diff mbox series

[nf-next,v3,3/3] netfilter: Introduce egress hook

Message ID d2256c451876583bbbf8f0e82a5a43ce35c5cf2f.1598517740.git.lukas@wunner.de
State Changes Requested
Delegated to: Pablo Neira
Headers show
Series Netfilter egress hook | expand

Commit Message

Lukas Wunner Aug. 27, 2020, 8:55 a.m. UTC
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.

Support the same on egress.  This allows filtering locally generated
traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
hook up egress rules from user space has been submitted separately.

Position the hook immediately before a packet is handed to traffic
control and then sent out on an interface, thereby mirroring the ingress
order.  This order allows marking packets in the netfilter egress hook
and subsequently using the mark in tc.  Another benefit of this order is
consistency with a lot of existing documentation which says that egress
tc is performed after netfilter hooks.

To avoid a performance degradation in the default case (with neither
netfilter nor traffic control used), Daniel Borkmann suggests "a single
static_key which wraps an empty function call entry which can then be
patched by the kernel at runtime. Inside that trampoline we can still
keep the ordering [between netfilter and traffic control] intact":

https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/

To this end, introduce nf_sch_egress() which is dynamically patched into
__dev_queue_xmit(), contingent on egress_needed_key.  Inside that
function, nf_egress() and sch_handle_egress() is called, each contingent
on its own separate static_key.

nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
This change alone causes a speedup if neither netfilter nor traffic
control is used, apparently because it reduces instruction cache
pressure.  The same effect was previously observed by Eric Dumazet for
the ingress path:

https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/

Overall, performance improves with this commit if neither netfilter nor
traffic control is used. However it degrades a little if only traffic
control is used, due to the "noinline", the additional outer static key
and the added netfilter code:

* Before:       4730418pps 2270Mb/sec (2270600640bps)
* After:        4759206pps 2284Mb/sec (2284418880bps)

* Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
* After  + tc:  4007728pps 1923Mb/sec (1923709440bps)

* After  + nft: 3714546pps 1782Mb/sec (1782982080bps)

Measured on a bare-metal Core i7-3615QM.

Commands to perform a measurement:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Commands to enable egress traffic control:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Commands to enable egress netfilter:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 include/linux/netdevice.h        |  8 +++++
 include/linux/netfilter_netdev.h | 27 +++++++++++++++++
 include/linux/rtnetlink.h        |  2 +-
 include/uapi/linux/netfilter.h   |  1 +
 net/core/dev.c                   | 52 ++++++++++++++++++++++++++++----
 net/netfilter/Kconfig            |  8 +++++
 net/netfilter/core.c             | 24 ++++++++++++---
 net/netfilter/nft_chain_filter.c |  4 ++-
 net/sched/Kconfig                |  3 ++
 9 files changed, 117 insertions(+), 12 deletions(-)

Comments

John Fastabend Aug. 28, 2020, 6:52 p.m. UTC | #1
Lukas Wunner wrote:
> Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> handle_ing() under unique static key") introduced the ability to
> classify packets on ingress.
> 
> Support the same on egress.  This allows filtering locally generated
> traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
> allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
> hook up egress rules from user space has been submitted separately.
> 
> Position the hook immediately before a packet is handed to traffic
> control and then sent out on an interface, thereby mirroring the ingress
> order.  This order allows marking packets in the netfilter egress hook
> and subsequently using the mark in tc.  Another benefit of this order is
> consistency with a lot of existing documentation which says that egress
> tc is performed after netfilter hooks.
> 
> To avoid a performance degradation in the default case (with neither
> netfilter nor traffic control used), Daniel Borkmann suggests "a single
> static_key which wraps an empty function call entry which can then be
> patched by the kernel at runtime. Inside that trampoline we can still
> keep the ordering [between netfilter and traffic control] intact":
> 
> https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/
> 
> To this end, introduce nf_sch_egress() which is dynamically patched into
> __dev_queue_xmit(), contingent on egress_needed_key.  Inside that
> function, nf_egress() and sch_handle_egress() is called, each contingent
> on its own separate static_key.
> 
> nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
> This change alone causes a speedup if neither netfilter nor traffic
> control is used, apparently because it reduces instruction cache
> pressure.  The same effect was previously observed by Eric Dumazet for
> the ingress path:
> 
> https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/
> 
> Overall, performance improves with this commit if neither netfilter nor
> traffic control is used. However it degrades a little if only traffic
> control is used, due to the "noinline", the additional outer static key
> and the added netfilter code:
> 
> * Before:       4730418pps 2270Mb/sec (2270600640bps)
> * After:        4759206pps 2284Mb/sec (2284418880bps)

These baseline numbers seem low to me.

> 
> * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> 
> * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
> 
> Measured on a bare-metal Core i7-3615QM.

OK I have some server class systems here I would like to run these
benchmarks again on to be sure we don't have any performance
regressions on that side.

I'll try to get to it asap, but likely will be Monday morning
by the time I get to it. I assume that should be no problem
seeing we are only on rc2.

Thanks.

> 
> Commands to perform a measurement:
> ip link add dev foo type dummy
> ip link set dev foo up
> modprobe pktgen
> echo "add_device foo" > /proc/net/pktgen/kpktgend_3
> samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Thats a single thread correct? -t option if I recall correctly.
I think we should also try with many threads to see if
that makes a difference. I guess probably not, but lets
see.

> 
> Commands to enable egress traffic control:
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
> 
> Commands to enable egress netfilter:
> nft add table netdev t
> nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
> nft add rule netdev t co ip daddr 4.3.2.1/32 drop
> 

I'll give above a try.
John Fastabend Sept. 3, 2020, 5 a.m. UTC | #2
John Fastabend wrote:
> Lukas Wunner wrote:
> > Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> > handle_ing() under unique static key") introduced the ability to
> > classify packets on ingress.
> > 
> > Support the same on egress.  This allows filtering locally generated
> > traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
> > allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
> > hook up egress rules from user space has been submitted separately.
> > 
> > Position the hook immediately before a packet is handed to traffic
> > control and then sent out on an interface, thereby mirroring the ingress
> > order.  This order allows marking packets in the netfilter egress hook
> > and subsequently using the mark in tc.  Another benefit of this order is
> > consistency with a lot of existing documentation which says that egress
> > tc is performed after netfilter hooks.
> > 
> > To avoid a performance degradation in the default case (with neither
> > netfilter nor traffic control used), Daniel Borkmann suggests "a single
> > static_key which wraps an empty function call entry which can then be
> > patched by the kernel at runtime. Inside that trampoline we can still
> > keep the ordering [between netfilter and traffic control] intact":
> > 
> > https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/
> > 
> > To this end, introduce nf_sch_egress() which is dynamically patched into
> > __dev_queue_xmit(), contingent on egress_needed_key.  Inside that
> > function, nf_egress() and sch_handle_egress() is called, each contingent
> > on its own separate static_key.
> > 
> > nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
> > This change alone causes a speedup if neither netfilter nor traffic
> > control is used, apparently because it reduces instruction cache
> > pressure.  The same effect was previously observed by Eric Dumazet for
> > the ingress path:
> > 
> > https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/
> > 
> > Overall, performance improves with this commit if neither netfilter nor
> > traffic control is used. However it degrades a little if only traffic
> > control is used, due to the "noinline", the additional outer static key
> > and the added netfilter code:

I don't think it actualy improves performance at least I didn't observe
that. From the code its not clear why this would be the case either. As
a nit I would prefer that line removed from the commit message.

I guess the Before/After below is just showing some noise in the
measurement.

> > 
> > * Before:       4730418pps 2270Mb/sec (2270600640bps)
> > * After:        4759206pps 2284Mb/sec (2284418880bps)
> 
> These baseline numbers seem low to me.

I used a 10Gbps ixgbe nic to measure the performance after the dummy
device hung on me for some reason. I'll try to investigate what happened
later. It was unrelated to these patches though.

But, with 10Gbps NIC and doing a pktgen benchmark with and without
the patches applied I didn't see any measurable differences. Both
cases reached 14Mpps.

> 
> > 
> > * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> > * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)

Same here before/after aggregate appears to be the same. Even the
numbers above show a 1.2% degradation. Just curious is the above
from a single run or averaged over multiple runs or something
else? Seems like noise to me.

I did see something odd where it appeared fairness between threads
was slightly worse. I don't have any explanation for this? Did
you have a chance to run the test with -t >1?

Also the overhead on your system for adding a tc rule seems
a bit high. In my case a single tc drop rule added ~7% overhead
at 14mpps. Above it looks more like 16% so double. Maybe a
missing JIT or some other configuration. Either a perf trace
or looking at your config would help figure that out.

> > 
> > * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
> > 

I haven't had a chance to do these benchmarks, but for my use
cases its more important to _not_ degrade tc performance.

I will note though that this is getting close to a 10% perf
degradation from using tc. I haven't looked much into it,
but that seems high to simply drop a packet.

Do you have plans to address the performance degradation? Otherwise
if I was building some new components its unclear why we would
choose the slower option over the tc hook. The two suggested
use cases security policy and DSR sound like new features, any
reason to not just use existing infrastructure?

Is the use case primarily legacy things already running in
nft infrastructure? I guess if you have code running now
moving it to this hook is faster and even if its 10% slower
than it could be that may be better than a rewrite?

> > Measured on a bare-metal Core i7-3615QM.
> 
> OK I have some server class systems here I would like to run these
> benchmarks again on to be sure we don't have any performance
> regressions on that side.
> 
> I'll try to get to it asap, but likely will be Monday morning
> by the time I get to it. I assume that should be no problem
> seeing we are only on rc2.

Sorry Monday had to look into a different bug.

> 
> Thanks.
> 

Thanks.
John
nevola Sept. 4, 2020, 8:54 a.m. UTC | #3
Hi,

On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
[...]
>
> I don't think it actualy improves performance at least I didn't observe
> that. From the code its not clear why this would be the case either. As
> a nit I would prefer that line removed from the commit message.
>

It hasn't been proven to be untrue either.


[...]
>
> Do you have plans to address the performance degradation? Otherwise
> if I was building some new components its unclear why we would
> choose the slower option over the tc hook. The two suggested
> use cases security policy and DSR sound like new features, any
> reason to not just use existing infrastructure?
>

Unfortunately, tc is not an option as it is required to interact with
nft objects (sets, maps, chains, etc), more complex than just a drop.
Also, when building new features we try to maintain the application
stack as simple as possible, not trying to do ugly integrations.

I understand that you measure performance with a drop, but using this
hook we reduce the datapath consistently for these use cases and
hence, improving traffic performance.

Thank you for your time!
John Fastabend Sept. 4, 2020, 3:46 p.m. UTC | #4
Laura García Liébana wrote:
> Hi,
> 
> On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> [...]
> >
> > I don't think it actualy improves performance at least I didn't observe
> > that. From the code its not clear why this would be the case either. As
> > a nit I would prefer that line removed from the commit message.
> >
> 
> It hasn't been proven to be untrue either.

huh? Its stated in the commit message with no reason for why it might
be the case and I can't reproduce it. Also the numbers posted show such a
slight increase (~1%) its likely just random system noise.

Sorry maybe that was a joke? Just poured some coffee so might be missing it.

> 
> 
> [...]
> >
> > Do you have plans to address the performance degradation? Otherwise
> > if I was building some new components its unclear why we would
> > choose the slower option over the tc hook. The two suggested
> > use cases security policy and DSR sound like new features, any
> > reason to not just use existing infrastructure?
> >
> 
> Unfortunately, tc is not an option as it is required to interact with
> nft objects (sets, maps, chains, etc), more complex than just a drop.
> Also, when building new features we try to maintain the application
> stack as simple as possible, not trying to do ugly integrations.

We have code that interacts with iptables as well. How I read the
above is in your case you have a bunch of existing software and you
want something slightly faster. Even if its not as fast the 10%
overhead is OK in your case and/or you believe the overhead of all
the other components is much higher so it will be lost in the noise.

> I understand that you measure performance with a drop, but using this
> hook we reduce the datapath consistently for these use cases and
> hence, improving traffic performance.

I measured drops because it was the benchmark provided in the patch
series. Also it likely looks a lot like any DDOS that might be put
there. You mentioned security policies which should probably include
DDOS so I would expect drop performance to be at least a useful
metric even if its not the only or most important in your case.

Lets post a selftest that represents the use case so folks like
myself can understand and benchmark correctly. This gives the extra
benefit of ensuring we don't regress going forward and can add it
to CI.

> 
> Thank you for your time!
nevola Sept. 5, 2020, 11:13 a.m. UTC | #5
Hi John,

On Fri, Sep 4, 2020 at 5:46 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Laura García Liébana wrote:
> > Hi,
> >
> > On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > [...]
> > >
> > > I don't think it actualy improves performance at least I didn't observe
> > > that. From the code its not clear why this would be the case either. As
> > > a nit I would prefer that line removed from the commit message.
> > >
> >
> > It hasn't been proven to be untrue either.
>
> huh? Its stated in the commit message with no reason for why it might
> be the case and I can't reproduce it. Also the numbers posted show such a
> slight increase (~1%) its likely just random system noise.
>
> Sorry maybe that was a joke? Just poured some coffee so might be missing it.
>
> >
> >
> > [...]
> > >
> > > Do you have plans to address the performance degradation? Otherwise
> > > if I was building some new components its unclear why we would
> > > choose the slower option over the tc hook. The two suggested
> > > use cases security policy and DSR sound like new features, any
> > > reason to not just use existing infrastructure?
> > >
> >
> > Unfortunately, tc is not an option as it is required to interact with
> > nft objects (sets, maps, chains, etc), more complex than just a drop.
> > Also, when building new features we try to maintain the application
> > stack as simple as possible, not trying to do ugly integrations.
>
> We have code that interacts with iptables as well. How I read the
> above is in your case you have a bunch of existing software and you
> want something slightly faster. Even if its not as fast the 10%
> overhead is OK in your case and/or you believe the overhead of all
> the other components is much higher so it will be lost in the noise.
>

Not a bunch of software, but the other way around. We replaced, a year
now, all the existing iptables, ip6tables, ebtables, arptables,
x_tables, ipset and ipvs components (both in-kernel and user-space)
with just nftables. As all these components features are integrated
natively, objects (sets, maps, chains, stateful objects, etc.) are
created in a form of nftables scheme that are integrated all together.
That is why the tc workaround is not an option for people that are
moving to nftables to use just a hook.


> > I understand that you measure performance with a drop, but using this
> > hook we reduce the datapath consistently for these use cases and
> > hence, improving traffic performance.
>
> I measured drops because it was the benchmark provided in the patch
> series. Also it likely looks a lot like any DDOS that might be put
> there. You mentioned security policies which should probably include
> DDOS so I would expect drop performance to be at least a useful
> metric even if its not the only or most important in your case.
>
> Lets post a selftest that represents the use case so folks like
> myself can understand and benchmark correctly. This gives the extra
> benefit of ensuring we don't regress going forward and can add it
> to CI.
>

From the 4 use cases we found until now (although I'm sure there will
be many more), 2 are related to filtering (not necessarily related to
DDoS mitigation though) and 2 are related to packet mangling. One of
the packet mangling DSR case, it is working great from ingress but in
certain traffic generated from user-space in the node, we require the
egress hook. In addition to that, having this hook available in nft,
we could improve performance by optimizing the datapath for several
load balancing cases.

As I said, I understand that you're worried about dropping performance
but this hook will allow many more possibilities to improve the
traffic datapath.

Thank you!
Pablo Neira Ayuso Sept. 19, 2020, 3:54 p.m. UTC | #6
Hi Lukas,

On Thu, Aug 27, 2020 at 10:55:03AM +0200, Lukas Wunner wrote:
[...]
> Overall, performance improves with this commit if neither netfilter nor
> traffic control is used. However it degrades a little if only traffic
> control is used, due to the "noinline", the additional outer static key
> and the added netfilter code:
> 
> * Before:       4730418pps 2270Mb/sec (2270600640bps)
> * After:        4759206pps 2284Mb/sec (2284418880bps)
> 
> * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> 
> * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
[...]
> Commands to enable egress traffic control:
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

1,6 0 0 0, means drop. This is a program with one instruction that
says "drop this packet".

> Commands to enable egress netfilter:
> nft add table netdev t
> nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
> nft add rule netdev t co ip daddr 4.3.2.1/32 drop

However, this is actually doing much more than that:

nft --debug=netlink add rule netdev t co ip daddr 4.3.2.1/32 drop
netdev 
  [ meta load protocol => reg 1 ]
  [ cmp eq reg 1 0x00000008 ]
  [ payload load 4b @ network header + 16 => reg 1 ]
  [ bitwise reg 1 = (reg=1 & 0xffffffff ) ^ 0x00000000 ]
  [ cmp eq reg 1 0x01020304 ]
  [ immediate reg 0 drop ]

So this is comparing apples and pears in some way :-)

Then, I'd suggest the Netfilter ruleset to compare it with tc should be:

add table netdev t
add chain netdev t co { type filter hook egress device foo priority 0 ; policy drop; }

Would you redo these numbers using this ruleset to address Daniel's
comments regarding performance?

Moreover, Daniel also suggested dev_direct_xmit() path from AF_PACKET
allows packets to escape from policy, it seems this also needs to be
extended to add a hook there too.

Could you work on this and send a v2?

Thank you.
Lukas Wunner Sept. 28, 2020, 12:20 p.m. UTC | #7
On Sat, Sep 19, 2020 at 05:54:05PM +0200, Pablo Neira Ayuso wrote:
> Would you redo these numbers using this ruleset to address Daniel's
> comments regarding performance?
> 
> Moreover, Daniel also suggested dev_direct_xmit() path from AF_PACKET
> allows packets to escape from policy, it seems this also needs to be
> extended to add a hook there too.
> 
> Could you work on this and send a v2?

Sure, will do.

Thanks,

Lukas
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0e303f6603f..5d88a40aca21 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -796,6 +796,10 @@  struct xps_dev_maps {
 
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_NET_EGRESS
+extern struct static_key_false egress_needed_key;
+#endif
+
 #define TC_MAX_QUEUE	16
 #define TC_BITMASK	15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -1779,6 +1783,7 @@  enum netdev_priv_flags {
  *	@xps_maps:	XXX: need comments on this one
  *	@miniq_egress:		clsact qdisc specific data for
  *				egress processing
+ *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
  *				the watchdog (see dev_watchdog())
@@ -2057,6 +2062,9 @@  struct net_device {
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	struct nf_hook_entries __rcu *nf_hooks_egress;
+#endif
 
 #ifdef CONFIG_NET_SCHED
 	DECLARE_HASHTABLE	(qdisc_hash, 4);
diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 49e26479642e..92d3611a782e 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -47,6 +47,9 @@  static inline void nf_hook_netdev_init(struct net_device *dev)
 #ifdef CONFIG_NETFILTER_INGRESS
 	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	RCU_INIT_POINTER(dev->nf_hooks_egress, NULL);
+#endif
 }
 
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -72,4 +75,28 @@  static inline int nf_hook_ingress(struct sk_buff *skb)
 	return 0;
 }
 #endif /* CONFIG_NETFILTER_INGRESS */
+
+#ifdef CONFIG_NETFILTER_EGRESS
+static inline bool nf_hook_egress_active(const struct sk_buff *skb)
+{
+	return nf_hook_netdev_active(NF_NETDEV_EGRESS,
+				     skb->dev->nf_hooks_egress);
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return nf_hook_netdev(skb, NF_NETDEV_EGRESS,
+			      skb->dev->nf_hooks_egress);
+}
+#else /* CONFIG_NETFILTER_EGRESS */
+static inline int nf_hook_egress_active(struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return 0;
+}
+#endif /* CONFIG_NETFILTER_EGRESS */
 #endif /* _NETFILTER_INGRESS_H_ */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bb9cb84114c1..49fe4f3b8fbd 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -97,7 +97,7 @@  void net_inc_ingress_queue(void);
 void net_dec_ingress_queue(void);
 #endif
 
-#ifdef CONFIG_NET_EGRESS
+#ifdef CONFIG_NET_SCH_EGRESS
 void net_inc_egress_queue(void);
 void net_dec_egress_queue(void);
 #endif
diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index ca9e63d6e0e4..d1616574c54f 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -50,6 +50,7 @@  enum nf_inet_hooks {
 
 enum nf_dev_hooks {
 	NF_NETDEV_INGRESS,
+	NF_NETDEV_EGRESS,
 	NF_NETDEV_NUMHOOKS
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e1aae2df6762..ebdd2a348f38 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2099,20 +2099,30 @@  EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
 #endif
 
 #ifdef CONFIG_NET_EGRESS
-static DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+
+#ifdef CONFIG_NET_SCH_EGRESS
+static DEFINE_STATIC_KEY_FALSE(sch_egress_needed_key);
 
 void net_inc_egress_queue(void)
 {
 	static_branch_inc(&egress_needed_key);
+	static_branch_inc(&sch_egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_inc_egress_queue);
 
 void net_dec_egress_queue(void)
 {
+	static_branch_dec(&sch_egress_needed_key);
 	static_branch_dec(&egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
-#endif
+
+#define sch_egress_needed static_branch_unlikely(&sch_egress_needed_key)
+#else
+#define sch_egress_needed false
+#endif /* CONFIG_NET_SCH_EGRESS */
+#endif /* CONFIG_NET_EGRESS */
 
 static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
@@ -3854,9 +3864,20 @@  int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 EXPORT_SYMBOL(dev_loopback_xmit);
 
 #ifdef CONFIG_NET_EGRESS
-static struct sk_buff *
+static inline int nf_egress(struct sk_buff *skb)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = nf_hook_egress(skb);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+#ifdef CONFIG_NET_SCH_EGRESS
 	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
 	struct tcf_result cl_res;
 
@@ -3890,6 +3911,25 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	default:
 		break;
 	}
+#endif /* CONFIG_NET_SCH_EGRESS */
+
+	return skb;
+}
+
+/* noinline reduces icache pressure in __dev_queue_xmit() hotpath for
+ * default case that neither netfilter nor traffic control is used.
+ */
+static noinline struct sk_buff *
+nf_sch_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+{
+	if (nf_hook_egress_active(skb)) {
+		*ret = nf_egress(skb);
+		if (*ret < 0)
+			return NULL;
+	}
+
+	if (sch_egress_needed)
+		return sch_handle_egress(skb, ret, dev);
 
 	return skb;
 }
@@ -4081,13 +4121,13 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_at_ingress = 0;
-# ifdef CONFIG_NET_EGRESS
+#endif
+#ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
-		skb = sch_handle_egress(skb, &rc, dev);
+		skb = nf_sch_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
 	}
-# endif
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 25313c29d799..93a300af82b8 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -10,6 +10,14 @@  config NETFILTER_INGRESS
 	  This allows you to classify packets from ingress using the Netfilter
 	  infrastructure.
 
+config NETFILTER_EGRESS
+	bool "Netfilter egress support"
+	default y
+	select NET_EGRESS
+	help
+	  This allows you to classify packets before transmission using the
+	  Netfilter infrastructure.
+
 config NETFILTER_NETLINK
 	tristate
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 3ac7c8c1548d..d42c8c2a7c30 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -306,6 +306,12 @@  nf_hook_entry_head(struct net *net, int pf, unsigned int hooknum,
 		if (dev && dev_net(dev) == net)
 			return &dev->nf_hooks_ingress;
 	}
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (hooknum == NF_NETDEV_EGRESS) {
+		if (dev && dev_net(dev) == net)
+			return &dev->nf_hooks_egress;
+	}
 #endif
 	WARN_ON_ONCE(1);
 	return NULL;
@@ -318,11 +324,13 @@  static int __nf_register_net_hook(struct net *net, int pf,
 	struct nf_hook_entries __rcu **pp;
 
 	if (pf == NFPROTO_NETDEV) {
-#ifndef CONFIG_NETFILTER_INGRESS
-		if (reg->hooknum == NF_NETDEV_INGRESS)
+		if ((!IS_ENABLED(CONFIG_NETFILTER_INGRESS) &&
+		     reg->hooknum == NF_NETDEV_INGRESS) ||
+		    (!IS_ENABLED(CONFIG_NETFILTER_EGRESS) &&
+		     reg->hooknum == NF_NETDEV_EGRESS))
 			return -EOPNOTSUPP;
-#endif
-		if (reg->hooknum != NF_NETDEV_INGRESS ||
+		if ((reg->hooknum != NF_NETDEV_INGRESS &&
+		     reg->hooknum != NF_NETDEV_EGRESS) ||
 		    !reg->dev || dev_net(reg->dev) != net)
 			return -EINVAL;
 	}
@@ -348,6 +356,10 @@  static int __nf_register_net_hook(struct net *net, int pf,
 	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 		net_inc_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+		static_branch_inc(&egress_needed_key);
+#endif
 #ifdef CONFIG_JUMP_LABEL
 	static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
@@ -406,6 +418,10 @@  static void __nf_unregister_net_hook(struct net *net, int pf,
 		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 			net_dec_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+			static_branch_dec(&egress_needed_key);
+#endif
 #ifdef CONFIG_JUMP_LABEL
 		static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index c78d01bc02e9..67ce6dbb5496 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -277,9 +277,11 @@  static const struct nft_chain_type nft_chain_filter_netdev = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_NETDEV,
-	.hook_mask	= (1 << NF_NETDEV_INGRESS),
+	.hook_mask	= (1 << NF_NETDEV_INGRESS) |
+			  (1 << NF_NETDEV_EGRESS),
 	.hooks		= {
 		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
+		[NF_NETDEV_EGRESS]	= nft_do_chain_netdev,
 	},
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..390b7e6b9e91 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -396,6 +396,9 @@  config NET_SCH_INGRESS
 	  To compile this code as a module, choose M here: the module will be
 	  called sch_ingress with alias of sch_clsact.
 
+config NET_SCH_EGRESS
+	def_bool NET_SCH_INGRESS
+
 config NET_SCH_PLUG
 	tristate "Plug network traffic until release (PLUG)"
 	help