diff mbox series

[nf-next,3/3] netfilter: Introduce egress hook

Message ID 14ab7e5af20124a34a50426fd570da7d3b0369ce.1583927267.git.lukas@wunner.de
State Accepted
Delegated to: Pablo Neira
Headers show
Series Netfilter egress hook | expand

Commit Message

Lukas Wunner March 11, 2020, 11:59 a.m. UTC
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.

Allow the same on egress.  Position the hook immediately before a packet
is handed to tc and then sent out on an interface, thereby mirroring the
ingress order.  This order allows marking packets in the netfilter
egress hook and subsequently using the mark in tc.  Another benefit of
this order is consistency with a lot of existing documentation which
says that egress tc is performed after netfilter hooks.

Egress hooks already exist for the most common protocols, such as
NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
they are executed earlier during packet processing.  However for more
exotic protocols, there is currently no provision to apply netfilter on
egress.  A common workaround is to enslave the interface to a bridge and
use ebtables, or to resort to tc.  But when the ingress hook was
introduced, consensus was that users should be given the choice to use
netfilter or tc, whichever tool suits their needs best:
https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/

There have also been occasional user requests for a netfilter egress
hook in the past, e.g.:
https://www.spinics.net/lists/netfilter/msg50038.html

Performance measurements with pktgen surprisingly show a speedup rather
than a slowdown with this commit:

* Without this commit:
  Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
  2920481pps 1401Mb/sec (1401830880bps) errors: 0

* With this commit:
  Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
  2941410pps 1411Mb/sec (1411876800bps) errors: 0

* Without this commit + tc egress:
  Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
  2562631pps 1230Mb/sec (1230062880bps) errors: 0

* With this commit + tc egress:
  Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
  2659259pps 1276Mb/sec (1276444320bps) errors: 0

* With this commit + nft egress:
  Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
  2413320pps 1158Mb/sec (1158393600bps) errors: 0

Tested on a bare-metal Core i7-3615QM, each measurement was performed
three times to verify that the numbers are stable.

Commands to perform a measurement:
modprobe pktgen
echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000

Commands for testing tc egress:
tc qdisc add dev lo clsact
tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32

Commands for testing nft egress:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop

All testing was performed on the loopback interface to avoid distorting
measurements by the packet handling in the low-level Ethernet driver.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/netdevice.h        |  4 ++++
 include/linux/netfilter_netdev.h | 27 +++++++++++++++++++++++++++
 include/uapi/linux/netfilter.h   |  1 +
 net/core/dev.c                   | 23 ++++++++++++++++++++---
 net/netfilter/Kconfig            |  8 ++++++++
 net/netfilter/core.c             | 24 ++++++++++++++++++++----
 net/netfilter/nft_chain_filter.c |  4 +++-
 7 files changed, 83 insertions(+), 8 deletions(-)

Comments

Daniel Borkmann March 11, 2020, 2:05 p.m. UTC | #1
On 3/11/20 12:59 PM, Lukas Wunner wrote:
> Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> handle_ing() under unique static key") introduced the ability to
> classify packets on ingress.
> 
> Allow the same on egress.  Position the hook immediately before a packet
> is handed to tc and then sent out on an interface, thereby mirroring the
> ingress order.  This order allows marking packets in the netfilter
> egress hook and subsequently using the mark in tc.  Another benefit of
> this order is consistency with a lot of existing documentation which
> says that egress tc is performed after netfilter hooks.
> 
> Egress hooks already exist for the most common protocols, such as
> NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
> they are executed earlier during packet processing.  However for more
> exotic protocols, there is currently no provision to apply netfilter on
> egress.  A common workaround is to enslave the interface to a bridge and

Sorry for late reply, but still NAK. The primary use-case for this patch set
is out of tree module which you mentioned last time [0] ...

   There's another reason I chose netfilter over tc:  I need to activate the
   filter from a kernel module, hence need an in-kernel (rather than user space)
   API.

   netfilter provides that via nf_register_net_hook(), I couldn't find
   anything similar for tc.  And an egress netfilter hook seemed like
   an obvious missing feature given the presence of an ingress hook.

   The module I need this for is out-of-tree:
   https://github.com/RevolutionPi/piControl/commit/da199ccd2099

   In my experience the argument that a feature is needed for an out-of-tree
   module holds zero value upstream.  If there's no in-tree user, the feature
   isn't merged, I've seen this more than enough.  Which is why I didn't mention
   it in the first place.

   For our use case I wouldn't even need the nft user space support which I
   posted separately, I just implemented it for completeness and to increase
   acceptability of the present series.

... and as you mentioned there's local out and post routing already where you
can mark packets, so no need to make the fast-path slower for exotic protocols
which can be solved through other means.

   [0] https://lore.kernel.org/netdev/20191123142305.g2kkaudhhyui22fq@wunner.de/

> use ebtables, or to resort to tc.  But when the ingress hook was
> introduced, consensus was that users should be given the choice to use
> netfilter or tc, whichever tool suits their needs best:
> https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
> 
> There have also been occasional user requests for a netfilter egress
> hook in the past, e.g.:
> https://www.spinics.net/lists/netfilter/msg50038.html
> 
> Performance measurements with pktgen surprisingly show a speedup rather
> than a slowdown with this commit:
> 
> * Without this commit:
>    Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
>    2920481pps 1401Mb/sec (1401830880bps) errors: 0
> 
> * With this commit:
>    Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
>    2941410pps 1411Mb/sec (1411876800bps) errors: 0

So you are suggesting that we've just optimized the stack by adding more
hooks to it ...?

> * Without this commit + tc egress:
>    Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
>    2562631pps 1230Mb/sec (1230062880bps) errors: 0
> 
> * With this commit + tc egress:
>    Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
>    2659259pps 1276Mb/sec (1276444320bps) errors: 0
> 
> * With this commit + nft egress:
>    Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
>    2413320pps 1158Mb/sec (1158393600bps) errors: 0
> 
> Tested on a bare-metal Core i7-3615QM, each measurement was performed
> three times to verify that the numbers are stable.
> 
> Commands to perform a measurement:
> modprobe pktgen
> echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
> samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000
> 
> Commands for testing tc egress:
> tc qdisc add dev lo clsact
> tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32
> 
> Commands for testing nft egress:
> nft add table netdev t
> nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
> nft add rule netdev t co ip daddr 4.3.2.1/32 drop
> 
> All testing was performed on the loopback interface to avoid distorting
> measurements by the packet handling in the low-level Ethernet driver.
> 
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
Lukas Wunner March 11, 2020, 3:54 p.m. UTC | #2
On Wed, Mar 11, 2020 at 03:05:16PM +0100, Daniel Borkmann wrote:
> no need to make the fast-path slower for exotic protocols
> which can be solved through other means.

As said the fast-path gets faster, not slower.

> > * Without this commit:
> >    Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
> >    2920481pps 1401Mb/sec (1401830880bps) errors: 0
> > 
> > * With this commit:
> >    Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
> >    2941410pps 1411Mb/sec (1411876800bps) errors: 0
> 
> So you are suggesting that we've just optimized the stack by adding more
> hooks to it ...?

Since I've provided numbers to disprove your allegation, I think the
onus is now on you to prove that your allegation holds any water.
Please reproduce the measurements and let's go from there.

This isn't much work, I've made it really easy by providing all the
steps necessary in the commit message.

Thanks,

Lukas
Daniel Borkmann March 12, 2020, 10:40 p.m. UTC | #3
On 3/11/20 4:54 PM, Lukas Wunner wrote:
> On Wed, Mar 11, 2020 at 03:05:16PM +0100, Daniel Borkmann wrote:
>> no need to make the fast-path slower for exotic protocols
>> which can be solved through other means.
> 
> As said the fast-path gets faster, not slower.
> 
>>> * Without this commit:
>>>     Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
>>>     2920481pps 1401Mb/sec (1401830880bps) errors: 0
>>>
>>> * With this commit:
>>>     Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
>>>     2941410pps 1411Mb/sec (1411876800bps) errors: 0
>>
>> So you are suggesting that we've just optimized the stack by adding more
>> hooks to it ...?
> 
> Since I've provided numbers to disprove your allegation, I think the
> onus is now on you to prove that your allegation holds any water.
> Please reproduce the measurements and let's go from there.
> 
> This isn't much work, I've made it really easy by providing all the
> steps necessary in the commit message.

So in terms of micro-benchmarking with pktgen, if I understand you correctly,
you are basically measuring loopback device by pushing packets through the
__dev_queue_xmit() -> loopback_xmit() -> netif_rx() till the stack drops them
in IP layer on ingress side? I wonder how your perf profile looks like ...
Setting a drop point in tc layer and then measuring the effect before/after
this change with CONFIG_NETFILTER_EGRESS enabled, I'm getting a stable degration
from ~4.123Mpps to ~4.082Mpps with pktgen, definitely not seeing a speedup.

# ip link add dev foo type dummy
# ip link set dev foo up
# tc qdisc add dev foo clsact
# tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
# modprobe pktgen
# echo "add_device foo" > /proc/net/pktgen/kpktgend_3
# samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'foo' -n 400000000 -m '11:11:11:11:11:11' -d '1.1.1.1'

Also to let pktgen count the skb:

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index acc849df60b5..8920da7a7a67 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3372,11 +3372,11 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
                 ret = dev_queue_xmit(pkt_dev->skb);
                 switch (ret) {
                 case NET_XMIT_SUCCESS:
+               case NET_XMIT_DROP:
                         pkt_dev->sofar++;
                         pkt_dev->seq_num++;
                         pkt_dev->tx_bytes += pkt_dev->last_pkt_size;
                         break;
-               case NET_XMIT_DROP:
                 case NET_XMIT_CN:
                 /* These are all valid return codes for a qdisc but
                  * indicate packets are being dropped or will likely

(In any case, this whole discussion is moot for out of tree code.)

Thanks,
Daniel
Pablo Neira Ayuso March 13, 2020, 2:55 p.m. UTC | #4
On Wed, Mar 11, 2020 at 03:05:16PM +0100, Daniel Borkmann wrote:
> On 3/11/20 12:59 PM, Lukas Wunner wrote:
> > Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> > handle_ing() under unique static key") introduced the ability to
> > classify packets on ingress.
> > 
> > Allow the same on egress.  Position the hook immediately before a packet
> > is handed to tc and then sent out on an interface, thereby mirroring the
> > ingress order.  This order allows marking packets in the netfilter
> > egress hook and subsequently using the mark in tc.  Another benefit of
> > this order is consistency with a lot of existing documentation which
> > says that egress tc is performed after netfilter hooks.
> > 
> > Egress hooks already exist for the most common protocols, such as
> > NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
> > they are executed earlier during packet processing.  However for more
> > exotic protocols, there is currently no provision to apply netfilter on
> > egress.  A common workaround is to enslave the interface to a bridge and
> 
> Sorry for late reply, but still NAK.

I agree Lukas use-case is very specific.

However, this is useful.

We have plans to support for NAT64 and NAT46, this is the right spot
to do this mangling. There is already support for the tunneling
infrastructure in netfilter from ingress, this spot from egress will
allow us to perform the tunneling from here. There is also no way to
drop traffic generated by dhclient, this also allow for filtering such
locally generated traffic. And many more.

Performance impact is negligible, Lukas already provided what you
asked for.

And more importantly:

I really think this patchset is _not_ interfering in your goals at all.
Daniel Borkmann March 14, 2020, 12:12 a.m. UTC | #5
On 3/13/20 3:55 PM, Pablo Neira Ayuso wrote:
> On Wed, Mar 11, 2020 at 03:05:16PM +0100, Daniel Borkmann wrote:
>> On 3/11/20 12:59 PM, Lukas Wunner wrote:
>>> Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
>>> handle_ing() under unique static key") introduced the ability to
>>> classify packets on ingress.
>>>
>>> Allow the same on egress.  Position the hook immediately before a packet
>>> is handed to tc and then sent out on an interface, thereby mirroring the
>>> ingress order.  This order allows marking packets in the netfilter
>>> egress hook and subsequently using the mark in tc.  Another benefit of
>>> this order is consistency with a lot of existing documentation which
>>> says that egress tc is performed after netfilter hooks.
>>>
>>> Egress hooks already exist for the most common protocols, such as
>>> NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
>>> they are executed earlier during packet processing.  However for more
>>> exotic protocols, there is currently no provision to apply netfilter on
>>> egress.  A common workaround is to enslave the interface to a bridge and
>>
>> Sorry for late reply, but still NAK.
> 
> I agree Lukas use-case is very specific.
> 
> However, this is useful.
> 
> We have plans to support for NAT64 and NAT46, this is the right spot
> to do this mangling. There is already support for the tunneling

But why is existing local-out or post-routing hook _not_ sufficient for
NAT64 given it being IP based?

> infrastructure in netfilter from ingress, this spot from egress will
> allow us to perform the tunneling from here. There is also no way to
> drop traffic generated by dhclient, this also allow for filtering such
> locally generated traffic. And many more.

This is a known fact for ~17 years [0] or probably more by now and noone
from netfilter folks cared to address it in all the years, so I presume
it cannot be important enough, and these days it can be filtered through
other means already. Tbh, it's a bit laughable that you bring this up as
an argument ...

   [0] https://www.spinics.net/lists/netfilter/msg19488.html

> Performance impact is negligible, Lukas already provided what you
> asked for.

Sure, and the claimed result was "as said the fast-path gets faster, not
slower" without any explanation or digging into details on why this might
be, especially since it appears counter-intuitive as was stated by the
author ... and later demonstrated w/ measurements that show the opposite.
Pablo Neira Ayuso March 15, 2020, 1:28 p.m. UTC | #6
Hello Daniel,

On Sat, Mar 14, 2020 at 01:12:02AM +0100, Daniel Borkmann wrote:
> On 3/13/20 3:55 PM, Pablo Neira Ayuso wrote:
[...]
> > We have plans to support for NAT64 and NAT46, this is the right spot
> > to do this mangling. There is already support for the tunneling
> 
> But why is existing local-out or post-routing hook _not_ sufficient for
> NAT64 given it being IP based?

Those hooks are not coming at the end of the IP processing. There is
very relevant IP code after those hooks that cannot be bypassed such
as fragmentation, tunneling and neighbour output. Such transformation
needs to happen after the IP processing, exactly from where Lukas is
proposing.

[...]
> > infrastructure in netfilter from ingress, this spot from egress will
> > allow us to perform the tunneling from here. There is also no way to
> > drop traffic generated by dhclient, this also allow for filtering such
> > locally generated traffic. And many more.
> 
> This is a known fact for ~17 years [0] or probably more by now and noone
> from netfilter folks cared to address it in all the years, so I presume
> it cannot be important enough, and these days it can be filtered through
> other means already. Tbh, it's a bit laughable that you bring this up as
> an argument ...
> 
>   [0] https://www.spinics.net/lists/netfilter/msg19488.html

Look: ip6tables, arptables and ebtables are a copy and paste from the
original iptables.

At that time, the only way one way to add support for ingress/egress
classification in netfilter: add "devtables", yet another copy and
past from iptables, that was a no-go.

This is not a problem anymore since there is a consolidated netfilter
framework to achieve ingress/egress classification.

> > Performance impact is negligible, Lukas already provided what you
> > asked for.
> 
> Sure, and the claimed result was "as said the fast-path gets faster, not
> slower" without any explanation or digging into details on why this might
> be, especially since it appears counter-intuitive as was stated by the
> author ... and later demonstrated w/ measurements that show the opposite.

I remember one of your collegues used this same argument against new
hooks back in 2015 [0], and the introduction of this hook was proven
to be negligible. This patchset introduces code that looks very much
the same.

I can make a list of recent updates to the output path, several of
them very are targeted to very specific usecases. You did not care at
all about performance impact of those at all, however, you care about
netfilter for some unknown reason.

In my opinion, your original feedback has been addressed, it's time to
move on.

Thank you.

[0] https://lwn.net/Articles/642414/
nevola April 23, 2020, 2:44 p.m. UTC | #7
On Sun, Mar 15, 2020 at 2:29 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> Hello Daniel,
>
> On Sat, Mar 14, 2020 at 01:12:02AM +0100, Daniel Borkmann wrote:
> > On 3/13/20 3:55 PM, Pablo Neira Ayuso wrote:
> [...]
> > > We have plans to support for NAT64 and NAT46, this is the right spot
> > > to do this mangling. There is already support for the tunneling
> >
> > But why is existing local-out or post-routing hook _not_ sufficient for
> > NAT64 given it being IP based?
>
> Those hooks are not coming at the end of the IP processing. There is
> very relevant IP code after those hooks that cannot be bypassed such
> as fragmentation, tunneling and neighbour output. Such transformation
> needs to happen after the IP processing, exactly from where Lukas is
> proposing.
>
> [...]
> > > infrastructure in netfilter from ingress, this spot from egress will
> > > allow us to perform the tunneling from here. There is also no way to
> > > drop traffic generated by dhclient, this also allow for filtering such
> > > locally generated traffic. And many more.

Hi,

Any chance to continue with this approach? I'm afraid outbound
af_packets also could not be filtered without this hook.

Thanks.
Lukas Wunner April 23, 2020, 4:05 p.m. UTC | #8
On Thu, Apr 23, 2020 at 04:44:44PM +0200, Laura Garcia wrote:
> On Sun, Mar 15, 2020 at 2:29 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Sat, Mar 14, 2020 at 01:12:02AM +0100, Daniel Borkmann wrote:
> > > On 3/13/20 3:55 PM, Pablo Neira Ayuso wrote:
> > > > We have plans to support for NAT64 and NAT46, this is the right spot
> > > > to do this mangling. There is already support for the tunneling
> > >
> > > But why is existing local-out or post-routing hook _not_ sufficient for
> > > NAT64 given it being IP based?
> >
> > Those hooks are not coming at the end of the IP processing. There is
> > very relevant IP code after those hooks that cannot be bypassed such
> > as fragmentation, tunneling and neighbour output. Such transformation
> > needs to happen after the IP processing, exactly from where Lukas is
> > proposing.
> >
> > [...]
> > > > infrastructure in netfilter from ingress, this spot from egress will
> > > > allow us to perform the tunneling from here. There is also no way to
> > > > drop traffic generated by dhclient, this also allow for filtering such
> > > > locally generated traffic. And many more.
> 
> Any chance to continue with this approach? I'm afraid outbound
> af_packets also could not be filtered without this hook.

Thanks Laura, good to hear there's interest in the functionality.

Daniel submitted a revert of this series but didn't cc me:

https://lore.kernel.org/netdev/bbdee6355234e730ef686f9321bd072bcf4bb232.1584523237.git.daniel@iogearbox.net/

In the ensuing discussion it turned out that the performance argument
may be addressed by a rearrangement of sch_handle_egress() and
nf_egress() invocations.  I could look into amending the series
accordingly and resubmitting, though I'm currently swamped with
other work.

The question is whether that's going to be sufficient because Daniel
mentioned having an in-tree user as a prerequisite for accepting this
feature, to which Pablo responded with NAT64/NAT46.  I don't have
intentions of implementing those, but maybe someone else has.

Thanks,

Lukas
Pablo Neira Ayuso April 27, 2020, 11:44 p.m. UTC | #9
Hi Lukas,

On Thu, Apr 23, 2020 at 06:05:42PM +0200, Lukas Wunner wrote:
[...]
> Daniel submitted a revert of this series but didn't cc me:
> 
> https://lore.kernel.org/netdev/bbdee6355234e730ef686f9321bd072bcf4bb232.1584523237.git.daniel@iogearbox.net/
> 
> In the ensuing discussion it turned out that the performance argument
> may be addressed by a rearrangement of sch_handle_egress() and
> nf_egress() invocations.  I could look into amending the series
> accordingly and resubmitting, though I'm currently swamped with
> other work.
> 
> The question is whether that's going to be sufficient because Daniel
> mentioned having an in-tree user as a prerequisite for accepting this
> feature, to which Pablo responded with NAT64/NAT46.  I don't have
> intentions of implementing those, but maybe someone else has.

I'd love to work on integrating that feature, there are a few
implementations outthere that might be useful for this.

However, I'm terribly biased, I'm the Netfilter maintainer.

Even though, I really think this hook is going to be very useful for
the Linux community from day one.
Daniel Borkmann April 28, 2020, 8:11 p.m. UTC | #10
Hey Lukas,

On 4/23/20 6:05 PM, Lukas Wunner wrote:
> On Thu, Apr 23, 2020 at 04:44:44PM +0200, Laura Garcia wrote:
>> On Sun, Mar 15, 2020 at 2:29 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>> On Sat, Mar 14, 2020 at 01:12:02AM +0100, Daniel Borkmann wrote:
>>>> On 3/13/20 3:55 PM, Pablo Neira Ayuso wrote:
>>>>> We have plans to support for NAT64 and NAT46, this is the right spot
>>>>> to do this mangling. There is already support for the tunneling
>>>>
>>>> But why is existing local-out or post-routing hook _not_ sufficient for
>>>> NAT64 given it being IP based?
>>>
>>> Those hooks are not coming at the end of the IP processing. There is
>>> very relevant IP code after those hooks that cannot be bypassed such
>>> as fragmentation, tunneling and neighbour output. Such transformation
>>> needs to happen after the IP processing, exactly from where Lukas is
>>> proposing.
>>>
>>> [...]
>>>>> infrastructure in netfilter from ingress, this spot from egress will
>>>>> allow us to perform the tunneling from here. There is also no way to
>>>>> drop traffic generated by dhclient, this also allow for filtering such
>>>>> locally generated traffic. And many more.
>>
>> Any chance to continue with this approach? I'm afraid outbound
>> af_packets also could not be filtered without this hook.
> 
> Thanks Laura, good to hear there's interest in the functionality.
> 
> Daniel submitted a revert of this series but didn't cc me:
> 
> https://lore.kernel.org/netdev/bbdee6355234e730ef686f9321bd072bcf4bb232.1584523237.git.daniel@iogearbox.net/
> 
> In the ensuing discussion it turned out that the performance argument
> may be addressed by a rearrangement of sch_handle_egress() and
> nf_egress() invocations.  I could look into amending the series
> accordingly and resubmitting, though I'm currently swamped with
> other work.

The rework of these hooks is still on my todo list, too swamped with
other stuff as well right now, but I'll see to have a prototype this
net-next development cycle.

Thanks,
Daniel
Lukas Wunner Aug. 20, 2020, 10:37 a.m. UTC | #11
On Tue, Apr 28, 2020 at 10:11:19PM +0200, Daniel Borkmann wrote:
> > https://lore.kernel.org/netdev/bbdee6355234e730ef686f9321bd072bcf4bb232.1584523237.git.daniel@iogearbox.net/
> > 
> > In the ensuing discussion it turned out that the performance argument
> > may be addressed by a rearrangement of sch_handle_egress() and
> > nf_egress() invocations.  I could look into amending the series
> > accordingly and resubmitting, though I'm currently swamped with
> > other work.
> 
> The rework of these hooks is still on my todo list, too swamped with
> other stuff as well right now, but I'll see to have a prototype this
> net-next development cycle.

Daniel, I have it running now the way you want it (I think) and am
benchmarking several variants.  I'm now faced with the following
choice:

* One variant speeds up the default case with neither tc nor nft in use
  (2241 -> 2285 Mb/sec), but tc becomes a little slower (1929 -> 1927
  Mb/sec).

* Another variant still speeds up the default case but not by as much
  (2241 -> 2274 Mb/sec) and speeds up tc as well (1929 -> 1931 Mb/sec).

The difference is that the first variant uses an outer static_key which
patches in a function containing two inner static_keys for nft + tc.
The second variant uses an outer static_key and a single inner static_key
for nft (but no static_key for tc).

nft is slower in the second variant (2231 -> 2225 Mb/sec).

I'm leaning towards the first variant, but because it incurs a small
performance degradation if tc is used, I wanted to give you a heads-up.
If this is totally unacceptable for you, I'll post the second variant
instead.

I need a few more days to update the commit messages, perform further
testing and apply polish, so I plan to dump the patches to the list
next week.  Just thought I'd ask for your opinion, I'm aware this is
difficult to judge without seeing the code.

I'm using static_keys instead of fmodify_return (which you've suggested)
because I would like to avoid a dependency on BPF as it might not be an
option for space-constrained embedded machines and a BPF JIT isn't
available on as many arches as CONFIG_JUMP_LABEL.

Thanks,

Lukas
Lukas Wunner Aug. 20, 2020, 4:35 p.m. UTC | #12
On Thu, Aug 20, 2020 at 12:37:01PM +0200, Lukas Wunner wrote:
> I need a few more days to update the commit messages, perform further
> testing and apply polish, so I plan to dump the patches to the list
> next week.  Just thought I'd ask for your opinion, I'm aware this is
> difficult to judge without seeing the code.

FWIW, the code for the first variant is in the top-most commit on the
following branch:

https://github.com/l1k/linux/commits/nft_egress_v3

Again, it gives the best performance if neither nft nor tc classifying
is enabled on egress, but incurs a small performance degradation for
the tc-only case.  Ignore the commit message, I haven't updated it yet.

And to get the second variant, the following patch needs to be applied
with "patch -R -p1".  This variant speeds up the normal case but not by
as much, also speeds up tc, has worse performance for nft.  It doesn't
use a static_key for tc, save for the outer one (shared with nft),
hence no performance degradation for the tc-only case.

-- >8 --
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93e63e756771..ef2cc21a0f11 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -785,6 +785,10 @@ struct xps_dev_maps {
 
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_NET_EGRESS
+extern struct static_key_false egress_needed_key;
+#endif
+
 #define TC_MAX_QUEUE	16
 #define TC_BITMASK	15
 /* HW offloaded queuing disciplines txq count and offset maps */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bb9cb84114c1..49fe4f3b8fbd 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -97,7 +97,7 @@ void net_inc_ingress_queue(void);
 void net_dec_ingress_queue(void);
 #endif
 
-#ifdef CONFIG_NET_EGRESS
+#ifdef CONFIG_NET_SCH_EGRESS
 void net_inc_egress_queue(void);
 void net_dec_egress_queue(void);
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index a7e2ff191481..f1ac84beef76 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1815,20 +1815,30 @@ EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
 #endif
 
 #ifdef CONFIG_NET_EGRESS
-static DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+
+#ifdef CONFIG_NET_SCH_EGRESS
+static DEFINE_STATIC_KEY_FALSE(sch_egress_needed_key);
 
 void net_inc_egress_queue(void)
 {
 	static_branch_inc(&egress_needed_key);
+	static_branch_inc(&sch_egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_inc_egress_queue);
 
 void net_dec_egress_queue(void)
 {
+	static_branch_dec(&sch_egress_needed_key);
 	static_branch_dec(&egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
-#endif
+
+#define sch_egress_needed static_branch_likely(&sch_egress_needed_key)
+#else
+#define sch_egress_needed false
+#endif /* CONFIG_NET_SCH_EGRESS */
+#endif /* CONFIG_NET_EGRESS */
 
 static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
@@ -3604,7 +3614,7 @@ static int nf_egress(struct sk_buff *skb)
 static inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_SCH_EGRESS
 	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
 	struct tcf_result cl_res;
 
@@ -3638,7 +3648,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	default:
 		break;
 	}
-#endif /* CONFIG_NET_CLS_ACT */
+#endif /* CONFIG_NET_SCH_EGRESS */
 
 	return skb;
 }
@@ -3655,7 +3665,10 @@ nf_sch_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 			return NULL;
 	}
 
-	return sch_handle_egress(skb, ret, dev);
+	if (sch_egress_needed)
+		return sch_handle_egress(skb, ret, dev);
+
+	return skb;
 }
 #endif /* CONFIG_NET_EGRESS */
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index a78a439e4fdd..8e7d5eed663c 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -358,7 +358,7 @@ static int __nf_register_net_hook(struct net *net, int pf,
 #endif
 #ifdef CONFIG_NETFILTER_EGRESS
 	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
-		net_inc_egress_queue();
+		static_branch_inc(&egress_needed_key);
 #endif
 #ifdef CONFIG_JUMP_LABEL
 	static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]);
@@ -420,7 +420,7 @@ static void __nf_unregister_net_hook(struct net *net, int pf,
 #endif
 #ifdef CONFIG_NETFILTER_EGRESS
 		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
-			net_dec_egress_queue();
+			static_branch_dec(&egress_needed_key);
 #endif
 #ifdef CONFIG_JUMP_LABEL
 		static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]);
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..806d5d60fc9a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -383,6 +383,9 @@ config NET_SCH_INGRESS
 	  To compile this code as a module, choose M here: the module will be
 	  called sch_ingress with alias of sch_clsact.
 
+config NET_SCH_EGRESS
+	def_bool NET_SCH_INGRESS
+
 config NET_SCH_PLUG
 	tristate "Plug network traffic until release (PLUG)"
 	---help---
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6c3f7032e8d9..2d2606d1b1b3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1750,6 +1750,7 @@  enum netdev_priv_flags {
  *	@xps_maps:	XXX: need comments on this one
  *	@miniq_egress:		clsact qdisc specific data for
  *				egress processing
+ *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
  *				the watchdog (see dev_watchdog())
@@ -2025,6 +2026,9 @@  struct net_device {
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	struct nf_hook_entries __rcu *nf_hooks_egress;
+#endif
 
 #ifdef CONFIG_NET_SCHED
 	DECLARE_HASHTABLE	(qdisc_hash, 4);
diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 49e26479642e..92d3611a782e 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -47,6 +47,9 @@  static inline void nf_hook_netdev_init(struct net_device *dev)
 #ifdef CONFIG_NETFILTER_INGRESS
 	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	RCU_INIT_POINTER(dev->nf_hooks_egress, NULL);
+#endif
 }
 
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -72,4 +75,28 @@  static inline int nf_hook_ingress(struct sk_buff *skb)
 	return 0;
 }
 #endif /* CONFIG_NETFILTER_INGRESS */
+
+#ifdef CONFIG_NETFILTER_EGRESS
+static inline bool nf_hook_egress_active(const struct sk_buff *skb)
+{
+	return nf_hook_netdev_active(NF_NETDEV_EGRESS,
+				     skb->dev->nf_hooks_egress);
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return nf_hook_netdev(skb, NF_NETDEV_EGRESS,
+			      skb->dev->nf_hooks_egress);
+}
+#else /* CONFIG_NETFILTER_EGRESS */
+static inline int nf_hook_egress_active(struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return 0;
+}
+#endif /* CONFIG_NETFILTER_EGRESS */
 #endif /* _NETFILTER_INGRESS_H_ */
diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index ca9e63d6e0e4..d1616574c54f 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -50,6 +50,7 @@  enum nf_inet_hooks {
 
 enum nf_dev_hooks {
 	NF_NETDEV_INGRESS,
+	NF_NETDEV_EGRESS,
 	NF_NETDEV_NUMHOOKS
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2e55d4e41517..89724a9f1bbb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3771,6 +3771,7 @@  EXPORT_SYMBOL(dev_loopback_xmit);
 static struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+#ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
 	struct tcf_result cl_res;
 
@@ -3804,11 +3805,24 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	default:
 		break;
 	}
-
+#endif /* CONFIG_NET_CLS_ACT */
 	return skb;
 }
 #endif /* CONFIG_NET_EGRESS */
 
+static inline int nf_egress(struct sk_buff *skb)
+{
+	if (nf_hook_egress_active(skb)) {
+		int ret;
+
+		rcu_read_lock();
+		ret = nf_hook_egress(skb);
+		rcu_read_unlock();
+		return ret;
+	}
+	return 0;
+}
+
 #ifdef CONFIG_XPS
 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
 			       struct xps_dev_maps *dev_maps, unsigned int tci)
@@ -3995,13 +4009,16 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_at_ingress = 0;
-# ifdef CONFIG_NET_EGRESS
+#endif
+#ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
+		if (nf_egress(skb) < 0)
+			goto out;
+
 		skb = sch_handle_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
 	}
-# endif
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 468fea1aebba..f4c68f60f241 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -10,6 +10,14 @@  config NETFILTER_INGRESS
 	  This allows you to classify packets from ingress using the Netfilter
 	  infrastructure.
 
+config NETFILTER_EGRESS
+	bool "Netfilter egress support"
+	default y
+	select NET_EGRESS
+	help
+	  This allows you to classify packets before transmission using the
+	  Netfilter infrastructure.
+
 config NETFILTER_NETLINK
 	tristate
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 78f046ec506f..85e9c959aba7 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -306,6 +306,12 @@  nf_hook_entry_head(struct net *net, int pf, unsigned int hooknum,
 		if (dev && dev_net(dev) == net)
 			return &dev->nf_hooks_ingress;
 	}
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (hooknum == NF_NETDEV_EGRESS) {
+		if (dev && dev_net(dev) == net)
+			return &dev->nf_hooks_egress;
+	}
 #endif
 	WARN_ON_ONCE(1);
 	return NULL;
@@ -318,11 +324,13 @@  static int __nf_register_net_hook(struct net *net, int pf,
 	struct nf_hook_entries __rcu **pp;
 
 	if (pf == NFPROTO_NETDEV) {
-#ifndef CONFIG_NETFILTER_INGRESS
-		if (reg->hooknum == NF_NETDEV_INGRESS)
+		if ((!IS_ENABLED(CONFIG_NETFILTER_INGRESS) &&
+		     reg->hooknum == NF_NETDEV_INGRESS) ||
+		    (!IS_ENABLED(CONFIG_NETFILTER_EGRESS) &&
+		     reg->hooknum == NF_NETDEV_EGRESS))
 			return -EOPNOTSUPP;
-#endif
-		if (reg->hooknum != NF_NETDEV_INGRESS ||
+		if ((reg->hooknum != NF_NETDEV_INGRESS &&
+		     reg->hooknum != NF_NETDEV_EGRESS) ||
 		    !reg->dev || dev_net(reg->dev) != net)
 			return -EINVAL;
 	}
@@ -348,6 +356,10 @@  static int __nf_register_net_hook(struct net *net, int pf,
 	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 		net_inc_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+		net_inc_egress_queue();
+#endif
 #ifdef CONFIG_JUMP_LABEL
 	static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
@@ -406,6 +418,10 @@  static void __nf_unregister_net_hook(struct net *net, int pf,
 		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 			net_dec_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+			net_dec_egress_queue();
+#endif
 #ifdef CONFIG_JUMP_LABEL
 		static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index c78d01bc02e9..67ce6dbb5496 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -277,9 +277,11 @@  static const struct nft_chain_type nft_chain_filter_netdev = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_NETDEV,
-	.hook_mask	= (1 << NF_NETDEV_INGRESS),
+	.hook_mask	= (1 << NF_NETDEV_INGRESS) |
+			  (1 << NF_NETDEV_EGRESS),
 	.hooks		= {
 		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
+		[NF_NETDEV_EGRESS]	= nft_do_chain_netdev,
 	},
 };