mbox series

[net-next,v3,0/9] netfilter: flowtable bridge and vlan enhancements

Message ID 20201111193737.1793-1-pablo@netfilter.org
Headers show
Series netfilter: flowtable bridge and vlan enhancements | expand

Message

Pablo Neira Ayuso Nov. 11, 2020, 7:37 p.m. UTC
Hi,

The following patchset augments the Netfilter flowtable fastpath [1] to
support for network topologies that combine IP forwarding, bridge and
VLAN devices.

A typical scenario that can benefit from this infrastructure is composed
of several VMs connected to bridge ports where the bridge master device
'br0' has an IP address. A DHCP server is also assumed to be running to
provide connectivity to the VMs. The VMs reach the Internet through
'br0' as default gateway, which makes the packet enter the IP forwarding
path. Then, netfilter is used to NAT the packets before they leave
through the wan device.

Something like this:

                       fast path
                .------------------------.
               /                          \
               |           IP forwarding   |
               |          /             \  .
               |       br0               eth0
               .       / \
               -- veth1  veth2
                   .
                   .
                   .
                 eth0
           ab:cd:ef:ab:cd:ef
                  VM

The idea is to accelerate forwarding by building a fast path that takes
packets from the ingress path of the bridge port and place them in the
egress path of the wan device (and vice versa). Hence, skipping the
classic bridge and IP stack paths.

This patchset is composed of:

Patch #1 adds a placeholder for the hash calculation, instead of using
	 the dir field. Changes in v3: this patch is new.

Patch #2 adds the transmit path type field to the flow tuple. Two transmit
         paths are supported so far: the neighbour and the xfrm transmit
         paths. This patch comes in preparation to add a new direct ethernet
         transmit path (see patch #7).

	 Changes in v3: I have simplified xmit path type logic.

Patch #3 adds dev_fill_forward_path() and .ndo_fill_forward_path() to
         netdev_ops. This new function describes the list of netdevice hops
         to reach a given destination MAC address in the local network topology,
         e.g.
                           IP forwarding
                          /             \
                       br0              eth0
                       / \
                   veth1 veth2
                    .
                    .
                    .
                   eth0
             ab:cd:ef:ab:cd:ef

          where veth1 and veth2 are bridge ports and eth0 provides Internet
          connectivity. eth0 is the interface in the VM which is connected to
          the veth1 bridge port. Then, for packets going to br0 whose
          destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path()
          provides the following path: br0 -> veth1.

Patch #4 adds .ndo_fill_forward_path for VLAN devices, which provides the next
         device hop via vlan->real_dev. This annotates the VLAN id and protocol.
	 This is useful to know what VLAN headers are expected from the ingress
	 device. This also provides information regarding the VLAN headers
	 to be pushed in the egress path.

Patch #5 adds .ndo_fill_forward_path for bridge devices, which allows to make
         lookups to the FDB to locate the next device hop (bridge port) in the
	 forwarding path.

	 Changes in v3: use READ_ONCE() as requested by Nikolay Aleksandrov.

Patch #6 updates the flowtable to use the dev_fill_forward_path()
         infrastructure to obtain the ingress device in the fastpath.

	 Changes in v3: If user specifies:

                devices = { br0, veth1 }
                            ^^^

         then, this still works and the fastpath is only created
         between br0 and veth1 using the neighbour path. Thus, classic
	 bridge path is exercised. If user specifies:

                devices = { veth0, veth1 }
                            ^^^^^

         then, dev_forward_path() provides an earlier ingress fastpath
	 ie. flowtable lookups for incoming packets happen in veth0 instead
	 of br0).

	There is fallback to use the neighbour path if:

	- dev_fill_forward_path() finds no real ingress device.
	- the ingress device that is obtained is not part of the flowtable
	  devices.
	- this route has a xfrm policy.

Patch #7 updates the flowtable to use dev_fill_forward_path() to obtain the
	 egress device in the forwarding path. This also adds the direct
	 ethernet transmit path, which pushes the ethernet header to the
	 packet and send it through dev_queue_xmit(). This patch adds
	 support for the bridge, so bridge ports use this direct xmit path.

         Changes in v3:
	- add the direct xmit path and use it for bridge devices type,
	  formely in v2 this came as two separated patches.
	- use union so it's either the direct xmit info or the IP dst
	  that is stored in the flow tuple to reduce memory footprint.
	- use dev_hard_header() in the direct xmit path to simplify VLAN
	  support coming patch #8.

Patch #8 adds ingress VLAN support (up to 2 VLAN tags, QinQ). The VLAN
	 information is also provided by dev_fill_forward_path(). Store the
	 VLAN id and protocol in the flow tuple for hash lookups. The VLAN
	 support in the xmit path is achieved by annotating the first vlan
	 device found in the xmit path and by calling dev_hard_header()
	 (previous patch #7) before dev_queue_xmit().

	Changes in v3:
	- Use VLAN device xmit path instead of manually pushing the
	  VLAN headers from the flowtable datapath to reuse the existing
	  device xmit infrastructure.

Patch #9 extends nft_flowtable.sh selftest: This is adding a test to
	 cover bridge and vlan support coming in this patchset.

	Changes in v3: this patch is new.

= Performance numbers

My testbed environment consists of three containers:

  192.168.20.2     .20.1     .10.1   10.141.10.2
         veth0       veth0 veth1      veth0
	ns1 <---------> nsr1 <--------> ns2
                            SNAT
     iperf -c	                       iperf -s

where nsr1 is used for forwarding. There is a bridge device br0 in nsr1,
veth0 is a port of br0. SNAT is performed on the veth1 device of nsr1.

- ns2 runs iperf -s
- ns1 runs iperf -c 10.141.10.2 -n 100G

My results are:

- Baseline (no flowtable, classic forwarding path + netfilter): ~16 Gbit/s
- Fastpath (with flowtable, this patchset): ~25 Gbit/s

This is an improvement of ~50% compared to baseline. My testbed is not
optimized, there are a few of instrumentation toggles turned on in my
kernel for debugging and this equipment is aging a bit.

Please apply, thanks.

Pablo Neira Ayuso (9):
  netfilter: flowtable: add hash offset field to tuple
  netfilter: flowtable: add xmit path types
  net: resolve forwarding path from virtual netdevice and HW destination address
  net: 8021q: resolve forwarding path for vlan devices
  bridge: resolve forwarding path for bridge devices
  netfilter: flowtable: use dev_fill_forward_path() to obtain ingress device
  netfilter: flowtable: use dev_fill_forward_path() to obtain egress device
  netfilter: flowtable: add vlan support
  selftests: netfilter: flowtable bridge and VLAN support

 include/linux/netdevice.h                     |  35 +++
 include/net/netfilter/nf_flow_table.h         |  43 +++-
 net/8021q/vlan_dev.c                          |  15 ++
 net/bridge/br_device.c                        |  24 +++
 net/core/dev.c                                |  31 +++
 net/netfilter/nf_flow_table_core.c            |  51 +++--
 net/netfilter/nf_flow_table_ip.c              | 200 ++++++++++++++----
 net/netfilter/nft_flow_offload.c              | 159 +++++++++++++-
 .../selftests/netfilter/nft_flowtable.sh      |  82 +++++++
 9 files changed, 580 insertions(+), 60 deletions(-)

Comments

Jakub Kicinski Nov. 14, 2020, 1:55 a.m. UTC | #1
On Wed, 11 Nov 2020 20:37:28 +0100 Pablo Neira Ayuso wrote:
> The following patchset augments the Netfilter flowtable fastpath [1] to
> support for network topologies that combine IP forwarding, bridge and
> VLAN devices.
> 
> A typical scenario that can benefit from this infrastructure is composed
> of several VMs connected to bridge ports where the bridge master device
> 'br0' has an IP address. A DHCP server is also assumed to be running to
> provide connectivity to the VMs. The VMs reach the Internet through
> 'br0' as default gateway, which makes the packet enter the IP forwarding
> path. Then, netfilter is used to NAT the packets before they leave
> through the wan device.
> 
> Something like this:
> 
>                        fast path
>                 .------------------------.
>                /                          \
>                |           IP forwarding   |
>                |          /             \  .
>                |       br0               eth0
>                .       / \
>                -- veth1  veth2
>                    .
>                    .
>                    .
>                  eth0
>            ab:cd:ef:ab:cd:ef
>                   VM
> 
> The idea is to accelerate forwarding by building a fast path that takes
> packets from the ingress path of the bridge port and place them in the
> egress path of the wan device (and vice versa). Hence, skipping the
> classic bridge and IP stack paths.

The problem that immediately comes to mind is that if there is any
dynamic forwarding state the cache you're creating would need to be
flushed when FDB changes. Are you expecting users would plug into the
flowtable devices where they know things are fairly static?
Pablo Neira Ayuso Nov. 14, 2020, 11:59 a.m. UTC | #2
On Fri, Nov 13, 2020 at 05:55:56PM -0800, Jakub Kicinski wrote:
> On Wed, 11 Nov 2020 20:37:28 +0100 Pablo Neira Ayuso wrote:
> > The following patchset augments the Netfilter flowtable fastpath [1] to
> > support for network topologies that combine IP forwarding, bridge and
> > VLAN devices.
> > 
> > A typical scenario that can benefit from this infrastructure is composed
> > of several VMs connected to bridge ports where the bridge master device
> > 'br0' has an IP address. A DHCP server is also assumed to be running to
> > provide connectivity to the VMs. The VMs reach the Internet through
> > 'br0' as default gateway, which makes the packet enter the IP forwarding
> > path. Then, netfilter is used to NAT the packets before they leave
> > through the wan device.
> > 
> > Something like this:
> > 
> >                        fast path
> >                 .------------------------.
> >                /                          \
> >                |           IP forwarding   |
> >                |          /             \  .
> >                |       br0               eth0
> >                .       / \
> >                -- veth1  veth2
> >                    .
> >                    .
> >                    .
> >                  eth0
> >            ab:cd:ef:ab:cd:ef
> >                   VM
> > 
> > The idea is to accelerate forwarding by building a fast path that takes
> > packets from the ingress path of the bridge port and place them in the
> > egress path of the wan device (and vice versa). Hence, skipping the
> > classic bridge and IP stack paths.
> 
> The problem that immediately comes to mind is that if there is any
> dynamic forwarding state the cache you're creating would need to be
> flushed when FDB changes. Are you expecting users would plug into the
> flowtable devices where they know things are fairly static?

If any of the flowtable device goes down / removed, the entries are
removed from the flowtable. This means packets of existing flows are
pushed up back to classic bridge / forwarding path to re-evaluate the
fast path.

For each new flow, the fast path that is selected freshly, so they use
the up-to-date FDB to select a new bridge port.

Existing flows still follow the old path. The same happens with FIB
currently.

It should be possible to explore purging entries in the flowtable that
are stale due to changes in the topology (either in FDB or FIB).

What scenario do you have specifically in mind? Something like VM
migrates from one bridge port to another?

Thank you.
Tobias Waldekranz Nov. 14, 2020, 2 p.m. UTC | #3
On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> If any of the flowtable device goes down / removed, the entries are
> removed from the flowtable. This means packets of existing flows are
> pushed up back to classic bridge / forwarding path to re-evaluate the
> fast path.
>
> For each new flow, the fast path that is selected freshly, so they use
> the up-to-date FDB to select a new bridge port.
>
> Existing flows still follow the old path. The same happens with FIB
> currently.
>
> It should be possible to explore purging entries in the flowtable that
> are stale due to changes in the topology (either in FDB or FIB).
>
> What scenario do you have specifically in mind? Something like VM
> migrates from one bridge port to another?

This should work in the case when the bridge ports are normal NICs or
switchdev ports, right?

In that case, relying on link state is brittle as you can easily have a
switch or a media converter between the bridge and the end-station:

        br0                  br0
        / \                  / \
    eth0   eth1          eth0   eth1
     /      \      =>     /      \
  [sw0]     [sw1]      [sw0]     [sw1]
   /          \         /          \
  A                                 A

In a scenario like this, A has clearly moved. But neither eth0 nor eth1
has seen any changes in link state.

This particular example is a bit contrived. But this is essentially what
happens in redundant topologies when reconfigurations occur (e.g. STP).

These protocols will typically signal reconfigurations to all bridges
though, so as long as the affected flows are flushed at the same time as
the FDB it should work.

Interesting stuff!
Jakub Kicinski Nov. 14, 2020, 5:03 p.m. UTC | #4
On Sat, 14 Nov 2020 15:00:03 +0100 Tobias Waldekranz wrote:
> On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > If any of the flowtable device goes down / removed, the entries are
> > removed from the flowtable. This means packets of existing flows are
> > pushed up back to classic bridge / forwarding path to re-evaluate the
> > fast path.
> >
> > For each new flow, the fast path that is selected freshly, so they use
> > the up-to-date FDB to select a new bridge port.
> >
> > Existing flows still follow the old path. The same happens with FIB
> > currently.
> >
> > It should be possible to explore purging entries in the flowtable that
> > are stale due to changes in the topology (either in FDB or FIB).
> >
> > What scenario do you have specifically in mind? Something like VM
> > migrates from one bridge port to another?  

Indeed, 2 VMs A and B, talking to each other, A is _outside_ the
system (reachable via eth0), B is inside (veth1). When A moves inside
and gets its veth. Neither B's veth1 not eth0 will change state, so
cache wouldn't get flushed, right?

> This should work in the case when the bridge ports are normal NICs or
> switchdev ports, right?
> 
> In that case, relying on link state is brittle as you can easily have a
> switch or a media converter between the bridge and the end-station:
> 
>         br0                  br0
>         / \                  / \
>     eth0   eth1          eth0   eth1
>      /      \      =>     /      \
>   [sw0]     [sw1]      [sw0]     [sw1]
>    /          \         /          \
>   A                                 A
> 
> In a scenario like this, A has clearly moved. But neither eth0 nor eth1
> has seen any changes in link state.
> 
> This particular example is a bit contrived. But this is essentially what
> happens in redundant topologies when reconfigurations occur (e.g. STP).
> 
> These protocols will typically signal reconfigurations to all bridges
> though, so as long as the affected flows are flushed at the same time as
> the FDB it should work.
> 
> Interesting stuff!

Agreed, could be interesting for all NAT/conntrack setups, not just VMs.
Pablo Neira Ayuso Nov. 16, 2020, 10:18 p.m. UTC | #5
On Sat, Nov 14, 2020 at 09:03:47AM -0800, Jakub Kicinski wrote:
> On Sat, 14 Nov 2020 15:00:03 +0100 Tobias Waldekranz wrote:
> > On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > If any of the flowtable device goes down / removed, the entries are
> > > removed from the flowtable. This means packets of existing flows are
> > > pushed up back to classic bridge / forwarding path to re-evaluate the
> > > fast path.
> > >
> > > For each new flow, the fast path that is selected freshly, so they use
> > > the up-to-date FDB to select a new bridge port.
> > >
> > > Existing flows still follow the old path. The same happens with FIB
> > > currently.
> > >
> > > It should be possible to explore purging entries in the flowtable that
> > > are stale due to changes in the topology (either in FDB or FIB).
> > >
> > > What scenario do you have specifically in mind? Something like VM
> > > migrates from one bridge port to another?  
> 
> Indeed, 2 VMs A and B, talking to each other, A is _outside_ the
> system (reachable via eth0), B is inside (veth1). When A moves inside
> and gets its veth. Neither B's veth1 not eth0 will change state, so
> cache wouldn't get flushed, right?

The flow tuple includes the input interface as part of the hash key,
so packets will not match the existing entries in the flowtable after
the topology update. The stale flow entries are removed after 30 seconds
if no matching packets are seen. New flow entries will be created for
the new topology, a few packets have to go through the classic
forwarding path so the new flow entries that represent the flow in the
new topology are created.
Jakub Kicinski Nov. 16, 2020, 10:28 p.m. UTC | #6
On Mon, 16 Nov 2020 23:18:15 +0100 Pablo Neira Ayuso wrote:
> On Sat, Nov 14, 2020 at 09:03:47AM -0800, Jakub Kicinski wrote:
> > On Sat, 14 Nov 2020 15:00:03 +0100 Tobias Waldekranz wrote:  
> > > On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> > > > If any of the flowtable device goes down / removed, the entries are
> > > > removed from the flowtable. This means packets of existing flows are
> > > > pushed up back to classic bridge / forwarding path to re-evaluate the
> > > > fast path.
> > > >
> > > > For each new flow, the fast path that is selected freshly, so they use
> > > > the up-to-date FDB to select a new bridge port.
> > > >
> > > > Existing flows still follow the old path. The same happens with FIB
> > > > currently.
> > > >
> > > > It should be possible to explore purging entries in the flowtable that
> > > > are stale due to changes in the topology (either in FDB or FIB).
> > > >
> > > > What scenario do you have specifically in mind? Something like VM
> > > > migrates from one bridge port to another?    
> > 
> > Indeed, 2 VMs A and B, talking to each other, A is _outside_ the
> > system (reachable via eth0), B is inside (veth1). When A moves inside
> > and gets its veth. Neither B's veth1 not eth0 will change state, so
> > cache wouldn't get flushed, right?  
> 
> The flow tuple includes the input interface as part of the hash key,
> so packets will not match the existing entries in the flowtable after
> the topology update. 

To be clear - the input interface for B -> A traffic remains B.
So if B was talking to A before A moved it will keep hitting 
the cached entry.

Are you saying A -> B traffic won't match so it will update the cache,
since conntrack flows are bi-directional?

> The stale flow entries are removed after 30 seconds
> if no matching packets are seen. New flow entries will be created for
> the new topology, a few packets have to go through the classic
> forwarding path so the new flow entries that represent the flow in the
> new topology are created.
Pablo Neira Ayuso Nov. 16, 2020, 10:31 p.m. UTC | #7
On Sat, Nov 14, 2020 at 03:00:03PM +0100, Tobias Waldekranz wrote:
> On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > If any of the flowtable device goes down / removed, the entries are
> > removed from the flowtable. This means packets of existing flows are
> > pushed up back to classic bridge / forwarding path to re-evaluate the
> > fast path.
> >
> > For each new flow, the fast path that is selected freshly, so they use
> > the up-to-date FDB to select a new bridge port.
> >
> > Existing flows still follow the old path. The same happens with FIB
> > currently.
> >
> > It should be possible to explore purging entries in the flowtable that
> > are stale due to changes in the topology (either in FDB or FIB).
> >
> > What scenario do you have specifically in mind? Something like VM
> > migrates from one bridge port to another?
> 
> This should work in the case when the bridge ports are normal NICs or
> switchdev ports, right?

Yes.

> In that case, relying on link state is brittle as you can easily have a
> switch or a media converter between the bridge and the end-station:
> 
>         br0                  br0
>         / \                  / \
>     eth0   eth1          eth0   eth1
>      /      \      =>     /      \
>   [sw0]     [sw1]      [sw0]     [sw1]
>    /          \         /          \
>   A                                 A
> 
> In a scenario like this, A has clearly moved. But neither eth0 nor eth1
> has seen any changes in link state.
> 
> This particular example is a bit contrived. But this is essentially what
> happens in redundant topologies when reconfigurations occur (e.g. STP).
> 
> These protocols will typically signal reconfigurations to all bridges
> though, so as long as the affected flows are flushed at the same time as
> the FDB it should work.

Yes, watching the FDB should allow to clean up stale flows immediately.
Pablo Neira Ayuso Nov. 16, 2020, 10:36 p.m. UTC | #8
On Mon, Nov 16, 2020 at 02:28:44PM -0800, Jakub Kicinski wrote:
> On Mon, 16 Nov 2020 23:18:15 +0100 Pablo Neira Ayuso wrote:
> > On Sat, Nov 14, 2020 at 09:03:47AM -0800, Jakub Kicinski wrote:
> > > On Sat, 14 Nov 2020 15:00:03 +0100 Tobias Waldekranz wrote:  
> > > > On Sat, Nov 14, 2020 at 12:59, Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> > > > > If any of the flowtable device goes down / removed, the entries are
> > > > > removed from the flowtable. This means packets of existing flows are
> > > > > pushed up back to classic bridge / forwarding path to re-evaluate the
> > > > > fast path.
> > > > >
> > > > > For each new flow, the fast path that is selected freshly, so they use
> > > > > the up-to-date FDB to select a new bridge port.
> > > > >
> > > > > Existing flows still follow the old path. The same happens with FIB
> > > > > currently.
> > > > >
> > > > > It should be possible to explore purging entries in the flowtable that
> > > > > are stale due to changes in the topology (either in FDB or FIB).
> > > > >
> > > > > What scenario do you have specifically in mind? Something like VM
> > > > > migrates from one bridge port to another?    
> > > 
> > > Indeed, 2 VMs A and B, talking to each other, A is _outside_ the
> > > system (reachable via eth0), B is inside (veth1). When A moves inside
> > > and gets its veth. Neither B's veth1 not eth0 will change state, so
> > > cache wouldn't get flushed, right?  
> > 
> > The flow tuple includes the input interface as part of the hash key,
> > so packets will not match the existing entries in the flowtable after
> > the topology update. 
> 
> To be clear - the input interface for B -> A traffic remains B.
> So if B was talking to A before A moved it will keep hitting 
> the cached entry.

Yes, Traffic for B -> A still hits the cached entry.

> Are you saying A -> B traffic won't match so it will update the cache,
> since conntrack flows are bi-directional?

Yes, Traffic for A -> B won't match the flowtable entry, this will
update the cache.
Jakub Kicinski Nov. 16, 2020, 10:45 p.m. UTC | #9
On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:
> > Are you saying A -> B traffic won't match so it will update the cache,
> > since conntrack flows are bi-directional?  
> 
> Yes, Traffic for A -> B won't match the flowtable entry, this will
> update the cache.

That's assuming there will be A -> B traffic without B sending a
request which reaches A, first.
Pablo Neira Ayuso Nov. 16, 2020, 10:56 p.m. UTC | #10
On Mon, Nov 16, 2020 at 02:45:21PM -0800, Jakub Kicinski wrote:
> On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:
> > > Are you saying A -> B traffic won't match so it will update the cache,
> > > since conntrack flows are bi-directional?  
> > 
> > Yes, Traffic for A -> B won't match the flowtable entry, this will
> > update the cache.
> 
> That's assuming there will be A -> B traffic without B sending a
> request which reaches A, first.

B might send packets to A but this will not get anywhere. Assuming
TCP, this will trigger retransmissions so B -> A will kick in to
refresh the entry.

Is this scenario that you describe a showstopper?

Thank you.
Pablo Neira Ayuso Nov. 21, 2020, 12:31 p.m. UTC | #11
On Mon, Nov 16, 2020 at 11:56:58PM +0100, Pablo Neira Ayuso wrote:
> On Mon, Nov 16, 2020 at 02:45:21PM -0800, Jakub Kicinski wrote:
> > On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:
> > > > Are you saying A -> B traffic won't match so it will update the cache,
> > > > since conntrack flows are bi-directional?  
> > > 
> > > Yes, Traffic for A -> B won't match the flowtable entry, this will
> > > update the cache.
> > 
> > That's assuming there will be A -> B traffic without B sending a
> > request which reaches A, first.
> 
> B might send packets to A but this will not get anywhere. Assuming
> TCP, this will trigger retransmissions so B -> A will kick in to
> refresh the entry.
> 
> Is this scenario that you describe a showstopper?

I have been discussing the topology update by tracking fdb updates
with the bridge maintainer, I'll be exploring extensions to the
existing fdb_notify() infrastructure to deal with this scenario you
describe. On my side this topology update scenario is not a priority
to be supported in this patchset, but it's feasible to support it
later on.
Jakub Kicinski Nov. 21, 2020, 6:15 p.m. UTC | #12
On Sat, 21 Nov 2020 13:31:38 +0100 Pablo Neira Ayuso wrote:
> On Mon, Nov 16, 2020 at 11:56:58PM +0100, Pablo Neira Ayuso wrote:
> > On Mon, Nov 16, 2020 at 02:45:21PM -0800, Jakub Kicinski wrote:  
> > > On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:  
> > > > > Are you saying A -> B traffic won't match so it will update the cache,
> > > > > since conntrack flows are bi-directional?    
> > > > 
> > > > Yes, Traffic for A -> B won't match the flowtable entry, this will
> > > > update the cache.  
> > > 
> > > That's assuming there will be A -> B traffic without B sending a
> > > request which reaches A, first.  
> > 
> > B might send packets to A but this will not get anywhere. Assuming
> > TCP, this will trigger retransmissions so B -> A will kick in to
> > refresh the entry.
> > 
> > Is this scenario that you describe a showstopper?  

Sorry I got distracted.
 
> I have been discussing the topology update by tracking fdb updates
> with the bridge maintainer, I'll be exploring extensions to the
> existing fdb_notify() infrastructure to deal with this scenario you
> describe. On my side this topology update scenario is not a priority
> to be supported in this patchset, but it's feasible to support it
> later on.

My concern is that invalidation is _the_ hard part of creating caches.
And I feel like merging this as is would be setting our standards pretty
low. 

Please gather some review tags from senior netdev developers. I don't
feel confident enough to apply this as 100% my own decision.
Pablo Neira Ayuso Nov. 21, 2020, 6:56 p.m. UTC | #13
On Sat, Nov 21, 2020 at 10:15:51AM -0800, Jakub Kicinski wrote:
> On Sat, 21 Nov 2020 13:31:38 +0100 Pablo Neira Ayuso wrote:
> > On Mon, Nov 16, 2020 at 11:56:58PM +0100, Pablo Neira Ayuso wrote:
> > > On Mon, Nov 16, 2020 at 02:45:21PM -0800, Jakub Kicinski wrote:  
> > > > On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:  
[...]
> > I have been discussing the topology update by tracking fdb updates
> > with the bridge maintainer, I'll be exploring extensions to the
> > existing fdb_notify() infrastructure to deal with this scenario you
> > describe. On my side this topology update scenario is not a priority
> > to be supported in this patchset, but it's feasible to support it
> > later on.
> 
> My concern is that invalidation is _the_ hard part of creating caches.
> And I feel like merging this as is would be setting our standards pretty
> low. 

Interesting, let's summarize a bit to make sure we're on the same
page:

- This "cache" is optional, you enable it on demand through ruleset.
- This "cache" is configurable, you can specify through ruleset policy
  what policies get into the cache and _when_ they are placed in the
  cache.
- This is not affecting any existing default configuration, neither
  Linux networking not even classic path Netfilter configurations,
  it's a rather new thing.
- This is showing performance improvement of ~50% with a very simple
  testbed. With pktgen, back few years ago I was reaching x2.5
  performance boost in software in a pktgen testbed.
- This is adding minimal changes to netdev_ops, just a single
  callback.

For the live VM migration you describe, connections might time out,
but there are many use-cases where this is still valid, some of them
has been described already here.

> Please gather some review tags from senior netdev developers. I don't
> feel confident enough to apply this as 100% my own decision.

Fair enough.

This requirement for very specific Netfilter infrastructure which does
not affect any other Networking subsystem sounds new to me.

What senior developers specifically you would like I should poke to
get an acknowledgement on this to get this accepted of your
preference?

Thank you.
Jakub Kicinski Nov. 21, 2020, 7:23 p.m. UTC | #14
On Sat, 21 Nov 2020 19:56:21 +0100 Pablo Neira Ayuso wrote:
> > Please gather some review tags from senior netdev developers. I don't
> > feel confident enough to apply this as 100% my own decision.  
> 
> Fair enough.
> 
> This requirement for very specific Netfilter infrastructure which does
> not affect any other Networking subsystem sounds new to me.

You mean me asking for reviews from other senior folks when I don't
feel good about some code? I've asked others the same thing in the
past, e.g. Paolo for his RPS thing.

> What senior developers specifically you would like I should poke to
> get an acknowledgement on this to get this accepted of your
> preference?

I don't want to make a list. Maybe netconf attendees are a safe bet?
Pablo Neira Ayuso Nov. 22, 2020, 11:33 a.m. UTC | #15
Hi Jakub,

On Sat, Nov 21, 2020 at 11:23:48AM -0800, Jakub Kicinski wrote:
> On Sat, 21 Nov 2020 19:56:21 +0100 Pablo Neira Ayuso wrote:
> > > Please gather some review tags from senior netdev developers. I don't
> > > feel confident enough to apply this as 100% my own decision.  
> > 
> > Fair enough.
> > 
> > This requirement for very specific Netfilter infrastructure which does
> > not affect any other Networking subsystem sounds new to me.
> 
> You mean me asking for reviews from other senior folks when I don't
> feel good about some code? I've asked others the same thing in the
> past, e.g. Paolo for his RPS thing.

No, I'm perfectly fine with peer review.

Note that I am sending this to net-next as a patchset (not as a PR)
_only_ because this is adding a new .ndo_fill_forward_path to
netdev_ops.

That's the only thing that is relevant to the Netdev core
infrastructure IMO, and this new .ndo that is private, not exposed to
userspace.

Let's have a look at the diffstats again:

 include/linux/netdevice.h                     |  35 +++
 include/net/netfilter/nf_flow_table.h         |  43 +++-
 net/8021q/vlan_dev.c                          |  15 ++
 net/bridge/br_device.c                        |  27 +++
 net/core/dev.c                                |  46 ++++
 net/netfilter/nf_flow_table_core.c            |  51 +++--
 net/netfilter/nf_flow_table_ip.c              | 200 ++++++++++++++----
 net/netfilter/nft_flow_offload.c              | 159 +++++++++++++-
 .../selftests/netfilter/nft_flowtable.sh      |  82 +++++++
 9 files changed, 598 insertions(+), 60 deletions(-)

So this is adding _minimal_ changes to the NetDev infrastructure. Most
of the code is an extension to the flowtable Netfilter infrastructure.
And the flowtable is a cache since its conception.

I am adding the .ndo indirection to avoid the dependencies with
Netfilter modules, e.g. Netfilter could use direct reference to bridge
function, but that would pull in bridge modules.

> > What senior developers specifically you would like I should poke to
> > get an acknowledgement on this to get this accepted of your
> > preference?
> 
> I don't want to make a list. Maybe netconf attendees are a safe bet?

I have no idea who to ask to, traditionally it's the NetDev maintainer
(AFAIK it's only you at this stage) that have the last word on
something to get this merged.

I consider all developers that have reviewed this patchset to be
senior developers.