[nf-next,v2,0/8] nftables: Set implementation for arbitrary concatenation of ranges

Message ID	cover.1574428269.git.sbrivio@redhat.com
Headers	show Return-Path: <netfilter-devel-owner@vger.kernel.org> From: Stefano Brivio <sbrivio@redhat.com> To: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org Cc: Florian Westphal <fw@strlen.de>, =?utf-8?q?Kadlecsik_J=C3=B3zsef?= <kadlec@blackhole.kfki.hu>, Eric Garver <eric@garver.life>, Phil Sutter <phil@nwl.cc> Subject: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Date: Fri, 22 Nov 2019 14:39:59 +0100 Message-Id: <cover.1574428269.git.sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: netfilter-devel-owner@vger.kernel.org Precedence: bulk
Series	nftables: Set implementation for arbitrary concatenation of ranges \| expand [nf-next,v2,0/8] nftables: Set implementation for arbitrary concatenation of ranges [nf-next,v2,1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields [nf-next,v2,2/8] bitmap: Introduce bitmap_cut(): cut bits and shift remaining [nf-next,v2,3/8] nf_tables: Add set type for arbitrary concatenation of ranges [nf-next,v2,4/8] selftests: netfilter: Introduce tests for sets with range concatenation [nf-next,v2,5/8] nft_set_pipapo: Provide unrolled lookup loops for common field sizes [nf-next,v2,6/8] nft_set_pipapo: Prepare for vectorised implementation: alignment [nf-next,v2,7/8] nft_set_pipapo: Prepare for vectorised implementation: helpers [nf-next,v2,8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation

Message ID

cover.1574428269.git.sbrivio@redhat.com

Headers

From: Stefano Brivio <sbrivio@redhat.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org
Cc: Florian Westphal <fw@strlen.de>, =?utf-8?q?Kadlecsik_J=C3=B3zsef?=
	<kadlec@blackhole.kfki.hu>, Eric Garver <eric@garver.life>,
	Phil Sutter <phil@nwl.cc>
Subject: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary
	concatenation of ranges
Date: Fri, 22 Nov 2019 14:39:59 +0100
Message-Id: <cover.1574428269.git.sbrivio@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Sender: netfilter-devel-owner@vger.kernel.org
Precedence: bulk

Series

nftables: Set implementation for arbitrary concatenation of ranges | expand

Message

Stefano Brivio Nov. 22, 2019, 1:39 p.m. UTC

Existing nftables set implementations allow matching entries with
interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
but not both.

In other words, none of the set types allows matching on range
expressions for more than one packet field at a time, such as ipset
does with types bitmap:ip,mac, and, to a more limited extent
(netmasks, not arbitrary ranges), with types hash:net,net,
hash:net,port, hash:ip,port,net, and hash:net,port,net.

As a pure hash-based approach is unsuitable for matching on ranges,
and "proxying" the existing red-black tree type looks impractical as
elements would need to be shared and managed across all employed
trees, this new set implementation intends to fill the functionality
gap by employing a relatively novel approach.

The fundamental idea, illustrated in deeper detail in patch 3/8, is to
use lookup tables classifying a small number of grouped bits from each
field, and map the lookup results in a way that yields a verdict for
the full set of specified fields.

The grouping bit aspect is loosely inspired by the Grouper algorithm,
by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 3/8 for the full
reference).

A reference, stand-alone implementation of the algorithm itself is
available at:
	https://pipapo.lameexcu.se

Some notes about possible future optimisations are also mentioned
there. This algorithm reduces the matching problem to, essentially,
a repetitive sequence of simple bitwise operations, and is
particularly suitable to be optimised by leveraging SIMD instruction
sets. An AVX2-based implementation is also presented in this series.

I plan to post the adaptation of the existing AVX2 vectorised
implementation for (at least) NEON at a later time.

Patch 1/8 implements the needed UAPI bits: additions to the existing
interface are kept to a minimum by recycling existing concepts for
both ranging and concatenation, as suggested by Florian.

Patch 2/8 adds a new bitmap operation that copies the source bitmap
onto the destination while removing a given region, and is needed to
delete regions of arrays mapping between lookup tables.

Patch 3/8 is the actual set implementation.

Patch 4/8 introduces selftests for the new implementation.

Patch 5/8 provides an easy optimisation with substantial gain on
matching rates.

Patches 6/8 and 7/8 are preparatory work to add an alternative,
vectorised lookup implementation.

Patch 8/8 contains the AVX2-based implementation of the lookup
routines.

The nftables and libnftnl counterparts depend on changes to the UAPI
header file included in patch 1/8.

Credits go to Jay Ligatti, Josh Kuhn, and Chris Gage for their
original Grouper implementation and article from ICCCN proceedings
(see reference in patch 3/8), and to Daniel Lemire for his public
domain implementation of a fast iterator on set bits using built-in
implementations of the CTZL operation, also included in patch 3/8.

Special thanks go to Florian Westphal for all the nftables consulting
and the original interface idea, to Sabrina Dubroca for support with
RCU and bit manipulation topics, to Eric Garver for an early review,
and to Phil Sutter for reaffirming the need for the use case covered
here.

v2: changes listed in messages for 3/8 and 8/8

Stefano Brivio (8):
  netfilter: nf_tables: Support for subkeys, set with multiple ranged
    fields
  bitmap: Introduce bitmap_cut(): cut bits and shift remaining
  nf_tables: Add set type for arbitrary concatenation of ranges
  selftests: netfilter: Introduce tests for sets with range
    concatenation
  nft_set_pipapo: Provide unrolled lookup loops for common field sizes
  nft_set_pipapo: Prepare for vectorised implementation: alignment
  nft_set_pipapo: Prepare for vectorised implementation: helpers
  nft_set_pipapo: Introduce AVX2-based lookup implementation

 include/linux/bitmap.h                        |    4 +
 include/net/netfilter/nf_tables_core.h        |    2 +
 include/uapi/linux/netfilter/nf_tables.h      |   16 +
 lib/bitmap.c                                  |   66 +
 net/netfilter/Makefile                        |    6 +-
 net/netfilter/nf_tables_api.c                 |    4 +-
 net/netfilter/nf_tables_set_core.c            |    8 +
 net/netfilter/nft_set_pipapo.c                | 2165 +++++++++++++++++
 net/netfilter/nft_set_pipapo.h                |  236 ++
 net/netfilter/nft_set_pipapo_avx2.c           |  838 +++++++
 net/netfilter/nft_set_pipapo_avx2.h           |   14 +
 tools/testing/selftests/netfilter/Makefile    |    3 +-
 .../selftests/netfilter/nft_concat_range.sh   | 1481 +++++++++++
 13 files changed, 4839 insertions(+), 4 deletions(-)
 create mode 100644 net/netfilter/nft_set_pipapo.c
 create mode 100644 net/netfilter/nft_set_pipapo.h
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.c
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.h
 create mode 100755 tools/testing/selftests/netfilter/nft_concat_range.sh

Comments

Pablo Neira Ayuso Nov. 23, 2019, 8:05 p.m. UTC | #1

On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
[...]
> Patch 1/8 implements the needed UAPI bits: additions to the existing
> interface are kept to a minimum by recycling existing concepts for
> both ranging and concatenation, as suggested by Florian.
> 
> Patch 2/8 adds a new bitmap operation that copies the source bitmap
> onto the destination while removing a given region, and is needed to
> delete regions of arrays mapping between lookup tables.
> 
> Patch 3/8 is the actual set implementation.
> 
> Patch 4/8 introduces selftests for the new implementation.
[...]

After talking to Florian, I'm inclined to merge upstream up to patch
4/8 in this merge window, once the UAPI discussion is sorted out.

Thanks.

Stefano Brivio Nov. 25, 2019, 9:31 a.m. UTC | #2

On Sat, 23 Nov 2019 21:05:18 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
> [...]
> > Patch 1/8 implements the needed UAPI bits: additions to the existing
> > interface are kept to a minimum by recycling existing concepts for
> > both ranging and concatenation, as suggested by Florian.
> > 
> > Patch 2/8 adds a new bitmap operation that copies the source bitmap
> > onto the destination while removing a given region, and is needed to
> > delete regions of arrays mapping between lookup tables.
> > 
> > Patch 3/8 is the actual set implementation.
> > 
> > Patch 4/8 introduces selftests for the new implementation.  
> [...]
> 
> After talking to Florian, I'm inclined to merge upstream up to patch
> 4/8 in this merge window, once the UAPI discussion is sorted out.

Thanks for the update. Let me know if there's some specific topic or
concern I can start addressing for patches 5/8 to 8/8.

Pablo Neira Ayuso Nov. 25, 2019, 10:02 a.m. UTC | #3

On Mon, Nov 25, 2019 at 10:31:06AM +0100, Stefano Brivio wrote:
> On Sat, 23 Nov 2019 21:05:18 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
> > [...]
> > > Patch 1/8 implements the needed UAPI bits: additions to the existing
> > > interface are kept to a minimum by recycling existing concepts for
> > > both ranging and concatenation, as suggested by Florian.
> > > 
> > > Patch 2/8 adds a new bitmap operation that copies the source bitmap
> > > onto the destination while removing a given region, and is needed to
> > > delete regions of arrays mapping between lookup tables.
> > > 
> > > Patch 3/8 is the actual set implementation.
> > > 
> > > Patch 4/8 introduces selftests for the new implementation.  
> > [...]
> > 
> > After talking to Florian, I'm inclined to merge upstream up to patch
> > 4/8 in this merge window, once the UAPI discussion is sorted out.
> 
> Thanks for the update. Let me know if there's some specific topic or
> concern I can start addressing for patches 5/8 to 8/8.

Merge window is now closed, I was trying to get the bare minimum in
this round. Now we have a bit more time to merge this upstream.

BTW, do you have numbers comparing the AVX2 version with the C code? I
quickly had a look at your numbers, but not clear to me if this is
compared there.

Thanks.

Stefano Brivio Nov. 25, 2019, 1:36 p.m. UTC | #4

On Mon, 25 Nov 2019 11:02:14 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> BTW, do you have numbers comparing the AVX2 version with the C code? I
> quickly had a look at your numbers, but not clear to me if this is
> compared there.

No, sorry, I didn't report that anywhere, I probably should have in
the commit messages for 4/8 and 5/8. This was from v1 at 4/8, single
thread on AMD Epyc 7351, C implementation without unrolled loops:

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):               9971887pps
    baseline hash (non-ranged entries):             5991032pps
    baseline rbtree (match on first field only):    2666255pps
    set with  1000 full, ranged entries:            2220404pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10004499pps
    baseline hash (non-ranged entries):             6011221pps
    baseline rbtree (match on first field only):    4035566pps
    set with   100 full, ranged entries:            4018240pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9497500pps
    baseline hash (non-ranged entries):             4685436pps
    baseline rbtree (match on first field only):    1354978pps
    set with  1000 full, ranged entries:            1052188pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10749256pps
    baseline hash (non-ranged entries):             6774103pps
    baseline rbtree (match on first field only):    2819211pps
    set with 30000 full, ranged entries:             283492pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9463935pps
    baseline hash (non-ranged entries):             3777039pps
    baseline rbtree (match on first field only):    2943527pps
    set with    10 full, ranged entries:            1927899pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9502200pps
    baseline hash (non-ranged entries):             3637739pps
    baseline rbtree (match on first field only):    1342323pps
    set with  1000 full, ranged entries:             753960pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10065715pps
    baseline hash (non-ranged entries):             5082895pps
    baseline rbtree (match on first field only):    2677391pps
    set with  1000 full, ranged entries:            1215104pps

I would re-run tests on v3 patches and include the comparisons in
commit messages. 

By the way, as you can see, even though the comparison with rbtree is
unfair (comparing > 1 fields adds substantial complexity), without AVX2
it doesn't scale as nicely. I plan to propose some optimisations that
should substantially improve the non-vectorised case, but what I have
in mind right now is a bit convoluted and I would skip it in this
initial submission.