diff mbox series

[nf,1/3] netfilter: nft_set_pipapo: store index in scratch maps

Message ID 20240206122531.21972-2-fw@strlen.de
State Changes Requested
Headers show
Series netfilter: nft_set_pipapo: map_index must be per set | expand

Commit Message

Florian Westphal Feb. 6, 2024, 12:23 p.m. UTC
Pipapo needs a scratchpad area to keep state during matching.
This state can be large and thus cannot reside on stack.

Each set preallocates percpu areas for this.

On each match stage, one scratchpad half starts with all-zero and the other
is inited to all-ones.

At the end of each stage, the half that starts with all-ones is
always zero.  Before next field is tested, pointers to the two halves
are swapped, i.e.  resmap pointer turns into fill pointer and vice versa.

After the last field has been processed, pipapo stashes the
index toggle in a percpu variable, with assumption that next packet
will start with the all-zero half and sets all bits in the other to 1.

This isn't reliable.

There can be multiple sets and we can't be sure that the upper
and lower half of all set scratch map is always in sync (lookups
can be conditional), so one set might have swapped, but other might
not have been queried.

Thus we need to keep the index per-set-and-cpu, just like the
scratchpad.

Note that this bug fix is incomplete, there is a related issue.

avx2 and normal implementation might use slightly different areas of the
map array space due to the avx2 alignment requirements, so
m->scratch (generic/fallback implementation) and ->scratch_aligned
(avx) may partially overlap. scratch and scratch_aligned are not distinct
objects, the latter is just the aligned address of the former.

After this change, write to scratch_align->map_index may write to
scratch->map, so this issue becomes more prominent, we can set to 1
a bit in the supposedly-all-zero area of scratch->map[].

A followup patch will remove the scratch_aligned and makes generic and
avx code use the same (aligned) area.

Its done in a separate change to ease review.

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nft_set_pipapo.c      | 34 +++++++++++++++--------------
 net/netfilter/nft_set_pipapo.h      | 14 ++++++++++--
 net/netfilter/nft_set_pipapo_avx2.c | 15 ++++++-------
 3 files changed, 37 insertions(+), 26 deletions(-)

Comments

Stefano Brivio Feb. 7, 2024, 2:15 p.m. UTC | #1
On Tue,  6 Feb 2024 13:23:06 +0100
Florian Westphal <fw@strlen.de> wrote:

> Pipapo needs a scratchpad area to keep state during matching.
> This state can be large and thus cannot reside on stack.
> 
> Each set preallocates percpu areas for this.
> 
> On each match stage, one scratchpad half starts with all-zero and the other
> is inited to all-ones.
> 
> At the end of each stage, the half that starts with all-ones is
> always zero.  Before next field is tested, pointers to the two halves
> are swapped, i.e.  resmap pointer turns into fill pointer and vice versa.
> 
> After the last field has been processed, pipapo stashes the
> index toggle in a percpu variable, with assumption that next packet
> will start with the all-zero half and sets all bits in the other to 1.
> 
> This isn't reliable.

Uh oh. In hindsight, this sounds so obvious now...

> There can be multiple sets and we can't be sure that the upper
> and lower half of all set scratch map is always in sync (lookups
> can be conditional), so one set might have swapped, but other might
> not have been queried.
> 
> Thus we need to keep the index per-set-and-cpu, just like the
> scratchpad.
> 
> Note that this bug fix is incomplete, there is a related issue.
> 
> avx2 and normal implementation might use slightly different areas of the
> map array space due to the avx2 alignment requirements, so
> m->scratch (generic/fallback implementation) and ->scratch_aligned
> (avx) may partially overlap. scratch and scratch_aligned are not distinct
> objects, the latter is just the aligned address of the former.
> 
> After this change, write to scratch_align->map_index may write to
> scratch->map, so this issue becomes more prominent, we can set to 1
> a bit in the supposedly-all-zero area of scratch->map[].
> 
> A followup patch will remove the scratch_aligned and makes generic and
> avx code use the same (aligned) area.
> 
> Its done in a separate change to ease review.
> 
> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
> Signed-off-by: Florian Westphal <fw@strlen.de>

Minus one nit (not worth respinning) and one half-doubt below,

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>

...I'm still reviewing the rest.

> ---
>  net/netfilter/nft_set_pipapo.c      | 34 +++++++++++++++--------------
>  net/netfilter/nft_set_pipapo.h      | 14 ++++++++++--
>  net/netfilter/nft_set_pipapo_avx2.c | 15 ++++++-------
>  3 files changed, 37 insertions(+), 26 deletions(-)
> 
> diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> index efd523496be4..a8aa915f3f0b 100644
> --- a/net/netfilter/nft_set_pipapo.c
> +++ b/net/netfilter/nft_set_pipapo.c
> @@ -342,9 +342,6 @@
>  #include "nft_set_pipapo_avx2.h"
>  #include "nft_set_pipapo.h"
>  
> -/* Current working bitmap index, toggled between field matches */
> -static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
> -
>  /**
>   * pipapo_refill() - For each set bit, set bits from selected mapping table item
>   * @map:	Bitmap to be scanned for set bits
> @@ -412,6 +409,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
>  		       const u32 *key, const struct nft_set_ext **ext)
>  {
>  	struct nft_pipapo *priv = nft_set_priv(set);
> +	struct nft_pipapo_scratch *scratch;
>  	unsigned long *res_map, *fill_map;
>  	u8 genmask = nft_genmask_cur(net);
>  	const u8 *rp = (const u8 *)key;
> @@ -422,15 +420,17 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
>  
>  	local_bh_disable();
>  
> -	map_index = raw_cpu_read(nft_pipapo_scratch_index);
> -
>  	m = rcu_dereference(priv->match);
>  
>  	if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
>  		goto out;
>  
> -	res_map  = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0);
> -	fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max);
> +	scratch = *raw_cpu_ptr(m->scratch);
> +
> +	map_index = scratch->map_index;
> +
> +	res_map  = scratch->map + (map_index ? m->bsize_max : 0);
> +	fill_map = scratch->map + (map_index ? 0 : m->bsize_max);
>  
>  	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
>  
> @@ -460,7 +460,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
>  		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
>  				  last);
>  		if (b < 0) {
> -			raw_cpu_write(nft_pipapo_scratch_index, map_index);
> +			scratch->map_index = map_index;
>  			local_bh_enable();
>  
>  			return false;
> @@ -477,7 +477,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
>  			 * current inactive bitmap is clean and can be reused as
>  			 * *next* bitmap (not initial) for the next packet.
>  			 */
> -			raw_cpu_write(nft_pipapo_scratch_index, map_index);
> +			scratch->map_index = map_index;
>  			local_bh_enable();
>  
>  			return true;
> @@ -1121,12 +1121,12 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
>  	int i;
>  
>  	for_each_possible_cpu(i) {
> -		unsigned long *scratch;
> +		struct nft_pipapo_scratch *scratch;
>  #ifdef NFT_PIPAPO_ALIGN
> -		unsigned long *scratch_aligned;
> +		void *scratch_aligned;
>  #endif
> -
> -		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2 +
> +		scratch = kzalloc_node(struct_size(scratch, map,
> +						   bsize_max * 2) +
>  				       NFT_PIPAPO_ALIGN_HEADROOM,
>  				       GFP_KERNEL, cpu_to_node(i));
>  		if (!scratch) {
> @@ -1145,7 +1145,9 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
>  		*per_cpu_ptr(clone->scratch, i) = scratch;
>  
>  #ifdef NFT_PIPAPO_ALIGN
> -		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
> +		scratch_aligned = NFT_PIPAPO_LT_ALIGN(&scratch->map);
> +		/* Need to align ->map start, not start of structure itself */
> +		scratch_aligned -= offsetof(struct nft_pipapo_scratch, map);

This should be fine with the current version of
NFT_PIPAPO_ALIGN_HEADROOM, but it took me quite some pondering, reasoning
below if you feel like double checking.

Let's say ARCH_KMALLOC_MINALIGN is 4, NFT_PIPAPO_LT_ALIGN is 32, we
need 100 bytes for the scratch map (existing implementation), and the
address, x, we get from kzalloc_node() is k + 28, with k aligned to 32
bytes.

Then, we'll ask to allocate 32 - 4 = 28 extra bytes
(NFT_PIPAPO_ALIGN_HEADROOM), that is, 128 bytes, and we'll start using
the area at x + 4 (aligned to 32 bytes), with 124 bytes in front of us.

With this change, and the current NFT_PIPAPO_ALIGN_HEADROOM, we'll
allocate (usually) 4 bytes more, 132 bytes, and we'll start using the
area at x + 4 anyway, with 128 bytes in front of us, and we could have
asked to allocate less, which made me think for a moment that
NFT_PIPAPO_ALIGN_HEADROOM needed to be adjusted too.

However, 'scratch' at k + 28 is not the worst case: k + 32 is. At that
point, we need anyway to ask for 28 extra bytes, because 'map_index'
will force us to start from x + 32.

>  		*per_cpu_ptr(clone->scratch_aligned, i) = scratch_aligned;
>  #endif
>  	}
> @@ -2132,7 +2134,7 @@ static int nft_pipapo_init(const struct nft_set *set,
>  	m->field_count = field_count;
>  	m->bsize_max = 0;
>  
> -	m->scratch = alloc_percpu(unsigned long *);
> +	m->scratch = alloc_percpu(struct nft_pipapo_scratch *);
>  	if (!m->scratch) {
>  		err = -ENOMEM;
>  		goto out_scratch;
> @@ -2141,7 +2143,7 @@ static int nft_pipapo_init(const struct nft_set *set,
>  		*per_cpu_ptr(m->scratch, i) = NULL;
>  
>  #ifdef NFT_PIPAPO_ALIGN
> -	m->scratch_aligned = alloc_percpu(unsigned long *);
> +	m->scratch_aligned = alloc_percpu(struct nft_pipapo_scratch *);
>  	if (!m->scratch_aligned) {
>  		err = -ENOMEM;
>  		goto out_free;
> diff --git a/net/netfilter/nft_set_pipapo.h b/net/netfilter/nft_set_pipapo.h
> index 1040223da5fa..144b186c4caf 100644
> --- a/net/netfilter/nft_set_pipapo.h
> +++ b/net/netfilter/nft_set_pipapo.h
> @@ -130,6 +130,16 @@ struct nft_pipapo_field {
>  	union nft_pipapo_map_bucket *mt;
>  };
>  
> +/**
> + * struct nft_pipapo_scratch - percpu data used for lookup and matching
> + * @map_index	Current working bitmap index, toggled between field matches
> + * @map		store partial matching results during lookup
> + */
> +struct nft_pipapo_scratch {
> +	u8 map_index;
> +	unsigned long map[];
> +};
> +
>  /**
>   * struct nft_pipapo_match - Data used for lookup and matching
>   * @field_count		Amount of fields in set
> @@ -142,9 +152,9 @@ struct nft_pipapo_field {
>  struct nft_pipapo_match {
>  	int field_count;
>  #ifdef NFT_PIPAPO_ALIGN
> -	unsigned long * __percpu *scratch_aligned;
> +	struct nft_pipapo_scratch * __percpu *scratch_aligned;
>  #endif
> -	unsigned long * __percpu *scratch;
> +	struct nft_pipapo_scratch * __percpu *scratch;
>  	size_t bsize_max;
>  	struct rcu_head rcu;
>  	struct nft_pipapo_field f[] __counted_by(field_count);
> diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
> index 52e0d026d30a..8cad7b2e759d 100644
> --- a/net/netfilter/nft_set_pipapo_avx2.c
> +++ b/net/netfilter/nft_set_pipapo_avx2.c
> @@ -71,9 +71,6 @@
>  #define NFT_PIPAPO_AVX2_ZERO(reg)					\
>  	asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)
>  
> -/* Current working bitmap index, toggled between field matches */
> -static DEFINE_PER_CPU(bool, nft_pipapo_avx2_scratch_index);
> -
>  /**
>   * nft_pipapo_avx2_prepare() - Prepare before main algorithm body
>   *
> @@ -1120,11 +1117,12 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
>  			    const u32 *key, const struct nft_set_ext **ext)
>  {
>  	struct nft_pipapo *priv = nft_set_priv(set);
> -	unsigned long *res, *fill, *scratch;
> +	struct nft_pipapo_scratch *scratch;
>  	u8 genmask = nft_genmask_cur(net);
>  	const u8 *rp = (const u8 *)key;
>  	struct nft_pipapo_match *m;
>  	struct nft_pipapo_field *f;
> +	unsigned long *res, *fill;
>  	bool map_index;
>  	int i, ret = 0;
>  
> @@ -1146,10 +1144,11 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
>  		kernel_fpu_end();
>  		return false;
>  	}
> -	map_index = raw_cpu_read(nft_pipapo_avx2_scratch_index);
>  
> -	res  = scratch + (map_index ? m->bsize_max : 0);
> -	fill = scratch + (map_index ? 0 : m->bsize_max);
> +	map_index = scratch->map_index;
> +
> +	res = scratch->map + (map_index ? m->bsize_max : 0);
> +	fill = scratch->map + (map_index ? 0 : m->bsize_max);

Nit (if you respin for any reason): the existing version had one extra
space to highlight the symmetry between 'res' and 'fill' in the right
operand. You kept that in nft_pipapo_lookup(), but dropped it here.

>  
>  	/* Starting map doesn't need to be set for this implementation */
>  
> @@ -1221,7 +1220,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
>  
>  out:
>  	if (i % 2)
> -		raw_cpu_write(nft_pipapo_avx2_scratch_index, !map_index);
> +		scratch->map_index = !map_index;
>  	kernel_fpu_end();
>  
>  	return ret >= 0;
Florian Westphal Feb. 7, 2024, 3:23 p.m. UTC | #2
Stefano Brivio <sbrivio@redhat.com> wrote:
> > This isn't reliable.
> 
> Uh oh. In hindsight, this sounds so obvious now...

Thats a recurring theme with a lot of bugs.
So, no, it was not obvious.

> > There can be multiple sets and we can't be sure that the upper
> > and lower half of all set scratch map is always in sync (lookups
> > can be conditional), so one set might have swapped, but other might
> > not have been queried.
> > 
> > Thus we need to keep the index per-set-and-cpu, just like the
> > scratchpad.
> > 
> > Note that this bug fix is incomplete, there is a related issue.
> > 
> > avx2 and normal implementation might use slightly different areas of the
> > map array space due to the avx2 alignment requirements, so
> > m->scratch (generic/fallback implementation) and ->scratch_aligned
> > (avx) may partially overlap. scratch and scratch_aligned are not distinct
> > objects, the latter is just the aligned address of the former.
> > 
> > After this change, write to scratch_align->map_index may write to
> > scratch->map, so this issue becomes more prominent, we can set to 1
> > a bit in the supposedly-all-zero area of scratch->map[].
> > 
> > A followup patch will remove the scratch_aligned and makes generic and
> > avx code use the same (aligned) area.
> > 
> > Its done in a separate change to ease review.
> > 
> > Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> 
> Minus one nit (not worth respinning) and one half-doubt below,
> 
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> 
> ...I'm still reviewing the rest.

Thanks for reviewing.

> >  #ifdef NFT_PIPAPO_ALIGN
> > -		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
> > +		scratch_aligned = NFT_PIPAPO_LT_ALIGN(&scratch->map);
> > +		/* Need to align ->map start, not start of structure itself */
> > +		scratch_aligned -= offsetof(struct nft_pipapo_scratch, map);
> 
> This should be fine with the current version of
> NFT_PIPAPO_ALIGN_HEADROOM, but it took me quite some pondering, reasoning
> below if you feel like double checking.

Good point.

> Let's say ARCH_KMALLOC_MINALIGN is 4, NFT_PIPAPO_LT_ALIGN is 32, we
> need 100 bytes for the scratch map (existing implementation), and the
> address, x, we get from kzalloc_node() is k + 28, with k aligned to 32
> bytes.
> Then, we'll ask to allocate 32 - 4 = 28 extra bytes
> (NFT_PIPAPO_ALIGN_HEADROOM), that is, 128 bytes, and we'll start using
> the area at x + 4 (aligned to 32 bytes), with 124 bytes in front of us.
>
> With this change, and the current NFT_PIPAPO_ALIGN_HEADROOM, we'll
> allocate (usually) 4 bytes more, 132 bytes, and we'll start using the
> area at x + 4 anyway, with 128 bytes in front of us, and we could have
> asked to allocate less, which made me think for a moment that
> NFT_PIPAPO_ALIGN_HEADROOM needed to be adjusted too.

We'll allocate sizeof(long) more space (map_index), which is 4 bytes in
your example.

> However, 'scratch' at k + 28 is not the worst case: k + 32 is. At that
> point, we need anyway to ask for 28 extra bytes, because 'map_index'
> will force us to start from x + 32.

Wait.  k + 32 is really "k" for old code: slab gave us an aligned
address.

In new code, k+4 is the perfect "already-aligned" address where we would
'no-op' the address on a ARCH_KMALLOC_MINALIGN == 4 system.

Lets assume we get address k, and that address is the worst
possible aligned one (with minalign 4), so we align
(k + 4) (&scratch->map[0]), then subtract the index/struct head,
which means we store (align(k+4) - 4), which would be 28.

Worst case aligned value allocator can provide for new code
is k or k + 32 (make no difference):

Lets say address we got from allocator is 0x4:

NFT_PIPAPO_LT_ALIGN(&scratch->map); -> aligned to 32, we store 28
as start of struct, so ->map[0] is at address 32.

Lets say address we got from allocator is 0x20 (32):
NFT_PIPAPO_LT_ALIGN(&scratch->map); -> aligned to 64, we store 60
as start of struct, so ->map[0] at 64.

In both cases ALIGN() ate 28 bytes of the allocation, which we accounted
for as NFT_PIPAPO_ALIGN_HEADROOM.

Maybe thats what you were saying.  I could try to add/expand the
comments here for the alignment calculations.

> > -	res  = scratch + (map_index ? m->bsize_max : 0);
> > -	fill = scratch + (map_index ? 0 : m->bsize_max);
> > +	map_index = scratch->map_index;
> > +
> > +	res = scratch->map + (map_index ? m->bsize_max : 0);
> > +	fill = scratch->map + (map_index ? 0 : m->bsize_max);
> 
> Nit (if you respin for any reason): the existing version had one extra
> space to highlight the symmetry between 'res' and 'fill' in the right
> operand. You kept that in nft_pipapo_lookup(), but dropped it here.

Oh, indeed, i'll fix this up.
Stefano Brivio Feb. 7, 2024, 5:27 p.m. UTC | #3
On Wed, 7 Feb 2024 16:23:28 +0100
Florian Westphal <fw@strlen.de> wrote:

> Stefano Brivio <sbrivio@redhat.com> wrote:
> > > This isn't reliable.  
> > 
> > Uh oh. In hindsight, this sounds so obvious now...  
> 
> Thats a recurring theme with a lot of bugs.
> So, no, it was not obvious.
> 
> > > There can be multiple sets and we can't be sure that the upper
> > > and lower half of all set scratch map is always in sync (lookups
> > > can be conditional), so one set might have swapped, but other might
> > > not have been queried.
> > > 
> > > Thus we need to keep the index per-set-and-cpu, just like the
> > > scratchpad.
> > > 
> > > Note that this bug fix is incomplete, there is a related issue.
> > > 
> > > avx2 and normal implementation might use slightly different areas of the
> > > map array space due to the avx2 alignment requirements, so
> > > m->scratch (generic/fallback implementation) and ->scratch_aligned
> > > (avx) may partially overlap. scratch and scratch_aligned are not distinct
> > > objects, the latter is just the aligned address of the former.
> > > 
> > > After this change, write to scratch_align->map_index may write to
> > > scratch->map, so this issue becomes more prominent, we can set to 1
> > > a bit in the supposedly-all-zero area of scratch->map[].
> > > 
> > > A followup patch will remove the scratch_aligned and makes generic and
> > > avx code use the same (aligned) area.
> > > 
> > > Its done in a separate change to ease review.
> > > 
> > > Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
> > > Signed-off-by: Florian Westphal <fw@strlen.de>  
> > 
> > Minus one nit (not worth respinning) and one half-doubt below,
> > 
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > 
> > ...I'm still reviewing the rest.  
> 
> Thanks for reviewing.
> 
> > >  #ifdef NFT_PIPAPO_ALIGN
> > > -		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
> > > +		scratch_aligned = NFT_PIPAPO_LT_ALIGN(&scratch->map);
> > > +		/* Need to align ->map start, not start of structure itself */
> > > +		scratch_aligned -= offsetof(struct nft_pipapo_scratch, map);  
> > 
> > This should be fine with the current version of
> > NFT_PIPAPO_ALIGN_HEADROOM, but it took me quite some pondering, reasoning
> > below if you feel like double checking.  
> 
> Good point.
> 
> > Let's say ARCH_KMALLOC_MINALIGN is 4, NFT_PIPAPO_LT_ALIGN is 32, we
> > need 100 bytes for the scratch map (existing implementation), and the
> > address, x, we get from kzalloc_node() is k + 28, with k aligned to 32
> > bytes.
> > Then, we'll ask to allocate 32 - 4 = 28 extra bytes
> > (NFT_PIPAPO_ALIGN_HEADROOM), that is, 128 bytes, and we'll start using
> > the area at x + 4 (aligned to 32 bytes), with 124 bytes in front of us.
> >
> > With this change, and the current NFT_PIPAPO_ALIGN_HEADROOM, we'll
> > allocate (usually) 4 bytes more, 132 bytes, and we'll start using the
> > area at x + 4 anyway, with 128 bytes in front of us, and we could have
> > asked to allocate less, which made me think for a moment that
> > NFT_PIPAPO_ALIGN_HEADROOM needed to be adjusted too.  
> 
> We'll allocate sizeof(long) more space (map_index), which is 4 bytes in
> your example.

Oh, right, it could be 8 bytes too. Let's stick to 4 for simplicity.

> > However, 'scratch' at k + 28 is not the worst case: k + 32 is. At that
> > point, we need anyway to ask for 28 extra bytes, because 'map_index'
> > will force us to start from x + 32.  
> 
> Wait.  k + 32 is really "k" for old code: slab gave us an aligned
> address.

Yes, I meant that for the *new* code: k + 32 mod 32 (that is, k) is the
worst case for the new code.

> In new code, k+4 is the perfect "already-aligned" address where we would
> 'no-op' the address on a ARCH_KMALLOC_MINALIGN == 4 system.

Isn't the already aligned-address k - 4, that is, k + 28? With k + 4,
we would have &scratch->map[0] at k + 8. But anyway:

> Lets assume we get address k, and that address is the worst
> possible aligned one (with minalign 4), so we align
> (k + 4) (&scratch->map[0]), then subtract the index/struct head,
> which means we store (align(k+4) - 4), which would be 28.
> 
> Worst case aligned value allocator can provide for new code
> is k or k + 32 (make no difference):
> 
> Lets say address we got from allocator is 0x4:
> 
> NFT_PIPAPO_LT_ALIGN(&scratch->map); -> aligned to 32, we store 28
> as start of struct, so ->map[0] is at address 32.
> 
> Lets say address we got from allocator is 0x20 (32):
> NFT_PIPAPO_LT_ALIGN(&scratch->map); -> aligned to 64, we store 60
> as start of struct, so ->map[0] at 64.
> 
> In both cases ALIGN() ate 28 bytes of the allocation, which we accounted
> for as NFT_PIPAPO_ALIGN_HEADROOM.
> 
> Maybe thats what you were saying.  I could try to add/expand the
> comments here for the alignment calculations.

...yes, the rest is exactly what I meant. I'm not really satisfied of
the paragraph below but maybe something on the lines of:

	/* Align &scratch->map (not the struct itself): the extra
	 * %NFT_PIPAPO_ALIGN_HEADROOM bytes passed to kzalloc_node() above
	 * guarantee we can waste up to those bytes in order to align the map
	 * field regardless of its offset within the struct.
	 */
Florian Westphal Feb. 7, 2024, 7:24 p.m. UTC | #4
Stefano Brivio <sbrivio@redhat.com> wrote:
> > In new code, k+4 is the perfect "already-aligned" address where we would
> > 'no-op' the address on a ARCH_KMALLOC_MINALIGN == 4 system.
> 
> Isn't the already aligned-address k - 4, that is, k + 28? With k + 4,
> we would have &scratch->map[0] at k + 8. But anyway:

Yes, k-4 (k+28).

[..]

> > Maybe thats what you were saying.  I could try to add/expand the
> > comments here for the alignment calculations.
> 
> ...yes, the rest is exactly what I meant. I'm not really satisfied of
> the paragraph below but maybe something on the lines of:
> 
> 	/* Align &scratch->map (not the struct itself): the extra
> 	 * %NFT_PIPAPO_ALIGN_HEADROOM bytes passed to kzalloc_node() above
> 	 * guarantee we can waste up to those bytes in order to align the map
> 	 * field regardless of its offset within the struct.
> 	 */

Thanks, thats good, I'll use that.

I think I'll also add a BUILD_BUG_ON to assert map
offset <= NFT_PIPAPO_ALIGN_HEADROOM, just in case.
diff mbox series

Patch

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index efd523496be4..a8aa915f3f0b 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -342,9 +342,6 @@ 
 #include "nft_set_pipapo_avx2.h"
 #include "nft_set_pipapo.h"
 
-/* Current working bitmap index, toggled between field matches */
-static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
-
 /**
  * pipapo_refill() - For each set bit, set bits from selected mapping table item
  * @map:	Bitmap to be scanned for set bits
@@ -412,6 +409,7 @@  bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 		       const u32 *key, const struct nft_set_ext **ext)
 {
 	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_scratch *scratch;
 	unsigned long *res_map, *fill_map;
 	u8 genmask = nft_genmask_cur(net);
 	const u8 *rp = (const u8 *)key;
@@ -422,15 +420,17 @@  bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 
 	local_bh_disable();
 
-	map_index = raw_cpu_read(nft_pipapo_scratch_index);
-
 	m = rcu_dereference(priv->match);
 
 	if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
 		goto out;
 
-	res_map  = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0);
-	fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max);
+	scratch = *raw_cpu_ptr(m->scratch);
+
+	map_index = scratch->map_index;
+
+	res_map  = scratch->map + (map_index ? m->bsize_max : 0);
+	fill_map = scratch->map + (map_index ? 0 : m->bsize_max);
 
 	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
 
@@ -460,7 +460,7 @@  bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
 				  last);
 		if (b < 0) {
-			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			scratch->map_index = map_index;
 			local_bh_enable();
 
 			return false;
@@ -477,7 +477,7 @@  bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 			 * current inactive bitmap is clean and can be reused as
 			 * *next* bitmap (not initial) for the next packet.
 			 */
-			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			scratch->map_index = map_index;
 			local_bh_enable();
 
 			return true;
@@ -1121,12 +1121,12 @@  static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 	int i;
 
 	for_each_possible_cpu(i) {
-		unsigned long *scratch;
+		struct nft_pipapo_scratch *scratch;
 #ifdef NFT_PIPAPO_ALIGN
-		unsigned long *scratch_aligned;
+		void *scratch_aligned;
 #endif
-
-		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2 +
+		scratch = kzalloc_node(struct_size(scratch, map,
+						   bsize_max * 2) +
 				       NFT_PIPAPO_ALIGN_HEADROOM,
 				       GFP_KERNEL, cpu_to_node(i));
 		if (!scratch) {
@@ -1145,7 +1145,9 @@  static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 		*per_cpu_ptr(clone->scratch, i) = scratch;
 
 #ifdef NFT_PIPAPO_ALIGN
-		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
+		scratch_aligned = NFT_PIPAPO_LT_ALIGN(&scratch->map);
+		/* Need to align ->map start, not start of structure itself */
+		scratch_aligned -= offsetof(struct nft_pipapo_scratch, map);
 		*per_cpu_ptr(clone->scratch_aligned, i) = scratch_aligned;
 #endif
 	}
@@ -2132,7 +2134,7 @@  static int nft_pipapo_init(const struct nft_set *set,
 	m->field_count = field_count;
 	m->bsize_max = 0;
 
-	m->scratch = alloc_percpu(unsigned long *);
+	m->scratch = alloc_percpu(struct nft_pipapo_scratch *);
 	if (!m->scratch) {
 		err = -ENOMEM;
 		goto out_scratch;
@@ -2141,7 +2143,7 @@  static int nft_pipapo_init(const struct nft_set *set,
 		*per_cpu_ptr(m->scratch, i) = NULL;
 
 #ifdef NFT_PIPAPO_ALIGN
-	m->scratch_aligned = alloc_percpu(unsigned long *);
+	m->scratch_aligned = alloc_percpu(struct nft_pipapo_scratch *);
 	if (!m->scratch_aligned) {
 		err = -ENOMEM;
 		goto out_free;
diff --git a/net/netfilter/nft_set_pipapo.h b/net/netfilter/nft_set_pipapo.h
index 1040223da5fa..144b186c4caf 100644
--- a/net/netfilter/nft_set_pipapo.h
+++ b/net/netfilter/nft_set_pipapo.h
@@ -130,6 +130,16 @@  struct nft_pipapo_field {
 	union nft_pipapo_map_bucket *mt;
 };
 
+/**
+ * struct nft_pipapo_scratch - percpu data used for lookup and matching
+ * @map_index	Current working bitmap index, toggled between field matches
+ * @map		store partial matching results during lookup
+ */
+struct nft_pipapo_scratch {
+	u8 map_index;
+	unsigned long map[];
+};
+
 /**
  * struct nft_pipapo_match - Data used for lookup and matching
  * @field_count		Amount of fields in set
@@ -142,9 +152,9 @@  struct nft_pipapo_field {
 struct nft_pipapo_match {
 	int field_count;
 #ifdef NFT_PIPAPO_ALIGN
-	unsigned long * __percpu *scratch_aligned;
+	struct nft_pipapo_scratch * __percpu *scratch_aligned;
 #endif
-	unsigned long * __percpu *scratch;
+	struct nft_pipapo_scratch * __percpu *scratch;
 	size_t bsize_max;
 	struct rcu_head rcu;
 	struct nft_pipapo_field f[] __counted_by(field_count);
diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
index 52e0d026d30a..8cad7b2e759d 100644
--- a/net/netfilter/nft_set_pipapo_avx2.c
+++ b/net/netfilter/nft_set_pipapo_avx2.c
@@ -71,9 +71,6 @@ 
 #define NFT_PIPAPO_AVX2_ZERO(reg)					\
 	asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)
 
-/* Current working bitmap index, toggled between field matches */
-static DEFINE_PER_CPU(bool, nft_pipapo_avx2_scratch_index);
-
 /**
  * nft_pipapo_avx2_prepare() - Prepare before main algorithm body
  *
@@ -1120,11 +1117,12 @@  bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
 			    const u32 *key, const struct nft_set_ext **ext)
 {
 	struct nft_pipapo *priv = nft_set_priv(set);
-	unsigned long *res, *fill, *scratch;
+	struct nft_pipapo_scratch *scratch;
 	u8 genmask = nft_genmask_cur(net);
 	const u8 *rp = (const u8 *)key;
 	struct nft_pipapo_match *m;
 	struct nft_pipapo_field *f;
+	unsigned long *res, *fill;
 	bool map_index;
 	int i, ret = 0;
 
@@ -1146,10 +1144,11 @@  bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
 		kernel_fpu_end();
 		return false;
 	}
-	map_index = raw_cpu_read(nft_pipapo_avx2_scratch_index);
 
-	res  = scratch + (map_index ? m->bsize_max : 0);
-	fill = scratch + (map_index ? 0 : m->bsize_max);
+	map_index = scratch->map_index;
+
+	res = scratch->map + (map_index ? m->bsize_max : 0);
+	fill = scratch->map + (map_index ? 0 : m->bsize_max);
 
 	/* Starting map doesn't need to be set for this implementation */
 
@@ -1221,7 +1220,7 @@  bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
 
 out:
 	if (i % 2)
-		raw_cpu_write(nft_pipapo_avx2_scratch_index, !map_index);
+		scratch->map_index = !map_index;
 	kernel_fpu_end();
 
 	return ret >= 0;