diff mbox series

[nf-next,3/5] nft_set_pipapo: Prepare for vectorised implementation: alignment

Message ID 2723f85da2cd9d6b7158c7a2514c6b22f044b1b6.1582488826.git.sbrivio@redhat.com
State Changes Requested
Delegated to: Pablo Neira
Headers show
Series nft_set_pipapo: Performance improvements: Season 1 | expand

Commit Message

Stefano Brivio Feb. 23, 2020, 9:23 p.m. UTC
SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/netfilter/nft_set_pipapo.c | 135 +++++++++++++++++++++++++++------
 1 file changed, 110 insertions(+), 25 deletions(-)

Comments

Florian Westphal Feb. 23, 2020, 10:14 p.m. UTC | #1
Stefano Brivio <sbrivio@redhat.com> wrote:
>  struct nft_pipapo_field {
> @@ -439,6 +456,9 @@ struct nft_pipapo_field {
>  	unsigned long rules;
>  	size_t bsize;
>  	int bb;
> +#ifdef NFT_PIPAPO_ALIGN
> +	unsigned long *lt_aligned;
> +#endif
>  	unsigned long *lt;
>  	union nft_pipapo_map_bucket *mt;
>  };

I wonder if these structs can be compressed.
AFAICS bsize is in sizes of longs, so when this number is
large then we also need to kvmalloc a large blob of memory.

I think u32 would be enough?

nft_pipapo_field is probably the most relevant one wrt. to size.

>  struct nft_pipapo_match {
>  	int field_count;
> +#ifdef NFT_PIPAPO_ALIGN
> +	unsigned long * __percpu *scratch_aligned;
> +#endif
>  	unsigned long * __percpu *scratch;
>  	size_t bsize_max;

Same here (bsize_max -- could fit with hole after field_count)?

Also, since you know the size of nft_pipapo_match (including the
dynamically allocated array at the end), you could store the
original memory (*scratch) and the rcu_head at the end, since
they are not needed at lookup time and a little overhead to calculate
their storage offset is fine.

Not sure its worth it, just an idea.
Stefano Brivio Feb. 23, 2020, 11:04 p.m. UTC | #2
On Sun, 23 Feb 2020 23:14:35 +0100
Florian Westphal <fw@strlen.de> wrote:

> Stefano Brivio <sbrivio@redhat.com> wrote:
> >  struct nft_pipapo_field {
> > @@ -439,6 +456,9 @@ struct nft_pipapo_field {
> >  	unsigned long rules;
> >  	size_t bsize;
> >  	int bb;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long *lt_aligned;
> > +#endif
> >  	unsigned long *lt;
> >  	union nft_pipapo_map_bucket *mt;
> >  };  
> 
> I wonder if these structs can be compressed.
> AFAICS bsize is in sizes of longs, so when this number is
> large then we also need to kvmalloc a large blob of memory.
> 
> I think u32 would be enough?

Hm, yes, it is. I just thought it was a "good idea" to use size_t for
"sizes", but this is so implementation-specific that u32 would make
total sense (it's enough for 32GiB), and avoid holes on 64-bit archs.

> nft_pipapo_field is probably the most relevant one wrt. to size.

Right. I forgot about that when I added 'bb'.

> >  struct nft_pipapo_match {
> >  	int field_count;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long * __percpu *scratch_aligned;
> > +#endif
> >  	unsigned long * __percpu *scratch;
> >  	size_t bsize_max;  
> 
> Same here (bsize_max -- could fit with hole after field_count)?

Yes, makes sense.

> Also, since you know the size of nft_pipapo_match (including the
> dynamically allocated array at the end), you could store the
> original memory (*scratch) and the rcu_head at the end, since
> they are not needed at lookup time and a little overhead to calculate
> their storage offset is fine.
> 
> Not sure its worth it, just an idea.

'*scratch' is actually needed at lookup time for implementations that
don't need stricter alignment than natural one, but I could probably
use some macro trickery and "move" it as needed.

I'm not sure how to deal with fields after f[0], syntactically. Do you
have some, er, pointers?
Florian Westphal Feb. 23, 2020, 11:15 p.m. UTC | #3
Stefano Brivio <sbrivio@redhat.com> wrote:
> '*scratch' is actually needed at lookup time for implementations that
> don't need stricter alignment than natural one, but I could probably
> use some macro trickery and "move" it as needed.

Ah I see, don't bother then for now.

> I'm not sure how to deal with fields after f[0], syntactically. Do you
> have some, er, pointers?

Don't bother, its very ugly.

Base hook chains do this, it involves a comment at the tail of the
struct, see struct nf_hook_entries in include/linux/netfilter.h.
Stefano Brivio March 5, 2020, 8:35 p.m. UTC | #4
Hi Florian,

On Sun, 23 Feb 2020 23:14:35 +0100
Florian Westphal <fw@strlen.de> wrote:

> Stefano Brivio <sbrivio@redhat.com> wrote:
> >  struct nft_pipapo_field {
> > @@ -439,6 +456,9 @@ struct nft_pipapo_field {
> >  	unsigned long rules;
> >  	size_t bsize;
> >  	int bb;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long *lt_aligned;
> > +#endif
> >  	unsigned long *lt;
> >  	union nft_pipapo_map_bucket *mt;
> >  };  
> 
> I wonder if these structs can be compressed.
> AFAICS bsize is in sizes of longs, so when this number is
> large then we also need to kvmalloc a large blob of memory.
> 
> I think u32 would be enough?
> 
> nft_pipapo_field is probably the most relevant one wrt. to size.

...so I tried rearranging that struct. The results (on both x86_64 and
aarch64) are rather disappointing: the hole (that we get on 64-bit
architectures) seems to actually be beneficial.

If I turn the 'unsigned long' and 'size_t' members to u32, matching
rates drop very slightly (1-3% depending on the case in the usual
kselftest).

I then tried to shrink it more aggressively ('bb' and 'groups' can be
u8, 'bsize' can probably even be u16), and there the performance hit is
much more apparent (> 10%) -- but this is something I can easily explain
with word masks and shifts.

I'm not sure exactly what happens with the pair of u32's. The assembly
looks clean. I would probably need some micro-benchmarking to
clearly relate this to execution pipeline "features" and to, perhaps,
find a way to shuffle accesses to fields to actually speed this up
while fitting two fields in the same word.

However, I'm not so sure it's worth it at this stage.

> >  struct nft_pipapo_match {
> >  	int field_count;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long * __percpu *scratch_aligned;
> > +#endif
> >  	unsigned long * __percpu *scratch;
> >  	size_t bsize_max;  
> 
> Same here (bsize_max -- could fit with hole after field_count)?
> 
> Also, since you know the size of nft_pipapo_match (including the
> dynamically allocated array at the end), you could store the
> original memory (*scratch) and the rcu_head at the end, since
> they are not needed at lookup time and a little overhead to calculate
> their storage offset is fine.
> 
> Not sure its worth it, just an idea.

I actually bothered with this: even without the trick you explained,
struct field f[0] can safely become f[16] (NFT_REG32_COUNT), and I can
move it to the top, and then push the rcu_head down. This, again, hits
lookup rates quite badly. With f[4] lookup rates are the same as the
current case.

So, well, I wouldn't touch this either -- maybe some micro-benchmarking
might suggest a better way, but I doubt it's worth it right now.
diff mbox series

Patch

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 63e608e73e05..eb9a3cd8a811 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -398,6 +398,22 @@ 
 #define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
 					- (1UL << NFT_PIPAPO_MAP_NBITS))
 
+/* Definitions for vectorised implementations */
+#ifdef NFT_PIPAPO_ALIGN
+#define NFT_PIPAPO_ALIGN_HEADROOM					\
+	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
+#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
+		(field)->lt = (x);					\
+	} while (0)
+#else
+#define NFT_PIPAPO_ALIGN_HEADROOM	0
+#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
+#define NFT_PIPAPO_LT_ASSIGN(field, x)	((field)->lt = (x))
+#endif /* NFT_PIPAPO_ALIGN */
+
 #define nft_pipapo_for_each_field(field, index, match)		\
 	for ((field) = (match)->f, (index) = 0;			\
 	     (index) < (match)->field_count;			\
@@ -432,6 +448,7 @@  union nft_pipapo_map_bucket {
  * @bsize:	Size of each bucket in lookup table, in longs
  * @bb:		Number of bits grouped together in lookup table buckets
  * @lt:		Lookup table: 'groups' rows of buckets
+ * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
  * @mt:		Mapping table: one bucket per rule
  */
 struct nft_pipapo_field {
@@ -439,6 +456,9 @@  struct nft_pipapo_field {
 	unsigned long rules;
 	size_t bsize;
 	int bb;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long *lt_aligned;
+#endif
 	unsigned long *lt;
 	union nft_pipapo_map_bucket *mt;
 };
@@ -447,12 +467,16 @@  struct nft_pipapo_field {
  * struct nft_pipapo_match - Data used for lookup and matching
  * @field_count		Amount of fields in set
  * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
  * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
  * @rcu			Matching data is swapped on commits
  * @f:			Fields, with lookup and mapping tables
  */
 struct nft_pipapo_match {
 	int field_count;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long * __percpu *scratch_aligned;
+#endif
 	unsigned long * __percpu *scratch;
 	size_t bsize_max;
 	struct rcu_head rcu;
@@ -729,6 +753,7 @@  static struct nft_pipapo_elem *pipapo_get(const struct net *net,
 	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
 
 	nft_pipapo_for_each_field(f, i, m) {
+		unsigned long *lt = NFT_PIPAPO_LT_ALIGN(f->lt);
 		bool last = i == m->field_count - 1;
 		int b;
 
@@ -817,6 +842,10 @@  static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 	int group, bucket;
 
 	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
+#ifdef NFT_PIPAPO_ALIGN
+	new_bucket_size = roundup(new_bucket_size,
+				  NFT_PIPAPO_ALIGN / sizeof(*new_lt));
+#endif
 
 	if (new_bucket_size == f->bsize)
 		goto mt;
@@ -827,12 +856,15 @@  static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 		copy = new_bucket_size;
 
 	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS(f->bb) *
-			  new_bucket_size * sizeof(*new_lt), GFP_KERNEL);
+			  new_bucket_size * sizeof(*new_lt) +
+			  NFT_PIPAPO_ALIGN_HEADROOM,
+			  GFP_KERNEL);
 	if (!new_lt)
 		return -ENOMEM;
 
-	new_p = new_lt;
-	old_p = old_lt;
+	new_p = NFT_PIPAPO_LT_ALIGN(new_lt);
+	old_p = NFT_PIPAPO_LT_ALIGN(old_lt);
+
 	for (group = 0; group < f->groups; group++) {
 		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS(f->bb); bucket++) {
 			memcpy(new_p, old_p, copy * sizeof(*new_p));
@@ -861,7 +893,7 @@  static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 
 	if (new_lt) {
 		f->bsize = new_bucket_size;
-		f->lt = new_lt;
+		NFT_PIPAPO_LT_ASSIGN(f, new_lt);
 		kvfree(old_lt);
 	}
 
@@ -883,7 +915,8 @@  static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
 {
 	unsigned long *pos;
 
-	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS(f->bb) * group;
+	pos = NFT_PIPAPO_LT_ALIGN(f->lt);
+	pos += f->bsize * NFT_PIPAPO_BUCKETS(f->bb) * group;
 	pos += f->bsize * v;
 
 	__set_bit(rule, pos);
@@ -1048,22 +1081,27 @@  static void pipapo_lt_bits_adjust(struct nft_pipapo_field *f)
 		return;
 	}
 
-	new_lt = kvzalloc(lt_size, GFP_KERNEL);
+	new_lt = kvzalloc(lt_size + NFT_PIPAPO_ALIGN_HEADROOM, GFP_KERNEL);
 	if (!new_lt)
 		return;
 
 	NFT_PIPAPO_GROUP_BITS_ARE_8_OR_4;
-	if (f->bb == 4 && bb == 8)
-		pipapo_lt_4b_to_8b(f->groups, f->bsize, f->lt, new_lt);
-	else if (f->bb == 8 && bb == 4)
-		pipapo_lt_8b_to_4b(f->groups, f->bsize, f->lt, new_lt);
-	else
+	if (f->bb == 4 && bb == 8) {
+		pipapo_lt_4b_to_8b(f->groups, f->bsize,
+				   NFT_PIPAPO_LT_ALIGN(f->lt),
+				   NFT_PIPAPO_LT_ALIGN(new_lt));
+	} else if (f->bb == 8 && bb == 4) {
+		pipapo_lt_8b_to_4b(f->groups, f->bsize,
+				   NFT_PIPAPO_LT_ALIGN(f->lt),
+				   NFT_PIPAPO_LT_ALIGN(new_lt));
+	} else {
 		BUG();
+	}
 
 	f->groups = groups;
 	f->bb = bb;
 	kvfree(f->lt);
-	f->lt = new_lt;
+	NFT_PIPAPO_LT_ASSIGN(f, new_lt);
 }
 
 /**
@@ -1289,8 +1327,12 @@  static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 
 	for_each_possible_cpu(i) {
 		unsigned long *scratch;
+#ifdef NFT_PIPAPO_ALIGN
+		unsigned long *scratch_aligned;
+#endif
 
-		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
+		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2 +
+				       NFT_PIPAPO_ALIGN_HEADROOM,
 				       GFP_KERNEL, cpu_to_node(i));
 		if (!scratch) {
 			/* On failure, there's no need to undo previous
@@ -1306,6 +1348,11 @@  static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 		kfree(*per_cpu_ptr(clone->scratch, i));
 
 		*per_cpu_ptr(clone->scratch, i) = scratch;
+
+#ifdef NFT_PIPAPO_ALIGN
+		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
+		*per_cpu_ptr(clone->scratch_aligned, i) = scratch_aligned;
+#endif
 	}
 
 	return 0;
@@ -1433,21 +1480,33 @@  static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 	if (!new->scratch)
 		goto out_scratch;
 
+#ifdef NFT_PIPAPO_ALIGN
+	new->scratch_aligned = alloc_percpu(*new->scratch_aligned);
+	if (!new->scratch_aligned)
+		goto out_scratch;
+#endif
+
 	rcu_head_init(&new->rcu);
 
 	src = old->f;
 	dst = new->f;
 
 	for (i = 0; i < old->field_count; i++) {
+		unsigned long *new_lt;
+
 		memcpy(dst, src, offsetof(struct nft_pipapo_field, lt));
 
-		dst->lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS(src->bb) *
-				   src->bsize * sizeof(*dst->lt),
-				   GFP_KERNEL);
-		if (!dst->lt)
+		new_lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS(src->bb) *
+				  src->bsize * sizeof(*dst->lt) +
+				  NFT_PIPAPO_ALIGN_HEADROOM,
+				  GFP_KERNEL);
+		if (!new_lt)
 			goto out_lt;
 
-		memcpy(dst->lt, src->lt,
+		NFT_PIPAPO_LT_ASSIGN(dst, new_lt);
+
+		memcpy(NFT_PIPAPO_LT_ALIGN(new_lt),
+		       NFT_PIPAPO_LT_ALIGN(src->lt),
 		       src->bsize * sizeof(*dst->lt) *
 		       src->groups * NFT_PIPAPO_BUCKETS(src->bb));
 
@@ -1470,8 +1529,11 @@  static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 		kvfree(dst->lt);
 		dst--;
 	}
-	free_percpu(new->scratch);
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(new->scratch_aligned);
+#endif
 out_scratch:
+	free_percpu(new->scratch);
 	kfree(new);
 
 	return ERR_PTR(-ENOMEM);
@@ -1627,7 +1689,8 @@  static void pipapo_drop(struct nft_pipapo_match *m,
 			unsigned long *pos;
 			int b;
 
-			pos = f->lt + g * NFT_PIPAPO_BUCKETS(f->bb) * f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) + g *
+			      NFT_PIPAPO_BUCKETS(f->bb) * f->bsize;
 
 			for (b = 0; b < NFT_PIPAPO_BUCKETS(f->bb); b++) {
 				bitmap_cut(pos, pos, rulemap[i].to,
@@ -1733,6 +1796,9 @@  static void pipapo_reclaim_match(struct rcu_head *rcu)
 	for_each_possible_cpu(i)
 		kfree(*per_cpu_ptr(m->scratch, i));
 
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
 
 	pipapo_free_fields(m);
@@ -1936,8 +2002,8 @@  static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule,
 		for (b = 0; b < NFT_PIPAPO_BUCKETS(f->bb); b++) {
 			unsigned long *pos;
 
-			pos = f->lt + (g * NFT_PIPAPO_BUCKETS(f->bb) + b) *
-				      f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) +
+			      (g * NFT_PIPAPO_BUCKETS(f->bb) + b) * f->bsize;
 			if (test_bit(first_rule, pos) && x0 == -1)
 				x0 = b;
 			if (test_bit(first_rule + rule_count - 1, pos))
@@ -2218,11 +2284,21 @@  static int nft_pipapo_init(const struct nft_set *set,
 	m->scratch = alloc_percpu(unsigned long *);
 	if (!m->scratch) {
 		err = -ENOMEM;
-		goto out_free;
+		goto out_scratch;
 	}
 	for_each_possible_cpu(i)
 		*per_cpu_ptr(m->scratch, i) = NULL;
 
+#ifdef NFT_PIPAPO_ALIGN
+	m->scratch_aligned = alloc_percpu(unsigned long *);
+	if (!m->scratch_aligned) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(m->scratch_aligned, i) = NULL;
+#endif
+
 	rcu_head_init(&m->rcu);
 
 	nft_pipapo_for_each_field(f, i, m) {
@@ -2233,7 +2309,7 @@  static int nft_pipapo_init(const struct nft_set *set,
 
 		f->bsize = 0;
 		f->rules = 0;
-		f->lt = NULL;
+		NFT_PIPAPO_LT_ASSIGN(f, NULL);
 		f->mt = NULL;
 	}
 
@@ -2251,7 +2327,11 @@  static int nft_pipapo_init(const struct nft_set *set,
 	return 0;
 
 out_free:
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
+out_scratch:
 	kfree(m);
 
 	return err;
@@ -2286,16 +2366,21 @@  static void nft_pipapo_destroy(const struct nft_set *set)
 			nft_set_elem_destroy(set, e, true);
 		}
 
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(m->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(m->scratch, cpu));
 		free_percpu(m->scratch);
-
 		pipapo_free_fields(m);
 		kfree(m);
 		priv->match = NULL;
 	}
 
 	if (priv->clone) {
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(priv->clone->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(priv->clone->scratch, cpu));
 		free_percpu(priv->clone->scratch);