diff mbox

[v1,1/14] rhashtable: Remove shift from bucket_table

Message ID E1YX61x-0005m7-GV@gondolin.me.apana.org.au
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu March 15, 2015, 10:44 a.m. UTC
Keeping both size and shift is silly.  We only need one.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 --
 lib/rhashtable.c           |    5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Laight March 17, 2015, 10:51 a.m. UTC | #1
From: Herbert Xu
> Sent: 15 March 2015 10:44
> Keeping both size and shift is silly.  We only need one.
...
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -162,7 +162,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
>  		return NULL;
> 
>  	tbl->size = nbuckets;
> -	tbl->shift = ilog2(nbuckets);
> 
>  	if (alloc_bucket_locks(ht, tbl) < 0) {
>  		bucket_table_free(tbl);
> @@ -189,7 +188,7 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
>  {
>  	/* Expand table when exceeding 75% load */
>  	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
> -	       (!ht->p.max_shift || tbl->shift < ht->p.max_shift);
> +	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));

Looks like you could pre-calculate the 'grow_at' size.
The test above would then be:
	return atomic_read(&ht->nelems > tbl->grow_at_size);

Similarly for the shrink.

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf March 17, 2015, 10:56 a.m. UTC | #2
On 03/17/15 at 10:51am, David Laight wrote:
> Looks like you could pre-calculate the 'grow_at' size.
> The test above would then be:
> 	return atomic_read(&ht->nelems > tbl->grow_at_size);
> 
> Similarly for the shrink.

Given the discussions, the grow decision will likely change to
a max bucket length limit anyway.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 17, 2015, 11 a.m. UTC | #3
On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
>
> Given the discussions, the grow decision will likely change to
> a max bucket length limit anyway.

Actually no.  In my pathces the chain length is only used to
force an immediate rehash.  Growing is still based on the number
of elements.

The reason is that the maximum (not average) chain length actually
grows with the hash table size, even at 75% utilisation.

So unless we want ever decreasing utilisation as the table grows,
we have to cope with a few chains with many elements.  The limit
I'm currently using is 16 which shouldn't be hit even if you had
a 2^32 table (which you currently cannot).

Cheers,
Thomas Graf March 17, 2015, 11:22 a.m. UTC | #4
On 03/17/15 at 10:00pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
> >
> > Given the discussions, the grow decision will likely change to
> > a max bucket length limit anyway.
> 
> Actually no.  In my pathces the chain length is only used to
> force an immediate rehash.  Growing is still based on the number
> of elements.
> 
> The reason is that the maximum (not average) chain length actually
> grows with the hash table size, even at 75% utilisation.
> 
> So unless we want ever decreasing utilisation as the table grows,
> we have to cope with a few chains with many elements.  The limit
> I'm currently using is 16 which shouldn't be hit even if you had
> a 2^32 table (which you currently cannot).

Not sure I understand the downside of a bucket length based grow
decision with optional forced rehashing plus synchroneous table
realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
we consider deterministic lookup and insert behaviour more important
than overall table utilization? Given the rehashing, the grow decision
should not be attackable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 17, 2015, 11:22 a.m. UTC | #5
From: Herbert Xu
> Sent: 17 March 2015 11:01
> On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
> >
> > Given the discussions, the grow decision will likely change to
> > a max bucket length limit anyway.
> 
> Actually no.  In my pathces the chain length is only used to
> force an immediate rehash.  Growing is still based on the number
> of elements.
> 
> The reason is that the maximum (not average) chain length actually
> grows with the hash table size, even at 75% utilisation.

That doesn't surprise me.

But won't the rehashed table be just as likely to have a long list?
So you are likely to get an immediate rehash?

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 17, 2015, 11:25 a.m. UTC | #6
On Tue, Mar 17, 2015 at 11:22:44AM +0000, David Laight wrote:
>
> But won't the rehashed table be just as likely to have a long list?
> So you are likely to get an immediate rehash?

Not if the threshold is set to a value that should never be hit
at 100% utilisation.

Cheers,
Herbert Xu March 17, 2015, 11:27 a.m. UTC | #7
On Tue, Mar 17, 2015 at 11:22:03AM +0000, tgraf@suug.ch wrote:
>
> Not sure I understand the downside of a bucket length based grow
> decision with optional forced rehashing plus synchroneous table
> realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
> we consider deterministic lookup and insert behaviour more important
> than overall table utilization? Given the rehashing, the grow decision
> should not be attackable.

Do you really want to double the table size when 0.1% of the buckets
have a chain length > 4 but still < 16?

Cheers,
Thomas Graf March 17, 2015, 11:57 a.m. UTC | #8
On 03/17/15 at 10:27pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 11:22:03AM +0000, tgraf@suug.ch wrote:
> >
> > Not sure I understand the downside of a bucket length based grow
> > decision with optional forced rehashing plus synchroneous table
> > realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
> > we consider deterministic lookup and insert behaviour more important
> > than overall table utilization? Given the rehashing, the grow decision
> > should not be attackable.
> 
> Do you really want to double the table size when 0.1% of the buckets
> have a chain length > 4 but still < 16?

If we constantly hit that bucket because we are handling just a few
TCP flows it would be worth to double the size & rehash to avoid the
additional cache misses of the linked list.

Although a limit of 4 would be too high. Ideally we should resize and
rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
It seems likely though that we'll always have a bucket with >=2
entries so we would end up constantly doubling and rehashing. This is
the only thing that speaks for a table wide nelems counters in my
opinion.

Another option would be to continue resizing & rehashing as long as we
have at least one bucket with >= 4 entries but allow a table size
dependant limit of (length > 1 && length < 4) buckets. This may be
overengineered again though ;-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 17, 2015, 12:13 p.m. UTC | #9
From: Thomas Graf 
> Sent: 17 March 2015 11:58
...
> > Do you really want to double the table size when 0.1% of the buckets
> > have a chain length > 4 but still < 16?
> 
> If we constantly hit that bucket because we are handling just a few
> TCP flows it would be worth to double the size & rehash to avoid the
> additional cache misses of the linked list.
> 
> Although a limit of 4 would be too high. Ideally we should resize and
> rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> It seems likely though that we'll always have a bucket with >=2
> entries so we would end up constantly doubling and rehashing. This is
> the only thing that speaks for a table wide nelems counters in my
> opinion.

I think you are seriously overestimating the 'efficiency' of the hash function.
And not doing the 'birthday paradox' maths at all.

The only way you'll get a 'hash' that good is if you can pick the input
value in order to generate a perfect hash.
However you aren't going to manage that for inbound TCP connections since
none of the inputs to the hash can be chosen by the listening system
(unless IPv6 has something than can help you).

You may have to live with a few % of the items being on long chains.
Maybe count the number of items on chains longer than (say) 4 and
rehash or increase the table size if this exceeds a few % of the
table size.
(Or count the number of items that are further than 4 from the start
of the hash chain.)

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf March 17, 2015, 12:18 p.m. UTC | #10
On 03/17/15 at 12:13pm, David Laight wrote:
> From: Thomas Graf 
> > Sent: 17 March 2015 11:58
> ...
> > > Do you really want to double the table size when 0.1% of the buckets
> > > have a chain length > 4 but still < 16?
> > 
> > If we constantly hit that bucket because we are handling just a few
> > TCP flows it would be worth to double the size & rehash to avoid the
> > additional cache misses of the linked list.
> > 
> > Although a limit of 4 would be too high. Ideally we should resize and
> > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > It seems likely though that we'll always have a bucket with >=2
> > entries so we would end up constantly doubling and rehashing. This is
> > the only thing that speaks for a table wide nelems counters in my
> > opinion.
> 
> I think you are seriously overestimating the 'efficiency' of the hash function.
> And not doing the 'birthday paradox' maths at all.
> 
> The only way you'll get a 'hash' that good is if you can pick the input
> value in order to generate a perfect hash.
> However you aren't going to manage that for inbound TCP connections since
> none of the inputs to the hash can be chosen by the listening system
> (unless IPv6 has something than can help you).
> 
> You may have to live with a few % of the items being on long chains.
> Maybe count the number of items on chains longer than (say) 4 and
> rehash or increase the table size if this exceeds a few % of the
> table size.
> (Or count the number of items that are further than 4 from the start
> of the hash chain.)

Did you read my email all the way?  You basically omitted to quote and
then rephrased the 2nd half of my email.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 17, 2015, 12:20 p.m. UTC | #11
On Tue, Mar 17, 2015 at 12:13:42PM +0000, David Laight wrote:
> From: Thomas Graf 
> > Sent: 17 March 2015 11:58
> ...
> > > Do you really want to double the table size when 0.1% of the buckets
> > > have a chain length > 4 but still < 16?
> > 
> > If we constantly hit that bucket because we are handling just a few
> > TCP flows it would be worth to double the size & rehash to avoid the
> > additional cache misses of the linked list.
> > 
> > Although a limit of 4 would be too high. Ideally we should resize and
> > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > It seems likely though that we'll always have a bucket with >=2
> > entries so we would end up constantly doubling and rehashing. This is
> > the only thing that speaks for a table wide nelems counters in my
> > opinion.
> 
> I think you are seriously overestimating the 'efficiency' of the hash function.
> And not doing the 'birthday paradox' maths at all.

Agreed.  Thomas, an easy test to do is to pump the integers from
0 to 65535 into jhash, then mask it with 65535 and run sort |
uniq -c | sort -n on it, I think you'll see what we're talking
about.

Cheers,
Thomas Graf March 17, 2015, 12:40 p.m. UTC | #12
On 03/17/15 at 11:20pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 12:13:42PM +0000, David Laight wrote:
> > From: Thomas Graf 
> > > Sent: 17 March 2015 11:58
> > ...
> > > > Do you really want to double the table size when 0.1% of the buckets
> > > > have a chain length > 4 but still < 16?
> > > 
> > > If we constantly hit that bucket because we are handling just a few
> > > TCP flows it would be worth to double the size & rehash to avoid the
> > > additional cache misses of the linked list.
> > > 
> > > Although a limit of 4 would be too high. Ideally we should resize and
> > > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > > It seems likely though that we'll always have a bucket with >=2
> > > entries so we would end up constantly doubling and rehashing. This is
> > > the only thing that speaks for a table wide nelems counters in my
> > > opinion.
> > 
> > I think you are seriously overestimating the 'efficiency' of the hash function.
> > And not doing the 'birthday paradox' maths at all.
> 
> Agreed.  Thomas, an easy test to do is to pump the integers from
> 0 to 65535 into jhash, then mask it with 65535 and run sort |
> uniq -c | sort -n on it, I think you'll see what we're talking
> about.

I'm not claiming perfect hash functions and this is exactly why I
think average utilization is not an optimal growth criteria because
it gives very limited view into the actual chain lengths.

What you describe above is a 100% utilization scenario. Initially
we talked about 0.1% utilization and whether to resize & rehash if a
single chain has length > 4. My answer is: yes we should resize &
rehash or at least rehash in that case.

My point here is that a chain length of 4 may be a serious
performance bottleneck already and that it might be worth to try
and detect bad hashing distribution and attempt to fix it at an
earlier stage while ruling out the possibility of endless rehashes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 17, 2015, 1:06 p.m. UTC | #13
From: Thomas Graf
> Sent: 17 March 2015 12:40
...
> I'm not claiming perfect hash functions and this is exactly why I
> think average utilization is not an optimal growth criteria because
> it gives very limited view into the actual chain lengths.

If you assume that all the entries are being looked up equally
often the interesting number ought to be the number of failed compares.
So you want to sum 'length * (length - 1)/2' and compare against the
total number of items (or the table size).

> What you describe above is a 100% utilization scenario. Initially
> we talked about 0.1% utilization and whether to resize & rehash if a
> single chain has length > 4. My answer is: yes we should resize &
> rehash or at least rehash in that case.

0.1% utilisation is about as realistic as 100%.
50-75% is probably a reasonable limit.
This will give some short chains.

> My point here is that a chain length of 4 may be a serious
> performance bottleneck already and that it might be worth to try
> and detect bad hashing distribution and attempt to fix it at an
> earlier stage while ruling out the possibility of endless rehashes.

If you have 4 items and they are all on one chain they I suspect that
nothing you do will actually split them.

OTOH if you have 1000 items and one chain of length 4 it really doesn't
matter. Unless, of course, someone arranges to flood the last item with
traffic - in which case you are going to lose anyway.

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 17, 2015, 9:56 p.m. UTC | #14
On Tue, Mar 17, 2015 at 12:40:12PM +0000, 'tgraf@suug.ch' wrote:
>
> > Agreed.  Thomas, an easy test to do is to pump the integers from
> > 0 to 65535 into jhash, then mask it with 65535 and run sort |
> > uniq -c | sort -n on it, I think you'll see what we're talking
> > about.
> 
> I'm not claiming perfect hash functions and this is exactly why I
> think average utilization is not an optimal growth criteria because
> it gives very limited view into the actual chain lengths.
> 
> What you describe above is a 100% utilization scenario. Initially

Actually 75% is just as bad.  To test that just repeat my script
above and add head -n $((65536 / 4 * 3)) before the first sort.

My point is that to get below anything like a chain length limit
of 2 (or even 4), your hash table is going to be mostly empty.
OK 0.1% might be an exaggeration but it's certainly nowhere near 50%
as your hash table size grows.

Please actually try out my test, here is a C program that will help
you do it:

#include <stdio.h>

typedef unsigned char u8;
typedef unsigned int u32;

static inline u32 rol32(u32 word, unsigned int shift)
{
	return (word << shift) | (word >> (32 - shift));
}

/* Best hash sizes are of power of two */
#define jhash_size(n)   ((u32)1<<(n))
/* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
#define jhash_mask(n)   (jhash_size(n)-1)

/* __jhash_mix -- mix 3 32-bit values reversibly. */
#define __jhash_mix(a, b, c)			\
{						\
	a -= c;  a ^= rol32(c, 4);  c += b;	\
	b -= a;  b ^= rol32(a, 6);  a += c;	\
	c -= b;  c ^= rol32(b, 8);  b += a;	\
	a -= c;  a ^= rol32(c, 16); c += b;	\
	b -= a;  b ^= rol32(a, 19); a += c;	\
	c -= b;  c ^= rol32(b, 4);  b += a;	\
}

/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
#define __jhash_final(a, b, c)			\
{						\
	c ^= b; c -= rol32(b, 14);		\
	a ^= c; a -= rol32(c, 11);		\
	b ^= a; b -= rol32(a, 25);		\
	c ^= b; c -= rol32(b, 16);		\
	a ^= c; a -= rol32(c, 4);		\
	b ^= a; b -= rol32(a, 14);		\
	c ^= b; c -= rol32(b, 24);		\
}

#define __get_unaligned_cpu32(x) (*(u32 *)(x))

/* An arbitrary initial parameter */
#define JHASH_INITVAL		0xdeadbeef

/* jhash - hash an arbitrary key
 * @k: sequence of bytes as key
 * @length: the length of the key
 * @initval: the previous hash, or an arbitray value
 *
 * The generic version, hashes an arbitrary sequence of bytes.
 * No alignment or length assumptions are made about the input key.
 *
 * Returns the hash value of the key. The result depends on endianness.
 */
static inline u32 jhash(const void *key, u32 length, u32 initval)
{
	u32 a, b, c;
	const u8 *k = key;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + length + initval;

	/* All but the last block: affect some 32 bits of (a,b,c) */
	while (length > 12) {
		a += __get_unaligned_cpu32(k);
		b += __get_unaligned_cpu32(k + 4);
		c += __get_unaligned_cpu32(k + 8);
		__jhash_mix(a, b, c);
		length -= 12;
		k += 12;
	}
	/* Last block: affect all 32 bits of (c) */
	/* All the case statements fall through */
	switch (length) {
	case 12: c += (u32)k[11]<<24;
	case 11: c += (u32)k[10]<<16;
	case 10: c += (u32)k[9]<<8;
	case 9:  c += k[8];
	case 8:  b += (u32)k[7]<<24;
	case 7:  b += (u32)k[6]<<16;
	case 6:  b += (u32)k[5]<<8;
	case 5:  b += k[4];
	case 4:  a += (u32)k[3]<<24;
	case 3:  a += (u32)k[2]<<16;
	case 2:  a += (u32)k[1]<<8;
	case 1:  a += k[0];
		 __jhash_final(a, b, c);
	case 0: /* Nothing left to add */
		break;
	}

	return c;
}

/* jhash2 - hash an array of u32's
 * @k: the key which must be an array of u32's
 * @length: the number of u32's in the key
 * @initval: the previous hash, or an arbitray value
 *
 * Returns the hash value of the key.
 */
static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
{
	u32 a, b, c;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + (length<<2) + initval;

	/* Handle most of the key */
	while (length > 3) {
		a += k[0];
		b += k[1];
		c += k[2];
		__jhash_mix(a, b, c);
		length -= 3;
		k += 3;
	}

	/* Handle the last 3 u32's: all the case statements fall through */
	switch (length) {
	case 3: c += k[2];
	case 2: b += k[1];
	case 1: a += k[0];
		__jhash_final(a, b, c);
	case 0:	/* Nothing left to add */
		break;
	}

	return c;
}


/* jhash_3words - hash exactly 3, 2 or 1 word(s) */
static inline u32 jhash_3words(u32 a, u32 b, u32 c, u32 initval)
{
	a += JHASH_INITVAL;
	b += JHASH_INITVAL;
	c += initval;

	__jhash_final(a, b, c);

	return c;
}

static inline u32 jhash_2words(u32 a, u32 b, u32 initval)
{
	return jhash_3words(a, b, 0, initval);
}

static inline u32 jhash_1word(u32 a, u32 initval)
{
	return jhash_3words(a, 0, 0, initval);
}

int main(int argc, char **argv)
{
	int i;
	struct {
		void *s;
		void *t;
		u32 l;
	} k = { .s = 0 };
	int total;

	total = atoi(argv[1]);

	for (i = 0; i < total; i++) {
		k.l = i;
		printf("0x%x\n", jhash2((u32 *)&k, 3, 12345) & (total - 1));
	}

	return 0;
}

Cheers,
Thomas Graf March 18, 2015, 9:51 a.m. UTC | #15
On 03/18/15 at 08:56am, Herbert Xu wrote:
> Actually 75% is just as bad.  To test that just repeat my script
> above and add head -n $((65536 / 4 * 3)) before the first sort.
> 
> My point is that to get below anything like a chain length limit
> of 2 (or even 4), your hash table is going to be mostly empty.
> OK 0.1% might be an exaggeration but it's certainly nowhere near 50%
> as your hash table size grows.
> 
> Please actually try out my test, here is a C program that will help
> you do it:

Thanks for providing that C program. I played around with it and
noticed a small typo which I fixed up. I'm attaching the version
I've used at the end of the mail.

Maybe there is a thinko on my part but the results I'm getting are
very impressive. I have a very hard time to produce more than a
single hash conflict for sequences up to 0..2^27.

I also tried with a pseudo random value for the u32 part of the key
instead of the sequence  number and got the same results.

I'm running it like this:
for i in $(seq 1 27); do \
echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \
done

1:
2:
3:
      3 0x1
4:
      2 0xc
5:
6:
      2 0x2b
7:
8:
9:
10:
11:
      2 0x1c9
12:
13:
      2 0x9c7
      2 0x4e9
14:
15:
      2 0x5c8
16:
17:
18:
      2 0x34cb2
      2 0x21fd0
19:
      2 0x61fd0
      2 0x6e82f
20:
      2 0x61fd0
      2 0x6e82f
      2 0xfd571
21:
      2 0x1a0e02
      2 0x1776ea
22:
      2 0x379099
23:
      2 0x2650b6
      2 0x6dc5b5
      2 0x648fe1
      2 0x543446
24:
      2 0x8d60e0
      2 0x8726e5
25:
      2 0x1225f30
26:
      2 0x3a136f4
27:


--- jhash.c ---

#include <stdio.h>

typedef unsigned char u8;
typedef unsigned int u32;

static inline u32 rol32(u32 word, unsigned int shift)
{
	return (word << shift) | (word >> (32 - shift));
}

/* Best hash sizes are of power of two */
#define jhash_size(n)   ((u32)1<<(n))
/* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
#define jhash_mask(n)   (jhash_size(n)-1)

/* __jhash_mix -- mix 3 32-bit values reversibly. */
#define __jhash_mix(a, b, c)			\
{						\
	a -= c;  a ^= rol32(c, 4);  c += b;	\
	b -= a;  b ^= rol32(a, 6);  a += c;	\
	c -= b;  c ^= rol32(b, 8);  b += a;	\
	a -= c;  a ^= rol32(c, 16); c += b;	\
	b -= a;  b ^= rol32(a, 19); a += c;	\
	c -= b;  c ^= rol32(b, 4);  b += a;	\
}

/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
#define __jhash_final(a, b, c)			\
{						\
	c ^= b; c -= rol32(b, 14);		\
	a ^= c; a -= rol32(c, 11);		\
	b ^= a; b -= rol32(a, 25);		\
	c ^= b; c -= rol32(b, 16);		\
	a ^= c; a -= rol32(c, 4);		\
	b ^= a; b -= rol32(a, 14);		\
	c ^= b; c -= rol32(b, 24);		\
}

#define __get_unaligned_cpu32(x) (*(u32 *)(x))

/* An arbitrary initial parameter */
#define JHASH_INITVAL		0xdeadbeef

/* jhash - hash an arbitrary key
 * @k: sequence of bytes as key
 * @length: the length of the key
 * @initval: the previous hash, or an arbitray value
 *
 * The generic version, hashes an arbitrary sequence of bytes.
 * No alignment or length assumptions are made about the input key.
 *
 * Returns the hash value of the key. The result depends on endianness.
 */
static inline u32 jhash(const void *key, u32 length, u32 initval)
{
	u32 a, b, c;
	const u8 *k = key;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + length + initval;

	/* All but the last block: affect some 32 bits of (a,b,c) */
	while (length > 12) {
		a += __get_unaligned_cpu32(k);
		b += __get_unaligned_cpu32(k + 4);
		c += __get_unaligned_cpu32(k + 8);
		__jhash_mix(a, b, c);
		length -= 12;
		k += 12;
	}
	/* Last block: affect all 32 bits of (c) */
	/* All the case statements fall through */
	switch (length) {
	case 12: c += (u32)k[11]<<24;
	case 11: c += (u32)k[10]<<16;
	case 10: c += (u32)k[9]<<8;
	case 9:  c += k[8];
	case 8:  b += (u32)k[7]<<24;
	case 7:  b += (u32)k[6]<<16;
	case 6:  b += (u32)k[5]<<8;
	case 5:  b += k[4];
	case 4:  a += (u32)k[3]<<24;
	case 3:  a += (u32)k[2]<<16;
	case 2:  a += (u32)k[1]<<8;
	case 1:  a += k[0];
		 __jhash_final(a, b, c);
	case 0: /* Nothing left to add */
		break;
	}

	return c;
}

/* jhash2 - hash an array of u32's
 * @k: the key which must be an array of u32's
 * @length: the number of u32's in the key
 * @initval: the previous hash, or an arbitray value
 *
 * Returns the hash value of the key.
 */
static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
{
	u32 a, b, c;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + (length<<2) + initval;

	/* Handle most of the key */
	while (length > 3) {
		a += k[0];
		b += k[1];
		c += k[2];
		__jhash_mix(a, b, c);
		length -= 3;
		k += 3;
	}

	/* Handle the last 3 u32's: all the case statements fall through */
	switch (length) {
	case 3: c += k[2];
	case 2: b += k[1];
	case 1: a += k[0];
		__jhash_final(a, b, c);
	case 0:	/* Nothing left to add */
		break;
	}

	return c;
}


/* jhash_3words - hash exactly 3, 2 or 1 word(s) */
static inline u32 jhash_3words(u32 a, u32 b, u32 c, u32 initval)
{
	a += JHASH_INITVAL;
	b += JHASH_INITVAL;
	c += initval;

	__jhash_final(a, b, c);

	return c;
}

static inline u32 jhash_2words(u32 a, u32 b, u32 initval)
{
	return jhash_3words(a, b, 0, initval);
}

static inline u32 jhash_1word(u32 a, u32 initval)
{
	return jhash_3words(a, 0, 0, initval);
}

int main(int argc, char **argv)
{
	int i;
	struct {
		void *s;
		void *t;
		u32 l;
	} k = { .s = 0 };
	int total;
	u32 initval;

	total = atoi(argv[1]);
	srandom(time(0));
	initval = random();

	for (i = 0; i < total; i++) {
		k.l = i;
		printf("0x%x\n", jhash2((u32 *)&k, sizeof(k)/4, initval) & (total - 1));
	}

	return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 18, 2015, 9:55 a.m. UTC | #16
On Wed, Mar 18, 2015 at 09:51:02AM +0000, 'tgraf@suug.ch' wrote:
> 
> I'm running it like this:
> for i in $(seq 1 27); do \
> echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \

uniq -D doesn't do what you think it does:

$ { echo a; echo b; echo a; } | uniq -D
$

So you should use sort | uniq -c

Cheers,
Thomas Graf March 18, 2015, 10:08 a.m. UTC | #17
On 03/18/15 at 08:55pm, Herbert Xu wrote:
> On Wed, Mar 18, 2015 at 09:51:02AM +0000, 'tgraf@suug.ch' wrote:
> > 
> > I'm running it like this:
> > for i in $(seq 1 27); do \
> > echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \
> 
> uniq -D doesn't do what you think it does:
> 
> $ { echo a; echo b; echo a; } | uniq -D
> $
> 
> So you should use sort | uniq -c

Thanks. The important keyword is "adjacent" here ;-)

With this fixed up I can see what you mean. So if we are
to only do a chain length based decision, the limit would
have to grow together with the table size.

for i in $(seq 1 27); do
echo $i: && ./jhash $((2**$i)) | sort | uniq -D | uniq -c | sort -n -r | head -5;
done

1:
      2 0x0
2:
      3 0x0
3:
      3 0x0
      2 0x2
4:
      2 0xb
      2 0x8
      2 0x4
      2 0x2
      2 0x0
5:
      3 0xe
      3 0x13
      2 0xc
      2 0xa
      2 0x8
6:
      3 0x38
      3 0x37
      3 0x32
      3 0x2f
      3 0x2e
7:
      4 0x37
      3 0x7d
      3 0x79
      3 0x6f
      3 0x69
8:
      6 0xb7
      5 0x79
      4 0xf4
      4 0xe4
      4 0xa3
9:
      5 0xb7
      5 0x9e
      5 0x15b
      5 0x13b
      4 0xdd
10:
      5 0xa4
      5 0x79
      5 0x29e
      5 0x22a
      4 0xf4
11:
      5 0x92
      5 0x4a4
      5 0x479
      5 0x47
      5 0x379
12:
      6 0xf1
      6 0x751
      5 0xf00
      5 0xe26
      5 0xc26
13:
      7 0xf1
      6 0x520
      6 0x1e06
      6 0x1c44
      6 0x11ac
14:
      7 0x3174
      7 0x2f12
      6 0x520
      6 0x3763
      6 0x3615
15:
      7 0xf4
      7 0x7d0b
      7 0x179
      6 0x807
      6 0x69c9
16:
      7 0xf516
      7 0xe659
      7 0xdf00
      7 0x5f5a
      7 0x5cc3
17:
      7 0xe659
      7 0xde5a
      7 0xcaf
      7 0x9cce
      7 0x7fe3
18:
      9 0x17f9b
      8 0x3852d
      7 0x992b
      7 0x5e38
      7 0x3d39e
19:
      8 0x5f22b
      8 0x47d66
      8 0x436ba
      8 0x182aa
      7 0xe095
20:
      8 0xf87dd
      8 0xf464d
      8 0xef51d
      8 0xc08d4
      8 0x91a96
21:
      9 0xa7a90
      9 0x1fc5ad
      8 0xfae2f
      8 0xd4123
      8 0xb5686
22:
      9 0x97b11
      9 0x3fcfa3
      8 0xffa9e
      8 0xfa437
      8 0xef7e6
23:
     10 0x3b3115
      9 0x5cb5ba
      9 0x568fbf
      9 0x54d790
      9 0x4c2393
24:
     10 0x86373a
     10 0x35794f
      9 0xf6d21e
      9 0xde5301
      9 0xb36b9d

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 18, 2015, 10:12 a.m. UTC | #18
On Wed, Mar 18, 2015 at 10:08:12AM +0000, 'tgraf@suug.ch' wrote:
>
> With this fixed up I can see what you mean. So if we are
> to only do a chain length based decision, the limit would
> have to grow together with the table size.

All we could just use a flat limit of 16 since our maximum table
size is 2^32 (actually a bit less) and 16 should be plenty for
that.

Remember this limit is only there to detect when somebody has
stolen our secret is trying to DoS us.  So if we can keep them
under 16 per-bucket it should be sufficient to defeat any attack.

Of course I will also add a patch to limit the number of elements
to the table size (so maximum utilisation is 100%).  This will come
after we allow insertions to fail.

Cheers,
David Laight March 18, 2015, 10:26 a.m. UTC | #19
From: Herbert Xu
..
> Of course I will also add a patch to limit the number of elements
> to the table size (so maximum utilisation is 100%).  This will come
> after we allow insertions to fail.

There may be some uses where short hash chains are acceptable.
In which case a utilisation way above 100% may make sense.

If a table is likely to have repeated lookups for the same item
then it can make sense to cache the last looked up item for each chain.
This cached value can be written without any locking - the only
place care is needed is ensuring it doesn't reference an item
that has just been deleted.
(You'd want a separate array to not dirty the cache lines containing
the head pointers. Although for a large table they are misses anyway.)

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf March 18, 2015, 10:44 a.m. UTC | #20
On 03/18/15 at 09:12pm, Herbert Xu wrote:
> On Wed, Mar 18, 2015 at 10:08:12AM +0000, 'tgraf@suug.ch' wrote:
> >
> > With this fixed up I can see what you mean. So if we are
> > to only do a chain length based decision, the limit would
> > have to grow together with the table size.
> 
> All we could just use a flat limit of 16 since our maximum table
> size is 2^32 (actually a bit less) and 16 should be plenty for
> that.
> 
> Remember this limit is only there to detect when somebody has
> stolen our secret is trying to DoS us.  So if we can keep them
> under 16 per-bucket it should be sufficient to defeat any attack.

Agreed, that is certainly sufficient to avoid attacks. I didn't
want to give up on getting rid of the need for a table wide
nelemes *inside* rhashtable if we have to maintain chain length
anyway but I can see the difficulties.

The only approach that seems to work is to count the number of
buckets with a chain length above a limit that is relative to the
table size or something alike which requires to count on table
level and if we have to do that we might as well just count the
number of elements in the table.

> Of course I will also add a patch to limit the number of elements
> to the table size (so maximum utilisation is 100%).  This will come
> after we allow insertions to fail.

I'm also attaching distribution of chain length for table sizes
for completion's sake.

Percentage of buckets with more than 1 entry:

1: 0%
2: 50%
3: 25%
4: 25%
5: 28%
6: 28%
7: 31%
8: 26%
9: 25%
10: 27%
11: 26%
12: 26%
13: 26%
14: 26%
15: 26%
16: 26%
17: 26%
18: 26%
19: 26%
20: 26%
21: 26%
22: 26%
23: 26%

Distribution of chain lengths:
1:
      1 2
2:
      2 2
3:
      1 2
      1 3
4:
      3 2
      1 3
5:
      6 2
      4 3
6:
     11 2
      4 3
      1 4
7:
     27 2
      8 3
      2 4
8:
     50 2
     15 3
      3 4
      1 5
9:
    101 2
     33 3
      6 4
      1 5
10:
    201 2
     61 3
     17 4
      2 5
11:
    361 2
    134 3
     29 4
      5 5
      1 6
12:
    784 2
    242 3
     49 4
     14 5
      3 6
13:
   1552 2
    472 3
    118 4
     28 5
      5 6
      1 7
14:
   2965 2
    996 3
    262 4
     51 5
      9 6
      1 7
      1 8
15:
   6010 2
   2029 3
    510 4
     90 5
     21 6
      2 7
16:
  11993 2
   4068 3
   1014 4
    175 5
     40 6
      5 7
      1 8
17:
  24123 2
   8100 3
   1969 4
    403 5
     79 6
     10 7
      3 8
18:
  48127 2
  16145 3
   3930 4
    825 5
    137 6
     25 7
      1 8
      1 9
19:
  96558 2
  32250 3
   7990 4
   1619 5
    264 6
     40 7
      7 8
20:
 193267 2
  64476 3
  16051 4
   3140 5
    544 6
     71 7
      9 8
      1 9
21:
 385293 2
 128850 3
  32366 4
   6344 5
   1017 6
    128 7
     21 8
      2 9
22:
 769535 2
 257779 3
  64534 4
  13052 5
   2177 6
    322 7
     25 8
      1 9
23:
1540606 2
 514743 3
 130192 4
  26207 5
   4403 6
    635 7
     73 8
     11 9
24:
3073982 2
1032727 3
 262044 4
  53475 5
   9179 6
   1353 7
    184 8
     14 9
      2 10
      1 11
25:
6124258 2
2071733 3
 534130 4
 111862 5
  19754 6
   3030 7
    419 8
     56 9
      5 10
      1 12

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1695378..f16e856 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -51,7 +51,6 @@  struct rhash_head {
  * @size: Number of hash buckets
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
- * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
@@ -63,7 +62,6 @@  struct bucket_table {
 	unsigned int		size;
 	unsigned int		rehash;
 	u32			hash_rnd;
-	u32			shift;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
 	struct list_head	walkers;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c523d3a..4a99c0c 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -162,7 +162,6 @@  static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 
 	tbl->size = nbuckets;
-	tbl->shift = ilog2(nbuckets);
 
 	if (alloc_bucket_locks(ht, tbl) < 0) {
 		bucket_table_free(tbl);
@@ -189,7 +188,7 @@  static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->shift < ht->p.max_shift);
+	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));
 }
 
 /**
@@ -202,7 +201,7 @@  static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->shift > ht->p.min_shift;
+	       tbl->size > (1 << ht->p.min_shift);
 }
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)