diff mbox

rhashtable: Add cap on number of elements in hash table

Message ID 20150513080640.GA27800@gondor.apana.org.au
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu May 13, 2015, 8:06 a.m. UTC
We currently have no limit on the number of elements in a hash table.
This is a problem because some users (tipc) set a ceiling on the
maximum table size and when that is reached the hash table may
degenerate.  Others may encounter OOM when growing and if we allow
insertions when that happens the hash table perofrmance may also
suffer.

This patch adds a new paramater insecure_max_entries which becomes
the cap on the table.  If unset it defaults to max_size.  If it is
also zero it means that there is no cap on the number of elements
in the table.  However, the table will grow whenever the utilisation
hits 100% and if that growth fails, you will get ENOMEM on insertion.

As allowing >100% utilisation is potentially dangerous, the name
contains the word insecure.

Note that the cap is not a hard limit.  This is done for performance
reasons as enforcing a hard limit will result in use of atomic ops
that are heavier than the ones we currently use.

The reasoning is that we're only guarding against a gross over-
subscription of the table, rather than a small breach of the limit.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

David Miller May 15, 2015, 2:22 a.m. UTC | #1
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 13 May 2015 16:06:40 +0800

> We currently have no limit on the number of elements in a hash table.
> This is a problem because some users (tipc) set a ceiling on the
> maximum table size and when that is reached the hash table may
> degenerate.  Others may encounter OOM when growing and if we allow
> insertions when that happens the hash table perofrmance may also
> suffer.
> 
> This patch adds a new paramater insecure_max_entries which becomes
> the cap on the table.  If unset it defaults to max_size.  If it is
> also zero it means that there is no cap on the number of elements
> in the table.  However, the table will grow whenever the utilisation
> hits 100% and if that growth fails, you will get ENOMEM on insertion.
> 
> As allowing >100% utilisation is potentially dangerous, the name
> contains the word insecure.
> 
> Note that the cap is not a hard limit.  This is done for performance
> reasons as enforcing a hard limit will result in use of atomic ops
> that are heavier than the ones we currently use.
> 
> The reasoning is that we're only guarding against a gross over-
> subscription of the table, rather than a small breach of the limit.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I'm not so sure I can get behind this change.

For the default case where we propagate max_size into this new limt,
if we grow the hash table to it's maximum size, and then just go one
past the limit, we're going to start failing inserts.

In my opinion, up to at least 2 X max_size, it's safe to allow the
insert.  Assuming a well choosen hash function and a roughly even
distribution.

A two entry deep hash chain is not a reason to fail a hash insertion.

Failures on hash insertions we would not do in pretty much any other
hash implementation in the kernel.  I'd seriously would rather have
3 entry deep hash chains than no new connections at all.

If an administrator has to analyze some situation where this is
happening and they see -E2BIG propagating into their applications,
this is going to be surprising.  And when they do all of the work
to figure out what is causing it, I am pretty sure they will be
somewhat upset that we considered it OK to do this.

So I'd like you to take some time to reconsider hash insert failures,
and only do them when it's going to create an unnaceptable, _massive_
hardship for the machine.  And I do not think that 2 or 3 entry deep
hash chains quality.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu May 15, 2015, 3:06 a.m. UTC | #2
On Thu, May 14, 2015 at 10:22:17PM -0400, David Miller wrote:
>
> In my opinion, up to at least 2 X max_size, it's safe to allow the
> insert.  Assuming a well choosen hash function and a roughly even
> distribution.

OK I can make it 2 x max_size/table size.

Cheers,
David Miller May 15, 2015, 3:46 a.m. UTC | #3
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 May 2015 11:06:23 +0800

> On Thu, May 14, 2015 at 10:22:17PM -0400, David Miller wrote:
>>
>> In my opinion, up to at least 2 X max_size, it's safe to allow the
>> insert.  Assuming a well choosen hash function and a roughly even
>> distribution.
> 
> OK I can make it 2 x max_size/table size.

The rest of my email after what you quoted was intended to get one
to consider this issue generally.  :-)

We wouldn't fail these inserts in any other hash table in the kernel.

Would we stop making new TCP sockets if the TCP ehash chains are 3
entries deep?  4?  5?  The answer to all of those is of course no
for any hash chain length of N whatsoever.

This new rhashtable behavior would be the default, and I seriously
doubt that's a behavior people who use a hash table, generally
speaking, desire or want.

Should there perhaps be hard protections for _extremely_ long hash
chains?  Sure, I'm willing to entertain that kind of idea.  But I
would do so at the very far end of the spectrum.  To the point where
the hash table is degenerating into a linked list.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu May 15, 2015, 6:30 a.m. UTC | #4
On Thu, May 14, 2015 at 11:46:15PM -0400, David Miller wrote:
> 
> We wouldn't fail these inserts in any other hash table in the kernel.
> 
> Would we stop making new TCP sockets if the TCP ehash chains are 3
> entries deep?  4?  5?  The answer to all of those is of course no
> for any hash chain length of N whatsoever.

I would agree with you if this was a fixed sized table.  If your
table grows with you (which is the whole point of rhashtable) then
we should never hit this until you run out of memory.

When you are running out of memory whether you fail when the table
growth fails or later when you can't even allocate an entry is
immaterial because failure is inevitable.

In my view everybody should be using rhashtable without a maximum
size.  The only place where it would make sense to have a maximum
size limit is if you also had a limit on the number of entries.
In which cas you might as well make that the limit on the hash table
size.

> Should there perhaps be hard protections for _extremely_ long hash
> chains?  Sure, I'm willing to entertain that kind of idea.  But I
> would do so at the very far end of the spectrum.  To the point where
> the hash table is degenerating into a linked list.

Do you have any suggestions of what such a limit should be?

Cheers,
David Miller May 16, 2015, 10:09 p.m. UTC | #5
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 May 2015 14:30:57 +0800

> On Thu, May 14, 2015 at 11:46:15PM -0400, David Miller wrote:
>> 
>> We wouldn't fail these inserts in any other hash table in the kernel.
>> 
>> Would we stop making new TCP sockets if the TCP ehash chains are 3
>> entries deep?  4?  5?  The answer to all of those is of course no
>> for any hash chain length of N whatsoever.
> 
> I would agree with you if this was a fixed sized table.  If your
> table grows with you (which is the whole point of rhashtable) then
> we should never hit this until you run out of memory.
> 
> When you are running out of memory whether you fail when the table
> growth fails or later when you can't even allocate an entry is
> immaterial because failure is inevitable.
> 
> In my view everybody should be using rhashtable without a maximum
> size.  The only place where it would make sense to have a maximum
> size limit is if you also had a limit on the number of entries.
> In which cas you might as well make that the limit on the hash table
> size.

Ok, agreed.

>> Should there perhaps be hard protections for _extremely_ long hash
>> chains?  Sure, I'm willing to entertain that kind of idea.  But I
>> would do so at the very far end of the spectrum.  To the point where
>> the hash table is degenerating into a linked list.
> 
> Do you have any suggestions of what such a limit should be?

Good question.

Obviously something like 50 or 100 is too much.

Perhaps something between 5 and 10.

That's just my gut instinct speaking.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu May 17, 2015, 1:38 a.m. UTC | #6
On Sat, May 16, 2015 at 06:09:46PM -0400, David Miller wrote:
>
> Obviously something like 50 or 100 is too much.
> 
> Perhaps something between 5 and 10.

You are even more parsimonious than I :) Because the maximum chain
length grows at a rate of logN * loglogN, I had chosen the number
16 which should be sufficient even if you had a 2^32 table (which
you currently can't as we max out somewhere below that).

FWIW Thomas did some numbers on this and apparent 10 gets breached
in a 2^24 table at 100% utilisation.

Cheers,
David Miller May 18, 2015, 8:12 p.m. UTC | #7
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 17 May 2015 09:38:29 +0800

> On Sat, May 16, 2015 at 06:09:46PM -0400, David Miller wrote:
>>
>> Obviously something like 50 or 100 is too much.
>> 
>> Perhaps something between 5 and 10.
> 
> You are even more parsimonious than I :) Because the maximum chain
> length grows at a rate of logN * loglogN, I had chosen the number
> 16 which should be sufficient even if you had a 2^32 table (which
> you currently can't as we max out somewhere below that).
> 
> FWIW Thomas did some numbers on this and apparent 10 gets breached
> in a 2^24 table at 100% utilisation.

Ok, this of course depends upon the distribution of the input data
and the strength/suitability of the hash function.

I'm a little bit disappointed in what Thomas found.  I would expect
the distribution to be at least a little bit better.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu May 18, 2015, 10:35 p.m. UTC | #8
On Mon, May 18, 2015 at 04:12:01PM -0400, David Miller wrote:
>
> Ok, this of course depends upon the distribution of the input data
> and the strength/suitability of the hash function.
> 
> I'm a little bit disappointed in what Thomas found.  I would expect
> the distribution to be at least a little bit better.

Just to make it clear this is not a problem with uneven distribution.
The same thing would happen with any hash function.  It's essentially
a generalised birthday paradox.

Of course the average chain length is still what you expect it to
be, it's only the maximum chain length that grows at logN loglogN.

Cheers,
David Laight May 19, 2015, 10:25 a.m. UTC | #9
From: Herbert Xu
> Sent: 18 May 2015 23:35
> On Mon, May 18, 2015 at 04:12:01PM -0400, David Miller wrote:
> >
> > Ok, this of course depends upon the distribution of the input data
> > and the strength/suitability of the hash function.
> >
> > I'm a little bit disappointed in what Thomas found.  I would expect
> > the distribution to be at least a little bit better.
> 
> Just to make it clear this is not a problem with uneven distribution.
> The same thing would happen with any hash function.  It's essentially
> a generalised birthday paradox.

Which means that with real data and a real hash function it is very
likely to be worse.
Did anyone look at what happens to the longest chains if you put
2 or 4 times as many items into the hash table?
Clearly there will be more short chains, but the longest might not
get proportionally longer (guessing at probability theory).

Personally I don't like the idea of rejecting inserts because chains
might get long, and certainly not because a malloc() fails in the
middle of the resize operation.

I'm sure there are some tables where slightly longer chains don't matter.

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index dbcbcc5..f3deeaa 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -17,6 +17,7 @@ 
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
+#include <linux/atomic.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
 #include <linux/jhash.h>
@@ -100,6 +101,7 @@  struct rhashtable;
  * @key_len: Length of key
  * @key_offset: Offset of key in struct to be hashed
  * @head_offset: Offset of rhash_head in struct to be hashed
+ * @insecure_max_entries: Maximum number of entries (may be exceeded)
  * @max_size: Maximum size while expanding
  * @min_size: Minimum size while shrinking
  * @nulls_base: Base value to generate nulls marker
@@ -115,6 +117,7 @@  struct rhashtable_params {
 	size_t			key_len;
 	size_t			key_offset;
 	size_t			head_offset;
+	unsigned int		insecure_max_entries;
 	unsigned int		max_size;
 	unsigned int		min_size;
 	u32			nulls_base;
@@ -286,6 +289,18 @@  static inline bool rht_grow_above_100(const struct rhashtable *ht,
 		(!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
+/**
+ * rht_grow_above_max - returns true if table is above maximum
+ * @ht:		hash table
+ * @tbl:	current table
+ */
+static inline bool rht_grow_above_max(const struct rhashtable *ht,
+				      const struct bucket_table *tbl)
+{
+	return ht->p.insecure_max_entries &&
+	       atomic_read(&ht->nelems) >= ht->p.insecure_max_entries;
+}
+
 /* The bucket lock is selected based on the hash and protects mutations
  * on a group of hash buckets.
  *
@@ -612,6 +627,10 @@  slow_path:
 			goto slow_path;
 	}
 
+	err = -E2BIG;
+	if (unlikely(rht_grow_above_max(ht, tbl)))
+		goto out;
+
 	err = 0;
 
 	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 4936fc4..cfd3408 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -14,6 +14,7 @@ 
  * published by the Free Software Foundation.
  */
 
+#include <linux/atomic.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/log2.h>
@@ -451,6 +452,10 @@  int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
 	    rht_grow_above_100(ht, tbl))
 		goto exit;
 
+	err = -E2BIG;
+	if (unlikely(rht_grow_above_max(ht, tbl)))
+		goto exit;
+
 	err = 0;
 
 	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
@@ -734,6 +739,12 @@  int rhashtable_init(struct rhashtable *ht,
 	if (params->max_size)
 		ht->p.max_size = rounddown_pow_of_two(params->max_size);
 
+	if (params->insecure_max_entries)
+		ht->p.insecure_max_entries =
+			rounddown_pow_of_two(params->insecure_max_entries);
+	else
+		ht->p.insecure_max_entries = ht->p.max_size;
+
 	ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
 
 	/* The maximum (not average) chain length grows with the