diff mbox

[v2,7/10] rhashtable: Disable automatic shrinking

Message ID E1YZard-0000wE-Se@gondolin.me.apana.org.au
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu March 22, 2015, 8:04 a.m. UTC
Automatic shrinking is dangerous because it provides an easy
way for an adversary to cause us to do unnecessary work.  Thus
making the resizable hashtable a poor data structure.

This patch disables automatic shrinking but retains a manual
shrink function for those cases where insertions and removals
are overseen by a trusted entity, e.g., nft_hash.

The shrink function will now also shrink to fit rather than halve
the size of the table.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |   15 ---------------
 lib/rhashtable.c           |   44 ++++++++++++++++++++++++++++++--------------
 lib/test_rhashtable.c      |   16 ++++++----------
 3 files changed, 36 insertions(+), 39 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Thomas Graf March 22, 2015, 12:17 p.m. UTC | #1
On 03/22/15 at 07:04pm, Herbert Xu wrote:
> Automatic shrinking is dangerous because it provides an easy
> way for an adversary to cause us to do unnecessary work.  Thus
> making the resizable hashtable a poor data structure.
> 
> This patch disables automatic shrinking but retains a manual
> shrink function for those cases where insertions and removals
> are overseen by a trusted entity, e.g., nft_hash.

This is misleading. I agree that unconditional shrinking is dangerous.
Shrinking was an optional feature disabled by default before. The
inlining enabled it by default for all users. What is the benefit of
requiring this logic outside of rhashtable over just adding a flag to
enable shrinking at 30% utilization?

> The shrink function will now also shrink to fit rather than halve
> the size of the table.

I like this part a lot

>  int rhashtable_shrink(struct rhashtable *ht)
>  {
> -	struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht);
> +	unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3);

If rhashtable_shrink() is called near the 75% border it will cause an
immediate expansion again. Maybe make this * 3 / 2 so we shrink near
30% utilization as before?

> +	struct bucket_table *new_tbl;
> +	struct bucket_table *tbl;
> +	int err;
>  
> -	ASSERT_RHT_MUTEX(ht);
> +	if (size < ht->p.min_size)
> +		size = ht->p.min_size;

We should only shrink if size < old_tbl->size

> -	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2);
> +	new_tbl = bucket_table_alloc(ht, size);
>  	if (new_tbl == NULL)
>  		return -ENOMEM;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf March 22, 2015, 1:06 p.m. UTC | #2
On 03/22/15 at 12:17pm, Thomas Graf wrote:
> On 03/22/15 at 07:04pm, Herbert Xu wrote:
> > +	struct bucket_table *new_tbl;
> > +	struct bucket_table *tbl;
> > +	int err;
> >  
> > -	ASSERT_RHT_MUTEX(ht);
> > +	if (size < ht->p.min_size)
> > +		size = ht->p.min_size;
> 
> We should only shrink if size < old_tbl->size

I found the check further down. Any particular reason why check
after allocation and then free again? Why do you want to avoid
the allocation inside the mutex?

> > -	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2);
> > +	new_tbl = bucket_table_alloc(ht, size);
> >  	if (new_tbl == NULL)
> >  		return -ENOMEM;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 12:07 a.m. UTC | #3
On Sun, Mar 22, 2015 at 01:06:30PM +0000, Thomas Graf wrote:
> On 03/22/15 at 12:17pm, Thomas Graf wrote:
> > On 03/22/15 at 07:04pm, Herbert Xu wrote:
> > > +	struct bucket_table *new_tbl;
> > > +	struct bucket_table *tbl;
> > > +	int err;
> > >  
> > > -	ASSERT_RHT_MUTEX(ht);
> > > +	if (size < ht->p.min_size)
> > > +		size = ht->p.min_size;
> > 
> > We should only shrink if size < old_tbl->size
> 
> I found the check further down. Any particular reason why check
> after allocation and then free again? Why do you want to avoid
> the allocation inside the mutex?

It's just quality of code.  You should always try to minimise
the locked sections.

Cheers,
Herbert Xu March 23, 2015, 12:09 a.m. UTC | #4
On Sun, Mar 22, 2015 at 12:17:55PM +0000, Thomas Graf wrote:
> 
> This is misleading. I agree that unconditional shrinking is dangerous.
> Shrinking was an optional feature disabled by default before. The

How was shrinking disabled before? AFAICS it always kicked in at
30%.

> inlining enabled it by default for all users. What is the benefit of
> requiring this logic outside of rhashtable over just adding a flag to
> enable shrinking at 30% utilization?

That would be adding an extra branch on the fast-path for an
operation which almost nobody needs.
 
> >  int rhashtable_shrink(struct rhashtable *ht)
> >  {
> > -	struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht);
> > +	unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3);
> 
> If rhashtable_shrink() is called near the 75% border it will cause an
> immediate expansion again. Maybe make this * 3 / 2 so we shrink near
> 30% utilization as before?

Sure that makes sense.

Cheers,
Thomas Graf March 23, 2015, 8:33 a.m. UTC | #5
On 03/23/15 at 11:09am, Herbert Xu wrote:
> On Sun, Mar 22, 2015 at 12:17:55PM +0000, Thomas Graf wrote:
> > 
> > This is misleading. I agree that unconditional shrinking is dangerous.
> > Shrinking was an optional feature disabled by default before. The
> 
> How was shrinking disabled before? AFAICS it always kicked in at
> 30%.

Before Daniel removed the indirection due to all callers enabling
shrinking by default ;-) It was clear that some future users
eventually would not want shrinking and thus require a conditional.

> > inlining enabled it by default for all users. What is the benefit of
> > requiring this logic outside of rhashtable over just adding a flag to
> > enable shrinking at 30% utilization?
> 
> That would be adding an extra branch on the fast-path for an
> operation which almost nobody needs.

I don't get why almost nobody would want shrinking. I agree that for
tables like TCP hash tables, once you have grown you want to keep that
table size because the load is likely to come back. But we will also
have lots of users such as the Netlink socket with a table per protocol
where not shrinking results in giving the user the ability to waste
memory indefinitely for no gain.

I'm not claiming you always want shrinking but what gain is there by
removing integrated support? Can you show numbers that the additional
branch actually hurts?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf March 23, 2015, 8:37 a.m. UTC | #6
On 03/23/15 at 11:07am, Herbert Xu wrote:
> On Sun, Mar 22, 2015 at 01:06:30PM +0000, Thomas Graf wrote:
> > On 03/22/15 at 12:17pm, Thomas Graf wrote:
> > > On 03/22/15 at 07:04pm, Herbert Xu wrote:
> > > > +	struct bucket_table *new_tbl;
> > > > +	struct bucket_table *tbl;
> > > > +	int err;
> > > >  
> > > > -	ASSERT_RHT_MUTEX(ht);
> > > > +	if (size < ht->p.min_size)
> > > > +		size = ht->p.min_size;
> > > 
> > > We should only shrink if size < old_tbl->size
> > 
> > I found the check further down. Any particular reason why check
> > after allocation and then free again? Why do you want to avoid
> > the allocation inside the mutex?
> 
> It's just quality of code.  You should always try to minimise
> the locked sections.

So do you expect the user to replicate the new table size calculation
outside of rhashtable_shrink() to avoid the cost of a possible massive
memory allocation even if no shrinking will take place?

I think rhashtable_shrink() should fetch ht->tbl in an RCU section to
cheaply get the current table size and only do the allocation and take
the lock if the table size warrants for shrinking.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 9:28 a.m. UTC | #7
On Mon, Mar 23, 2015 at 08:33:19AM +0000, Thomas Graf wrote:
>
> I'm not claiming you always want shrinking but what gain is there by
> removing integrated support? Can you show numbers that the additional
> branch actually hurts?

You never want automatic shrinking unless all your users are
trusted.  I doubt there would be many rhashtable users where
this would apply.  Even nft_hash is quite tenuous.

Besdies, if you really want automatic shrinking, you could always
do it in the caller of rhashtable_remove.  That way only you
would pay for the cost and not everybody else.

Cheers,
Herbert Xu March 23, 2015, 9:29 a.m. UTC | #8
On Mon, Mar 23, 2015 at 08:37:12AM +0000, Thomas Graf wrote:
>
> I think rhashtable_shrink() should fetch ht->tbl in an RCU section to
> cheaply get the current table size and only do the allocation and take
> the lock if the table size warrants for shrinking.

Well you should never invoke rhashtable_shrink unless you actually
wanted to shrink.  So this is something that you should have checked
before rhashtable_shrink is called.

Cheers,
Thomas Graf March 23, 2015, 9:36 a.m. UTC | #9
On 03/23/15 at 08:28pm, Herbert Xu wrote:
> On Mon, Mar 23, 2015 at 08:33:19AM +0000, Thomas Graf wrote:
> >
> > I'm not claiming you always want shrinking but what gain is there by
> > removing integrated support? Can you show numbers that the additional
> > branch actually hurts?
> 
> You never want automatic shrinking unless all your users are
> trusted.  I doubt there would be many rhashtable users where
> this would apply.  Even nft_hash is quite tenuous.

Why?

> Besdies, if you really want automatic shrinking, you could always
> do it in the caller of rhashtable_remove.  That way only you
> would pay for the cost and not everybody else.

Same can be said for growing. Why do we differ between the two?
Would you expect users requiring shrinking() to call
rhashtable_shrink() after every remove? Should they encode their
own logic based on rhashtable internals?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 9:39 a.m. UTC | #10
On Mon, Mar 23, 2015 at 09:36:32AM +0000, Thomas Graf wrote:
>
> > You never want automatic shrinking unless all your users are
> > trusted.  I doubt there would be many rhashtable users where
> > this would apply.  Even nft_hash is quite tenuous.
> 
> Why?

Because with multiple rehashing it's quite easy to convert your
hash table into a linked list by repeatedly growing and shrinking.

Multiple rehashing simply cannot work unless you get rid of automatic
shrinking for the untrusted case.

> Same can be said for growing. Why do we differ between the two?
> Would you expect users requiring shrinking() to call
> rhashtable_shrink() after every remove? Should they encode their
> own logic based on rhashtable internals?

No we must be able to grow or insertions will fail.  The same cannot
be said for shrinking.

Cheers,
Thomas Graf March 23, 2015, 9:43 a.m. UTC | #11
On 03/23/15 at 08:29pm, Herbert Xu wrote:
> On Mon, Mar 23, 2015 at 08:37:12AM +0000, Thomas Graf wrote:
> >
> > I think rhashtable_shrink() should fetch ht->tbl in an RCU section to
> > cheaply get the current table size and only do the allocation and take
> > the lock if the table size warrants for shrinking.
> 
> Well you should never invoke rhashtable_shrink unless you actually
> wanted to shrink.  So this is something that you should have checked
> before rhashtable_shrink is called.

How? The calculation of the table size is embedded in
rhashtable_shrink(). Should every user have a copy of that
calculation algorithm?

Why not just:

	unlikely(ht->p.shrink && rht_shrink_below_30(..))

If you really care about that additional conditional we
can also add:

static inline int rhashtable_remove_and_shrink()
{
        int err;

        rcu_read_lock();

	tbl = rht_dereference_rcu(ht->tbl, ht);

        err = rhashtable_remove_fast();
        if (unlikely(!err && rht_shrink_below_30(ht, tbl)))
                schedule_work(&ht->run_work);

	rcu_read_unlock();

	return err;
}

I just think it's wrong to rip out all the shrinking logic and
require every single user to re-add its own copy.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 9:44 a.m. UTC | #12
On Mon, Mar 23, 2015 at 08:39:52PM +1100, Herbert Xu wrote:
> 
> Because with multiple rehashing it's quite easy to convert your
> hash table into a linked list by repeatedly growing and shrinking.
> 
> Multiple rehashing simply cannot work unless you get rid of automatic
> shrinking for the untrusted case.

Actually what I could do is allow automatic shrinking when there
are no outstanding rehashes.  So maybe we could restore this feature
after all.

Cheers,
Thomas Graf March 23, 2015, 10:08 a.m. UTC | #13
On 03/23/15 at 08:44pm, Herbert Xu wrote:
> On Mon, Mar 23, 2015 at 08:39:52PM +1100, Herbert Xu wrote:
> > 
> > Because with multiple rehashing it's quite easy to convert your
> > hash table into a linked list by repeatedly growing and shrinking.
> > 
> > Multiple rehashing simply cannot work unless you get rid of automatic
> > shrinking for the untrusted case.
> 
> Actually what I could do is allow automatic shrinking when there
> are no outstanding rehashes.  So maybe we could restore this feature
> after all.

OK. Maybe this patch should be posted in the context of enabling
multiple rehashes then. It is difficult to review without having
the full context. This correlation was not clear to me from the
commit message.

I have yet to understand the implications of multiple rehashes.
The idea of having to traverse N tables for each insert, removal
and lookup in a pressure situation is still frightening.

I would like to compare it with an exponential growing logic.
Eventually both approaches can be combined to limit the chain
length of rehashes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 10:19 a.m. UTC | #14
On Mon, Mar 23, 2015 at 10:08:52AM +0000, Thomas Graf wrote:
>
> I have yet to understand the implications of multiple rehashes.
> The idea of having to traverse N tables for each insert, removal
> and lookup in a pressure situation is still frightening.

It's simple.  With only expansion you can never have more than
log N tables.

Cheers,
David Miller March 23, 2015, 4:44 p.m. UTC | #15
From: Thomas Graf <tgraf@suug.ch>
Date: Mon, 23 Mar 2015 08:33:19 +0000

> I don't get why almost nobody would want shrinking. I agree that for
> tables like TCP hash tables, once you have grown you want to keep that
> table size because the load is likely to come back. But we will also
> have lots of users such as the Netlink socket with a table per protocol
> where not shrinking results in giving the user the ability to waste
> memory indefinitely for no gain.

The user can't do this with TCP?  Why is netlink only susceptible?

The only plausible argument for shrinking I've ever heard of is the
nft_hash case, and there that code can _explicitly_ ask for a shrink
after it has made a major table modification.

That puts all of the smarts for when to shrink where the knowledge
resides, and in this case that's the user.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 23, 2015, 4:45 p.m. UTC | #16
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 23 Mar 2015 20:28:07 +1100

> You never want automatic shrinking unless all your users are
> trusted.

+1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 23, 2015, 9:48 p.m. UTC | #17
On Mon, Mar 23, 2015 at 12:44:34PM -0400, David Miller wrote:
>
> The user can't do this with TCP?  Why is netlink only susceptible?
> 
> The only plausible argument for shrinking I've ever heard of is the
> nft_hash case, and there that code can _explicitly_ ask for a shrink
> after it has made a major table modification.
> 
> That puts all of the smarts for when to shrink where the knowledge
> resides, and in this case that's the user.

One thing I got to say is that automatic shrinking is a really
good stress test as reenabling it allowed me to quickly identify
two bugs in the final patch :)

Other than that I totally agree that it should be disabled by
default as otherwise it increases the amortised cost of the hash
table for a paltry saving in memory.

We can do it afterwards once we're sure this whole thing is stable.

Cheers,
Thomas Graf March 23, 2015, 10:13 p.m. UTC | #18
On 03/23/15 at 12:44pm, David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Mon, 23 Mar 2015 08:33:19 +0000
> 
> > I don't get why almost nobody would want shrinking. I agree that for
> > tables like TCP hash tables, once you have grown you want to keep that
> > table size because the load is likely to come back. But we will also
> > have lots of users such as the Netlink socket with a table per protocol
> > where not shrinking results in giving the user the ability to waste
> > memory indefinitely for no gain.
> 
> The user can't do this with TCP?  Why is netlink only susceptible?

You are right. Any table that doesn't shrink will eventually waste
memory. I used TCP vs Netlink because I believe it represents the
difference in priorities very well. TCP may go 0..1M flows within
a fraction of a second so if you've seen that many flows before you
might get hit again and you prioritize the "being ready" to handle it
higher than eventually wasting the memory indefinitely.

Whereas with Netlink it seems (glad to be proven wrong) that the
need for instant growth is lesser as it takes time to create 1M
sockets across many PIDs. So we gain something by releasing the
resources if not needed.

> The only plausible argument for shrinking I've ever heard of is the
> nft_hash case, and there that code can _explicitly_ ask for a shrink
> after it has made a major table modification.
> 
> That puts all of the smarts for when to shrink where the knowledge
> resides, and in this case that's the user.

My argument pro automatic shrinking is simplicity. I'm absolutely
fine with disabling it by default and to require enabling it
explicitly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 2a98b95..44aa579 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -252,19 +252,6 @@  static inline bool rht_grow_above_75(const struct rhashtable *ht,
 	       (!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
-/**
- * rht_shrink_below_30 - returns true if nelems < 0.3 * table-size
- * @ht:		hash table
- * @tbl:	current table
- */
-static inline bool rht_shrink_below_30(const struct rhashtable *ht,
-				       const struct bucket_table *tbl)
-{
-	/* Shrink table beneath 30% load */
-	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > ht->p.min_size;
-}
-
 /* The bucket lock is selected based on the hash and protects mutations
  * on a group of hash buckets.
  *
@@ -745,8 +732,6 @@  static inline int rhashtable_remove_fast(
 		goto out;
 
 	atomic_dec(&ht->nelems);
-	if (rht_shrink_below_30(ht, tbl))
-		schedule_work(&ht->run_work);
 
 out:
 	rcu_read_unlock();
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 798f01d..08a6123 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -261,30 +261,48 @@  EXPORT_SYMBOL_GPL(rhashtable_expand);
  * rhashtable_shrink - Shrink hash table while allowing concurrent lookups
  * @ht:		the hash table to shrink
  *
- * This function may only be called in a context where it is safe to call
- * synchronize_rcu(), e.g. not within a rcu_read_lock() section.
- *
- * The caller must ensure that no concurrent resizing occurs by holding
- * ht->mutex.
- *
- * The caller must ensure that no concurrent table mutations take place.
- * It is however valid to have concurrent lookups if they are RCU protected.
+ * This function shrinks the hash table to fit, i.e., the smallest
+ * size would not cause it to expand right away automatically.
  *
  * It is valid to have concurrent insertions and deletions protected by per
  * bucket locks or concurrent RCU protected lookups and traversals.
  */
 int rhashtable_shrink(struct rhashtable *ht)
 {
-	struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht);
+	unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3);
+	struct bucket_table *new_tbl;
+	struct bucket_table *tbl;
+	int err;
 
-	ASSERT_RHT_MUTEX(ht);
+	if (size < ht->p.min_size)
+		size = ht->p.min_size;
 
-	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2);
+	new_tbl = bucket_table_alloc(ht, size);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
+	err = -EEXIST;
+
+	mutex_lock(&ht->mutex);
+	tbl = rht_dereference(ht->tbl, ht);
+	if (rht_dereference(tbl->future_tbl, ht))
+		goto out_free_tbl;
+
+	err = 0;
+	if (tbl->size <= size)
+		goto out_free_tbl;
+
 	rhashtable_rehash(ht, new_tbl);
-	return 0;
+
+	mutex_unlock(&ht->mutex);
+
+out:
+	return err;
+
+out_free_tbl:
+	mutex_unlock(&ht->mutex);
+	bucket_table_free(new_tbl);
+	goto out;
 }
 EXPORT_SYMBOL_GPL(rhashtable_shrink);
 
@@ -302,8 +320,6 @@  static void rht_deferred_worker(struct work_struct *work)
 
 	if (rht_grow_above_75(ht, tbl))
 		rhashtable_expand(ht);
-	else if (rht_shrink_below_30(ht, tbl))
-		rhashtable_shrink(ht);
 unlock:
 	mutex_unlock(&ht->mutex);
 }
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index a2ba6ad..0ceb332 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -167,17 +167,13 @@  static int __init test_rhashtable(struct rhashtable *ht)
 		rcu_read_unlock();
 	}
 
-	for (i = 0; i < TEST_NEXPANDS; i++) {
-		pr_info("  Table shrinkage iteration %u...\n", i);
-		mutex_lock(&ht->mutex);
-		rhashtable_shrink(ht);
-		mutex_unlock(&ht->mutex);
+	pr_info("  Table shrinkage...\n");
+	rhashtable_shrink(ht);
 
-		rcu_read_lock();
-		pr_info("  Verifying lookups...\n");
-		test_rht_lookup(ht);
-		rcu_read_unlock();
-	}
+	rcu_read_lock();
+	pr_info("  Verifying lookups...\n");
+	test_rht_lookup(ht);
+	rcu_read_unlock();
 
 	rcu_read_lock();
 	test_bucket_stats(ht, true);