diff mbox

rhashtable: Prevent spurious EBUSY errors on insertion

Message ID 20151203124129.GA5505@gondor.apana.org.au
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu Dec. 3, 2015, 12:41 p.m. UTC
On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> 
> OK that's better.  I think I see the problem.  The test in
> rhashtable_insert_rehash is racy and if two threads both try
> to grow the table one of them may be tricked into doing a rehash
> instead.
> 
> I'm working on a fix.

OK this patch fixes the EBUSY problem as far as I can tell.  Please
let me know if you still observe EBUSY with it.  I'll respond to the
ENOMEM problem in another email.

---8<---
Thomas and Phil observed that under stress rhashtable insertion
sometimes failed with EBUSY, even though this error should only
ever been seen when we're under attack and our hash chain length
has grown to an unacceptable level, even after a rehash.

It turns out that the logic for detecting whether there is an
existing rehash is faulty.  In particular, when two threads both
try to grow the same table at the same time, one of them may see
the newly grown table and thus erroneously conclude that it had
been rehashed.  This is what leads to the EBUSY error.

This patch fixes this by remembering the current last table we
used during insertion so that rhashtable_insert_rehash can detect
when another thread has also done a resize/rehash.  When this is
detected we will give up our resize/rehash and simply retry the
insertion with the new table.

Reported-by: Thomas Graf <tgraf@suug.ch>
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Phil Sutter Dec. 3, 2015, 3:38 p.m. UTC | #1
On Thu, Dec 03, 2015 at 08:41:29PM +0800, Herbert Xu wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> > 
> > OK that's better.  I think I see the problem.  The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> > 
> > I'm working on a fix.
> 
> OK this patch fixes the EBUSY problem as far as I can tell.  Please
> let me know if you still observe EBUSY with it.  I'll respond to the
> ENOMEM problem in another email.
> 
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
> 
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty.  In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed.  This is what leads to the EBUSY error.
> 
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash.  When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
> 
> Reported-by: Thomas Graf <tgraf@suug.ch>
> Reported-by: Phil Sutter <phil@nwl.cc>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Tested-by: Phil Sutter <phil@nwl.cc>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 4, 2015, 7:38 p.m. UTC | #2
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 3 Dec 2015 20:41:29 +0800

> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
> 
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty.  In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed.  This is what leads to the EBUSY error.
> 
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash.  When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
> 
> Reported-by: Thomas Graf <tgraf@suug.ch>
> Reported-by: Phil Sutter <phil@nwl.cc>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Looks good, applied, thanks Herbert.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin Long Dec. 17, 2015, 8:46 a.m. UTC | #3
On Thu, Dec 3, 2015 at 8:41 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
>>
>> OK that's better.  I think I see the problem.  The test in
>> rhashtable_insert_rehash is racy and if two threads both try
>> to grow the table one of them may be tricked into doing a rehash
>> instead.
>>
>> I'm working on a fix.
>
> OK this patch fixes the EBUSY problem as far as I can tell.  Please
> let me know if you still observe EBUSY with it.  I'll respond to the
> ENOMEM problem in another email.
>
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
>
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty.  In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed.  This is what leads to the EBUSY error.
>
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash.  When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
>
> Reported-by: Thomas Graf <tgraf@suug.ch>
> Reported-by: Phil Sutter <phil@nwl.cc>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 843ceca..e50b31d 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -19,6 +19,7 @@
>
>  #include <linux/atomic.h>
>  #include <linux/compiler.h>
> +#include <linux/err.h>
>  #include <linux/errno.h>
>  #include <linux/jhash.h>
>  #include <linux/list_nulls.h>
> @@ -339,10 +340,11 @@ static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
>  int rhashtable_init(struct rhashtable *ht,
>                     const struct rhashtable_params *params);
>
> -int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> -                          struct rhash_head *obj,
> -                          struct bucket_table *old_tbl);
> -int rhashtable_insert_rehash(struct rhashtable *ht);
> +struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
> +                                           const void *key,
> +                                           struct rhash_head *obj,
> +                                           struct bucket_table *old_tbl);
> +int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
>
>  int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
>  void rhashtable_walk_exit(struct rhashtable_iter *iter);
> @@ -598,9 +600,11 @@ restart:
>
>         new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
>         if (unlikely(new_tbl)) {
> -               err = rhashtable_insert_slow(ht, key, obj, new_tbl);
> -               if (err == -EAGAIN)
> +               tbl = rhashtable_insert_slow(ht, key, obj, new_tbl);
> +               if (!IS_ERR_OR_NULL(tbl))
>                         goto slow_path;
> +
> +               err = PTR_ERR(tbl);
>                 goto out;
>         }
>
> @@ -611,7 +615,7 @@ restart:
>         if (unlikely(rht_grow_above_100(ht, tbl))) {
>  slow_path:
>                 spin_unlock_bh(lock);
> -               err = rhashtable_insert_rehash(ht);
> +               err = rhashtable_insert_rehash(ht, tbl);
>                 rcu_read_unlock();
>                 if (err)
>                         return err;
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index a54ff89..2ff7ed9 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -389,33 +389,31 @@ static bool rhashtable_check_elasticity(struct rhashtable *ht,
>         return false;
>  }
>
> -int rhashtable_insert_rehash(struct rhashtable *ht)
> +int rhashtable_insert_rehash(struct rhashtable *ht,
> +                            struct bucket_table *tbl)
>  {
>         struct bucket_table *old_tbl;
>         struct bucket_table *new_tbl;
> -       struct bucket_table *tbl;
>         unsigned int size;
>         int err;
>
>         old_tbl = rht_dereference_rcu(ht->tbl, ht);
> -       tbl = rhashtable_last_table(ht, old_tbl);
>
>         size = tbl->size;
>
> +       err = -EBUSY;
> +
>         if (rht_grow_above_75(ht, tbl))
>                 size *= 2;
>         /* Do not schedule more than one rehash */
>         else if (old_tbl != tbl)
> -               return -EBUSY;
> +               goto fail;
> +
> +       err = -ENOMEM;
>
>         new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
> -       if (new_tbl == NULL) {
> -               /* Schedule async resize/rehash to try allocation
> -                * non-atomic context.
> -                */
> -               schedule_work(&ht->run_work);
> -               return -ENOMEM;
> -       }
> +       if (new_tbl == NULL)
> +               goto fail;
>
>         err = rhashtable_rehash_attach(ht, tbl, new_tbl);
>         if (err) {
> @@ -426,12 +424,24 @@ int rhashtable_insert_rehash(struct rhashtable *ht)
>                 schedule_work(&ht->run_work);
>
>         return err;
> +
> +fail:
> +       /* Do not fail the insert if someone else did a rehash. */
> +       if (likely(rcu_dereference_raw(tbl->future_tbl)))
> +               return 0;
> +
> +       /* Schedule async rehash to retry allocation in process context. */
> +       if (err == -ENOMEM)
> +               schedule_work(&ht->run_work);
> +
> +       return err;
>  }
>  EXPORT_SYMBOL_GPL(rhashtable_insert_rehash);
>
> -int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
> -                          struct rhash_head *obj,
> -                          struct bucket_table *tbl)
> +struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
> +                                           const void *key,
> +                                           struct rhash_head *obj,
> +                                           struct bucket_table *tbl)
>  {
>         struct rhash_head *head;
>         unsigned int hash;
> @@ -467,7 +477,12 @@ int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
>  exit:
>         spin_unlock(rht_bucket_lock(tbl, hash));
>
> -       return err;
> +       if (err == 0)
> +               return NULL;
> +       else if (err == -EAGAIN)
> +               return tbl;
> +       else
> +               return ERR_PTR(err);
>  }
>  EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
>

sorry for late test, but unfortunately, my case with rhashtalbe still
return EBUSY.
I added some debug code in rhashtable_insert_rehash(), and found:
*future_tbl is null*

fail:
        /* Do not fail the insert if someone else did a rehash. */
        if (likely(rcu_dereference_raw(tbl->future_tbl))) {
                printk("future_tbl is there\n");
                return 0;
        } else {
                printk("future_tbl is null\n");
        }

any idea why ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 17, 2015, 8:48 a.m. UTC | #4
On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>
> sorry for late test, but unfortunately, my case with rhashtalbe still
> return EBUSY.
> I added some debug code in rhashtable_insert_rehash(), and found:
> *future_tbl is null*
> 
> fail:
>         /* Do not fail the insert if someone else did a rehash. */
>         if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>                 printk("future_tbl is there\n");
>                 return 0;
>         } else {
>                 printk("future_tbl is null\n");
>         }
> 
> any idea why ?

That's presumably because you got a genuine double rehash.

Until you post your code we can't really help you.

Cheers,
Xin Long Dec. 17, 2015, 9 a.m. UTC | #5
On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>
>> sorry for late test, but unfortunately, my case with rhashtalbe still
>> return EBUSY.
>> I added some debug code in rhashtable_insert_rehash(), and found:
>> *future_tbl is null*
>>
>> fail:
>>         /* Do not fail the insert if someone else did a rehash. */
>>         if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>>                 printk("future_tbl is there\n");
>>                 return 0;
>>         } else {
>>                 printk("future_tbl is null\n");
>>         }
>>
>> any idea why ?
>
> That's presumably because you got a genuine double rehash.
>
> Until you post your code we can't really help you.
>
i wish i could , but my codes is a big patch for sctp, and this issue
happens in a special stress test based on this patch.
im trying to think how i can show you. :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin Long Dec. 17, 2015, 4:07 p.m. UTC | #6
On Thu, Dec 17, 2015 at 5:00 PM, Xin Long <lucien.xin@gmail.com> wrote:
> On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>>
>>> sorry for late test, but unfortunately, my case with rhashtalbe still
>>> return EBUSY.
>>> I added some debug code in rhashtable_insert_rehash(), and found:
>>> *future_tbl is null*
>>>
>>> fail:
>>>         /* Do not fail the insert if someone else did a rehash. */
>>>         if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>>>                 printk("future_tbl is there\n");
>>>                 return 0;
>>>         } else {
>>>                 printk("future_tbl is null\n");
>>>         }
>>>
>>> any idea why ?
>>
>> That's presumably because you got a genuine double rehash.
>>
>> Until you post your code we can't really help you.
>>
> i wish i could , but my codes is a big patch for sctp, and this issue
> happens in a special stress test based on this patch.
> im trying to think how i can show you. :)

I'm just wondering, why do not we handle the genuine double rehash
issue inside rhashtable? i mean it's just a temporary error that a
simple retry may fix it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 17, 2015, 5 p.m. UTC | #7
From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 17 Dec 2015 17:00:35 +0800

> On Thu, Dec 17, 2015 at 4:48 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, Dec 17, 2015 at 04:46:00PM +0800, Xin Long wrote:
>>>
>>> sorry for late test, but unfortunately, my case with rhashtalbe still
>>> return EBUSY.
>>> I added some debug code in rhashtable_insert_rehash(), and found:
>>> *future_tbl is null*
>>>
>>> fail:
>>>         /* Do not fail the insert if someone else did a rehash. */
>>>         if (likely(rcu_dereference_raw(tbl->future_tbl))) {
>>>                 printk("future_tbl is there\n");
>>>                 return 0;
>>>         } else {
>>>                 printk("future_tbl is null\n");
>>>         }
>>>
>>> any idea why ?
>>
>> That's presumably because you got a genuine double rehash.
>>
>> Until you post your code we can't really help you.
>>
> i wish i could , but my codes is a big patch for sctp, and this issue
> happens in a special stress test based on this patch.
> im trying to think how i can show you. :)

Simply post it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 18, 2015, 2:26 a.m. UTC | #8
On Fri, Dec 18, 2015 at 12:07:08AM +0800, Xin Long wrote:
>
> I'm just wondering, why do not we handle the genuine double rehash
> issue inside rhashtable? i mean it's just a temporary error that a
> simple retry may fix it.

Because a double rehash means that someone has cracked your hash
function and there is no point in trying anymore.

Cheers,
Xin Long Dec. 18, 2015, 8:18 a.m. UTC | #9
On Fri, Dec 18, 2015 at 10:26 AM, Herbert Xu
<herbert@gondor.apana.org.au> wrote:
> On Fri, Dec 18, 2015 at 12:07:08AM +0800, Xin Long wrote:
>>
>> I'm just wondering, why do not we handle the genuine double rehash
>> issue inside rhashtable? i mean it's just a temporary error that a
>> simple retry may fix it.
>
> Because a double rehash means that someone has cracked your hash
> function and there is no point in trying anymore.

ok, get your point, is it possible to be triggered by some cases under
a big stress insertion, but they are all legal cases. like we use rhash in
nftables, if there are a big batch sets to insert, may this issue happen?

>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 843ceca..e50b31d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -19,6 +19,7 @@ 
 
 #include <linux/atomic.h>
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/errno.h>
 #include <linux/jhash.h>
 #include <linux/list_nulls.h>
@@ -339,10 +340,11 @@  static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
 int rhashtable_init(struct rhashtable *ht,
 		    const struct rhashtable_params *params);
 
-int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
-			   struct rhash_head *obj,
-			   struct bucket_table *old_tbl);
-int rhashtable_insert_rehash(struct rhashtable *ht);
+struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
+					    const void *key,
+					    struct rhash_head *obj,
+					    struct bucket_table *old_tbl);
+int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
 
 int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
 void rhashtable_walk_exit(struct rhashtable_iter *iter);
@@ -598,9 +600,11 @@  restart:
 
 	new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
 	if (unlikely(new_tbl)) {
-		err = rhashtable_insert_slow(ht, key, obj, new_tbl);
-		if (err == -EAGAIN)
+		tbl = rhashtable_insert_slow(ht, key, obj, new_tbl);
+		if (!IS_ERR_OR_NULL(tbl))
 			goto slow_path;
+
+		err = PTR_ERR(tbl);
 		goto out;
 	}
 
@@ -611,7 +615,7 @@  restart:
 	if (unlikely(rht_grow_above_100(ht, tbl))) {
 slow_path:
 		spin_unlock_bh(lock);
-		err = rhashtable_insert_rehash(ht);
+		err = rhashtable_insert_rehash(ht, tbl);
 		rcu_read_unlock();
 		if (err)
 			return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a54ff89..2ff7ed9 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -389,33 +389,31 @@  static bool rhashtable_check_elasticity(struct rhashtable *ht,
 	return false;
 }
 
-int rhashtable_insert_rehash(struct rhashtable *ht)
+int rhashtable_insert_rehash(struct rhashtable *ht,
+			     struct bucket_table *tbl)
 {
 	struct bucket_table *old_tbl;
 	struct bucket_table *new_tbl;
-	struct bucket_table *tbl;
 	unsigned int size;
 	int err;
 
 	old_tbl = rht_dereference_rcu(ht->tbl, ht);
-	tbl = rhashtable_last_table(ht, old_tbl);
 
 	size = tbl->size;
 
+	err = -EBUSY;
+
 	if (rht_grow_above_75(ht, tbl))
 		size *= 2;
 	/* Do not schedule more than one rehash */
 	else if (old_tbl != tbl)
-		return -EBUSY;
+		goto fail;
+
+	err = -ENOMEM;
 
 	new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
-	if (new_tbl == NULL) {
-		/* Schedule async resize/rehash to try allocation
-		 * non-atomic context.
-		 */
-		schedule_work(&ht->run_work);
-		return -ENOMEM;
-	}
+	if (new_tbl == NULL)
+		goto fail;
 
 	err = rhashtable_rehash_attach(ht, tbl, new_tbl);
 	if (err) {
@@ -426,12 +424,24 @@  int rhashtable_insert_rehash(struct rhashtable *ht)
 		schedule_work(&ht->run_work);
 
 	return err;
+
+fail:
+	/* Do not fail the insert if someone else did a rehash. */
+	if (likely(rcu_dereference_raw(tbl->future_tbl)))
+		return 0;
+
+	/* Schedule async rehash to retry allocation in process context. */
+	if (err == -ENOMEM)
+		schedule_work(&ht->run_work);
+
+	return err;
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert_rehash);
 
-int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
-			   struct rhash_head *obj,
-			   struct bucket_table *tbl)
+struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
+					    const void *key,
+					    struct rhash_head *obj,
+					    struct bucket_table *tbl)
 {
 	struct rhash_head *head;
 	unsigned int hash;
@@ -467,7 +477,12 @@  int rhashtable_insert_slow(struct rhashtable *ht, const void *key,
 exit:
 	spin_unlock(rht_bucket_lock(tbl, hash));
 
-	return err;
+	if (err == 0)
+		return NULL;
+	else if (err == -EAGAIN)
+		return tbl;
+	else
+		return ERR_PTR(err);
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert_slow);