Message ID | E1YZard-0000wE-Se@gondolin.me.apana.org.au |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 03/22/15 at 07:04pm, Herbert Xu wrote: > Automatic shrinking is dangerous because it provides an easy > way for an adversary to cause us to do unnecessary work. Thus > making the resizable hashtable a poor data structure. > > This patch disables automatic shrinking but retains a manual > shrink function for those cases where insertions and removals > are overseen by a trusted entity, e.g., nft_hash. This is misleading. I agree that unconditional shrinking is dangerous. Shrinking was an optional feature disabled by default before. The inlining enabled it by default for all users. What is the benefit of requiring this logic outside of rhashtable over just adding a flag to enable shrinking at 30% utilization? > The shrink function will now also shrink to fit rather than halve > the size of the table. I like this part a lot > int rhashtable_shrink(struct rhashtable *ht) > { > - struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht); > + unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3); If rhashtable_shrink() is called near the 75% border it will cause an immediate expansion again. Maybe make this * 3 / 2 so we shrink near 30% utilization as before? > + struct bucket_table *new_tbl; > + struct bucket_table *tbl; > + int err; > > - ASSERT_RHT_MUTEX(ht); > + if (size < ht->p.min_size) > + size = ht->p.min_size; We should only shrink if size < old_tbl->size > - new_tbl = bucket_table_alloc(ht, old_tbl->size / 2); > + new_tbl = bucket_table_alloc(ht, size); > if (new_tbl == NULL) > return -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/22/15 at 12:17pm, Thomas Graf wrote: > On 03/22/15 at 07:04pm, Herbert Xu wrote: > > + struct bucket_table *new_tbl; > > + struct bucket_table *tbl; > > + int err; > > > > - ASSERT_RHT_MUTEX(ht); > > + if (size < ht->p.min_size) > > + size = ht->p.min_size; > > We should only shrink if size < old_tbl->size I found the check further down. Any particular reason why check after allocation and then free again? Why do you want to avoid the allocation inside the mutex? > > - new_tbl = bucket_table_alloc(ht, old_tbl->size / 2); > > + new_tbl = bucket_table_alloc(ht, size); > > if (new_tbl == NULL) > > return -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 22, 2015 at 01:06:30PM +0000, Thomas Graf wrote: > On 03/22/15 at 12:17pm, Thomas Graf wrote: > > On 03/22/15 at 07:04pm, Herbert Xu wrote: > > > + struct bucket_table *new_tbl; > > > + struct bucket_table *tbl; > > > + int err; > > > > > > - ASSERT_RHT_MUTEX(ht); > > > + if (size < ht->p.min_size) > > > + size = ht->p.min_size; > > > > We should only shrink if size < old_tbl->size > > I found the check further down. Any particular reason why check > after allocation and then free again? Why do you want to avoid > the allocation inside the mutex? It's just quality of code. You should always try to minimise the locked sections. Cheers,
On Sun, Mar 22, 2015 at 12:17:55PM +0000, Thomas Graf wrote: > > This is misleading. I agree that unconditional shrinking is dangerous. > Shrinking was an optional feature disabled by default before. The How was shrinking disabled before? AFAICS it always kicked in at 30%. > inlining enabled it by default for all users. What is the benefit of > requiring this logic outside of rhashtable over just adding a flag to > enable shrinking at 30% utilization? That would be adding an extra branch on the fast-path for an operation which almost nobody needs. > > int rhashtable_shrink(struct rhashtable *ht) > > { > > - struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht); > > + unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3); > > If rhashtable_shrink() is called near the 75% border it will cause an > immediate expansion again. Maybe make this * 3 / 2 so we shrink near > 30% utilization as before? Sure that makes sense. Cheers,
On 03/23/15 at 11:09am, Herbert Xu wrote: > On Sun, Mar 22, 2015 at 12:17:55PM +0000, Thomas Graf wrote: > > > > This is misleading. I agree that unconditional shrinking is dangerous. > > Shrinking was an optional feature disabled by default before. The > > How was shrinking disabled before? AFAICS it always kicked in at > 30%. Before Daniel removed the indirection due to all callers enabling shrinking by default ;-) It was clear that some future users eventually would not want shrinking and thus require a conditional. > > inlining enabled it by default for all users. What is the benefit of > > requiring this logic outside of rhashtable over just adding a flag to > > enable shrinking at 30% utilization? > > That would be adding an extra branch on the fast-path for an > operation which almost nobody needs. I don't get why almost nobody would want shrinking. I agree that for tables like TCP hash tables, once you have grown you want to keep that table size because the load is likely to come back. But we will also have lots of users such as the Netlink socket with a table per protocol where not shrinking results in giving the user the ability to waste memory indefinitely for no gain. I'm not claiming you always want shrinking but what gain is there by removing integrated support? Can you show numbers that the additional branch actually hurts? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/23/15 at 11:07am, Herbert Xu wrote: > On Sun, Mar 22, 2015 at 01:06:30PM +0000, Thomas Graf wrote: > > On 03/22/15 at 12:17pm, Thomas Graf wrote: > > > On 03/22/15 at 07:04pm, Herbert Xu wrote: > > > > + struct bucket_table *new_tbl; > > > > + struct bucket_table *tbl; > > > > + int err; > > > > > > > > - ASSERT_RHT_MUTEX(ht); > > > > + if (size < ht->p.min_size) > > > > + size = ht->p.min_size; > > > > > > We should only shrink if size < old_tbl->size > > > > I found the check further down. Any particular reason why check > > after allocation and then free again? Why do you want to avoid > > the allocation inside the mutex? > > It's just quality of code. You should always try to minimise > the locked sections. So do you expect the user to replicate the new table size calculation outside of rhashtable_shrink() to avoid the cost of a possible massive memory allocation even if no shrinking will take place? I think rhashtable_shrink() should fetch ht->tbl in an RCU section to cheaply get the current table size and only do the allocation and take the lock if the table size warrants for shrinking. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 08:33:19AM +0000, Thomas Graf wrote: > > I'm not claiming you always want shrinking but what gain is there by > removing integrated support? Can you show numbers that the additional > branch actually hurts? You never want automatic shrinking unless all your users are trusted. I doubt there would be many rhashtable users where this would apply. Even nft_hash is quite tenuous. Besdies, if you really want automatic shrinking, you could always do it in the caller of rhashtable_remove. That way only you would pay for the cost and not everybody else. Cheers,
On Mon, Mar 23, 2015 at 08:37:12AM +0000, Thomas Graf wrote: > > I think rhashtable_shrink() should fetch ht->tbl in an RCU section to > cheaply get the current table size and only do the allocation and take > the lock if the table size warrants for shrinking. Well you should never invoke rhashtable_shrink unless you actually wanted to shrink. So this is something that you should have checked before rhashtable_shrink is called. Cheers,
On 03/23/15 at 08:28pm, Herbert Xu wrote: > On Mon, Mar 23, 2015 at 08:33:19AM +0000, Thomas Graf wrote: > > > > I'm not claiming you always want shrinking but what gain is there by > > removing integrated support? Can you show numbers that the additional > > branch actually hurts? > > You never want automatic shrinking unless all your users are > trusted. I doubt there would be many rhashtable users where > this would apply. Even nft_hash is quite tenuous. Why? > Besdies, if you really want automatic shrinking, you could always > do it in the caller of rhashtable_remove. That way only you > would pay for the cost and not everybody else. Same can be said for growing. Why do we differ between the two? Would you expect users requiring shrinking() to call rhashtable_shrink() after every remove? Should they encode their own logic based on rhashtable internals? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 09:36:32AM +0000, Thomas Graf wrote: > > > You never want automatic shrinking unless all your users are > > trusted. I doubt there would be many rhashtable users where > > this would apply. Even nft_hash is quite tenuous. > > Why? Because with multiple rehashing it's quite easy to convert your hash table into a linked list by repeatedly growing and shrinking. Multiple rehashing simply cannot work unless you get rid of automatic shrinking for the untrusted case. > Same can be said for growing. Why do we differ between the two? > Would you expect users requiring shrinking() to call > rhashtable_shrink() after every remove? Should they encode their > own logic based on rhashtable internals? No we must be able to grow or insertions will fail. The same cannot be said for shrinking. Cheers,
On 03/23/15 at 08:29pm, Herbert Xu wrote: > On Mon, Mar 23, 2015 at 08:37:12AM +0000, Thomas Graf wrote: > > > > I think rhashtable_shrink() should fetch ht->tbl in an RCU section to > > cheaply get the current table size and only do the allocation and take > > the lock if the table size warrants for shrinking. > > Well you should never invoke rhashtable_shrink unless you actually > wanted to shrink. So this is something that you should have checked > before rhashtable_shrink is called. How? The calculation of the table size is embedded in rhashtable_shrink(). Should every user have a copy of that calculation algorithm? Why not just: unlikely(ht->p.shrink && rht_shrink_below_30(..)) If you really care about that additional conditional we can also add: static inline int rhashtable_remove_and_shrink() { int err; rcu_read_lock(); tbl = rht_dereference_rcu(ht->tbl, ht); err = rhashtable_remove_fast(); if (unlikely(!err && rht_shrink_below_30(ht, tbl))) schedule_work(&ht->run_work); rcu_read_unlock(); return err; } I just think it's wrong to rip out all the shrinking logic and require every single user to re-add its own copy. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 08:39:52PM +1100, Herbert Xu wrote: > > Because with multiple rehashing it's quite easy to convert your > hash table into a linked list by repeatedly growing and shrinking. > > Multiple rehashing simply cannot work unless you get rid of automatic > shrinking for the untrusted case. Actually what I could do is allow automatic shrinking when there are no outstanding rehashes. So maybe we could restore this feature after all. Cheers,
On 03/23/15 at 08:44pm, Herbert Xu wrote: > On Mon, Mar 23, 2015 at 08:39:52PM +1100, Herbert Xu wrote: > > > > Because with multiple rehashing it's quite easy to convert your > > hash table into a linked list by repeatedly growing and shrinking. > > > > Multiple rehashing simply cannot work unless you get rid of automatic > > shrinking for the untrusted case. > > Actually what I could do is allow automatic shrinking when there > are no outstanding rehashes. So maybe we could restore this feature > after all. OK. Maybe this patch should be posted in the context of enabling multiple rehashes then. It is difficult to review without having the full context. This correlation was not clear to me from the commit message. I have yet to understand the implications of multiple rehashes. The idea of having to traverse N tables for each insert, removal and lookup in a pressure situation is still frightening. I would like to compare it with an exponential growing logic. Eventually both approaches can be combined to limit the chain length of rehashes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 10:08:52AM +0000, Thomas Graf wrote: > > I have yet to understand the implications of multiple rehashes. > The idea of having to traverse N tables for each insert, removal > and lookup in a pressure situation is still frightening. It's simple. With only expansion you can never have more than log N tables. Cheers,
From: Thomas Graf <tgraf@suug.ch> Date: Mon, 23 Mar 2015 08:33:19 +0000 > I don't get why almost nobody would want shrinking. I agree that for > tables like TCP hash tables, once you have grown you want to keep that > table size because the load is likely to come back. But we will also > have lots of users such as the Netlink socket with a table per protocol > where not shrinking results in giving the user the ability to waste > memory indefinitely for no gain. The user can't do this with TCP? Why is netlink only susceptible? The only plausible argument for shrinking I've ever heard of is the nft_hash case, and there that code can _explicitly_ ask for a shrink after it has made a major table modification. That puts all of the smarts for when to shrink where the knowledge resides, and in this case that's the user. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 23 Mar 2015 20:28:07 +1100 > You never want automatic shrinking unless all your users are > trusted. +1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 12:44:34PM -0400, David Miller wrote: > > The user can't do this with TCP? Why is netlink only susceptible? > > The only plausible argument for shrinking I've ever heard of is the > nft_hash case, and there that code can _explicitly_ ask for a shrink > after it has made a major table modification. > > That puts all of the smarts for when to shrink where the knowledge > resides, and in this case that's the user. One thing I got to say is that automatic shrinking is a really good stress test as reenabling it allowed me to quickly identify two bugs in the final patch :) Other than that I totally agree that it should be disabled by default as otherwise it increases the amortised cost of the hash table for a paltry saving in memory. We can do it afterwards once we're sure this whole thing is stable. Cheers,
On 03/23/15 at 12:44pm, David Miller wrote: > From: Thomas Graf <tgraf@suug.ch> > Date: Mon, 23 Mar 2015 08:33:19 +0000 > > > I don't get why almost nobody would want shrinking. I agree that for > > tables like TCP hash tables, once you have grown you want to keep that > > table size because the load is likely to come back. But we will also > > have lots of users such as the Netlink socket with a table per protocol > > where not shrinking results in giving the user the ability to waste > > memory indefinitely for no gain. > > The user can't do this with TCP? Why is netlink only susceptible? You are right. Any table that doesn't shrink will eventually waste memory. I used TCP vs Netlink because I believe it represents the difference in priorities very well. TCP may go 0..1M flows within a fraction of a second so if you've seen that many flows before you might get hit again and you prioritize the "being ready" to handle it higher than eventually wasting the memory indefinitely. Whereas with Netlink it seems (glad to be proven wrong) that the need for instant growth is lesser as it takes time to create 1M sockets across many PIDs. So we gain something by releasing the resources if not needed. > The only plausible argument for shrinking I've ever heard of is the > nft_hash case, and there that code can _explicitly_ ask for a shrink > after it has made a major table modification. > > That puts all of the smarts for when to shrink where the knowledge > resides, and in this case that's the user. My argument pro automatic shrinking is simplicity. I'm absolutely fine with disabling it by default and to require enabling it explicitly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index 2a98b95..44aa579 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -252,19 +252,6 @@ static inline bool rht_grow_above_75(const struct rhashtable *ht, (!ht->p.max_size || tbl->size < ht->p.max_size); } -/** - * rht_shrink_below_30 - returns true if nelems < 0.3 * table-size - * @ht: hash table - * @tbl: current table - */ -static inline bool rht_shrink_below_30(const struct rhashtable *ht, - const struct bucket_table *tbl) -{ - /* Shrink table beneath 30% load */ - return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) && - tbl->size > ht->p.min_size; -} - /* The bucket lock is selected based on the hash and protects mutations * on a group of hash buckets. * @@ -745,8 +732,6 @@ static inline int rhashtable_remove_fast( goto out; atomic_dec(&ht->nelems); - if (rht_shrink_below_30(ht, tbl)) - schedule_work(&ht->run_work); out: rcu_read_unlock(); diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 798f01d..08a6123 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -261,30 +261,48 @@ EXPORT_SYMBOL_GPL(rhashtable_expand); * rhashtable_shrink - Shrink hash table while allowing concurrent lookups * @ht: the hash table to shrink * - * This function may only be called in a context where it is safe to call - * synchronize_rcu(), e.g. not within a rcu_read_lock() section. - * - * The caller must ensure that no concurrent resizing occurs by holding - * ht->mutex. - * - * The caller must ensure that no concurrent table mutations take place. - * It is however valid to have concurrent lookups if they are RCU protected. + * This function shrinks the hash table to fit, i.e., the smallest + * size would not cause it to expand right away automatically. * * It is valid to have concurrent insertions and deletions protected by per * bucket locks or concurrent RCU protected lookups and traversals. */ int rhashtable_shrink(struct rhashtable *ht) { - struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht); + unsigned size = roundup_pow_of_two(atomic_read(&ht->nelems) * 4 / 3); + struct bucket_table *new_tbl; + struct bucket_table *tbl; + int err; - ASSERT_RHT_MUTEX(ht); + if (size < ht->p.min_size) + size = ht->p.min_size; - new_tbl = bucket_table_alloc(ht, old_tbl->size / 2); + new_tbl = bucket_table_alloc(ht, size); if (new_tbl == NULL) return -ENOMEM; + err = -EEXIST; + + mutex_lock(&ht->mutex); + tbl = rht_dereference(ht->tbl, ht); + if (rht_dereference(tbl->future_tbl, ht)) + goto out_free_tbl; + + err = 0; + if (tbl->size <= size) + goto out_free_tbl; + rhashtable_rehash(ht, new_tbl); - return 0; + + mutex_unlock(&ht->mutex); + +out: + return err; + +out_free_tbl: + mutex_unlock(&ht->mutex); + bucket_table_free(new_tbl); + goto out; } EXPORT_SYMBOL_GPL(rhashtable_shrink); @@ -302,8 +320,6 @@ static void rht_deferred_worker(struct work_struct *work) if (rht_grow_above_75(ht, tbl)) rhashtable_expand(ht); - else if (rht_shrink_below_30(ht, tbl)) - rhashtable_shrink(ht); unlock: mutex_unlock(&ht->mutex); } diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index a2ba6ad..0ceb332 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -167,17 +167,13 @@ static int __init test_rhashtable(struct rhashtable *ht) rcu_read_unlock(); } - for (i = 0; i < TEST_NEXPANDS; i++) { - pr_info(" Table shrinkage iteration %u...\n", i); - mutex_lock(&ht->mutex); - rhashtable_shrink(ht); - mutex_unlock(&ht->mutex); + pr_info(" Table shrinkage...\n"); + rhashtable_shrink(ht); - rcu_read_lock(); - pr_info(" Verifying lookups...\n"); - test_rht_lookup(ht); - rcu_read_unlock(); - } + rcu_read_lock(); + pr_info(" Verifying lookups...\n"); + test_rht_lookup(ht); + rcu_read_unlock(); rcu_read_lock(); test_bucket_stats(ht, true);
Automatic shrinking is dangerous because it provides an easy way for an adversary to cause us to do unnecessary work. Thus making the resizable hashtable a poor data structure. This patch disables automatic shrinking but retains a manual shrink function for those cases where insertions and removals are overseen by a trusted entity, e.g., nft_hash. The shrink function will now also shrink to fit rather than halve the size of the table. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/linux/rhashtable.h | 15 --------------- lib/rhashtable.c | 44 ++++++++++++++++++++++++++++++-------------- lib/test_rhashtable.c | 16 ++++++---------- 3 files changed, 36 insertions(+), 39 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html