diff mbox series

[-,revised] rhashtable: detect when object movement might have invalidated a lookup

Message ID 87fu0kt5m0.fsf@notabene.neil.brown.name
State Rejected, archived
Delegated to: David Miller
Headers show
Series [-,revised] rhashtable: detect when object movement might have invalidated a lookup | expand

Commit Message

NeilBrown July 15, 2018, 11:57 p.m. UTC
Some users of rhashtable might need to change the key
of an object and move it to a different location in the table.
Other users might want to allocate objects using
SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
being used for a different (type-compatible) purpose and similarly
end up in a different hash-chain.

To support these, we store a unique NULLS_MARKER at the end of
each chain, and when a search fails to find a match, we check
if the NULLS marker found was the expected one.  If not,
the search is repeated.

The unique NULLS_MARKER is derived from the address of the
head of the chain.

If an object is removed and re-added to the same hash chain, we won't
notice by looking that the NULLS marker.  In this case we must be sure
that it was not re-added *after* its original location, or a lookup may
incorrectly fail.  The easiest solution is to ensure it is inserted at
the start of the chain.  insert_slow() already does that,
insert_fast() does not.  So this patch changes insert_fast to always
insert at the head of the chain.

Note that such a user must do their own double-checking of
the object found by rhashtable_lookup_fast() after ensuring
mutual exclusion which anything that might change the key, such as
successfully taking a new reference.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h | 35 +++++++++++++++++++++++------------
 lib/rhashtable.c           |  8 +++++---
 2 files changed, 28 insertions(+), 15 deletions(-)

Comments

Herbert Xu July 16, 2018, 12:51 a.m. UTC | #1
On Mon, Jul 16, 2018 at 09:57:11AM +1000, NeilBrown wrote:
> 
> Some users of rhashtable might need to change the key
> of an object and move it to a different location in the table.
> Other users might want to allocate objects using
> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
> being used for a different (type-compatible) purpose and similarly
> end up in a different hash-chain.
> 
> To support these, we store a unique NULLS_MARKER at the end of
> each chain, and when a search fails to find a match, we check
> if the NULLS marker found was the expected one.  If not,
> the search is repeated.
> 
> The unique NULLS_MARKER is derived from the address of the
> head of the chain.
> 
> If an object is removed and re-added to the same hash chain, we won't
> notice by looking that the NULLS marker.  In this case we must be sure
> that it was not re-added *after* its original location, or a lookup may
> incorrectly fail.  The easiest solution is to ensure it is inserted at
> the start of the chain.  insert_slow() already does that,
> insert_fast() does not.  So this patch changes insert_fast to always
> insert at the head of the chain.
> 
> Note that such a user must do their own double-checking of
> the object found by rhashtable_lookup_fast() after ensuring
> mutual exclusion which anything that might change the key, such as
> successfully taking a new reference.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I still don't understand why we need this feature.  The only
existing user of this (which currently doesn't use rhashtable)
does not readd the reused entry to the same table.  IOW the flow
is always from table A to table B.  After which the entry will
be properly freed rather than reused.

So who is going to use this?

Cheers,
NeilBrown July 16, 2018, 1:23 a.m. UTC | #2
On Mon, Jul 16 2018, Herbert Xu wrote:

> On Mon, Jul 16, 2018 at 09:57:11AM +1000, NeilBrown wrote:
>> 
>> Some users of rhashtable might need to change the key
>> of an object and move it to a different location in the table.
>> Other users might want to allocate objects using
>> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
>> being used for a different (type-compatible) purpose and similarly
>> end up in a different hash-chain.
>> 
>> To support these, we store a unique NULLS_MARKER at the end of
>> each chain, and when a search fails to find a match, we check
>> if the NULLS marker found was the expected one.  If not,
>> the search is repeated.
>> 
>> The unique NULLS_MARKER is derived from the address of the
>> head of the chain.
>> 
>> If an object is removed and re-added to the same hash chain, we won't
>> notice by looking that the NULLS marker.  In this case we must be sure
>> that it was not re-added *after* its original location, or a lookup may
>> incorrectly fail.  The easiest solution is to ensure it is inserted at
>> the start of the chain.  insert_slow() already does that,
>> insert_fast() does not.  So this patch changes insert_fast to always
>> insert at the head of the chain.
>> 
>> Note that such a user must do their own double-checking of
>> the object found by rhashtable_lookup_fast() after ensuring
>> mutual exclusion which anything that might change the key, such as
>> successfully taking a new reference.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> I still don't understand why we need this feature.  The only
> existing user of this (which currently doesn't use rhashtable)
> does not readd the reused entry to the same table.  IOW the flow
> is always from table A to table B.  After which the entry will
> be properly freed rather than reused.
>
> So who is going to use this?

I want it so I can use SLAB_TYPESAFE_BY_RCU slab caches in lustre.
lustre isn't in mainline any more so I cannot point at the code but the
concept is simple.
Without this, you need to use rcu_free to free any object added to an
rhashtable.
When kmalloc is used, kfree_rcu() can be used, which is fairly painless.
When a kmem_cache is used, you need to provide your own rcu free
function, which is clumsy.
With SLAB_TYPESAFE_BY_RCU, that isn't needed.  The object can be freed
immediately, providing you can cope with it being re-used (as the same
type) immediately.  Part of handling that is coping with the possibility
that it might be inserted into the same hash table, and possibly the
same chain, immediately it is freed.
lustre has 6 different resizable hashtables which I want to convert
to use rhashtable.
I currently need call_rcu handlers for 3 for these.  I want those
3 to use SLAB_TYPESAFE_BY_RCU instead so they can use
kmem_cache_free() directly.  For this, I need rhashtable to be safe if
an object is deleted and immediately re-inserted into the same hash
chain.

Thanks,
NeilBrown
Herbert Xu July 16, 2018, 2:16 a.m. UTC | #3
On Mon, Jul 16, 2018 at 11:23:43AM +1000, NeilBrown wrote:
>
> kmem_cache_free() directly.  For this, I need rhashtable to be safe if
> an object is deleted and immediately re-inserted into the same hash
> chain.

This means that

	rcu_read_lock();
	A = rhashtable_lookup();
	use(A);
	rcu_read_unlock();

A can turn into object B when it is used.  That is just too strange
for words.  Can we see some actual code on how this works?

For comparison, the existing net code where this happens A doesn't
actually change and it simply moves from one hashtable to another.

I'm relucant to add semantics that would restrain on how rhashtable
works unless we have real and valid use-cases for it.

Cheers,
NeilBrown July 16, 2018, 3:26 a.m. UTC | #4
On Mon, Jul 16 2018, Herbert Xu wrote:

> On Mon, Jul 16, 2018 at 11:23:43AM +1000, NeilBrown wrote:
>>
>> kmem_cache_free() directly.  For this, I need rhashtable to be safe if
>> an object is deleted and immediately re-inserted into the same hash
>> chain.
>
> This means that
>
> 	rcu_read_lock();
> 	A = rhashtable_lookup();
> 	use(A);
> 	rcu_read_unlock();
>
> A can turn into object B when it is used.  That is just too strange
> for words.  Can we see some actual code on how this works?

Look in Documenation/RCU/rculist_nulls.txt.
The very first example is a typical lookup for a nulls list.
The above sample code would read:

	rcu_read_lock();
    begin:
        A = rhashtable_lookup(table, key);
        if (IS_ERR(A)) {
        	rcu_read_unlock();
        	goto not_found;
        }
        if (!try_get_ref(A))
        	goto again;
        if (A->key != key) {
        	put_ref(A);
        	goto again;
        }
        rcu_read_unlock();
        use(A);

If you particularly need to see real code rather than sample code, I can
have something for you in a day or so.

Thanks,
NeilBrown

>
> For comparison, the existing net code where this happens A doesn't
> actually change and it simply moves from one hashtable to another.
>
> I'm relucant to add semantics that would restrain on how rhashtable
> works unless we have real and valid use-cases for it.
>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu July 17, 2018, 6:30 a.m. UTC | #5
On Mon, Jul 16, 2018 at 01:26:42PM +1000, NeilBrown wrote:
>
> Look in Documenation/RCU/rculist_nulls.txt.
> The very first example is a typical lookup for a nulls list.
> The above sample code would read:

OK, but how will this work with rhlist? It would be very bad to
have a feature that works for rhashtable but fails in strange
ways when you use rhlist.

Cheers,
David Miller July 18, 2018, 8:14 p.m. UTC | #6
From: NeilBrown <neilb@suse.com>
Date: Mon, 16 Jul 2018 09:57:11 +1000

> Some users of rhashtable might need to change the key
> of an object and move it to a different location in the table.
> Other users might want to allocate objects using
> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
> being used for a different (type-compatible) purpose and similarly
> end up in a different hash-chain.
> 
> To support these, we store a unique NULLS_MARKER at the end of
> each chain, and when a search fails to find a match, we check
> if the NULLS marker found was the expected one.  If not,
> the search is repeated.
> 
> The unique NULLS_MARKER is derived from the address of the
> head of the chain.
> 
> If an object is removed and re-added to the same hash chain, we won't
> notice by looking that the NULLS marker.  In this case we must be sure
> that it was not re-added *after* its original location, or a lookup may
> incorrectly fail.  The easiest solution is to ensure it is inserted at
> the start of the chain.  insert_slow() already does that,
> insert_fast() does not.  So this patch changes insert_fast to always
> insert at the head of the chain.
> 
> Note that such a user must do their own double-checking of
> the object found by rhashtable_lookup_fast() after ensuring
> mutual exclusion which anything that might change the key, such as
> successfully taking a new reference.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Neil I have to be honest with you.

During this whole ordeal I was under the impression that this was all
going to be used for something in-tree.  But now I see that you want
to use all of this stuff for lustre which is out of tree.

It would be extremely hard for me to accept adding this kind of
complexity and weird semantics to an already extremely complicated
and delicate piece of infrastructure if something in-tree would use
it.

But for something out-of-tree?  I'm sorry, no way.
NeilBrown July 20, 2018, 6:24 a.m. UTC | #7
On Tue, Jul 17 2018, Herbert Xu wrote:

> On Mon, Jul 16, 2018 at 01:26:42PM +1000, NeilBrown wrote:
>>
>> Look in Documenation/RCU/rculist_nulls.txt.
>> The very first example is a typical lookup for a nulls list.
>> The above sample code would read:
>
> OK, but how will this work with rhlist? It would be very bad to
> have a feature that works for rhashtable but fails in strange
> ways when you use rhlist.

It should be easy enough to handle in rhlist too.
When inserting a new object, we put it at the start of the chain, and if
there was already a list with the same key, it gets moved to
the new object.
A walk could see some objects repeatedly when this happens, but that
is already possible.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
NeilBrown July 20, 2018, 6:30 a.m. UTC | #8
On Thu, Jul 19 2018, David Miller wrote:

> From: NeilBrown <neilb@suse.com>
> Date: Mon, 16 Jul 2018 09:57:11 +1000
>
>> Some users of rhashtable might need to change the key
>> of an object and move it to a different location in the table.
>> Other users might want to allocate objects using
>> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
>> being used for a different (type-compatible) purpose and similarly
>> end up in a different hash-chain.
>> 
>> To support these, we store a unique NULLS_MARKER at the end of
>> each chain, and when a search fails to find a match, we check
>> if the NULLS marker found was the expected one.  If not,
>> the search is repeated.
>> 
>> The unique NULLS_MARKER is derived from the address of the
>> head of the chain.
>> 
>> If an object is removed and re-added to the same hash chain, we won't
>> notice by looking that the NULLS marker.  In this case we must be sure
>> that it was not re-added *after* its original location, or a lookup may
>> incorrectly fail.  The easiest solution is to ensure it is inserted at
>> the start of the chain.  insert_slow() already does that,
>> insert_fast() does not.  So this patch changes insert_fast to always
>> insert at the head of the chain.
>> 
>> Note that such a user must do their own double-checking of
>> the object found by rhashtable_lookup_fast() after ensuring
>> mutual exclusion which anything that might change the key, such as
>> successfully taking a new reference.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> Neil I have to be honest with you.

Thank you.

>
> During this whole ordeal I was under the impression that this was all
> going to be used for something in-tree.  But now I see that you want
> to use all of this stuff for lustre which is out of tree.
>
> It would be extremely hard for me to accept adding this kind of
> complexity and weird semantics to an already extremely complicated
> and delicate piece of infrastructure if something in-tree would use
> it.
>
> But for something out-of-tree?  I'm sorry, no way.

That's unfortunate, but I can live with it.  null-list support is
just a nice-to-have for me.
I'll resend the patch with the unwanted complexity removed.

Does this ruling also apply to the bit-spin-lock changes and the
per-cpu-counter changes that I have proposed?
These improve scalability when updates dominate.  Not having these
in mainline would mean I need to carry a separate rhashtables
implementation for lustre, which means code diversion which isn't
healthy in the long run.

(Note that, in my mind, lustre is only temporarily out-of-tree.  It is
coming back, hopefully this year).

Thanks,
NeilBrown
David Miller July 20, 2018, 6:43 a.m. UTC | #9
From: NeilBrown <neilb@suse.com>
Date: Fri, 20 Jul 2018 16:30:34 +1000

> Does this ruling also apply to the bit-spin-lock changes and the
> per-cpu-counter changes that I have proposed?  These improve
> scalability when updates dominate.  Not having these in mainline
> would mean I need to carry a separate rhashtables implementation for
> lustre, which means code diversion which isn't healthy in the long
> run.

If it helps existing rhashtable users generally, then it is fine,
since it will actually be tested by upstream users.

Thanks.
NeilBrown July 20, 2018, 7:09 a.m. UTC | #10
On Thu, Jul 19 2018, David Miller wrote:

> From: NeilBrown <neilb@suse.com>
> Date: Fri, 20 Jul 2018 16:30:34 +1000
>
>> Does this ruling also apply to the bit-spin-lock changes and the
>> per-cpu-counter changes that I have proposed?  These improve
>> scalability when updates dominate.  Not having these in mainline
>> would mean I need to carry a separate rhashtables implementation for
>> lustre, which means code diversion which isn't healthy in the long
>> run.
>
> If it helps existing rhashtable users generally, then it is fine,
> since it will actually be tested by upstream users.

Good, thanks.

NeilBrown
diff mbox series

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index eb7111039247..10435a77b156 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -75,8 +75,10 @@  struct bucket_table {
 	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
 };
 
+#define	RHT_NULLS_MARKER(ptr)	\
+	((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr)	\
-	((ptr) = (typeof(ptr)) NULLS_MARKER(0))
+	((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
@@ -471,6 +473,7 @@  static inline struct rhash_head *__rhashtable_lookup(
 		.ht = ht,
 		.key = key,
 	};
+	struct rhash_head __rcu * const *head;
 	struct bucket_table *tbl;
 	struct rhash_head *he;
 	unsigned int hash;
@@ -478,13 +481,19 @@  static inline struct rhash_head *__rhashtable_lookup(
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
 	hash = rht_key_hashfn(ht, tbl, key, params);
-	rht_for_each_rcu(he, tbl, hash) {
-		if (params.obj_cmpfn ?
-		    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
-		    rhashtable_compare(&arg, rht_obj(ht, he)))
-			continue;
-		return he;
-	}
+	head = rht_bucket(tbl, hash);
+	do {
+		rht_for_each_rcu_continue(he, *head, tbl, hash) {
+			if (params.obj_cmpfn ?
+			    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+			    rhashtable_compare(&arg, rht_obj(ht, he)))
+				continue;
+			return he;
+		}
+		/* An object might have been moved to a different hash chain,
+		 * while we walk along it - better check and retry.
+		 */
+	} while (he != RHT_NULLS_MARKER(head));
 
 	/* Ensure we see any new tables. */
 	smp_rmb();
@@ -580,6 +589,7 @@  static inline void *__rhashtable_insert_fast(
 		.ht = ht,
 		.key = key,
 	};
+	struct rhash_head __rcu **headp;
 	struct rhash_head __rcu **pprev;
 	struct bucket_table *tbl;
 	struct rhash_head *head;
@@ -603,12 +613,13 @@  static inline void *__rhashtable_insert_fast(
 	}
 
 	elasticity = RHT_ELASTICITY;
-	pprev = rht_bucket_insert(ht, tbl, hash);
+	headp = rht_bucket_insert(ht, tbl, hash);
+	pprev = headp;
 	data = ERR_PTR(-ENOMEM);
 	if (!pprev)
 		goto out;
 
-	rht_for_each_continue(head, *pprev, tbl, hash) {
+	rht_for_each_continue(head, *headp, tbl, hash) {
 		struct rhlist_head *plist;
 		struct rhlist_head *list;
 
@@ -648,7 +659,7 @@  static inline void *__rhashtable_insert_fast(
 	if (unlikely(rht_grow_above_100(ht, tbl)))
 		goto slow_path;
 
-	head = rht_dereference_bucket(*pprev, tbl, hash);
+	head = rht_dereference_bucket(*headp, tbl, hash);
 
 	RCU_INIT_POINTER(obj->next, head);
 	if (rhlist) {
@@ -658,7 +669,7 @@  static inline void *__rhashtable_insert_fast(
 		RCU_INIT_POINTER(list->next, NULL);
 	}
 
-	rcu_assign_pointer(*pprev, obj);
+	rcu_assign_pointer(*headp, obj);
 
 	atomic_inc(&ht->nelems);
 	if (rht_grow_above_75(ht, tbl))
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 0e04947b7e0c..1737fbd049da 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1164,8 +1164,7 @@  struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 					    unsigned int hash)
 {
 	const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-	static struct rhash_head __rcu *rhnull =
-		(struct rhash_head __rcu *)NULLS_MARKER(0);
+	static struct rhash_head __rcu *rhnull;
 	unsigned int index = hash & ((1 << tbl->nest) - 1);
 	unsigned int size = tbl->size >> tbl->nest;
 	unsigned int subhash = hash;
@@ -1183,8 +1182,11 @@  struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 		subhash >>= shift;
 	}
 
-	if (!ntbl)
+	if (!ntbl) {
+		if (!rhnull)
+			INIT_RHT_NULLS_HEAD(rhnull);
 		return &rhnull;
+	}
 
 	return &ntbl[subhash].bucket;