diff mbox

[1/2] rhashtable: Introduce rhashtable_walk_*

Message ID E1YFWUl-0004gE-SH@gondolin.me.apana.org.au
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu Jan. 25, 2015, 11:21 p.m. UTC
Some existing rhashtable users get too intimate with it by walking
the buckets directly.  This prevents us from easily changing the
internals of rhashtable.

This patch adds the helpers rhashtable_walk_init/next/end which
will replace these custom walkers.

They are meant to be usable for both procfs seq_file walks as well
as walking by a netlink dump.  The iterator structure should fit
inside a netlink dump cb structure, with at least one element to
spare.
  
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |   44 +++++++++++++++++++++
 lib/rhashtable.c           |   91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Thomas Graf Jan. 26, 2015, 8:20 a.m. UTC | #1
On 01/26/15 at 10:21am, Herbert Xu wrote:
> +int rhashtable_walk_start(struct rhashtable_iter *iter)
> +{
> +	struct rhashtable *ht = iter->ht;
> +	int err;
> +
> +	err = mutex_lock_interruptible(&ht->mutex);
> +	rcu_read_lock();
> +
> +	if (!err)
> +		mutex_unlock(&ht->mutex);
> +
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(rhashtable_walk_start);

This doesn't seem to work for Netlink dumps as it would define
a RCU read side section across user read()s.

I assume you relying on RCU to defer the initial table relinking
in the resizing to be postponed. Wouldn't it be better to hold
the mutex instead, we could sleep during the dump.

> +/**
> + * rhashtable_walk_next - Return the next object and advance the iterator
> + * @iter:	Hash table iterator
> + *
> + * Note that you must call rhashtable_walk_stop when you are finished
> + * with the walk.
> + *
> + * Returns the next object or NULL when the end of the table is reached.
> + */
> +void *rhashtable_walk_next(struct rhashtable_iter *iter)
> +{
> +	const struct bucket_table *tbl;
> +	struct rhashtable *ht = iter->ht;
> +	struct rhash_head *p = iter->p;
> +
> +	tbl = rht_dereference_rcu(ht->tbl, ht);
> +
> +	if (p) {
> +		p = rht_dereference_bucket(p->next, tbl, iter->slot);
> +		goto next;
> +	}
> +
> +	for (; iter->slot < tbl->size; iter->slot++) {
> +		int skip = iter->skip;
> +
> +		iter->lock = bucket_lock(tbl, iter->slot);
> +		spin_lock_bh(iter->lock);
> +
> +		rht_for_each(p, tbl, iter->slot) {
> +			if (!skip)
> +				break;
> +			skip--;
> +		}
> +
> +next:
> +		if (!rht_is_a_nulls(p)) {
> +			iter->skip++;
> +			iter->p = p;
> +			return rht_obj(ht, p);

Same here, we leave the iterator with the bucket lock held.

> +		}
> +		spin_unlock_bh(iter->lock);
> +
> +		iter->skip = 0;
> +	}
> +
> +	iter->p = NULL;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(rhashtable_walk_next);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Jan. 26, 2015, 10:09 a.m. UTC | #2
From: Herbert Xu
> Some existing rhashtable users get too intimate with it by walking
> the buckets directly.  This prevents us from easily changing the
> internals of rhashtable.
> 
> This patch adds the helpers rhashtable_walk_init/next/end which
> will replace these custom walkers.
> 
> They are meant to be usable for both procfs seq_file walks as well
> as walking by a netlink dump.  The iterator structure should fit
> inside a netlink dump cb structure, with at least one element to
> spare.
...
> +/**
> + * rhashtable_walk_start - Start a hash table walk
> + * @iter:	Hash table iterator
> + *
> + * Start a hash table walk.
> + *
> + * Returns zero if successful.  Returns -EINTR if we couldn't
> + * obtain the resize lock.
> + */
> +int rhashtable_walk_start(struct rhashtable_iter *iter)
> +{
> +	struct rhashtable *ht = iter->ht;
> +	int err;
> +
> +	err = mutex_lock_interruptible(&ht->mutex);
> +	rcu_read_lock();
> +
> +	if (!err)
> +		mutex_unlock(&ht->mutex);
> +
> +	return err;
> +}

That doesn't look right to me.
Surely you shouldn't be calling rcu_read_lock() when the mutex
request is interrupted.

So maybe:
	err = mutex_lock_interruptible(&ht->mutex);
	if (err)
		return err;
	rcu_read_lock();
	return 0;
}

	David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 26, 2015, 10:21 p.m. UTC | #3
On Mon, Jan 26, 2015 at 08:20:00AM +0000, Thomas Graf wrote:
> On 01/26/15 at 10:21am, Herbert Xu wrote:
> > +int rhashtable_walk_start(struct rhashtable_iter *iter)
> > +{
> > +	struct rhashtable *ht = iter->ht;
> > +	int err;
> > +
> > +	err = mutex_lock_interruptible(&ht->mutex);
> > +	rcu_read_lock();
> > +
> > +	if (!err)
> > +		mutex_unlock(&ht->mutex);
> > +
> > +	return err;
> > +}
> > +EXPORT_SYMBOL_GPL(rhashtable_walk_start);
> 
> This doesn't seem to work for Netlink dumps as it would define
> a RCU read side section across user read()s.

No it does not.  The start call roughly corresponds to the start
call in seq_file which does not span across reads.

> I assume you relying on RCU to defer the initial table relinking
> in the resizing to be postponed. Wouldn't it be better to hold
> the mutex instead, we could sleep during the dump.

Note that this is not the final solution but just the first step.
So right now it does not prevent/postpone resizes completely.  It's
only meant to be a strict improvement over what we currently have.

So indeed resizes will be stopped within a single read but not
across reads for seq_file.  Similarly for netlink dump resizes
will only be stopped for a single skb.

I will do the rest of what we discussed once I implement rehashing
since there will be dependencies on it.
 
> Same here, we leave the iterator with the bucket lock held.

This is intentional, it is what udp_diag does.  We'll have to do
this if you ever want to convert UDP sockets over to rhashtable
so we might as well do it now.

If you don't hold the bucket lock then anyone that recycles objects
like UDP will end up walking into the ether.

As before this is not held across reads but only within a read.
Please see patch #2.

Cheers,
Herbert Xu Jan. 26, 2015, 10:23 p.m. UTC | #4
On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
>
> That doesn't look right to me.
> Surely you shouldn't be calling rcu_read_lock() when the mutex
> request is interrupted.
> 
> So maybe:
> 	err = mutex_lock_interruptible(&ht->mutex);
> 	if (err)
> 		return err;
> 	rcu_read_lock();

No, we need to grab the RCU read lock while holding the mutex
in order to prevent future resizes from happening once we release
the mutex.

We don't want to hold the mutex which would stop other walks from
starting.

Cheers,
David Miller Jan. 26, 2015, 10:36 p.m. UTC | #5
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 27 Jan 2015 09:23:00 +1100

> On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
>>
>> That doesn't look right to me.
>> Surely you shouldn't be calling rcu_read_lock() when the mutex
>> request is interrupted.
>> 
>> So maybe:
>> 	err = mutex_lock_interruptible(&ht->mutex);
>> 	if (err)
>> 		return err;
>> 	rcu_read_lock();
> 
> No, we need to grab the RCU read lock while holding the mutex
> in order to prevent future resizes from happening once we release
> the mutex.
> 
> We don't want to hold the mutex which would stop other walks from
> starting.

I really think the amount of time and effort being put into making
the walker function properly is excessive.

Let's just say that if you want to use rhashtable, you have to use a
linked list, or similar separate mechanism, to have stable walks of
the entire set of objects.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 26, 2015, 10:42 p.m. UTC | #6
On Mon, Jan 26, 2015 at 02:36:13PM -0800, David Miller wrote:
>
> I really think the amount of time and effort being put into making
> the walker function properly is excessive.
> 
> Let's just say that if you want to use rhashtable, you have to use a
> linked list, or similar separate mechanism, to have stable walks of
> the entire set of objects.

I definitely agree with that :)

However, I'd really like to get these primitives in first so that
at least I can convert the existing rhashtable walkers over and
get on with implementing rhashtable rehashing which is what I set
out to do.

As otherwise we'd be looking at fixing the existing users (in
particular, nft_hash) properly which might take a lot more time.

If and when we finally fix the existing users we can always remove
these primitives.

FWIW when I first wrote this patch I called it
rhashtable_walk_dangerously, I can readopt that name if you think
that will dissuade people from using it.

Thanks,
Herbert Xu Jan. 26, 2015, 11:31 p.m. UTC | #7
On Tue, Jan 27, 2015 at 09:42:16AM +1100, Herbert Xu wrote:
> 
> As otherwise we'd be looking at fixing the existing users (in
> particular, nft_hash) properly which might take a lot more time.

I take that back.  Looks like my walkers would be no good for
netfilter anyway since it wants to do nested walks, meaning no
locks can be taken.

So I think I'll just have to tackle netfilter first in which case
the walk function would be no use anyway since I could just fix
netlink by itself.

Cheers,
Thomas Graf Jan. 27, 2015, 9:45 a.m. UTC | #8
On 01/27/15 at 10:31am, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 09:42:16AM +1100, Herbert Xu wrote:
> > 
> > As otherwise we'd be looking at fixing the existing users (in
> > particular, nft_hash) properly which might take a lot more time.
> 
> I take that back.  Looks like my walkers would be no good for
> netfilter anyway since it wants to do nested walks, meaning no
> locks can be taken.
> 
> So I think I'll just have to tackle netfilter first in which case
> the walk function would be no use anyway since I could just fix
> netlink by itself.

OK. I'll wait for your netfilter approach before posting NLM_F_INTR
patches.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 9:54 a.m. UTC | #9
On Tue, Jan 27, 2015 at 09:45:41AM +0000, Thomas Graf wrote:
>
> OK. I'll wait for your netfilter approach before posting NLM_F_INTR
> patches.

Why would we still need the INTR stuff given that Dave has asked
for this to be done outside of rhashtable?

What I'm planning on doing right now is to divide rhashtable users
into two camps, lockless insert/removal vs. locked insert/removal.
netfilter will obviously fall into the latter camp.

For those with locked insert/removal I will make resizing/rehashing
synchronous since there is zero benefit from doing it in a separate
thread.

Once that is done you can then easily walk the table by just holding
the lock, which is what netfilter does today.

I'll probably make netlink locked too since it's much simpler that
way.  Only really high performance users like TCP/UDP sockets need
lockless insert/removal and the complexity that comes with it.

Cheers,
David Laight Jan. 27, 2015, 10:09 a.m. UTC | #10
From: Herbert Xu 
> On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
> >
> > That doesn't look right to me.
> > Surely you shouldn't be calling rcu_read_lock() when the mutex
> > request is interrupted.
> >
> > So maybe:
> > 	err = mutex_lock_interruptible(&ht->mutex);
> > 	if (err)
> > 		return err;
> > 	rcu_read_lock();
> 
> No, we need to grab the RCU read lock while holding the mutex
> in order to prevent future resizes from happening once we release
> the mutex.

But if err is non-zero you don't hold the mutex.
Presumably the calling code also errors out immediately,
so the RCU lock isn't needed at all.

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 10:12 a.m. UTC | #11
On Tue, Jan 27, 2015 at 10:09:18AM +0000, David Laight wrote:
>
> But if err is non-zero you don't hold the mutex.
> Presumably the calling code also errors out immediately,
> so the RCU lock isn't needed at all.

No the calling code expects the RCU lock to be held because it
will always call stop.  Look at how seq_file works.

Cheers,
Thomas Graf Jan. 27, 2015, 10:15 a.m. UTC | #12
On 01/27/15 at 08:54pm, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 09:45:41AM +0000, Thomas Graf wrote:
> >
> > OK. I'll wait for your netfilter approach before posting NLM_F_INTR
> > patches.
> 
> Why would we still need the INTR stuff given that Dave has asked
> for this to be done outside of rhashtable?

Because libraries like libnl already handle this transparently for
various user space apps. Those apps are not even aware of the problem
and always get a consistent dump if the kernel implements INTR.

> What I'm planning on doing right now is to divide rhashtable users
> into two camps, lockless insert/removal vs. locked insert/removal.
> netfilter will obviously fall into the latter camp.
> 
> For those with locked insert/removal I will make resizing/rehashing
> synchronous since there is zero benefit from doing it in a separate
> thread.

Because making it synchronous does not help the Netlink dump problem
at all, see below.

Also, a sync resize causes a random inserts/removal to take a lot
longer. This leads to less predictable insert/removal times. I have
no objection to offering both sync and async resizing as long as the
choice of balance is configurable.

> Once that is done you can then easily walk the table by just holding
> the lock, which is what netfilter does today.

How? Do you want to hold the mutex while the user is reading? I don't
see how to achieve consistency with locking without introducing a ton
of complexity and attackable behaviour like user space being able to
delay inserts. If at all this only seems acceptable for privileged
users.

My vote is to rely on the proven INTR flag and optionally restart the
dump if user space did not indicate that INTR is supported as Patrick
suggested.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 10:24 a.m. UTC | #13
On Tue, Jan 27, 2015 at 10:15:12AM +0000, Thomas Graf wrote:
>
> Because libraries like libnl already handle this transparently for
> various user space apps. Those apps are not even aware of the problem
> and always get a consistent dump if the kernel implements INTR.

Did you see Dave's message? He said those that dump need to create
their own data structure so there is no point in proceeding with
any of this work within rhashtable.
 
> Because making it synchronous does not help the Netlink dump problem
> at all, see below.
> 
> Also, a sync resize causes a random inserts/removal to take a lot
> longer. This leads to less predictable insert/removal times. I have
> no objection to offering both sync and async resizing as long as the
> choice of balance is configurable.

So what?  They can switch over to lockless mode if they care.

The point is that for netfilter as it stands it makes no sense
to go lockless because its modus operandi is to hold a single
lock over everything, including walking/dumping.

Cheers,
Thomas Graf Jan. 27, 2015, 11:16 a.m. UTC | #14
On 01/27/15 at 09:24pm, Herbert Xu wrote:
> The point is that for netfilter as it stands it makes no sense
> to go lockless because its modus operandi is to hold a single
> lock over everything, including walking/dumping.

No objection. I have a patch prepared which allows the user to
provide ht->mutex himself so nfset can provide its own existing
mutex to rhashtable and lock out the resizes from inserts,
removals and dump iterations automatically That would restore the
old behaviour of the nfset API without major surgery.

I was just waiting on your walker as you requested a hold off.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 11:23 a.m. UTC | #15
On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> 
> No objection. I have a patch prepared which allows the user to
> provide ht->mutex himself so nfset can provide its own existing
> mutex to rhashtable and lock out the resizes from inserts,
> removals and dump iterations automatically That would restore the
> old behaviour of the nfset API without major surgery.

If you take the mutex you might as well just make it synchronous.
There is zero difference.

Maybe you misunderstood my email.  I'm not making it synchronous
for everybody, just those that always take a lock on inserts/removals
and therefore don't need per-bucket locks.

Cheers,
Thomas Graf Jan. 27, 2015, 11:40 a.m. UTC | #16
On 01/27/15 at 10:23pm, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> > 
> > No objection. I have a patch prepared which allows the user to
> > provide ht->mutex himself so nfset can provide its own existing
> > mutex to rhashtable and lock out the resizes from inserts,
> > removals and dump iterations automatically That would restore the
> > old behaviour of the nfset API without major surgery.
> 
> If you take the mutex you might as well just make it synchronous.
> There is zero difference.
> 
> Maybe you misunderstood my email.  I'm not making it synchronous
> for everybody, just those that always take a lock on inserts/removals
> and therefore don't need per-bucket locks.

Understood. No objection, happy to review patches. I initially did
keep sync/async separate and it lead to considerable code duplication
and complexity. I figured that if a user needs sync insert they could
provide their own locking. I missed to allow controlling the async
resize though. Again, feel free to give a shot, no objections.

This is unrelated to resize run control though, the reason is that
I'm converting tcp_hashinfo et al and they require a hybrid approach.
The tables may be too big to construct a parallel data structure, we
don't want to hold off inserts or deletes while the expensive dump
is underway. Even though we can't build a shadow structure while
locking everybody else out, we still want to provide a way to somehow
achieve consistent information. I think that NLM_F_INTR with fallback
to restarting the dump is a good option and very easy to implement. In
that case, we want to lock out resize from dumping iterations but
still allow parallel insert/delete.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Jan. 27, 2015, 1:09 p.m. UTC | #17
On 27.01, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> > 
> > No objection. I have a patch prepared which allows the user to
> > provide ht->mutex himself so nfset can provide its own existing
> > mutex to rhashtable and lock out the resizes from inserts,
> > removals and dump iterations automatically That would restore the
> > old behaviour of the nfset API without major surgery.
> 
> If you take the mutex you might as well just make it synchronous.
> There is zero difference.
> 
> Maybe you misunderstood my email.  I'm not making it synchronous
> for everybody, just those that always take a lock on inserts/removals
> and therefore don't need per-bucket locks.

Actually I have a patchset queued that adds runtime additions and
removals, both active and timeout based. So netfilter won't have
pure synchronous behaviour anymore.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 8:36 p.m. UTC | #18
On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
>
> Actually I have a patchset queued that adds runtime additions and
> removals, both active and timeout based. So netfilter won't have
> pure synchronous behaviour anymore.

Can you show us the patchset?

Thanks,
Herbert Xu Jan. 27, 2015, 8:39 p.m. UTC | #19
On Tue, Jan 27, 2015 at 11:40:28AM +0000, Thomas Graf wrote:
> 
> This is unrelated to resize run control though, the reason is that
> I'm converting tcp_hashinfo et al and they require a hybrid approach.
> The tables may be too big to construct a parallel data structure, we
> don't want to hold off inserts or deletes while the expensive dump
> is underway. Even though we can't build a shadow structure while
> locking everybody else out, we still want to provide a way to somehow
> achieve consistent information. I think that NLM_F_INTR with fallback
> to restarting the dump is a good option and very easy to implement. In
> that case, we want to lock out resize from dumping iterations but
> still allow parallel insert/delete.

Well I guess Dave needs to make the call.  Do we want to allow
lockless walks over the hash table or not?

Personally I don't think a linked list is that big a deal.  But then
you guys were agonsing over a single pointer so who knows.

Cheers,
David Miller Jan. 27, 2015, 10:10 p.m. UTC | #20
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jan 2015 07:39:24 +1100

> On Tue, Jan 27, 2015 at 11:40:28AM +0000, Thomas Graf wrote:
>> 
>> This is unrelated to resize run control though, the reason is that
>> I'm converting tcp_hashinfo et al and they require a hybrid approach.
>> The tables may be too big to construct a parallel data structure, we
>> don't want to hold off inserts or deletes while the expensive dump
>> is underway. Even though we can't build a shadow structure while
>> locking everybody else out, we still want to provide a way to somehow
>> achieve consistent information. I think that NLM_F_INTR with fallback
>> to restarting the dump is a good option and very easy to implement. In
>> that case, we want to lock out resize from dumping iterations but
>> still allow parallel insert/delete.
> 
> Well I guess Dave needs to make the call.  Do we want to allow
> lockless walks over the hash table or not?
> 
> Personally I don't think a linked list is that big a deal.  But then
> you guys were agonsing over a single pointer so who knows.

For netlink a linked list is no big deal, but for something like TCP
sockets it really is.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2015, 11:16 p.m. UTC | #21
On Tue, Jan 27, 2015 at 02:10:18PM -0800, David Miller wrote:
>
> For netlink a linked list is no big deal, but for something like TCP
> sockets it really is.

OK then I'll resubmit my iterators.

Cheers,
Patrick McHardy Jan. 28, 2015, 7:07 p.m. UTC | #22
On 28.01, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> >
> > Actually I have a patchset queued that adds runtime additions and
> > removals, both active and timeout based. So netfilter won't have
> > pure synchronous behaviour anymore.
> 
> Can you show us the patchset?

Sure, will send it over tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 30, 2015, 5:58 a.m. UTC | #23
On Wed, Jan 28, 2015 at 07:07:16PM +0000, Patrick McHardy wrote:
> On 28.01, Herbert Xu wrote:
> > On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> > >
> > > Actually I have a patchset queued that adds runtime additions and
> > > removals, both active and timeout based. So netfilter won't have
> > > pure synchronous behaviour anymore.
> > 
> > Can you show us the patchset?
> 
> Sure, will send it over tomorrow.

So how are you handling the locking in these cases?

Cheers,
Patrick McHardy Jan. 30, 2015, 8:10 a.m. UTC | #24
On 30.01, Herbert Xu wrote:
> On Wed, Jan 28, 2015 at 07:07:16PM +0000, Patrick McHardy wrote:
> > On 28.01, Herbert Xu wrote:
> > > On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> > > >
> > > > Actually I have a patchset queued that adds runtime additions and
> > > > removals, both active and timeout based. So netfilter won't have
> > > > pure synchronous behaviour anymore.
> > > 
> > > Can you show us the patchset?
> > 
> > Sure, will send it over tomorrow.
> 
> So how are you handling the locking in these cases?

Insertion is handled using the rhashtable internal locks. GC happens
in work context and takes the rhashtable mutex, then walks the table
similar to nft_hash_walk and removes stale entries.

Still need to work out something to reduce RCU grace periods
(currently ignored) and make sure we don't race between ->get/->remove
in the netlink API and GC.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 6d7e840..b03b375 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,6 +18,7 @@ 
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
+#include <linux/compiler.h>
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
@@ -123,6 +124,22 @@  struct rhashtable {
 	bool                            being_destroyed;
 };
 
+/**
+ * struct rhashtable_iter - Hash table iterator
+ * @ht: Table to iterate through
+ * @p: Current pointer
+ * @lock: Slot lock
+ * @slot: Current slot
+ * @skip: Number of entries to skip in slot
+ */
+struct rhashtable_iter {
+	struct rhashtable *ht;
+	struct rhash_head *p;
+	spinlock_t *lock;
+	unsigned int slot;
+	unsigned int skip;
+};
+
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
 {
 	return NULLS_MARKER(ht->p.nulls_base + hash);
@@ -178,6 +195,33 @@  bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      bool (*compare)(void *, void *),
 				      void *arg);
 
+/**
+ * rhashtable_walk_init - Initialise an iterator
+ * @ht:		Table to walk over
+ * @iter:	Hash table Iterator
+ *
+ * This function prepares a hash table walk.
+ * Note that if you restart a walk after rhashtable_walk_stop you
+ * may see the same object twice.  Also, you may miss objects if
+ * there are removals in between rhashtable_walk_stop and the next
+ * call to rhashtable_walk_start.
+ *
+ * For a completely stable walk you should construct your own data
+ * structure outside the hash table.
+ */
+static inline void rhashtable_walk_init(struct rhashtable *ht,
+					struct rhashtable_iter *iter)
+{
+	iter->ht = ht;
+	iter->p = NULL;
+	iter->slot = 0;
+	iter->skip = 0;
+}
+
+int rhashtable_walk_start(struct rhashtable_iter *iter) __acquires(RCU);
+void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
+
 void rhashtable_destroy(struct rhashtable *ht);
 
 #define rht_dereference(p, ht) \
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 71c6aa1..d51fb06 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -813,6 +813,97 @@  exit:
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
 
+/**
+ * rhashtable_walk_start - Start a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Start a hash table walk.
+ *
+ * Returns zero if successful.  Returns -EINTR if we couldn't
+ * obtain the resize lock.
+ */
+int rhashtable_walk_start(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	int err;
+
+	err = mutex_lock_interruptible(&ht->mutex);
+	rcu_read_lock();
+
+	if (!err)
+		mutex_unlock(&ht->mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_start);
+
+/**
+ * rhashtable_walk_next - Return the next object and advance the iterator
+ * @iter:	Hash table iterator
+ *
+ * Note that you must call rhashtable_walk_stop when you are finished
+ * with the walk.
+ *
+ * Returns the next object or NULL when the end of the table is reached.
+ */
+void *rhashtable_walk_next(struct rhashtable_iter *iter)
+{
+	const struct bucket_table *tbl;
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+
+	if (p) {
+		p = rht_dereference_bucket(p->next, tbl, iter->slot);
+		goto next;
+	}
+
+	for (; iter->slot < tbl->size; iter->slot++) {
+		int skip = iter->skip;
+
+		iter->lock = bucket_lock(tbl, iter->slot);
+		spin_lock_bh(iter->lock);
+
+		rht_for_each(p, tbl, iter->slot) {
+			if (!skip)
+				break;
+			skip--;
+		}
+
+next:
+		if (!rht_is_a_nulls(p)) {
+			iter->skip++;
+			iter->p = p;
+			return rht_obj(ht, p);
+		}
+		spin_unlock_bh(iter->lock);
+
+		iter->skip = 0;
+	}
+
+	iter->p = NULL;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_next);
+
+/**
+ * rhashtable_walk_stop - Finish a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Finish a hash table walk.
+ */
+void rhashtable_walk_stop(struct rhashtable_iter *iter)
+{
+	if (iter->p)
+		spin_unlock_bh(iter->lock);
+
+	rcu_read_unlock();
+
+	iter->p = NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
+
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),