[1/3] rcu: Introduce hlist_nulls variant of hlist

Message ID	491C282A.5050802@cosmosbay.com
State	Accepted, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id D2198DDDFF for <patchwork-incoming@ozlabs.org>; Fri, 14 Nov 2008 00:15:14 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752375AbYKMNPL (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Thu, 13 Nov 2008 08:15:11 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752107AbYKMNPJ (ORCPT <rfc822; netdev-outgoing>); Thu, 13 Nov 2008 08:15:09 -0500 Received: from gw1.cosmosbay.com ([86.65.150.130]:43306 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752070AbYKMNPH (ORCPT <rfc822;netdev@vger.kernel.org>); Thu, 13 Nov 2008 08:15:07 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id mADDEJS4017186; Thu, 13 Nov 2008 14:14:19 +0100 Message-ID: <491C282A.5050802@cosmosbay.com> Date: Thu, 13 Nov 2008 14:14:18 +0100 From: Eric Dumazet <dada1@cosmosbay.com> User-Agent: Thunderbird 2.0.0.17 (Windows/20080914) MIME-Version: 1.0 To: Corey Minyard <minyard@acm.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, David Miller <davem@davemloft.net>, Stephen Hemminger <shemminger@vyatta.com>, benny+usenet@amorsen.dk, Linux Netdev List <netdev@vger.kernel.org>, Christoph Lameter <cl@linux-foundation.org>, Evgeniy Polyakov <zbr@ioremap.net>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Christian Bell <christian@myri.com> Subject: [PATCH 1/3] rcu: Introduce hlist_nulls variant of hlist References: <4908627C.6030001@acm.org> <490874F2.2060306@cosmosbay.com> <49088288.6050805@acm.org> <49088AD1.7040805@cosmosbay.com> <20081029163739.GB6732@linux.vnet.ibm.com> <49089BE5.3090705@acm.org> <4908A134.4040705@cosmosbay.com> <4908AB3F.1060003@acm.org> <20081029185200.GE6732@linux.vnet.ibm.com> <4908C0CD.5050406@cosmosbay.com> <20081029201759.GF6732@linux.vnet.ibm.com> <4908DEDE.5030706@cosmosbay.com> <4909D551.9080309@cosmosbay.com> In-Reply-To: <4909D551.9080309@cosmosbay.com> Content-Type: multipart/mixed; boundary="------------060104040500010903080803" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Thu, 13 Nov 2008 14:14:20 +0100 (CET) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

Eric Dumazet Nov. 13, 2008, 1:14 p.m. UTC

hlist uses NULL value to finish a chain.

hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.

This allows to store many different end markers, so that some RCU lockless
algos (used in TCP/UDP stack for example) can save some memory barriers in
fast paths.

Two new files are added :

include/linux/list_nulls.h
  - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant

include/linux/rculist_nulls.h
  - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant

   Only four helpers are declared for the moment :

     hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
     hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()

prefetches() were removed, since an end of list is not anymore NULL value.
prefetches() could trigger useless (and possibly dangerous) memory transactions.

Example of use (extracted from __udp4_lib_lookup())

	struct sock *sk, *result;
        struct hlist_nulls_node *node;
        unsigned short hnum = ntohs(dport);
        unsigned int hash = udp_hashfn(net, hnum);
        struct udp_hslot *hslot = &udptable->hash[hash];
        int score, badness;

        rcu_read_lock();
begin:
        result = NULL;
        badness = -1;
        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                score = compute_score(sk, net, saddr, hnum, sport,
                                      daddr, dport, dif);
                if (score > badness) {
                        result = sk;
                        badness = score;
                }
        }
        /*
         * if the nulls value we got at the end of this lookup is
         * not the expected one, we must restart lookup.
         * We probably met an item that was moved to another chain.
         */
        if (get_nulls_value(node) != hash)
                goto begin;

        if (result) {
                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                        result = NULL;
                else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                  daddr, dport, dif) < badness)) {
                        sock_put(result);
                        goto begin;
                }
        }
        rcu_read_unlock();
        return result;

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/list_nulls.h    |   94 +++++++++++++++++++++++++++
 include/linux/rculist_nulls.h |  110 ++++++++++++++++++++++++++++++++
 2 files changed, 204 insertions(+)

Peter Zijlstra Nov. 13, 2008, 1:29 p.m. UTC | #1

On Thu, 2008-11-13 at 14:14 +0100, Eric Dumazet wrote:
> hlist uses NULL value to finish a chain.
> 
> hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
> 
> This allows to store many different end markers, so that some RCU lockless
> algos (used in TCP/UDP stack for example) can save some memory barriers in
> fast paths.
> 
> Two new files are added :
> 
> include/linux/list_nulls.h
>   - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant

How is the !rcu variant useful?

> include/linux/rculist_nulls.h
>   - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
> 
>    Only four helpers are declared for the moment :
> 
>      hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>      hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
> 
> prefetches() were removed, since an end of list is not anymore NULL value.
> prefetches() could trigger useless (and possibly dangerous) memory transactions.
> 
> Example of use (extracted from __udp4_lib_lookup())
> 
> 	struct sock *sk, *result;
>         struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(dport);
>         unsigned int hash = udp_hashfn(net, hnum);
>         struct udp_hslot *hslot = &udptable->hash[hash];
>         int score, badness;
> 
>         rcu_read_lock();
> begin:
>         result = NULL;
>         badness = -1;
>         sk_nulls_for_each_rcu(sk, node, &hslot->head) {
>                 score = compute_score(sk, net, saddr, hnum, sport,
>                                       daddr, dport, dif);
>                 if (score > badness) {
>                         result = sk;
>                         badness = score;
>                 }
>         }
>         /*
>          * if the nulls value we got at the end of this lookup is
>          * not the expected one, we must restart lookup.
>          * We probably met an item that was moved to another chain.
>          */
>         if (get_nulls_value(node) != hash)
>                 goto begin;

So by not using some memory barriers (would be nice to have it
illustrated which ones), we can race and end up on the wrong chain, in
case that happens we detect this by using this per-chain terminator and
try again.

It would be really good to have it explained in the rculist_nulls.h
comments what memory barriers are missing, what races they open, and how
the this special terminator trick closes that race.

I'm sure most of us understand it now, but will we still in a few
months? - how about new people?

Other than that, very cool stuff! :-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Nov. 13, 2008, 1:44 p.m. UTC | #2

Peter Zijlstra a écrit :
> On Thu, 2008-11-13 at 14:14 +0100, Eric Dumazet wrote:
>> hlist uses NULL value to finish a chain.
>>
>> hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
>>
>> This allows to store many different end markers, so that some RCU lockless
>> algos (used in TCP/UDP stack for example) can save some memory barriers in
>> fast paths.
>>
>> Two new files are added :
>>
>> include/linux/list_nulls.h
>>   - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
> 
> How is the !rcu variant useful?

For example, if a process holds a lock, it doesnt need rcu version.

/proc/net/tcp comes to mind

> 
>> include/linux/rculist_nulls.h
>>   - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
>>
>>    Only four helpers are declared for the moment :
>>
>>      hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>>      hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
>>
>> prefetches() were removed, since an end of list is not anymore NULL value.
>> prefetches() could trigger useless (and possibly dangerous) memory transactions.
>>

> 
> So by not using some memory barriers (would be nice to have it
> illustrated which ones), we can race and end up on the wrong chain, in
> case that happens we detect this by using this per-chain terminator and
> try again.
> 
> It would be really good to have it explained in the rculist_nulls.h
> comments what memory barriers are missing, what races they open, and how
> the this special terminator trick closes that race.

OK, maybe I should add a Documentation/RCU/rculist_nulls.txt file with
appropriate examples and documentation.

(Say the lookup/insert algorithms, with standard hlist and memory barriers,
and with hlist_nulls without those two memory barriers.

(These two memory barriers can be found in commits :

c37ccc0d4e2a4ee52f1a40cff1be0049f2104bba :

udp: add a missing smp_wmb() in udp_lib_get_port()

Corey Minyard spotted a missing memory barrier in udp_lib_get_port()

We need to make sure a reader cannot read the new 'sk->sk_next' value
and previous value of 'sk->sk_hash'. Or else, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.

This patch is temporary, since we expect an upcoming patch
to introduce another way of handling the problem.

And commit 96631ed16c514cf8b28fab991a076985ce378c26 :

udp: introduce sk_for_each_rcu_safenext()

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
(udp: RCU handling for Unicast packets.)

 "If the socket is moved from one list to another list in-between the
 time the hash is calculated and the next field is accessed, and the
 socket has moved to the end of the new list, the traversal will not
 complete properly on the list it should have, since the socket will
 be on the end of the new list and there's not a way to tell it's on a
 new list and restart the list traversal.  I think that this can be
 solved by pre-fetching the "next" field (with proper barriers) before
 checking the hash."

This patch corrects this problem, introducing a new
sk_for_each_rcu_safenext() macro.

> 
> I'm sure most of us understand it now, but will we still in a few
> months? - how about new people?
> 
> Other than that, very cool stuff! :-)

Thanks Peter ;)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Nov. 14, 2008, 3:16 p.m. UTC | #3

On Thu, 2008-11-13 at 14:14 +0100, Eric Dumazet wrote:
> hlist uses NULL value to finish a chain.
> 
> hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
> 
> This allows to store many different end markers, so that some RCU lockless
> algos (used in TCP/UDP stack for example) can save some memory barriers in
> fast paths.
> 
> Two new files are added :
> 
> include/linux/list_nulls.h
>   - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
> 
> include/linux/rculist_nulls.h
>   - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
> 
>    Only four helpers are declared for the moment :
> 
>      hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>      hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
> 
> prefetches() were removed, since an end of list is not anymore NULL value.
> prefetches() could trigger useless (and possibly dangerous) memory transactions.
> 
> Example of use (extracted from __udp4_lib_lookup())
> 
>         struct sock *sk, *result;
>         struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(dport);
>         unsigned int hash = udp_hashfn(net, hnum);
>         struct udp_hslot *hslot = &udptable->hash[hash];
>         int score, badness;
> 
>         rcu_read_lock();
> begin:
>         result = NULL;
>         badness = -1;
>         sk_nulls_for_each_rcu(sk, node, &hslot->head) {
>                 score = compute_score(sk, net, saddr, hnum, sport,
>                                       daddr, dport, dif);
>                 if (score > badness) {
>                         result = sk;
>                         badness = score;
>                 }
>         }
>         /*
>          * if the nulls value we got at the end of this lookup is
>          * not the expected one, we must restart lookup.
>          * We probably met an item that was moved to another chain.
>          */
>         if (get_nulls_value(node) != hash)
>                 goto begin;
> 
>         if (result) {
>                 if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
>                         result = NULL;
>                 else if (unlikely(compute_score(result, net, saddr, hnum, sport,
>                                   daddr, dport, dif) < badness)) {
>                         sock_put(result);
>                         goto begin;
>                 }
>         }
>         rcu_read_unlock();
>         return result;
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 19, 2008, 5:01 p.m. UTC | #4

On Thu, Nov 13, 2008 at 02:14:18PM +0100, Eric Dumazet wrote:
> hlist uses NULL value to finish a chain.
>
> hlist_nulls variant use the low order bit set to 1 to signal an end-of-list 
> marker.
>
> This allows to store many different end markers, so that some RCU lockless
> algos (used in TCP/UDP stack for example) can save some memory barriers in
> fast paths.
>
> Two new files are added :
>
> include/linux/list_nulls.h
>  - mimics hlist part of include/linux/list.h, derived to hlist_nulls 
> variant
>
> include/linux/rculist_nulls.h
>  - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls 
> variant
>
>   Only four helpers are declared for the moment :
>
>     hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>     hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
>
> prefetches() were removed, since an end of list is not anymore NULL value.
> prefetches() could trigger useless (and possibly dangerous) memory 
> transactions.
>
> Example of use (extracted from __udp4_lib_lookup())
>
> 	struct sock *sk, *result;
>        struct hlist_nulls_node *node;
>        unsigned short hnum = ntohs(dport);
>        unsigned int hash = udp_hashfn(net, hnum);
>        struct udp_hslot *hslot = &udptable->hash[hash];
>        int score, badness;
>
>        rcu_read_lock();
> begin:
>        result = NULL;
>        badness = -1;
>        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
>                score = compute_score(sk, net, saddr, hnum, sport,
>                                      daddr, dport, dif);
>                if (score > badness) {
>                        result = sk;
>                        badness = score;
>                }
>        }
>        /*
>         * if the nulls value we got at the end of this lookup is
>         * not the expected one, we must restart lookup.
>         * We probably met an item that was moved to another chain.
>         */
>        if (get_nulls_value(node) != hash)
>                goto begin;
>
>        if (result) {
>                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
>                        result = NULL;
>                else if (unlikely(compute_score(result, net, saddr, hnum, 
> sport,
>                                  daddr, dport, dif) < badness)) {
>                        sock_put(result);
>                        goto begin;
>                }
>        }
>        rcu_read_unlock();
>        return result;

Looks good, but a few questions and suggestions interspersed below.

							Thanx, Paul

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
> include/linux/list_nulls.h    |   94 +++++++++++++++++++++++++++
> include/linux/rculist_nulls.h |  110 ++++++++++++++++++++++++++++++++
> 2 files changed, 204 insertions(+)

> diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
> new file mode 100644
> index 0000000..856dee8
> --- /dev/null
> +++ b/include/linux/list_nulls.h
> @@ -0,0 +1,94 @@
> +#ifndef _LINUX_LIST_NULLS_H
> +#define _LINUX_LIST_NULLS_H
> +
> +/*
> + * Special version of lists, where end of list is not a NULL pointer,
> + * but a 'nulls' marker, which can have many different values.
> + * (up to 2^31 different values guaranteed on all platforms)
> + *
> + * In the standard hlist, termination of a list is the NULL pointer.
> + * In this special 'nulls' variant, we use the fact that objects stored in
> + * a list are aligned on a word (4 or 8 bytes alignment).
> + * We therefore use the last significant bit of 'ptr' :
> + * Set to 1 : This is a 'nulls' end-of-list marker (ptr >> 1)
> + * Set to 0 : This is a pointer to some object (ptr)
> + */
> +
> +struct hlist_nulls_head {
> +	struct hlist_nulls_node *first;
> +};
> +
> +struct hlist_nulls_node {
> +	struct hlist_nulls_node *next, **pprev;
> +};
> +#define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
> +	((ptr)->first = (struct hlist_nulls_node *) (1UL | (((long)nulls) << 1)))
> +
> +#define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member)
> +/**
> + * ptr_is_a_nulls - Test if a ptr is a nulls
> + * @ptr: ptr to be tested
> + *
> + */
> +static inline int is_a_nulls(const struct hlist_nulls_node *ptr)
> +{
> +	return ((unsigned long)ptr & 1);
> +}
> +
> +/**
> + * get_nulls_value - Get the 'nulls' value of the end of chain
> + * @ptr: end of chain
> + *
> + * Should be called only if is_a_nulls(ptr);
> + */
> +static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
> +{
> +	return ((unsigned long)ptr) >> 1;
> +}
> +
> +static inline int hlist_nulls_unhashed(const struct hlist_nulls_node *h)
> +{
> +	return !h->pprev;
> +}
> +
> +static inline int hlist_nulls_empty(const struct hlist_nulls_head *h)
> +{
> +	return is_a_nulls(h->first);
> +}
> +
> +static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
> +{
> +	struct hlist_nulls_node *next = n->next;
> +	struct hlist_nulls_node **pprev = n->pprev;
> +	*pprev = next;
> +	if (!is_a_nulls(next))
> +		next->pprev = pprev;
> +}
> +
> +/**
> + * hlist_nulls_for_each_entry	- iterate over list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_node within the struct.
> + *
> + */
> +#define hlist_nulls_for_each_entry(tpos, pos, head, member)		       \
> +	for (pos = (head)->first;					       \
> +	     (!is_a_nulls(pos)) &&					       \
> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = pos->next)
> +
> +/**
> + * hlist_nulls_for_each_entry_from - iterate over a hlist continuing from current point
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.

And @pos is the starting point, correct?  Suggest something like:

	@pos:	the &struct hlist_node serving as starting point and cursor

> + * @member:	the name of the hlist_node within the struct.
> + *
> + */
> +#define hlist_nulls_for_each_entry_from(tpos, pos, member)	\
> +	for (; (!is_a_nulls(pos)) && 				\
> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = pos->next)
> +
> +#endif
> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> new file mode 100644
> index 0000000..b185ac4
> --- /dev/null
> +++ b/include/linux/rculist_nulls.h
> @@ -0,0 +1,110 @@
> +#ifndef _LINUX_RCULIST_NULLS_H
> +#define _LINUX_RCULIST_NULLS_H
> +
> +#ifdef __KERNEL__
> +
> +/*
> + * RCU-protected list version
> + */
> +#include <linux/list_nulls.h>
> +#include <linux/rcupdate.h>
> +
> +/**
> + * hlist_nulls_del_init_rcu - deletes entry from hash list with re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: hlist_nulls_unhashed() on the node return true after this. It is
> + * useful for RCU based read lockfree traversal if the writer side
> + * must know if the list entry is still hashed or already unhashed.
> + *
> + * In particular, it means that we can not poison the forward pointers
> + * that may still be used for walking the hash list and we can only
> + * zero the pprev pointer so list_unhashed() will return true after
> + * this.
> + *
> + * The caller must take whatever precautions are necessary (such as
> + * holding appropriate locks) to avoid racing with another
> + * list-mutation primitive, such as hlist_nulls_add_head_rcu() or
> + * hlist_nulls_del_rcu(), running on this same list.  However, it is
> + * perfectly legal to run concurrently with the _rcu list-traversal
> + * primitives, such as hlist_nulls_for_each_entry_rcu().
> + */
> +static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
> +{
> +	if (!hlist_nulls_unhashed(n)) {
> +		__hlist_nulls_del(n);
> +		n->pprev = NULL;
> +	}
> +}

The point here is to allow an RCU reader to grab the update-side lock
while holding a reference to an hlist_nulls_node, and then be able to
blindly call hlist_nulls_del_init_rcu() without having to do any complex
check to see if the element has already been deleted?

But this only works if each free operation waits for a grace period.
If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to
revalidate after grabbing the update-side lock, right?  Hmmm...

> +
> +/**
> + * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: hlist_nulls_unhashed() on entry does not return true after this,
> + * the entry is in an undefined state. It is useful for RCU based
> + * lockfree traversal.
> + *
> + * In particular, it means that we can not poison the forward
> + * pointers that may still be used for walking the hash list.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
> + * or hlist_nulls_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_nulls_for_each_entry().
> + */
> +static inline void hlist_nulls_del_rcu(struct hlist_nulls_node *n)
> +{
> +	__hlist_nulls_del(n);
> +	n->pprev = LIST_POISON2;
> +}
> +
> +/**
> + * hlist_nulls_add_head_rcu
> + * @n: the element to add to the hash list.
> + * @h: the list to add to.
> + *
> + * Description:
> + * Adds the specified element to the specified hlist_nulls,
> + * while permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
> + * or hlist_nulls_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
> + * problems on Alpha CPUs.  Regardless of the type of CPU, the
> + * list-traversal primitive must be guarded by rcu_read_lock().
> + */
> +static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
> +					struct hlist_nulls_head *h)
> +{
> +	struct hlist_nulls_node *first = h->first;
> +
> +	n->next = first;
> +	n->pprev = &h->first;
> +	rcu_assign_pointer(h->first, n);
> +	if (!is_a_nulls(first))
> +		first->pprev = &n->next;
> +}
> +/**
> + * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_nulls_node to use as a loop cursor.
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_nulls_node within the struct.
> + *
> + */
> +#define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \
> +	for (pos = rcu_dereference((head)->first);			 \
> +		(!is_a_nulls(pos)) && 			\
> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1; }); \
> +		pos = rcu_dereference(pos->next))
> +
> +#endif
> +#endif

Any chance of using a trick like Oleg used to get rid of the "pos"
argument?  http://lkml.org/lkml/2008/3/12/47

The hlist_nulls_node must always be at an even address, correct?
Couldn't this fact be used to allow testing for is_a_nulls() on tpos
rather than on pos?  Or is there a better way to approach this?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Nov. 19, 2008, 5:53 p.m. UTC | #5

Paul E. McKenney a écrit :
> On Thu, Nov 13, 2008 at 02:14:18PM +0100, Eric Dumazet wrote:
>> hlist uses NULL value to finish a chain.
>>
>> hlist_nulls variant use the low order bit set to 1 to signal an end-of-list 
>> marker.
>>
>> This allows to store many different end markers, so that some RCU lockless
>> algos (used in TCP/UDP stack for example) can save some memory barriers in
>> fast paths.
>>
>> Two new files are added :
>>
>> include/linux/list_nulls.h
>>  - mimics hlist part of include/linux/list.h, derived to hlist_nulls 
>> variant
>>
>> include/linux/rculist_nulls.h
>>  - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls 
>> variant
>>
>>   Only four helpers are declared for the moment :
>>
>>     hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>>     hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
>>
>> prefetches() were removed, since an end of list is not anymore NULL value.
>> prefetches() could trigger useless (and possibly dangerous) memory 
>> transactions.
>>
>> Example of use (extracted from __udp4_lib_lookup())
>>
>> 	struct sock *sk, *result;
>>        struct hlist_nulls_node *node;
>>        unsigned short hnum = ntohs(dport);
>>        unsigned int hash = udp_hashfn(net, hnum);
>>        struct udp_hslot *hslot = &udptable->hash[hash];
>>        int score, badness;
>>
>>        rcu_read_lock();
>> begin:
>>        result = NULL;
>>        badness = -1;
>>        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
>>                score = compute_score(sk, net, saddr, hnum, sport,
>>                                      daddr, dport, dif);
>>                if (score > badness) {
>>                        result = sk;
>>                        badness = score;
>>                }
>>        }
>>        /*
>>         * if the nulls value we got at the end of this lookup is
>>         * not the expected one, we must restart lookup.
>>         * We probably met an item that was moved to another chain.
>>         */
>>        if (get_nulls_value(node) != hash)
>>                goto begin;
>>
>>        if (result) {
>>                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
>>                        result = NULL;
>>                else if (unlikely(compute_score(result, net, saddr, hnum, 
>> sport,
>>                                  daddr, dport, dif) < badness)) {
>>                        sock_put(result);
>>                        goto begin;
>>                }
>>        }
>>        rcu_read_unlock();
>>        return result;
> 
> Looks good, but a few questions and suggestions interspersed below.
> 
> 							Thanx, Paul
> 
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
>> include/linux/list_nulls.h    |   94 +++++++++++++++++++++++++++
>> include/linux/rculist_nulls.h |  110 ++++++++++++++++++++++++++++++++
>> 2 files changed, 204 insertions(+)
> 
>> diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
>> new file mode 100644
>> index 0000000..856dee8
>> --- /dev/null
>> +++ b/include/linux/list_nulls.h
>> @@ -0,0 +1,94 @@
>> +#ifndef _LINUX_LIST_NULLS_H
>> +#define _LINUX_LIST_NULLS_H
>> +
>> +/*
>> + * Special version of lists, where end of list is not a NULL pointer,
>> + * but a 'nulls' marker, which can have many different values.
>> + * (up to 2^31 different values guaranteed on all platforms)
>> + *
>> + * In the standard hlist, termination of a list is the NULL pointer.
>> + * In this special 'nulls' variant, we use the fact that objects stored in
>> + * a list are aligned on a word (4 or 8 bytes alignment).
>> + * We therefore use the last significant bit of 'ptr' :
>> + * Set to 1 : This is a 'nulls' end-of-list marker (ptr >> 1)
>> + * Set to 0 : This is a pointer to some object (ptr)
>> + */
>> +
>> +struct hlist_nulls_head {
>> +	struct hlist_nulls_node *first;
>> +};
>> +
>> +struct hlist_nulls_node {
>> +	struct hlist_nulls_node *next, **pprev;
>> +};
>> +#define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
>> +	((ptr)->first = (struct hlist_nulls_node *) (1UL | (((long)nulls) << 1)))
>> +
>> +#define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member)
>> +/**
>> + * ptr_is_a_nulls - Test if a ptr is a nulls
>> + * @ptr: ptr to be tested
>> + *
>> + */
>> +static inline int is_a_nulls(const struct hlist_nulls_node *ptr)
>> +{
>> +	return ((unsigned long)ptr & 1);
>> +}
>> +
>> +/**
>> + * get_nulls_value - Get the 'nulls' value of the end of chain
>> + * @ptr: end of chain
>> + *
>> + * Should be called only if is_a_nulls(ptr);
>> + */
>> +static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
>> +{
>> +	return ((unsigned long)ptr) >> 1;
>> +}
>> +
>> +static inline int hlist_nulls_unhashed(const struct hlist_nulls_node *h)
>> +{
>> +	return !h->pprev;
>> +}
>> +
>> +static inline int hlist_nulls_empty(const struct hlist_nulls_head *h)
>> +{
>> +	return is_a_nulls(h->first);
>> +}
>> +
>> +static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
>> +{
>> +	struct hlist_nulls_node *next = n->next;
>> +	struct hlist_nulls_node **pprev = n->pprev;
>> +	*pprev = next;
>> +	if (!is_a_nulls(next))
>> +		next->pprev = pprev;
>> +}
>> +
>> +/**
>> + * hlist_nulls_for_each_entry	- iterate over list of given type
>> + * @tpos:	the type * to use as a loop cursor.
>> + * @pos:	the &struct hlist_node to use as a loop cursor.
>> + * @head:	the head for your list.
>> + * @member:	the name of the hlist_node within the struct.
>> + *
>> + */
>> +#define hlist_nulls_for_each_entry(tpos, pos, head, member)		       \
>> +	for (pos = (head)->first;					       \
>> +	     (!is_a_nulls(pos)) &&					       \
>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
>> +	     pos = pos->next)
>> +
>> +/**
>> + * hlist_nulls_for_each_entry_from - iterate over a hlist continuing from current point
>> + * @tpos:	the type * to use as a loop cursor.
>> + * @pos:	the &struct hlist_node to use as a loop cursor.
> 
> And @pos is the starting point, correct?  Suggest something like:
> 
> 	@pos:	the &struct hlist_node serving as starting point and cursor

Yes, comment was copied from hlist_for_each_entry_from() comment, this one
needs update too.

> 
>> + * @member:	the name of the hlist_node within the struct.
>> + *
>> + */
>> +#define hlist_nulls_for_each_entry_from(tpos, pos, member)	\
>> +	for (; (!is_a_nulls(pos)) && 				\
>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
>> +	     pos = pos->next)
>> +
>> +#endif
>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>> new file mode 100644
>> index 0000000..b185ac4
>> --- /dev/null
>> +++ b/include/linux/rculist_nulls.h
>> @@ -0,0 +1,110 @@
>> +#ifndef _LINUX_RCULIST_NULLS_H
>> +#define _LINUX_RCULIST_NULLS_H
>> +
>> +#ifdef __KERNEL__
>> +
>> +/*
>> + * RCU-protected list version
>> + */
>> +#include <linux/list_nulls.h>
>> +#include <linux/rcupdate.h>
>> +
>> +/**
>> + * hlist_nulls_del_init_rcu - deletes entry from hash list with re-initialization
>> + * @n: the element to delete from the hash list.
>> + *
>> + * Note: hlist_nulls_unhashed() on the node return true after this. It is
>> + * useful for RCU based read lockfree traversal if the writer side
>> + * must know if the list entry is still hashed or already unhashed.
>> + *
>> + * In particular, it means that we can not poison the forward pointers
>> + * that may still be used for walking the hash list and we can only
>> + * zero the pprev pointer so list_unhashed() will return true after
>> + * this.
>> + *
>> + * The caller must take whatever precautions are necessary (such as
>> + * holding appropriate locks) to avoid racing with another
>> + * list-mutation primitive, such as hlist_nulls_add_head_rcu() or
>> + * hlist_nulls_del_rcu(), running on this same list.  However, it is
>> + * perfectly legal to run concurrently with the _rcu list-traversal
>> + * primitives, such as hlist_nulls_for_each_entry_rcu().
>> + */
>> +static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>> +{
>> +	if (!hlist_nulls_unhashed(n)) {
>> +		__hlist_nulls_del(n);
>> +		n->pprev = NULL;
>> +	}
>> +}
> 
> The point here is to allow an RCU reader to grab the update-side lock
> while holding a reference to an hlist_nulls_node, and then be able to
> blindly call hlist_nulls_del_init_rcu() without having to do any complex
> check to see if the element has already been deleted?
> 
> But this only works if each free operation waits for a grace period.
> If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to
> revalidate after grabbing the update-side lock, right?  Hmmm...
> 

<start a brain refresh cycle>
  <read again your questions>
    Tilt... 


hlist_nulls_del_init_rcu() is only used by a writer, exactly
like hlist_del_init_rcu().
I see nothing special about SLAB_DESTROY_BY_RCU here.

static inline void hlist_del_init_rcu(struct hlist_node *n)
{
        if (!hlist_unhashed(n)) {
                __hlist_del(n);
                n->pprev = NULL;
        }
}



>> +
>> +/**
>> + * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>> + * @n: the element to delete from the hash list.
>> + *
>> + * Note: hlist_nulls_unhashed() on entry does not return true after this,
>> + * the entry is in an undefined state. It is useful for RCU based
>> + * lockfree traversal.
>> + *
>> + * In particular, it means that we can not poison the forward
>> + * pointers that may still be used for walking the hash list.
>> + *
>> + * The caller must take whatever precautions are necessary
>> + * (such as holding appropriate locks) to avoid racing
>> + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
>> + * or hlist_nulls_del_rcu(), running on this same list.
>> + * However, it is perfectly legal to run concurrently with
>> + * the _rcu list-traversal primitives, such as
>> + * hlist_nulls_for_each_entry().
>> + */
>> +static inline void hlist_nulls_del_rcu(struct hlist_nulls_node *n)
>> +{
>> +	__hlist_nulls_del(n);
>> +	n->pprev = LIST_POISON2;
>> +}
>> +
>> +/**
>> + * hlist_nulls_add_head_rcu
>> + * @n: the element to add to the hash list.
>> + * @h: the list to add to.
>> + *
>> + * Description:
>> + * Adds the specified element to the specified hlist_nulls,
>> + * while permitting racing traversals.
>> + *
>> + * The caller must take whatever precautions are necessary
>> + * (such as holding appropriate locks) to avoid racing
>> + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
>> + * or hlist_nulls_del_rcu(), running on this same list.
>> + * However, it is perfectly legal to run concurrently with
>> + * the _rcu list-traversal primitives, such as
>> + * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
>> + * problems on Alpha CPUs.  Regardless of the type of CPU, the
>> + * list-traversal primitive must be guarded by rcu_read_lock().
>> + */
>> +static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
>> +					struct hlist_nulls_head *h)
>> +{
>> +	struct hlist_nulls_node *first = h->first;
>> +
>> +	n->next = first;
>> +	n->pprev = &h->first;
>> +	rcu_assign_pointer(h->first, n);
>> +	if (!is_a_nulls(first))
>> +		first->pprev = &n->next;
>> +}
>> +/**
>> + * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
>> + * @tpos:	the type * to use as a loop cursor.
>> + * @pos:	the &struct hlist_nulls_node to use as a loop cursor.
>> + * @head:	the head for your list.
>> + * @member:	the name of the hlist_nulls_node within the struct.
>> + *
>> + */
>> +#define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \
>> +	for (pos = rcu_dereference((head)->first);			 \
>> +		(!is_a_nulls(pos)) && 			\
>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1; }); \
>> +		pos = rcu_dereference(pos->next))
>> +
>> +#endif
>> +#endif
> 
> Any chance of using a trick like Oleg used to get rid of the "pos"
> argument?  http://lkml.org/lkml/2008/3/12/47
> 
> The hlist_nulls_node must always be at an even address, correct?
> Couldn't this fact be used to allow testing for is_a_nulls() on tpos
> rather than on pos?  Or is there a better way to approach this?

#define sk_nulls_for_each_rcu(__sk, node, list) \
	hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)

1) __sk is the pointer to found item if any is found in the loop

2) node will contain the end value of chain, in case we find nothing in loop
   because we need to check it after the loop.

if (get_nulls_value(node) != hash)
	 goto begin;

I dont know, it seems quite complex to try to use only three args ?

This algo is not very easy to read as is already ...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 19, 2008, 6:46 p.m. UTC | #6

On Wed, Nov 19, 2008 at 06:53:20PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Thu, Nov 13, 2008 at 02:14:18PM +0100, Eric Dumazet wrote:
>>> hlist uses NULL value to finish a chain.
>>>
>>> hlist_nulls variant use the low order bit set to 1 to signal an 
>>> end-of-list marker.
>>>
>>> This allows to store many different end markers, so that some RCU 
>>> lockless
>>> algos (used in TCP/UDP stack for example) can save some memory barriers 
>>> in
>>> fast paths.
>>>
>>> Two new files are added :
>>>
>>> include/linux/list_nulls.h
>>>  - mimics hlist part of include/linux/list.h, derived to hlist_nulls 
>>> variant
>>>
>>> include/linux/rculist_nulls.h
>>>  - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls 
>>> variant
>>>
>>>   Only four helpers are declared for the moment :
>>>
>>>     hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
>>>     hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
>>>
>>> prefetches() were removed, since an end of list is not anymore NULL 
>>> value.
>>> prefetches() could trigger useless (and possibly dangerous) memory 
>>> transactions.
>>>
>>> Example of use (extracted from __udp4_lib_lookup())
>>>
>>> 	struct sock *sk, *result;
>>>        struct hlist_nulls_node *node;
>>>        unsigned short hnum = ntohs(dport);
>>>        unsigned int hash = udp_hashfn(net, hnum);
>>>        struct udp_hslot *hslot = &udptable->hash[hash];
>>>        int score, badness;
>>>
>>>        rcu_read_lock();
>>> begin:
>>>        result = NULL;
>>>        badness = -1;
>>>        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
>>>                score = compute_score(sk, net, saddr, hnum, sport,
>>>                                      daddr, dport, dif);
>>>                if (score > badness) {
>>>                        result = sk;
>>>                        badness = score;
>>>                }
>>>        }
>>>        /*
>>>         * if the nulls value we got at the end of this lookup is
>>>         * not the expected one, we must restart lookup.
>>>         * We probably met an item that was moved to another chain.
>>>         */
>>>        if (get_nulls_value(node) != hash)
>>>                goto begin;
>>>
>>>        if (result) {
>>>                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
>>>                        result = NULL;
>>>                else if (unlikely(compute_score(result, net, saddr, hnum, 
>>> sport,
>>>                                  daddr, dport, dif) < badness)) {
>>>                        sock_put(result);
>>>                        goto begin;
>>>                }
>>>        }
>>>        rcu_read_unlock();
>>>        return result;
>> Looks good, but a few questions and suggestions interspersed below.
>> 							Thanx, Paul
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>> ---
>>> include/linux/list_nulls.h    |   94 +++++++++++++++++++++++++++
>>> include/linux/rculist_nulls.h |  110 ++++++++++++++++++++++++++++++++
>>> 2 files changed, 204 insertions(+)
>>> diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
>>> new file mode 100644
>>> index 0000000..856dee8
>>> --- /dev/null
>>> +++ b/include/linux/list_nulls.h
>>> @@ -0,0 +1,94 @@
>>> +#ifndef _LINUX_LIST_NULLS_H
>>> +#define _LINUX_LIST_NULLS_H
>>> +
>>> +/*
>>> + * Special version of lists, where end of list is not a NULL pointer,
>>> + * but a 'nulls' marker, which can have many different values.
>>> + * (up to 2^31 different values guaranteed on all platforms)
>>> + *
>>> + * In the standard hlist, termination of a list is the NULL pointer.
>>> + * In this special 'nulls' variant, we use the fact that objects stored 
>>> in
>>> + * a list are aligned on a word (4 or 8 bytes alignment).
>>> + * We therefore use the last significant bit of 'ptr' :
>>> + * Set to 1 : This is a 'nulls' end-of-list marker (ptr >> 1)
>>> + * Set to 0 : This is a pointer to some object (ptr)
>>> + */
>>> +
>>> +struct hlist_nulls_head {
>>> +	struct hlist_nulls_node *first;
>>> +};
>>> +
>>> +struct hlist_nulls_node {
>>> +	struct hlist_nulls_node *next, **pprev;
>>> +};
>>> +#define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
>>> +	((ptr)->first = (struct hlist_nulls_node *) (1UL | (((long)nulls) << 
>>> 1)))
>>> +
>>> +#define hlist_nulls_entry(ptr, type, member) 
>>> container_of(ptr,type,member)
>>> +/**
>>> + * ptr_is_a_nulls - Test if a ptr is a nulls
>>> + * @ptr: ptr to be tested
>>> + *
>>> + */
>>> +static inline int is_a_nulls(const struct hlist_nulls_node *ptr)
>>> +{
>>> +	return ((unsigned long)ptr & 1);
>>> +}
>>> +
>>> +/**
>>> + * get_nulls_value - Get the 'nulls' value of the end of chain
>>> + * @ptr: end of chain
>>> + *
>>> + * Should be called only if is_a_nulls(ptr);
>>> + */
>>> +static inline unsigned long get_nulls_value(const struct 
>>> hlist_nulls_node *ptr)
>>> +{
>>> +	return ((unsigned long)ptr) >> 1;
>>> +}
>>> +
>>> +static inline int hlist_nulls_unhashed(const struct hlist_nulls_node *h)
>>> +{
>>> +	return !h->pprev;
>>> +}
>>> +
>>> +static inline int hlist_nulls_empty(const struct hlist_nulls_head *h)
>>> +{
>>> +	return is_a_nulls(h->first);
>>> +}
>>> +
>>> +static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
>>> +{
>>> +	struct hlist_nulls_node *next = n->next;
>>> +	struct hlist_nulls_node **pprev = n->pprev;
>>> +	*pprev = next;
>>> +	if (!is_a_nulls(next))
>>> +		next->pprev = pprev;
>>> +}
>>> +
>>> +/**
>>> + * hlist_nulls_for_each_entry	- iterate over list of given type
>>> + * @tpos:	the type * to use as a loop cursor.
>>> + * @pos:	the &struct hlist_node to use as a loop cursor.
>>> + * @head:	the head for your list.
>>> + * @member:	the name of the hlist_node within the struct.
>>> + *
>>> + */
>>> +#define hlist_nulls_for_each_entry(tpos, pos, head, member)		       \
>>> +	for (pos = (head)->first;					       \
>>> +	     (!is_a_nulls(pos)) &&					       \
>>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
>>> +	     pos = pos->next)
>>> +
>>> +/**
>>> + * hlist_nulls_for_each_entry_from - iterate over a hlist continuing 
>>> from current point
>>> + * @tpos:	the type * to use as a loop cursor.
>>> + * @pos:	the &struct hlist_node to use as a loop cursor.
>> And @pos is the starting point, correct?  Suggest something like:
>> 	@pos:	the &struct hlist_node serving as starting point and cursor
>
> Yes, comment was copied from hlist_for_each_entry_from() comment, this one
> needs update too.
>
>>> + * @member:	the name of the hlist_node within the struct.
>>> + *
>>> + */
>>> +#define hlist_nulls_for_each_entry_from(tpos, pos, member)	\
>>> +	for (; (!is_a_nulls(pos)) && 				\
>>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
>>> +	     pos = pos->next)
>>> +
>>> +#endif
>>> diff --git a/include/linux/rculist_nulls.h 
>>> b/include/linux/rculist_nulls.h
>>> new file mode 100644
>>> index 0000000..b185ac4
>>> --- /dev/null
>>> +++ b/include/linux/rculist_nulls.h
>>> @@ -0,0 +1,110 @@
>>> +#ifndef _LINUX_RCULIST_NULLS_H
>>> +#define _LINUX_RCULIST_NULLS_H
>>> +
>>> +#ifdef __KERNEL__
>>> +
>>> +/*
>>> + * RCU-protected list version
>>> + */
>>> +#include <linux/list_nulls.h>
>>> +#include <linux/rcupdate.h>
>>> +
>>> +/**
>>> + * hlist_nulls_del_init_rcu - deletes entry from hash list with 
>>> re-initialization
>>> + * @n: the element to delete from the hash list.
>>> + *
>>> + * Note: hlist_nulls_unhashed() on the node return true after this. It 
>>> is
>>> + * useful for RCU based read lockfree traversal if the writer side
>>> + * must know if the list entry is still hashed or already unhashed.
>>> + *
>>> + * In particular, it means that we can not poison the forward pointers
>>> + * that may still be used for walking the hash list and we can only
>>> + * zero the pprev pointer so list_unhashed() will return true after
>>> + * this.
>>> + *
>>> + * The caller must take whatever precautions are necessary (such as
>>> + * holding appropriate locks) to avoid racing with another
>>> + * list-mutation primitive, such as hlist_nulls_add_head_rcu() or
>>> + * hlist_nulls_del_rcu(), running on this same list.  However, it is
>>> + * perfectly legal to run concurrently with the _rcu list-traversal
>>> + * primitives, such as hlist_nulls_for_each_entry_rcu().
>>> + */
>>> +static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>> +{
>>> +	if (!hlist_nulls_unhashed(n)) {
>>> +		__hlist_nulls_del(n);
>>> +		n->pprev = NULL;
>>> +	}
>>> +}
>> The point here is to allow an RCU reader to grab the update-side lock
>> while holding a reference to an hlist_nulls_node, and then be able to
>> blindly call hlist_nulls_del_init_rcu() without having to do any complex
>> check to see if the element has already been deleted?
>> But this only works if each free operation waits for a grace period.
>> If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to
>> revalidate after grabbing the update-side lock, right?  Hmmm...
>
> <start a brain refresh cycle>
>  <read again your questions>
>    Tilt... 
>
> hlist_nulls_del_init_rcu() is only used by a writer, exactly
> like hlist_del_init_rcu().
> I see nothing special about SLAB_DESTROY_BY_RCU here.
>
> static inline void hlist_del_init_rcu(struct hlist_node *n)
> {
>        if (!hlist_unhashed(n)) {
>                __hlist_del(n);
>                n->pprev = NULL;
>        }
> }

Not a problem, as you don't use it the way I was thinking.

For whatever it is worth, here is a more complete use case, on the
off-chance that it becomes useful some time:

	retry:
	rcu_read_lock();
	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
		if (!(curgen = still_valid(tpos)))
			goto retry;
		if (needs_deletion(tpos)) {
			spin_lock(&update_side_lock);
			if (still_valid(tpos) == curgen)
				hlist_nulls_del_init_rcu(pos);
			spin_unlock(&update_side_lock);
		}
	}
	rcu_read_unlock();

This approach requires that the key and a generation number be encoded
into a single word, and that the generation number be changed on each
allocation and on each free.

>>> +
>>> +/**
>>> + * hlist_nulls_del_rcu - deletes entry from hash list without 
>>> re-initialization
>>> + * @n: the element to delete from the hash list.
>>> + *
>>> + * Note: hlist_nulls_unhashed() on entry does not return true after 
>>> this,
>>> + * the entry is in an undefined state. It is useful for RCU based
>>> + * lockfree traversal.
>>> + *
>>> + * In particular, it means that we can not poison the forward
>>> + * pointers that may still be used for walking the hash list.
>>> + *
>>> + * The caller must take whatever precautions are necessary
>>> + * (such as holding appropriate locks) to avoid racing
>>> + * with another list-mutation primitive, such as 
>>> hlist_nulls_add_head_rcu()
>>> + * or hlist_nulls_del_rcu(), running on this same list.
>>> + * However, it is perfectly legal to run concurrently with
>>> + * the _rcu list-traversal primitives, such as
>>> + * hlist_nulls_for_each_entry().
>>> + */
>>> +static inline void hlist_nulls_del_rcu(struct hlist_nulls_node *n)
>>> +{
>>> +	__hlist_nulls_del(n);
>>> +	n->pprev = LIST_POISON2;
>>> +}
>>> +
>>> +/**
>>> + * hlist_nulls_add_head_rcu
>>> + * @n: the element to add to the hash list.
>>> + * @h: the list to add to.
>>> + *
>>> + * Description:
>>> + * Adds the specified element to the specified hlist_nulls,
>>> + * while permitting racing traversals.
>>> + *
>>> + * The caller must take whatever precautions are necessary
>>> + * (such as holding appropriate locks) to avoid racing
>>> + * with another list-mutation primitive, such as 
>>> hlist_nulls_add_head_rcu()
>>> + * or hlist_nulls_del_rcu(), running on this same list.
>>> + * However, it is perfectly legal to run concurrently with
>>> + * the _rcu list-traversal primitives, such as
>>> + * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
>>> + * problems on Alpha CPUs.  Regardless of the type of CPU, the
>>> + * list-traversal primitive must be guarded by rcu_read_lock().
>>> + */
>>> +static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
>>> +					struct hlist_nulls_head *h)
>>> +{
>>> +	struct hlist_nulls_node *first = h->first;
>>> +
>>> +	n->next = first;
>>> +	n->pprev = &h->first;
>>> +	rcu_assign_pointer(h->first, n);
>>> +	if (!is_a_nulls(first))
>>> +		first->pprev = &n->next;
>>> +}
>>> +/**
>>> + * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
>>> + * @tpos:	the type * to use as a loop cursor.
>>> + * @pos:	the &struct hlist_nulls_node to use as a loop cursor.
>>> + * @head:	the head for your list.
>>> + * @member:	the name of the hlist_nulls_node within the struct.
>>> + *
>>> + */
>>> +#define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \
>>> +	for (pos = rcu_dereference((head)->first);			 \
>>> +		(!is_a_nulls(pos)) && 			\
>>> +		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1; }); \
>>> +		pos = rcu_dereference(pos->next))
>>> +
>>> +#endif
>>> +#endif
>> Any chance of using a trick like Oleg used to get rid of the "pos"
>> argument?  http://lkml.org/lkml/2008/3/12/47
>> The hlist_nulls_node must always be at an even address, correct?
>> Couldn't this fact be used to allow testing for is_a_nulls() on tpos
>> rather than on pos?  Or is there a better way to approach this?
>
> #define sk_nulls_for_each_rcu(__sk, node, list) \
> 	hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)
>
> 1) __sk is the pointer to found item if any is found in the loop
>
> 2) node will contain the end value of chain, in case we find nothing in 
> loop
>   because we need to check it after the loop.
>
> if (get_nulls_value(node) != hash)
> 	 goto begin;
>
> I dont know, it seems quite complex to try to use only three args ?
>
> This algo is not very easy to read as is already ...

One way around #2 would be for get_nulls_value() to undo the
hlist_nulls_entry().  Not sure whether it is worth it, but a
thought.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnaldo Carvalho de Melo Nov. 19, 2008, 6:53 p.m. UTC | #7

Em Wed, Nov 19, 2008 at 10:46:24AM -0800, Paul E. McKenney escreveu:
> For whatever it is worth, here is a more complete use case, on the
> off-chance that it becomes useful some time:
> 
> 	retry:
> 	rcu_read_lock();

	retry: /* should be here, huh? */

> 	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
> 		if (!(curgen = still_valid(tpos)))
> 			goto retry;
> 		if (needs_deletion(tpos)) {
> 			spin_lock(&update_side_lock);
> 			if (still_valid(tpos) == curgen)
> 				hlist_nulls_del_init_rcu(pos);
> 			spin_unlock(&update_side_lock);
> 		}
> 	}
> 	rcu_read_unlock();

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Nov. 19, 2008, 8:39 p.m. UTC | #8

Paul E. McKenney a écrit :
> On Wed, Nov 19, 2008 at 06:53:20PM +0100, Eric Dumazet wrote:
>> Paul E. McKenney a écrit :
>>>> +
>>>> +/**
>>>> + * hlist_nulls_del_init_rcu - deletes entry from hash list with 
>>>> re-initialization
>>>> + * @n: the element to delete from the hash list.
>>>> + *
>>>> + * Note: hlist_nulls_unhashed() on the node return true after this. It 
>>>> is
>>>> + * useful for RCU based read lockfree traversal if the writer side
>>>> + * must know if the list entry is still hashed or already unhashed.
>>>> + *
>>>> + * In particular, it means that we can not poison the forward pointers
>>>> + * that may still be used for walking the hash list and we can only
>>>> + * zero the pprev pointer so list_unhashed() will return true after
>>>> + * this.
>>>> + *
>>>> + * The caller must take whatever precautions are necessary (such as
>>>> + * holding appropriate locks) to avoid racing with another
>>>> + * list-mutation primitive, such as hlist_nulls_add_head_rcu() or
>>>> + * hlist_nulls_del_rcu(), running on this same list.  However, it is
>>>> + * perfectly legal to run concurrently with the _rcu list-traversal
>>>> + * primitives, such as hlist_nulls_for_each_entry_rcu().
>>>> + */
>>>> +static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>> +{
>>>> +	if (!hlist_nulls_unhashed(n)) {
>>>> +		__hlist_nulls_del(n);
>>>> +		n->pprev = NULL;
>>>> +	}
>>>> +}
>>> The point here is to allow an RCU reader to grab the update-side lock
>>> while holding a reference to an hlist_nulls_node, and then be able to
>>> blindly call hlist_nulls_del_init_rcu() without having to do any complex
>>> check to see if the element has already been deleted?
>>> But this only works if each free operation waits for a grace period.
>>> If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to
>>> revalidate after grabbing the update-side lock, right?  Hmmm...
>> <start a brain refresh cycle>
>>  <read again your questions>
>>    Tilt... 
>>
>> hlist_nulls_del_init_rcu() is only used by a writer, exactly
>> like hlist_del_init_rcu().
>> I see nothing special about SLAB_DESTROY_BY_RCU here.
>>
>> static inline void hlist_del_init_rcu(struct hlist_node *n)
>> {
>>        if (!hlist_unhashed(n)) {
>>                __hlist_del(n);
>>                n->pprev = NULL;
>>        }
>> }
> 
> Not a problem, as you don't use it the way I was thinking.
> 
> For whatever it is worth, here is a more complete use case, on the
> off-chance that it becomes useful some time:
> 
> 	retry:
> 	rcu_read_lock();
> 	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
> 		if (!(curgen = still_valid(tpos)))
> 			goto retry;
> 		if (needs_deletion(tpos)) {
> 			spin_lock(&update_side_lock);
> 			if (still_valid(tpos) == curgen)
> 				hlist_nulls_del_init_rcu(pos);
> 			spin_unlock(&update_side_lock);
> 		}
> 	}
> 	rcu_read_unlock();
> 
> This approach requires that the key and a generation number be encoded
> into a single word, and that the generation number be changed on each
> allocation and on each free.

Hum, we should add this template in Documentation/RCU  I guess

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 19, 2008, 9:17 p.m. UTC | #9

On Wed, Nov 19, 2008 at 04:53:47PM -0200, Arnaldo Carvalho de Melo wrote:
> Em Wed, Nov 19, 2008 at 10:46:24AM -0800, Paul E. McKenney escreveu:
> > For whatever it is worth, here is a more complete use case, on the
> > off-chance that it becomes useful some time:
> > 
> > 	retry:
> > 	rcu_read_lock();
> 
> 	retry: /* should be here, huh? */

Indeed!  Either that or have an rcu_read_unlock() before the
goto retry.

Good catch!

							Thanx, Paul

> > 	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
> > 		if (!(curgen = still_valid(tpos)))
> > 			goto retry;
> > 		if (needs_deletion(tpos)) {
> > 			spin_lock(&update_side_lock);
> > 			if (still_valid(tpos) == curgen)
> > 				hlist_nulls_del_init_rcu(pos);
> > 			spin_unlock(&update_side_lock);
> > 		}
> > 	}
> > 	rcu_read_unlock();
> 
> - Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 19, 2008, 9:21 p.m. UTC | #10

On Wed, Nov 19, 2008 at 09:39:57PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Wed, Nov 19, 2008 at 06:53:20PM +0100, Eric Dumazet wrote:
>>> Paul E. McKenney a écrit :
>>>>> +
>>>>> +/**
>>>>> + * hlist_nulls_del_init_rcu - deletes entry from hash list with 
>>>>> re-initialization
>>>>> + * @n: the element to delete from the hash list.
>>>>> + *
>>>>> + * Note: hlist_nulls_unhashed() on the node return true after this. It 
>>>>> is
>>>>> + * useful for RCU based read lockfree traversal if the writer side
>>>>> + * must know if the list entry is still hashed or already unhashed.
>>>>> + *
>>>>> + * In particular, it means that we can not poison the forward pointers
>>>>> + * that may still be used for walking the hash list and we can only
>>>>> + * zero the pprev pointer so list_unhashed() will return true after
>>>>> + * this.
>>>>> + *
>>>>> + * The caller must take whatever precautions are necessary (such as
>>>>> + * holding appropriate locks) to avoid racing with another
>>>>> + * list-mutation primitive, such as hlist_nulls_add_head_rcu() or
>>>>> + * hlist_nulls_del_rcu(), running on this same list.  However, it is
>>>>> + * perfectly legal to run concurrently with the _rcu list-traversal
>>>>> + * primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>> + */
>>>>> +static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node 
>>>>> *n)
>>>>> +{
>>>>> +	if (!hlist_nulls_unhashed(n)) {
>>>>> +		__hlist_nulls_del(n);
>>>>> +		n->pprev = NULL;
>>>>> +	}
>>>>> +}
>>>> The point here is to allow an RCU reader to grab the update-side lock
>>>> while holding a reference to an hlist_nulls_node, and then be able to
>>>> blindly call hlist_nulls_del_init_rcu() without having to do any complex
>>>> check to see if the element has already been deleted?
>>>> But this only works if each free operation waits for a grace period.
>>>> If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to
>>>> revalidate after grabbing the update-side lock, right?  Hmmm...
>>> <start a brain refresh cycle>
>>>  <read again your questions>
>>>    Tilt... 
>>> hlist_nulls_del_init_rcu() is only used by a writer, exactly
>>> like hlist_del_init_rcu().
>>> I see nothing special about SLAB_DESTROY_BY_RCU here.
>>>
>>> static inline void hlist_del_init_rcu(struct hlist_node *n)
>>> {
>>>        if (!hlist_unhashed(n)) {
>>>                __hlist_del(n);
>>>                n->pprev = NULL;
>>>        }
>>> }
>> Not a problem, as you don't use it the way I was thinking.
>> For whatever it is worth, here is a more complete use case, on the
>> off-chance that it becomes useful some time:
>> 	retry:
>> 	rcu_read_lock();
>> 	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
>> 		if (!(curgen = still_valid(tpos)))
>> 			goto retry;
>> 		if (needs_deletion(tpos)) {
>> 			spin_lock(&update_side_lock);
>> 			if (still_valid(tpos) == curgen)
>> 				hlist_nulls_del_init_rcu(pos);
>> 			spin_unlock(&update_side_lock);
>> 		}
>> 	}
>> 	rcu_read_unlock();
>> This approach requires that the key and a generation number be encoded
>> into a single word, and that the generation number be changed on each
>> allocation and on each free.
>
> Hum, we should add this template in Documentation/RCU  I guess

With Arnaldo's change -- probably should prototype and test to find
the other inevitable bugs.  :-/

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/3] rcu: Introduce hlist_nulls variant of hlist

Commit Message

Comments

Patch