diff mbox

udp: Introduce special NULL pointers for hlist termination

Message ID 4909D551.9080309@cosmosbay.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 30, 2008, 3:40 p.m. UTC
Eric Dumazet a écrit :
> Paul E. McKenney a écrit :
>> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
>>> Hum... Another way of handling all those cases and avoid memory barriers
>>> would be to have different "NULL" pointers.
>>>
>>> Each hash chain should have a unique "NULL" pointer (in the case of 
>>> UDP, it
>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>
>>> Then, when performing a lookup, a reader should check the "NULL" pointer
>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>
>>> If not -> restart the loop, aka "goto begin;" :)
>>>
>>> We could avoid memory barriers then.
>>>
>>> In the two cases Corey mentioned, this trick could let us avoid 
>>> memory barriers.
>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>
>>> What do you think ?
>>
>> Kinky!!!  ;-)
>>
>> Then the rcu_dereference() would be supplying the needed memory barriers.
>>
>> Hmmm...  I guess that the only confusion would be if the element got
>> removed and then added to the same list.  But then if its pointer was
>> pseudo-NULL, then that would mean that all subsequent elements had been
>> removed, and all preceding ones added after the scan started.
>>
>> Which might well be harmless, but I must defer to you on this one at
>> the moment.
>>
>> If you need a larger hash table, another approach would be to set the
>> pointer's low-order bit, allowing the upper bits to be a full-sized
>> index -- or even a pointer to the list header.  Just make very sure
>> to clear the pointer when freeing, or an element on the freelist
>> could end up looking like a legitimate end of list...  Which again
>> might well be safe, but why inflict this on oneself?
> 

Ok, here is an updated and tested patch.

Thanks everybody

[PATCH] udp: Introduce special NULL pointers for hlist termination

In order to safely detect changes in chains, we would like to have different
'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
value, so that the lockless readers can detect their lookups evaded from
their starting chain.

We introduce a new type of hlist implementation, named hlist_nulls, were
we use the least significant bit of the 'ptr' to tell if its a "NULL" value
or a pointer to an object. We expect to use this new hlist variant for TCP
as well.

For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
(UDP_HTABLE_SIZE=128)

Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
pointers *before* checking key values) we added in commit 
96631ed16c514cf8b28fab991a076985ce378c26
(udp: introduce sk_for_each_rcu_safenext())

This also saves a write memory barrier in udp_lib_get_port(), between
sk->sk_hash update and sk->next update)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/list_nulls.h    |   97 ++++++++++++++++++++++++++++++++
 include/linux/rculist.h       |   17 -----
 include/linux/rculist_nulls.h |   55 ++++++++++++++++++
 include/net/sock.h            |   50 ++++++++++++----
 include/net/udp.h             |    2
 net/ipv4/udp.c                |   40 ++++++-------
 net/ipv6/udp.c                |   22 ++++---
 7 files changed, 228 insertions(+), 55 deletions(-)

Comments

stephen hemminger Oct. 30, 2008, 3:51 p.m. UTC | #1
On Thu, 30 Oct 2008 16:40:01 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Eric Dumazet a écrit :
> > Paul E. McKenney a écrit :
> >> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
> >>> Hum... Another way of handling all those cases and avoid memory barriers
> >>> would be to have different "NULL" pointers.
> >>>
> >>> Each hash chain should have a unique "NULL" pointer (in the case of 
> >>> UDP, it
> >>> can be the 128 values : [ (void*)0 .. (void *)127 ]
> >>>
> >>> Then, when performing a lookup, a reader should check the "NULL" pointer
> >>> it get at the end of its lookup has is the "hash" value of its chain.
> >>>
> >>> If not -> restart the loop, aka "goto begin;" :)
> >>>
> >>> We could avoid memory barriers then.
> >>>
> >>> In the two cases Corey mentioned, this trick could let us avoid 
> >>> memory barriers.
> >>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
> >>>
> >>> What do you think ?
> >>
> >> Kinky!!!  ;-)
> >>
> >> Then the rcu_dereference() would be supplying the needed memory barriers.
> >>
> >> Hmmm...  I guess that the only confusion would be if the element got
> >> removed and then added to the same list.  But then if its pointer was
> >> pseudo-NULL, then that would mean that all subsequent elements had been
> >> removed, and all preceding ones added after the scan started.
> >>
> >> Which might well be harmless, but I must defer to you on this one at
> >> the moment.
> >>
> >> If you need a larger hash table, another approach would be to set the
> >> pointer's low-order bit, allowing the upper bits to be a full-sized
> >> index -- or even a pointer to the list header.  Just make very sure
> >> to clear the pointer when freeing, or an element on the freelist
> >> could end up looking like a legitimate end of list...  Which again
> >> might well be safe, but why inflict this on oneself?
> > 
> 
> Ok, here is an updated and tested patch.
> 
> Thanks everybody
> 
> [PATCH] udp: Introduce special NULL pointers for hlist termination
> 
> In order to safely detect changes in chains, we would like to have different
> 'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
> value, so that the lockless readers can detect their lookups evaded from
> their starting chain.
> 
> We introduce a new type of hlist implementation, named hlist_nulls, were
> we use the least significant bit of the 'ptr' to tell if its a "NULL" value
> or a pointer to an object. We expect to use this new hlist variant for TCP
> as well.
> 
> For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
> (UDP_HTABLE_SIZE=128)
> 
> Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
> pointers *before* checking key values) we added in commit 
> 96631ed16c514cf8b28fab991a076985ce378c26
> (udp: introduce sk_for_each_rcu_safenext())
> 
> This also saves a write memory barrier in udp_lib_get_port(), between
> sk->sk_hash update and sk->next update)
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---

IMHO this goes over the edge into tricky hack. Is it really worth it?
Is there a better simpler way?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 30, 2008, 4:01 p.m. UTC | #2
On Thu, 2008-10-30 at 16:40 +0100, Eric Dumazet wrote:
> 
> [PATCH] udp: Introduce special NULL pointers for hlist termination
> 
> In order to safely detect changes in chains, we would like to have different
> 'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
> value, so that the lockless readers can detect their lookups evaded from
> their starting chain.
> 
> We introduce a new type of hlist implementation, named hlist_nulls, were
> we use the least significant bit of the 'ptr' to tell if its a "NULL" value
> or a pointer to an object. We expect to use this new hlist variant for TCP
> as well.
> 
> For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
> (UDP_HTABLE_SIZE=128)
> 
> Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
> pointers *before* checking key values) we added in commit 
> 96631ed16c514cf8b28fab991a076985ce378c26
> (udp: introduce sk_for_each_rcu_safenext())
> 
> This also saves a write memory barrier in udp_lib_get_port(), between
> sk->sk_hash update and sk->next update)
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  include/linux/list_nulls.h    |   97 ++++++++++++++++++++++++++++++++
>  include/linux/rculist.h       |   17 -----
>  include/linux/rculist_nulls.h |   55 ++++++++++++++++++
>  include/net/sock.h            |   50 ++++++++++++----
>  include/net/udp.h             |    2
>  net/ipv4/udp.c                |   40 ++++++-------
>  net/ipv6/udp.c                |   22 ++++---
>  7 files changed, 228 insertions(+), 55 deletions(-)

If we're going to do this, It'd be good to have the list_nulls stuff in
their own patch, as clearly they are not UDP specific.

Also, I think it would be very good to have some extensive comments in
the list_nulls files describing their use in clear and concise language,
because the above changelog doesn't even begin to explain things for
those not following this thread.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Corey Minyard Oct. 30, 2008, 4:28 p.m. UTC | #3
Stephen Hemminger wrote:
> On Thu, 30 Oct 2008 16:40:01 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:
>
>   
>> Eric Dumazet a écrit :
>>     
>>> Paul E. McKenney a écrit :
>>>       
>>>> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
>>>>         
>>>>> Hum... Another way of handling all those cases and avoid memory barriers
>>>>> would be to have different "NULL" pointers.
>>>>>
>>>>> Each hash chain should have a unique "NULL" pointer (in the case of 
>>>>> UDP, it
>>>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>>>
>>>>> Then, when performing a lookup, a reader should check the "NULL" pointer
>>>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>>>
>>>>> If not -> restart the loop, aka "goto begin;" :)
>>>>>
>>>>> We could avoid memory barriers then.
>>>>>
>>>>> In the two cases Corey mentioned, this trick could let us avoid 
>>>>> memory barriers.
>>>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>>>
>>>>> What do you think ?
>>>>>           
>>>> Kinky!!!  ;-)
>>>>
>>>> Then the rcu_dereference() would be supplying the needed memory barriers.
>>>>
>>>> Hmmm...  I guess that the only confusion would be if the element got
>>>> removed and then added to the same list.  But then if its pointer was
>>>> pseudo-NULL, then that would mean that all subsequent elements had been
>>>> removed, and all preceding ones added after the scan started.
>>>>
>>>> Which might well be harmless, but I must defer to you on this one at
>>>> the moment.
>>>>
>>>> If you need a larger hash table, another approach would be to set the
>>>> pointer's low-order bit, allowing the upper bits to be a full-sized
>>>> index -- or even a pointer to the list header.  Just make very sure
>>>> to clear the pointer when freeing, or an element on the freelist
>>>> could end up looking like a legitimate end of list...  Which again
>>>> might well be safe, but why inflict this on oneself?
>>>>         
>> Ok, here is an updated and tested patch.
>>
>> Thanks everybody
>>
>> [PATCH] udp: Introduce special NULL pointers for hlist termination
>>
>> In order to safely detect changes in chains, we would like to have different
>> 'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
>> value, so that the lockless readers can detect their lookups evaded from
>> their starting chain.
>>
>> We introduce a new type of hlist implementation, named hlist_nulls, were
>> we use the least significant bit of the 'ptr' to tell if its a "NULL" value
>> or a pointer to an object. We expect to use this new hlist variant for TCP
>> as well.
>>
>> For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
>> (UDP_HTABLE_SIZE=128)
>>
>> Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
>> pointers *before* checking key values) we added in commit 
>> 96631ed16c514cf8b28fab991a076985ce378c26
>> (udp: introduce sk_for_each_rcu_safenext())
>>
>> This also saves a write memory barrier in udp_lib_get_port(), between
>> sk->sk_hash update and sk->next update)
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
>>     
>
> IMHO this goes over the edge into tricky hack. Is it really worth it?
> Is there a better simpler way?
>   
The only think I've thought of is to do a single smp_rmb() after the 
loop scanning the list and check the sk_hash value again.  That's better 
than the read barrier for every list element, but still not as good as 
this list from a performance point of view.

IMHO, this is a tricky hack, but it if is well abstracted and documented 
I think it's ok.  I'd guess something like this will become more often 
used as we get larger numbers of processors on systems.

It is annoying that it doesn't help the performance for multicast.  
However, I think the current patch will solve the DOS issue for 
multicast, since it switches to a normal spinlock and has a per-list lock.

-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 30, 2008, 5:12 p.m. UTC | #4
Stephen Hemminger a écrit :
> On Thu, 30 Oct 2008 16:40:01 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:

>>
>> [PATCH] udp: Introduce special NULL pointers for hlist termination
>>
>> In order to safely detect changes in chains, we would like to have different
>> 'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
>> value, so that the lockless readers can detect their lookups evaded from
>> their starting chain.
>>
>> We introduce a new type of hlist implementation, named hlist_nulls, were
>> we use the least significant bit of the 'ptr' to tell if its a "NULL" value
>> or a pointer to an object. We expect to use this new hlist variant for TCP
>> as well.
>>
>> For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
>> (UDP_HTABLE_SIZE=128)
>>
>> Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
>> pointers *before* checking key values) we added in commit 
>> 96631ed16c514cf8b28fab991a076985ce378c26
>> (udp: introduce sk_for_each_rcu_safenext())
>>
>> This also saves a write memory barrier in udp_lib_get_port(), between
>> sk->sk_hash update and sk->next update)
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
> 
> IMHO this goes over the edge into tricky hack. Is it really worth it?
> Is there a better simpler way?

rwlocks , spinlocks, seqlocks :)

More seriously Stephen, if the infrastructure is clean, and well tested on a relative
simple case (UDP), it can then be deployed on a much more interesting protocol : TCP

The moment we switch to RCU, we have to accept the pain of really understand what
we did. Details are scary yes.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 31, 2008, 7:51 a.m. UTC | #5
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 30 Oct 2008 18:12:03 +0100

> More seriously Stephen, if the infrastructure is clean, and well
> tested on a relative simple case (UDP), it can then be deployed on a
> much more interesting protocol : TCP

I'm really looking forward to someone finally tackling that problem :)

> The moment we switch to RCU, we have to accept the pain of really
> understand what we did. Details are scary yes.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 13, 2008, 1:13 p.m. UTC | #6
Hi all

Here is a serie of three patches (based on net-next-2.6), to continue work
with RCU on UDP/TCP/DCCP stacks

Many thanks for all usefull reviews and comments, especially from Paul and Corey.

1) Introduce hlist_nulls variant of hlist

   hlist uses NULL value to finish a chain.
   hlist_nulls variant use the low order bit set to 1 to signal an end marker.
   This allows to store many different end markers, so that some RCU lockless
   algos (used in TCP/UDP stack for example) can save some memory barriers in
   fast paths.

2) Use hlist_nulls in UDP RCU code

   This is a straightforward patch, using hlist_nulls infrastructure.
   RCU-ification already done on UDP two weeks ago, so hlist_nulls
   permits us to avoid some memory barriers, both at lookup time
   and delete time. Patch is large because it adds new macros to
   include/net/sock.h. These macros will be used by TCP & DCCP too.

3) Convert TCP & DCCP hash tables to use RCU & hlist_nulls

   RCU was added to UDP lookups, using a fast infrastructure :
   - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
     price of call_rcu() at freeing time.
   - hlist_nulls permits to use few memory barriers.

   This patch uses same infrastructure for TCP/DCCP established
   and timewait sockets.

   Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
   using short lived TCP connections. A followup patch, converting
   rwlocks to spinlocks will even speedup this case.

   __inet_lookup_established() is pretty fast now we dont have to
   dirty a contended cache line (read_lock/read_unlock)

   Only established and timewait hashtable are converted to RCU
  (bind table and listen table are still using traditional locking)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Nov. 13, 2008, 5:20 p.m. UTC | #7
Eric Dumazet <dada1@cosmosbay.com> writes:


> 1) Introduce hlist_nulls variant of hlist
>
>   hlist uses NULL value to finish a chain.
>   hlist_nulls variant use the low order bit set to 1 to signal an end marker.
>   This allows to store many different end markers, so that some RCU lockless
>   algos (used in TCP/UDP stack for example) can save some memory barriers in
>   fast paths.

Do you have any numbers that demonstrate the read memory barriers being
a performance problem? At least on x86 they should be very cheap
because they're normally nops.

-Andi
David Miller Nov. 17, 2008, 3:41 a.m. UTC | #8
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 13 Nov 2008 14:13:20 +0100

> Here is a serie of three patches (based on net-next-2.6), to continue work
> with RCU on UDP/TCP/DCCP stacks

All 4 patches applied to net-next-2.6, thanks a lot Eric!

These patches are incredibly cool!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Nov. 19, 2008, 7:52 p.m. UTC | #9
On Sun, 16 Nov 2008, David Miller wrote:

> These patches are incredibly cool!

How much does this give us on the aim9 udp test?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index e69de29..6adaa75 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -0,0 +1,97 @@ 
+#ifndef _LINUX_LIST_NULLS_H
+#define _LINUX_LIST_NULLS_H
+
+#include <linux/list.h>
+
+/*
+ * Special versions of lists, where a NULL pointer may have different values.
+ * (up to 2^31 different values guaranteed on all platforms)
+ *
+ * The least significant bit of 'ptr' is used to encode the 'NULL' value.
+ * Set to 1 : This is a NULL value (ptr >> 1)
+ * Set to 0 : This is a pointer to some object (ptr)
+ *
+ * Used for UDP sockets.
+ */
+
+struct hlist_nulls_head {
+	struct hlist_nulls_node *first;
+};
+
+struct hlist_nulls_node {
+	struct hlist_nulls_node *next, **pprev;
+};
+
+#define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
+	((ptr)->first = (struct hlist_nulls_node *) (1UL | ((nulls) << 1)))
+
+#define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member)
+
+/**
+ * ptr_is_a_nulls - Test if a ptr to struct hlist_nulls_node is a nulls
+ * @ptr: ptr to be tested
+ *
+ */
+static inline int is_a_nulls(struct hlist_nulls_node *ptr)
+{
+	return ((unsigned long)ptr & 1);
+}
+
+/**
+ * get_nulls_value - Returns the null associated with a ptr
+ * @ptr: ptr to struct hlist_nulls_node
+ *
+ * Caller must check is_a_nulls(ptr) is true before calling this.
+ */
+static inline unsigned long get_nulls_value(struct hlist_nulls_node *ptr)
+{
+	return ((unsigned long)ptr) >> 1;
+}
+
+static inline int hlist_nulls_unhashed(const struct hlist_nulls_node *h)
+{
+	return !h->pprev;
+}
+
+static inline int hlist_nulls_empty(const struct hlist_nulls_head *h)
+{
+	return is_a_nulls(h->first);
+}
+
+static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
+{
+	struct hlist_nulls_node *next = n->next;
+	struct hlist_nulls_node **pprev = n->pprev;
+	*pprev = next;
+	if (!is_a_nulls(next))
+		next->pprev = pprev;
+}
+
+/**
+ * hlist_nulls_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_nulls_for_each_entry(tpos, pos, head, member)	 \
+	for (pos = (head)->first;				 \
+	     (!is_a_nulls(pos)) && ({ prefetch(pos->next); 1;}) && \
+		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+/**
+ * hlist_nulls_for_each_entry_from - iterate over a hlist continuing from current point
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_nulls_for_each_entry_from(tpos, pos, member)	\
+	for (; (!is_a_nulls(pos)) && 				\
+		({ prefetch(pos->next); 1;})   &&		\
+		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+#endif
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 3ba2998..e649bd3 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -383,22 +383,5 @@  static inline void hlist_add_after_rcu(struct hlist_node *prev,
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
 		pos = rcu_dereference(pos->next))
 
-/**
- * hlist_for_each_entry_rcu_safenext - iterate over rcu list of given type
- * @tpos:	the type * to use as a loop cursor.
- * @pos:	the &struct hlist_node to use as a loop cursor.
- * @head:	the head for your list.
- * @member:	the name of the hlist_node within the struct.
- * @next:       the &struct hlist_node to use as a next cursor
- *
- * Special version of hlist_for_each_entry_rcu that make sure
- * each next pointer is fetched before each iteration.
- */
-#define hlist_for_each_entry_rcu_safenext(tpos, pos, head, member, next) \
-	for (pos = rcu_dereference((head)->first);			 \
-		pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&	\
-		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
-		pos = rcu_dereference(next))
-
 #endif	/* __KERNEL__ */
 #endif
diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index e69de29..f16b455 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -0,0 +1,55 @@ 
+#ifndef _LINUX_RCULIST_NULLS_H
+#define _LINUX_RCULIST_NULLS_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version, based on 'hlist_nulls' variant
+ *
+ * Used for UDP sockets.
+ */
+#include <linux/list_nulls.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
+{
+	if (!hlist_nulls_unhashed(n)) {
+		__hlist_nulls_del(n);
+		n->pprev = NULL;
+	}
+}
+
+static inline void hlist_nulls_del_rcu(struct hlist_nulls_node *n)
+{
+	__hlist_nulls_del(n);
+	n->pprev = LIST_POISON2;
+}
+
+static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
+					struct hlist_nulls_head *h)
+{
+	struct hlist_nulls_node *first = h->first;
+
+	n->next = first;
+	n->pprev = &h->first;
+	rcu_assign_pointer(h->first, n);
+	if (!is_a_nulls(first))
+		first->pprev = &n->next;
+}
+/**
+ * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \
+	for (pos = rcu_dereference((head)->first);			 \
+		(!is_a_nulls(pos)) && 			\
+		({ prefetch(pos->next); 1; }) &&			\
+		({ tpos = hlist_nulls_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference(pos->next))
+
+#endif  /* __KERNEL__ */
+#endif
diff --git a/include/net/sock.h b/include/net/sock.h
index a4f6d3f..ece2235 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -42,6 +42,7 @@ 
 
 #include <linux/kernel.h>
 #include <linux/list.h>
+#include <linux/list_nulls.h>
 #include <linux/timer.h>
 #include <linux/cache.h>
 #include <linux/module.h>
@@ -52,6 +53,7 @@ 
 #include <linux/security.h>
 
 #include <linux/filter.h>
+#include <linux/rculist_nulls.h>
 
 #include <asm/atomic.h>
 #include <net/dst.h>
@@ -106,6 +108,7 @@  struct net;
  *	@skc_reuse: %SO_REUSEADDR setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_node: main hash linkage for various protocol lookup tables
+ *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
  *	@skc_refcnt: reference count
  *	@skc_hash: hash value used with various protocol lookup tables
@@ -120,7 +123,10 @@  struct sock_common {
 	volatile unsigned char	skc_state;
 	unsigned char		skc_reuse;
 	int			skc_bound_dev_if;
-	struct hlist_node	skc_node;
+	union {
+		struct hlist_node	skc_node;
+		struct hlist_nulls_node skc_nulls_node;
+	};
 	struct hlist_node	skc_bind_node;
 	atomic_t		skc_refcnt;
 	unsigned int		skc_hash;
@@ -206,6 +212,7 @@  struct sock {
 #define sk_reuse		__sk_common.skc_reuse
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_node			__sk_common.skc_node
+#define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_hash			__sk_common.skc_hash
@@ -296,12 +303,28 @@  static inline struct sock *sk_head(const struct hlist_head *head)
 	return hlist_empty(head) ? NULL : __sk_head(head);
 }
 
+static inline struct sock *__sk_nulls_head(const struct hlist_nulls_head *head)
+{
+	return hlist_nulls_entry(head->first, struct sock, sk_nulls_node);
+}
+
+static inline struct sock *sk_nulls_head(const struct hlist_nulls_head *head)
+{
+	return hlist_nulls_empty(head) ? NULL : __sk_nulls_head(head);
+}
+
 static inline struct sock *sk_next(const struct sock *sk)
 {
 	return sk->sk_node.next ?
 		hlist_entry(sk->sk_node.next, struct sock, sk_node) : NULL;
 }
 
+static inline struct sock *sk_nulls_next(const struct sock *sk)
+{
+	return (!is_a_nulls(sk->sk_nulls_node.next)) ?
+		hlist_entry(sk->sk_nulls_node.next, struct sock, sk_nulls_node) : NULL;
+}
+
 static inline int sk_unhashed(const struct sock *sk)
 {
 	return hlist_unhashed(&sk->sk_node);
@@ -363,18 +386,18 @@  static __inline__ int sk_del_node_init(struct sock *sk)
 	return rc;
 }
 
-static __inline__ int __sk_del_node_init_rcu(struct sock *sk)
+static __inline__ int __sk_nulls_del_node_init_rcu(struct sock *sk)
 {
 	if (sk_hashed(sk)) {
-		hlist_del_init_rcu(&sk->sk_node);
+		hlist_nulls_del_init_rcu(&sk->sk_nulls_node);
 		return 1;
 	}
 	return 0;
 }
 
-static __inline__ int sk_del_node_init_rcu(struct sock *sk)
+static __inline__ int sk_nulls_del_node_init_rcu(struct sock *sk)
 {
-	int rc = __sk_del_node_init_rcu(sk);
+	int rc = __sk_nulls_del_node_init_rcu(sk);
 
 	if (rc) {
 		/* paranoid for a while -acme */
@@ -395,15 +418,15 @@  static __inline__ void sk_add_node(struct sock *sk, struct hlist_head *list)
 	__sk_add_node(sk, list);
 }
 
-static __inline__ void __sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
+static __inline__ void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
-	hlist_add_head_rcu(&sk->sk_node, list);
+	hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
 }
 
-static __inline__ void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
+static __inline__ void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
 	sock_hold(sk);
-	__sk_add_node_rcu(sk, list);
+	__sk_nulls_add_node_rcu(sk, list);
 }
 
 static __inline__ void __sk_del_bind_node(struct sock *sk)
@@ -419,11 +442,16 @@  static __inline__ void sk_add_bind_node(struct sock *sk,
 
 #define sk_for_each(__sk, node, list) \
 	hlist_for_each_entry(__sk, node, list, sk_node)
-#define sk_for_each_rcu_safenext(__sk, node, list, next) \
-	hlist_for_each_entry_rcu_safenext(__sk, node, list, sk_node, next)
+#define sk_nulls_for_each(__sk, node, list) \
+	hlist_nulls_for_each_entry(__sk, node, list, sk_nulls_node)
+#define sk_nulls_for_each_rcu(__sk, node, list) \
+	hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)
 #define sk_for_each_from(__sk, node) \
 	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
 		hlist_for_each_entry_from(__sk, node, sk_node)
+#define sk_nulls_for_each_from(__sk, node) \
+	if (__sk && ({ node = &(__sk)->sk_nulls_node; 1; })) \
+		hlist_nulls_for_each_entry_from(__sk, node, sk_nulls_node)
 #define sk_for_each_continue(__sk, node) \
 	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
 		hlist_for_each_entry_continue(__sk, node, sk_node)
diff --git a/include/net/udp.h b/include/net/udp.h
index df2bfe5..90e6ce5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -51,7 +51,7 @@  struct udp_skb_cb {
 #define UDP_SKB_CB(__skb)	((struct udp_skb_cb *)((__skb)->cb))
 
 struct udp_hslot {
-	struct hlist_head	head;
+	struct hlist_nulls_head	head;
 	spinlock_t		lock;
 } __attribute__((aligned(2 * sizeof(long))));
 struct udp_table {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1789b35..0f7ed53 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -127,9 +127,9 @@  static int udp_lib_lport_inuse(struct net *net, __u16 num,
 						 const struct sock *sk2))
 {
 	struct sock *sk2;
-	struct hlist_node *node;
+	struct hlist_nulls_node *node;
 
-	sk_for_each(sk2, node, &hslot->head)
+	sk_nulls_for_each(sk2, node, &hslot->head)
 		if (net_eq(sock_net(sk2), net)			&&
 		    sk2 != sk					&&
 		    sk2->sk_hash == num				&&
@@ -189,12 +189,7 @@  int udp_lib_get_port(struct sock *sk, unsigned short snum,
 	inet_sk(sk)->num = snum;
 	sk->sk_hash = snum;
 	if (sk_unhashed(sk)) {
-		/*
-		 * We need that previous write to sk->sk_hash committed
-		 * before write to sk->next done in following add_node() variant
-		 */
-		smp_wmb();
-		sk_add_node_rcu(sk, &hslot->head);
+		sk_nulls_add_node_rcu(sk, &hslot->head);
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 	}
 	error = 0;
@@ -261,7 +256,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node, *next;
+	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -271,7 +266,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
+	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain
@@ -285,6 +280,13 @@  begin:
 			badness = score;
 		}
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 */
+	if (get_nulls_value(node) != hash)
+		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -325,11 +327,11 @@  static inline struct sock *udp_v4_mcast_next(struct sock *sk,
 					     __be16 rmt_port, __be32 rmt_addr,
 					     int dif)
 {
-	struct hlist_node *node;
+	struct hlist_nulls_node *node;
 	struct sock *s = sk;
 	unsigned short hnum = ntohs(loc_port);
 
-	sk_for_each_from(s, node) {
+	sk_nulls_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (s->sk_hash != hnum					||
@@ -976,7 +978,7 @@  void udp_lib_unhash(struct sock *sk)
 	struct udp_hslot *hslot = &udptable->hash[hash];
 
 	spin_lock_bh(&hslot->lock);
-	if (sk_del_node_init_rcu(sk)) {
+	if (sk_nulls_del_node_init_rcu(sk)) {
 		inet_sk(sk)->num = 0;
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 	}
@@ -1129,7 +1131,7 @@  static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	int dif;
 
 	spin_lock(&hslot->lock);
-	sk = sk_head(&hslot->head);
+	sk = sk_nulls_head(&hslot->head);
 	dif = skb->dev->ifindex;
 	sk = udp_v4_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
 	if (sk) {
@@ -1138,7 +1140,7 @@  static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		do {
 			struct sk_buff *skb1 = skb;
 
-			sknext = udp_v4_mcast_next(sk_next(sk), uh->dest, daddr,
+			sknext = udp_v4_mcast_next(sk_nulls_next(sk), uh->dest, daddr,
 						   uh->source, saddr, dif);
 			if (sknext)
 				skb1 = skb_clone(skb, GFP_ATOMIC);
@@ -1558,10 +1560,10 @@  static struct sock *udp_get_first(struct seq_file *seq, int start)
 	struct net *net = seq_file_net(seq);
 
 	for (state->bucket = start; state->bucket < UDP_HTABLE_SIZE; ++state->bucket) {
-		struct hlist_node *node;
+		struct hlist_nulls_node *node;
 		struct udp_hslot *hslot = &state->udp_table->hash[state->bucket];
 		spin_lock_bh(&hslot->lock);
-		sk_for_each(sk, node, &hslot->head) {
+		sk_nulls_for_each(sk, node, &hslot->head) {
 			if (!net_eq(sock_net(sk), net))
 				continue;
 			if (sk->sk_family == state->family)
@@ -1580,7 +1582,7 @@  static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 	struct net *net = seq_file_net(seq);
 
 	do {
-		sk = sk_next(sk);
+		sk = sk_nulls_next(sk);
 	} while (sk && (!net_eq(sock_net(sk), net) || sk->sk_family != state->family));
 
 	if (!sk) {
@@ -1751,7 +1753,7 @@  void __init udp_table_init(struct udp_table *table)
 	int i;
 
 	for (i = 0; i < UDP_HTABLE_SIZE; i++) {
-		INIT_HLIST_HEAD(&table->hash[i].head);
+		INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);
 		spin_lock_init(&table->hash[i].lock);
 	}
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 32d914d..581fcc1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -98,7 +98,7 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 				      int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node, *next;
+	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -108,19 +108,27 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
+	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain
 		 */
 		if (udp_hashfn(net, sk->sk_hash) != hash)
 			goto begin;
-		score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
+		score = compute_score(sk, net, hnum, saddr, sport,
+				      daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
 		}
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 */
+	if (get_nulls_value(node) != hash)
+		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -360,11 +368,11 @@  static struct sock *udp_v6_mcast_next(struct sock *sk,
 				      __be16 rmt_port, struct in6_addr *rmt_addr,
 				      int dif)
 {
-	struct hlist_node *node;
+	struct hlist_nulls_node *node;
 	struct sock *s = sk;
 	unsigned short num = ntohs(loc_port);
 
-	sk_for_each_from(s, node) {
+	sk_nulls_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (sock_net(s) != sock_net(sk))
@@ -409,7 +417,7 @@  static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	int dif;
 
 	spin_lock(&hslot->lock);
-	sk = sk_head(&hslot->head);
+	sk = sk_nulls_head(&hslot->head);
 	dif = inet6_iif(skb);
 	sk = udp_v6_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
 	if (!sk) {
@@ -418,7 +426,7 @@  static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	}
 
 	sk2 = sk;
-	while ((sk2 = udp_v6_mcast_next(sk_next(sk2), uh->dest, daddr,
+	while ((sk2 = udp_v6_mcast_next(sk_nulls_next(sk2), uh->dest, daddr,
 					uh->source, saddr, dif))) {
 		struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC);
 		if (buff) {