diff mbox

[2/2] udp: RCU handling for Unicast packets.

Message ID 4908DEDE.5030706@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 29, 2008, 10:08 p.m. UTC
Paul E. McKenney a écrit :
> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
>> Hum... Another way of handling all those cases and avoid memory barriers
>> would be to have different "NULL" pointers.
>>
>> Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>
>> Then, when performing a lookup, a reader should check the "NULL" pointer
>> it get at the end of its lookup has is the "hash" value of its chain.
>>
>> If not -> restart the loop, aka "goto begin;" :)
>>
>> We could avoid memory barriers then.
>>
>> In the two cases Corey mentioned, this trick could let us avoid memory 
>> barriers.
>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>
>> What do you think ?
> 
> Kinky!!!  ;-)
> 
> Then the rcu_dereference() would be supplying the needed memory barriers.
> 
> Hmmm...  I guess that the only confusion would be if the element got
> removed and then added to the same list.  But then if its pointer was
> pseudo-NULL, then that would mean that all subsequent elements had been
> removed, and all preceding ones added after the scan started.
> 
> Which might well be harmless, but I must defer to you on this one at
> the moment.
> 
> If you need a larger hash table, another approach would be to set the
> pointer's low-order bit, allowing the upper bits to be a full-sized
> index -- or even a pointer to the list header.  Just make very sure
> to clear the pointer when freeing, or an element on the freelist
> could end up looking like a legitimate end of list...  Which again
> might well be safe, but why inflict this on oneself?

Well, for UDP case, hash table will be <= 65536 anyway, we can assume
no dynamic kernel memory is in the range [0 .. 65535]

Here is a patch (untested yet, its really time for a sleep for me ;) )

[PATCH] udp: Introduce special NULL pointers for hlist termination

In order to safely detect changes in chains, we would like to have different
'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
value, so that the lockless readers can detect their lookups evaded from
their starting chain.

We define 'NULL' values as ((unsigned long)values < UDP_HTABLE_SIZE)

This saves memory barriers (a read barrier to fetch 'next' pointers
*before* checking key values) we added in commit 
96631ed16c514cf8b28fab991a076985ce378c26 (udp: introduce 
sk_for_each_rcu_safenext()) 

This also saves a missing memory barrier spotted by Corey Minyard 
(a write one in udp_lib_get_port(), between sk_hash update and ->next update)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/linux/list.h    |   32 ++++++++++++++++++++++++++++++++
include/linux/rculist.h |   15 ++++++++-------
include/net/sock.h      |    9 +++++++--
net/ipv4/udp.c          |   19 +++++++++++++------
net/ipv6/udp.c          |   16 ++++++++++++----
5 files changed, 72 insertions(+), 19 deletions(-)

Comments

Corey Minyard Oct. 30, 2008, 3:22 a.m. UTC | #1
Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
>>> Hum... Another way of handling all those cases and avoid memory 
>>> barriers
>>> would be to have different "NULL" pointers.
>>>
>>> Each hash chain should have a unique "NULL" pointer (in the case of 
>>> UDP, it
>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>
>>> Then, when performing a lookup, a reader should check the "NULL" 
>>> pointer
>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>
>>> If not -> restart the loop, aka "goto begin;" :)
>>>
>>> We could avoid memory barriers then.
>>>
>>> In the two cases Corey mentioned, this trick could let us avoid 
>>> memory barriers.
>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>
>>> What do you think ?
>>
>> Kinky!!!  ;-)
>>
>> Then the rcu_dereference() would be supplying the needed memory 
>> barriers.
>>
>> Hmmm...  I guess that the only confusion would be if the element got
>> removed and then added to the same list.  But then if its pointer was
>> pseudo-NULL, then that would mean that all subsequent elements had been
>> removed, and all preceding ones added after the scan started.
>>
>> Which might well be harmless, but I must defer to you on this one at
>> the moment.
>>
>> If you need a larger hash table, another approach would be to set the
>> pointer's low-order bit, allowing the upper bits to be a full-sized
>> index -- or even a pointer to the list header.  Just make very sure
>> to clear the pointer when freeing, or an element on the freelist
>> could end up looking like a legitimate end of list...  Which again
>> might well be safe, but why inflict this on oneself?
>
> Well, for UDP case, hash table will be <= 65536 anyway, we can assume
> no dynamic kernel memory is in the range [0 .. 65535]
>
> Here is a patch (untested yet, its really time for a sleep for me ;) )
>
> [PATCH] udp: Introduce special NULL pointers for hlist termination
>
> In order to safely detect changes in chains, we would like to have 
> different
> 'NULL' pointers. Each chain in hash table is terminated by an unique 
> 'NULL'
> value, so that the lockless readers can detect their lookups evaded from
> their starting chain.
>
> We define 'NULL' values as ((unsigned long)values < UDP_HTABLE_SIZE)
>
> This saves memory barriers (a read barrier to fetch 'next' pointers
> *before* checking key values) we added in commit 
> 96631ed16c514cf8b28fab991a076985ce378c26 (udp: introduce 
> sk_for_each_rcu_safenext())
> This also saves a missing memory barrier spotted by Corey Minyard (a 
> write one in udp_lib_get_port(), between sk_hash update and ->next 
> update)
I think you are right, this will certainly perform a lot better without 
the read barrier in the list traversal.  I haven't seen any problems 
with this approach, though it's unusual enough to perhaps warrant some 
extra comments in the code.

You do need to modify udp_lib_unhash(), as sk_del_node_init_rcu() will 
do a NULL check on the ->next value, so you will need a special version 
of that as well.

-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 30, 2008, 5:40 a.m. UTC | #2
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 29 Oct 2008 23:08:30 +0100

> @@ -1746,7 +1753,7 @@ void __init udp_table_init(struct udp_table *table)
>  	int i;
>  
>  	for (i = 0; i < UDP_HTABLE_SIZE; i++) {
> -		INIT_HLIST_HEAD(&table->hash[i].head);
> +		table->hash[i].head.first = (struct hlist_node *)i;

Please hide this behind some list.h interface macro, even something
as simple as INIT_HLIST_HEAD_NULLS(X, index) would suffice.

And as Corey said, the code needs more comments for something as
clever as this! :-)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 30, 2008, 5:51 a.m. UTC | #3
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 29 Oct 2008 23:08:30 +0100
> 
>> @@ -1746,7 +1753,7 @@ void __init udp_table_init(struct udp_table *table)
>>  	int i;
>>  
>>  	for (i = 0; i < UDP_HTABLE_SIZE; i++) {
>> -		INIT_HLIST_HEAD(&table->hash[i].head);
>> +		table->hash[i].head.first = (struct hlist_node *)i;
> 
> Please hide this behind some list.h interface macro, even something
> as simple as INIT_HLIST_HEAD_NULLS(X, index) would suffice.
> 
> And as Corey said, the code needs more comments for something as
> clever as this! :-)
> 

Yes I agree 100%, please give me one day to prepare a real patch,
or else akpm will kill us :)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 30, 2008, 7:04 a.m. UTC | #4
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Wed, 29 Oct 2008 23:08:30 +0100
>>
>>> @@ -1746,7 +1753,7 @@ void __init udp_table_init(struct udp_table 
>>> *table)
>>>      int i;
>>>  
>>>      for (i = 0; i < UDP_HTABLE_SIZE; i++) {
>>> -        INIT_HLIST_HEAD(&table->hash[i].head);
>>> +        table->hash[i].head.first = (struct hlist_node *)i;
>>
>> Please hide this behind some list.h interface macro, even something
>> as simple as INIT_HLIST_HEAD_NULLS(X, index) would suffice.
>>
>> And as Corey said, the code needs more comments for something as
>> clever as this! :-)
>>
> 
> Yes I agree 100%, please give me one day to prepare a real patch,
> or else akpm will kill us :)
> 

If we design something that could be reused, say for TCP sockets, we need
to be able to handle very large number of 'NULL' pointers, say, up to 64*1024*1024

So lets use the low order bit, set to one for "NULL" pointers, and 0 for regular pointers.

This gives us 31 bits (or 63 bits) to store any valuable info :)

and all ...._nulls() macros would not need to know the max value (UDP_HTABLE_SIZE in UDP case),
since all they have to do is a test ((unsigned long)pos & 1)

At iterator exit, pos would contain the 'index' value, (pos >> 1), to hide this
implementation detail.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 30, 2008, 7:05 a.m. UTC | #5
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 30 Oct 2008 08:04:42 +0100

> If we design something that could be reused, say for TCP sockets, we need
> to be able to handle very large number of 'NULL' pointers, say, up to 64*1024*1024
> 
> So lets use the low order bit, set to one for "NULL" pointers, and 0 for regular pointers.
> 
> This gives us 31 bits (or 63 bits) to store any valuable info :)
> 
> and all ...._nulls() macros would not need to know the max value (UDP_HTABLE_SIZE in UDP case),
> since all they have to do is a test ((unsigned long)pos & 1)
> 
> At iterator exit, pos would contain the 'index' value, (pos >> 1), to hide this
> implementation detail.

This sound fine to me.  Quite an improvement in fact :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/list.h b/include/linux/list.h
index 969f6e9..a3d5dd1 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -654,6 +654,22 @@  static inline void hlist_move_list(struct hlist_head *old,
 	     pos && ({ prefetch(pos->next); 1;}) &&			 \
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
+/**
+ * hlist_for_each_entry_nulls	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ * @nullval: the iteration should stop if a pointer is < nullval
+ *
+ * Special version of hlist_for_each_entry where the end pointer
+ * can be NULL but also any value < nullval.
+ */
+#define hlist_for_each_entry_nulls(tpos, pos, head, member, nullval)	 \
+	for (pos = (head)->first;					 \
+	     ((unsigned long)pos >= nullval) && ({ prefetch(pos->next); 1;}) && \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
 
 /**
  * hlist_for_each_entry_continue - iterate over a hlist continuing after current point
@@ -679,6 +695,22 @@  static inline void hlist_move_list(struct hlist_head *old,
 	     pos = pos->next)
 
 /**
+ * hlist_for_each_entry_from_nulls - iterate over a hlist continuing from current point
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @member:	the name of the hlist_node within the struct.
+ * @nullval: the iteration should stop if a pointer is < nullval
+ *
+ * Special version of hlist_for_each_entry_from where the end pointer
+ * can be NULL but also any value < nullval.
+ */
+#define hlist_for_each_entry_from_nulls(tpos, pos, member, nullval)	\
+	for (; ((unsigned long)pos >= nullval) && \
+		({ prefetch(pos->next); 1;})   && \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+/**
  * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
  * @tpos:	the type * to use as a loop cursor.
  * @pos:	the &struct hlist_node to use as a loop cursor.
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 3ba2998..6f78e2b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -384,21 +384,22 @@  static inline void hlist_add_after_rcu(struct hlist_node *prev,
 		pos = rcu_dereference(pos->next))
 
 /**
- * hlist_for_each_entry_rcu_safenext - iterate over rcu list of given type
+ * hlist_for_each_entry_rcu_nulls - iterate over rcu list of given type
  * @tpos:	the type * to use as a loop cursor.
  * @pos:	the &struct hlist_node to use as a loop cursor.
  * @head:	the head for your list.
  * @member:	the name of the hlist_node within the struct.
- * @next:       the &struct hlist_node to use as a next cursor
+ * @nullval:       the iteration should stop if a pointer is < nullval
  *
- * Special version of hlist_for_each_entry_rcu that make sure
- * each next pointer is fetched before each iteration.
+ * Special version of hlist_for_each_entry_rcu where the end pointer
+ * can be NULL but also any value < nullval.
  */
-#define hlist_for_each_entry_rcu_safenext(tpos, pos, head, member, next) \
+#define hlist_for_each_entry_rcu_nulls(tpos, pos, head, member, nullval) \
 	for (pos = rcu_dereference((head)->first);			 \
-		pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&	\
+		((unsigned long)pos >= nullval) && 			\
+		({ prefetch(pos->next); 1; }) &&			\
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
-		pos = rcu_dereference(next))
+		pos = rcu_dereference(pos->next))
 
 #endif	/* __KERNEL__ */
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index a4f6d3f..efe4def 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -419,11 +419,16 @@  static __inline__ void sk_add_bind_node(struct sock *sk,
 
 #define sk_for_each(__sk, node, list) \
 	hlist_for_each_entry(__sk, node, list, sk_node)
-#define sk_for_each_rcu_safenext(__sk, node, list, next) \
-	hlist_for_each_entry_rcu_safenext(__sk, node, list, sk_node, next)
+#define sk_for_each_nulls(__sk, node, list, nullval) \
+	hlist_for_each_entry_nulls(__sk, node, list, sk_node, nullval)
+#define sk_for_each_rcu_nulls(__sk, node, list, nullval) \
+	hlist_for_each_entry_rcu_nulls(__sk, node, list, sk_node, nullval)
 #define sk_for_each_from(__sk, node) \
 	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
 		hlist_for_each_entry_from(__sk, node, sk_node)
+#define sk_for_each_from_nulls(__sk, node, nullval) \
+	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
+		hlist_for_each_entry_from_nulls(__sk, node, sk_node, nullval)
 #define sk_for_each_continue(__sk, node) \
 	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
 		hlist_for_each_entry_continue(__sk, node, sk_node)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c3ecec8..a61fe89 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -129,7 +129,7 @@  static int udp_lib_lport_inuse(struct net *net, __u16 num,
 	struct sock *sk2;
 	struct hlist_node *node;
 
-	sk_for_each(sk2, node, &hslot->head)
+	sk_for_each_nulls(sk2, node, &hslot->head, UDP_HTABLE_SIZE)
 		if (net_eq(sock_net(sk2), net)			&&
 		    sk2 != sk					&&
 		    sk2->sk_hash == num				&&
@@ -256,7 +256,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node, *next;
+	struct hlist_node *node;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -266,7 +266,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
+	sk_for_each_rcu_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain
@@ -280,6 +280,13 @@  begin:
 			badness = score;
 		}
 	}
+	/*
+	 * if the 'NULL' pointer we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 */
+	if ((unsigned long)node != hash)
+		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -324,7 +331,7 @@  static inline struct sock *udp_v4_mcast_next(struct sock *sk,
 	struct sock *s = sk;
 	unsigned short hnum = ntohs(loc_port);
 
-	sk_for_each_from(s, node) {
+	sk_for_each_from_nulls(s, node, UDP_HTABLE_SIZE) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (s->sk_hash != hnum					||
@@ -1556,7 +1563,7 @@  static struct sock *udp_get_first(struct seq_file *seq, int start)
 		struct hlist_node *node;
 		struct udp_hslot *hslot = &state->udp_table->hash[state->bucket];
 		spin_lock_bh(&hslot->lock);
-		sk_for_each(sk, node, &hslot->head) {
+		sk_for_each_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
 			if (!net_eq(sock_net(sk), net))
 				continue;
 			if (sk->sk_family == state->family)
@@ -1746,7 +1753,7 @@  void __init udp_table_init(struct udp_table *table)
 	int i;
 
 	for (i = 0; i < UDP_HTABLE_SIZE; i++) {
-		INIT_HLIST_HEAD(&table->hash[i].head);
+		table->hash[i].head.first = (struct hlist_node *)i;
 		spin_lock_init(&table->hash[i].lock);
 	}
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 32d914d..13635ef 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -98,7 +98,7 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 				      int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node, *next;
+	struct hlist_node *node;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -108,19 +108,27 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
+	sk_for_each_rcu_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain
 		 */
 		if (udp_hashfn(net, sk->sk_hash) != hash)
 			goto begin;
-		score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
+		score = compute_score(sk, net, hnum, saddr, sport,
+				      daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
 		}
 	}
+	/*
+	 * if the 'NULL' pointer we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 */
+	if ((unsigned long)node != hash)
+		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -364,7 +372,7 @@  static struct sock *udp_v6_mcast_next(struct sock *sk,
 	struct sock *s = sk;
 	unsigned short num = ntohs(loc_port);
 
-	sk_for_each_from(s, node) {
+	sk_for_each_from_nulls(s, node, UDP_HTABLE_SIZE) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (sock_net(s) != sock_net(sk))