Patchwork [2/2] udp: RCU handling for Unicast packets.

login
register
mail settings
Submitter Eric Dumazet
Date Oct. 29, 2008, 2:36 p.m.
Message ID <490874F2.2060306@cosmosbay.com>
Download mbox | patch
Permalink /patch/6259/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Oct. 29, 2008, 2:36 p.m.
Corey Minyard a écrit :
> I believe there is a race in this patch:
> 
> +    sk_for_each_rcu(sk, node, &hslot->head) {
> +        /*
> +         * lockless reader, and SLAB_DESTROY_BY_RCU items:
> +         * We must check this item was not moved to another chain
> +         */
> +        if (udp_hashfn(net, sk->sk_hash) != hash)
> +            goto begin;
>         score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, 
> dif);
>         if (score > badness) {
>             result = sk;
>             badness = score;
>         }
>     }
> 
> If the socket is moved from one list to another list in-between the time 
> the hash is calculated and the next field is accessed, and the socket 
> has moved to the end of the new list, the traversal will not complete 
> properly on the list it should have, since the socket will be on the end 
> of the new list and there's not a way to tell it's on a new list and 
> restart the list traversal.  I think that this can be solved by 
> pre-fetching the "next" field (with proper barriers) before checking the 
> hash.
> 

You are absolutely right. As we *need* next pointer anyway for the prefetch(),
we can store it, so that its value is kept.

> I also might be nice to have a way to avoid recomputing the score the 
> second time, perhaps using a sequence number of some type.

Well, computing score is really cheap, everything is in cache, granted
the chain was not really huge and high score item at the beginning.

Adding yet another field in sock structure should be avoided if possible.

Thanks

[PATCH] udp: introduce sk_for_each_rcu_safenext()

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
(udp: RCU handling for Unicast packets.)

 "If the socket is moved from one list to another list in-between the time 
  the hash is calculated and the next field is accessed, and the socket 
  has moved to the end of the new list, the traversal will not complete 
  properly on the list it should have, since the socket will be on the end 
  of the new list and there's not a way to tell it's on a new list and 
  restart the list traversal.  I think that this can be solved by 
  pre-fetching the "next" field (with proper barriers) before checking the 
  hash."

This patch corrects this problem, introducing a new sk_for_each_rcu_safenext()
macro.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/rculist.h |   17 +++++++++++++++++
 include/net/sock.h      |    4 ++--
 net/ipv4/udp.c          |    4 ++--
 net/ipv6/udp.c          |    4 ++--
 4 files changed, 23 insertions(+), 6 deletions(-)
Corey Minyard - Oct. 29, 2008, 3:34 p.m.
Eric Dumazet wrote:
> Corey Minyard found a race added in commit 
> 271b72c7fa82c2c7a795bc16896149933110672d
> (udp: RCU handling for Unicast packets.)
>
> "If the socket is moved from one list to another list in-between the 
> time  the hash is calculated and the next field is accessed, and the 
> socket  has moved to the end of the new list, the traversal will not 
> complete  properly on the list it should have, since the socket will 
> be on the end  of the new list and there's not a way to tell it's on a 
> new list and  restart the list traversal.  I think that this can be 
> solved by  pre-fetching the "next" field (with proper barriers) before 
> checking the  hash."
>
> This patch corrects this problem, introducing a new 
> sk_for_each_rcu_safenext()
> macro.
You also need the appropriate smp_wmb() in udp_lib_get_port() after 
sk_hash is set, I think, so the next field is guaranteed to be changed 
after the hash value is changed.


-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 4:09 p.m.
Corey Minyard a écrit :
> Eric Dumazet wrote:
>> Corey Minyard found a race added in commit 
>> 271b72c7fa82c2c7a795bc16896149933110672d
>> (udp: RCU handling for Unicast packets.)
>>
>> "If the socket is moved from one list to another list in-between the 
>> time  the hash is calculated and the next field is accessed, and the 
>> socket  has moved to the end of the new list, the traversal will not 
>> complete  properly on the list it should have, since the socket will 
>> be on the end  of the new list and there's not a way to tell it's on a 
>> new list and  restart the list traversal.  I think that this can be 
>> solved by  pre-fetching the "next" field (with proper barriers) before 
>> checking the  hash."
>>
>> This patch corrects this problem, introducing a new 
>> sk_for_each_rcu_safenext()
>> macro.
> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
> sk_hash is set, I think, so the next field is guaranteed to be changed 
> after the hash value is changed.
> 
> 

Not sure about this one Corey.

If a reader catches previous value of item->sk_hash, two cases are to be taken into :

1) its udp_hashfn(net, sk->sk_hash) is != hash  
  -> goto begin : Reader will redo its scan

2) its udp_hashfn(net, sk->sk_hash) is == hash
  -> next pointer is good enough : it points to next item in same hash chain.
     No need to rescan the chain at this point.
     Yes we could miss the fact that a new port was bound and this UDP message could be lost.


If we force a smp_wmb(), reader would fetch pointer to begin of list.


1) its udp_hashfn(net, sk->sk_hash) is != hash  
  -> goto begin : Reader will redo its scan : next pointer value had no meaning

2) its udp_hashfn(net, sk->sk_hash) is == hash
  ->next pointer "force reader" to rescan the chain, but wont find new items.

In any case, we cannot lost an UDP message sent to a stable port (previously bound)


Thanks

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 4:37 p.m.
On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
> Corey Minyard a écrit :
>> Eric Dumazet wrote:
>>> Corey Minyard found a race added in commit 
>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>> (udp: RCU handling for Unicast packets.)
>>>
>>> "If the socket is moved from one list to another list in-between the time 
>>>  the hash is calculated and the next field is accessed, and the socket  
>>> has moved to the end of the new list, the traversal will not complete  
>>> properly on the list it should have, since the socket will be on the end  
>>> of the new list and there's not a way to tell it's on a new list and  
>>> restart the list traversal.  I think that this can be solved by  
>>> pre-fetching the "next" field (with proper barriers) before checking the  
>>> hash."
>>>
>>> This patch corrects this problem, introducing a new 
>>> sk_for_each_rcu_safenext()
>>> macro.
>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>> after the hash value is changed.
>
> Not sure about this one Corey.
>
> If a reader catches previous value of item->sk_hash, two cases are to be 
> taken into :
>
> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
> will redo its scan
>
> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>  -> next pointer is good enough : it points to next item in same hash 
> chain.
>     No need to rescan the chain at this point.
>     Yes we could miss the fact that a new port was bound and this UDP 
> message could be lost.

3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
removed, freed, reallocated, and then readded with the same hash value,
possibly carrying the reader to a new position in the same list.

You might well cover this (will examine your code in detail on my plane
flight starting about 20 hours from now), but thought I should point it
out.  ;-)

						Thanx, Paul

> If we force a smp_wmb(), reader would fetch pointer to begin of list.
>
>
> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
> will redo its scan : next pointer value had no meaning
>
> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>  ->next pointer "force reader" to rescan the chain, but wont find new 
> items.
>
> In any case, we cannot lost an UDP message sent to a stable port 
> (previously bound)
>
>
> Thanks
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Corey Minyard - Oct. 29, 2008, 5:22 p.m.
Paul E. McKenney wrote:
> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>   
>> Corey Minyard a écrit :
>>     
>>> Eric Dumazet wrote:
>>>       
>>>> Corey Minyard found a race added in commit 
>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>> (udp: RCU handling for Unicast packets.)
>>>>
>>>> "If the socket is moved from one list to another list in-between the time 
>>>>  the hash is calculated and the next field is accessed, and the socket  
>>>> has moved to the end of the new list, the traversal will not complete  
>>>> properly on the list it should have, since the socket will be on the end  
>>>> of the new list and there's not a way to tell it's on a new list and  
>>>> restart the list traversal.  I think that this can be solved by  
>>>> pre-fetching the "next" field (with proper barriers) before checking the  
>>>> hash."
>>>>
>>>> This patch corrects this problem, introducing a new 
>>>> sk_for_each_rcu_safenext()
>>>> macro.
>>>>         
>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>> after the hash value is changed.
>>>       
>> Not sure about this one Corey.
>>
>> If a reader catches previous value of item->sk_hash, two cases are to be 
>> taken into :
>>
>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>> will redo its scan
>>
>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>  -> next pointer is good enough : it points to next item in same hash 
>> chain.
>>     No need to rescan the chain at this point.
>>     Yes we could miss the fact that a new port was bound and this UDP 
>> message could be lost.
>>     
>
> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
> removed, freed, reallocated, and then readded with the same hash value,
> possibly carrying the reader to a new position in the same list.
>   
If I understand this, without the smp_wmb(), it is possible that the 
next field can be written to main memory before the hash value is 
written.  If that happens, the following can occur:

  CPU1                    CPU2
  next is set to NULL (end of new list)
                          fetch next
                          calculate hash and compare to sk_hash
  sk_hash is set to new value

So I think in the above cases, your case #2 is not necessarily valid 
without the barrier.

And another possible issue.  If sk_hash is written before next, and CPU1 
is interrupted before CPU2, CPU2 will continually spin on the list until 
CPU1 comes back and moves it to the new list.  Note sure if that is an 
issue.

-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 5:32 p.m.
Paul E. McKenney a écrit :
> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>> Corey Minyard a écrit :
>>> Eric Dumazet wrote:
>>>> Corey Minyard found a race added in commit 
>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>> (udp: RCU handling for Unicast packets.)
>>>>
>>>> "If the socket is moved from one list to another list in-between the time 
>>>>  the hash is calculated and the next field is accessed, and the socket  
>>>> has moved to the end of the new list, the traversal will not complete  
>>>> properly on the list it should have, since the socket will be on the end  
>>>> of the new list and there's not a way to tell it's on a new list and  
>>>> restart the list traversal.  I think that this can be solved by  
>>>> pre-fetching the "next" field (with proper barriers) before checking the  
>>>> hash."
>>>>
>>>> This patch corrects this problem, introducing a new 
>>>> sk_for_each_rcu_safenext()
>>>> macro.
>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>> after the hash value is changed.
>> Not sure about this one Corey.
>>
>> If a reader catches previous value of item->sk_hash, two cases are to be 
>> taken into :
>>
>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>> will redo its scan
>>
>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>  -> next pointer is good enough : it points to next item in same hash 
>> chain.
>>     No need to rescan the chain at this point.
>>     Yes we could miss the fact that a new port was bound and this UDP 
>> message could be lost.
> 
> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
> removed, freed, reallocated, and then readded with the same hash value,
> possibly carrying the reader to a new position in the same list.

yes, but 'new position' is 'before any not yet examined objects', since
we insert objects only at chain head.

> 
> You might well cover this (will examine your code in detail on my plane
> flight starting about 20 hours from now), but thought I should point it
> out.  ;-)
> 
> 	

Yes, I'll double check too, this seems tricky :)

About SLAB_DESTROY_BY_RCU effect, we now have two different kmem_cache for "UDP-Lite"
and "UDP".

This is expected, but we could avoid that and alias these caches, since
these objects have the same *type* . (The fields used for the RCU lookups,
deletes and inserts are the same)

Maybe a hack in net/ipv4/udplite.c before calling proto_register(), to
copy the kmem_cache from UDP.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 5:45 p.m.
Corey Minyard a écrit :
> Paul E. McKenney wrote:
>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>  
>>> Corey Minyard a écrit :
>>>    
>>>> Eric Dumazet wrote:
>>>>      
>>>>> Corey Minyard found a race added in commit 
>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>> (udp: RCU handling for Unicast packets.)
>>>>>
>>>>> "If the socket is moved from one list to another list in-between 
>>>>> the time  the hash is calculated and the next field is accessed, 
>>>>> and the socket  has moved to the end of the new list, the traversal 
>>>>> will not complete  properly on the list it should have, since the 
>>>>> socket will be on the end  of the new list and there's not a way to 
>>>>> tell it's on a new list and  restart the list traversal.  I think 
>>>>> that this can be solved by  pre-fetching the "next" field (with 
>>>>> proper barriers) before checking the  hash."
>>>>>
>>>>> This patch corrects this problem, introducing a new 
>>>>> sk_for_each_rcu_safenext()
>>>>> macro.
>>>>>         
>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>> sk_hash is set, I think, so the next field is guaranteed to be 
>>>> changed after the hash value is changed.
>>>>       
>>> Not sure about this one Corey.
>>>
>>> If a reader catches previous value of item->sk_hash, two cases are to 
>>> be taken into :
>>>
>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : 
>>> Reader will redo its scan
>>>
>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>  -> next pointer is good enough : it points to next item in same hash 
>>> chain.
>>>     No need to rescan the chain at this point.
>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>> message could be lost.
>>>     
>>
>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>> removed, freed, reallocated, and then readded with the same hash value,
>> possibly carrying the reader to a new position in the same list.
>>   
> If I understand this, without the smp_wmb(), it is possible that the 
> next field can be written to main memory before the hash value is 
> written.  If that happens, the following can occur:
> 
>  CPU1                    CPU2
>  next is set to NULL (end of new list)

Well, if this item is injected to the same chain, next wont be set to NULL.

That would mean previous writers deleted all items from the chain.

In this case, readers can see NULL, it is not a problem at all.
List is/was empty.
An application cannot complain a packet is not
handled if its bind() syscall is not yet completed :)

If item is injected on another chain, we will detect hash mismatch and redo full scan.

>                          fetch next
>                          calculate hash and compare to sk_hash
>  sk_hash is set to new value
> 
> So I think in the above cases, your case #2 is not necessarily valid 
> without the barrier.
> 
> And another possible issue.  If sk_hash is written before next, and CPU1 
> is interrupted before CPU2, CPU2 will continually spin on the list until 
> CPU1 comes back and moves it to the new list.  Note sure if that is an 
> issue.

Probably not. Previously, readers were spining on read_lock(), when 
a writer was inside its critical section (write_lock()/write_unlock()).
So instead of spining inside read_unlock(), issuing stupid memory 
transactions, the readers can now spin reading hash chain and populate
cpu cache :)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 6:11 p.m.
On Wed, Oct 29, 2008 at 06:32:29PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>> Corey Minyard a écrit :
>>>> Eric Dumazet wrote:
>>>>> Corey Minyard found a race added in commit 
>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>> (udp: RCU handling for Unicast packets.)
>>>>>
>>>>> "If the socket is moved from one list to another list in-between the 
>>>>> time  the hash is calculated and the next field is accessed, and the 
>>>>> socket  has moved to the end of the new list, the traversal will not 
>>>>> complete  properly on the list it should have, since the socket will be 
>>>>> on the end  of the new list and there's not a way to tell it's on a new 
>>>>> list and  restart the list traversal.  I think that this can be solved 
>>>>> by  pre-fetching the "next" field (with proper barriers) before 
>>>>> checking the  hash."
>>>>>
>>>>> This patch corrects this problem, introducing a new 
>>>>> sk_for_each_rcu_safenext()
>>>>> macro.
>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>>> after the hash value is changed.
>>> Not sure about this one Corey.
>>>
>>> If a reader catches previous value of item->sk_hash, two cases are to be 
>>> taken into :
>>>
>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>>> will redo its scan
>>>
>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>  -> next pointer is good enough : it points to next item in same hash 
>>> chain.
>>>     No need to rescan the chain at this point.
>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>> message could be lost.
>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>> removed, freed, reallocated, and then readded with the same hash value,
>> possibly carrying the reader to a new position in the same list.
>
> yes, but 'new position' is 'before any not yet examined objects', since
> we insert objects only at chain head.

OK.  However, this reasoning assumes that a socket with a given
udp_hashfn() value will appear on one and only one list.  There are no
side lists for sockets in other states?  (listen, &c)

>> You might well cover this (will examine your code in detail on my plane
>> flight starting about 20 hours from now), but thought I should point it
>> out.  ;-)
>
> Yes, I'll double check too, this seems tricky :)

;-)

> About SLAB_DESTROY_BY_RCU effect, we now have two different kmem_cache for 
> "UDP-Lite"
> and "UDP".
>
> This is expected, but we could avoid that and alias these caches, since
> these objects have the same *type* . (The fields used for the RCU lookups,
> deletes and inserts are the same)
>
> Maybe a hack in net/ipv4/udplite.c before calling proto_register(), to
> copy the kmem_cache from UDP.

As long as this preserves the aforementioned assumption that a socket
with a given hash can appear on one and only one list.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Oct. 29, 2008, 6:20 p.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 29 Oct 2008 15:36:34 +0100

> [PATCH] udp: introduce sk_for_each_rcu_safenext()

Also applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Corey Minyard - Oct. 29, 2008, 6:28 p.m.
Eric Dumazet wrote:
> Corey Minyard a écrit :
>> Paul E. McKenney wrote:
>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>  
>>>> Corey Minyard a écrit :
>>>>   
>>>>> Eric Dumazet wrote:
>>>>>     
>>>>>> Corey Minyard found a race added in commit 
>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>
>>>>>> "If the socket is moved from one list to another list in-between 
>>>>>> the time  the hash is calculated and the next field is accessed, 
>>>>>> and the socket  has moved to the end of the new list, the 
>>>>>> traversal will not complete  properly on the list it should have, 
>>>>>> since the socket will be on the end  of the new list and there's 
>>>>>> not a way to tell it's on a new list and  restart the list 
>>>>>> traversal.  I think that this can be solved by  pre-fetching the 
>>>>>> "next" field (with proper barriers) before checking the  hash."
>>>>>>
>>>>>> This patch corrects this problem, introducing a new 
>>>>>> sk_for_each_rcu_safenext()
>>>>>> macro.
>>>>>>         
>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() 
>>>>> after sk_hash is set, I think, so the next field is guaranteed to 
>>>>> be changed after the hash value is changed.
>>>>>       
>>>> Not sure about this one Corey.
>>>>
>>>> If a reader catches previous value of item->sk_hash, two cases are 
>>>> to be taken into :
>>>>
>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : 
>>>> Reader will redo its scan
>>>>
>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>  -> next pointer is good enough : it points to next item in same 
>>>> hash chain.
>>>>     No need to rescan the chain at this point.
>>>>     Yes we could miss the fact that a new port was bound and this 
>>>> UDP message could be lost.
>>>>     
>>>
>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>> removed, freed, reallocated, and then readded with the same hash value,
>>> possibly carrying the reader to a new position in the same list.
>>>   
>> If I understand this, without the smp_wmb(), it is possible that the 
>> next field can be written to main memory before the hash value is 
>> written.  If that happens, the following can occur:
>>
>>  CPU1                    CPU2
>>  next is set to NULL (end of new list)
>
> Well, if this item is injected to the same chain, next wont be set to 
> NULL.
>
> That would mean previous writers deleted all items from the chain.
I put my comment in the wrong place, I wasn't talking about being 
injected into the same chain.

>
> In this case, readers can see NULL, it is not a problem at all.
> List is/was empty.
> An application cannot complain a packet is not
> handled if its bind() syscall is not yet completed :)
>
> If item is injected on another chain, we will detect hash mismatch and 
> redo full scan.
If the item is injected onto the end of another chain, the next field 
will be NULL and you won't detect a hash mismatch.  It's basically the 
same issue as the previous race, though a lot more subtle and unlikely.  
If you get (from the previous socket) an old value of "sk_hash" and 
(from the new socket) a new value of "next" that is NULL, you will 
terminate the loop when you should have restarted it.  I'm pretty sure 
that can occur without the write barrier.

>
>>                          fetch next
>>                          calculate hash and compare to sk_hash
>>  sk_hash is set to new value
>>
>> So I think in the above cases, your case #2 is not necessarily valid 
>> without the barrier.
>>
>> And another possible issue.  If sk_hash is written before next, and 
>> CPU1 is interrupted before CPU2, CPU2 will continually spin on the 
>> list until CPU1 comes back and moves it to the new list.  Note sure 
>> if that is an issue.
>
> Probably not. Previously, readers were spining on read_lock(), when a 
> writer was inside its critical section (write_lock()/write_unlock()).
> So instead of spining inside read_unlock(), issuing stupid memory 
> transactions, the readers can now spin reading hash chain and populate
> cpu cache :)
Yes, I thought about that and thought I would point it out, but I agree, 
what you have is certainly better than spinning on a lock :).


-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Oct. 29, 2008, 6:29 p.m.
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Wed, 29 Oct 2008 11:11:14 -0700

> OK.  However, this reasoning assumes that a socket with a given
> udp_hashfn() value will appear on one and only one list.  There are no
> side lists for sockets in other states?  (listen, &c)

Nope, with UDP things are very simple, just one hash table.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 6:36 p.m.
Paul E. McKenney a écrit :
> On Wed, Oct 29, 2008 at 06:32:29PM +0100, Eric Dumazet wrote:
>> Paul E. McKenney a écrit :
>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>> Corey Minyard a écrit :
>>>>> Eric Dumazet wrote:
>>>>>> Corey Minyard found a race added in commit 
>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>
>>>>>> "If the socket is moved from one list to another list in-between the 
>>>>>> time  the hash is calculated and the next field is accessed, and the 
>>>>>> socket  has moved to the end of the new list, the traversal will not 
>>>>>> complete  properly on the list it should have, since the socket will be 
>>>>>> on the end  of the new list and there's not a way to tell it's on a new 
>>>>>> list and  restart the list traversal.  I think that this can be solved 
>>>>>> by  pre-fetching the "next" field (with proper barriers) before 
>>>>>> checking the  hash."
>>>>>>
>>>>>> This patch corrects this problem, introducing a new 
>>>>>> sk_for_each_rcu_safenext()
>>>>>> macro.
>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>>>> after the hash value is changed.
>>>> Not sure about this one Corey.
>>>>
>>>> If a reader catches previous value of item->sk_hash, two cases are to be 
>>>> taken into :
>>>>
>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>>>> will redo its scan
>>>>
>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>  -> next pointer is good enough : it points to next item in same hash 
>>>> chain.
>>>>     No need to rescan the chain at this point.
>>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>>> message could be lost.
>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>> removed, freed, reallocated, and then readded with the same hash value,
>>> possibly carrying the reader to a new position in the same list.
>> yes, but 'new position' is 'before any not yet examined objects', since
>> we insert objects only at chain head.
> 
> OK.  However, this reasoning assumes that a socket with a given
> udp_hashfn() value will appear on one and only one list.  There are no
> side lists for sockets in other states?  (listen, &c)
> 
>>> You might well cover this (will examine your code in detail on my plane
>>> flight starting about 20 hours from now), but thought I should point it
>>> out.  ;-)
>> Yes, I'll double check too, this seems tricky :)
> 
> ;-)
> 
>> About SLAB_DESTROY_BY_RCU effect, we now have two different kmem_cache for 
>> "UDP-Lite"
>> and "UDP".
>>
>> This is expected, but we could avoid that and alias these caches, since
>> these objects have the same *type* . (The fields used for the RCU lookups,
>> deletes and inserts are the same)
>>
>> Maybe a hack in net/ipv4/udplite.c before calling proto_register(), to
>> copy the kmem_cache from UDP.
> 
> As long as this preserves the aforementioned assumption that a socket
> with a given hash can appear on one and only one list.  ;-)
> 

Ouch, thanks Paul, that is indeed the point, well sort of.

If a UDP socket is freed, and re-allocated as an UDP-Lite socket, inserted on
the udplite_table, then we would have a problem with current implementation.

A reader could be directed to the chain of the other hash table, without
noticing it should restart its lookup...

Not worth adding a check to detect such a scenario, we can live with two different
kmem_cache after all, they are not that expensive.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 6:38 p.m.
On Wed, Oct 29, 2008 at 11:29:56AM -0700, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Date: Wed, 29 Oct 2008 11:11:14 -0700
> 
> > OK.  However, this reasoning assumes that a socket with a given
> > udp_hashfn() value will appear on one and only one list.  There are no
> > side lists for sockets in other states?  (listen, &c)
> 
> Nope, with UDP things are very simple, just one hash table.

Cool!  That does make things easier.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 6:52 p.m.
On Wed, Oct 29, 2008 at 01:28:15PM -0500, Corey Minyard wrote:
> Eric Dumazet wrote:
>> Corey Minyard a écrit :
>>> Paul E. McKenney wrote:
>>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>>  
>>>>> Corey Minyard a écrit :
>>>>>   
>>>>>> Eric Dumazet wrote:
>>>>>>     
>>>>>>> Corey Minyard found a race added in commit 
>>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>>
>>>>>>> "If the socket is moved from one list to another list in-between the 
>>>>>>> time  the hash is calculated and the next field is accessed, and the 
>>>>>>> socket  has moved to the end of the new list, the traversal will not 
>>>>>>> complete  properly on the list it should have, since the socket will 
>>>>>>> be on the end  of the new list and there's not a way to tell it's on 
>>>>>>> a new list and  restart the list traversal.  I think that this can be 
>>>>>>> solved by  pre-fetching the "next" field (with proper barriers) 
>>>>>>> before checking the  hash."
>>>>>>>
>>>>>>> This patch corrects this problem, introducing a new 
>>>>>>> sk_for_each_rcu_safenext()
>>>>>>> macro.
>>>>>>>         
>>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>>>>> after the hash value is changed.
>>>>>>       
>>>>> Not sure about this one Corey.
>>>>>
>>>>> If a reader catches previous value of item->sk_hash, two cases are to 
>>>>> be taken into :
>>>>>
>>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>>>>> will redo its scan
>>>>>
>>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>>  -> next pointer is good enough : it points to next item in same hash 
>>>>> chain.
>>>>>     No need to rescan the chain at this point.
>>>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>>>> message could be lost.
>>>>>     
>>>>
>>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>>> removed, freed, reallocated, and then readded with the same hash value,
>>>> possibly carrying the reader to a new position in the same list.
>>>>   
>>> If I understand this, without the smp_wmb(), it is possible that the next 
>>> field can be written to main memory before the hash value is written.  If 
>>> that happens, the following can occur:
>>>
>>>  CPU1                    CPU2
>>>  next is set to NULL (end of new list)
>>
>> Well, if this item is injected to the same chain, next wont be set to 
>> NULL.
>>
>> That would mean previous writers deleted all items from the chain.
> I put my comment in the wrong place, I wasn't talking about being injected 
> into the same chain.
>
>>
>> In this case, readers can see NULL, it is not a problem at all.
>> List is/was empty.
>> An application cannot complain a packet is not
>> handled if its bind() syscall is not yet completed :)
>>
>> If item is injected on another chain, we will detect hash mismatch and 
>> redo full scan.
> If the item is injected onto the end of another chain, the next field will 
> be NULL and you won't detect a hash mismatch.  It's basically the same 
> issue as the previous race, though a lot more subtle and unlikely.  If you 
> get (from the previous socket) an old value of "sk_hash" and (from the new 
> socket) a new value of "next" that is NULL, you will terminate the loop 
> when you should have restarted it.  I'm pretty sure that can occur without 
> the write barrier.

One way of dealing with this is to keep a tail pointer.  Then, if the
element containing the NULL pointer doesn't match the tail pointer seen
at the start of the search, or if the tail pointer has changed,
restart the search.  Memory barriers will be needed.  ;-)

							Thanx, Paul

>>>                          fetch next
>>>                          calculate hash and compare to sk_hash
>>>  sk_hash is set to new value
>>>
>>> So I think in the above cases, your case #2 is not necessarily valid 
>>> without the barrier.
>>>
>>> And another possible issue.  If sk_hash is written before next, and CPU1 
>>> is interrupted before CPU2, CPU2 will continually spin on the list until 
>>> CPU1 comes back and moves it to the new list.  Note sure if that is an 
>>> issue.
>>
>> Probably not. Previously, readers were spining on read_lock(), when a 
>> writer was inside its critical section (write_lock()/write_unlock()).
>> So instead of spining inside read_unlock(), issuing stupid memory 
>> transactions, the readers can now spin reading hash chain and populate
>> cpu cache :)
> Yes, I thought about that and thought I would point it out, but I agree, 
> what you have is certainly better than spinning on a lock :).
>
>
> -corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 8 p.m.
Paul E. McKenney a écrit :
> On Wed, Oct 29, 2008 at 01:28:15PM -0500, Corey Minyard wrote:
>> Eric Dumazet wrote:
>>> Corey Minyard a écrit :
>>>> Paul E. McKenney wrote:
>>>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>>>  
>>>>>> Corey Minyard a écrit :
>>>>>>   
>>>>>>> Eric Dumazet wrote:
>>>>>>>     
>>>>>>>> Corey Minyard found a race added in commit 
>>>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>>>
>>>>>>>> "If the socket is moved from one list to another list in-between the 
>>>>>>>> time  the hash is calculated and the next field is accessed, and the 
>>>>>>>> socket  has moved to the end of the new list, the traversal will not 
>>>>>>>> complete  properly on the list it should have, since the socket will 
>>>>>>>> be on the end  of the new list and there's not a way to tell it's on 
>>>>>>>> a new list and  restart the list traversal.  I think that this can be 
>>>>>>>> solved by  pre-fetching the "next" field (with proper barriers) 
>>>>>>>> before checking the  hash."
>>>>>>>>
>>>>>>>> This patch corrects this problem, introducing a new 
>>>>>>>> sk_for_each_rcu_safenext()
>>>>>>>> macro.
>>>>>>>>         
>>>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>>>>> sk_hash is set, I think, so the next field is guaranteed to be changed 
>>>>>>> after the hash value is changed.
>>>>>>>       
>>>>>> Not sure about this one Corey.
>>>>>>
>>>>>> If a reader catches previous value of item->sk_hash, two cases are to 
>>>>>> be taken into :
>>>>>>
>>>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : Reader 
>>>>>> will redo its scan
>>>>>>
>>>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>>>  -> next pointer is good enough : it points to next item in same hash 
>>>>>> chain.
>>>>>>     No need to rescan the chain at this point.
>>>>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>>>>> message could be lost.
>>>>>>     
>>>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>>>> removed, freed, reallocated, and then readded with the same hash value,
>>>>> possibly carrying the reader to a new position in the same list.
>>>>>   
>>>> If I understand this, without the smp_wmb(), it is possible that the next 
>>>> field can be written to main memory before the hash value is written.  If 
>>>> that happens, the following can occur:
>>>>
>>>>  CPU1                    CPU2
>>>>  next is set to NULL (end of new list)
>>> Well, if this item is injected to the same chain, next wont be set to 
>>> NULL.
>>>
>>> That would mean previous writers deleted all items from the chain.
>> I put my comment in the wrong place, I wasn't talking about being injected 
>> into the same chain.
>>
>>> In this case, readers can see NULL, it is not a problem at all.
>>> List is/was empty.
>>> An application cannot complain a packet is not
>>> handled if its bind() syscall is not yet completed :)
>>>
>>> If item is injected on another chain, we will detect hash mismatch and 
>>> redo full scan.
>> If the item is injected onto the end of another chain, the next field will 
>> be NULL and you won't detect a hash mismatch.  It's basically the same 
>> issue as the previous race, though a lot more subtle and unlikely.  If you 
>> get (from the previous socket) an old value of "sk_hash" and (from the new 
>> socket) a new value of "next" that is NULL, you will terminate the loop 
>> when you should have restarted it.  I'm pretty sure that can occur without 
>> the write barrier.
> 
> One way of dealing with this is to keep a tail pointer.  Then, if the
> element containing the NULL pointer doesn't match the tail pointer seen
> at the start of the search, or if the tail pointer has changed,
> restart the search.  Memory barriers will be needed.  ;-)
> 

Hum... Another way of handling all those cases and avoid memory barriers
would be to have different "NULL" pointers.

Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
can be the 128 values : [ (void*)0 .. (void *)127 ]

Then, when performing a lookup, a reader should check the "NULL" pointer
it get at the end of its lookup has is the "hash" value of its chain.

If not -> restart the loop, aka "goto begin;" :)

We could avoid memory barriers then.

In the two cases Corey mentioned, this trick could let us avoid memory barriers.
(existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)

What do you think ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 8:17 p.m.
On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Wed, Oct 29, 2008 at 01:28:15PM -0500, Corey Minyard wrote:
>>> Eric Dumazet wrote:
>>>> Corey Minyard a écrit :
>>>>> Paul E. McKenney wrote:
>>>>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>>>>  
>>>>>>> Corey Minyard a écrit :
>>>>>>>   
>>>>>>>> Eric Dumazet wrote:
>>>>>>>>     
>>>>>>>>> Corey Minyard found a race added in commit 
>>>>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>>>>
>>>>>>>>> "If the socket is moved from one list to another list in-between 
>>>>>>>>> the time  the hash is calculated and the next field is accessed, 
>>>>>>>>> and the socket  has moved to the end of the new list, the traversal 
>>>>>>>>> will not complete  properly on the list it should have, since the 
>>>>>>>>> socket will be on the end  of the new list and there's not a way to 
>>>>>>>>> tell it's on a new list and  restart the list traversal.  I think 
>>>>>>>>> that this can be solved by  pre-fetching the "next" field (with 
>>>>>>>>> proper barriers) before checking the  hash."
>>>>>>>>>
>>>>>>>>> This patch corrects this problem, introducing a new 
>>>>>>>>> sk_for_each_rcu_safenext()
>>>>>>>>> macro.
>>>>>>>>>         
>>>>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after 
>>>>>>>> sk_hash is set, I think, so the next field is guaranteed to be 
>>>>>>>> changed after the hash value is changed.
>>>>>>>>       
>>>>>>> Not sure about this one Corey.
>>>>>>>
>>>>>>> If a reader catches previous value of item->sk_hash, two cases are to 
>>>>>>> be taken into :
>>>>>>>
>>>>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash   -> goto begin : 
>>>>>>> Reader will redo its scan
>>>>>>>
>>>>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>>>>  -> next pointer is good enough : it points to next item in same hash 
>>>>>>> chain.
>>>>>>>     No need to rescan the chain at this point.
>>>>>>>     Yes we could miss the fact that a new port was bound and this UDP 
>>>>>>> message could be lost.
>>>>>>>     
>>>>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>>>>> removed, freed, reallocated, and then readded with the same hash 
>>>>>> value,
>>>>>> possibly carrying the reader to a new position in the same list.
>>>>>>   
>>>>> If I understand this, without the smp_wmb(), it is possible that the 
>>>>> next field can be written to main memory before the hash value is 
>>>>> written.  If that happens, the following can occur:
>>>>>
>>>>>  CPU1                    CPU2
>>>>>  next is set to NULL (end of new list)
>>>> Well, if this item is injected to the same chain, next wont be set to 
>>>> NULL.
>>>>
>>>> That would mean previous writers deleted all items from the chain.
>>> I put my comment in the wrong place, I wasn't talking about being 
>>> injected into the same chain.
>>>
>>>> In this case, readers can see NULL, it is not a problem at all.
>>>> List is/was empty.
>>>> An application cannot complain a packet is not
>>>> handled if its bind() syscall is not yet completed :)
>>>>
>>>> If item is injected on another chain, we will detect hash mismatch and 
>>>> redo full scan.
>>> If the item is injected onto the end of another chain, the next field 
>>> will be NULL and you won't detect a hash mismatch.  It's basically the 
>>> same issue as the previous race, though a lot more subtle and unlikely.  
>>> If you get (from the previous socket) an old value of "sk_hash" and (from 
>>> the new socket) a new value of "next" that is NULL, you will terminate 
>>> the loop when you should have restarted it.  I'm pretty sure that can 
>>> occur without the write barrier.
>> One way of dealing with this is to keep a tail pointer.  Then, if the
>> element containing the NULL pointer doesn't match the tail pointer seen
>> at the start of the search, or if the tail pointer has changed,
>> restart the search.  Memory barriers will be needed.  ;-)
>
> Hum... Another way of handling all those cases and avoid memory barriers
> would be to have different "NULL" pointers.
>
> Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
> can be the 128 values : [ (void*)0 .. (void *)127 ]
>
> Then, when performing a lookup, a reader should check the "NULL" pointer
> it get at the end of its lookup has is the "hash" value of its chain.
>
> If not -> restart the loop, aka "goto begin;" :)
>
> We could avoid memory barriers then.
>
> In the two cases Corey mentioned, this trick could let us avoid memory 
> barriers.
> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>
> What do you think ?

Kinky!!!  ;-)

Then the rcu_dereference() would be supplying the needed memory barriers.

Hmmm...  I guess that the only confusion would be if the element got
removed and then added to the same list.  But then if its pointer was
pseudo-NULL, then that would mean that all subsequent elements had been
removed, and all preceding ones added after the scan started.

Which might well be harmless, but I must defer to you on this one at
the moment.

If you need a larger hash table, another approach would be to set the
pointer's low-order bit, allowing the upper bits to be a full-sized
index -- or even a pointer to the list header.  Just make very sure
to clear the pointer when freeing, or an element on the freelist
could end up looking like a legitimate end of list...  Which again
might well be safe, but why inflict this on oneself?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Corey Minyard - Oct. 29, 2008, 9:29 p.m.
Paul E. McKenney wrote:
> O
..snip
>> Hum... Another way of handling all those cases and avoid memory barriers
>> would be to have different "NULL" pointers.
>>
>> Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>
>> Then, when performing a lookup, a reader should check the "NULL" pointer
>> it get at the end of its lookup has is the "hash" value of its chain.
>>
>> If not -> restart the loop, aka "goto begin;" :)
>>
>> We could avoid memory barriers then.
>>
>> In the two cases Corey mentioned, this trick could let us avoid memory 
>> barriers.
>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>
>> What do you think ?
>>     
>
> Kinky!!!  ;-)
>   
My thought exactly ;-).

> Then the rcu_dereference() would be supplying the needed memory barriers.
>
> Hmmm...  I guess that the only confusion would be if the element got
> removed and then added to the same list.  But then if its pointer was
> pseudo-NULL, then that would mean that all subsequent elements had been
> removed, and all preceding ones added after the scan started.
>
> Which might well be harmless, but I must defer to you on this one at
> the moment.
>   
I believe that is harmless, as re-scanning the same data should be fine.

> If you need a larger hash table, another approach would be to set the
> pointer's low-order bit, allowing the upper bits to be a full-sized
> index -- or even a pointer to the list header.  Just make very sure
> to clear the pointer when freeing, or an element on the freelist
> could end up looking like a legitimate end of list...  Which again
> might well be safe, but why inflict this on oneself?
>   
Kind of my thought, too.  That's a lot of work to avoid a single 
smb_wmb() on the socket creation path.  Plus this could be extra confusing.

-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2008, 9:57 p.m.
Corey Minyard a écrit :
> Paul E. McKenney wrote:
>> O
> ..snip
>>> Hum... Another way of handling all those cases and avoid memory barriers
>>> would be to have different "NULL" pointers.
>>>
>>> Each hash chain should have a unique "NULL" pointer (in the case of 
>>> UDP, it
>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>
>>> Then, when performing a lookup, a reader should check the "NULL" pointer
>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>
>>> If not -> restart the loop, aka "goto begin;" :)
>>>
>>> We could avoid memory barriers then.
>>>
>>> In the two cases Corey mentioned, this trick could let us avoid 
>>> memory barriers.
>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>
>>> What do you think ?
>>>     
>>
>> Kinky!!!  ;-)
>>   
> My thought exactly ;-).
> 
>> Then the rcu_dereference() would be supplying the needed memory barriers.
>>
>> Hmmm...  I guess that the only confusion would be if the element got
>> removed and then added to the same list.  But then if its pointer was
>> pseudo-NULL, then that would mean that all subsequent elements had been
>> removed, and all preceding ones added after the scan started.
>>
>> Which might well be harmless, but I must defer to you on this one at
>> the moment.
>>   
> I believe that is harmless, as re-scanning the same data should be fine.
> 
>> If you need a larger hash table, another approach would be to set the
>> pointer's low-order bit, allowing the upper bits to be a full-sized
>> index -- or even a pointer to the list header.  Just make very sure
>> to clear the pointer when freeing, or an element on the freelist
>> could end up looking like a legitimate end of list...  Which again
>> might well be safe, but why inflict this on oneself?
>>   
> Kind of my thought, too.  That's a lot of work to avoid a single 
> smb_wmb() on the socket creation path.  Plus this could be extra confusing.
> 

Sure this smp_wmb() seems harmless (but, remember this infrastructure will
next be deployed for TCP sockets as well ;) )

But saving smp_rmb() at lookup time, for each item is a clear win, no ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 29, 2008, 9:58 p.m.
On Wed, Oct 29, 2008 at 04:29:19PM -0500, Corey Minyard wrote:
> Paul E. McKenney wrote:
>> O
> ..snip
>>> Hum... Another way of handling all those cases and avoid memory barriers
>>> would be to have different "NULL" pointers.
>>>
>>> Each hash chain should have a unique "NULL" pointer (in the case of UDP, 
>>> it
>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>
>>> Then, when performing a lookup, a reader should check the "NULL" pointer
>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>
>>> If not -> restart the loop, aka "goto begin;" :)
>>>
>>> We could avoid memory barriers then.
>>>
>>> In the two cases Corey mentioned, this trick could let us avoid memory 
>>> barriers.
>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>
>>> What do you think ?
>>>     
>>
>> Kinky!!!  ;-)
>>   
> My thought exactly ;-).
>
>> Then the rcu_dereference() would be supplying the needed memory barriers.
>>
>> Hmmm...  I guess that the only confusion would be if the element got
>> removed and then added to the same list.  But then if its pointer was
>> pseudo-NULL, then that would mean that all subsequent elements had been
>> removed, and all preceding ones added after the scan started.
>>
>> Which might well be harmless, but I must defer to you on this one at
>> the moment.
>>   
> I believe that is harmless, as re-scanning the same data should be fine.
>
>> If you need a larger hash table, another approach would be to set the
>> pointer's low-order bit, allowing the upper bits to be a full-sized
>> index -- or even a pointer to the list header.  Just make very sure
>> to clear the pointer when freeing, or an element on the freelist
>> could end up looking like a legitimate end of list...  Which again
>> might well be safe, but why inflict this on oneself?
>>   
> Kind of my thought, too.  That's a lot of work to avoid a single smb_wmb() 
> on the socket creation path.  Plus this could be extra confusing.

Just to be clear, I was fulminating against any potential failure to
clear the pseudo-NULL pointer, not against the pseudo-pointer itself.
This sort of trick is already used in some of the RCU-protected trees
(for example, FIB tree, IIRC), so I would look a bit funny fulminating
too hard against it.  ;-)

The only other high-level approach I have come up with thus far is to
maintain separate hash tables for the long-lived UDP sockets (protected
by RCU) and for the short-lived UDP sockets (protected by locking).
Given the usual bimodal traffic pattern, most of the sockets are short
lived, but most of the data is transmitted over long-lived sockets.  If a
socket receives more than N packets (10? 50? 100?), it is moved from the
short-lived table to the long-lived table.  Sockets on the short-lived
table may be freed directly, while sockets on the long-lived table must
be RCU freed -- but this added overhead should be in the noise for a
long-lived connection.  Lookups hit the RCU-protected table, then the lock
protected table, then the RCU-protected table again, but still holding
the lock.  (Clearly, only search until you find the desired socket.)

However, I am not certain that this short-term/long-term approach is
better than the approach that Eric is proposing.  It might in fact be
worse.  But I throw it out anyway on the off-chance that it is helpful
as a comparison or as a solution to some future problem.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra - Oct. 30, 2008, 11:04 a.m.
On Wed, 2008-10-29 at 21:00 +0100, Eric Dumazet wrote:

> 
> Hum... Another way of handling all those cases and avoid memory barriers
> would be to have different "NULL" pointers.
> 
> Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
> can be the 128 values : [ (void*)0 .. (void *)127 ]

Why not use the bucket pointer as terminating condition?

Because all you really need is a pointer that is specific per bucket,
and not a valid element, right?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra - Oct. 30, 2008, 11:12 a.m.
On Wed, 2008-10-29 at 15:36 +0100, Eric Dumazet wrote:
> +/**
> + * hlist_for_each_entry_rcu_safenext - iterate over rcu list of given type
> + * @tpos:      the type * to use as a loop cursor.
> + * @pos:       the &struct hlist_node to use as a loop cursor.
> + * @head:      the head for your list.
> + * @member:    the name of the hlist_node within the struct.
> + * @next:       the &struct hlist_node to use as a next cursor
> + *
> + * Special version of hlist_for_each_entry_rcu that make sure
> + * each next pointer is fetched before each iteration.
> + */
> +#define hlist_for_each_entry_rcu_safenext(tpos, pos, head, member, next) \
> +       for (pos = rcu_dereference((head)->first);                       \
> +               pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && \
> +               ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
> +               pos = rcu_dereference(next))
> +

why _safenext and not _safe like hlist_for_each_entry_safe() which also
keeps a next pointer?

Also note the difference in argument order between these two.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 30, 2008, 11:29 a.m.
Peter Zijlstra a écrit :
> On Wed, 2008-10-29 at 15:36 +0100, Eric Dumazet wrote:
>> +/**
>> + * hlist_for_each_entry_rcu_safenext - iterate over rcu list of given type
>> + * @tpos:      the type * to use as a loop cursor.
>> + * @pos:       the &struct hlist_node to use as a loop cursor.
>> + * @head:      the head for your list.
>> + * @member:    the name of the hlist_node within the struct.
>> + * @next:       the &struct hlist_node to use as a next cursor
>> + *
>> + * Special version of hlist_for_each_entry_rcu that make sure
>> + * each next pointer is fetched before each iteration.
>> + */
>> +#define hlist_for_each_entry_rcu_safenext(tpos, pos, head, member, next) \
>> +       for (pos = rcu_dereference((head)->first);                       \
>> +               pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && \
>> +               ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
>> +               pos = rcu_dereference(next))
>> +
> 
> why _safenext and not _safe like hlist_for_each_entry_safe() which also
> keeps a next pointer?
> 
> Also note the difference in argument order between these two.
> 

Yes, this one is going to vanish soon.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 30, 2008, 11:30 a.m.
Peter Zijlstra a écrit :
> On Wed, 2008-10-29 at 21:00 +0100, Eric Dumazet wrote:
> 
>> Hum... Another way of handling all those cases and avoid memory barriers
>> would be to have different "NULL" pointers.
>>
>> Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
>> can be the 128 values : [ (void*)0 .. (void *)127 ]
> 
> Why not use the bucket pointer as terminating condition?
> 
> Because all you really need is a pointer that is specific per bucket,
> and not a valid element, right?

Yes, but it forces compiler to keep around the bucket pointer.

Big chance this value will be stored on stack.

Next patch will use the least significant bit to distinguish a valid
pointer from a "NULL pointer"

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Oct. 30, 2008, 6:25 p.m.
On Thu, Oct 30, 2008 at 12:30:20PM +0100, Eric Dumazet wrote:
> Peter Zijlstra a écrit :
>> On Wed, 2008-10-29 at 21:00 +0100, Eric Dumazet wrote:
>>> Hum... Another way of handling all those cases and avoid memory barriers
>>> would be to have different "NULL" pointers.
>>>
>>> Each hash chain should have a unique "NULL" pointer (in the case of UDP, 
>>> it
>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>> Why not use the bucket pointer as terminating condition?
>> Because all you really need is a pointer that is specific per bucket,
>> and not a valid element, right?
>
> Yes, but it forces compiler to keep around the bucket pointer.
>
> Big chance this value will be stored on stack.
>
> Next patch will use the least significant bit to distinguish a valid
> pointer from a "NULL pointer"

As you might guess, I do like that idea.  ;-)

On my plane trip, so am reviewing what I believe to be your current
combined patchset consisting of six patches transmitted in this thread:

Message-ID: <49081D67.3050502@cosmosbay.com>
Message-ID: <49082718.2030201@cosmosbay.com>
Message-ID: <490874F2.2060306@cosmosbay.com>
Message-ID: <4908DEDE.5030706@cosmosbay.com>
Message-ID: <49094B0F.2090208@cosmosbay.com>
Message-ID: <490838C6.4060304@cosmosbay.com>

This probably won't be your latest and greatest by the time you receive
this, but it appears to be the latest and greatest that I have.  ;-)

A few comments, search for blank lines.

							Thanx, Paul

> diff --git a/include/linux/list.h b/include/linux/list.h
> index 969f6e9..a3d5dd1 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -654,6 +654,22 @@ static inline void hlist_move_list(struct hlist_head *old,
>  	     pos && ({ prefetch(pos->next); 1;}) &&			 \
>  		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
>  	     pos = pos->next)
> +/**
> + * hlist_for_each_entry_nulls	- iterate over list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_node within the struct.
> + * @nullval: the iteration should stop if a pointer is < nullval
> + *
> + * Special version of hlist_for_each_entry where the end pointer
> + * can be NULL but also any value < nullval.
> + */
> +#define hlist_for_each_entry_nulls(tpos, pos, head, member, nullval)	 \
> +	for (pos = (head)->first;					 \
> +	     ((unsigned long)pos >= nullval) && ({ prefetch(pos->next); 1;}) && \
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = pos->next)

It might be good to take a predicate macro/function instead of nullval,
in order to allow this primitive to be used for a number of different
pseudo-NULL-pointer schemes, and ditto for the similar macros you define
below.  Might be too early to know exactly how such a primitive should
look, though.

So just a random thought at this point.

In any case, this macro cannot be used in read-side critical sections.

Used by sk_for_each_nulls(), which is called by udp_lib_lport_inuse()
and udp_get_first().

udp_lib_lport_inuse() is called by udp_lib_get_port(), which holds the
hslot->lock, so should be OK.

In udp_get_first(), this same lock is held, so again should be OK.

>  /**
>   * hlist_for_each_entry_continue - iterate over a hlist continuing after current point
> @@ -679,6 +695,22 @@ static inline void hlist_move_list(struct hlist_head *old,
>  	     pos = pos->next)
>  
>  /**
> + * hlist_for_each_entry_from_nulls - iterate over a hlist continuing from current point
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @member:	the name of the hlist_node within the struct.
> + * @nullval: the iteration should stop if a pointer is < nullval
> + *
> + * Special version of hlist_for_each_entry_from where the end pointer
> + * can be NULL but also any value < nullval.
> + */
> +#define hlist_for_each_entry_from_nulls(tpos, pos, member, nullval)	\
> +	for (; ((unsigned long)pos >= nullval) && \
> +		({ prefetch(pos->next); 1;})   && \
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = pos->next)

This also must be called with the update-side lock held.  It is
wrappered by sk_for_each_from_nulls(), which in turn is called from
udp_v4_mcast_next() and udp_v6_mcast_next().  According to messages
earlier in this thread, that is the case.

udp_v4_mcast_next() is called from __udp4_lib_mcast_deliver(), which does
hold hslot->lock, so OK.  Ditto for the calls from udp_v6_mcast_next().

Interestingly enough, these two functions use sk_head(), which calls
__sk_head(), which do not do rcu_dereference().  Which is another reason
that this cannot be called from an RCU read-side critical section.

> +/**
>   * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
>   * @tpos:	the type * to use as a loop cursor.
>   * @pos:	the &struct hlist_node to use as a loop cursor.
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index e649bd3..6f78e2b 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -383,5 +383,23 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
>  		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
>  		pos = rcu_dereference(pos->next))
>  
> +/**
> + * hlist_for_each_entry_rcu_nulls - iterate over rcu list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_node within the struct.
> + * @nullval:       the iteration should stop if a pointer is < nullval
> + *
> + * Special version of hlist_for_each_entry_rcu where the end pointer
> + * can be NULL but also any value < nullval.
> + */
> +#define hlist_for_each_entry_rcu_nulls(tpos, pos, head, member, nullval) \
> +	for (pos = rcu_dereference((head)->first);			 \
> +		((unsigned long)pos >= nullval) && 			\
> +		({ prefetch(pos->next); 1; }) &&			\
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
> +		pos = rcu_dereference(pos->next))

Looks good.  It should be possible to get rid of the "pos" argument in
the upcoming version, as all of the offsets should be even, so the
bottom bit would come through unchanged.  In fact, it should be possible
in this variant as well:

	((unsigned long)pos - offsetof(typeof(*tpos)) >= nullval)

Or am I missing something?

>  #endif	/* __KERNEL__ */
>  #endif
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ada50c0..6e8545b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -361,6 +361,27 @@ static __inline__ int sk_del_node_init(struct sock *sk)
>  	return rc;
>  }
>  
> +static __inline__ int __sk_del_node_init_rcu(struct sock *sk)
> +{
> +	if (sk_hashed(sk)) {
> +		hlist_del_init_rcu(&sk->sk_node);

Won't your pseudo-NULL pointers mess up __hlist_del(), which is called
from hlist_del_init_rcu(), and which expects real NULL pointers?

Seems like you need _nulls variants of these list primitives as well.

> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static __inline__ int sk_del_node_init_rcu(struct sock *sk)
> +{
> +	int rc = __sk_del_node_init_rcu(sk);
> +
> +	if (rc) {
> +		/* paranoid for a while -acme */
> +		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
> +		__sock_put(sk);
> +	}
> +	return rc;
> +}
> +
>  static __inline__ void __sk_add_node(struct sock *sk, struct hlist_head *list)
>  {
>  	hlist_add_head(&sk->sk_node, list);
> @@ -372,6 +393,17 @@ static __inline__ void sk_add_node(struct sock *sk, struct hlist_head *list)
>  	__sk_add_node(sk, list);
>  }
>  
> +static __inline__ void __sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
> +{
> +	hlist_add_head_rcu(&sk->sk_node, list);

Same here -- hlist_add_head_rcu() expects real NULL pointers.

Or are these different lists?  I believe that these are the same lists,
given that the ->sk_node field is used in both cases.

So I believe that you need parallel _nulls primitives for this one also.

> +}
> +
> +static __inline__ void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
> +{
> +	sock_hold(sk);
> +	__sk_add_node_rcu(sk, list);
> +}
> +
>  static __inline__ void __sk_del_bind_node(struct sock *sk)
>  {
>  	__hlist_del(&sk->sk_bind_node);
> @@ -385,9 +417,16 @@ static __inline__ void sk_add_bind_node(struct sock *sk,
>  
>  #define sk_for_each(__sk, node, list) \
>  	hlist_for_each_entry(__sk, node, list, sk_node)
> +#define sk_for_each_nulls(__sk, node, list, nullval) \
> +	hlist_for_each_entry_nulls(__sk, node, list, sk_node, nullval)
> +#define sk_for_each_rcu_nulls(__sk, node, list, nullval) \
> +	hlist_for_each_entry_rcu_nulls(__sk, node, list, sk_node, nullval)
>  #define sk_for_each_from(__sk, node) \
>  	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
>  		hlist_for_each_entry_from(__sk, node, sk_node)
> +#define sk_for_each_from_nulls(__sk, node, nullval) \
> +	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
> +		hlist_for_each_entry_from_nulls(__sk, node, sk_node, nullval)
>  #define sk_for_each_continue(__sk, node) \
>  	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
>  		hlist_for_each_entry_continue(__sk, node, sk_node)
> @@ -587,8 +626,9 @@ struct proto {
>  	int			*sysctl_rmem;
>  	int			max_header;
>  
> -	struct kmem_cache		*slab;
> +	struct kmem_cache	*slab;
>  	unsigned int		obj_size;
> +	int			slab_flags;
>  
>  	atomic_t		*orphan_count;
>  
> @@ -597,7 +637,7 @@ struct proto {
>  
>  	union {
>  		struct inet_hashinfo	*hashinfo;
> -		struct hlist_head	*udp_hash;
> +		struct udp_table	*udp_table;
>  		struct raw_hashinfo	*raw_hash;
>  	} h;
>  
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 1e20509..df2bfe5 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -50,8 +50,15 @@ struct udp_skb_cb {
>  };
>  #define UDP_SKB_CB(__skb)	((struct udp_skb_cb *)((__skb)->cb))
>  
> -extern struct hlist_head udp_hash[UDP_HTABLE_SIZE];
> -extern rwlock_t udp_hash_lock;
> +struct udp_hslot {
> +	struct hlist_head	head;
> +	spinlock_t		lock;
> +} __attribute__((aligned(2 * sizeof(long))));
> +struct udp_table {
> +	struct udp_hslot	hash[UDP_HTABLE_SIZE];
> +};
> +extern struct udp_table udp_table;
> +extern void udp_table_init(struct udp_table *);
>  
>  
>  /* Note: this must match 'valbool' in sock_setsockopt */
> @@ -110,15 +117,7 @@ static inline void udp_lib_hash(struct sock *sk)
>  	BUG();
>  }
>  
> -static inline void udp_lib_unhash(struct sock *sk)
> -{
> -	write_lock_bh(&udp_hash_lock);
> -	if (sk_del_node_init(sk)) {
> -		inet_sk(sk)->num = 0;
> -		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -	}
> -	write_unlock_bh(&udp_hash_lock);
> -}
> +extern void udp_lib_unhash(struct sock *sk);
>  
>  static inline void udp_lib_close(struct sock *sk, long timeout)
>  {
> @@ -187,7 +186,7 @@ extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
>  struct udp_seq_afinfo {
>  	char			*name;
>  	sa_family_t		family;
> -	struct hlist_head	*hashtable;
> +	struct udp_table	*udp_table;
>  	struct file_operations	seq_fops;
>  	struct seq_operations	seq_ops;
>  };
> @@ -196,7 +195,7 @@ struct udp_iter_state {
>  	struct seq_net_private  p;
>  	sa_family_t		family;
>  	int			bucket;
> -	struct hlist_head	*hashtable;
> +	struct udp_table	*udp_table;
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/include/net/udplite.h b/include/net/udplite.h
> index b76b2e3..afdffe6 100644
> --- a/include/net/udplite.h
> +++ b/include/net/udplite.h
> @@ -11,7 +11,7 @@
>  #define UDPLITE_RECV_CSCOV   11 /* receiver partial coverage (threshold ) */
>  
>  extern struct proto 		udplite_prot;
> -extern struct hlist_head 	udplite_hash[UDP_HTABLE_SIZE];
> +extern struct udp_table		udplite_table;
>  
>  /*
>   *	Checksum computation is all in software, hence simpler getfrag.
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5e2a313..ded1eb5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2042,7 +2042,8 @@ int proto_register(struct proto *prot, int alloc_slab)
>  
>  	if (alloc_slab) {
>  		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
> -					       SLAB_HWCACHE_ALIGN, NULL);
> +					SLAB_HWCACHE_ALIGN | prot->slab_flags,
> +					NULL);
>  
>  		if (prot->slab == NULL) {
>  			printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n",
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 2095abc..915d92d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -104,12 +104,8 @@
>  #include <net/xfrm.h>
>  #include "udp_impl.h"
>  
> -/*
> - *	Snmp MIB for the UDP layer
> - */
> -
> -struct hlist_head udp_hash[UDP_HTABLE_SIZE];
> -DEFINE_RWLOCK(udp_hash_lock);
> +struct udp_table udp_table;
> +EXPORT_SYMBOL(udp_table);
>  
>  int sysctl_udp_mem[3] __read_mostly;
>  int sysctl_udp_rmem_min __read_mostly;
> @@ -123,7 +119,7 @@ atomic_t udp_memory_allocated;
>  EXPORT_SYMBOL(udp_memory_allocated);
>  
>  static int udp_lib_lport_inuse(struct net *net, __u16 num,
> -			       const struct hlist_head udptable[],
> +			       const struct udp_hslot *hslot,
>  			       struct sock *sk,
>  			       int (*saddr_comp)(const struct sock *sk1,
>  						 const struct sock *sk2))
> @@ -131,7 +127,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
>  	struct sock *sk2;
>  	struct hlist_node *node;
>  
> -	sk_for_each(sk2, node, &udptable[udp_hashfn(net, num)])
> +	sk_for_each_nulls(sk2, node, &hslot->head, UDP_HTABLE_SIZE)
>  		if (net_eq(sock_net(sk2), net)			&&
>  		    sk2 != sk					&&
>  		    sk2->sk_hash == num				&&
> @@ -154,12 +150,11 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
>  		       int (*saddr_comp)(const struct sock *sk1,
>  					 const struct sock *sk2 )    )
>  {
> -	struct hlist_head *udptable = sk->sk_prot->h.udp_hash;
> +	struct udp_hslot *hslot;
> +	struct udp_table *udptable = sk->sk_prot->h.udp_table;
>  	int    error = 1;
>  	struct net *net = sock_net(sk);
>  
> -	write_lock_bh(&udp_hash_lock);
> -
>  	if (!snum) {
>  		int low, high, remaining;
>  		unsigned rand;
> @@ -171,26 +166,39 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,

inet_get_local_port_range() used to be under the write_lock_bh(),
and no longer is.  So, how do we now protect against concurrent port
range changes?

>  		rand = net_random();
>  		snum = first = rand % remaining + low;
>  		rand |= 1;
> -		while (udp_lib_lport_inuse(net, snum, udptable, sk,
> -					   saddr_comp)) {
> +		for (;;) {
> +			hslot = &udptable->hash[udp_hashfn(net, snum)];
> +			spin_lock_bh(&hslot->lock);
> +			if (!udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp))
> +				break;
> +			spin_unlock_bh(&hslot->lock);
>  			do {
>  				snum = snum + rand;
>  			} while (snum < low || snum > high);

The above -really- confuses me, but not part of this patch.  If we are
out of range, keep going?  Well, I guess that since it is a short, we
cannot go very far...

>  			if (snum == first)
>  				goto fail;

And I don't understand how we are guaranteed to have scanned all the
possible ports upon failure, but happy to leave that to you guys.

>  		}
> -	} else if (udp_lib_lport_inuse(net, snum, udptable, sk, saddr_comp))
> -		goto fail;
> -
> +	} else {
> +		hslot = &udptable->hash[udp_hashfn(net, snum)];
> +		spin_lock_bh(&hslot->lock);
> +		if (udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp))
> +			goto fail_unlock;
> +	}
>  	inet_sk(sk)->num = snum;
>  	sk->sk_hash = snum;
>  	if (sk_unhashed(sk)) {
> -		sk_add_node(sk, &udptable[udp_hashfn(net, snum)]);
> +		/*
> +		 * We need that previous write to sk->sk_hash committed
> +		 * before write to sk->next done in following add_node() variant
> +		 */
> +		smp_wmb();
> +		sk_add_node_rcu(sk, &hslot->head);
>  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>  	}
>  	error = 0;
> +fail_unlock:
> +	spin_unlock_bh(&hslot->lock);
>  fail:
> -	write_unlock_bh(&udp_hash_lock);
>  	return error;
>  }
>  
> @@ -208,63 +216,96 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
>  	return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal);
>  }
>  
> +static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
> +			 unsigned short hnum,
> +			 __be16 sport, __be32 daddr, __be16 dport, int dif)
> +{
> +	int score = -1;
> +
> +	if (net_eq(sock_net(sk), net) && sk->sk_hash == hnum &&
> +			!ipv6_only_sock(sk)) {
> +		struct inet_sock *inet = inet_sk(sk);
> +
> +		score = (sk->sk_family == PF_INET ? 1 : 0);
> +		if (inet->rcv_saddr) {
> +			if (inet->rcv_saddr != daddr)
> +				return -1;
> +			score += 2;
> +		}
> +		if (inet->daddr) {
> +			if (inet->daddr != saddr)
> +				return -1;
> +			score += 2;
> +		}
> +		if (inet->dport) {
> +			if (inet->dport != sport)
> +				return -1;
> +			score += 2;
> +		}
> +		if (sk->sk_bound_dev_if) {
> +			if (sk->sk_bound_dev_if != dif)
> +				return -1;
> +			score += 2;
> +		}
> +	}
> +	return score;
> +}
> +
>  /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
>   * harder than this. -DaveM
>   */
>  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
>  		__be16 sport, __be32 daddr, __be16 dport,
> -		int dif, struct hlist_head udptable[])
> +		int dif, struct udp_table *udptable)
>  {
> -	struct sock *sk, *result = NULL;
> +	struct sock *sk, *result;
>  	struct hlist_node *node;
>  	unsigned short hnum = ntohs(dport);
> -	int badness = -1;
> -
> -	read_lock(&udp_hash_lock);
> -	sk_for_each(sk, node, &udptable[udp_hashfn(net, hnum)]) {
> -		struct inet_sock *inet = inet_sk(sk);
> -
> -		if (net_eq(sock_net(sk), net) && sk->sk_hash == hnum &&
> -				!ipv6_only_sock(sk)) {
> -			int score = (sk->sk_family == PF_INET ? 1 : 0);
> -			if (inet->rcv_saddr) {
> -				if (inet->rcv_saddr != daddr)
> -					continue;
> -				score+=2;
> -			}
> -			if (inet->daddr) {
> -				if (inet->daddr != saddr)
> -					continue;
> -				score+=2;
> -			}
> -			if (inet->dport) {
> -				if (inet->dport != sport)
> -					continue;
> -				score+=2;
> -			}
> -			if (sk->sk_bound_dev_if) {
> -				if (sk->sk_bound_dev_if != dif)
> -					continue;
> -				score+=2;
> -			}
> -			if (score == 9) {
> -				result = sk;
> -				break;
> -			} else if (score > badness) {
> -				result = sk;
> -				badness = score;
> -			}
> +	unsigned int hash = udp_hashfn(net, hnum);
> +	struct udp_hslot *hslot = &udptable->hash[hash];
> +	int score, badness;
> +
> +	rcu_read_lock();
> +begin:
> +	result = NULL;
> +	badness = -1;
> +	sk_for_each_rcu_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
> +		/*
> +		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
> +		 * We must check this item was not moved to another chain
> +		 */
> +		if (udp_hashfn(net, sk->sk_hash) != hash)
> +			goto begin;
> +		score = compute_score(sk, net, saddr, hnum, sport,
> +				      daddr, dport, dif);
> +		if (score > badness) {
> +			result = sk;
> +			badness = score;
> + 		}
> +	}
> +	/*
> +	 * if the 'NULL' pointer we got at the end of this lookup is
> +	 * not the expected one, we must restart lookup.
> +	 */
> +	if ((unsigned long)node != hash)
> +		goto begin;
> +
> +	if (result) {
> +		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
> +			result = NULL;
> +		else if (unlikely(compute_score(result, net, saddr, hnum, sport,
> +				  daddr, dport, dif) < badness)) {
> +			sock_put(result);
> +			goto begin;
>  		}
>  	}
> -	if (result)
> -		sock_hold(result);
> -	read_unlock(&udp_hash_lock);
> +	rcu_read_unlock();
>  	return result;
>  }
>  
>  static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb,
>  						 __be16 sport, __be16 dport,
> -						 struct hlist_head udptable[])
> +						 struct udp_table *udptable)
>  {
>  	struct sock *sk;
>  	const struct iphdr *iph = ip_hdr(skb);
> @@ -280,7 +321,7 @@ static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb,
>  struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
>  			     __be32 daddr, __be16 dport, int dif)
>  {
> -	return __udp4_lib_lookup(net, saddr, sport, daddr, dport, dif, udp_hash);
> +	return __udp4_lib_lookup(net, saddr, sport, daddr, dport, dif, &udp_table);
>  }
>  EXPORT_SYMBOL_GPL(udp4_lib_lookup);
>  
> @@ -293,7 +334,7 @@ static inline struct sock *udp_v4_mcast_next(struct sock *sk,
>  	struct sock *s = sk;
>  	unsigned short hnum = ntohs(loc_port);
>  
> -	sk_for_each_from(s, node) {
> +	sk_for_each_from_nulls(s, node, UDP_HTABLE_SIZE) {
>  		struct inet_sock *inet = inet_sk(s);
>  
>  		if (s->sk_hash != hnum					||
> @@ -323,7 +364,7 @@ found:
>   * to find the appropriate port.
>   */
>  
> -void __udp4_lib_err(struct sk_buff *skb, u32 info, struct hlist_head udptable[])
> +void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
>  {
>  	struct inet_sock *inet;
>  	struct iphdr *iph = (struct iphdr*)skb->data;
> @@ -392,7 +433,7 @@ out:
>  
>  void udp_err(struct sk_buff *skb, u32 info)
>  {
> -	__udp4_lib_err(skb, info, udp_hash);
> +	__udp4_lib_err(skb, info, &udp_table);
>  }
>  
>  /*
> @@ -933,6 +974,21 @@ int udp_disconnect(struct sock *sk, int flags)
>  	return 0;
>  }
>  
> +void udp_lib_unhash(struct sock *sk)
> +{
> +	struct udp_table *udptable = sk->sk_prot->h.udp_table;
> +	unsigned int hash = udp_hashfn(sock_net(sk), sk->sk_hash);
> +	struct udp_hslot *hslot = &udptable->hash[hash];
> +
> +	spin_lock(&hslot->lock);
> +	if (sk_del_node_init_rcu(sk)) {
> +		inet_sk(sk)->num = 0;
> +		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> +	}
> +	spin_unlock(&hslot->lock);
> +}
> +EXPORT_SYMBOL(udp_lib_unhash);
> +
>  static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  {
>  	int is_udplite = IS_UDPLITE(sk);
> @@ -1071,13 +1127,14 @@ drop:
>  static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>  				    struct udphdr  *uh,
>  				    __be32 saddr, __be32 daddr,
> -				    struct hlist_head udptable[])
> +				    struct udp_table *udptable)
>  {
>  	struct sock *sk;
> +	struct udp_hslot *hslot = &udptable->hash[udp_hashfn(net, ntohs(uh->dest))];
>  	int dif;
>  
> -	read_lock(&udp_hash_lock);
> -	sk = sk_head(&udptable[udp_hashfn(net, ntohs(uh->dest))]);
> +	spin_lock(&hslot->lock);
> +	sk = sk_head(&hslot->head);
>  	dif = skb->dev->ifindex;
>  	sk = udp_v4_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
>  	if (sk) {
> @@ -1102,7 +1159,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>  		} while (sknext);
>  	} else
>  		kfree_skb(skb);
> -	read_unlock(&udp_hash_lock);
> +	spin_unlock(&hslot->lock);
>  	return 0;
>  }
>  
> @@ -1148,7 +1205,7 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
>   *	All we need to do is get the socket, and then do a checksum.
>   */
>  
> -int __udp4_lib_rcv(struct sk_buff *skb, struct hlist_head udptable[],
> +int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  		   int proto)
>  {
>  	struct sock *sk;
> @@ -1246,7 +1303,7 @@ drop:
>  
>  int udp_rcv(struct sk_buff *skb)
>  {
> -	return __udp4_lib_rcv(skb, udp_hash, IPPROTO_UDP);
> +	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
>  }
>  
>  void udp_destroy_sock(struct sock *sk)
> @@ -1488,7 +1545,8 @@ struct proto udp_prot = {
>  	.sysctl_wmem	   = &sysctl_udp_wmem_min,
>  	.sysctl_rmem	   = &sysctl_udp_rmem_min,
>  	.obj_size	   = sizeof(struct udp_sock),
> -	.h.udp_hash	   = udp_hash,
> +	.slab_flags	   = SLAB_DESTROY_BY_RCU,
> +	.h.udp_table	   = &udp_table,
>  #ifdef CONFIG_COMPAT
>  	.compat_setsockopt = compat_udp_setsockopt,
>  	.compat_getsockopt = compat_udp_getsockopt,
> @@ -1498,20 +1556,23 @@ struct proto udp_prot = {
>  /* ------------------------------------------------------------------------ */
>  #ifdef CONFIG_PROC_FS
>  
> -static struct sock *udp_get_first(struct seq_file *seq)
> +static struct sock *udp_get_first(struct seq_file *seq, int start)
>  {
>  	struct sock *sk;
>  	struct udp_iter_state *state = seq->private;
>  	struct net *net = seq_file_net(seq);
>  
> -	for (state->bucket = 0; state->bucket < UDP_HTABLE_SIZE; ++state->bucket) {
> +	for (state->bucket = start; state->bucket < UDP_HTABLE_SIZE; ++state->bucket) {
>  		struct hlist_node *node;
> -		sk_for_each(sk, node, state->hashtable + state->bucket) {
> +		struct udp_hslot *hslot = &state->udp_table->hash[state->bucket];
> +		spin_lock_bh(&hslot->lock);
> +		sk_for_each_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
>  			if (!net_eq(sock_net(sk), net))
>  				continue;
>  			if (sk->sk_family == state->family)
>  				goto found;
>  		}
> +		spin_unlock_bh(&hslot->lock);
>  	}
>  	sk = NULL;
>  found:
> @@ -1525,20 +1586,18 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
>  
>  	do {
>  		sk = sk_next(sk);
> -try_again:
> -		;
>  	} while (sk && (!net_eq(sock_net(sk), net) || sk->sk_family != state->family));
>  
> -	if (!sk && ++state->bucket < UDP_HTABLE_SIZE) {
> -		sk = sk_head(state->hashtable + state->bucket);
> -		goto try_again;
> +	if (!sk) {
> +		spin_unlock_bh(&state->udp_table->hash[state->bucket].lock);
> +		return udp_get_first(seq, state->bucket + 1);
>  	}
>  	return sk;
>  }
>  
>  static struct sock *udp_get_idx(struct seq_file *seq, loff_t pos)
>  {
> -	struct sock *sk = udp_get_first(seq);
> +	struct sock *sk = udp_get_first(seq, 0);
>  
>  	if (sk)
>  		while (pos && (sk = udp_get_next(seq, sk)) != NULL)
> @@ -1547,9 +1606,7 @@ static struct sock *udp_get_idx(struct seq_file *seq, loff_t pos)
>  }
>  
>  static void *udp_seq_start(struct seq_file *seq, loff_t *pos)
> -	__acquires(udp_hash_lock)
>  {
> -	read_lock(&udp_hash_lock);
>  	return *pos ? udp_get_idx(seq, *pos-1) : SEQ_START_TOKEN;
>  }
>  
> @@ -1567,9 +1624,11 @@ static void *udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>  }
>  
>  static void udp_seq_stop(struct seq_file *seq, void *v)
> -	__releases(udp_hash_lock)
>  {
> -	read_unlock(&udp_hash_lock);
> +	struct udp_iter_state *state = seq->private;
> +
> +	if (state->bucket < UDP_HTABLE_SIZE)
> +		spin_unlock_bh(&state->udp_table->hash[state->bucket].lock);
>  }
>  
>  static int udp_seq_open(struct inode *inode, struct file *file)
> @@ -1585,7 +1644,7 @@ static int udp_seq_open(struct inode *inode, struct file *file)
>  
>  	s = ((struct seq_file *)file->private_data)->private;
>  	s->family		= afinfo->family;
> -	s->hashtable		= afinfo->hashtable;
> +	s->udp_table		= afinfo->udp_table;
>  	return err;
>  }
>  
> @@ -1657,7 +1716,7 @@ int udp4_seq_show(struct seq_file *seq, void *v)
>  static struct udp_seq_afinfo udp4_seq_afinfo = {
>  	.name		= "udp",
>  	.family		= AF_INET,
> -	.hashtable	= udp_hash,
> +	.udp_table	= &udp_table,
>  	.seq_fops	= {
>  		.owner	=	THIS_MODULE,
>  	},
> @@ -1692,10 +1751,21 @@ void udp4_proc_exit(void)
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +void __init udp_table_init(struct udp_table *table)
> +{
> +	int i;
> +	
> +	for (i = 0; i < UDP_HTABLE_SIZE; i++) {
> +		table->hash[i].head.first = (struct hlist_node *)i;
> +		spin_lock_init(&table->hash[i].lock);
> +	}
> +}
> +
>  void __init udp_init(void)
>  {
>  	unsigned long limit;
>  
> +	udp_table_init(&udp_table);
>  	/* Set the pressure threshold up by the same strategy of TCP. It is a
>  	 * fraction of global memory that is up to 1/2 at 256 MB, decreasing
>  	 * toward zero with the amount of memory, with a floor of 128 pages.
> @@ -1712,8 +1782,6 @@ void __init udp_init(void)
>  }
>  
>  EXPORT_SYMBOL(udp_disconnect);
> -EXPORT_SYMBOL(udp_hash);
> -EXPORT_SYMBOL(udp_hash_lock);
>  EXPORT_SYMBOL(udp_ioctl);
>  EXPORT_SYMBOL(udp_prot);
>  EXPORT_SYMBOL(udp_sendmsg);
> diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
> index 2e9bad2..9f4a616 100644
> --- a/net/ipv4/udp_impl.h
> +++ b/net/ipv4/udp_impl.h
> @@ -5,8 +5,8 @@
>  #include <net/protocol.h>
>  #include <net/inet_common.h>
>  
> -extern int  	__udp4_lib_rcv(struct sk_buff *, struct hlist_head [], int );
> -extern void 	__udp4_lib_err(struct sk_buff *, u32, struct hlist_head []);
> +extern int  	__udp4_lib_rcv(struct sk_buff *, struct udp_table *, int );
> +extern void 	__udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
>  
>  extern int	udp_v4_get_port(struct sock *sk, unsigned short snum);
>  
> diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
> index 3c80796..c784891 100644
> --- a/net/ipv4/udplite.c
> +++ b/net/ipv4/udplite.c
> @@ -12,16 +12,17 @@
>   */
>  #include "udp_impl.h"
>  
> -struct hlist_head 	udplite_hash[UDP_HTABLE_SIZE];
> +struct udp_table 	udplite_table;
> +EXPORT_SYMBOL(udplite_table);
>  
>  static int udplite_rcv(struct sk_buff *skb)
>  {
> -	return __udp4_lib_rcv(skb, udplite_hash, IPPROTO_UDPLITE);
> +	return __udp4_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
>  }
>  
>  static void udplite_err(struct sk_buff *skb, u32 info)
>  {
> -	__udp4_lib_err(skb, info, udplite_hash);
> +	__udp4_lib_err(skb, info, &udplite_table);
>  }
>  
>  static	struct net_protocol udplite_protocol = {
> @@ -50,7 +51,8 @@ struct proto 	udplite_prot = {
>  	.unhash		   = udp_lib_unhash,
>  	.get_port	   = udp_v4_get_port,
>  	.obj_size	   = sizeof(struct udp_sock),
> -	.h.udp_hash	   = udplite_hash,
> +	.slab_flags	   = SLAB_DESTROY_BY_RCU,
> +	.h.udp_table	   = &udplite_table,
>  #ifdef CONFIG_COMPAT
>  	.compat_setsockopt = compat_udp_setsockopt,
>  	.compat_getsockopt = compat_udp_getsockopt,
> @@ -71,7 +73,7 @@ static struct inet_protosw udplite4_protosw = {
>  static struct udp_seq_afinfo udplite4_seq_afinfo = {
>  	.name		= "udplite",
>  	.family		= AF_INET,
> -	.hashtable	= udplite_hash,
> +	.udp_table 	= &udplite_table,
>  	.seq_fops	= {
>  		.owner	=	THIS_MODULE,
>  	},
> @@ -108,6 +110,7 @@ static inline int udplite4_proc_init(void)
>  
>  void __init udplite4_register(void)
>  {
> +	udp_table_init(&udplite_table);
>  	if (proto_register(&udplite_prot, 1))
>  		goto out_register_err;
>  
> @@ -126,5 +129,4 @@ out_register_err:
>  	printk(KERN_CRIT "%s: Cannot add UDP-Lite protocol.\n", __func__);
>  }
>  
> -EXPORT_SYMBOL(udplite_hash);
>  EXPORT_SYMBOL(udplite_prot);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index e51da8c..8c3671d 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -54,62 +54,97 @@ int udp_v6_get_port(struct sock *sk, unsigned short snum)
>  	return udp_lib_get_port(sk, snum, ipv6_rcv_saddr_equal);
>  }
>  
> +static inline int compute_score(struct sock *sk, struct net *net,
> +				unsigned short hnum,
> +				struct in6_addr *saddr, __be16 sport,
> +				struct in6_addr *daddr, __be16 dport,
> +				int dif)
> +{
> +	int score = -1;
> +
> +	if (net_eq(sock_net(sk), net) && sk->sk_hash == hnum &&
> +			sk->sk_family == PF_INET6) {
> +		struct ipv6_pinfo *np = inet6_sk(sk);
> +		struct inet_sock *inet = inet_sk(sk);
> +
> +		score = 0;
> +		if (inet->dport) {
> +			if (inet->dport != sport)
> +				return -1;
> +			score++;
> +		}
> +		if (!ipv6_addr_any(&np->rcv_saddr)) {
> +			if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
> +				return -1;
> +			score++;
> +		}
> +		if (!ipv6_addr_any(&np->daddr)) {
> +			if (!ipv6_addr_equal(&np->daddr, saddr))
> +				return -1;
> +			score++;
> +		}
> +		if (sk->sk_bound_dev_if) {
> +			if (sk->sk_bound_dev_if != dif)
> +				return -1;
> +			score++;
> +		}
> +	}
> +	return score;
> +}
> +
>  static struct sock *__udp6_lib_lookup(struct net *net,
>  				      struct in6_addr *saddr, __be16 sport,
>  				      struct in6_addr *daddr, __be16 dport,
> -				      int dif, struct hlist_head udptable[])
> +				      int dif, struct udp_table *udptable)
>  {
> -	struct sock *sk, *result = NULL;
> +	struct sock *sk, *result;
>  	struct hlist_node *node;
>  	unsigned short hnum = ntohs(dport);
> -	int badness = -1;
> -
> -	read_lock(&udp_hash_lock);
> -	sk_for_each(sk, node, &udptable[udp_hashfn(net, hnum)]) {
> -		struct inet_sock *inet = inet_sk(sk);
> -
> -		if (net_eq(sock_net(sk), net) && sk->sk_hash == hnum &&
> -				sk->sk_family == PF_INET6) {
> -			struct ipv6_pinfo *np = inet6_sk(sk);
> -			int score = 0;
> -			if (inet->dport) {
> -				if (inet->dport != sport)
> -					continue;
> -				score++;
> -			}
> -			if (!ipv6_addr_any(&np->rcv_saddr)) {
> -				if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
> -					continue;
> -				score++;
> -			}
> -			if (!ipv6_addr_any(&np->daddr)) {
> -				if (!ipv6_addr_equal(&np->daddr, saddr))
> -					continue;
> -				score++;
> -			}
> -			if (sk->sk_bound_dev_if) {
> -				if (sk->sk_bound_dev_if != dif)
> -					continue;
> -				score++;
> -			}
> -			if (score == 4) {
> -				result = sk;
> -				break;
> -			} else if (score > badness) {
> -				result = sk;
> -				badness = score;
> -			}
> +	unsigned int hash = udp_hashfn(net, hnum);
> +	struct udp_hslot *hslot = &udptable->hash[hash];
> +	int score, badness;
> +
> +	rcu_read_lock();
> +begin:
> +	result = NULL;
> +	badness = -1;
> +	sk_for_each_rcu_nulls(sk, node, &hslot->head, UDP_HTABLE_SIZE) {
> +		/*
> +		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
> +		 * We must check this item was not moved to another chain
> +		 */
> +		if (udp_hashfn(net, sk->sk_hash) != hash)
> +			goto begin;
> +		score = compute_score(sk, net, hnum, saddr, sport,
> +				      daddr, dport, dif);
> +		if (score > badness) {
> +			result = sk;
> +			badness = score;
>  		}
>  	}
> -	if (result)
> -		sock_hold(result);
> -	read_unlock(&udp_hash_lock);
> +	/*
> +	 * if the 'NULL' pointer we got at the end of this lookup is
> +	 * not the expected one, we must restart lookup.
> +	 */
> +	if ((unsigned long)node != hash)
> +		goto begin;
> +
> +	if (result) {
> +		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
> +			result = NULL;
> +		else if (unlikely(compute_score(result, net, hnum, saddr, sport,
> +					daddr, dport, dif) < badness)) {
> +			sock_put(result);
> +			goto begin;
> + 		}
> +	}
> +	rcu_read_unlock();
>  	return result;
>  }
>  
>  static struct sock *__udp6_lib_lookup_skb(struct sk_buff *skb,
>  					  __be16 sport, __be16 dport,
> -					  struct hlist_head udptable[])
> +					  struct udp_table *udptable)
>  {
>  	struct sock *sk;
>  	struct ipv6hdr *iph = ipv6_hdr(skb);
> @@ -239,7 +274,7 @@ csum_copy_err:
>  
>  void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
>  		    int type, int code, int offset, __be32 info,
> -		    struct hlist_head udptable[]                    )
> +		    struct udp_table *udptable)
>  {
>  	struct ipv6_pinfo *np;
>  	struct ipv6hdr *hdr = (struct ipv6hdr*)skb->data;
> @@ -275,7 +310,7 @@ static __inline__ void udpv6_err(struct sk_buff *skb,
>  				 struct inet6_skb_parm *opt, int type,
>  				 int code, int offset, __be32 info     )
>  {
> -	__udp6_lib_err(skb, opt, type, code, offset, info, udp_hash);
> +	__udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
>  }
>  
>  int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
> @@ -337,7 +372,7 @@ static struct sock *udp_v6_mcast_next(struct sock *sk,
>  	struct sock *s = sk;
>  	unsigned short num = ntohs(loc_port);
>  
> -	sk_for_each_from(s, node) {
> +	sk_for_each_from_nulls(s, node, UDP_HTABLE_SIZE) {
>  		struct inet_sock *inet = inet_sk(s);
>  
>  		if (sock_net(s) != sock_net(sk))
> @@ -374,14 +409,15 @@ static struct sock *udp_v6_mcast_next(struct sock *sk,
>   */
>  static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>  		struct in6_addr *saddr, struct in6_addr *daddr,
> -		struct hlist_head udptable[])
> +		struct udp_table *udptable)
>  {
>  	struct sock *sk, *sk2;
>  	const struct udphdr *uh = udp_hdr(skb);
> +	struct udp_hslot *hslot = &udptable->hash[udp_hashfn(net, ntohs(uh->dest))];
>  	int dif;
>  
> -	read_lock(&udp_hash_lock);
> -	sk = sk_head(&udptable[udp_hashfn(net, ntohs(uh->dest))]);
> +	spin_lock(&hslot->lock);
> +	sk = sk_head(&hslot->head);
>  	dif = inet6_iif(skb);
>  	sk = udp_v6_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
>  	if (!sk) {
> @@ -409,7 +445,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>  		sk_add_backlog(sk, skb);
>  	bh_unlock_sock(sk);
>  out:
> -	read_unlock(&udp_hash_lock);
> +	spin_unlock(&hslot->lock);
>  	return 0;
>  }
>  
> @@ -447,7 +483,7 @@ static inline int udp6_csum_init(struct sk_buff *skb, struct udphdr *uh,
>  	return 0;
>  }
>  
> -int __udp6_lib_rcv(struct sk_buff *skb, struct hlist_head udptable[],
> +int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  		   int proto)
>  {
>  	struct sock *sk;
> @@ -544,7 +580,7 @@ discard:
>  
>  static __inline__ int udpv6_rcv(struct sk_buff *skb)
>  {
> -	return __udp6_lib_rcv(skb, udp_hash, IPPROTO_UDP);
> +	return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
>  }
>  
>  /*
> @@ -1008,7 +1044,7 @@ int udp6_seq_show(struct seq_file *seq, void *v)
>  static struct udp_seq_afinfo udp6_seq_afinfo = {
>  	.name		= "udp6",
>  	.family		= AF_INET6,
> -	.hashtable	= udp_hash,
> +	.udp_table	= &udp_table,
>  	.seq_fops	= {
>  		.owner	=	THIS_MODULE,
>  	},
> @@ -1050,7 +1086,8 @@ struct proto udpv6_prot = {
>  	.sysctl_wmem	   = &sysctl_udp_wmem_min,
>  	.sysctl_rmem	   = &sysctl_udp_rmem_min,
>  	.obj_size	   = sizeof(struct udp6_sock),
> -	.h.udp_hash	   = udp_hash,
> +	.slab_flags	   = SLAB_DESTROY_BY_RCU,
> +	.h.udp_table	   = &udp_table,
>  #ifdef CONFIG_COMPAT
>  	.compat_setsockopt = compat_udpv6_setsockopt,
>  	.compat_getsockopt = compat_udpv6_getsockopt,
> diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
> index 92dd7da..2377920 100644
> --- a/net/ipv6/udp_impl.h
> +++ b/net/ipv6/udp_impl.h
> @@ -7,9 +7,9 @@
>  #include <net/inet_common.h>
>  #include <net/transp_v6.h>
>  
> -extern int  	__udp6_lib_rcv(struct sk_buff *, struct hlist_head [], int );
> +extern int  	__udp6_lib_rcv(struct sk_buff *, struct udp_table *, int );
>  extern void 	__udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *,
> -			       int , int , int , __be32 , struct hlist_head []);
> +			       int , int , int , __be32 , struct udp_table *);
>  
>  extern int	udp_v6_get_port(struct sock *sk, unsigned short snum);
>  
> diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
> index 3cd1a1a..05ab176 100644
> --- a/net/ipv6/udplite.c
> +++ b/net/ipv6/udplite.c
> @@ -15,14 +15,14 @@
>  
>  static int udplitev6_rcv(struct sk_buff *skb)
>  {
> -	return __udp6_lib_rcv(skb, udplite_hash, IPPROTO_UDPLITE);
> +	return __udp6_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
>  }
>  
>  static void udplitev6_err(struct sk_buff *skb,
>  			  struct inet6_skb_parm *opt,
>  			  int type, int code, int offset, __be32 info)
>  {
> -	__udp6_lib_err(skb, opt, type, code, offset, info, udplite_hash);
> +	__udp6_lib_err(skb, opt, type, code, offset, info, &udplite_table);
>  }
>  
>  static struct inet6_protocol udplitev6_protocol = {
> @@ -49,7 +49,8 @@ struct proto udplitev6_prot = {
>  	.unhash		   = udp_lib_unhash,
>  	.get_port	   = udp_v6_get_port,
>  	.obj_size	   = sizeof(struct udp6_sock),
> -	.h.udp_hash	   = udplite_hash,
> + 	.slab_flags	   = SLAB_DESTROY_BY_RCU,
> +	.h.udp_table	   = &udplite_table,
>  #ifdef CONFIG_COMPAT
>  	.compat_setsockopt = compat_udpv6_setsockopt,
>  	.compat_getsockopt = compat_udpv6_getsockopt,
> @@ -95,7 +96,7 @@ void udplitev6_exit(void)
>  static struct udp_seq_afinfo udplite6_seq_afinfo = {
>  	.name		= "udplite6",
>  	.family		= AF_INET6,
> -	.hashtable	= udplite_hash,
> +	.udp_table	= &udplite_table,
>  	.seq_fops	= {
>  		.owner	=	THIS_MODULE,
>  	},
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 31, 2008, 4:40 p.m.
Paul E. McKenney a écrit :
> On Thu, Oct 30, 2008 at 12:30:20PM +0100, Eric Dumazet wrote:
>> -		while (udp_lib_lport_inuse(net, snum, udptable, sk,
>> -					   saddr_comp)) {
>> +		for (;;) {
>> +			hslot = &udptable->hash[udp_hashfn(net, snum)];
>> +			spin_lock_bh(&hslot->lock);
>> +			if (!udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp))
>> +				break;
>> +			spin_unlock_bh(&hslot->lock);
>>  			do {
>>  				snum = snum + rand;
>>  			} while (snum < low || snum > high);
> 
> The above -really- confuses me, but not part of this patch.  If we are
> out of range, keep going?  Well, I guess that since it is a short, we
> cannot go very far...
> 
>>  			if (snum == first)
>>  				goto fail;
> 
> And I don't understand how we are guaranteed to have scanned all the
> possible ports upon failure, but happy to leave that to you guys.

Well, we have 65536(=2^16) possible port values, and while 'rand' is random,
it has the interesting property/bias of being odd.

We know (thanks modular arithmetic / congruence relation) we will hit
all 65356 values exactly once, after exactly 65536 iterations.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - Nov. 1, 2008, 3:10 a.m.
On Fri, Oct 31, 2008 at 05:40:46PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
>> On Thu, Oct 30, 2008 at 12:30:20PM +0100, Eric Dumazet wrote:
>>> -		while (udp_lib_lport_inuse(net, snum, udptable, sk,
>>> -					   saddr_comp)) {
>>> +		for (;;) {
>>> +			hslot = &udptable->hash[udp_hashfn(net, snum)];
>>> +			spin_lock_bh(&hslot->lock);
>>> +			if (!udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp))
>>> +				break;
>>> +			spin_unlock_bh(&hslot->lock);
>>>  			do {
>>>  				snum = snum + rand;
>>>  			} while (snum < low || snum > high);
>> The above -really- confuses me, but not part of this patch.  If we are
>> out of range, keep going?  Well, I guess that since it is a short, we
>> cannot go very far...
>>>  			if (snum == first)
>>>  				goto fail;
>> And I don't understand how we are guaranteed to have scanned all the
>> possible ports upon failure, but happy to leave that to you guys.
>
> Well, we have 65536(=2^16) possible port values, and while 'rand' is 
> random,
> it has the interesting property/bias of being odd.
>
> We know (thanks modular arithmetic / congruence relation) we will hit
> all 65356 values exactly once, after exactly 65536 iterations.

Ah, got it!  Thank you for the explanation!

I was fixating on the low..high interval.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e649bd3..3ba2998 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -383,5 +383,22 @@  static inline void hlist_add_after_rcu(struct hlist_node *prev,
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
 		pos = rcu_dereference(pos->next))
 
+/**
+ * hlist_for_each_entry_rcu_safenext - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ * @next:       the &struct hlist_node to use as a next cursor
+ *
+ * Special version of hlist_for_each_entry_rcu that make sure
+ * each next pointer is fetched before each iteration.
+ */
+#define hlist_for_each_entry_rcu_safenext(tpos, pos, head, member, next) \
+	for (pos = rcu_dereference((head)->first);			 \
+		pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&	\
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference(next))
+
 #endif	/* __KERNEL__ */
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 0bea25d..a4f6d3f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -419,8 +419,8 @@  static __inline__ void sk_add_bind_node(struct sock *sk,
 
 #define sk_for_each(__sk, node, list) \
 	hlist_for_each_entry(__sk, node, list, sk_node)
-#define sk_for_each_rcu(__sk, node, list) \
-	hlist_for_each_entry_rcu(__sk, node, list, sk_node)
+#define sk_for_each_rcu_safenext(__sk, node, list, next) \
+	hlist_for_each_entry_rcu_safenext(__sk, node, list, sk_node, next)
 #define sk_for_each_from(__sk, node) \
 	if (__sk && ({ node = &(__sk)->sk_node; 1; })) \
 		hlist_for_each_entry_from(__sk, node, sk_node)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5ba0340..c3ecec8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -256,7 +256,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node;
+	struct hlist_node *node, *next;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -266,7 +266,7 @@  static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu(sk, node, &hslot->head) {
+	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1d9790e..32d914d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -98,7 +98,7 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 				      int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;
-	struct hlist_node *node;
+	struct hlist_node *node, *next;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash = udp_hashfn(net, hnum);
 	struct udp_hslot *hslot = &udptable->hash[hash];
@@ -108,7 +108,7 @@  static struct sock *__udp6_lib_lookup(struct net *net,
 begin:
 	result = NULL;
 	badness = -1;
-	sk_for_each_rcu(sk, node, &hslot->head) {
+	sk_for_each_rcu_safenext(sk, node, &hslot->head, next) {
 		/*
 		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
 		 * We must check this item was not moved to another chain