| Submitter | Eric Dumazet |
|---|---|
| Date | Oct. 8, 2008, 1:55 p.m. |
| Message ID | <48ECBBD8.9060602@cosmosbay.com> |
| Download | mbox | patch |
| Permalink | /patch/3318/ |
| State | Accepted |
| Delegated to: | David Miller |
| Headers | show |
Comments
On Tue, 28 Oct 2008 21:37:15 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > UDP sockets are hashed in a 128 slots hash table. > > This hash table is protected by *one* rwlock. > > This rwlock is readlocked each time an incoming UDP message is handled. > > This rwlock is writelocked each time a socket must be inserted in > hash table (bind time), or deleted from this table (unbind time) > > This is not scalable on SMP machines : > > 1) Even in read mode, lock() and unlock() are atomic operations and > must dirty a contended cache line, shared by all cpus. > > 2) A writer might be starved if many readers are 'in flight'. This can > happen on a machine with some NIC receiving many UDP messages. User > process can be delayed a long time at socket creation/dismantle time. > > > What Corey and I propose is to use RCU to protect this hash table. > > Goals are : > > 1) Optimizing handling of incoming Unicast UDP frames, so that no memory > writes should happen in the fast path. Using an array of rwlocks (one per > slot for example is not an option in this regard) > > Note: Multicasts and broadcasts still will need to take a lock, > because doing a full lockless lookup in this case is difficult. > > 2) No expensive operations in the socket bind/unhash phases : > - No expensive synchronize_rcu() calls. > > - No added rcu_head in socket structure, increasing memory needs, > but more important, forcing us to use call_rcu() calls, > that have the bad property of making sockets structure cold. > (rcu grace period between socket freeing and its potential reuse > make this socket being cold in CPU cache). > David did a previous patch using call_rcu() and noticed a 20% > impact on TCP connection rates. > > Quoting Cristopher Lameter : > "Right. That results in cacheline cooldown. You'd want to recycle > the object as they are cache hot on a per cpu basis. That is screwed > up by the delayed regular rcu processing. We have seen multiple > regressions due to cacheline cooldown. > The only choice in cacheline hot sensitive areas is to deal with the > complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU." > > - Because udp sockets are allocated from dedicated kmem_cache, > use of SLAB_DESTROY_BY_RCU can help here. > > Theory of operation : > --------------------- > > As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()), > special attention must be taken by readers and writers. > > Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed, > reused, inserted in a different chain or in worst case in the same chain > while readers could do lookups in the same time. > > In order to avoid loops, a reader must check each socket found in a chain > really belongs to the chain the reader was traversing. If it finds a > mismatch, lookup must start again at the begining. This *restart* loop > is the reason we had to use rdlock for the multicast case, because > we dont want to send same message several times to the same socket. > > We use RCU only for fast path. Thus, /proc/net/udp still take rdlocks. We should just make it a spin_lock later and speed up udp socket creation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger a écrit : > On Tue, 28 Oct 2008 21:37:15 +0100 > Eric Dumazet <dada1@cosmosbay.com> wrote: >> We use RCU only for fast path. Thus, /proc/net/udp still take rdlocks. > > We should just make it a spin_lock later and speed up udp socket creation. Indeed -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patch
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 85f8e8e..67d8430 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -155,55 +155,23 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum, write_lock_bh(&udp_hash_lock); if (!snum) { - int i, low, high, remaining; - unsigned rover, best, best_size_so_far; + int low, high, remaining; + unsigned rand; + unsigned short first; inet_get_local_port_range(&low, &high); remaining = (high - low) + 1; - best_size_so_far = UINT_MAX; - best = rover = net_random() % remaining + low; - - /* 1st pass: look for empty (or shortest) hash chain */ - for (i = 0; i < UDP_HTABLE_SIZE; i++) { - int size = 0; - - head = &udptable[udp_hashfn(net, rover)]; - if (hlist_empty(head)) - goto gotit; - - sk_for_each(sk2, node, head) { - if (++size >= best_size_so_far) - goto next; - } - best_size_so_far = size; - best = rover; - next: - /* fold back if end of range */ - if (++rover > high) - rover = low + ((rover - low) - & (UDP_HTABLE_SIZE - 1)); - - - } - - /* 2nd pass: find hole in shortest hash chain */ - rover = best; - for (i = 0; i < (1 << 16) / UDP_HTABLE_SIZE; i++) { - if (! __udp_lib_lport_inuse(net, rover, udptable)) - goto gotit; - rover += UDP_HTABLE_SIZE; - if (rover > high) - rover = low + ((rover - low) - & (UDP_HTABLE_SIZE - 1)); + rand = net_random(); + snum = first = rand % remaining + low; + rand |= 1; + while (__udp_lib_lport_inuse(net, snum, udptable)) { + do { + snum = snum + rand; + } while (snum < low || snum > high); + if (snum == first) + goto fail; } - - - /* All ports in use! */ - goto fail; - -gotit: - snum = rover; } else { head = &udptable[udp_hashfn(net, snum)];