From patchwork Fri Jan 23 00:14:58 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 19919 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 3698FDDF21 for ; Fri, 23 Jan 2009 11:15:23 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756693AbZAWAPM (ORCPT ); Thu, 22 Jan 2009 19:15:12 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756532AbZAWAPL (ORCPT ); Thu, 22 Jan 2009 19:15:11 -0500 Received: from gw2.cosmosbay.com ([86.64.20.130]:60882 "EHLO gw2.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756044AbZAWAPH convert rfc822-to-8bit (ORCPT ); Thu, 22 Jan 2009 19:15:07 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw2.cosmosbay.com (8.12.11.20060308/8.12.11) with ESMTP id n0N0Ew9L012112; Fri, 23 Jan 2009 01:14:58 +0100 Message-ID: <49790C02.90800@cosmosbay.com> Date: Fri, 23 Jan 2009 01:14:58 +0100 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Vitaly Mayatskikh CC: David Miller , netdev@vger.kernel.org Subject: Re: speed regression in udp_lib_lport_inuse() References: <4978EE03.9040207@cosmosbay.com> In-Reply-To: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw2.cosmosbay.com [127.0.0.1]); Fri, 23 Jan 2009 01:14:59 +0100 (CET) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Vitaly Mayatskikh a écrit : > At Thu, 22 Jan 2009 23:06:59 +0100, Eric Dumazet wrote: > >>> err = bind(s, (const struct sockaddr*)&sa, sizeof(sa)); >> Bug here, if bind() returns -1 (all ports are in use) > > Yeah, there was assert(), but the program drops to problems very soon, > I was lazy to handle this situation correctly and just removed it ;) > >>> Thanks! >> Hello Vitaly, thanks for this excellent report. >> >> Yes, current code is really not good when all ports are in use : >> >> We now have to scan 28232 [1] times long chains of 220 sockets. >> Thats very long (but at least thread is preemptable) >> >> In the past (before patches), only one thread was allowed to run in kernel while scanning >> udp port table (we had only one global lock udp_hash_lock protecting the whole udp table). > > Very true, my (older) kernel with udp_hash_lock just become totally > unresponsive after running this test. .29-rc2 become jerky only, but > still works. > >> This thread was faster because it was not slowed down by other threads. >> (But the rwlock we used was responsible for starvations of writers if many UDP frames >> were received) >> >> >> >> One way to solve the problem could be to use following : >> >> 1) Raising UDP_HTABLE_SIZE from 128 to 1024 to reduce average chain lengths. >> >> 2) In bind(0) algo, use rcu locking to find a possible usable port. All cpus can run in //, without >> dirtying locks. Then lock the found chain and recheck port is available before using it. > > I think 2 is definitely better than 1, because 1 is not actually > fixing anything, but postpones the problem slightly. > >> [1] replace 28232 by your actual /proc/sys/net/ipv4/ip_local_port_range values >> 61000 - 32768 = 28232 >> >> I will try to code a patch before this week end. > > Cool! > >> Thanks >> >> Note : I tried to use a mutex to force only one thread in bind(0) code but got no real speedup. >> But it should help if you have a SMP machine, since only one cpu will be busy in bind(0) >> > > You saved my time, I was thinking about trying mutexes also. Thanks :) > Could you try following patch ? Thank you [PATCH] udp: optimize bind(0) if many ports are in use commit 9088c5609584684149f3fb5b065aa7f18dcb03ff (udp: Improve port randomization) introduced a regression for UDP bind() syscall to null port (getting a random port) in case lot of ports are already in use. This is because we do about 28000 scans of very long chains (220 sockets per chain), with many spin_lock_bh()/spin_unlock_bh() calls. Fix this using a bitmap (64 bytes for current value of UDP_HTABLE_SIZE) so that we scan chains at most once. Instead of 250 ms per bind() call, we get after patch a time of 2.9 ms Based on a report from Vitaly Mayatskikh Reported-by: Vitaly Mayatskikh Signed-off-by: Eric Dumazet --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index cf5ab05..adbdbd8 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -120,8 +120,11 @@ EXPORT_SYMBOL(sysctl_udp_wmem_min); atomic_t udp_memory_allocated; EXPORT_SYMBOL(udp_memory_allocated); +#define PORTS_PER_CHAIN (65536 / UDP_HTABLE_SIZE) + static int udp_lib_lport_inuse(struct net *net, __u16 num, const struct udp_hslot *hslot, + unsigned long *bitmap, struct sock *sk, int (*saddr_comp)(const struct sock *sk1, const struct sock *sk2)) @@ -132,12 +135,16 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, sk_nulls_for_each(sk2, node, &hslot->head) if (net_eq(sock_net(sk2), net) && sk2 != sk && - sk2->sk_hash == num && + (bitmap || sk2->sk_hash == num) && (!sk2->sk_reuse || !sk->sk_reuse) && (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && - (*saddr_comp)(sk, sk2)) - return 1; + (*saddr_comp)(sk, sk2)) { + if (bitmap) + __set_bit(sk2->sk_hash / UDP_HTABLE_SIZE, bitmap); + else + return 1; + } return 0; } @@ -158,34 +165,44 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum, struct net *net = sock_net(sk); if (!snum) { - int low, high, remaining; - unsigned rand; + int low, high; + unsigned rand, slotn, bias; unsigned short first; + DECLARE_BITMAP(bitmap, PORTS_PER_CHAIN); inet_get_local_port_range(&low, &high); - remaining = (high - low) + 1; rand = net_random(); - snum = first = rand % remaining + low; - rand |= 1; - for (;;) { - hslot = &udptable->hash[udp_hashfn(net, snum)]; + bias = rand; + rand = ((rand >> 16) | 1) * UDP_HTABLE_SIZE; + for (slotn = 0; slotn < UDP_HTABLE_SIZE; slotn++) { + first = slotn + bias; + hslot = &udptable->hash[udp_hashfn(net, first)]; + bitmap_zero(bitmap, PORTS_PER_CHAIN); spin_lock_bh(&hslot->lock); - if (!udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp)) - break; - spin_unlock_bh(&hslot->lock); + udp_lib_lport_inuse(net, snum, hslot, bitmap, sk, saddr_comp); + + snum = first; + /* + * PORTS_PER_CHAIN loops, because snum is unsigned short + * and we add an odd multiple of UDP_HTABLE_SIZE + */ do { - snum = snum + rand; - } while (snum < low || snum > high); - if (snum == first) - goto fail; + if (low <= snum && snum <= high && + !test_bit(snum / UDP_HTABLE_SIZE, bitmap)) + goto found; + snum += rand; + } while (snum != first); + spin_unlock_bh(&hslot->lock); } + goto fail; } else { hslot = &udptable->hash[udp_hashfn(net, snum)]; spin_lock_bh(&hslot->lock); - if (udp_lib_lport_inuse(net, snum, hslot, sk, saddr_comp)) + if (udp_lib_lport_inuse(net, snum, hslot, NULL, sk, saddr_comp)) goto fail_unlock; } +found: inet_sk(sk)->num = snum; sk->sk_hash = snum; if (sk_unhashed(sk)) {