From patchwork Tue Jan 22 19:50:32 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Tom Herbert X-Patchwork-Id: 214639 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id C7E7D2C007E for ; Wed, 23 Jan 2013 06:57:32 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755744Ab3AVT5Y (ORCPT ); Tue, 22 Jan 2013 14:57:24 -0500 Received: from mail-ye0-f202.google.com ([209.85.213.202]:42219 "EHLO mail-ye0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755701Ab3AVT5V (ORCPT ); Tue, 22 Jan 2013 14:57:21 -0500 Received: by mail-ye0-f202.google.com with SMTP id r9so849449yen.3 for ; Tue, 22 Jan 2013 11:57:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:to:cc:subject:message-id:user-agent :mime-version:content-type; bh=k7+yKWFwDiiun2xHuhBw+KkKuandSai4r7z1qnX7CC0=; b=Bm7lT3zs8KlyzPLj9YPpY1c/NKdVHJ3Ws+67DmqOGJpl6SIMgf2ZWQKU7xSl6eN3RX JiixBwbUSeqhqVm54x3CGR+61sWIE7SvzvzRdBj7vh+Tt4ER+kNHhsjBbNpPS5W/7ok4 qRfbONQ8gwBNl+Vrw2CiGDyjhZbm366Ul4nDoKGQ9WOs3YB6kSf44tNQ6sXUEc5LO6iw +jl/eJSZ8EWkrf/lnfJdNh/lvSKhwTZwm6PY+rqoOa3MBpKQprxjqq47ebe69AU/vIo0 E5+pYNHadqllLavXRPVnUXYWSFIgXonCre9JzHKAOXk23C+VTpynTeYtlHXDMxtDdG7b MjOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:to:cc:subject:message-id:user-agent :mime-version:content-type:x-gm-message-state; bh=k7+yKWFwDiiun2xHuhBw+KkKuandSai4r7z1qnX7CC0=; b=UT7IBkP98iM60eJ25H2dsj58a4nVzHLlBo/F3aD3HH6CWC4EFvdRpuLvWfNlhFVKTT /xYlMrzRB1ukAZpgMhdy3LiskmsA1eJwF6mXaGXPl2ohJ8qJgKeuztkhbOcoNMehNXio aRYF3W0AWgzxtVjk1ST4shUkgSfHZJx5NVFMmaDBiJiF8UWv1xZcgGcTmTtWPzZFkwaM WyyFxbJgp5UoWS3VOOeJqAf4VhrMYAUuSeVKqp3z+cVbYReVhxNFxso6kLFRZr417Bf5 r3VIKcxz6tC9Z7o0UatcXwZU3oOJRvCmWSump5V7cqc/37pvxL7oBF4F6nT+6hFIkkg9 eeSg== X-Received: by 10.236.156.227 with SMTP id m63mr11100092yhk.17.1358884232993; Tue, 22 Jan 2013 11:50:32 -0800 (PST) Received: from corp2gmr1-2.hot.corp.google.com (corp2gmr1-2.hot.corp.google.com [172.24.189.93]) by gmr-mx.google.com with ESMTPS id p19si783337yhi.1.2013.01.22.11.50.32 (version=TLSv1.1 cipher=AES128-SHA bits=128/128); Tue, 22 Jan 2013 11:50:32 -0800 (PST) Received: from pokey.mtv.corp.google.com (pokey.mtv.corp.google.com [172.17.131.25]) by corp2gmr1-2.hot.corp.google.com (Postfix) with ESMTP id C247F5A415E; Tue, 22 Jan 2013 11:50:32 -0800 (PST) Received: by pokey.mtv.corp.google.com (Postfix, from userid 60832) id 8B0C822F058; Tue, 22 Jan 2013 11:50:32 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by pokey.mtv.corp.google.com (Postfix) with ESMTP id 8A46B22F013; Tue, 22 Jan 2013 11:50:32 -0800 (PST) Date: Tue, 22 Jan 2013 11:50:32 -0800 (PST) From: Tom Herbert To: netdev@vger.kernel.org, davem@davemloft.net cc: netdev@markandruth.co.uk, eric.dumazet@gmail.com Subject: [PATCH 3/5] soreuseport: UDP/IPv4 implementation Message-ID: User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 X-Gm-Message-State: ALoCoQkiEHNghOOPXB79M1l1MAy7kQMrugxgKNqgq6CS7yqLmVWQ/ksuj/IDFHzOQqhhXcJYZMEZ8U4qEqErh4ZEFFUMHmf0TzNfcqbGNBJUFwoN9eqoKNfsvo8eMJ5JgHCzO8bA5VTtvgpS3sDVaRV/Dy3fOk7oTFJ3O72HesnMOQ+bxUXt2SX4c0Lsm++eFyvgWnIci8JC Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Allow multiple UDP sockets to bind to the same port. Motivation soreuseport would be something like a DNS server.  An alternative would be to recv on the same socket from multiple threads. As in the case of TCP, the load across these threads tends to be disproportionate and we also see a lot of contection on the socketlock. Note that SO_REUSEADDR already allows multiple UDP sockets to bind to the same port, however there is no provision to prevent hijacking and nothing to distribute packets across all the sockets sharing the same bound port.  This patch does not change the semantics of SO_REUSEADDR, but provides usable functionality of it for unicast. Signed-off-by: Tom Herbert --- net/ipv4/udp.c | 61 +++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 43 insertions(+), 18 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 79c8dbe..b360b30 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -139,6 +139,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, { struct sock *sk2; struct hlist_nulls_node *node; + kuid_t uid = sock_i_uid(sk); sk_nulls_for_each(sk2, node, &hslot->head) if (net_eq(sock_net(sk2), net) && @@ -147,6 +148,8 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, (!sk2->sk_reuse || !sk->sk_reuse) && (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && + (!sk2->sk_reuseport || !sk->sk_reuseport || + !uid_eq(uid, sock_i_uid(sk2))) && (*saddr_comp)(sk, sk2)) { if (bitmap) __set_bit(udp_sk(sk2)->udp_port_hash >> log, @@ -169,6 +172,7 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num, { struct sock *sk2; struct hlist_nulls_node *node; + kuid_t uid = sock_i_uid(sk); int res = 0; spin_lock(&hslot2->lock); @@ -179,6 +183,8 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num, (!sk2->sk_reuse || !sk->sk_reuse) && (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && + (!sk2->sk_reuseport || !sk->sk_reuseport || + !uid_eq(uid, sock_i_uid(sk2))) && (*saddr_comp)(sk, sk2)) { res = 1; break; @@ -337,26 +343,26 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr, !ipv6_only_sock(sk)) { struct inet_sock *inet = inet_sk(sk); - score = (sk->sk_family == PF_INET ? 1 : 0); + score = (sk->sk_family == PF_INET ? 2 : 1); if (inet->inet_rcv_saddr) { if (inet->inet_rcv_saddr != daddr) return -1; - score += 2; + score += 4; } if (inet->inet_daddr) { if (inet->inet_daddr != saddr) return -1; - score += 2; + score += 4; } if (inet->inet_dport) { if (inet->inet_dport != sport) return -1; - score += 2; + score += 4; } if (sk->sk_bound_dev_if) { if (sk->sk_bound_dev_if != dif) return -1; - score += 2; + score += 4; } } return score; @@ -365,7 +371,6 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr, /* * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, inet_num) */ -#define SCORE2_MAX (1 + 2 + 2 + 2) static inline int compute_score2(struct sock *sk, struct net *net, __be32 saddr, __be16 sport, __be32 daddr, unsigned int hnum, int dif) @@ -380,21 +385,21 @@ static inline int compute_score2(struct sock *sk, struct net *net, if (inet->inet_num != hnum) return -1; - score = (sk->sk_family == PF_INET ? 1 : 0); + score = (sk->sk_family == PF_INET ? 2 : 1); if (inet->inet_daddr) { if (inet->inet_daddr != saddr) return -1; - score += 2; + score += 4; } if (inet->inet_dport) { if (inet->inet_dport != sport) return -1; - score += 2; + score += 4; } if (sk->sk_bound_dev_if) { if (sk->sk_bound_dev_if != dif) return -1; - score += 2; + score += 4; } } return score; @@ -409,19 +414,29 @@ static struct sock *udp4_lib_lookup2(struct net *net, { struct sock *sk, *result; struct hlist_nulls_node *node; - int score, badness; + int score, badness, matches = 0, reuseport = 0; + u32 hash = 0; begin: result = NULL; - badness = -1; + badness = 0; udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) { score = compute_score2(sk, net, saddr, sport, daddr, hnum, dif); if (score > badness) { result = sk; badness = score; - if (score == SCORE2_MAX) - goto exact_match; + reuseport = sk->sk_reuseport; + if (reuseport) { + hash = inet_ehashfn(net, daddr, hnum, + saddr, htons(sport)); + matches = 1; + } + } else if (score == badness && reuseport) { + matches++; + if (((u64)hash * matches) >> 32 == 0) + result = sk; + hash = next_pseudo_random32(hash); } } /* @@ -431,9 +446,7 @@ begin: */ if (get_nulls_value(node) != slot2) goto begin; - if (result) { -exact_match: if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2))) result = NULL; else if (unlikely(compute_score2(result, net, saddr, sport, @@ -457,7 +470,8 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, unsigned short hnum = ntohs(dport); unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask); struct udp_hslot *hslot2, *hslot = &udptable->hash[slot]; - int score, badness; + int score, badness, matches = 0, reuseport = 0; + u32 hash = 0; rcu_read_lock(); if (hslot->count > 10) { @@ -486,13 +500,24 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, } begin: result = NULL; - badness = -1; + badness = 0; sk_nulls_for_each_rcu(sk, node, &hslot->head) { score = compute_score(sk, net, saddr, hnum, sport, daddr, dport, dif); if (score > badness) { result = sk; badness = score; + reuseport = sk->sk_reuseport; + if (reuseport) { + hash = inet_ehashfn(net, daddr, hnum, + saddr, htons(sport)); + matches = 1; + } + } else if (score == badness && reuseport) { + matches++; + if (((u64)hash * matches) >> 32 == 0) + result = sk; + hash = next_pseudo_random32(hash); } } /*