From patchwork Mon Dec 7 13:24:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412023 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=TKG0SRe8; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPGn5y9kz9sVn for ; Tue, 8 Dec 2020 00:26:41 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726531AbgLGN0L (ORCPT ); Mon, 7 Dec 2020 08:26:11 -0500 Received: from smtp-fw-2101.amazon.com ([72.21.196.25]:36461 "EHLO smtp-fw-2101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726026AbgLGN0L (ORCPT ); Mon, 7 Dec 2020 08:26:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347570; x=1638883570; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=HKjIsmqmU/4GV4rzTHMtQpJccy4bICOTjf5HS8mC/Ik=; b=TKG0SRe8m9AkZMK5hQJmowRCbcgX40nS4KKyaCI8ZgdrBBNQPv0fVAX6 aTGNwLdU6HjW/PaBY/Cp3XCHAIMJRZGIE+h1r3zb+CVZiNUdoqn9oQ3JE T17kAnhb8Bxt+CilrvujV9M5lBFduiQ8rckpbtIi68CuJrMBmtPz7oSmA 4=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="67699485" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1a-715bee71.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-2101.iad2.amazon.com with ESMTP; 07 Dec 2020 13:25:37 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-715bee71.us-east-1.amazon.com (Postfix) with ESMTPS id C5A93A1DD7; Mon, 7 Dec 2020 13:25:34 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:25:33 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:25:29 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 01/13] tcp: Allow TCP_CLOSE sockets to hold the reuseport group. Date: Mon, 7 Dec 2020 22:24:44 +0900 Message-ID: <20201207132456.65472-2-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch is a preparation patch to migrate incoming connections in the later commits and adds a field (num_closed_socks) to the struct sock_reuseport to allow TCP_CLOSE sockets to access to the reuseport group. When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So, we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, it is impossible because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to hold sk_reuseport_cb while any child socket references to them. The point is that reuseport_detach_sock() is called twice from inet_unhash() and sk_destruct(). At first, it decrements num_socks and increments num_closed_socks. Later, when all migrated connections are accepted, it decrements num_closed_socks and sets NULL to sk_reuseport_cb. By this change, closed sockets can keep sk_reuseport_cb until all child requests have been freed or accepted. Consequently calling listen() after shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or inet_csk_bind_conflict() which expect that such sockets should not have the reuseport group. Therefore, this patch also loosens such validation rules so that the socket can listen again if it has the same reuseport group with other listening sockets. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 5 +++-- net/core/sock_reuseport.c | 39 +++++++++++++++++++++++---------- net/ipv4/inet_connection_sock.c | 7 ++++-- 3 files changed, 35 insertions(+), 16 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 505f1e18e9bf..0e558ca7afbf 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock; struct sock_reuseport { struct rcu_head rcu; - u16 max_socks; /* length of socks */ - u16 num_socks; /* elements in socks */ + u16 max_socks; /* length of socks */ + u16 num_socks; /* elements in socks */ + u16 num_closed_socks; /* closed elements in socks */ /* The last synq overflow event timestamp of this * reuse->socks[] group. */ diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index bbdd3c7b6cb5..c26f4256ff41 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -98,14 +98,15 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) return NULL; more_reuse->num_socks = reuse->num_socks; + more_reuse->num_closed_socks = reuse->num_closed_socks; more_reuse->prog = reuse->prog; more_reuse->reuseport_id = reuse->reuseport_id; more_reuse->bind_inany = reuse->bind_inany; more_reuse->has_conns = reuse->has_conns; + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); - more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); for (i = 0; i < reuse->num_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, @@ -152,8 +153,10 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) reuse = rcu_dereference_protected(sk2->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb, - lockdep_is_held(&reuseport_lock)); - if (old_reuse && old_reuse->num_socks != 1) { + lockdep_is_held(&reuseport_lock)); + if (old_reuse == reuse) { + reuse->num_closed_socks--; + } else if (old_reuse && old_reuse->num_socks != 1) { spin_unlock_bh(&reuseport_lock); return -EBUSY; } @@ -174,8 +177,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) spin_unlock_bh(&reuseport_lock); - if (old_reuse) + if (old_reuse && old_reuse != reuse) call_rcu(&old_reuse->rcu, reuseport_free_rcu); + return 0; } EXPORT_SYMBOL(reuseport_add_sock); @@ -199,17 +203,28 @@ void reuseport_detach_sock(struct sock *sk) */ bpf_sk_reuseport_detach(sk); - rcu_assign_pointer(sk->sk_reuseport_cb, NULL); - - for (i = 0; i < reuse->num_socks; i++) { - if (reuse->socks[i] == sk) { - reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; - reuse->num_socks--; - if (reuse->num_socks == 0) - call_rcu(&reuse->rcu, reuseport_free_rcu); + if (sk->sk_protocol == IPPROTO_TCP && sk->sk_state == TCP_CLOSE) { + reuse->num_closed_socks--; + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + } else { + for (i = 0; i < reuse->num_socks; i++) { + if (reuse->socks[i] != sk) + continue; break; } + + reuse->num_socks--; + reuse->socks[i] = reuse->socks[reuse->num_socks]; + + if (sk->sk_protocol == IPPROTO_TCP) + reuse->num_closed_socks++; + else + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + + if (reuse->num_socks + reuse->num_closed_socks == 0) + call_rcu(&reuse->rcu, reuseport_free_rcu); + spin_unlock_bh(&reuseport_lock); } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index f60869acbef0..1451aa9712b0 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk, bool reuse = sk->sk_reuse; bool reuseport = !!sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb); /* * Unlike other sk lookup places we do not check @@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, if ((!relax || (!reuseport_ok && reuseport && sk2->sk_reuseport && - !rcu_access_pointer(sk->sk_reuseport_cb) && + (!reuseport_cb || + reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) && (sk2->sk_state == TCP_TIME_WAIT || uid_eq(uid, sock_i_uid(sk2))))) && inet_rcv_saddr_equal(sk, sk2, true)) break; } else if (!reuseport_ok || !reuseport || !sk2->sk_reuseport || - rcu_access_pointer(sk->sk_reuseport_cb) || + (reuseport_cb && + reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)))) { if (inet_rcv_saddr_equal(sk, sk2, true)) From patchwork Mon Dec 7 13:24:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412024 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=e1jAaR+Z; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPGq6217z9sWP for ; Tue, 8 Dec 2020 00:26:43 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726627AbgLGN02 (ORCPT ); Mon, 7 Dec 2020 08:26:28 -0500 Received: from smtp-fw-2101.amazon.com ([72.21.196.25]:36461 "EHLO smtp-fw-2101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725996AbgLGN02 (ORCPT ); Mon, 7 Dec 2020 08:26:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347586; x=1638883586; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=GUF0obSd4VoBjnotD58UAfe+GcsZfRAHuRtUll6vREE=; b=e1jAaR+Z4b4eRoATrelGpp+Q1AbpjCIMiSxp9SAbqxOZqBecH1wWZA2B rEaR4vKqJn1bN7g6290EtqwQMGlENYjgtfYPv303s1SaUPYryEb28RGgI 6ZSp3BA+HjFoHTTbn69tIKrN+vcNj4GIOs16zpv4saqYARb56EqLmVKlT s=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="67699577" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1d-38ae4ad2.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-2101.iad2.amazon.com with ESMTP; 07 Dec 2020 13:25:59 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-38ae4ad2.us-east-1.amazon.com (Postfix) with ESMTPS id B8C93A2040; Mon, 7 Dec 2020 13:25:56 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:25:55 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:25:45 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 02/13] bpf: Define migration types for SO_REUSEPORT. Date: Mon, 7 Dec 2020 22:24:45 +0900 Message-ID: <20201207132456.65472-3-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org As noted in the preceding commit, there are two migration types. In addition to that, the kernel will run the same eBPF program to select a listener for SYN packets. This patch defines three types to signal the kernel and the eBPF program if it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in the accept queue or NEW_SYN_RECV socket during 3WHS. Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 14 ++++++++++++++ tools/include/uapi/linux/bpf.h | 14 ++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1233f14f659f..7a48e0055500 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4423,6 +4423,20 @@ struct sk_msg_md { __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ }; +/* Migration type for SO_REUSEPORT enabled TCP sockets. + * + * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets. + * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in + * the accept queue at close() or shutdown(). + * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the + * final ACK of 3WHS or retransmitting SYN+ACKs. + */ +enum { + BPF_SK_REUSEPORT_MIGRATE_NO, + BPF_SK_REUSEPORT_MIGRATE_QUEUE, + BPF_SK_REUSEPORT_MIGRATE_REQUEST, +}; + struct sk_reuseport_md { /* * Start of directly accessible data. It begins from diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1233f14f659f..7a48e0055500 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4423,6 +4423,20 @@ struct sk_msg_md { __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ }; +/* Migration type for SO_REUSEPORT enabled TCP sockets. + * + * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets. + * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in + * the accept queue at close() or shutdown(). + * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the + * final ACK of 3WHS or retransmitting SYN+ACKs. + */ +enum { + BPF_SK_REUSEPORT_MIGRATE_NO, + BPF_SK_REUSEPORT_MIGRATE_QUEUE, + BPF_SK_REUSEPORT_MIGRATE_REQUEST, +}; + struct sk_reuseport_md { /* * Start of directly accessible data. It begins from From patchwork Mon Dec 7 13:24:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412026 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=T+wdSdcL; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPHS74MHz9s1l for ; Tue, 8 Dec 2020 00:27:16 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726661AbgLGN0v (ORCPT ); Mon, 7 Dec 2020 08:26:51 -0500 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:35272 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726174AbgLGN0v (ORCPT ); Mon, 7 Dec 2020 08:26:51 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347610; x=1638883610; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=ZgNA79r8oboVjgxvQ5hc/k4jl9iZmqF9kCrWvhttAso=; b=T+wdSdcLjqn3Dovn/2gfHPSv46v8GPzF/jc1xvAnHB40+A24MSgKEEaj n4QLHLZtglw/6wNxgC6d8AZP9bSgALP8uBwq6F9uFEuwbUusMvKMv17Ok m7WVARaj1ELCGfC5pp2Qz413hHXA+QqyVIxn68k0YSTL1FPpWUfKj60h+ M=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="101016343" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1e-42f764a0.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP; 07 Dec 2020 13:26:08 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1e-42f764a0.us-east-1.amazon.com (Postfix) with ESMTPS id A8983B3DD3; Mon, 7 Dec 2020 13:26:05 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:04 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:00 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Waiman Long Subject: [PATCH v2 bpf-next 03/13] Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested() API" Date: Mon, 7 Dec 2020 22:24:46 +0900 Message-ID: <20201207132456.65472-4-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This reverts commit 607904c357c61adf20b8fd18af765e501d61a385 to use spin_lock_bh_nested() in the next commit. Link: https://lore.kernel.org/netdev/9d290a57-49e1-04cd-2487-262b0d7c5844@gmail.com/ Signed-off-by: Kuniyuki Iwashima CC: Waiman Long Acked-by: Waiman Long --- include/linux/spinlock.h | 8 ++++++++ include/linux/spinlock_api_smp.h | 2 ++ include/linux/spinlock_api_up.h | 1 + kernel/locking/spinlock.c | 8 ++++++++ 4 files changed, 19 insertions(+) diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 79897841a2cc..c020b375a071 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -227,6 +227,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC # define raw_spin_lock_nested(lock, subclass) \ _raw_spin_lock_nested(lock, subclass) +# define raw_spin_lock_bh_nested(lock, subclass) \ + _raw_spin_lock_bh_nested(lock, subclass) # define raw_spin_lock_nest_lock(lock, nest_lock) \ do { \ @@ -242,6 +244,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) # define raw_spin_lock_nested(lock, subclass) \ _raw_spin_lock(((void)(subclass), (lock))) # define raw_spin_lock_nest_lock(lock, nest_lock) _raw_spin_lock(lock) +# define raw_spin_lock_bh_nested(lock, subclass) _raw_spin_lock_bh(lock) #endif #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) @@ -369,6 +372,11 @@ do { \ raw_spin_lock_nested(spinlock_check(lock), subclass); \ } while (0) +#define spin_lock_bh_nested(lock, subclass) \ +do { \ + raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\ +} while (0) + #define spin_lock_nest_lock(lock, nest_lock) \ do { \ raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock); \ diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..d565fb6304f2 100644 --- a/include/linux/spinlock_api_smp.h +++ b/include/linux/spinlock_api_smp.h @@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr); void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) __acquires(lock); void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass) __acquires(lock); +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass) + __acquires(lock); void __lockfunc _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map) __acquires(lock); diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h index d0d188861ad6..d3afef9d8dbe 100644 --- a/include/linux/spinlock_api_up.h +++ b/include/linux/spinlock_api_up.h @@ -57,6 +57,7 @@ #define _raw_spin_lock(lock) __LOCK(lock) #define _raw_spin_lock_nested(lock, subclass) __LOCK(lock) +#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock) #define _raw_read_lock(lock) __LOCK(lock) #define _raw_write_lock(lock) __LOCK(lock) #define _raw_spin_lock_bh(lock) __LOCK_BH(lock) diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 0ff08380f531..48e99ed1bdd8 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass) } EXPORT_SYMBOL(_raw_spin_lock_nested); +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass) +{ + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET); + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); +} +EXPORT_SYMBOL(_raw_spin_lock_bh_nested); + unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) { From patchwork Mon Dec 7 13:24:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412027 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=VOgq0R5I; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPHX5ktCz9sWY for ; Tue, 8 Dec 2020 00:27:20 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726708AbgLGN1M (ORCPT ); Mon, 7 Dec 2020 08:27:12 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:5055 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726081AbgLGN1M (ORCPT ); Mon, 7 Dec 2020 08:27:12 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347631; x=1638883631; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=Mn37ssPc5fLpkPkbvi2B7Yse7CJP+fPNOo1utZBK/SA=; b=VOgq0R5IKg9Pud0KwUs6mWzBV9/UxK52YqEklWmZVdE970oWtCrPRQ2k MvLq2Daa3WbLopjrUZmda41z98S1tJO38Q3kRNvcB9Rj5O1ZHnoJSm718 AI/w+6x+nzdqIhQr28gayi+IhadOwnRFNGRksj9IbD+xVoSAL18x9JfMS 8=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="102282574" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 07 Dec 2020 13:26:30 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com (Postfix) with ESMTPS id 23FD2A17C5; Mon, 7 Dec 2020 13:26:26 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:26 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:15 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 04/13] tcp: Introduce inet_csk_reqsk_queue_migrate(). Date: Mon, 7 Dec 2020 22:24:47 +0900 Message-ID: <20201207132456.65472-5-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch defines a new function to migrate ESTABLISHED/SYN_RECV sockets. Listening sockets hold incoming connections as a linked list of struct request_sock in the accept queue, and each request has reference to its full socket and listener. In inet_csk_reqsk_queue_migrate(), we only unlink the requests from the closing listener's queue and relink them to the head of the new listener's queue. We do not process each request and its reference to the listener, so the migration completes in O(1) time complexity. Moreover, if TFO requests caused RST before 3WHS has completed, they are held in the listener's TFO queue to prevent DDoS attack. Thus, we also migrate the requests in the TFO queue in the same way. After 3WHS has completed, there are three access patterns to incoming sockets: (1) access to the full socket instead of request_sock (2) access to request_sock from access queue (3) access to request_sock from TFO queue In the first case, the full socket does not have a reference to its request socket and listener, so we do not need the correct listener set in the request socket. In the second case, we always have the correct listener and currently do not use req->rsk_listener. However, in the third case of TCP_SYN_RECV sockets, we take special care in the next commit. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/inet_connection_sock.h | 1 + net/ipv4/inet_connection_sock.c | 68 ++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 7338b3865a2a..2ea2d743f8fc 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk, struct sock *inet_csk_reqsk_queue_add(struct sock *sk, struct request_sock *req, struct sock *child); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk); void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, unsigned long timeout); struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 1451aa9712b0..5da38a756e4c 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -992,6 +992,74 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, } EXPORT_SYMBOL(inet_csk_reqsk_queue_add); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) +{ + struct request_sock_queue *old_accept_queue, *new_accept_queue; + struct fastopen_queue *old_fastopenq, *new_fastopenq; + spinlock_t *l1, *l2, *l3, *l4; + + old_accept_queue = &inet_csk(sk)->icsk_accept_queue; + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; + old_fastopenq = &old_accept_queue->fastopenq; + new_fastopenq = &new_accept_queue->fastopenq; + + l1 = &old_accept_queue->rskq_lock; + l2 = &new_accept_queue->rskq_lock; + l3 = &old_fastopenq->lock; + l4 = &new_fastopenq->lock; + + /* sk is never selected as the new listener from reuse->socks[], + * so inversion deadlock does not happen here, + * but change the order to avoid the warning of lockdep. + */ + if (sk < nsk) { + swap(l1, l2); + swap(l3, l4); + } + + spin_lock(l1); + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); + + if (old_accept_queue->rskq_accept_head) { + if (new_accept_queue->rskq_accept_head) + old_accept_queue->rskq_accept_tail->dl_next = + new_accept_queue->rskq_accept_head; + else + new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail; + + new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head; + old_accept_queue->rskq_accept_head = NULL; + old_accept_queue->rskq_accept_tail = NULL; + + WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog); + WRITE_ONCE(sk->sk_ack_backlog, 0); + } + + spin_unlock(l2); + spin_unlock(l1); + + spin_lock_bh(l3); + spin_lock_bh_nested(l4, SINGLE_DEPTH_NESTING); + + new_fastopenq->qlen += old_fastopenq->qlen; + old_fastopenq->qlen = 0; + + if (old_fastopenq->rskq_rst_head) { + if (new_fastopenq->rskq_rst_head) + old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head; + else + old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail; + + new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head; + old_fastopenq->rskq_rst_head = NULL; + old_fastopenq->rskq_rst_tail = NULL; + } + + spin_unlock_bh(l4); + spin_unlock_bh(l3); +} +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); + struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, struct request_sock *req, bool own_req) { From patchwork Mon Dec 7 13:24:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412028 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=QeEghicx; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPJC6dP6z9sW9 for ; Tue, 8 Dec 2020 00:27:55 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726730AbgLGN1b (ORCPT ); Mon, 7 Dec 2020 08:27:31 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:36694 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725804AbgLGN1a (ORCPT ); Mon, 7 Dec 2020 08:27:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347650; x=1638883650; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=Vonv/Q6hCNxXl/Pnwc4Bge1XzPa1igFZTKnOjMRgcwA=; b=QeEghicxIcu1iQIlOViankC8PG6bWG7rldwwU3Q/lFsPymngP529tYf6 tu4Hh3OcdH0fn7Ymd5HC3Va4IbxwCEYwImMBNmN0h3pNErGslIpScjMYA O64s91Ubkhj+QFHlMkrzTixWJi5iCGzLHFii3CgyZ4A7DwqBIdtzwXZYX A=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="69561666" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-37fd6b3d.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 07 Dec 2020 13:26:49 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1d-37fd6b3d.us-east-1.amazon.com (Postfix) with ESMTPS id 2DEAD28481E; Mon, 7 Dec 2020 13:26:45 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:45 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:40 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 05/13] tcp: Set the new listener to migrated TFO requests. Date: Mon, 7 Dec 2020 22:24:48 +0900 Message-ID: <20201207132456.65472-6-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A TFO request socket is only freed after BOTH 3WHS has completed (or aborted) and the child socket has been accepted (or its listener has been closed). Hence, depending on the order, there can be two kinds of request sockets in the accept queue. 3WHS -> accept : TCP_ESTABLISHED accept -> 3WHS : TCP_SYN_RECV Unlike TCP_ESTABLISHED socket, accept() does not free the request socket for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove(). Also, it accesses request_sock.rsk_listener. So, in order to complete TFO socket migration, we have to set the current listener to it at accept() before reqsk_fastopen_remove(). Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 5da38a756e4c..143590858c2e 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) tcp_rsk(req)->tfo_listener) { spin_lock_bh(&queue->fastopenq.lock); if (tcp_rsk(req)->tfo_listener) { + if (req->rsk_listener != sk) { + /* TFO request was migrated to another listener so + * the new listener must be used in reqsk_fastopen_remove() + * to hold requests which cause RST. + */ + sock_put(req->rsk_listener); + sock_hold(sk); + req->rsk_listener = sk; + } + /* We are still waiting for the final ACK from 3WHS * so can't free req now. Instead, we set req->sk to * NULL to signify that the child socket is taken @@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req, if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) { BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req); - BUG_ON(sk != req->rsk_listener); /* Paranoid, to prevent race condition if * an inbound pkt destined for child is From patchwork Mon Dec 7 13:24:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412029 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=Eg8fFMZG; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPJG2TPYz9sVn for ; Tue, 8 Dec 2020 00:27:58 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726798AbgLGN1u (ORCPT ); Mon, 7 Dec 2020 08:27:50 -0500 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:54862 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726195AbgLGN1t (ORCPT ); Mon, 7 Dec 2020 08:27:49 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347668; x=1638883668; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=8XO3oMJuYMRpRokryS+5ntczUSPA08sfPz93Ikz0YFA=; b=Eg8fFMZGZpF8L4Ew3FHMEfzwuz+BjVFz5O7J4myJiQonD2cM+qMxrhen lep+rUc/4jlTOi1ZkrxCg2oCTkbSUhxSkNEVe38ki3vvzlCjGcGtKjGDL hzTkd2M+fBSyNoobrFXHPKvHDiWh52ZtaQ+nR2dADJMjGDykTp3bMpqiS c=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="901121995" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1e-303d0b0e.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 07 Dec 2020 13:27:06 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1e-303d0b0e.us-east-1.amazon.com (Postfix) with ESMTPS id 5ACA1A188D; Mon, 7 Dec 2020 13:27:04 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:03 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:26:58 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 06/13] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. Date: Mon, 7 Dec 2020 22:24:49 +0900 Message-ID: <20201207132456.65472-7-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch lets reuseport_detach_sock() return a pointer of struct sock, which is used only by inet_unhash(). If it is not NULL, inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV sockets from the closing listener to the selected one. By default, the kernel selects a new listener randomly. In order to pick out a different socket every time, we select the last element of socks[] as the new listener. This behaviour is based on how the kernel moves sockets in socks[]. (See also [1]) Basically, in order to redistribute sockets evenly, we have to use an eBPF program called in the later commit, but as the side effect of such default selection, the kernel can redistribute old requests evenly to new listeners for a specific case where the application replaces listeners by generations. For example, we call listen() for four sockets (A, B, C, D), and close() the first two by turns. The sockets move in socks[] like below. socks[0] : A <-. socks[0] : D socks[0] : D socks[1] : B | => socks[1] : B <-. => socks[1] : C socks[2] : C | socks[2] : C --' socks[3] : D --' Then, if C and D have newer settings than A and B, and each socket has a request (a, b, c, d) in their accept queue, we can redistribute old requests evenly to new listeners. socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d) socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c) socks[2] : C (c) | socks[2] : C (c) --' socks[3] : D (d) --' Here, (A, D), or (B, C) can have different application settings, but they MUST have the same settings at the socket API level; otherwise, unexpected error may happen. For instance, if only the new listeners have TCP_SAVE_SYN, old requests do not hold SYN data, so the application will face inconsistency and cause an error. Therefore, if there are different kinds of sockets, we must attach an eBPF program described in later commits. Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/ Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 2 +- net/core/sock_reuseport.c | 16 +++++++++++++--- net/ipv4/inet_hashtables.c | 9 +++++++-- 3 files changed, 21 insertions(+), 6 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 0e558ca7afbf..09a1b1539d4c 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -31,7 +31,7 @@ struct sock_reuseport { extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); -extern void reuseport_detach_sock(struct sock *sk); +extern struct sock *reuseport_detach_sock(struct sock *sk); extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, struct sk_buff *skb, diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index c26f4256ff41..2de42f8103ea 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -184,9 +184,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } EXPORT_SYMBOL(reuseport_add_sock); -void reuseport_detach_sock(struct sock *sk) +struct sock *reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; + struct bpf_prog *prog; + struct sock *nsk = NULL; int i; spin_lock_bh(&reuseport_lock); @@ -215,17 +217,25 @@ void reuseport_detach_sock(struct sock *sk) reuse->num_socks--; reuse->socks[i] = reuse->socks[reuse->num_socks]; + prog = rcu_dereference_protected(reuse->prog, + lockdep_is_held(&reuseport_lock)); + + if (sk->sk_protocol == IPPROTO_TCP) { + if (reuse->num_socks && !prog) + nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; - if (sk->sk_protocol == IPPROTO_TCP) reuse->num_closed_socks++; - else + } else { rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + } } if (reuse->num_socks + reuse->num_closed_socks == 0) call_rcu(&reuse->rcu, reuseport_free_rcu); spin_unlock_bh(&reuseport_lock); + + return nsk; } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 45fb450b4522..545538a6bfac 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; struct inet_listen_hashbucket *ilb = NULL; + struct sock *nsk; spinlock_t *lock; if (sk_unhashed(sk)) @@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk) if (sk_unhashed(sk)) goto unlock; - if (rcu_access_pointer(sk->sk_reuseport_cb)) - reuseport_detach_sock(sk); + if (rcu_access_pointer(sk->sk_reuseport_cb)) { + nsk = reuseport_detach_sock(sk); + if (nsk) + inet_csk_reqsk_queue_migrate(sk, nsk); + } + if (ilb) { inet_unhash2(hashinfo, sk); ilb->count--; From patchwork Mon Dec 7 13:24:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412030 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=Gl/IeySQ; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPK507VQz9sVn for ; Tue, 8 Dec 2020 00:28:41 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726314AbgLGN2F (ORCPT ); Mon, 7 Dec 2020 08:28:05 -0500 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:54916 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725823AbgLGN2E (ORCPT ); Mon, 7 Dec 2020 08:28:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347683; x=1638883683; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=p8tI0mz+kTb3nu8UCKLc2FJ3zfZDXGquaaAz/e5hCQs=; b=Gl/IeySQROMNp3i1M0wsdAaoEsbZ2Wy61NIk6Ywk2LO8DtgA94fjhnbB h2hyg5LzL4eP7m6rK+wYtKSsC7lH/qUPVBiHq9YjeSFH0GUO/ntLl37Vh t/e/0NJC7FKStbWC/OYcCChh+jrfTUKxKzAYOuJtPh79d7hzV3lIK1xbk A=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="901122042" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 07 Dec 2020 13:27:22 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com (Postfix) with ESMTPS id 91DCBA1E53; Mon, 7 Dec 2020 13:27:19 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:18 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:14 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 07/13] tcp: Migrate TCP_NEW_SYN_RECV requests. Date: Mon, 7 Dec 2020 22:24:50 +0900 Message-ID: <20201207132456.65472-8-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch renames reuseport_select_sock() to __reuseport_select_sock() and adds two wrapper function of it to pass the migration type defined in the previous commit. reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV requests at receiving the final ACK or sending a SYN+ACK. Therefore, this patch also changes the code to call reuseport_select_migrated_sock() even if the listening socket is TCP_CLOSE. If we can pick out a listening socket from the reuseport group, we rewrite request_sock.rsk_listener and resume processing the request. Link: https://lore.kernel.org/bpf/202012020136.bF0Z4Guu-lkp@intel.com/ Reported-by: kernel test robot Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/inet_connection_sock.h | 11 ++++++++ include/net/request_sock.h | 13 ++++++++++ include/net/sock_reuseport.h | 8 +++--- net/core/sock_reuseport.c | 40 ++++++++++++++++++++++++------ net/ipv4/inet_connection_sock.c | 13 ++++++++-- net/ipv4/tcp_ipv4.c | 9 +++++-- net/ipv6/tcp_ipv6.c | 9 +++++-- 7 files changed, 86 insertions(+), 17 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 2ea2d743f8fc..d8c3be31e987 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -272,6 +272,17 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); } +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, + struct sock *nsk, + struct request_sock *req) +{ + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, + &inet_csk(nsk)->icsk_accept_queue, + req); + sock_put(sk); + req->rsk_listener = nsk; +} + static inline int inet_csk_reqsk_queue_len(const struct sock *sk) { return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue); diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 29e41ff3ec93..d18ba0b857cc 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue) atomic_inc(&queue->qlen); } +static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue, + struct request_sock_queue *new_accept_queue, + const struct request_sock *req) +{ + atomic_dec(&old_accept_queue->qlen); + atomic_inc(&new_accept_queue->qlen); + + if (req->num_timeout == 0) { + atomic_dec(&old_accept_queue->young); + atomic_inc(&new_accept_queue->young); + } +} + static inline int reqsk_queue_len(const struct request_sock_queue *queue) { return atomic_read(&queue->qlen); diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 09a1b1539d4c..a48259a974be 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); extern struct sock *reuseport_detach_sock(struct sock *sk); -extern struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len); +extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len); +extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb); extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog); extern int reuseport_detach_prog(struct sock *sk); diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 2de42f8103ea..1011c3756c92 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -170,7 +170,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } reuse->socks[reuse->num_socks] = sk; - /* paired with smp_rmb() in reuseport_select_sock() */ + /* paired with smp_rmb() in __reuseport_select_sock() */ smp_wmb(); reuse->num_socks++; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); @@ -277,12 +277,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, * @hdr_len: BPF filter expects skb data pointer at payload data. If * the skb does not yet point at the payload, this parameter represents * how far the pointer needs to advance to reach the payload. + * @migration: represents if it is selecting a listener for SYN or + * migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket. * Returns a socket that should receive the packet (or NULL on error). */ -struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len) +static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len, + u8 migration) { struct sock_reuseport *reuse; struct bpf_prog *prog; @@ -296,13 +297,19 @@ struct sock *reuseport_select_sock(struct sock *sk, if (!reuse) goto out; - prog = rcu_dereference(reuse->prog); socks = READ_ONCE(reuse->num_socks); if (likely(socks)) { /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (!prog || !skb) + prog = rcu_dereference(reuse->prog); + if (!prog) + goto select_by_hash; + + if (migration) + goto out; + + if (!skb) goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) @@ -331,8 +338,27 @@ struct sock *reuseport_select_sock(struct sock *sk, rcu_read_unlock(); return sk2; } + +struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len) +{ + return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO); +} EXPORT_SYMBOL(reuseport_select_sock); +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb) +{ + struct sock *nsk; + + nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); + if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt))) + return nsk; + + return NULL; +} +EXPORT_SYMBOL(reuseport_select_migrated_sock); + int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) { struct sock_reuseport *reuse; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 143590858c2e..f042e9122074 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t) struct request_sock_queue *queue = &icsk->icsk_accept_queue; int max_syn_ack_retries, qlen, expire = 0, resend = 0; - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) - goto drop; + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { + sk_listener = reuseport_select_migrated_sock(sk_listener, + req_to_sk(req)->sk_hash, NULL); + if (!sk_listener) { + sk_listener = req->rsk_listener; + goto drop; + } + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); + icsk = inet_csk(sk_listener); + queue = &icsk->icsk_accept_queue; + } max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; /* Normally all the openreqs are young and become mature diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index af2338294598..a4eea6b36795 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1978,8 +1978,13 @@ int tcp_v4_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } /* We own a reference on the listener, increase it again * as we might lose it too soon. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 1a1510513739..61b8c5855735 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1640,8 +1640,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } sock_hold(sk); refcounted = true; From patchwork Mon Dec 7 13:24:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412032 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=iGnVDJpP; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPK738dfz9sWP for ; Tue, 8 Dec 2020 00:28:43 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726422AbgLGN2V (ORCPT ); Mon, 7 Dec 2020 08:28:21 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:36838 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725823AbgLGN2S (ORCPT ); Mon, 7 Dec 2020 08:28:18 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347698; x=1638883698; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=PB9Ez+y4FD9VGYXw++IaSHiEr4C9tHxS7oGDzn/NXxo=; b=iGnVDJpP71Ni0d7zUEtffkJc7NRFkVlT6aQqCJMuQ9P9fXGoxjfCCOg8 DPeDHBvXkmXLpPGflx55nsHkV0N6RcF/Uj02AP/6nP2zU0cTS3tQhIRhW AS37/2GCXvOeUp7FF8vaEq1/couVkbnZXo2JZIT9RBu9PE84/+0PwO+0k g=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="69561783" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-af6a10df.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 07 Dec 2020 13:27:37 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-af6a10df.us-east-1.amazon.com (Postfix) with ESMTPS id 9A40FA221B; Mon, 7 Dec 2020 13:27:34 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:33 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:29 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 08/13] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT. Date: Mon, 7 Dec 2020 22:24:51 +0900 Message-ID: <20201207132456.65472-9-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, the kernel runs it for socket migration only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. The kernel will change the behaviour depending on the returned value: - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fall back to the random selection - SK_DROP, cancel the migration Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 13 +++++++++++++ tools/include/uapi/linux/bpf.h | 2 ++ 3 files changed, 17 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7a48e0055500..c7f6848c0226 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -241,6 +241,8 @@ enum bpf_attach_type { BPF_XDP_CPUMAP, BPF_SK_LOOKUP, BPF_XDP, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0cd3cc2af9c1..0737673c727c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1920,6 +1920,11 @@ static void bpf_prog_load_fixup_attach_type(union bpf_attr *attr) attr->expected_attach_type = BPF_CGROUP_INET_SOCK_CREATE; break; + case BPF_PROG_TYPE_SK_REUSEPORT: + if (!attr->expected_attach_type) + attr->expected_attach_type = + BPF_SK_REUSEPORT_SELECT; + break; } } @@ -2003,6 +2008,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, if (expected_attach_type == BPF_SK_LOOKUP) return 0; return -EINVAL; + case BPF_PROG_TYPE_SK_REUSEPORT: + switch (expected_attach_type) { + case BPF_SK_REUSEPORT_SELECT: + case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE: + return 0; + default: + return -EINVAL; + } case BPF_PROG_TYPE_EXT: if (expected_attach_type) return -EINVAL; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7a48e0055500..c7f6848c0226 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -241,6 +241,8 @@ enum bpf_attach_type { BPF_XDP_CPUMAP, BPF_SK_LOOKUP, BPF_XDP, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; From patchwork Mon Dec 7 13:24:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412031 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=lIbaP/P9; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPK60vrzz9sW9 for ; Tue, 8 Dec 2020 00:28:42 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726485AbgLGN2W (ORCPT ); Mon, 7 Dec 2020 08:28:22 -0500 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:54916 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725917AbgLGN2U (ORCPT ); Mon, 7 Dec 2020 08:28:20 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347699; x=1638883699; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=zZWQroLY9J+nGi+ykaI4WpcwCCgdqU/PMdq9gqllP8M=; b=lIbaP/P9pB0fX+tDTg8pBWongkWfx23+avN5YtWJ63RzYcqV8uHUDDeo PmqMLA7uGpOQ3e0W0nKF0FPgFdIWB2SUFqUmig269CBBrUEoAITd6Z23g bJfdmmNjmG+Qod7Oazq5gsDhm3uAijfFhYwAyfi4JNAbzwOVbvWSKeOPu w=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="901122159" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 07 Dec 2020 13:27:53 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com (Postfix) with ESMTPS id 4B0C0A1E56; Mon, 7 Dec 2020 13:27:50 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:50 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:27:45 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 09/13] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT. Date: Mon, 7 Dec 2020 22:24:52 +0900 Message-ID: <20201207132456.65472-10-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit introduces a new section (sk_reuseport/migrate) and sets expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT program. Signed-off-by: Kuniyuki Iwashima --- tools/lib/bpf/libbpf.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 9be88a90a4aa..ba64c891a5e7 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -8471,7 +8471,10 @@ static struct bpf_link *attach_iter(const struct bpf_sec_def *sec, static const struct bpf_sec_def section_defs[] = { BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER), - BPF_PROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT), + BPF_EAPROG_SEC("sk_reuseport/migrate", BPF_PROG_TYPE_SK_REUSEPORT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE), + BPF_EAPROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT, + BPF_SK_REUSEPORT_SELECT), SEC_DEF("kprobe/", KPROBE, .attach_fn = attach_kprobe), BPF_PROG_SEC("uprobe/", BPF_PROG_TYPE_KPROBE), From patchwork Mon Dec 7 13:24:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412033 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=mUxqO/Z6; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPKg0Yp2z9sVn for ; Tue, 8 Dec 2020 00:29:11 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726370AbgLGN2v (ORCPT ); Mon, 7 Dec 2020 08:28:51 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:5379 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725917AbgLGN2u (ORCPT ); Mon, 7 Dec 2020 08:28:50 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347730; x=1638883730; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=rwwsd+4koA6t9hA5HF4gY60ckOiP8zAYsJKIAgHSb5c=; b=mUxqO/Z689jDpfZwwJjkumDZsIRhViJiM++1aueWuReI77rqGJaOT6f6 AI01b/z/LToI9mGyDM+GW0oPH1DQusji+QbgSmTA1QZkFsZYX+pOLpdk5 hiSq/Xsxvp/gcTUT/0kpVOe3fRR18NuhvY81FeVWkxi4w/RnykfZnrf2F c=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="102282946" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-16425a8d.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 07 Dec 2020 13:28:08 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-16425a8d.us-east-1.amazon.com (Postfix) with ESMTPS id A71E51010F1; Mon, 7 Dec 2020 13:28:05 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:04 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:00 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 10/13] bpf: Add migration to sk_reuseport_(kern|md). Date: Mon, 7 Dec 2020 22:24:53 +0900 Message-ID: <20201207132456.65472-11-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md to signal the eBPF program if the kernel calls it for selecting a listener for SYN or migrating sockets in the accept queue or an immature socket during 3WHS. Note that this field is accessible only if the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/linux/bpf.h | 1 + include/linux/filter.h | 4 ++-- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 15 ++++++++++++--- net/core/sock_reuseport.c | 2 +- tools/include/uapi/linux/bpf.h | 1 + 6 files changed, 18 insertions(+), 6 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index d05e75ed8c1b..cdeb27f4ad63 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1914,6 +1914,7 @@ struct sk_reuseport_kern { u32 hash; u32 reuseport_id; bool bind_inany; + u8 migration; }; bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info); diff --git a/include/linux/filter.h b/include/linux/filter.h index 1b62397bd124..15d5bf13a905 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -967,12 +967,12 @@ void bpf_warn_invalid_xdp_action(u32 act); #ifdef CONFIG_INET struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash); + u32 hash, u8 migration); #else static inline struct sock * bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { return NULL; } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c7f6848c0226..cf518e83df5c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4462,6 +4462,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u8 migration; /* Migration type */ }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index 77001a35768f..7bdf62f24044 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9860,7 +9860,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, struct sock_reuseport *reuse, struct sock *sk, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { reuse_kern->skb = skb; reuse_kern->sk = sk; @@ -9869,16 +9869,17 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; reuse_kern->bind_inany = reuse->bind_inany; + reuse_kern->migration = migration; } struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { struct sk_reuseport_kern reuse_kern; enum sk_action action; - bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash); + bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration); action = BPF_PROG_RUN(prog, &reuse_kern); if (action == SK_PASS) @@ -10017,6 +10018,10 @@ sk_reuseport_is_valid_access(int off, int size, case offsetof(struct sk_reuseport_md, hash): return size == size_default; + case bpf_ctx_range(struct sk_reuseport_md, migration): + return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE && + size == sizeof(__u8); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10089,6 +10094,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, bind_inany): SK_REUSEPORT_LOAD_FIELD(bind_inany); break; + + case offsetof(struct sk_reuseport_md, migration): + SK_REUSEPORT_LOAD_FIELD(migration); + break; } return insn - insn_buf; diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 1011c3756c92..b877c8e552d2 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -313,7 +313,7 @@ static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) - sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration); else sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c7f6848c0226..cf518e83df5c 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4462,6 +4462,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u8 migration; /* Migration type */ }; #define BPF_TAG_SIZE 8 From patchwork Mon Dec 7 13:24:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412034 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=iXT7U4Ia; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPL62FLqz9sW1 for ; Tue, 8 Dec 2020 00:29:34 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726830AbgLGN3L (ORCPT ); Mon, 7 Dec 2020 08:29:11 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:36988 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725852AbgLGN3L (ORCPT ); Mon, 7 Dec 2020 08:29:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347749; x=1638883749; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=7XiD9aIMDu4mgLkRDGGRXRcyA0AIujxpHFHP5MzN3Ls=; b=iXT7U4Ia/WVxfMxGp4P83ymnCf06ELcH9Pj9A9OpyYel09aJqIzevZGH Gz9+2aGgmgukohbuj6nOdQ23h9YNi02gZoCDw/9nTrVL8iCnE2GZ2ZXc0 N7ZcnBXAraNsAEObgwaV8l3B5EbC+XtywaJzZXqV247vRBE4Higfs85/1 0=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="69561866" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-e34f1ddc.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 07 Dec 2020 13:28:29 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-e34f1ddc.us-east-1.amazon.com (Postfix) with ESMTPS id 51446A1E93; Mon, 7 Dec 2020 13:28:25 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:24 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:15 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 11/13] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT. Date: Mon, 7 Dec 2020 22:24:54 +0900 Message-ID: <20201207132456.65472-12-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing in order to select the new listener. Currently, we can get a unique ID for each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the sk pointer available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 8 ++++++++ net/core/filter.c | 22 ++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 8 ++++++++ 3 files changed, 38 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index cf518e83df5c..a688a7a4fe85 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1655,6 +1655,13 @@ union bpf_attr { * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. * + * u64 bpf_get_socket_cookie(struct bpf_sock *sk) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock** context. + * Return + * A 8-byte long non-decreasing number. + * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts @@ -4463,6 +4470,7 @@ struct sk_reuseport_md { __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ __u8 migration; /* Migration type */ + __bpf_md_ptr(struct bpf_sock *, sk); /* Current listening socket */ }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index 7bdf62f24044..9f7018e3f545 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4631,6 +4631,18 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_pointer_cookie, struct sock *, sk) +{ + return __sock_gen_cookie(sk); +} + +static const struct bpf_func_proto bpf_get_socket_pointer_cookie_proto = { + .func = bpf_get_socket_pointer_cookie, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_SOCKET, +}; + BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx) { return __sock_gen_cookie(ctx->sk); @@ -9989,6 +10001,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id, return &sk_reuseport_load_bytes_proto; case BPF_FUNC_skb_load_bytes_relative: return &sk_reuseport_load_bytes_relative_proto; + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_pointer_cookie_proto; default: return bpf_base_func_proto(func_id); } @@ -10022,6 +10036,10 @@ sk_reuseport_is_valid_access(int off, int size, return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE && size == sizeof(__u8); + case offsetof(struct sk_reuseport_md, sk): + info->reg_type = PTR_TO_SOCKET; + return size == sizeof(__u64); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10098,6 +10116,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, migration): SK_REUSEPORT_LOAD_FIELD(migration); break; + + case offsetof(struct sk_reuseport_md, sk): + SK_REUSEPORT_LOAD_FIELD(sk); + break; } return insn - insn_buf; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index cf518e83df5c..a688a7a4fe85 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1655,6 +1655,13 @@ union bpf_attr { * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. * + * u64 bpf_get_socket_cookie(struct bpf_sock *sk) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock** context. + * Return + * A 8-byte long non-decreasing number. + * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts @@ -4463,6 +4470,7 @@ struct sk_reuseport_md { __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ __u8 migration; /* Migration type */ + __bpf_md_ptr(struct bpf_sock *, sk); /* Current listening socket */ }; #define BPF_TAG_SIZE 8 From patchwork Mon Dec 7 13:24:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412035 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=gxDhuIFN; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPL715Jxz9sWP for ; Tue, 8 Dec 2020 00:29:35 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726862AbgLGN3a (ORCPT ); Mon, 7 Dec 2020 08:29:30 -0500 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:9822 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725852AbgLGN3a (ORCPT ); Mon, 7 Dec 2020 08:29:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347768; x=1638883768; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=jIfxzUB2+17RMJxfEDisRHBPm7+z8TJ3VAqSHs38L00=; b=gxDhuIFNojiIvTngYDueMbPZD2ltUr2Q6bZapFbjMHlpFIayvQ+2SRHt eZNqucKhE6kjn/rIj5Duecym9bRfxG/yHkR3fOOh8H5NJBtXE5fDuW2fX 6Br5uxH1unA7FnyOtC4cuvqC9HM+sRQeYqJkrZgjfi0qHnjYGYO/E2xeT k=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="67966355" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-e34f1ddc.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 07 Dec 2020 13:28:47 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-e34f1ddc.us-east-1.amazon.com (Postfix) with ESMTPS id CB581A071C; Mon, 7 Dec 2020 13:28:44 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:43 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:39 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 12/13] bpf: Call bpf_run_sk_reuseport() for socket migration. Date: Mon, 7 Dec 2020 22:24:55 +0900 Message-ID: <20201207132456.65472-13-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch supports socket migration by eBPF. If the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. There are two noteworthy points. The first is that we select a listening socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. However, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. The second is that we do not have struct request_sock in unhash path, and the sk_hash of the listener is always zero. So we pass zero as hash to bpf_run_sk_reuseport(). Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/core/filter.c | 19 +++++++++++++++++++ net/core/sock_reuseport.c | 21 +++++++++++---------- net/ipv4/inet_hashtables.c | 2 +- 3 files changed, 31 insertions(+), 11 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 9f7018e3f545..53fa3bcbf00f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9890,10 +9890,29 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, { struct sk_reuseport_kern reuse_kern; enum sk_action action; + bool allocated = false; + + if (migration) { + /* cancel migration for possibly incapable eBPF program */ + if (prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE) + return ERR_PTR(-ENOTSUPP); + + if (!skb) { + allocated = true; + skb = alloc_skb(0, GFP_ATOMIC); + if (!skb) + return ERR_PTR(-ENOMEM); + } + } else if (!skb) { + return NULL; /* fall back to select by hash */ + } bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration); action = BPF_PROG_RUN(prog, &reuse_kern); + if (allocated) + kfree_skb(skb); + if (action == SK_PASS) return reuse_kern.selected_sk; else diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index b877c8e552d2..2358e8896199 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -221,8 +221,15 @@ struct sock *reuseport_detach_sock(struct sock *sk) lockdep_is_held(&reuseport_lock)); if (sk->sk_protocol == IPPROTO_TCP) { - if (reuse->num_socks && !prog) - nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + if (reuse->num_socks) { + if (prog) + nsk = bpf_run_sk_reuseport(reuse, sk, prog, NULL, 0, + BPF_SK_REUSEPORT_MIGRATE_QUEUE); + + if (!nsk) + nsk = i == reuse->num_socks ? + reuse->socks[i - 1] : reuse->socks[i]; + } reuse->num_closed_socks++; } else { @@ -306,15 +313,9 @@ static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, if (!prog) goto select_by_hash; - if (migration) - goto out; - - if (!skb) - goto select_by_hash; - if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration); - else + else if (!skb) sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); select_by_hash: @@ -352,7 +353,7 @@ struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, struct sock *nsk; nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); - if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt))) + if (!IS_ERR_OR_NULL(nsk) && likely(refcount_inc_not_zero(&nsk->sk_refcnt))) return nsk; return NULL; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 545538a6bfac..59f58740c20d 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -699,7 +699,7 @@ void inet_unhash(struct sock *sk) if (rcu_access_pointer(sk->sk_reuseport_cb)) { nsk = reuseport_detach_sock(sk); - if (nsk) + if (!IS_ERR_OR_NULL(nsk)) inet_csk_reqsk_queue_migrate(sk, nsk); } From patchwork Mon Dec 7 13:24:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 1412036 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.jp Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=amazon.co.jp header.i=@amazon.co.jp header.a=rsa-sha256 header.s=amazon201209 header.b=FN5YI/LK; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CqPLm4VGtz9s1l for ; Tue, 8 Dec 2020 00:30:08 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726924AbgLGN3r (ORCPT ); Mon, 7 Dec 2020 08:29:47 -0500 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:9822 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726089AbgLGN3q (ORCPT ); Mon, 7 Dec 2020 08:29:46 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1607347785; x=1638883785; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=1uyj+ON56WptrF13KwZJtXh7i/q+PlixzaIrrXNzRjs=; b=FN5YI/LKZ4tWcixIiLFn8TeySO6PezzSzri9R4ohvvByEY+GsbVKxUAD vzOq2wBSp40yVUWrA6hY5XTH9+JPdjfRbYAD9FYQRpWpYNkEk+/EAwvSo MueOZG6KnxMBRnM2E+iLlENGXUvoO6xHWZ2XnC397CipdDwPYUtNIq09T 8=; X-IronPort-AV: E=Sophos;i="5.78,399,1599523200"; d="scan'208";a="67966404" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-474bcd9f.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 07 Dec 2020 13:29:06 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1d-474bcd9f.us-east-1.amazon.com (Postfix) with ESMTPS id 127B4A1BFC; Mon, 7 Dec 2020 13:29:03 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:29:02 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.161.43) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 7 Dec 2020 13:28:58 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v2 bpf-next 13/13] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Date: Mon, 7 Dec 2020 22:24:56 +0900 Message-ID: <20201207132456.65472-14-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201207132456.65472-1-kuniyu@amazon.co.jp> References: <20201207132456.65472-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.43] X-ClientProxiedBy: EX13D37UWC002.ant.amazon.com (10.43.162.123) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- .../bpf/prog_tests/select_reuseport_migrate.c | 173 ++++++++++++++++++ .../bpf/progs/test_select_reuseport_migrate.c | 53 ++++++ 2 files changed, 226 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c create mode 100644 tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c new file mode 100644 index 000000000000..814b1e3a4c56 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c @@ -0,0 +1,173 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. call listen() for 5 server sockets. + * 2. update a map to migrate all child socket + * to the last server socket (migrate_map[cookie] = 4) + * 3. call connect() for 25 client sockets. + * 4. call close() for first 4 server sockets. + * 5. call accept() for the last server socket. + * + * Author: Kuniyuki Iwashima + */ + +#include +#include + +#include "test_progs.h" +#include "test_select_reuseport_migrate.skel.h" + +#define ADDRESS "127.0.0.1" +#define PORT 80 +#define NUM_SERVERS 5 +#define NUM_CLIENTS (NUM_SERVERS * 5) + + +static int test_listen(struct test_select_reuseport_migrate *skel, int server_fds[]) +{ + int i, err, optval = 1, migrated_to = NUM_SERVERS - 1; + int prog_fd, reuseport_map_fd, migrate_map_fd; + struct sockaddr_in addr; + socklen_t addr_len; + __u64 value; + + prog_fd = bpf_program__fd(skel->progs.prog_select_reuseport_migrate); + reuseport_map_fd = bpf_map__fd(skel->maps.reuseport_map); + migrate_map_fd = bpf_map__fd(skel->maps.migrate_map); + + addr_len = sizeof(addr); + addr.sin_family = AF_INET; + addr.sin_port = htons(PORT); + inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_SERVERS; i++) { + server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (CHECK_FAIL(server_fds[i] == -1)) + return -1; + + err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT, + &optval, sizeof(optval)); + if (CHECK_FAIL(err == -1)) + return -1; + + if (i == 0) { + err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, + &prog_fd, sizeof(prog_fd)); + if (CHECK_FAIL(err == -1)) + return -1; + } + + err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len); + if (CHECK_FAIL(err == -1)) + return -1; + + err = listen(server_fds[i], 32); + if (CHECK_FAIL(err == -1)) + return -1; + + err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST); + if (CHECK_FAIL(err == -1)) + return -1; + + err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value); + if (CHECK_FAIL(err == -1)) + return -1; + + err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST); + if (CHECK_FAIL(err == -1)) + return -1; + } + + return 0; +} + +static int test_connect(int client_fds[]) +{ + struct sockaddr_in addr; + socklen_t addr_len; + int i, err; + + addr_len = sizeof(addr); + addr.sin_family = AF_INET; + addr.sin_port = htons(PORT); + inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_CLIENTS; i++) { + client_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (CHECK_FAIL(client_fds[i] == -1)) + return -1; + + err = connect(client_fds[i], (struct sockaddr *)&addr, addr_len); + if (CHECK_FAIL(err == -1)) + return -1; + } + + return 0; +} + +static void test_close(int server_fds[], int num) +{ + int i; + + for (i = 0; i < num; i++) + if (server_fds[i] > 0) + close(server_fds[i]); +} + +static int test_accept(int server_fd) +{ + struct sockaddr_in addr; + socklen_t addr_len; + int cnt, client_fd; + + fcntl(server_fd, F_SETFL, O_NONBLOCK); + addr_len = sizeof(addr); + + for (cnt = 0; cnt < NUM_CLIENTS; cnt++) { + client_fd = accept(server_fd, (struct sockaddr *)&addr, &addr_len); + if (CHECK_FAIL(client_fd == -1)) + return -1; + } + + return cnt; +} + + +void test_select_reuseport_migrate(void) +{ + struct test_select_reuseport_migrate *skel; + int server_fds[NUM_SERVERS] = {0}; + int client_fds[NUM_CLIENTS] = {0}; + __u32 duration = 0; + int err; + + skel = test_select_reuseport_migrate__open_and_load(); + if (CHECK_FAIL(!skel)) + goto destroy; + + err = test_listen(skel, server_fds); + if (err) + goto close_server; + + err = test_connect(client_fds); + if (err) + goto close_client; + + test_close(server_fds, NUM_SERVERS - 1); + + err = test_accept(server_fds[NUM_SERVERS - 1]); + CHECK(err != NUM_CLIENTS, + "accept", + "expected (%d) != actual (%d)\n", + NUM_CLIENTS, err); + +close_client: + test_close(client_fds, NUM_CLIENTS); + +close_server: + test_close(server_fds, NUM_SERVERS); + +destroy: + test_select_reuseport_migrate__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c b/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c new file mode 100644 index 000000000000..f1ac07bb2c03 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. If reuse_md->migration is 0 (SYN packet), + * return SK_PASS without selecting a listener. + * 2. If reuse_md->migration is not 0 (socket migration), + * select a listener (reuseport_map[migrate_map[cookie]]) + * + * Author: Kuniyuki Iwashima + */ + +#include +#include + +#define NULL ((void *)0) + +struct bpf_map_def SEC("maps") reuseport_map = { + .type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, + .key_size = sizeof(int), + .value_size = sizeof(__u64), + .max_entries = 256, +}; + +struct bpf_map_def SEC("maps") migrate_map = { + .type = BPF_MAP_TYPE_HASH, + .key_size = sizeof(__u64), + .value_size = sizeof(int), + .max_entries = 256, +}; + +SEC("sk_reuseport/migrate") +int prog_select_reuseport_migrate(struct sk_reuseport_md *reuse_md) +{ + int *key, flags = 0; + __u64 cookie; + + if (!reuse_md->migration) + return SK_PASS; + + cookie = bpf_get_socket_cookie(reuse_md->sk); + + key = bpf_map_lookup_elem(&migrate_map, &cookie); + if (key == NULL) + return SK_DROP; + + bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags); + + return SK_PASS; +} + +int _version SEC("version") = 1; +char _license[] SEC("license") = "GPL";