From patchwork Wed Aug 8 08:01:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 954809 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="oGRA1loI"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 41lkNx0qMYz9ryt for ; Wed, 8 Aug 2018 18:01:37 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727202AbeHHKUF (ORCPT ); Wed, 8 Aug 2018 06:20:05 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:37110 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727025AbeHHKUE (ORCPT ); Wed, 8 Aug 2018 06:20:04 -0400 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w7880CBH026087 for ; Wed, 8 Aug 2018 01:01:33 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=8ITqoHj5SxaQcGHLAYDJdlyJZ2yae1CrPa9p0oH0kLU=; b=oGRA1loIWOyTx5DMtGp215M49K0eqsjDqjs0tynQkD0geGtALXicYIAINgs1v9R1QsDO AbTMJHhv3lCGiWEgJvUE+yRuuzNA9qBjrlOu9LjQ1lFvLynOvWhsl2pqU8F9eGx8ioZf Lbs8azlYuN6Nt7Z5faJwdii6EHYAYNNTgfc= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2kqsbxrc70-3 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 08 Aug 2018 01:01:32 -0700 Received: from mx-out.facebook.com (192.168.52.123) by mail.thefacebook.com (192.168.16.22) with Microsoft SMTP Server (TLS) id 14.3.319.2; Wed, 8 Aug 2018 01:01:31 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id B9DA82941E27; Wed, 8 Aug 2018 01:01:21 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , , Eric Dumazet Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 1/9] tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket Date: Wed, 8 Aug 2018 01:01:21 -0700 Message-ID: <20180808080121.3013747-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20180808075917.3009181-1-kafai@fb.com> References: <20180808075917.3009181-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-08-08_02:, , signatures=0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Although the actual cookie check "__cookie_v[46]_check()" does not involve sk specific info, it checks whether the sk has recent synq overflow event in "tcp_synq_no_recent_overflow()". The tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second when it has sent out a syncookie (through "tcp_synq_overflow()"). The above per sk "recent synq overflow event timestamp" works well for non SO_REUSEPORT use case. However, it may cause random connection request reject/discard when SO_REUSEPORT is used with syncookie because it fails the "tcp_synq_no_recent_overflow()" test. When SO_REUSEPORT is used, it usually has multiple listening socks serving TCP connection requests destinated to the same local IP:PORT. There are cases that the TCP-ACK-COOKIE may not be received by the same sk that sent out the syncookie. For example, if reuse->socks[] began with {sk0, sk1}, 1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp was updated. 2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed and then sk2 was added. Here, sk2 does not have ts_recent_stamp set. There are other ordering that will trigger the similar situation below but the idea is the same. 3) When the TCP-ACK-COOKIE comes back, sk2 was selected. "tcp_synq_no_recent_overflow(sk2)" returns true. In this case, all syncookies sent by sk1 will be handled (and rejected) by sk2 while sk1 is still alive. The userspace may create and remove listening SO_REUSEPORT sockets as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle incoming requests, old process stopping and new process starting...etc. With or without SO_ATTACH_REUSEPORT_[CB]BPF, the sockets leaving and joining a reuseport group makes picking the same sk to check the syncookie very difficult (if not impossible). The later patches will allow bpf prog more flexibility in deciding where a sk should be located in a bpf map and selecting a particular SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock, replace the whole bpf reuseport_array in one map_update() by using map-in-map. Getting the syncookie check working smoothly across socks in the same "reuse->socks[]" is important. A partial solution is to set the newly added sk's ts_recent_stamp to the max ts_recent_stamp of a reuseport group but that will require to iterate through reuse->socks[] OR pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is joining a reuseport group. However, neither of them will solve the existing sk getting moved around the reuse->socks[] and that sk may not have ts_recent_stamp updated, unlikely under continuous synflood but not impossible. This patch opts to treat the reuseport group as a whole when considering the last synq overflow timestamp since they are serving the same IP:PORT from the userspace (and BPF program) perspective. "synq_overflow_ts" is added to "struct sock_reuseport". The tcp_synq_overflow() and tcp_synq_no_recent_overflow() will update/check reuse->synq_overflow_ts if the sk is in a reuseport group. Similar to the reuseport decision in __inet_lookup_listener(), both sk->sk_reuseport and sk->sk_reuseport_cb are tested for SO_REUSEPORT usage. Update on "synq_overflow_ts" happens at roughly once every second. A synflood test was done with a 16 rx-queues and 16 reuseport sockets. No meaningful performance change is observed. Before and after the change is ~9Mpps in IPv4. Cc: Eric Dumazet Signed-off-by: Martin KaFai Lau Acked-by: Alexei Starovoitov --- include/net/sock_reuseport.h | 4 ++++ include/net/tcp.h | 30 ++++++++++++++++++++++++++++-- net/core/sock_reuseport.c | 1 + 3 files changed, 33 insertions(+), 2 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 0054b3a9b923..6bef7a0052f2 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -12,6 +12,10 @@ struct sock_reuseport { u16 max_socks; /* length of socks */ u16 num_socks; /* elements in socks */ + /* The last synq overflow event timestamp of this + * reuse->socks[] group. + */ + unsigned int synq_overflow_ts; struct bpf_prog __rcu *prog; /* optional BPF sock selector */ struct sock *socks[0]; /* array of sock pointers */ }; diff --git a/include/net/tcp.h b/include/net/tcp.h index f6e0a9b1dff3..b0587318c64d 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -472,9 +473,22 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb); */ static inline void tcp_synq_overflow(const struct sock *sk) { - unsigned int last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp; + unsigned int last_overflow; unsigned int now = jiffies; + if (sk->sk_reuseport) { + struct sock_reuseport *reuse; + + reuse = rcu_dereference(sk->sk_reuseport_cb); + if (likely(reuse)) { + last_overflow = READ_ONCE(reuse->synq_overflow_ts); + if (time_after32(now, last_overflow + HZ)) + WRITE_ONCE(reuse->synq_overflow_ts, now); + return; + } + } + + last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp; if (time_after32(now, last_overflow + HZ)) tcp_sk(sk)->rx_opt.ts_recent_stamp = now; } @@ -482,9 +496,21 @@ static inline void tcp_synq_overflow(const struct sock *sk) /* syncookies: no recent synqueue overflow on this listening socket? */ static inline bool tcp_synq_no_recent_overflow(const struct sock *sk) { - unsigned int last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp; + unsigned int last_overflow; unsigned int now = jiffies; + if (sk->sk_reuseport) { + struct sock_reuseport *reuse; + + reuse = rcu_dereference(sk->sk_reuseport_cb); + if (likely(reuse)) { + last_overflow = READ_ONCE(reuse->synq_overflow_ts); + return time_after32(now, last_overflow + + TCP_SYNCOOKIE_VALID); + } + } + + last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp; return time_after32(now, last_overflow + TCP_SYNCOOKIE_VALID); } diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 064acb04be0f..3f188fad0162 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -81,6 +81,7 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); for (i = 0; i < reuse->num_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,