From patchwork Mon Aug 3 23:10:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340587 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=K+tzDq0B; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDBV00l1z9sSG for ; Tue, 4 Aug 2020 09:10:25 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729004AbgHCXKZ (ORCPT ); Mon, 3 Aug 2020 19:10:25 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:19330 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728940AbgHCXKZ (ORCPT ); Mon, 3 Aug 2020 19:10:25 -0400 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MrawW014782 for ; Mon, 3 Aug 2020 16:10:24 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=hHxS2Toao5f/OikYvMZpCy8qVQSg+2tYIfC73UklCUI=; b=K+tzDq0BVNSL4x03rBnmJE67MitbR3LgZg59QJ0WSGiXG+qv72lhxMvwbwPMyU9UxZe2 G6vJHUC9t9l8erzCj3OoLqViHvHuQXZ1HmKKuiTaE6fbaJQ8NZdIhUy1jbIXfOaVGojA RW/Iq1AE/zp8e4TREWdjcivbAkvZshq/ZdU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n81fsnpk-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:24 -0700 Received: from intmgw004.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:23 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9E6FB2943872; Mon, 3 Aug 2020 16:10:19 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 01/12] tcp: Use a struct to represent a saved_syn Date: Mon, 3 Aug 2020 16:10:19 -0700 Message-ID: <20200803231019.2681772-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 priorityscore=1501 spamscore=0 impostorscore=0 mlxscore=0 phishscore=0 suspectscore=38 lowpriorityscore=0 bulkscore=0 adultscore=0 mlxlogscore=732 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org The TCP_SAVE_SYN has both the network header and tcp header. The total length of the saved syn packet is currently stored in the first 4 bytes (u32) of an array and the actual packet data is stored after that. A later patch will add a bpf helper that allows to get the tcp header alone from the saved syn without the network header. It will be more convenient to have a direct offset to a specific header instead of re-parsing it. This requires to separately store the network hdrlen. The total header length (i.e. network + tcp) is still needed for the current usage in getsockopt. Although this total length can be obtained by looking into the tcphdr and then get the (th->doff << 2), this patch chooses to directly store the tcp hdrlen in the second four bytes of this newly created "struct saved_syn". By using a new struct, it can give a readable name to each individual header length. Signed-off-by: Martin KaFai Lau Reviewed-by: Eric Dumazet Acked-by: John Fastabend --- include/linux/tcp.h | 7 ++++++- include/net/request_sock.h | 8 +++++++- net/core/filter.c | 4 ++-- net/ipv4/tcp.c | 9 +++++---- net/ipv4/tcp_input.c | 16 +++++++++------- 5 files changed, 29 insertions(+), 15 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 527d668a5275..5528912dc468 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -406,7 +406,7 @@ struct tcp_sock { * socket. Used to retransmit SYNACKs etc. */ struct request_sock __rcu *fastopen_rsk; - u32 *saved_syn; + struct saved_syn *saved_syn; }; enum tsq_enum { @@ -484,6 +484,11 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp) tp->saved_syn = NULL; } +static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn) +{ + return saved_syn->network_hdrlen + saved_syn->tcp_hdrlen; +} + struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk); static inline u16 tcp_mss_clamp(const struct tcp_sock *tp, u16 mss) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index cf8b33213bbc..b1b101814ecb 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -41,6 +41,12 @@ struct request_sock_ops { int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req); +struct saved_syn { + u32 network_hdrlen; + u32 tcp_hdrlen; + u8 data[]; +}; + /* struct request_sock - mini sock to represent a connection request */ struct request_sock { @@ -60,7 +66,7 @@ struct request_sock { struct timer_list rsk_timer; const struct request_sock_ops *rsk_ops; struct sock *sk; - u32 *saved_syn; + struct saved_syn *saved_syn; u32 secid; u32 peer_secid; }; diff --git a/net/core/filter.c b/net/core/filter.c index 7124f0fe6974..250b5552a148 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4550,9 +4550,9 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname, tp = tcp_sk(sk); if (optlen <= 0 || !tp->saved_syn || - optlen > tp->saved_syn[0]) + optlen > tcp_saved_syn_len(tp->saved_syn)) goto err_clear; - memcpy(optval, tp->saved_syn + 1, optlen); + memcpy(optval, tp->saved_syn->data, optlen); break; default: goto err_clear; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 27de9380ed14..8a774b5094e9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3791,20 +3791,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level, lock_sock(sk); if (tp->saved_syn) { - if (len < tp->saved_syn[0]) { - if (put_user(tp->saved_syn[0], optlen)) { + if (len < tcp_saved_syn_len(tp->saved_syn)) { + if (put_user(tcp_saved_syn_len(tp->saved_syn), + optlen)) { release_sock(sk); return -EFAULT; } release_sock(sk); return -EINVAL; } - len = tp->saved_syn[0]; + len = tcp_saved_syn_len(tp->saved_syn); if (put_user(len, optlen)) { release_sock(sk); return -EFAULT; } - if (copy_to_user(optval, tp->saved_syn + 1, len)) { + if (copy_to_user(optval, tp->saved_syn->data, len)) { release_sock(sk); return -EFAULT; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a018bafd7bdf..05d0818c3e31 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6598,13 +6598,15 @@ static void tcp_reqsk_record_syn(const struct sock *sk, { if (tcp_sk(sk)->save_syn) { u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb); - u32 *copy; - - copy = kmalloc(len + sizeof(u32), GFP_ATOMIC); - if (copy) { - copy[0] = len; - memcpy(©[1], skb_network_header(skb), len); - req->saved_syn = copy; + struct saved_syn *saved_syn; + + saved_syn = kmalloc(struct_size(saved_syn, data, len), + GFP_ATOMIC); + if (saved_syn) { + saved_syn->network_hdrlen = skb_network_header_len(skb); + saved_syn->tcp_hdrlen = tcp_hdrlen(skb); + memcpy(saved_syn->data, skb_network_header(skb), len); + req->saved_syn = saved_syn; } } } From patchwork Mon Aug 3 23:10:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340589 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=E944Lyqx; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDBh3rxPz9sR4 for ; Tue, 4 Aug 2020 09:10:36 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729065AbgHCXKg (ORCPT ); Mon, 3 Aug 2020 19:10:36 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:62078 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727091AbgHCXKf (ORCPT ); Mon, 3 Aug 2020 19:10:35 -0400 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073Mt8Me001094 for ; Mon, 3 Aug 2020 16:10:34 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=ENnFZ5a2qYkZgQfuQqJO3qL+d0MIG6J7ZgK++HEmV+k=; b=E944LyqxK3KixA3ARacVnay3OjkOuakbsOi7qhQqXlvWp0I5OPB1B2CLSFcu7isUxF7C 5etTMI8N5OXhFXPRuyhAiELpUJngeoaFzE0pJJv9n9zOdqsjgYViHRBvLjLsOlb6LPfv zUxZ9kByilOw8oPuTe6ltvQQS8TOs6oIV9c= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n80t9gdn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:34 -0700 Received: from intmgw001.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:33 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id E13D02943872; Mon, 3 Aug 2020 16:10:26 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 02/12] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt Date: Mon, 3 Aug 2020 16:10:26 -0700 Message-ID: <20200803231026.2682120-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 suspectscore=13 bulkscore=0 priorityscore=1501 spamscore=0 clxscore=1015 mlxscore=0 phishscore=0 impostorscore=0 lowpriorityscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This change is mostly from an internal patch and adapts it from sysctl config to the bpf_setsockopt setup. The bpf_prog can set the max delay ack by using bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated to its peer through bpf header option. The receiving peer can then use this max delay ack and set a potentially lower rto by using bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced in the next patch. Another later selftest patch will also use it like the above to show how to write and parse bpf tcp header option. Reviewed-by: Eric Dumazet Signed-off-by: Martin KaFai Lau Acked-by: John Fastabend --- include/net/inet_connection_sock.h | 1 + include/uapi/linux/bpf.h | 1 + net/core/filter.c | 8 ++++++++ net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_output.c | 2 ++ tools/include/uapi/linux/bpf.h | 1 + 6 files changed, 15 insertions(+) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 1e209ce7d1bd..8b6d89ac91cc 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -86,6 +86,7 @@ struct inet_connection_sock { struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; __u32 icsk_rto; + __u32 icsk_delack_max; __u32 icsk_pmtu_cookie; const struct tcp_congestion_ops *icsk_ca_ops; const struct inet_connection_sock_af_ops *icsk_af_ops; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index eb5e0c38eb2c..993060d9ecf2 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4244,6 +4244,7 @@ enum { enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ + TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ }; struct bpf_perf_event_value { diff --git a/net/core/filter.c b/net/core/filter.c index 250b5552a148..969c6b6b98d0 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4459,6 +4459,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); + unsigned long timeout; if (optlen != sizeof(int)) return -EINVAL; @@ -4480,6 +4481,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, tp->snd_ssthresh = val; } break; + case TCP_BPF_DELACK_MAX: + timeout = usecs_to_jiffies(val); + if (timeout > TCP_DELACK_MAX || + timeout < TCP_TIMEOUT_MIN) + return -EINVAL; + inet_csk(sk)->icsk_delack_max = timeout; + break; case TCP_SAVE_SYN: if (val < 0 || val > 1) ret = -EINVAL; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8a774b5094e9..10db079cb9e3 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -418,6 +418,7 @@ void tcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&tp->tsorted_sent_queue); icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_delack_max = TCP_DELACK_MAX; tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U); @@ -2685,6 +2686,7 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_backoff = 0; icsk->icsk_probes_out = 0; icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_delack_max = TCP_DELACK_MAX; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; tp->snd_cwnd = TCP_INIT_CWND; tp->snd_cwnd_cnt = 0; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d8f16f6a9b02..984114087263 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3741,6 +3741,8 @@ void tcp_send_delayed_ack(struct sock *sk) ato = min(ato, max_ato); } + ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); + /* Stay within the limit we were given */ timeout = jiffies + ato; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index eb5e0c38eb2c..993060d9ecf2 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4244,6 +4244,7 @@ enum { enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ + TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ }; struct bpf_perf_event_value { From patchwork Mon Aug 3 23:10:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340591 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=EZB8ycHI; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDBq6SKBz9sR4 for ; Tue, 4 Aug 2020 09:10:43 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729096AbgHCXKl (ORCPT ); Mon, 3 Aug 2020 19:10:41 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:44184 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727091AbgHCXKl (ORCPT ); Mon, 3 Aug 2020 19:10:41 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073Ms1Ze008659 for ; Mon, 3 Aug 2020 16:10:40 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=xWjUbfLNQflMBZgYI+GetO4DkixOzBlMvj3KasvFUKc=; b=EZB8ycHIkk6S2EBDQnSAZ4NJp9WVVIhb6UeZ37b0iK7FHSUzCW/WwLoyl865cLxZxzV6 hQPn0emuNXasD5WpsA/lmaLvrbJZ1h7pWns2jXHVafQTPJjSC/0pwxZQMenl7ru3B9Qp 4lU36A9xf0VlOrHhqcy5kosWp4AWREpIcvM= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32nrc9757k-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:40 -0700 Received: from intmgw003.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:39 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 2B90E2943872; Mon, 3 Aug 2020 16:10:33 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 03/12] tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt Date: Mon, 3 Aug 2020 16:10:33 -0700 Message-ID: <20200803231033.2682401-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 adultscore=0 mlxlogscore=999 spamscore=0 priorityscore=1501 suspectscore=13 clxscore=1015 impostorscore=0 bulkscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog to set the min rto of a connection. It could be used together with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX). A later selftest patch will communicate the max delay ack in a bpf tcp header option and then the receiving side can use bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto. Reviewed-by: Eric Dumazet Signed-off-by: Martin KaFai Lau Acked-by: John Fastabend --- include/net/inet_connection_sock.h | 1 + include/net/tcp.h | 2 +- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 7 +++++++ net/ipv4/tcp.c | 2 ++ tools/include/uapi/linux/bpf.h | 1 + 6 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 8b6d89ac91cc..d54aa3cc71df 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -86,6 +86,7 @@ struct inet_connection_sock { struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; __u32 icsk_rto; + __u32 icsk_rto_min; __u32 icsk_delack_max; __u32 icsk_pmtu_cookie; const struct tcp_congestion_ops *icsk_ca_ops; diff --git a/include/net/tcp.h b/include/net/tcp.h index e0c35d56091f..895e7aabf136 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -697,7 +697,7 @@ static inline void tcp_fast_path_check(struct sock *sk) static inline u32 tcp_rto_min(struct sock *sk) { const struct dst_entry *dst = __sk_dst_get(sk); - u32 rto_min = TCP_RTO_MIN; + u32 rto_min = inet_csk(sk)->icsk_rto_min; if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 993060d9ecf2..d77b7df71784 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4245,6 +4245,7 @@ enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ + TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ }; struct bpf_perf_event_value { diff --git a/net/core/filter.c b/net/core/filter.c index 969c6b6b98d0..0a1bf520c55d 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4488,6 +4488,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, return -EINVAL; inet_csk(sk)->icsk_delack_max = timeout; break; + case TCP_BPF_RTO_MIN: + timeout = usecs_to_jiffies(val); + if (timeout > TCP_RTO_MIN || + timeout < TCP_TIMEOUT_MIN) + return -EINVAL; + inet_csk(sk)->icsk_rto_min = timeout; + break; case TCP_SAVE_SYN: if (val < 0 || val > 1) ret = -EINVAL; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 10db079cb9e3..98606f94b01b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -418,6 +418,7 @@ void tcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&tp->tsorted_sent_queue); icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_rto_min = TCP_RTO_MIN; icsk->icsk_delack_max = TCP_DELACK_MAX; tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U); @@ -2686,6 +2687,7 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_backoff = 0; icsk->icsk_probes_out = 0; icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_rto_min = TCP_RTO_MIN; icsk->icsk_delack_max = TCP_DELACK_MAX; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; tp->snd_cwnd = TCP_INIT_CWND; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 993060d9ecf2..d77b7df71784 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4245,6 +4245,7 @@ enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ + TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ }; struct bpf_perf_event_value { From patchwork Mon Aug 3 23:10:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340593 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=e1yMYej7; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDBy0pggz9sR4 for ; Tue, 4 Aug 2020 09:10:50 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729139AbgHCXKt (ORCPT ); Mon, 3 Aug 2020 19:10:49 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:29970 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728662AbgHCXKt (ORCPT ); Mon, 3 Aug 2020 19:10:49 -0400 Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073Mov6O009358 for ; Mon, 3 Aug 2020 16:10:48 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=bvt3I0oCfzjJXyCrbToSJXIJ9nR6dE9Azh74le8JbrQ=; b=e1yMYej7/wUuv/2E+iAPCaibW4UKmiMVT4KYkULTxPuc56LHNrBQ9RiOVzq0DD/l4CnW wCHDUMugwRnQGlXYB5DLAuEhPObNaSISP8Ju92+DNVgeWAuG0+WFjyxIOxYA3rbS420x gQeqyghyGXV8S9o1xuV2RxF5z3m8tR+54l4= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32nr3by49n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:48 -0700 Received: from intmgw002.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:47 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 70F352943872; Mon, 3 Aug 2020 16:10:39 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 04/12] tcp: Add saw_unknown to struct tcp_options_received Date: Mon, 3 Aug 2020 16:10:39 -0700 Message-ID: <20200803231039.2682896-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 lowpriorityscore=0 malwarescore=0 impostorscore=0 mlxscore=0 suspectscore=13 phishscore=0 bulkscore=0 mlxlogscore=999 clxscore=1015 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org In a later patch, the bpf prog only wants to be called to handle a header option if that particular header option cannot be handled by the kernel. This unknown option could be written by the peer's bpf-prog. It could also be a new standard option that the running kernel does not support it while a bpf-prog can handle it. This patch adds a "saw_unknown" bit to "struct tcp_options_received" and it uses an existing one byte hole to do that. "saw_unknown" will be set in tcp_parse_options() if it sees an option that the kernel cannot handle. Signed-off-by: Martin KaFai Lau Reviewed-by: Eric Dumazet Acked-by: John Fastabend --- include/linux/tcp.h | 2 ++ net/ipv4/tcp_input.c | 22 ++++++++++++++++------ 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 5528912dc468..981145691f60 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -92,6 +92,8 @@ struct tcp_options_received { smc_ok : 1, /* SMC seen on SYN packet */ snd_wscale : 4, /* Window scaling received from sender */ rcv_wscale : 4; /* Window scaling to send to receiver */ + u8 saw_unknown:1, /* Received unknown option */ + unused:7; u8 num_sacks; /* Number of SACK blocks */ u16 user_mss; /* mss requested by user in ioctl */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 05d0818c3e31..c1d2fb507701 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3799,7 +3799,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie, foc->exp = exp_opt; } -static void smc_parse_options(const struct tcphdr *th, +static bool smc_parse_options(const struct tcphdr *th, struct tcp_options_received *opt_rx, const unsigned char *ptr, int opsize) @@ -3808,10 +3808,13 @@ static void smc_parse_options(const struct tcphdr *th, if (static_branch_unlikely(&tcp_have_smc)) { if (th->syn && !(opsize & 1) && opsize >= TCPOLEN_EXP_SMC_BASE && - get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) + get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) { opt_rx->smc_ok = 1; + return true; + } } #endif + return false; } /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped @@ -3872,6 +3875,7 @@ void tcp_parse_options(const struct net *net, ptr = (const unsigned char *)(th + 1); opt_rx->saw_tstamp = 0; + opt_rx->saw_unknown = 0; while (length > 0) { int opcode = *ptr++; @@ -3962,15 +3966,21 @@ void tcp_parse_options(const struct net *net, */ if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE && get_unaligned_be16(ptr) == - TCPOPT_FASTOPEN_MAGIC) + TCPOPT_FASTOPEN_MAGIC) { tcp_parse_fastopen_option(opsize - TCPOLEN_EXP_FASTOPEN_BASE, ptr + 2, th->syn, foc, true); - else - smc_parse_options(th, opt_rx, ptr, - opsize); + break; + } + + if (smc_parse_options(th, opt_rx, ptr, opsize)) + break; + + opt_rx->saw_unknown = 1; break; + default: + opt_rx->saw_unknown = 1; } ptr += opsize-2; length -= opsize; From patchwork Mon Aug 3 23:10:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340594 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=X0vk0psI; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDBz23zxz9sSG for ; Tue, 4 Aug 2020 09:10:51 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729173AbgHCXKu (ORCPT ); Mon, 3 Aug 2020 19:10:50 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:5502 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729124AbgHCXKu (ORCPT ); Mon, 3 Aug 2020 19:10:50 -0400 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MrawZ014782 for ; Mon, 3 Aug 2020 16:10:49 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=qq70gJyeGBzaA0k1feKqTdNm9xRHpPrW4JAFYqw/jlQ=; b=X0vk0psIhdBu/JkQOawh8IXRamjRza82nLh1n0FhMNiKVvfX5DIekJQ7OtaR42Q/yk3W t3hlSlW9yy5XLfTsfoYxN/AzJlceR3mCyrTfY101iSH7vy4EVM3WeOZjpiSkduogCrLN Q51g1RQlEk1xgnrRbAR2uWLuUAkZkZvrjwk= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n81fsnr6-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:48 -0700 Received: from intmgw002.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::e) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:48 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id AD9C72943872; Mon, 3 Aug 2020 16:10:45 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 05/12] bpf: tcp: Add bpf_skops_established() Date: Mon, 3 Aug 2020 16:10:45 -0700 Message-ID: <20200803231045.2683198-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 priorityscore=1501 spamscore=0 impostorscore=0 mlxscore=0 phishscore=0 suspectscore=15 lowpriorityscore=0 bulkscore=0 adultscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org In tcp_init_transfer(), it currently calls the bpf prog to give it a chance to handle the just "ESTABLISHED" event (e.g. do setsockopt on the newly established sk). Right now, it is done by calling the general purpose tcp_call_bpf(). In the later patch, it also needs to pass the just-received skb which concludes the 3 way handshake. E.g. the SYNACK received at the active side. The bpf prog can then learn some specific header options written by the peer's bpf-prog and potentially do setsockopt on the newly established sk. Thus, instead of reusing the general purpose tcp_call_bpf(), a new function bpf_skops_established() is added to allow passing the "skb" to the bpf prog. The actual skb passing from bpf_skops_established() to the bpf prog will happen together in a later patch which has the necessary bpf pieces. A "skb" arg is also added to tcp_init_transfer() such that it can then be passed to bpf_skops_established(). Calling the new bpf_skops_established() instead of tcp_call_bpf() should be a noop in this patch. Signed-off-by: Martin KaFai Lau Acked-by: John Fastabend --- include/net/tcp.h | 2 +- net/ipv4/tcp_fastopen.c | 2 +- net/ipv4/tcp_input.c | 32 ++++++++++++++++++++++++++++---- 3 files changed, 30 insertions(+), 6 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 895e7aabf136..ae18c2b222a3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -394,7 +394,7 @@ void tcp_metrics_init(void); bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst); void tcp_close(struct sock *sk, long timeout); void tcp_init_sock(struct sock *sk); -void tcp_init_transfer(struct sock *sk, int bpf_op); +void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb); __poll_t tcp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int tcp_getsockopt(struct sock *sk, int level, int optname, diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 19ad9586c720..d53190302087 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -272,7 +272,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk, refcount_set(&req->rsk_refcnt, 2); /* Now finish processing the fastopen child socket. */ - tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); + tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, skb); tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c1d2fb507701..9a8fb41676bc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -138,6 +138,29 @@ void clean_acked_data_flush(void) EXPORT_SYMBOL_GPL(clean_acked_data_flush); #endif +#ifdef CONFIG_CGROUP_BPF +static void bpf_skops_established(struct sock *sk, int bpf_op, + struct sk_buff *skb) +{ + struct bpf_sock_ops_kern sock_ops; + + sock_owned_by_me(sk); + + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + sock_ops.op = bpf_op; + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + /* skb will be passed to the bpf prog in a later patch. */ + + BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); +} +#else +static void bpf_skops_established(struct sock *sk, int bpf_op, + struct sk_buff *skb) +{ +} +#endif + static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb, unsigned int len) { @@ -5806,7 +5829,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) } EXPORT_SYMBOL(tcp_rcv_established); -void tcp_init_transfer(struct sock *sk, int bpf_op) +void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); @@ -5827,7 +5850,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op) tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk)); tp->snd_cwnd_stamp = tcp_jiffies32; - tcp_call_bpf(sk, bpf_op, 0, NULL); + bpf_skops_established(sk, bpf_op, skb); tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); } @@ -5846,7 +5869,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) sk_mark_napi_id(sk, skb); } - tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); + tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB, skb); /* Prevent spurious tcp_cwnd_restart() on first data * packet. @@ -6318,7 +6341,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) } else { tcp_try_undo_spurious_syn(sk); tp->retrans_stamp = 0; - tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); + tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, + skb); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); } smp_mb(); From patchwork Mon Aug 3 23:10:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340597 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=Rnq/rR0o; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDC61cmnz9sR4 for ; Tue, 4 Aug 2020 09:10:58 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729164AbgHCXK5 (ORCPT ); Mon, 3 Aug 2020 19:10:57 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:40942 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729178AbgHCXK5 (ORCPT ); Mon, 3 Aug 2020 19:10:57 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.42/8.16.0.42) with SMTP id 073MtBI0013158 for ; Mon, 3 Aug 2020 16:10:56 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=GbNspzCKUO6lSZzBigfQVlWK4gdRi/hTWJIYDi/s62M=; b=Rnq/rR0owvU9n88Jy8DyowhGLac9rWJ4mhOxKwPEl0DbR64eNcs+Fbtfp/0Of06htlZ/ 7cmCnJ6Ya3jHfdvDjOLWFOZaMpypVvATQZd+AFd8Jo0kmBupoBB2sicOB3f2+8wu3uqf idi+/SnxEkUU97CLQQ6cs4C487mDh/sqpmE= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 32n81y9jjs-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:10:56 -0700 Received: from intmgw004.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:10:55 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 14CA62943872; Mon, 3 Aug 2020 16:10:51 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 06/12] bpf: tcp: Add bpf_skops_parse_hdr() Date: Mon, 3 Aug 2020 16:10:51 -0700 Message-ID: <20200803231051.2683561-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 malwarescore=0 phishscore=0 priorityscore=1501 mlxlogscore=995 bulkscore=0 suspectscore=13 impostorscore=0 adultscore=0 clxscore=1015 mlxscore=0 spamscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org The patch adds a function bpf_skops_parse_hdr(). It will call the bpf prog to parse the TCP header received at a tcp_sock that has at least reached the ESTABLISHED state. For the packets received during the 3WHS (SYN, SYNACK and ACK), the received skb will be available to the bpf prog during the callback in bpf_skops_established() introduced in the previous patch and in the bpf_skops_write_hdr_opt() that will be added in the next patch. Calling bpf prog to parse header is controlled by two new flags in tp->bpf_sock_ops_cb_flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG. When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set, the bpf prog will only be called when there is unknown option in the TCP header. When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set, the bpf prog will be called on all received TCP header. This function is half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Signed-off-by: Martin KaFai Lau Reviewed-by: Eric Dumazet --- include/uapi/linux/bpf.h | 4 +++- net/ipv4/tcp_input.c | 36 ++++++++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 4 +++- 3 files changed, 42 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d77b7df71784..355cb97ec891 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4160,8 +4160,10 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x3F, }; /* List of known BPF sock_ops operators. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9a8fb41676bc..ec49f6a9b68b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -139,6 +139,36 @@ EXPORT_SYMBOL_GPL(clean_acked_data_flush); #endif #ifdef CONFIG_CGROUP_BPF +static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) +{ + bool unknown_opt = tcp_sk(sk)->rx_opt.saw_unknown && + BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG); + bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); + + if (likely(!unknown_opt && !parse_all_opt)) + return; + + /* The skb will be handled in the + * bpf_skops_established() or + * bpf_skops_write_hdr_opt(). + */ + switch (sk->sk_state) { + case TCP_SYN_RECV: + case TCP_SYN_SENT: + case TCP_LISTEN: + return; + } + + /* BPF prog will have access to the sk and skb. + * + * The bpf running context preparation and the actual bpf prog + * calling will be implemented in a later PATCH together with + * other bpf pieces. + */ +} + static void bpf_skops_established(struct sock *sk, int bpf_op, struct sk_buff *skb) { @@ -155,6 +185,10 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); } #else +static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) +{ +} + static void bpf_skops_established(struct sock *sk, int bpf_op, struct sk_buff *skb) { @@ -5621,6 +5655,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, goto discard; } + bpf_skops_parse_hdr(sk, skb); + return true; discard: diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index d77b7df71784..355cb97ec891 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4160,8 +4160,10 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x3F, }; /* List of known BPF sock_ops operators. From patchwork Mon Aug 3 23:10:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340599 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=OfzN6T9Z; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCH4N6Mz9sR4 for ; Tue, 4 Aug 2020 09:11:07 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729203AbgHCXLG (ORCPT ); Mon, 3 Aug 2020 19:11:06 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:23274 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728824AbgHCXLG (ORCPT ); Mon, 3 Aug 2020 19:11:06 -0400 Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MtAS3011116 for ; Mon, 3 Aug 2020 16:11:03 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=+6C/EmKzptLsNc37lkSlYGKDI5Tji7ICeiDalHHeSL8=; b=OfzN6T9ZLfcp0bMMDPRahvL2ummvL4lFGNbDwnBB2N0v6qwqChflYsu4hzAZgodtO1Hb EFA1SO0Srzw5+nKR+X80ecjQgKCfScrMOqlICmov5+NE0edq1nb3aVqa/ttnE6L1o4ht aHSZFk9g/xaIoPizLgDCmHDQ3/5sP+5bPAY= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n81jhgxu-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:03 -0700 Received: from intmgw005.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:02 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 5BD8A2943872; Mon, 3 Aug 2020 16:10:58 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 07/12] bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() Date: Mon, 3 Aug 2020 16:10:58 -0700 Message-ID: <20200803231058.2683864-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 impostorscore=0 spamscore=0 suspectscore=38 lowpriorityscore=0 phishscore=0 bulkscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 mlxlogscore=999 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The bpf prog needs to parse the SYN header to learn what options have been sent by the peer's bpf-prog before writing its options into SYNACK. This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack(). This syn_skb will eventually be made available (as read-only) to the bpf prog. When writing options, the bpf prog will first be called to tell the kernel its required number of bytes. It is done by the new bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags. When the bpf prog returns, the kernel will know how many bytes are needed and then update the "*remaining" arg accordingly. 4 byte alignment will be included in the "*remaining" before this function returns. The 4 byte aligned number of bytes will also be stored into the opts->bpf_opt_len. "bpf_opt_len" is a newly added member to the struct tcp_out_options. Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the header options. The bpf prog is only called if it has reserved spaces before (opts->bpf_opt_len > 0). The bpf prog is the last one getting a chance to reserve header space and writing the header option. These two functions are half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Signed-off-by: Martin KaFai Lau Reviewed-by: Eric Dumazet --- include/net/tcp.h | 6 +- include/uapi/linux/bpf.h | 3 +- net/ipv4/tcp_input.c | 5 +- net/ipv4/tcp_ipv4.c | 5 +- net/ipv4/tcp_output.c | 105 +++++++++++++++++++++++++++++---- net/ipv6/tcp_ipv6.c | 5 +- tools/include/uapi/linux/bpf.h | 3 +- 7 files changed, 109 insertions(+), 23 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index ae18c2b222a3..220b1d1b00d1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -455,7 +455,8 @@ enum tcp_synack_type { struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type); + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb); int tcp_disconnect(struct sock *sk, int flags); void tcp_finish_connect(struct sock *sk, struct sk_buff *skb); @@ -2031,7 +2032,8 @@ struct tcp_request_sock_ops { int (*send_synack)(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type); + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb); }; extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 355cb97ec891..7f462d852138 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4162,8 +4162,9 @@ enum { BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0x3F, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, }; /* List of known BPF sock_ops operators. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ec49f6a9b68b..4f5c22c4842b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6826,7 +6826,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } if (fastopen_sk) { af_ops->send_synack(fastopen_sk, dst, &fl, req, - &foc, TCP_SYNACK_FASTOPEN); + &foc, TCP_SYNACK_FASTOPEN, skb); /* Add the child socket directly into the accept queue */ if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) { reqsk_fastopen_remove(fastopen_sk, req, false); @@ -6844,7 +6844,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_timeout_init((struct sock *)req)); af_ops->send_synack(sk, dst, &fl, req, &foc, !want_cookie ? TCP_SYNACK_NORMAL : - TCP_SYNACK_COOKIE); + TCP_SYNACK_COOKIE, + skb); if (want_cookie) { reqsk_free(req); return 0; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 5084333b5ab6..631a5ee0dd4e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -965,7 +965,8 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { const struct inet_request_sock *ireq = inet_rsk(req); struct flowi4 fl4; @@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL) return -1; - skb = tcp_make_synack(sk, dst, req, foc, synack_type); + skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); if (skb) { __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 984114087263..71d262ec314a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -438,6 +438,7 @@ struct tcp_out_options { u8 ws; /* window scale, 0 to disable */ u8 num_sack_blocks; /* number of SACK blocks to include */ u8 hash_size; /* bytes in hash_location */ + u8 bpf_opt_len; /* length of BPF hdr option */ __u8 *hash_location; /* temporary pointer, overloaded */ __u32 tsval, tsecr; /* need to include OPTION_TS */ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ @@ -452,6 +453,59 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) #endif } +#ifdef CONFIG_CGROUP_BPF +/* req, syn_skb and synack_type are used when writing synack */ +static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts, + unsigned int *remaining) +{ + if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) || + !*remaining) + return; + + /* The bpf running context preparation and the actual bpf prog + * calling will be implemented in a later PATCH together with + * other bpf pieces. + */ +} + +static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts) +{ + if (likely(!opts->bpf_opt_len)) + return; + + /* The bpf running context preparation and the actual bpf prog + * calling will be implemented in a later PATCH together with + * other bpf pieces. + */ +} +#else +static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts, + unsigned int *remaining) +{ +} + +static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts) +{ +} +#endif + /* Write previously computed TCP options to the packet. * * Beware: Something in the Internet is very sensitive to the ordering of @@ -691,6 +745,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, } } + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); + return MAX_TCP_OPTION_SPACE - remaining; } @@ -701,7 +757,8 @@ static unsigned int tcp_synack_options(const struct sock *sk, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -758,6 +815,9 @@ static unsigned int tcp_synack_options(const struct sock *sk, smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); + bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb, + synack_type, opts, &remaining); + return MAX_TCP_OPTION_SPACE - remaining; } @@ -826,6 +886,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; } + if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp, + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) { + unsigned int remaining = MAX_TCP_OPTION_SPACE - size; + + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); + + size = MAX_TCP_OPTION_SPACE - remaining; + } + return size; } @@ -1213,6 +1282,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, } #endif + /* BPF prog is the last one writing header option */ + bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, 0, &opts); + INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check, tcp_v6_send_check, tcp_v4_send_check, sk, skb); @@ -3336,20 +3408,20 @@ int tcp_send_synack(struct sock *sk) } /** - * tcp_make_synack - Prepare a SYN-ACK. - * sk: listener socket - * dst: dst entry attached to the SYNACK - * req: request_sock pointer - * foc: cookie for tcp fast open - * synack_type: Type of synback to prepare - * - * Allocate one skb and build a SYNACK packet. - * @dst is consumed : Caller should not use it again. + * tcp_make_synack - Allocate one skb and build a SYNACK packet. + * @sk: listener socket + * @dst: dst entry attached to the SYNACK. It is consumed and caller + * should not use it again. + * @req: request_sock pointer + * @foc: cookie for tcp fast open + * @synack_type: Type of synack to prepare + * @syn_skb: SYN packet just received. It could be NULL for rtx case. */ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); const struct tcp_sock *tp = tcp_sk(sk); @@ -3408,8 +3480,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); + /* bpf program will be interested in the tcp_flags */ + TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK; tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5, - foc, synack_type) + sizeof(*th); + foc, synack_type, + syn_skb) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); @@ -3441,6 +3516,9 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, rcu_read_unlock(); #endif + bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, + synack_type, &opts); + skb->skb_mstamp_ns = now; tcp_add_tx_delay(skb, tp); @@ -3936,7 +4014,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) int res; tcp_rsk(req)->txhash = net_tx_rndhash(); - res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL); + res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL, + NULL); if (!res) { __TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 305870a72352..87a633e1fbef 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -501,7 +501,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); struct ipv6_pinfo *np = tcp_inet6_sk(sk); @@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, IPPROTO_TCP)) == NULL) goto done; - skb = tcp_make_synack(sk, dst, req, foc, synack_type); + skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); if (skb) { __tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr, diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 355cb97ec891..7f462d852138 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4162,8 +4162,9 @@ enum { BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0x3F, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, }; /* List of known BPF sock_ops operators. From patchwork Mon Aug 3 23:11:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340601 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=FDDJZQ3X; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCK55sSz9sR4 for ; Tue, 4 Aug 2020 09:11:09 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728960AbgHCXLJ (ORCPT ); Mon, 3 Aug 2020 19:11:09 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:56606 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729190AbgHCXLI (ORCPT ); Mon, 3 Aug 2020 19:11:08 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MolF1013734 for ; Mon, 3 Aug 2020 16:11:08 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=1ujo6eMuF+9KbdpXujYgVna6biexRCf5aN/ftEu70Go=; b=FDDJZQ3XDKI+tZqQLkv7bZRPy96/+j13cD3JasX7GcxUvD+yZxZ1DFGizkn221eDKDkH VbO8epbnXnY3y+CRHFZtxuJsaoTY3ck+JQxJ/4PzKoyFkYZTdxzMh9GD4V1yh2qtaiNU ibQtJ2lP6Mh6zg9LLDtUlf/fsyhcEGc//jE= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 32nrnny2xv-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:08 -0700 Received: from intmgw003.03.ash8.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:07 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 992472943872; Mon, 3 Aug 2020 16:11:04 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 08/12] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Date: Mon, 3 Aug 2020 16:11:04 -0700 Message-ID: <20200803231104.2684437-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 lowpriorityscore=0 adultscore=0 malwarescore=0 spamscore=0 impostorscore=0 bulkscore=0 suspectscore=13 phishscore=0 mlxscore=0 mlxlogscore=734 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org A later patch needs to add a few pointers and a few u8 to sock_ops_kern. Hence, this patch saves some spaces by moving some of the existing members from u32 to u8 so that the later patch can still fit everything in a cacheline. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 4 ++-- net/core/filter.c | 15 ++++++++++----- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 0a355b005bf4..c427dfa5f908 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1236,13 +1236,13 @@ struct bpf_sock_addr_kern { struct bpf_sock_ops_kern { struct sock *sk; - u32 op; union { u32 args[4]; u32 reply; u32 replylong[4]; }; - u32 is_fullsock; + u8 op; + u8 is_fullsock; u64 temp; /* temp and everything after is not * initialized to 0 before calling * the BPF program. New fields that diff --git a/net/core/filter.c b/net/core/filter.c index 0a1bf520c55d..4adfc90b221e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8406,17 +8406,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, return insn - insn_buf; switch (si->off) { - case offsetof(struct bpf_sock_ops, op) ... + case offsetof(struct bpf_sock_ops, op): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + op), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, op)); + break; + + case offsetof(struct bpf_sock_ops, replylong[0]) ... offsetof(struct bpf_sock_ops, replylong[3]): - BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) != - sizeof_field(struct bpf_sock_ops_kern, op)); BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) != sizeof_field(struct bpf_sock_ops_kern, reply)); BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) != sizeof_field(struct bpf_sock_ops_kern, replylong)); off = si->off; - off -= offsetof(struct bpf_sock_ops, op); - off += offsetof(struct bpf_sock_ops_kern, op); + off -= offsetof(struct bpf_sock_ops, replylong[0]); + off += offsetof(struct bpf_sock_ops_kern, replylong[0]); if (type == BPF_WRITE) *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg, off); From patchwork Mon Aug 3 23:11:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340605 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=Op/D2YPc; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCf4qH2z9sSG for ; Tue, 4 Aug 2020 09:11:26 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729223AbgHCXLZ (ORCPT ); Mon, 3 Aug 2020 19:11:25 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48272 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729036AbgHCXLZ (ORCPT ); Mon, 3 Aug 2020 19:11:25 -0400 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MrZfF014756 for ; Mon, 3 Aug 2020 16:11:16 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=facebook; bh=uR4DJ92fi1wOw9guPR4rgIIAkZcfhNsHi86S3kkO1F0=; b=Op/D2YPcYzIBxUI7LtSVGjktRsc3cA4Ar+LUaDKeMXw2vwa46k3R7vQz/RBg+lipfntF kvl9DJE/PV08+xNhl4GgW8hnCL5Anc+aH+tV1XluY/REB2hb8cN/P2t3Q7OhV2L92rJ4 EXfdJ8fdZ8iFru6rz7zrHOnSPjMeJVaeChU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n81fsnss-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:16 -0700 Received: from intmgw001.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:14 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id D976C2943872; Mon, 3 Aug 2020 16:11:10 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 09/12] bpf: tcp: Allow bpf prog to write and parse TCP header option Date: Mon, 3 Aug 2020 16:11:10 -0700 Message-ID: <20200803231110.2685038-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 priorityscore=1501 spamscore=0 impostorscore=0 mlxscore=0 phishscore=0 suspectscore=1 lowpriorityscore=0 bulkscore=0 adultscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org [ Note: The TCP changes here is mainly to implement the bpf pieces into the bpf_skops_*() functions introduced in the earlier patches. ] The earlier effort in BPF-TCP-CC allows the TCP Congestion Control algorithm to be written in BPF. It opens up opportunities to allow a faster turnaround time in testing/releasing new congestion control ideas to production environment. The same flexibility can be extended to writing TCP header option. It is not uncommon that people want to test new TCP header option to improve the TCP performance. Another use case is for data-center that has a more controlled environment and has more flexibility in putting header options for internal only use. For example, we want to test the idea in putting maximum delay ACK in TCP header option which is similar to a draft RFC proposal [1]. This patch introduces the necessary BPF API and use them in the TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse and write TCP header options. It currently supports most of the TCP packet except RST. Supported TCP header option: ─────────────────────────── This patch allows the bpf-prog to write any option kind. Different bpf-progs can write its own option by calling the new helper bpf_store_hdr_opt(). The helper will ensure there is no duplicated option in the header. By allowing bpf-prog to write any option kind, this gives a lot of flexibility to the bpf-prog. Different bpf-prog can write its own option kind. It could also allow the bpf-prog to support a recently standardized option on an older kernel. Sockops Callback Flags: ────────────────────── The header parsing and writing callback can be turned on by enabling a few newly added callback flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG: Call bpf when kernel has received a header option that the kernel cannot handle. It is useful when the peer doesn't send bpf-options very often. The bpf-prog can inspect the received header by sock_ops->skb_data which covers the whole header (including the fixed fields like ports, flags...etc) or use the new bpf_load_hdr_opt() to search for a particular TCP header option. BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG: Call bpf for all received TCP header. It could be used at the client/active side (i.e. connect() side) when the server told it that the server was in syncookie mode and required the active side to resend the bpf-written options. The active side can keep writing the bpf-options until it received a valid packet from the server side to confirm the earlier packet (and options) has been received. The later example patch is using it like this at the active side when the server is in syncookie mode. The bpf prog will usually turn this off in the common cases. When the above PARSE CB flags are turned on, the bpf-prog will be called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB. These PARSE CB flags will only ask the kernel to call the bpf-prog when the tcp packet is received at an already-established sk. It does not include the SYN-SYNACK-ACK during the 3WHS where the connection has not been established. The parsing of the SYN-SYNACK-ACK will be discussed in the "3 Way HandShake" section. BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG: Call bpf when the kernel is writing header options for the outgoing packet. The bpf will first be called to reserve the space in a skb during sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. The kernel native options will get the spaces first and the bpf can only reserve the remaining spaces left. The bpf prog can reserve through the new bpf_reserve_hdr_opt() helper. The bpf-prog will then be called to write the option during sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB. Again, the kernel will write its native options first before calling the bpf-prog. The bpf-prog will use the new bpf_store_hdr_opt() to write the option. The bpf_store_hdr_opt() will ensure the writing option has not already been written by the kernel or by the earlier bpf-progs. This will avoid sending duplicated options in the header. The bpf-prog can also learn what options have been written by the kernel or other bpf-progs by reading sock_ops->skb_data or by calling the new bpf_load_hdr_opt() helper. The default is off for all of the above new CB flags, i.e. the bpf prog will not be called to parse or write bpf hdr option. sock_ops->skb_data and bpf_load_hdr_opt() ───────────────────────────────────────── sock_ops->skb_data and sock_ops->skb_data_end covers the whole TCP header and its options. bpf_load_hdr_opt() helps to search a particular option "kind" in the TCP header. When parsing header (sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB), they are the received skb. When reserving header space (sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB), they will not be useful because the header has not been written yet. When writing header (sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB), they are the outgoing skb that the header option will be written into. The bpf-prog can use them to see what header has been written by the kernel or the earlier bpf-progs. When concluding a 3WHS, it is the received skb that completes the 3WHS: In ACTIVE_ESTABLISHED_CB, it is the received SYNACK In PASSIVE_ESTABLISHED_CB, it is usually the received ACK (or the received SYN-DATA in fastopen). 3 Way HandShake ─────────────── The bpf-prog can learn if it is sending SYN or SYNACK by reading the sock_ops->skb_tcp_flags. * Passive side When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB), the received SYN skb will be available to the bpf prog. The bpf prog can use the SYN skb (which may carry the header option sent from the remote bpf prog) to decide what bpf header option should be written to the outgoing SYNACK skb. The bpf-prog can get the whole SYN TCP header by "bpf_getsockopt(TCP_BPF_SYN)" or search for a particular header option by "bpf_load_hdr_opt(BPF_LOAD_HDR_OPT_TCP_SYN)". The bpf prog will also know if it is in syncookie mode (sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE) or not. The bpf prog can store the received SYN pkt by using the existing bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch also does this. [ Note that the fullsock here is a listen sk, bpf_sk_storage is not very useful here since the listen sk will be shared by many concurrent connection requests. Extending bpf_sk_storage support to request_sock will add weight to the minisock and it is not necessary better than storing the whole ~100 bytes SYN pkt. ] When the connection is established, the bpf prog will be called in the existing PASSIVE_ESTABLISHED_CB callback. At that time, the bpf prog can get the header option from the saved syn and then apply the needed operation to the newly established socket. The later patch will use the max delay ack specified in the SYN packet as an example. The received ACK (that concludes the 3WHS) will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data and bpf_load_hdr_opt() and it could be useful in syncookie scenario. More on this later. There is an existing getsockopt "TCP_SAVED_SYN" to return the whole saved syn pkt which includes the IP[46] header and the TCP header. A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to start getting from, e.g. starting from TCP header, or from IP[46] header. In the new getsockopt(TCP_BPF_SYN*), the kernel will know where it can get the SYN's packet from: - (a) the just received syn (available when the bpf prog is writing SYNACK) and it is the only way to get SYN in syncookie mode. or - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other existing CB). The bpf prog does not need to know where the SYN pkt can be obtained from. The "TCP_BPF_SYN*" getsockopt will hide this details. Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to bpf_load_hdr_opt() to search header option from the SYN packet. * Fastopen Fastopen should work the same as the regular non fastopen case. This is a test in a later patch. * Syncookie For syncookie, the later example patch asks the active side's bpf prog to resend the header options in ACK. The server can use bpf_load_hdr_opt() to look at the options in this received ACK during PASSIVE_ESTABLISHED_CB. * Active side The bpf prog will get a chance to write the bpf header option in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK pkt will also be available to the bpf prog during the existing ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data and bpf_load_hdr_opt(). * Turn off header CB flags after 3WHS If the bpf prog does not need to write/parse header options beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags to avoid being called for header options. Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on so that the kernel will only call it when there is option that the kernel cannot handle. Established Connection ────────────────────── The bpf prog will be called as long as the parse/write *_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags. That will allow the bpf prog to parse/write header options in the data, pure-ack, and fin packet. [1]: draft-wang-tcpm-low-latency-opt-00 https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 Signed-off-by: Martin KaFai Lau --- include/linux/bpf-cgroup.h | 25 +++ include/linux/filter.h | 4 + include/net/tcp.h | 49 +++++ include/uapi/linux/bpf.h | 228 +++++++++++++++++++- net/core/filter.c | 365 +++++++++++++++++++++++++++++++++ net/ipv4/tcp_input.c | 20 +- net/ipv4/tcp_minisocks.c | 1 + net/ipv4/tcp_output.c | 104 +++++++++- tools/include/uapi/linux/bpf.h | 228 +++++++++++++++++++- 9 files changed, 1006 insertions(+), 18 deletions(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 64f367044e25..2f98d2fce62e 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \ BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL) +/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a + * fullsock and its parent fullsock cannot be traced by + * sk_to_full_sk(). + * + * e.g. sock_ops->sk is a request_sock and it is under syncookie mode. + * Its listener-sk is not attached to the rsk_listener. + * In this case, the caller holds the listener-sk (unlocked), + * set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with + * the listener-sk such that the cgroup-bpf-progs of the + * listener-sk will be run. + * + * Regardless of syncookie mode or not, + * calling bpf_setsockopt on listener-sk will not make sense anyway, + * so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here. + */ +#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_run_filter_sock_ops(sk, \ + sock_ops, \ + BPF_CGROUP_SOCK_OPS); \ + __ret; \ +}) + #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ ({ \ int __ret = 0; \ diff --git a/include/linux/filter.h b/include/linux/filter.h index c427dfa5f908..995625950cc1 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1241,8 +1241,12 @@ struct bpf_sock_ops_kern { u32 reply; u32 replylong[4]; }; + struct sk_buff *syn_skb; + struct sk_buff *skb; + void *skb_data_end; u8 op; u8 is_fullsock; + u8 remaining_opt_len; u64 temp; /* temp and everything after is not * initialized to 0 before calling * the BPF program. New fields that diff --git a/include/net/tcp.h b/include/net/tcp.h index 220b1d1b00d1..da6a55fdd9a7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2231,6 +2231,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, int len, int flags); #endif /* CONFIG_NET_SOCK_MSG */ +#ifdef CONFIG_CGROUP_BPF +/* Copy the listen sk's HDR_OPT_CB flags to its child. + * + * During 3-Way-HandShake, the synack is usually sent from + * the listen sk with the HDR_OPT_CB flags set so that + * bpf-prog will be called to write the BPF hdr option. + * + * In fastopen, the child sk is used to send synack instead + * of the listen sk. Thus, inheriting the HDR_OPT_CB flags + * from the listen sk gives the bpf-prog a chance to write + * BPF hdr option in the synack pkt during fastopen. + * + * Both fastopen and non-fastopen child will inherit the + * HDR_OPT_CB flags to keep the bpf-prog having a consistent + * behavior when deciding to clear this cb flags (or not) + * during the PASSIVE_ESTABLISHED_CB. + * + * In the future, other cb flags could be inherited here also. + */ +static inline void bpf_skops_init_child(const struct sock *sk, + struct sock *child) +{ + tcp_sk(child)->bpf_sock_ops_cb_flags = + tcp_sk(sk)->bpf_sock_ops_cb_flags & + (BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); +} + +static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops, + struct sk_buff *skb, + unsigned int end_offset) +{ + skops->skb = skb; + skops->skb_data_end = skb->data + end_offset; +} +#else +static inline void bpf_skops_init_child(const struct sock *sk, + struct sock *child) +{ +} + +static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops, + struct sk_buff *skb, + unsigned int end_offset) +{ +} +#endif + /* Call BPF_SOCK_OPS program that returns an int. If the return value * is < 0, then the BPF op failed (for example if the loaded BPF * program does not support the chosen operation or there is no BPF diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7f462d852138..c6162ee573e9 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3389,6 +3389,114 @@ union bpf_attr { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags) + * Description + * Load Header option. Support loading TCP header option + * of a particular kind in BPF_PROG_TYPE_SOCK_OPS. + * + * The first byte of the *searchby_res* specifies the + * kind that it wants to search. + * + * If the searching kind is an experimental kind + * (i.e. 253 or 254 according to RFC6994). It also + * needs to specify the "magic" which is either + * 2 bytes or 4 bytes. It then also needs to + * specify the size of the magic by using + * the 2nd byte which is "kind-length" of a TCP + * header option and the "kind-length" also + * includes the first 2 bytes "kind" and "kind-length" + * itself as a normal TCP header option also does. + * + * For example, to search experimental kind 254 with + * 2 byte magic 0xeB9F, the searchby_res should be + * [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ]. + * + * To search for the standard window scale option (3), + * the searchby_res should be [ 3, 0, 0, .... 0 ]. + * Note, kind-length must be 0 for regular option. + * + * Searching for No-Op (0) and End-of-Option-List (1) are + * not supported. + * + * *len* must be at least 2 bytes which is the minimal size + * of a header option. + * + * Supported flags: + * * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the + * saved_syn packet or the just-received syn packet. + * + * Return + * >0 when found, the header option is copied to *searchby_res*. + * The return value is the total length copied. + * + * **-EINVAL** If param is invalid + * + * **-ENOMSG** The option is not found + * + * **-ENOENT** No syn packet available when + * **BPF_LOAD_HDR_OPT_TCP_SYN** is used + * + * **-ENOSPC** Not enough space. Only *len* number of + * bytes are copied. + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags) + * Description + * Store BPF header option. The data will be copied + * from buffer *from* with length *len* to the TCP header. + * + * The buffer *from* should have the whole option that + * includes the kind, kind-length, and the actual + * option data. The *len* must be at least kind-length + * long. The kind-length does not have to be 4 byte + * aligned. The kernel will take care of the padding + * and setting the 4 bytes aligned value to th->doff. + * + * This helper will check for duplicated option + * by searching the same option in the outgoing skb. + * + * This helper can only be called during + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** If param is invalid + * + * **-ENOSPC** Not enough space. Nothing has been written + * + * **-EEXIST** The writing option has already existed + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags) + * Description + * Reserve *len* bytes for the bpf header option. The + * space will be used by bpf_store_hdr_opt() later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * If bpf_reserve_hdr_opt() is called multiple times, + * the total number of bytes will be reserved. + * + * This helper can only be called during + * BPF_SOCK_OPS_HDR_OPT_LEN_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** if param is invalid + * + * **-ENOSPC** Not enough remaining space in the header. + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3533,6 +3641,9 @@ union bpf_attr { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(load_hdr_opt), \ + FN(store_hdr_opt), \ + FN(reserve_hdr_opt), /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4152,6 +4263,10 @@ struct bpf_sock_ops { __u64 bytes_received; __u64 bytes_acked; __bpf_md_ptr(struct bpf_sock *, sk); + __bpf_md_ptr(void *, skb_data); + __bpf_md_ptr(void *, skb_data_end); + __u32 skb_len; + __u32 skb_tcp_flags; }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -4160,7 +4275,7 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), - BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ @@ -4220,6 +4335,63 @@ enum { */ BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. */ + BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option. + * It will be called to handle + * the packets received at + * an already established + * connection. + * + * sock_ops->skb_data: + * Referring to the received skb. + * It covers the TCP header only. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option. + */ + BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the + * header option later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Not available because no header has + * been written yet. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the + * outgoing skb. (e.g. SYN, ACK, FIN). + * + * bpf_reserve_hdr_opt() should + * be used to reserve space. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Referring to the outgoing skb. + * It covers the TCP header + * that has already been written + * by the kernel and the + * earlier bpf-progs. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the outgoing + * skb. (e.g. SYN, ACK, FIN). + * + * bpf_store_hdr_opt() should + * be used to write the + * option. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option that + * has already been written + * by the kernel or the + * earlier bpf-progs. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect @@ -4249,6 +4421,60 @@ enum { TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ + /* Copy the SYN pkt to optval + * + * BPF_PROG_TYPE_SOCK_OPS only. It is similar to the + * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit + * to only getting from the saved_syn. It can either get the + * syn packet from: + * + * 1. the just-received SYN packet (only available when writing the + * SYNACK). It will be useful when it is not necessary to + * save the SYN packet for latter use. It is also the only way + * to get the SYN during syncookie mode because the syn + * packet cannot be saved during syncookie. + * + * OR + * + * 2. the earlier saved syn which was done by + * bpf_setsockopt(TCP_SAVE_SYN). + * + * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the + * SYN packet is obtained. + * + * If the bpf-prog does not need the IP[46] header, the + * bpf-prog can avoid parsing the IP header by using + * TCP_BPF_SYN. Otherwise, the bpf-prog can get both + * IP[46] and TCP header by using TCP_BPF_SYN_IP. + * + * >0: Total number of bytes copied + * -ENOSPC: Not enough space in optval. Only optlen number of + * bytes is copied. + * -ENOENT: The SYN skb is not available now and the earlier SYN pkt + * is not saved by setsockopt(TCP_SAVE_SYN). + */ + TCP_BPF_SYN = 1005, /* Copy the TCP header */ + TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ +}; + +enum { + BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0), +}; + +/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + */ +enum { + BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the + * total option spaces + * required for an established + * sk in order to calculate the + * the MSS. No skb is actually + * sent. + */ + BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode + * when sending a SYN. + */ }; struct bpf_perf_event_value { diff --git a/net/core/filter.c b/net/core/filter.c index 4adfc90b221e..607f87901122 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4669,9 +4669,82 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock, + int optname, const u8 **start) +{ + struct sk_buff *syn_skb = bpf_sock->syn_skb; + const u8 *hdr_start; + int ret; + + if (syn_skb) { + /* sk is a request_sock here */ + + if (optname == TCP_BPF_SYN) { + hdr_start = syn_skb->data; + ret = tcp_hdrlen(syn_skb); + } else { + /* optname == TCP_BPF_SYN_IP */ + hdr_start = skb_network_header(syn_skb); + ret = skb_network_header_len(syn_skb) + + tcp_hdrlen(syn_skb); + } + } else { + struct sock *sk = bpf_sock->sk; + struct saved_syn *saved_syn; + + if (sk->sk_state == TCP_NEW_SYN_RECV) + /* synack retransmit. bpf_sock->syn_skb will + * not be available. It has to resort to + * saved_syn (if it is saved). + */ + saved_syn = inet_reqsk(sk)->saved_syn; + else + saved_syn = tcp_sk(sk)->saved_syn; + + if (!saved_syn) + return -ENOENT; + + if (optname == TCP_BPF_SYN) { + hdr_start = saved_syn->data + + saved_syn->network_hdrlen; + ret = saved_syn->tcp_hdrlen; + } else { + /* optname == TCP_BPF_SYN_IP */ + hdr_start = saved_syn->data; + ret = saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; + } + } + + *start = hdr_start; + return ret; +} + BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { + if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP && + optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_IP) { + int ret, copy_len = 0; + const u8 *start; + + ret = bpf_sock_ops_get_syn(bpf_sock, optname, &start); + if (ret > 0) { + copy_len = ret; + if (optlen < copy_len) { + copy_len = optlen; + ret = -ENOSPC; + } + + memcpy(optval, start, copy_len); + } + + /* Zero out unused buffer at the end */ + memset(optval + copy_len, 0, optlen - copy_len); + + return ret; + } + return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen); } @@ -6165,6 +6238,232 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { .arg3_type = ARG_ANYTHING, }; +static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend, + u8 search_kind, const u8 *magic, + u8 magic_len, bool *eol) +{ + u8 kind, kind_len; + + *eol = false; + + while (op < opend) { + kind = op[0]; + + if (kind == TCPOPT_EOL) { + *eol = true; + return ERR_PTR(-ENOMSG); + } else if (kind == TCPOPT_NOP) { + op++; + continue; + } + + if (opend - op < 2 || opend - op < op[1] || op[1] < 2) + /* Something is wrong in the received header. + * Follow the TCP stack's tcp_parse_options() + * and just bail here. + */ + return ERR_PTR(-EFAULT); + + kind_len = op[1]; + if (search_kind == kind) { + if (!magic_len) + return op; + + if (magic_len > kind_len - 2) + return ERR_PTR(-ENOMSG); + + if (!memcmp(&op[2], magic, magic_len)) + return op; + } + + op += kind_len; + } + + return ERR_PTR(-ENOMSG); +} + +BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + void *, search_res, u32, len, u64, flags) +{ + bool eol, load_syn = flags & BPF_LOAD_HDR_OPT_TCP_SYN; + const u8 *op, *opend, *magic, *search = search_res; + u8 search_kind, search_len, copy_len, magic_len; + int ret; + + /* 2 byte is the minimal option len except TCPOPT_NOP and + * TCPOPT_EOL which are useless for the bpf prog to learn + * and this helper disallow loading them also. + */ + if (len < 2 || flags & ~BPF_LOAD_HDR_OPT_TCP_SYN) + return -EINVAL; + + search_kind = search[0]; + search_len = search[1]; + + if (search_len > len || search_kind == TCPOPT_NOP || + search_kind == TCPOPT_EOL) + return -EINVAL; + + if (search_kind == TCPOPT_EXP || search_kind == 253) { + /* 16 or 32 bit magic. +2 for kind and kind length */ + if (search_len != 4 && search_len != 6) + return -EINVAL; + magic = &search[2]; + magic_len = search_len - 2; + } else { + if (search_len) + return -EINVAL; + magic = NULL; + magic_len = 0; + } + + if (load_syn) { + ret = bpf_sock_ops_get_syn(bpf_sock, TCP_BPF_SYN, &op); + if (ret < 0) + return ret; + + opend = op + ret; + op += sizeof(struct tcphdr); + } else { + if (!bpf_sock->skb || + bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB) + /* This bpf_sock->op cannot call this helper */ + return -EPERM; + + opend = bpf_sock->skb_data_end; + op = bpf_sock->skb->data + sizeof(struct tcphdr); + } + + op = bpf_search_tcp_opt(op, opend, search_kind, magic, magic_len, + &eol); + if (IS_ERR(op)) + return PTR_ERR(op); + + copy_len = op[1]; + ret = copy_len; + if (copy_len > len) { + ret = -ENOSPC; + copy_len = len; + } + + memcpy(search_res, op, copy_len); + return ret; +} + +static const struct bpf_func_proto bpf_sock_ops_load_hdr_opt_proto = { + .func = bpf_sock_ops_load_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_4(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + const void *, from, u32, len, u64, flags) +{ + u8 new_kind, new_kind_len, magic_len = 0, *opend; + const u8 *op, *new_op, *magic = NULL; + struct sk_buff *skb; + bool eol; + + if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB) + return -EPERM; + + if (len < 2 || flags) + return -EINVAL; + + new_op = from; + new_kind = new_op[0]; + new_kind_len = new_op[1]; + + if (new_kind_len > len || new_kind == TCPOPT_NOP || + new_kind == TCPOPT_EOL) + return -EINVAL; + + if (new_kind_len > bpf_sock->remaining_opt_len) + return -ENOSPC; + + /* 253 is another experimental kind */ + if (new_kind == TCPOPT_EXP || new_kind == 253) { + if (new_kind_len < 4) + return -EINVAL; + /* Match for the 2 byte magic also. + * RFC 6994: the magic could be 2 or 4 bytes. + * Hence, matching by 2 byte only is on the + * conservative side but it is the right + * thing to do for the 'search-for-duplication' + * purpose. + */ + magic = &new_op[2]; + magic_len = 2; + } + + /* Check for duplication */ + skb = bpf_sock->skb; + op = skb->data + sizeof(struct tcphdr); + opend = bpf_sock->skb_data_end; + + op = bpf_search_tcp_opt(op, opend, new_kind, magic, magic_len, + &eol); + if (!IS_ERR(op)) + return -EEXIST; + + if (PTR_ERR(op) != -ENOMSG) + return PTR_ERR(op); + + if (eol) + /* The option has been ended. Treat it as no more + * header option can be written. + */ + return -ENOSPC; + + /* No duplication found. Store the header option. */ + memcpy(opend, from, new_kind_len); + + bpf_sock->remaining_opt_len -= new_kind_len; + bpf_sock->skb_data_end += new_kind_len; + + return 0; +} + +static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = { + .func = bpf_sock_ops_store_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + u32, len, u64, flags) +{ + if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB) + return -EPERM; + + if (flags || len < 2) + return -EINVAL; + + if (len > bpf_sock->remaining_opt_len) + return -ENOSPC; + + bpf_sock->remaining_opt_len -= len; + + return 0; +} + +static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = { + .func = bpf_sock_ops_reserve_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; + #endif /* CONFIG_INET */ bool bpf_helper_changes_pkt_data(void *func) @@ -6193,6 +6492,9 @@ bool bpf_helper_changes_pkt_data(void *func) func == bpf_lwt_seg6_store_bytes || func == bpf_lwt_seg6_adjust_srh || func == bpf_lwt_seg6_action || +#endif +#ifdef CONFIG_INET + func == bpf_sock_ops_store_hdr_opt || #endif func == bpf_lwt_in_push_encap || func == bpf_lwt_xmit_push_encap) @@ -6565,6 +6867,12 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; #ifdef CONFIG_INET + case BPF_FUNC_load_hdr_opt: + return &bpf_sock_ops_load_hdr_opt_proto; + case BPF_FUNC_store_hdr_opt: + return &bpf_sock_ops_store_hdr_opt_proto; + case BPF_FUNC_reserve_hdr_opt: + return &bpf_sock_ops_reserve_hdr_opt_proto; case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; #endif /* CONFIG_INET */ @@ -7364,6 +7672,20 @@ static bool sock_ops_is_valid_access(int off, int size, return false; info->reg_type = PTR_TO_SOCKET_OR_NULL; break; + case offsetof(struct bpf_sock_ops, skb_data): + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_PACKET; + break; + case offsetof(struct bpf_sock_ops, skb_data_end): + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_PACKET_END; + break; + case offsetof(struct bpf_sock_ops, skb_tcp_flags): + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, + size_default); default: if (size != size_default) return false; @@ -8652,6 +8974,49 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, si->dst_reg, si->src_reg, offsetof(struct bpf_sock_ops_kern, sk)); break; + case offsetof(struct bpf_sock_ops, skb_data_end): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb_data_end), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb_data_end)); + break; + case offsetof(struct bpf_sock_ops, skb_data): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data), + si->dst_reg, si->dst_reg, + offsetof(struct sk_buff, data)); + break; + case offsetof(struct bpf_sock_ops, skb_len): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len), + si->dst_reg, si->dst_reg, + offsetof(struct sk_buff, len)); + break; + case offsetof(struct bpf_sock_ops, skb_tcp_flags): + off = offsetof(struct sk_buff, cb); + off += offsetof(struct tcp_skb_cb, tcp_flags); + *target_size = sizeof_field(struct tcp_skb_cb, tcp_flags); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb, + tcp_flags), + si->dst_reg, si->dst_reg, off); + break; } return insn - insn_buf; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4f5c22c4842b..40ffdfc9adf4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -146,6 +146,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG); bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); + struct bpf_sock_ops_kern sock_ops; if (likely(!unknown_opt && !parse_all_opt)) return; @@ -161,12 +162,15 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) return; } - /* BPF prog will have access to the sk and skb. - * - * The bpf running context preparation and the actual bpf prog - * calling will be implemented in a later PATCH together with - * other bpf pieces. - */ + sock_owned_by_me(sk); + + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB; + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); + + BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); } static void bpf_skops_established(struct sock *sk, int bpf_op, @@ -180,7 +184,9 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, sock_ops.op = bpf_op; sock_ops.is_fullsock = 1; sock_ops.sk = sk; - /* skb will be passed to the bpf prog in a later patch. */ + /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ + if (skb) + bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); } diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 495dda2449fe..56c306e3cd2f 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, newtp->fastopen_req = NULL; RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); + bpf_skops_init_child(sk, newsk); tcp_bpf_clone(sk, newsk); __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 71d262ec314a..8b0ac8b0f581 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -454,6 +454,18 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) } #ifdef CONFIG_CGROUP_BPF +static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb, + enum tcp_synack_type synack_type) +{ + if (unlikely(!skb)) + return BPF_WRITE_HDR_TCP_CURRENT_MSS; + + if (unlikely(synack_type == TCP_SYNACK_COOKIE)) + return BPF_WRITE_HDR_TCP_SYNACK_COOKIE; + + return 0; +} + /* req, syn_skb and synack_type are used when writing synack */ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, struct request_sock *req, @@ -462,15 +474,60 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, struct tcp_out_options *opts, unsigned int *remaining) { + struct bpf_sock_ops_kern sock_ops; + int err; + if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) || !*remaining) return; - /* The bpf running context preparation and the actual bpf prog - * calling will be implemented in a later PATCH together with - * other bpf pieces. - */ + /* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */ + + /* init sock_ops */ + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + + sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB; + + if (req) { + /* The listen "sk" cannot be passed here because + * it is not locked. It would not make too much + * sense to do bpf_setsockopt(listen_sk) based + * on individual connection request also. + * + * Thus, "req" is passed here and the cgroup-bpf-progs + * of the listen "sk" will be run. + * + * "req" is also used here for fastopen even the "sk" here is + * a fullsock "child" sk. It is to keep the behavior + * consistent between fastopen and non-fastopen on + * the bpf programming side. + */ + sock_ops.sk = (struct sock *)req; + sock_ops.syn_skb = syn_skb; + } else { + sock_owned_by_me(sk); + + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + } + + sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type); + sock_ops.remaining_opt_len = *remaining; + /* tcp_current_mss() does not pass a skb */ + if (skb) + bpf_skops_init_skb(&sock_ops, skb, 0); + + err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk); + + if (err || sock_ops.remaining_opt_len == *remaining) + return; + + opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len; + /* round up to 4 bytes */ + opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3; + + *remaining -= opts->bpf_opt_len; } static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, @@ -479,13 +536,42 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, enum tcp_synack_type synack_type, struct tcp_out_options *opts) { - if (likely(!opts->bpf_opt_len)) + u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len; + struct bpf_sock_ops_kern sock_ops; + int err; + + if (likely(!max_opt_len)) return; - /* The bpf running context preparation and the actual bpf prog - * calling will be implemented in a later PATCH together with - * other bpf pieces. - */ + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + + sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB; + + if (req) { + sock_ops.sk = (struct sock *)req; + sock_ops.syn_skb = syn_skb; + } else { + sock_owned_by_me(sk); + + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + } + + sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type); + sock_ops.remaining_opt_len = max_opt_len; + first_opt_off = tcp_hdrlen(skb) - max_opt_len; + bpf_skops_init_skb(&sock_ops, skb, first_opt_off); + + err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk); + + if (err) + nr_written = 0; + else + nr_written = max_opt_len - sock_ops.remaining_opt_len; + + if (nr_written < max_opt_len) + memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP, + max_opt_len - nr_written); } #else static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7f462d852138..c6162ee573e9 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3389,6 +3389,114 @@ union bpf_attr { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags) + * Description + * Load Header option. Support loading TCP header option + * of a particular kind in BPF_PROG_TYPE_SOCK_OPS. + * + * The first byte of the *searchby_res* specifies the + * kind that it wants to search. + * + * If the searching kind is an experimental kind + * (i.e. 253 or 254 according to RFC6994). It also + * needs to specify the "magic" which is either + * 2 bytes or 4 bytes. It then also needs to + * specify the size of the magic by using + * the 2nd byte which is "kind-length" of a TCP + * header option and the "kind-length" also + * includes the first 2 bytes "kind" and "kind-length" + * itself as a normal TCP header option also does. + * + * For example, to search experimental kind 254 with + * 2 byte magic 0xeB9F, the searchby_res should be + * [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ]. + * + * To search for the standard window scale option (3), + * the searchby_res should be [ 3, 0, 0, .... 0 ]. + * Note, kind-length must be 0 for regular option. + * + * Searching for No-Op (0) and End-of-Option-List (1) are + * not supported. + * + * *len* must be at least 2 bytes which is the minimal size + * of a header option. + * + * Supported flags: + * * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the + * saved_syn packet or the just-received syn packet. + * + * Return + * >0 when found, the header option is copied to *searchby_res*. + * The return value is the total length copied. + * + * **-EINVAL** If param is invalid + * + * **-ENOMSG** The option is not found + * + * **-ENOENT** No syn packet available when + * **BPF_LOAD_HDR_OPT_TCP_SYN** is used + * + * **-ENOSPC** Not enough space. Only *len* number of + * bytes are copied. + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags) + * Description + * Store BPF header option. The data will be copied + * from buffer *from* with length *len* to the TCP header. + * + * The buffer *from* should have the whole option that + * includes the kind, kind-length, and the actual + * option data. The *len* must be at least kind-length + * long. The kind-length does not have to be 4 byte + * aligned. The kernel will take care of the padding + * and setting the 4 bytes aligned value to th->doff. + * + * This helper will check for duplicated option + * by searching the same option in the outgoing skb. + * + * This helper can only be called during + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** If param is invalid + * + * **-ENOSPC** Not enough space. Nothing has been written + * + * **-EEXIST** The writing option has already existed + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags) + * Description + * Reserve *len* bytes for the bpf header option. The + * space will be used by bpf_store_hdr_opt() later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * If bpf_reserve_hdr_opt() is called multiple times, + * the total number of bytes will be reserved. + * + * This helper can only be called during + * BPF_SOCK_OPS_HDR_OPT_LEN_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** if param is invalid + * + * **-ENOSPC** Not enough remaining space in the header. + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3533,6 +3641,9 @@ union bpf_attr { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(load_hdr_opt), \ + FN(store_hdr_opt), \ + FN(reserve_hdr_opt), /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4152,6 +4263,10 @@ struct bpf_sock_ops { __u64 bytes_received; __u64 bytes_acked; __bpf_md_ptr(struct bpf_sock *, sk); + __bpf_md_ptr(void *, skb_data); + __bpf_md_ptr(void *, skb_data_end); + __u32 skb_len; + __u32 skb_tcp_flags; }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -4160,7 +4275,7 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), - BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ @@ -4220,6 +4335,63 @@ enum { */ BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. */ + BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option. + * It will be called to handle + * the packets received at + * an already established + * connection. + * + * sock_ops->skb_data: + * Referring to the received skb. + * It covers the TCP header only. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option. + */ + BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the + * header option later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Not available because no header has + * been written yet. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the + * outgoing skb. (e.g. SYN, ACK, FIN). + * + * bpf_reserve_hdr_opt() should + * be used to reserve space. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Referring to the outgoing skb. + * It covers the TCP header + * that has already been written + * by the kernel and the + * earlier bpf-progs. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the outgoing + * skb. (e.g. SYN, ACK, FIN). + * + * bpf_store_hdr_opt() should + * be used to write the + * option. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option that + * has already been written + * by the kernel or the + * earlier bpf-progs. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect @@ -4249,6 +4421,60 @@ enum { TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ + /* Copy the SYN pkt to optval + * + * BPF_PROG_TYPE_SOCK_OPS only. It is similar to the + * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit + * to only getting from the saved_syn. It can either get the + * syn packet from: + * + * 1. the just-received SYN packet (only available when writing the + * SYNACK). It will be useful when it is not necessary to + * save the SYN packet for latter use. It is also the only way + * to get the SYN during syncookie mode because the syn + * packet cannot be saved during syncookie. + * + * OR + * + * 2. the earlier saved syn which was done by + * bpf_setsockopt(TCP_SAVE_SYN). + * + * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the + * SYN packet is obtained. + * + * If the bpf-prog does not need the IP[46] header, the + * bpf-prog can avoid parsing the IP header by using + * TCP_BPF_SYN. Otherwise, the bpf-prog can get both + * IP[46] and TCP header by using TCP_BPF_SYN_IP. + * + * >0: Total number of bytes copied + * -ENOSPC: Not enough space in optval. Only optlen number of + * bytes is copied. + * -ENOENT: The SYN skb is not available now and the earlier SYN pkt + * is not saved by setsockopt(TCP_SAVE_SYN). + */ + TCP_BPF_SYN = 1005, /* Copy the TCP header */ + TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ +}; + +enum { + BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0), +}; + +/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + */ +enum { + BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the + * total option spaces + * required for an established + * sk in order to calculate the + * the MSS. No skb is actually + * sent. + */ + BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode + * when sending a SYN. + */ }; struct bpf_perf_event_value { From patchwork Mon Aug 3 23:11:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340603 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=Ck9vIQFZ; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCZ6zz9z9sR4 for ; Tue, 4 Aug 2020 09:11:22 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729212AbgHCXLW (ORCPT ); Mon, 3 Aug 2020 19:11:22 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:50946 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729185AbgHCXLW (ORCPT ); Mon, 3 Aug 2020 19:11:22 -0400 Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MtAS8011116 for ; Mon, 3 Aug 2020 16:11:21 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=6tAT+doT1U/Hj+Jq4B5YaWu3S+rEo2R4DqnySvTWGdA=; b=Ck9vIQFZd+S4OXXUvONmPR2P1O6CBWINS3KRrO29sA0k/Dplj4rvLsHo4z0QXGQTdqs9 RlEhpM2gMdIUE9REQ+0U4N4pY6cejzoSusXLAzVf5AnwoLDrkdz9L/L5ZDzcVXQQxc5Z ADWOsx+TDQ0Y1Xstb72SMCULXh6m64YFkHQ= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 32n81jhgyv-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:21 -0700 Received: from intmgw003.03.ash8.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:21d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:19 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 2580D2943872; Mon, 3 Aug 2020 16:11:17 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 10/12] bpf: selftests: Add fastopen_connect to network_helpers Date: Mon, 3 Aug 2020 16:11:17 -0700 Message-ID: <20200803231117.2685614-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 impostorscore=0 spamscore=0 suspectscore=13 lowpriorityscore=0 phishscore=0 bulkscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 mlxlogscore=999 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch adds a fastopen_connect() helper which will be used in a later test. Signed-off-by: Martin KaFai Lau --- tools/testing/selftests/bpf/network_helpers.c | 37 +++++++++++++++++++ tools/testing/selftests/bpf/network_helpers.h | 2 + 2 files changed, 39 insertions(+) diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index f56655690f9b..12ee40284da0 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -104,6 +104,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port, return -1; } +int fastopen_connect(int server_fd, const char *data, unsigned int data_len, + int timeout_ms) +{ + struct sockaddr_storage addr; + socklen_t addrlen = sizeof(addr); + struct sockaddr_in *addr_in; + int fd, ret; + + if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) { + log_err("Failed to get server addr"); + return -1; + } + + addr_in = (struct sockaddr_in *)&addr; + fd = socket(addr_in->sin_family, SOCK_STREAM, 0); + if (fd < 0) { + log_err("Failed to create client socket"); + return -1; + } + + if (settimeo(fd, timeout_ms)) + goto error_close; + + ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr, + addrlen); + if (ret != data_len) { + log_err("sendto(data, %u) != %d\n", data_len, ret); + goto error_close; + } + + return fd; + +error_close: + save_errno_close(fd); + return -1; +} + static int connect_fd_to_addr(int fd, const struct sockaddr_storage *addr, socklen_t addrlen) diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index c3728f6667e4..7205f8afdba1 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port, int timeout_ms); int connect_to_fd(int server_fd, int timeout_ms); int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms); +int fastopen_connect(int server_fd, const char *data, unsigned int data_len, + int timeout_ms); int make_sockaddr(int family, const char *addr_str, __u16 port, struct sockaddr_storage *addr, socklen_t *len); From patchwork Mon Aug 3 23:11:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340609 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=LYURxT3p; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCw4sy5z9sR4 for ; Tue, 4 Aug 2020 09:11:40 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729241AbgHCXLk (ORCPT ); Mon, 3 Aug 2020 19:11:40 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:17424 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728912AbgHCXLi (ORCPT ); Mon, 3 Aug 2020 19:11:38 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073Mol0V013729 for ; Mon, 3 Aug 2020 16:11:35 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=qg5/q5u/Q4mq2aVHddAugJOYeI96W6BAuaOnBpJUuY8=; b=LYURxT3pdGV3Vs6HRfSyI9apAQvEsaPE7Pq59zHlfG6sooCpyG+rnmkO6zXmOYx3KP0k otIMuXO6b1K8yRv+6ltvzU7opeJudqLpXBBKExFUh9iKxOBdBoaX3138rMJB7+CsMlsZ NagPCPwCunzASrHoDr+MnCSsHLtxK90j4vs= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32nrnny30x-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:35 -0700 Received: from intmgw001.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:32 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9A3C22943872; Mon, 3 Aug 2020 16:11:23 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 11/12] bpf: selftests: tcp header options Date: Mon, 3 Aug 2020 16:11:23 -0700 Message-ID: <20200803231123.2686191-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 lowpriorityscore=0 adultscore=0 malwarescore=0 spamscore=0 impostorscore=0 bulkscore=0 suspectscore=15 phishscore=0 mlxscore=0 mlxlogscore=999 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds tests for the new bpf tcp header option feature. test_tcp_hdr_options.c: - It tests header option writing and parsing in 3WHS, normal regular connection establishment, fastopen, and syncookie. - In syncookie, the passive side's bpf prog is asking the active side to resend its bpf header option by specifying a RESEND bit in the outgoing SYNACK. handle_active_estab() and write_nodata_opt() has some details. - handle_passive_estab() has comments on fastopen. - It also has test for header writing and parsing in FIN packet. - Most of the tests is writing an experimental option 254 with magic 0xeB9F. - The no_exprm_estab() also tests writing a regular TCP option without any magic. test_misc_tcp_options.c: - It is an one directional test. Active side writes option and passive side parses option. The focus is to exercise the new helpers and API. - Testing the new helper, bpf_load_hdr_opt() and bpf_store_hdr_opt(). - Testing the bpf_getsockopt(TCP_BPF_SYN). - Negative tests for the above helpers. - Testing the sock_ops->skb_data. Signed-off-by: Martin KaFai Lau --- .../bpf/prog_tests/tcp_hdr_options.c | 629 +++++++++++++++++ .../bpf/progs/test_misc_tcp_hdr_options.c | 338 +++++++++ .../bpf/progs/test_tcp_hdr_options.c | 657 ++++++++++++++++++ .../selftests/bpf/test_tcp_hdr_options.h | 150 ++++ 4 files changed, 1774 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c create mode 100644 tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c new file mode 100644 index 000000000000..82b90aaa918c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c @@ -0,0 +1,629 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include + +#include "test_progs.h" +#include "cgroup_helpers.h" +#include "network_helpers.h" +#include "test_tcp_hdr_options.h" +#include "test_tcp_hdr_options.skel.h" +#include "test_misc_tcp_hdr_options.skel.h" + +#define LO_ADDR6 "::eB9F" +#define CG_NAME "/tcpbpf-hdr-opt-test" + +struct bpf_test_option exp_passive_estab_in; +struct bpf_test_option exp_active_estab_in; +struct bpf_test_option exp_passive_fin_in; +struct bpf_test_option exp_active_fin_in; +struct hdr_stg exp_passive_hdr_stg; +struct hdr_stg exp_active_hdr_stg = { .active = true, }; + +static struct test_misc_tcp_hdr_options *misc_skel; +static struct test_tcp_hdr_options *skel; +static int lport_linum_map_fd; +static int hdr_stg_map_fd; +static __u32 duration; +static int cg_fd; + +struct sk_fds { + int srv_fd; + int passive_fd; + int active_fd; + int passive_lport; + int active_lport; +}; + +static int add_lo_addr(int family, const char *addr_str) +{ + char ip_addr_cmd[256]; + int cmdlen; + + if (family == AF_INET6) + cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd), + "ip -6 addr add %s/128 dev lo scope host", + addr_str); + else + cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd), + "ip addr add %s/32 dev lo", + addr_str); + + if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd", + "failed to add host addr %s to lo. ip cmdlen is too long\n", + addr_str)) + return -1; + + if (CHECK(system(ip_addr_cmd), "run ip cmd", + "failed to add host addr %s to lo\n", addr_str)) + return -1; + + return 0; +} + +static int create_netns(void) +{ + if (CHECK(unshare(CLONE_NEWNET), "create netns", + "unshare(CLONE_NEWNET): %s (%d)", + strerror(errno), errno)) + return -1; + + if (CHECK(system("ip link set dev lo up"), "run ip cmd", + "failed to bring lo link up\n")) + return -1; + + if (add_lo_addr(AF_INET6, LO_ADDR6)) + return -1; + + return 0; +} + +static int write_sysctl(const char *sysctl, const char *value) +{ + int fd, err, len; + + fd = open(sysctl, O_WRONLY); + if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n", + sysctl, strerror(errno), errno)) + return -1; + + len = strlen(value); + err = write(fd, value, len); + close(fd); + if (CHECK(err != len, "write sysctl", + "write(%s, %s): err:%d %s (%d)\n", + sysctl, value, err, strerror(errno), errno)) + return -1; + + return 0; +} + +static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix) +{ + fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n", + prefix ? : "", hdr_stg->active, hdr_stg->resend_syn, + hdr_stg->syncookie, hdr_stg->fastopen); +} + +static void print_option(const struct bpf_test_option *opt, const char *prefix) +{ + fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, rand:0x%x}\n", + prefix ? : "", opt->flags, opt->max_delack_ms, opt->rand); +} + +static void sk_fds_close(struct sk_fds *sk_fds) +{ + close(sk_fds->srv_fd); + close(sk_fds->passive_fd); + close(sk_fds->active_fd); +} + +static int sk_fds_shutdown(struct sk_fds *sk_fds) +{ + int ret, abyte; + + shutdown(sk_fds->active_fd, SHUT_WR); + ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte)); + if (CHECK(ret != 0, "read-after-shutdown(passive_fd):", + "ret:%d %s (%d)\n", + ret, strerror(errno), errno)) + return -1; + + shutdown(sk_fds->passive_fd, SHUT_WR); + ret = read(sk_fds->active_fd, &abyte, sizeof(abyte)); + if (CHECK(ret != 0, "read-after-shutdown(active_fd):", + "ret:%d %s (%d)\n", + ret, strerror(errno), errno)) + return -1; + + return 0; +} + +static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open) +{ + const char fast[] = "FAST!!!"; + struct sockaddr_storage addr; + struct sockaddr_in *addr_in; + socklen_t len; + + sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0); + if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n", + strerror(errno), errno)) + goto error; + + if (fast_open) + sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast, + sizeof(fast), 0); + else + sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0); + + if (CHECK_FAIL(sk_fds->active_fd == -1)) { + close(sk_fds->srv_fd); + goto error; + } + + addr_in = (struct sockaddr_in *)&addr; + len = sizeof(addr); + if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr, + &len), "getsockname(srv_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + sk_fds->passive_lport = ntohs(addr_in->sin_port); + + len = sizeof(addr); + if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr, + &len), "getsockname(active_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + sk_fds->active_lport = ntohs(addr_in->sin_port); + + sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0); + if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + + if (fast_open) { + char bytes_in[sizeof(fast)]; + int ret; + + ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in)); + if (CHECK(ret != sizeof(fast), "read fastopen syn data", + "expected=%lu actual=%d\n", sizeof(fast), ret)) { + close(sk_fds->passive_fd); + goto error_close; + } + } + + return 0; + +error_close: + close(sk_fds->active_fd); + close(sk_fds->srv_fd); + +error: + memset(sk_fds, -1, sizeof(*sk_fds)); + return -1; +} + +static int check_hdr_opt(const struct bpf_test_option *exp, + const struct bpf_test_option *act, + const char *hdr_desc) +{ + if (CHECK(memcmp(exp, act, sizeof(*exp)), + "expected-vs-actual", "unexpected %s\n", hdr_desc)) { + print_option(exp, "expected: "); + print_option(act, " actual: "); + return -1; + } + + return 0; +} + +static int check_hdr_stg(const struct hdr_stg *exp, int fd, + const char *stg_desc) +{ + struct hdr_stg act; + + if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act), + "map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n", + stg_desc, strerror(errno), errno)) + return -1; + + if (CHECK(memcmp(exp, &act, sizeof(*exp)), + "expected-vs-actual", "unexpected %s\n", stg_desc)) { + print_hdr_stg(exp, "expected: "); + print_hdr_stg(&act, " actual: "); + return -1; + } + + return 0; +} + +static int check_error_linum(const struct sk_fds *sk_fds) +{ + unsigned int nr_errors = 0; + struct linum_err linum_err; + int lport; + + lport = sk_fds->passive_lport; + if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) { + fprintf(stderr, + "bpf prog error out at lport:passive(%d), linum:%u err:%d\n", + lport, linum_err.linum, linum_err.err); + nr_errors++; + } + + lport = sk_fds->active_lport; + if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) { + fprintf(stderr, + "bpf prog error out at lport:active(%d), linum:%u err:%d\n", + lport, linum_err.linum, linum_err.err); + nr_errors++; + } + + return nr_errors; +} + +static void check_hdr_and_close_fds(struct sk_fds *sk_fds) +{ + if (sk_fds_shutdown(sk_fds)) + goto check_linum; + + if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd, + "passive_hdr_stg")) + goto check_linum; + + if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd, + "active_hdr_stg")) + goto check_linum; + + if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in, + "passive_estab_in")) + goto check_linum; + + if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in, + "active_estab_in")) + goto check_linum; + + if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in, + "passive_fin_in")) + goto check_linum; + + check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in, + "active_fin_in"); + +check_linum: + CHECK_FAIL(check_error_linum(sk_fds)); + sk_fds_close(sk_fds); +} + +static void prepare_out(void) +{ + skel->bss->active_syn_out = exp_passive_estab_in; + skel->bss->passive_synack_out = exp_active_estab_in; + + skel->bss->active_fin_out = exp_passive_fin_in; + skel->bss->passive_fin_out = exp_active_fin_in; +} + +static void reset_test(void) +{ + size_t optsize = sizeof(struct bpf_test_option); + int lport, err; + + memset(&skel->bss->passive_synack_out, 0, optsize); + memset(&skel->bss->passive_fin_out, 0, optsize); + + memset(&skel->bss->passive_estab_in, 0, optsize); + memset(&skel->bss->passive_fin_in, 0, optsize); + + memset(&skel->bss->active_syn_out, 0, optsize); + memset(&skel->bss->active_fin_out, 0, optsize); + + memset(&skel->bss->active_estab_in, 0, optsize); + memset(&skel->bss->active_fin_in, 0, optsize); + + skel->data->test_kind = TCPOPT_EXP; + skel->data->test_magic = 0xeB9F; + + memset(&exp_passive_estab_in, 0, optsize); + memset(&exp_active_estab_in, 0, optsize); + memset(&exp_passive_fin_in, 0, optsize); + memset(&exp_active_fin_in, 0, optsize); + + memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg)); + memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg)); + exp_active_hdr_stg.active = true; + + err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport); + while (!err) { + bpf_map_delete_elem(lport_linum_map_fd, &lport); + err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport); + } +} + +static void fastopen_estab(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + exp_passive_hdr_stg.fastopen = true; + + prepare_out(); + + /* Allow fastopen without fastopen cookie */ + if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, true)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void syncookie_estab(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS | + OPTION_F_RESEND; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + exp_passive_hdr_stg.syncookie = true; + exp_active_hdr_stg.resend_syn = true, + + prepare_out(); + + /* Clear the RESEND to ensure the bpf prog can learn + * want_cookie and set the RESEND by itself. + */ + skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND; + + /* Enforce syncookie mode */ + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void fin(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_fin_in.flags = OPTION_F_RAND; + exp_passive_fin_in.rand = 0xfa; + + exp_active_fin_in.flags = OPTION_F_RAND; + exp_active_fin_in.rand = 0xce; + + prepare_out(); + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void __simple_estab(bool exprm) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + prepare_out(); + + if (!exprm) { + skel->data->test_kind = 0xB9; + skel->data->test_magic = 0; + } + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void no_exprm_estab(void) +{ + __simple_estab(false); +} + +static void simple_estab(void) +{ + __simple_estab(true); +} + +static void misc(void) +{ + const char send_msg[] = "MISC!!!"; + char recv_msg[sizeof(send_msg)]; + const unsigned int nr_data = 2; + struct bpf_link *link; + struct sk_fds sk_fds; + int i, ret; + + lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map); + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(misc_skel->progs.misc_estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(misc_estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + for (i = 0; i < nr_data; i++) { + /* MSG_EOR to ensure skb will not be combined */ + ret = send(sk_fds.active_fd, send_msg, sizeof(send_msg), + MSG_EOR); + if (CHECK(ret != sizeof(send_msg), "send(msg)", "ret:%d\n", + ret)) + goto check_linum; + + ret = read(sk_fds.passive_fd, recv_msg, sizeof(recv_msg)); + if (CHECK(ret != sizeof(send_msg), "recv(msg)", "ret:%d\n", + ret)) + goto check_linum; + } + + if (sk_fds_shutdown(&sk_fds)) + goto check_linum; + + CHECK(misc_skel->bss->nr_syn != 1, "unexpected nr_syn", + "expected (1) != actual (%u)\n", + misc_skel->bss->nr_syn); + + CHECK(misc_skel->bss->nr_data != nr_data, "unexpected nr_data", + "expected (%u) != actual (%u)\n", + nr_data, misc_skel->bss->nr_data); + + /* The last ACK may have been delayed, so it is either 1 or 2. */ + CHECK(misc_skel->bss->nr_pure_ack != 1 && + misc_skel->bss->nr_pure_ack != 2, + "unexpected nr_pure_ack", + "expected (1 or 2) != actual (%u)\n", + misc_skel->bss->nr_pure_ack); + + CHECK(misc_skel->bss->nr_fin != 1, "unexpected nr_fin", + "expected (1) != actual (%u)\n", + misc_skel->bss->nr_fin); + +check_linum: + CHECK_FAIL(check_error_linum(&sk_fds)); + sk_fds_close(&sk_fds); + bpf_link__destroy(link); +} + +struct test { + const char *desc; + void (*run)(void); +}; + +#define DEF_TEST(name) { #name, name } +static struct test tests[] = { + DEF_TEST(simple_estab), + DEF_TEST(no_exprm_estab), + DEF_TEST(syncookie_estab), + DEF_TEST(fastopen_estab), + DEF_TEST(fin), + DEF_TEST(misc), +}; + +void test_tcp_hdr_options(void) +{ + int i; + + skel = test_tcp_hdr_options__open_and_load(); + if (CHECK(!skel, "open and load skel", "failed")) + return; + + misc_skel = test_misc_tcp_hdr_options__open_and_load(); + if (CHECK(!misc_skel, "open and load misc test skel", "failed")) + goto skel_destroy; + + cg_fd = test__join_cgroup(CG_NAME); + if (CHECK_FAIL(cg_fd < 0)) + goto skel_destroy; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + if (!test__start_subtest(tests[i].desc)) + continue; + + if (create_netns()) + break; + + tests[i].run(); + + reset_test(); + } + + close(cg_fd); +skel_destroy: + test_misc_tcp_hdr_options__destroy(misc_skel); + test_tcp_hdr_options__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c new file mode 100644 index 000000000000..6c33fb8a34d4 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c @@ -0,0 +1,338 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#define BPF_PROG_TEST_TCP_HDR_OPTIONS +#include "test_tcp_hdr_options.h" + +__u16 last_addr16_n = __bpf_htons(0xeB9F); +__u16 active_lport_n = 0; +__u16 active_lport_h = 0; +__u16 passive_lport_n = 0; +__u16 passive_lport_h = 0; + +/* options received at passive side */ +unsigned int nr_pure_ack = 0; +unsigned int nr_data = 0; +unsigned int nr_syn = 0; +unsigned int nr_fin = 0; + +/* Check the header received from the active side */ +static __always_inline int __check_active_hdr_in(struct bpf_sock_ops *skops, + bool check_syn) +{ + union { + struct tcphdr th; + struct ipv6hdr ip6; + struct tcp_exprm_opt exprm_opt; + struct tcp_opt reg_opt; + __u8 data[100]; /* IPv6 (40) + Max TCP hdr (60) */ + } hdr = {}; + __u64 load_flags = check_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0; + struct tcphdr *pth; + int err, ret; + + hdr.reg_opt.kind = 0xB9; + + /* The option is 4 bytes long instead of 2 bytes */ + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, 2, load_flags); + if (ret != -ENOSPC) + RET_CG_ERR(ret); + + /* Test searching magic with regular kind */ + hdr.reg_opt.len = 4; + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt), + load_flags); + if (ret != -EINVAL) + RET_CG_ERR(ret); + + hdr.reg_opt.len = 0; + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt), + load_flags); + if (ret != 4 || hdr.reg_opt.len != 4 || hdr.reg_opt.kind != 0xB9 || + hdr.reg_opt.data[0] != 0xfa || hdr.reg_opt.data[1] != 0xce) + RET_CG_ERR(ret); + + /* Test searching experimental option with invalid kind length */ + hdr.exprm_opt.kind = TCPOPT_EXP; + hdr.exprm_opt.len = 5; + hdr.exprm_opt.magic = 0; + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != -EINVAL) + RET_CG_ERR(ret); + + /* Test searching experimental option with 0 magic value */ + hdr.exprm_opt.len = 4; + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != -ENOMSG) + RET_CG_ERR(ret); + + hdr.exprm_opt.magic = __bpf_htons(0xeB9F); + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != 4 || hdr.exprm_opt.len != 4 || + hdr.exprm_opt.kind != TCPOPT_EXP || + hdr.exprm_opt.magic != __bpf_htons(0xeB9F)) + RET_CG_ERR(ret); + + if (!check_syn) + return CG_OK; + + /* Test loading from skops->syn_skb if sk_state == TCP_NEW_SYN_RECV + * (i.e. skops->sk is a req. and it means it received a syn + * and writing synack). + * + * Test loading from tp->saved_syn for other sk_state. + */ + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr.ip6, + sizeof(hdr.ip6)); + if (ret != -ENOSPC) + RET_CG_ERR(ret); + + if (hdr.ip6.saddr.s6_addr16[7] != last_addr16_n || + hdr.ip6.daddr.s6_addr16[7] != last_addr16_n) + RET_CG_ERR(0); + + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr, sizeof(hdr)); + if (ret < 0) + RET_CG_ERR(ret); + + pth = (struct tcphdr *)(&hdr.ip6 + 1); + if (pth->dest != passive_lport_n || pth->source != active_lport_n) + RET_CG_ERR(0); + + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN, &hdr, sizeof(hdr)); + if (ret < 0) + RET_CG_ERR(ret); + + if (hdr.th.dest != passive_lport_n || hdr.th.source != active_lport_n) + RET_CG_ERR(0); + + return CG_OK; +} + +static __always_inline int check_active_syn_in(struct bpf_sock_ops *skops) +{ + return __check_active_hdr_in(skops, true); +} + +static __always_inline int check_active_hdr_in(struct bpf_sock_ops *skops) +{ + struct tcphdr *th; + + if (__check_active_hdr_in(skops, false) == CG_ERR) + return CG_ERR; + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (tcp_hdrlen(th) < skops->skb_len) + nr_data++; + + if (th->fin) + nr_fin++; + + if (th->ack && !th->fin && tcp_hdrlen(th) == skops->skb_len) + nr_pure_ack++; + + return CG_OK; +} + +static __always_inline int active_opt_len(struct bpf_sock_ops *skops) +{ + int err; + + /* Reserve more than enough to allow the -EEXIST test in + * the write_active_opt(). + */ + err = bpf_reserve_hdr_opt(skops, 12, 0); + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static __always_inline int write_active_opt(struct bpf_sock_ops *skops) +{ + struct tcp_exprm_opt exprm_opt = {}; + struct tcp_opt win_scale_opt = {}; + struct tcp_opt reg_opt = {}; + struct tcphdr *th; + int err, ret; + + exprm_opt.kind = TCPOPT_EXP; + exprm_opt.len = 4; + exprm_opt.magic = __bpf_htons(0xeB9F); + + reg_opt.kind = 0xB9; + reg_opt.len = 4; + reg_opt.data[0] = 0xfa; + reg_opt.data[1] = 0xce; + + win_scale_opt.kind = TCPOPT_WINDOW; + + err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (err) + RET_CG_ERR(err); + + /* Store the same exprm option */ + err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + + err = bpf_store_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (err) + RET_CG_ERR(err); + err = bpf_store_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + + /* Check the option has been written and can be searched */ + ret = bpf_load_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (ret != 4 || exprm_opt.len != 4 || exprm_opt.kind != TCPOPT_EXP || + exprm_opt.magic != __bpf_htons(0xeB9F)) + RET_CG_ERR(ret); + + reg_opt.len = 0; + ret = bpf_load_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (ret != 4 || reg_opt.len != 4 || reg_opt.kind != 0xB9 || + reg_opt.data[0] != 0xfa || reg_opt.data[1] != 0xce) + RET_CG_ERR(ret); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (th->syn) { + active_lport_h = skops->local_port; + active_lport_n = th->source; + + /* Search the win scale option written by kernel + * in the SYN packet. + */ + ret = bpf_load_hdr_opt(skops, &win_scale_opt, + sizeof(win_scale_opt), 0); + if (ret != 3 || win_scale_opt.len != 3 || + win_scale_opt.kind != TCPOPT_WINDOW) + RET_CG_ERR(ret); + + /* Write the win scale option that kernel + * has already written. + */ + err = bpf_store_hdr_opt(skops, &win_scale_opt, + sizeof(win_scale_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + /* Check the SYN from bpf_sock_ops_kern->syn_skb */ + return check_active_syn_in(skops); + + /* Passive side should have cleared the write hdr cb by now */ + if (skops->local_port == passive_lport_h) + RET_CG_ERR(0); + + return active_opt_len(skops); +} + +static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + + /* Hdr space has not been reserved on passive-side in + * handle_hdr_opt_len(), so we should not have been called. + */ + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + RET_CG_ERR(0); + + if (skops->local_port == passive_lport_h) + RET_CG_ERR(0); + + return write_active_opt(skops); +} + +static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops) +{ + struct bpf_sock *sk = skops->sk; + + /* Passive side is not writing any non-standard/unknown + * option, so the active side should never be called. + */ + if (skops->local_port == active_lport_h) + RET_CG_ERR(0); + + return check_active_hdr_in(skops); +} + +static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops) +{ + int err; + + /* No more write hdr cb */ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); + + /* Recheck the SYN but check the tp->saved_syn this time */ + err = check_active_syn_in(skops); + if (err == CG_ERR) + return err; + + nr_syn++; + + /* The ack has header option written by the active side also */ + return check_active_hdr_in(skops); +} + +SEC("sockops/misc_estab") +int misc_estab(struct bpf_sock_ops *skops) +{ + int op, optlen, true_val = 1; + + switch (skops->op) { + case BPF_SOCK_OPS_TCP_LISTEN_CB: + passive_lport_h = skops->local_port; + passive_lport_n = __bpf_htons(passive_lport_h); + bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN, + &true_val, sizeof(true_val)); + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_TCP_CONNECT_CB: + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_PARSE_HDR_OPT_CB: + return handle_parse_hdr(skops); + case BPF_SOCK_OPS_HDR_OPT_LEN_CB: + return handle_hdr_opt_len(skops); + case BPF_SOCK_OPS_WRITE_HDR_OPT_CB: + return handle_write_hdr_opt(skops); + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + return handle_passive_estab(skops); + } + + return CG_OK; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c new file mode 100644 index 000000000000..e745ce480106 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c @@ -0,0 +1,657 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#define BPF_PROG_TEST_TCP_HDR_OPTIONS +#include "test_tcp_hdr_options.h" + +#ifndef sizeof_field +#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) +#endif + +__u8 test_kind = TCPOPT_EXP; +__u16 test_magic = 0xeB9F; + +struct bpf_test_option passive_synack_out = {}; +struct bpf_test_option passive_fin_out = {}; + +struct bpf_test_option passive_estab_in = {}; +struct bpf_test_option passive_fin_in = {}; + +struct bpf_test_option active_syn_out = {}; +struct bpf_test_option active_fin_out = {}; + +struct bpf_test_option active_estab_in = {}; +struct bpf_test_option active_fin_in = {}; + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct hdr_stg); +} hdr_stg_map SEC(".maps"); + +static __always_inline bool skops_want_cookie(const struct bpf_sock_ops *skops) +{ + return skops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE; +} + +static __always_inline bool skops_current_mss(const struct bpf_sock_ops *skops) +{ + return skops->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS; +} + +static __always_inline __u8 option_total_len(__u8 flags) +{ + __u8 i, len = 1; /* +1 for flags */ + + if (!flags) + return 0; + + /* RESEND bit does not use a byte */ + for (i = OPTION_RESEND + 1; i < __NR_OPTION_FLAGS; i++) + len += !!TEST_OPTION_FLAGS(flags, i); + + if (test_kind == TCPOPT_EXP) + return len + TCP_BPF_EXPOPT_BASE_LEN; + else + return len + 2; +} + +static __always_inline +void write_test_option(const struct bpf_test_option *test_opt, + __u8 *data) +{ + __u8 offset = 0; + + data[offset++] = test_opt->flags; + if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_MAX_DELACK_MS)) + data[offset++] = test_opt->max_delack_ms; + + if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_RAND)) + data[offset++] = test_opt->rand; +} + +static __always_inline int store_option(struct bpf_sock_ops *skops, + const struct bpf_test_option *test_opt) +{ + union { + struct tcp_exprm_opt exprm; + struct tcp_opt regular; + } write_opt; + __u8 flags = test_opt->flags, offset = 0; + __u8 *data; + int err; + + if (test_kind == TCPOPT_EXP) { + write_opt.exprm.kind = TCPOPT_EXP; + write_opt.exprm.len = option_total_len(test_opt->flags); + write_opt.exprm.magic = __bpf_htons(test_magic); + write_opt.exprm.data32 = 0; + write_test_option(test_opt, write_opt.exprm.data); + err = bpf_store_hdr_opt(skops, &write_opt.exprm, + sizeof(write_opt.exprm), 0); + } else { + write_opt.regular.kind = test_kind; + write_opt.regular.len = option_total_len(test_opt->flags); + write_opt.regular.data32 = 0; + write_test_option(test_opt, write_opt.regular.data); + err = bpf_store_hdr_opt(skops, &write_opt.regular, + sizeof(write_opt.regular), 0); + } + + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static __always_inline int parse_test_option(struct bpf_test_option *opt, + const __u8 *start, const __u8 *end) +{ + __u8 flags; + + if (start + sizeof(opt->flags) > end) + return -EINVAL; + + flags = *start; + opt->flags = flags; + start += sizeof(opt->flags); + + if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) { + if (start + sizeof(opt->max_delack_ms) > end) + return -EINVAL; + opt->max_delack_ms = *(typeof(opt->max_delack_ms) *)start; + start += sizeof(opt->max_delack_ms); + } + + if (TEST_OPTION_FLAGS(flags, OPTION_RAND)) { + if (start + sizeof(opt->rand) > end) + return -EINVAL; + opt->rand = *(typeof(opt->rand) *)start; + start += sizeof(opt->rand); + } + + return 0; +} + +static __always_inline int load_option(struct bpf_sock_ops *skops, + struct bpf_test_option *test_opt, + bool from_syn) +{ + union { + struct tcp_exprm_opt exprm; + struct tcp_opt regular; + } search_opt; + int ret, load_flags = from_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0; + + if (test_kind == TCPOPT_EXP) { + search_opt.exprm.kind = TCPOPT_EXP; + search_opt.exprm.len = 4; + search_opt.exprm.magic = __bpf_htons(test_magic); + search_opt.exprm.data32 = 0; + ret = bpf_load_hdr_opt(skops, &search_opt.exprm, + sizeof(search_opt.exprm), load_flags); + if (ret < 0) + return ret; + return parse_test_option(test_opt, search_opt.exprm.data, + search_opt.exprm.data + + sizeof(search_opt.exprm.data)); + } else { + search_opt.regular.kind = test_kind; + search_opt.regular.len = 0; + search_opt.regular.data32 = 0; + ret = bpf_load_hdr_opt(skops, &search_opt.regular, + sizeof(search_opt.regular), load_flags); + if (ret < 0) + return ret; + return parse_test_option(test_opt, search_opt.regular.data, + search_opt.regular.data + + sizeof(search_opt.regular.data)); + } +} + +static __always_inline int synack_opt_len(struct bpf_sock_ops *skops) +{ + struct bpf_test_option test_opt = {}; + __u8 optlen, flags; + int err; + + if (!passive_synack_out.flags) + return CG_OK; + + err = load_option(skops, &test_opt, true); + + /* bpf_test_option is not found */ + if (err == -ENOMSG) + return CG_OK; + + if (err) + RET_CG_ERR(err); + + flags = passive_synack_out.flags; + if (skops_want_cookie(skops)) + SET_OPTION_FLAGS(flags, OPTION_RESEND); + + optlen = option_total_len(flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int write_synack_opt(struct bpf_sock_ops *skops) +{ + struct bpf_test_option opt; + __u8 exp_kind; + int err; + + if (!passive_synack_out.flags) + /* We should not even be called since no header + * space has been reserved. + */ + RET_CG_ERR(0); + + opt = passive_synack_out; + if (skops_want_cookie(skops)) + SET_OPTION_FLAGS(opt.flags, OPTION_RESEND); + + return store_option(skops, &opt); +} + +static __always_inline int syn_opt_len(struct bpf_sock_ops *skops) +{ + __u8 optlen; + int err; + + if (!active_syn_out.flags) + return CG_OK; + + optlen = option_total_len(active_syn_out.flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int write_syn_opt(struct bpf_sock_ops *skops) +{ + if (!active_syn_out.flags) + return CG_OK; + + return store_option(skops, &active_syn_out); +} + +static __always_inline int fin_opt_len(struct bpf_sock_ops *skops) +{ + struct bpf_test_option *opt; + struct hdr_stg *hdr_stg; + __u8 optlen; + int err; + + if (!skops->sk) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + if (hdr_stg->active) + opt = &active_fin_out; + else + opt = &passive_fin_out; + + optlen = option_total_len(opt->flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int write_fin_opt(struct bpf_sock_ops *skops) +{ + struct bpf_test_option *opt; + struct hdr_stg *hdr_stg; + + if (!skops->sk) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + return store_option(skops, hdr_stg->active ? + &active_fin_out : &passive_fin_out); +} + +static __always_inline int resend_in_ack(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + + if (!skops->sk) + return -1; + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + return -1; + + return !!hdr_stg->resend_syn; +} + +static __always_inline int nodata_opt_len(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + int resend; + + resend = resend_in_ack(skops); + if (resend < 0) + RET_CG_ERR(0); + + if (resend) + return syn_opt_len(skops); + + return CG_OK; +} + +static __always_inline int write_nodata_opt(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + int resend; + + resend = resend_in_ack(skops); + if (resend < 0) + RET_CG_ERR(0); + + if (resend) + return write_syn_opt(skops); + + return CG_OK; +} + +static __always_inline int data_opt_len(struct bpf_sock_ops *skops) +{ + return nodata_opt_len(skops); +} + +static __always_inline int write_data_opt(struct bpf_sock_ops *skops) +{ + return write_nodata_opt(skops); +} + +static __always_inline int current_mss_opt_len(struct bpf_sock_ops *skops) +{ + /* Reserve maximum that may be needed */ + int err; + + err = bpf_reserve_hdr_opt(skops, option_total_len(OPTION_MASK), 0); + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + return synack_opt_len(skops); + + if (tcp_flags & TCPHDR_SYN) + return syn_opt_len(skops); + + if (tcp_flags & TCPHDR_FIN) + return fin_opt_len(skops); + + if (skops_current_mss(skops)) + /* The kernel is calculating the MSS */ + return current_mss_opt_len(skops); + + if (skops->skb_len) + return data_opt_len(skops); + + return nodata_opt_len(skops); +} + +static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + struct tcphdr *th; + int err; + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + return write_synack_opt(skops); + + if (tcp_flags & TCPHDR_SYN) + return write_syn_opt(skops); + + if (tcp_flags & TCPHDR_FIN) + return write_fin_opt(skops); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (skops->skb_len > tcp_hdrlen(th)) + return write_data_opt(skops); + + return write_nodata_opt(skops); +} + +static __always_inline int set_delack_max(struct bpf_sock_ops *skops, + __u8 max_delack_ms) +{ + __u32 max_delack_us = max_delack_ms * 1000; + + return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX, + &max_delack_us, sizeof(max_delack_us)); +} + +static __always_inline int set_rto_min(struct bpf_sock_ops *skops, + __u8 peer_max_delack_ms) +{ + __u32 min_rto_us = peer_max_delack_ms * 1000; + + return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us, + sizeof(min_rto_us)); +} + +static __always_inline int handle_active_estab(struct bpf_sock_ops *skops) +{ + struct hdr_stg *stg, init_stg = { + .active = true, + }; + int err; + + err = load_option(skops, &active_estab_in, false); + if (err && err != -ENOMSG) + RET_CG_ERR(err); + +update_sk_stg: + init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags, + OPTION_RESEND); + if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk, + &init_stg, + BPF_SK_STORAGE_GET_F_CREATE)) + RET_CG_ERR(0); + + if (init_stg.resend_syn) + /* Don't clear the write_hdr cb now because + * the ACK may get lost and retransmit may + * be needed. + * + * PARSE_ALL_HDR cb flag is set to learn if this + * resend_syn option has received by the peer. + * + * The header option will be resent until a valid + * packet is received at handle_parse_hdr() + * and all hdr cb flags will be cleared in + * handle_parse_hdr(). + */ + set_parse_all_hdr_cb_flags(skops); + else if (!active_fin_out.flags) + /* No options will be written from now */ + clear_hdr_cb_flags(skops); + + if (active_syn_out.max_delack_ms) { + err = set_delack_max(skops, active_syn_out.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + if (active_estab_in.max_delack_ms) { + err = set_rto_min(skops, active_estab_in.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + bool syncookie = false; + __u8 *start, *end, len; + struct tcphdr *th; + int err; + + err = load_option(skops, &passive_estab_in, true); + if (err == -ENOENT) { + /* saved_syn is not found. It was in syncookie mode. + * We have asked the active side to resend the options + * in ACK, so try to find the bpf_test_option from ACK now. + */ + err = load_option(skops, &passive_estab_in, false); + syncookie = true; + } + + /* ENOMSG: The bpf_test_option is not found which is fine. + * Bail out now for all other errors. + */ + if (err && err != -ENOMSG) + RET_CG_ERR(err); + + if (!skops->sk) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, + BPF_SK_STORAGE_GET_F_CREATE); + if (!hdr_stg) + RET_CG_ERR(0); + hdr_stg->syncookie = syncookie; + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (th->syn) { + /* Fastopen */ + + /* Cannot clear cb_flags to stop write_hdr cb. + * synack is not sent yet for fast open. + * Even it was, the synack may need to be retransmitted. + * + * PARSE_ALL_HDR cb flag is set to learn + * if synack has reached the peer. + * All cb_flags will be cleared in handle_parse_hdr(). + */ + set_parse_all_hdr_cb_flags(skops); + hdr_stg->fastopen = true; + } else if (!passive_fin_out.flags) { + /* No options will be written from now */ + clear_hdr_cb_flags(skops); + } + + if (passive_synack_out.max_delack_ms) { + err = set_delack_max(skops, passive_synack_out.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + if (passive_estab_in.max_delack_ms) { + err = set_rto_min(skops, passive_estab_in.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + struct tcphdr *th; + __u32 cb_flags; + + if (!skops->sk) + RET_CG_ERR(0); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + if (hdr_stg->resend_syn || hdr_stg->fastopen) + /* The PARSE_ALL_HDR cb flag was turned on + * to ensure that the previously written + * options have reached the peer. + * Those previously written option includes: + * - Active side: resend_syn in ACK during syncookie + * or + * - Passive side: SYNACK during fastopen + * + * A valid packet has been received here after + * the 3WHS, so the PARSE_ALL_HDR cb flag + * can be cleared now. + */ + clear_parse_all_hdr_cb_flags(skops); + + if (hdr_stg->resend_syn && !active_fin_out.flags) + /* Active side resent the syn option in ACK + * because the server was in syncookie mode. + * A valid packet has been received, so + * clear header cb flags if there is no + * more option to send. + */ + clear_hdr_cb_flags(skops); + + if (hdr_stg->fastopen && !passive_fin_out.flags) + /* Passive side was in fastopen. + * A valid packet has been received, so + * the SYNACK has reached the peer. + * Clear header cb flags if there is no more + * option to send. + */ + clear_hdr_cb_flags(skops); + + if (th->fin) { + struct bpf_test_option *fin_opt; + __u8 *kind_start; + int err; + + if (hdr_stg->active) + fin_opt = &active_fin_in; + else + fin_opt = &passive_fin_in; + + err = load_option(skops, fin_opt, false); + if (err && err != -ENOMSG) + RET_CG_ERR(err); + } + + return CG_OK; +} + +SEC("sockops/estab") +int estab(struct bpf_sock_ops *skops) +{ + int op, optlen, true_val = 1; + + switch (skops->op) { + case BPF_SOCK_OPS_TCP_LISTEN_CB: + bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN, + &true_val, sizeof(true_val)); + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_TCP_CONNECT_CB: + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_PARSE_HDR_OPT_CB: + return handle_parse_hdr(skops); + case BPF_SOCK_OPS_HDR_OPT_LEN_CB: + return handle_hdr_opt_len(skops); + case BPF_SOCK_OPS_WRITE_HDR_OPT_CB: + return handle_write_hdr_opt(skops); + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + return handle_passive_estab(skops); + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: + return handle_active_estab(skops); + } + + return CG_OK; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h new file mode 100644 index 000000000000..9aa19cc87146 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _TEST_TCP_HDR_OPTIONS_H +#define _TEST_TCP_HDR_OPTIONS_H + +struct bpf_test_option { + __u8 flags; + __u8 max_delack_ms; + __u8 rand; +} __attribute__((packed)); + +enum { + OPTION_RESEND, + OPTION_MAX_DELACK_MS, + OPTION_RAND, + __NR_OPTION_FLAGS, +}; + +#define OPTION_F_RESEND (1 << OPTION_RESEND) +#define OPTION_F_MAX_DELACK_MS (1 << OPTION_MAX_DELACK_MS) +#define OPTION_F_RAND (1 << OPTION_RAND) +#define OPTION_MASK ((1 << __NR_OPTION_FLAGS) - 1) + +#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option))) +#define SET_OPTION_FLAGS(flags, option) ((flags) |= (1 << (option))) + +/* Store in bpf_sk_storage */ +struct hdr_stg { + bool active; + bool resend_syn; /* active side only */ + bool syncookie; /* passive side only */ + bool fastopen; /* passive side only */ +}; + +struct linum_err { + unsigned int linum; + int err; +}; + +#define TCPHDR_FIN 0x01 +#define TCPHDR_SYN 0x02 +#define TCPHDR_RST 0x04 +#define TCPHDR_PSH 0x08 +#define TCPHDR_ACK 0x10 +#define TCPHDR_URG 0x20 +#define TCPHDR_ECE 0x40 +#define TCPHDR_CWR 0x80 +#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK) + +#define TCPOPT_EOL 0 +#define TCPOPT_NOP 1 +#define TCPOPT_WINDOW 3 +#define TCPOPT_EXP 254 + +#define TCP_BPF_EXPOPT_BASE_LEN 4 +#define MAX_TCP_HDR_LEN 60 +#define MAX_TCP_OPTION_SPACE 40 + +#ifdef BPF_PROG_TEST_TCP_HDR_OPTIONS + +#define CG_OK 1 +#define CG_ERR 0 + +#ifndef SOL_TCP +#define SOL_TCP 6 +#endif + +struct tcp_exprm_opt { + __u8 kind; + __u8 len; + __u16 magic; + union { + __u8 data[4]; + __u32 data32; + }; +} __attribute__((packed)); + +struct tcp_opt { + __u8 kind; + __u8 len; + union { + __u8 data[4]; + __u32 data32; + }; +} __attribute__((packed)); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 2); + __type(key, int); + __type(value, struct linum_err); +} lport_linum_map SEC(".maps"); + +static inline unsigned int tcp_hdrlen(const struct tcphdr *th) +{ + return th->doff << 2; +} + +static inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops) +{ + return skops->skb_tcp_flags; +} + +static inline void clear_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~(BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)); +} + +static inline void set_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags | + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); +} +static inline void +clear_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); +} + +static inline void +set_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags | + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); +} + +#define RET_CG_ERR(__err) ({ \ + struct linum_err __linum_err; \ + int __lport; \ + \ + __linum_err.linum = __LINE__; \ + __linum_err.err = __err; \ + __lport = skops->local_port; \ + bpf_map_update_elem(&lport_linum_map, &__lport, &__linum_err, BPF_NOEXIST); \ + clear_hdr_cb_flags(skops); \ + clear_parse_all_hdr_cb_flags(skops); \ + return CG_ERR; \ +}) + +#endif /* BPF_PROG_TEST_TCP_HDR_OPTIONS */ + +#endif /* _TEST_TCP_HDR_OPTIONS_H */ From patchwork Mon Aug 3 23:11:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1340607 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.a=rsa-sha256 header.s=facebook header.b=Lz5sfVaS; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BLDCt1NxMz9sR4 for ; Tue, 4 Aug 2020 09:11:38 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729230AbgHCXLh (ORCPT ); Mon, 3 Aug 2020 19:11:37 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:5996 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729226AbgHCXLh (ORCPT ); Mon, 3 Aug 2020 19:11:37 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.42/8.16.0.42) with SMTP id 073MtAao013131 for ; Mon, 3 Aug 2020 16:11:35 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=+IEUf0xZ8dTX/kRSFz42cb/OMoa7ChR09r+jtEdi8nA=; b=Lz5sfVaSKYzdB/ejJxFoONkBUQEB4X9m8Q0niNpZn0+/ybeRnypu88pGn034/GP4UcLO cPWzMSPRmlhtZlr9yRGDPfiRwkr4G5NS0REShMfjPa2OLwfkZeqKr8k2kkfR7zjMltBk ll3YV1toBe+Ai9qs6i34WqSPgjWIiUXbtO8= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 32n81y9jnq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 03 Aug 2020 16:11:35 -0700 Received: from intmgw005.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:35 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id E01622943872; Mon, 3 Aug 2020 16:11:29 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Lawrence Brakmo , Neal Cardwell , , Yuchung Cheng Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 12/12] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN Date: Mon, 3 Aug 2020 16:11:29 -0700 Message-ID: <20200803231129.2687001-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 malwarescore=0 phishscore=0 priorityscore=1501 mlxlogscore=998 bulkscore=0 suspectscore=3 impostorscore=0 adultscore=0 clxscore=1015 mlxscore=0 spamscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch is adapted from Eric's patch in an earlier discussion [1]. The TCP_SAVE_SYN currently only stores the network header and tcp header. This patch allows it to optionally store the mac header also if the setsockopt's optval is 2. It requires one more bit for the "save_syn" bit field in tcp_sock. This patch achieves this by moving the syn_smc bit next to the is_mptcp. The syn_smc is currently used with the TCP experimental option. Since syn_smc is only used when CONFIG_SMC is enabled, this patch also puts the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did with "IS_ENABLED(CONFIG_MPTCP)". The mac_hdrlen is also stored in the "struct saved_syn" to allow a quick offset from the bpf prog if it chooses to start getting from the network header or the tcp header. [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/ Suggested-by: Eric Dumazet Reviewed-by: Eric Dumazet Signed-off-by: Martin KaFai Lau --- include/linux/tcp.h | 13 ++++++++----- include/net/request_sock.h | 1 + include/uapi/linux/bpf.h | 1 + net/core/filter.c | 27 ++++++++++++++++++++++----- net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_input.c | 14 +++++++++++++- tools/include/uapi/linux/bpf.h | 1 + 7 files changed, 48 insertions(+), 12 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 981145691f60..97cfb33305c0 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -239,14 +239,13 @@ struct tcp_sock { repair : 1, frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; - u8 syn_data:1, /* SYN includes data */ + u8 save_syn:2, /* Save headers of SYN packet */ + syn_data:1, /* SYN includes data */ syn_fastopen:1, /* SYN includes Fast Open option */ syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ syn_fastopen_ch:1, /* Active TFO re-enabling probe */ syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ - save_syn:1, /* Save headers of SYN packet */ - is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ - syn_smc:1; /* SYN includes SMC */ + is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */ u32 tlp_high_seq; /* snd_nxt at the time of TLP */ u32 tcp_tx_delay; /* delay (in usec) added to TX packets */ @@ -393,6 +392,9 @@ struct tcp_sock { #if IS_ENABLED(CONFIG_MPTCP) bool is_mptcp; #endif +#if IS_ENABLED(CONFIG_SMC) + bool syn_smc; /* SYN includes SMC */ +#endif #ifdef CONFIG_TCP_MD5SIG /* TCP AF-Specific parts; only used by MD5 Signature support so far */ @@ -488,7 +490,8 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp) static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn) { - return saved_syn->network_hdrlen + saved_syn->tcp_hdrlen; + return saved_syn->mac_hdrlen + saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; } struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk); diff --git a/include/net/request_sock.h b/include/net/request_sock.h index b1b101814ecb..8c99af5ea891 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -42,6 +42,7 @@ struct request_sock_ops { int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req); struct saved_syn { + u32 mac_hdrlen; u32 network_hdrlen; u32 tcp_hdrlen; u8 data[]; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c6162ee573e9..68e71c948dc3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4455,6 +4455,7 @@ enum { */ TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ + TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ }; enum { diff --git a/net/core/filter.c b/net/core/filter.c index 607f87901122..477192550838 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4682,11 +4682,16 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock, if (optname == TCP_BPF_SYN) { hdr_start = syn_skb->data; ret = tcp_hdrlen(syn_skb); - } else { - /* optname == TCP_BPF_SYN_IP */ + } else if (optname == TCP_BPF_SYN_IP) { hdr_start = skb_network_header(syn_skb); ret = skb_network_header_len(syn_skb) + tcp_hdrlen(syn_skb); + } else { + /* optname == TCP_BPF_SYN_MAC */ + hdr_start = skb_mac_header(syn_skb); + ret = skb_mac_header_len(syn_skb) + + skb_network_header_len(syn_skb) + + tcp_hdrlen(syn_skb); } } else { struct sock *sk = bpf_sock->sk; @@ -4706,12 +4711,24 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock, if (optname == TCP_BPF_SYN) { hdr_start = saved_syn->data + + saved_syn->mac_hdrlen + saved_syn->network_hdrlen; ret = saved_syn->tcp_hdrlen; + } else if (optname == TCP_BPF_SYN_IP) { + hdr_start = saved_syn->data + + saved_syn->mac_hdrlen; + ret = saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; } else { - /* optname == TCP_BPF_SYN_IP */ + /* optname == TCP_BPF_SYN_MAC */ + + /* TCP_SAVE_SYN may not have saved the mac hdr */ + if (!saved_syn->mac_hdrlen) + return -ENOENT; + hdr_start = saved_syn->data; - ret = saved_syn->network_hdrlen + + ret = saved_syn->mac_hdrlen + + saved_syn->network_hdrlen + saved_syn->tcp_hdrlen; } } @@ -4724,7 +4741,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP && - optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_IP) { + optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) { int ret, copy_len = 0; const u8 *start; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 98606f94b01b..b3d275624658 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3210,7 +3210,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname, break; case TCP_SAVE_SYN: - if (val < 0 || val > 1) + /* 0: disable, 1: enable, 2: start from ether_header */ + if (val < 0 || val > 2) err = -EINVAL; else tp->save_syn = val; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 40ffdfc9adf4..af235b058d14 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6675,13 +6675,25 @@ static void tcp_reqsk_record_syn(const struct sock *sk, if (tcp_sk(sk)->save_syn) { u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb); struct saved_syn *saved_syn; + u32 mac_hdrlen; + void *base; + + if (tcp_sk(sk)->save_syn == 2) { /* Save full header. */ + base = skb_mac_header(skb); + mac_hdrlen = skb_mac_header_len(skb); + len += mac_hdrlen; + } else { + base = skb_network_header(skb); + mac_hdrlen = 0; + } saved_syn = kmalloc(struct_size(saved_syn, data, len), GFP_ATOMIC); if (saved_syn) { + saved_syn->mac_hdrlen = mac_hdrlen; saved_syn->network_hdrlen = skb_network_header_len(skb); saved_syn->tcp_hdrlen = tcp_hdrlen(skb); - memcpy(saved_syn->data, skb_network_header(skb), len); + memcpy(saved_syn->data, base, len); req->saved_syn = saved_syn; } } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c6162ee573e9..68e71c948dc3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4455,6 +4455,7 @@ enum { */ TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ + TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ }; enum {