From patchwork Fri Oct 20 18:05:39 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lawrence Brakmo X-Patchwork-Id: 828760 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="ZyUpI+Xg"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yJYdP3lpvz9t3l for ; Sat, 21 Oct 2017 05:06:17 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753590AbdJTSGL (ORCPT ); Fri, 20 Oct 2017 14:06:11 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:58100 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753220AbdJTSGJ (ORCPT ); Fri, 20 Oct 2017 14:06:09 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v9KI4X11005145 for ; Fri, 20 Oct 2017 11:06:09 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=QfERZMmGHNAVCoj/cLziABIGy3MxRTUjKlPLgUFDVoI=; b=ZyUpI+XgHXENMCAwUP5WNITJJjBQLoSYeRm2K/OEZGbU0fRytzHzKvOpT+lxxWoLpwXn jNAscmUtwOPD3OWCG8LUtLyyGUaakbY834lp3vxzNAgoNRpczi5jW9lm6swPPEXIE4jB e7eLt4g+xCC1Y8ZZn3twvMetYfzlc7uuV+4= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2dqmbgrd3f-2 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 20 Oct 2017 11:06:09 -0700 Received: from PRN-CHUB02.TheFacebook.com (2620:10d:c081:35::11) by PRN-CHUB12.TheFacebook.com (2620:10d:c081:35::21) with Microsoft SMTP Server (TLS) id 14.3.319.2; Fri, 20 Oct 2017 11:06:08 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB02.TheFacebook.com (192.168.16.12) with Microsoft SMTP Server id 14.3.361.1; Fri, 20 Oct 2017 11:06:07 -0700 Received: by dev15893.prn1.facebook.com (Postfix, from userid 10340) id 432ABC21BE7; Fri, 20 Oct 2017 11:05:55 -0700 (PDT) Smtp-Origin-Hostprefix: dev From: Lawrence Brakmo Smtp-Origin-Hostname: dev15893.prn1.facebook.com To: netdev CC: Kernel Team , Alexei Starovoitov , "Daniel Borkmann" , Blake Matheny , "Lawrence Brakmo" Smtp-Origin-Cluster: prn1c29 Subject: [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT Date: Fri, 20 Oct 2017 11:05:39 -0700 Message-ID: <20171020180543.4156833-2-brakmo@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <20171020180543.4156833-1-brakmo@fb.com> References: <20171020180543.4156833-1-brakmo@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-10-20_09:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A congestion control algorithm can make a call to the BPF socket_ops program to request the base RTT. The base RTT can be congestion control dependent and is meant to represent a congestion threshold such that RTTs above it indicate congestion. This is especially useful for flows within a DC where the base RTT is easy to obtain. Being provided a base RTT solves a basic problem in RTT based congestion avoidance algorithms (such as Vegas, NV and BBR). Although it is easy to get the base RTT when the network is not congested, it is very diffcult to do when it is very congested. Newer connections get an inflated value of the base RTT leading to unfariness (newer flows with a larger base RTT get more bandwidth). As a result, RTT based congestion avoidance algorithms tend to update their base RTTs to improve fairness. In very congested networks this can lead to base RTT inflation, reducing the ability of these RTT based congestion control algorithms to prevent congestion. Note that in my experiments with TCP-NV, the base RTT provided can be much larger than the actual hardware RTT. For example, experimenting with hosts within a rack where the hardware RTT is 16-20us, I've used base RTTs up to 150us. The effect of using a larger base RTT is that the congestion avoidance algorithm will allow more queueing. When there are only a few flows the main effect is larger measured RTTs and RPC latencies due to the increased queueing. When there are a lot of flows, a larger base RTT can lead to more congestion and more packet drops. For this case, where the hardware RTT is 20us, a base RTT of 80us produces good results. This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this set adds support for using it in TCP-NV. Further study and testing is needed before support can be added to other delay based congestion avoidance algorithms. Signed-off-by: Lawrence Brakmo Acked-by: Alexei Starovoitov Acked-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d83f95e..1aca744 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -955,6 +955,13 @@ enum { BPF_SOCK_OPS_NEEDS_ECN, /* If connection's congestion control * needs ECN */ + BPF_SOCK_OPS_BASE_RTT, /* Get base RTT. The correct value is + * based on the path and may be + * dependent on the congestion control + * algorithm. In general it indicates + * a congestion threshold. RTTs above + * this indicate congestion + */ }; #define TCP_BPF_IW 1001 /* Set TCP initial congestion window */ From patchwork Fri Oct 20 18:05:40 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lawrence Brakmo X-Patchwork-Id: 828759 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="pYDRSb70"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yJYdK5lZHz9t3l for ; Sat, 21 Oct 2017 05:06:13 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753595AbdJTSGM (ORCPT ); Fri, 20 Oct 2017 14:06:12 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:41940 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753212AbdJTSGI (ORCPT ); Fri, 20 Oct 2017 14:06:08 -0400 Received: from pps.filterd (m0044008.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v9KI3wtu020498 for ; Fri, 20 Oct 2017 11:06:08 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=gTVUSoUHHqdWgS6rIR8VWVLabJovAaYHEm/uJpTjl04=; b=pYDRSb70NbvOlLcf/qedjTHOBYA3pPNW4BN1wenNBL7iHmgItNXkjh77eSRIXL09nX4+ A4hE1iWlGg64BeHXL+x0eJZfJTffbNVgo+NcZrGqhNEJ5vQTMIzzSq7hZiZP9UppDRSI aYOn2wRDaCaoz/TNphH9kNDji5C0UeF4aA0= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2dqmeh0ckj-2 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 20 Oct 2017 11:06:08 -0700 Received: from PRN-CHUB02.TheFacebook.com (2620:10d:c081:35::11) by PRN-CHUB13.TheFacebook.com (2620:10d:c081:35::22) with Microsoft SMTP Server (TLS) id 14.3.361.1; Fri, 20 Oct 2017 11:06:07 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB02.TheFacebook.com (192.168.16.12) with Microsoft SMTP Server id 14.3.361.1; Fri, 20 Oct 2017 11:06:07 -0700 Received: by dev15893.prn1.facebook.com (Postfix, from userid 10340) id 30143C21A0D; Fri, 20 Oct 2017 11:05:55 -0700 (PDT) Smtp-Origin-Hostprefix: dev From: Lawrence Brakmo Smtp-Origin-Hostname: dev15893.prn1.facebook.com To: netdev CC: Kernel Team , Alexei Starovoitov , "Daniel Borkmann" , Blake Matheny , "Lawrence Brakmo" Smtp-Origin-Cluster: prn1c29 Subject: [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops Date: Fri, 20 Oct 2017 11:05:40 -0700 Message-ID: <20171020180543.4156833-3-brakmo@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <20171020180543.4156833-1-brakmo@fb.com> References: <20171020180543.4156833-1-brakmo@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-10-20_09:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Adding support for helper function bpf_getsockops to socket_ops BPF programs. This patch only supports TCP_CONGESTION. Signed-off-by: Vlad Vysotsky Acked-by: Lawrence Brakmo Acked-by: Alexei Starovoitov Acked-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 19 ++++++++++--- net/core/filter.c | 46 ++++++++++++++++++++++++++++++- tools/testing/selftests/bpf/bpf_helpers.h | 3 ++ 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1aca744..f650346 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -613,12 +613,22 @@ union bpf_attr { * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen) * Calls setsockopt. Not all opts are available, only those with * integer optvals plus TCP_CONGESTION. - * Supported levels: SOL_SOCKET and IPROTO_TCP + * Supported levels: SOL_SOCKET and IPPROTO_TCP * @bpf_socket: pointer to bpf_socket - * @level: SOL_SOCKET or IPROTO_TCP + * @level: SOL_SOCKET or IPPROTO_TCP * @optname: option name * @optval: pointer to option value - * @optlen: length of optval in byes + * @optlen: length of optval in bytes + * Return: 0 or negative error + * + * int bpf_getsockopt(bpf_socket, level, optname, optval, optlen) + * Calls getsockopt. Not all opts are available. + * Supported levels: IPPROTO_TCP + * @bpf_socket: pointer to bpf_socket + * @level: IPPROTO_TCP + * @optname: option name + * @optval: pointer to option value + * @optlen: length of optval in bytes * Return: 0 or negative error * * int bpf_skb_adjust_room(skb, len_diff, mode, flags) @@ -721,7 +731,8 @@ union bpf_attr { FN(sock_map_update), \ FN(xdp_adjust_meta), \ FN(perf_event_read_value), \ - FN(perf_prog_read_value), + FN(perf_prog_read_value), \ + FN(getsockopt), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index 09e011f..ccf62f4 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3273,7 +3273,7 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, static const struct bpf_func_proto bpf_setsockopt_proto = { .func = bpf_setsockopt, - .gpl_only = true, + .gpl_only = false, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_CTX, .arg2_type = ARG_ANYTHING, @@ -3282,6 +3282,48 @@ static const struct bpf_func_proto bpf_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, + int, level, int, optname, char *, optval, int, optlen) +{ + struct sock *sk = bpf_sock->sk; + int ret = 0; + + if (!sk_fullsock(sk)) + goto err_clear; + +#ifdef CONFIG_INET + if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) { + if (optname == TCP_CONGESTION) { + struct inet_connection_sock *icsk = inet_csk(sk); + + if (!icsk->icsk_ca_ops || optlen <= 1) + goto err_clear; + strncpy(optval, icsk->icsk_ca_ops->name, optlen); + optval[optlen - 1] = 0; + } else { + goto err_clear; + } + } else { + goto err_clear; + } + return ret; +#endif +err_clear: + memset(optval, 0, optlen); + return -EINVAL; +} + +static const struct bpf_func_proto bpf_getsockopt_proto = { + .func = bpf_getsockopt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_UNINIT_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -3460,6 +3502,8 @@ static const struct bpf_func_proto * switch (func_id) { case BPF_FUNC_setsockopt: return &bpf_setsockopt_proto; + case BPF_FUNC_getsockopt: + return &bpf_getsockopt_proto; case BPF_FUNC_sock_map_update: return &bpf_sock_map_update_proto; default: diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h index e25dbf6..609514f 100644 --- a/tools/testing/selftests/bpf/bpf_helpers.h +++ b/tools/testing/selftests/bpf/bpf_helpers.h @@ -67,6 +67,9 @@ static int (*bpf_xdp_adjust_meta)(void *ctx, int offset) = static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval, int optlen) = (void *) BPF_FUNC_setsockopt; +static int (*bpf_getsockopt)(void *ctx, int level, int optname, void *optval, + int optlen) = + (void *) BPF_FUNC_getsockopt; static int (*bpf_sk_redirect_map)(void *map, int key, int flags) = (void *) BPF_FUNC_sk_redirect_map; static int (*bpf_sock_map_update)(void *map, void *key, void *value, From patchwork Fri Oct 20 18:05:41 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lawrence Brakmo X-Patchwork-Id: 828758 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="FbXt+pE7"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yJYdC66rnz9t5s for ; Sat, 21 Oct 2017 05:06:07 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753269AbdJTSGB (ORCPT ); Fri, 20 Oct 2017 14:06:01 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:58036 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753200AbdJTSF7 (ORCPT ); Fri, 20 Oct 2017 14:05:59 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v9KI4WcI005137 for ; Fri, 20 Oct 2017 11:05:58 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=Bszy/WYnuZWTCjAqKhstIAtfkdHhResimcZeHzXwCN8=; b=FbXt+pE7IbIm6IUAyMHmyDGxvhlQgQkThXJAXYUsyYzPeaUW9sHsL0sxFGCuxFYPBn0H 40BH6k22IPu14b0Hdb7EJjuyPX4lC1ZX27HFlruwzkNshpmAEU8PNZQAhR84LSziye9r s2EPSe1yxKfG7K6/f5sb+BuA+Y1vur55ld0= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2dqmbgrd2d-5 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 20 Oct 2017 11:05:58 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB01.TheFacebook.com (192.168.16.11) with Microsoft SMTP Server id 14.3.361.1; Fri, 20 Oct 2017 11:05:56 -0700 Received: by dev15893.prn1.facebook.com (Postfix, from userid 10340) id 69A5FC21A0D; Fri, 20 Oct 2017 11:05:55 -0700 (PDT) Smtp-Origin-Hostprefix: dev From: Lawrence Brakmo Smtp-Origin-Hostname: dev15893.prn1.facebook.com To: netdev CC: Kernel Team , Alexei Starovoitov , Daniel Borkmann , Blake Matheny , Lawrence Brakmo Smtp-Origin-Cluster: prn1c29 Subject: [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv Date: Fri, 20 Oct 2017 11:05:41 -0700 Message-ID: <20171020180543.4156833-4-brakmo@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <20171020180543.4156833-1-brakmo@fb.com> References: <20171020180543.4156833-1-brakmo@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-10-20_09:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org TCP_NV will try to get the base RTT from a socket_ops BPF program if one is loaded. NV will then use the base RTT to bound its min RTT (its notion of the base RTT). It uses the base RTT as an upper bound and 80% of the base RTT as its lower bound. In other words, NV will consider filtered RTTs larger than base RTT as a sign of congestion. As a result, there is no minRTT inflation when there is a lot of congestion. For example, in a DC where the RTTs are less than 40us when there is no congestion, a base RTT value of 80us improves the performance of NV. The difference between the uncongested RTT and the base RTT provided represents how much queueing we are willing to have (in practice it can be higher). NV has been tunned to reduce congestion when there are many flows at the cost of one flow not achieving full bandwith utilization. When a reasonable base RTT is provided, one NV flow can now fully utilize the full bandwidth. In addition, the performance is also improved when there are many flows. In the following examples the NV results are using a kernel with this patch set (i.e. both NV results are using the new nv_loss_dec_factor). With one host sending to another host and only one flow the goodputs are: Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps With 2 hosts sending to one host (1 flow per host, the goodput per flow is: Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps But the RTTs seen by a ping process in the sender is: Cubic: 3.3ms NV: 97us, NV (baseRTT=80us): 146us With a lot of flows things look even better for NV with baseRTT. Here we have 3 hosts sending to one host. Each sending host has 6 flows: 1 stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully utilize the full available bandwidth. However, the distribution of bandwidth among the flows is very different. For the 10KB RPC flow: Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps The 99% latencies for the 10KB flows are: Cubic: 26ms, NV: 1ms, NV (baseRTT=80us): 500us The RTT seen by a ping process at the senders: Cubic: 3.2ms NV: 720us, NV (baseRTT=80us): 330us Signed-off-by: Lawrence Brakmo Acked_by: Alexei Starovoitov Acked-by: Alexei Starovoitov --- net/ipv4/tcp_nv.c | 40 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c index 1ff7398..a978e3f 100644 --- a/net/ipv4/tcp_nv.c +++ b/net/ipv4/tcp_nv.c @@ -39,7 +39,7 @@ * nv_cong_dec_mult Decrease cwnd by X% (30%) of congestion when detected * nv_ssthresh_factor On congestion set ssthresh to this * / 8 * nv_rtt_factor RTT averaging factor - * nv_loss_dec_factor Decrease cwnd by this (50%) when losses occur + * nv_loss_dec_factor Decrease cwnd to this (80%) when losses occur * nv_dec_eval_min_calls Wait this many RTT measurements before dec cwnd * nv_inc_eval_min_calls Wait this many RTT measurements before inc cwnd * nv_ssthresh_eval_min_calls Wait this many RTT measurements before stopping @@ -61,7 +61,7 @@ static int nv_min_cwnd __read_mostly = 2; static int nv_cong_dec_mult __read_mostly = 30 * 128 / 100; /* = 30% */ static int nv_ssthresh_factor __read_mostly = 8; /* = 1 */ static int nv_rtt_factor __read_mostly = 128; /* = 1/2*old + 1/2*new */ -static int nv_loss_dec_factor __read_mostly = 512; /* => 50% */ +static int nv_loss_dec_factor __read_mostly = 819; /* => 80% */ static int nv_cwnd_growth_rate_neg __read_mostly = 8; static int nv_cwnd_growth_rate_pos __read_mostly; /* 0 => fixed like Reno */ static int nv_dec_eval_min_calls __read_mostly = 60; @@ -101,6 +101,11 @@ struct tcpnv { u32 nv_last_rtt; /* last rtt */ u32 nv_min_rtt; /* active min rtt. Used to determine slope */ u32 nv_min_rtt_new; /* min rtt for future use */ + u32 nv_base_rtt; /* If non-zero it represents the threshold for + * congestion */ + u32 nv_lower_bound_rtt; /* Used in conjunction with nv_base_rtt. It is + * set to 80% of nv_base_rtt. It helps reduce + * unfairness between flows */ u32 nv_rtt_max_rate; /* max rate seen during current RTT */ u32 nv_rtt_start_seq; /* current RTT ends when packet arrives * acking beyond nv_rtt_start_seq */ @@ -132,9 +137,24 @@ static inline void tcpnv_reset(struct tcpnv *ca, struct sock *sk) static void tcpnv_init(struct sock *sk) { struct tcpnv *ca = inet_csk_ca(sk); + int base_rtt; tcpnv_reset(ca, sk); + /* See if base_rtt is available from socket_ops bpf program. + * It is meant to be used in environments, such as communication + * within a datacenter, where we have reasonable estimates of + * RTTs + */ + base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT); + if (base_rtt > 0) { + ca->nv_base_rtt = base_rtt; + ca->nv_lower_bound_rtt = (base_rtt * 205) >> 8; /* 80% */ + } else { + ca->nv_base_rtt = 0; + ca->nv_lower_bound_rtt = 0; + } + ca->nv_allow_cwnd_growth = 1; ca->nv_min_rtt_reset_jiffies = jiffies + 2 * HZ; ca->nv_min_rtt = NV_INIT_RTT; @@ -144,6 +164,19 @@ static void tcpnv_init(struct sock *sk) ca->cwnd_growth_factor = 0; } +/* If provided, apply upper (base_rtt) and lower (lower_bound_rtt) + * bounds to RTT. + */ +inline u32 nv_get_bounded_rtt(struct tcpnv *ca, u32 val) +{ + if (ca->nv_lower_bound_rtt > 0 && val < ca->nv_lower_bound_rtt) + return ca->nv_lower_bound_rtt; + else if (ca->nv_base_rtt > 0 && val > ca->nv_base_rtt) + return ca->nv_base_rtt; + else + return val; +} + static void tcpnv_cong_avoid(struct sock *sk, u32 ack, u32 acked) { struct tcp_sock *tp = tcp_sk(sk); @@ -265,6 +298,9 @@ static void tcpnv_acked(struct sock *sk, const struct ack_sample *sample) if (ca->nv_eval_call_cnt < 255) ca->nv_eval_call_cnt++; + /* Apply bounds to rtt. Only used to update min_rtt */ + avg_rtt = nv_get_bounded_rtt(ca, avg_rtt); + /* update min rtt if necessary */ if (avg_rtt < ca->nv_min_rtt) ca->nv_min_rtt = avg_rtt; From patchwork Fri Oct 20 18:05:42 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lawrence Brakmo X-Patchwork-Id: 828761 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="JsQTgwYt"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yJYdW1z9yz9t4b for ; Sat, 21 Oct 2017 05:06:23 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753617AbdJTSGV (ORCPT ); Fri, 20 Oct 2017 14:06:21 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:48654 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753206AbdJTSGK (ORCPT ); Fri, 20 Oct 2017 14:06:10 -0400 Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v9KI4qZI019219 for ; Fri, 20 Oct 2017 11:06:10 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=BKCbl7JPa/KBXd7Ciz2TSZWcClbCQCCb6yah0XlRrLU=; b=JsQTgwYt5P+TU7gThAg7JhUetkGoNl5buY3Sg+lpBU/k5f2afL0z2DP2AkMVxkt3ziNS gbaLTh6npPBifTepKkzEpuXWj0GJlQfXbIksS/rTAsOgrJda4ASzlEQfKJtKoQe6ghOB EVo0gWiMYB02IHX4jMmV7gyA99uETFAER7g= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2dqmuhg9sw-2 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 20 Oct 2017 11:06:10 -0700 Received: from PRN-CHUB02.TheFacebook.com (2620:10d:c081:35::11) by PRN-CHUB10.TheFacebook.com (2620:10d:c081:35::19) with Microsoft SMTP Server (TLS) id 14.3.361.1; Fri, 20 Oct 2017 11:06:09 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB02.TheFacebook.com (192.168.16.12) with Microsoft SMTP Server id 14.3.361.1; Fri, 20 Oct 2017 11:06:08 -0700 Received: by dev15893.prn1.facebook.com (Postfix, from userid 10340) id 54C97C21ABF; Fri, 20 Oct 2017 11:05:55 -0700 (PDT) Smtp-Origin-Hostprefix: dev From: Lawrence Brakmo Smtp-Origin-Hostname: dev15893.prn1.facebook.com To: netdev CC: Kernel Team , Alexei Starovoitov , "Daniel Borkmann" , Blake Matheny , "Lawrence Brakmo" Smtp-Origin-Cluster: prn1c29 Subject: [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program Date: Fri, 20 Oct 2017 11:05:42 -0700 Message-ID: <20171020180543.4156833-5-brakmo@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <20171020180543.4156833-1-brakmo@fb.com> References: <20171020180543.4156833-1-brakmo@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-10-20_09:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Sample socket_ops BPF program to test the BPF helper function bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT. The program provides a base RTT of 80us when the calling flow is within a DC (as determined by the IPV6 prefix) and the congestion algorithm is "nv". Signed-off-by: Lawrence Brakmo Acked-by: Daniel Borkmann Acked_by: Alexei Starovoitov --- samples/bpf/Makefile | 1 + samples/bpf/tcp_basertt_kern.c | 78 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 samples/bpf/tcp_basertt_kern.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 3534ccf..ea2b9e6 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -129,6 +129,7 @@ always += tcp_bufs_kern.o always += tcp_cong_kern.o always += tcp_iw_kern.o always += tcp_clamp_kern.o +always += tcp_basertt_kern.o always += xdp_redirect_kern.o always += xdp_redirect_map_kern.o always += xdp_redirect_cpu_kern.o diff --git a/samples/bpf/tcp_basertt_kern.c b/samples/bpf/tcp_basertt_kern.c new file mode 100644 index 0000000..4bf4fc5 --- /dev/null +++ b/samples/bpf/tcp_basertt_kern.c @@ -0,0 +1,78 @@ +/* Copyright (c) 2017 Facebook + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * BPF program to set base_rtt to 80us when host is running TCP-NV and + * both hosts are in the same datacenter (as determined by IPv6 prefix). + * + * Use load_sock_ops to load this BPF program. + */ + +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" +#include "bpf_endian.h" + +#define DEBUG 1 + +#define bpf_printk(fmt, ...) \ +({ \ + char ____fmt[] = fmt; \ + bpf_trace_printk(____fmt, sizeof(____fmt), \ + ##__VA_ARGS__); \ +}) + +SEC("sockops") +int bpf_basertt(struct bpf_sock_ops *skops) +{ + char cong[20]; + char nv[] = "nv"; + int rv = 0, n; + int op; + + op = (int) skops->op; + +#ifdef DEBUG + bpf_printk("BPF command: %d\n", op); +#endif + + /* Check if both hosts are in the same datacenter. For this + * example they are if the 1st 5.5 bytes in the IPv6 address + * are the same. + */ + if (skops->family == AF_INET6 && + skops->local_ip6[0] == skops->remote_ip6[0] && + (bpf_ntohl(skops->local_ip6[1]) & 0xfff00000) == + (bpf_ntohl(skops->remote_ip6[1]) & 0xfff00000)) { + switch (op) { + case BPF_SOCK_OPS_BASE_RTT: + n = bpf_getsockopt(skops, SOL_TCP, TCP_CONGESTION, + cong, sizeof(cong)); + if (!n && !__builtin_memcmp(cong, nv, sizeof(nv)+1)) { + /* Set base_rtt to 80us */ + rv = 80; + } else if (n) { + rv = n; + } else { + rv = -1; + } + break; + default: + rv = -1; + } + } else { + rv = -1; + } +#ifdef DEBUG + bpf_printk("Returning %d\n", rv); +#endif + skops->reply = rv; + return 1; +} +char _license[] SEC("license") = "GPL"; From patchwork Fri Oct 20 18:05:43 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lawrence Brakmo X-Patchwork-Id: 828756 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="n40cQSa0"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yJYd86VKCz9t3l for ; Sat, 21 Oct 2017 05:06:04 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753561AbdJTSGC (ORCPT ); Fri, 20 Oct 2017 14:06:02 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:58038 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753223AbdJTSF7 (ORCPT ); Fri, 20 Oct 2017 14:05:59 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v9KI4WcJ005137 for ; Fri, 20 Oct 2017 11:05:59 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=oklJng8Pri5eX/1dHt8VQ8GxDCxHoVDwWmJlI7f4NNQ=; b=n40cQSa0dWuV3ZcpzIcVsP5y3fDwLSsgAzlzFfuCiT4yfPjjleM903Ac2GPincnSntQv V0gLC+QzFjBtK08hxtm8Sn1Hj8PRLmgHJgLYceb/SfLk1PdjgPh4ByJ5CpT+X62svXVu P/lvDlssnFAu8X+qx6DfEVq55QX35UbOQX0= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2dqmbgrd2d-6 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 20 Oct 2017 11:05:59 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB01.TheFacebook.com (192.168.16.11) with Microsoft SMTP Server id 14.3.361.1; Fri, 20 Oct 2017 11:05:57 -0700 Received: by dev15893.prn1.facebook.com (Postfix, from userid 10340) id 7AC30C21A44; Fri, 20 Oct 2017 11:05:55 -0700 (PDT) Smtp-Origin-Hostprefix: dev From: Lawrence Brakmo Smtp-Origin-Hostname: dev15893.prn1.facebook.com To: netdev CC: Kernel Team , Alexei Starovoitov , Daniel Borkmann , Blake Matheny , Lawrence Brakmo Smtp-Origin-Cluster: prn1c29 Subject: [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme Date: Fri, 20 Oct 2017 11:05:43 -0700 Message-ID: <20171020180543.4156833-6-brakmo@fb.com> X-Mailer: git-send-email 2.9.5 In-Reply-To: <20171020180543.4156833-1-brakmo@fb.com> References: <20171020180543.4156833-1-brakmo@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-10-20_09:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Readme file explaining how to create a cgroupv2 and attach one of the tcp_*_kern.o socket_ops BPF program. Signed-off-by: Lawrence Brakmo Acked-by: Daniel Borkmann Acked_by: Alexei Starovoitov --- samples/bpf/tcp_bbf.readme | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) create mode 100644 samples/bpf/tcp_bbf.readme diff --git a/samples/bpf/tcp_bbf.readme b/samples/bpf/tcp_bbf.readme new file mode 100644 index 0000000..3692056 --- /dev/null +++ b/samples/bpf/tcp_bbf.readme @@ -0,0 +1,27 @@ +This file describes how to run the tcp_*_kern.o tcp_bpf (or socket_ops) +programs. These programs attach to a cgroupv2. The following commands create +a cgroupv2 and attach a bash shell to the group. + + mkdir -p /tmp/cgroupv2 + mount -t cgroup2 none /tmp/cgroupv2 + mkdir -p /tmp/cgroupv2/foo + bash + echo $$ >> /tmp/cgroupv2/foo/cgroup.procs + +Anything that runs under this shell belongs to the foo cgroupv2 To load +(attach) one of the tcp_*_kern.o programs: + + ./load_sock_ops -l /tmp/cgroupv2/foo tcp_basertt_kern.o + +If the "-l" flag is used, the load_sock_ops program will continue to run +printing the BPF log buffer. The tcp_*_kern.o programs use special print +functions to print logging information (if enabled by the ifdef). + +If using netperf/netserver to create traffic, you need to run them under the +cgroupv2 to which the BPF programs are attached (i.e. under bash shell +attached to the cgroupv2). + +To remove (unattach) a socket_ops BPF program from a cgroupv2: + + ./load_sock_ops -r /tmp/cgroupv2/foo +