From patchwork Sat Feb 23 01:06:55 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047230
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="IIctoIcU";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnK0GTRz9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:33 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727605AbfBWBHX (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:23 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:44612 "EHLO
	mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1725814AbfBWBHW (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:22 -0500
Received: from pps.filterd (m0109332.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N0xF0N020719
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:21 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=4YIFlynjP/joAMu+4MCbb+bbIaTyAut4fGyS1r3AQSY=;
	b=IIctoIcUO1FQsQI006P5AT+Mmx1QKeOdH+OWxvcNxawrjk9KVmrTB2wFwN4PAvrSU6dH
	//hLvAmPUzFJcoIsgttbKtJ/ShgYN4kuKo4eqfAWpvLTKn4ooxnwHosmNdXwznjXQPRp
	B5AzLoTLdIjEi7oNL3g6xJ4nODerwocJc3Q=
Received: from mail.thefacebook.com ([199.201.64.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtubkr4ej-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:21 -0800
Received: from mx-out.facebook.com (2620:10d:c081:10::13) by
	mail.thefacebook.com (2620:10d:c081:35::126) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:20 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 36B605AE1524; Fri, 22 Feb 2019 17:07:19 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto
Date: Fri, 22 Feb 2019 17:06:55 -0800
Message-ID: <20190223010703.678070-2-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

From: Martin KaFai Lau <kafai@fb.com>

The next patch needs to set a bit in "prog" in
cg_skb_func_proto().  Hence, the "const struct bpf_prog *"
as a second argument will not work.

This patch removes the "const" from get_func_proto and
makes the needed changes to all get_func_proto implementations
to avoid compiler error.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 drivers/media/rc/bpf-lirc.c |  2 +-
 include/linux/bpf.h         |  2 +-
 kernel/bpf/cgroup.c         |  2 +-
 kernel/trace/bpf_trace.c    | 10 +++++-----
 net/core/filter.c           | 30 +++++++++++++++---------------
 5 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index 390a722e6211..6adb7f734cb9 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -82,7 +82,7 @@ static const struct bpf_func_proto rc_pointer_rel_proto = {
 };
 
 static const struct bpf_func_proto *
-lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lirc_mode2_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_rc_repeat:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index de18227b3d95..d5ba2fc01af3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -287,7 +287,7 @@ struct bpf_verifier_ops {
 	/* return eBPF function prototype for verification */
 	const struct bpf_func_proto *
 	(*get_func_proto)(enum bpf_func_id func_id,
-			  const struct bpf_prog *prog);
+			  struct bpf_prog *prog);
 
 	/* return true if 'size' wide access at offset 'off' within bpf_context
 	 * with 'type' (read or write) is allowed
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 4e807973aa80..0de0f5d98b46 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -701,7 +701,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission);
 
 static const struct bpf_func_proto *
-cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+cgroup_dev_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f1a86a0d881d..0d2f60828d7d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -561,7 +561,7 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
 };
 
 static const struct bpf_func_proto *
-tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tracing_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
@@ -610,7 +610,7 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+kprobe_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -726,7 +726,7 @@ static const struct bpf_func_proto bpf_get_stack_proto_tp = {
 };
 
 static const struct bpf_func_proto *
-tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tp_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -790,7 +790,7 @@ static const struct bpf_func_proto bpf_perf_prog_read_value_proto = {
 };
 
 static const struct bpf_func_proto *
-pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+pe_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -873,7 +873,7 @@ static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
 };
 
 static const struct bpf_func_proto *
-raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+raw_tp_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
diff --git a/net/core/filter.c b/net/core/filter.c
index 85749f6ec789..97916eedfe69 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5508,7 +5508,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
-sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_filter_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	/* inet and inet6 sockets are created in a process
@@ -5524,7 +5524,7 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_addr_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	/* inet and inet6 sockets are created in a process
@@ -5558,7 +5558,7 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_filter_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5575,7 +5575,7 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
@@ -5592,7 +5592,7 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tc_cls_act_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_store_bytes:
@@ -5685,7 +5685,7 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+xdp_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -5723,7 +5723,7 @@ const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 static const struct bpf_func_proto *
-sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_ops_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_setsockopt:
@@ -5751,7 +5751,7 @@ const struct bpf_func_proto bpf_msg_redirect_map_proto __weak;
 const struct bpf_func_proto bpf_msg_redirect_hash_proto __weak;
 
 static const struct bpf_func_proto *
-sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_msg_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_msg_redirect_map:
@@ -5777,7 +5777,7 @@ const struct bpf_func_proto bpf_sk_redirect_map_proto __weak;
 const struct bpf_func_proto bpf_sk_redirect_hash_proto __weak;
 
 static const struct bpf_func_proto *
-sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_store_bytes:
@@ -5812,7 +5812,7 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+flow_dissector_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5823,7 +5823,7 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_out_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5850,7 +5850,7 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_in_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
@@ -5861,7 +5861,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_xmit_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_get_tunnel_key:
@@ -5898,7 +5898,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_seg6local_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
@@ -8124,7 +8124,7 @@ static const struct bpf_func_proto sk_reuseport_load_bytes_relative_proto = {
 
 static const struct bpf_func_proto *
 sk_reuseport_func_proto(enum bpf_func_id func_id,
-			const struct bpf_prog *prog)
+			struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sk_select_reuseport:

From patchwork Sat Feb 23 01:06:56 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047232
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="nure6pyp";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnM0bl3z9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:35 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727633AbfBWBH1 (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:27 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:50234 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
	by vger.kernel.org with ESMTP id S1725814AbfBWBH0 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:26 -0500
Received: from pps.filterd (m0089730.ppops.net [127.0.0.1])
	by m0089730.ppops.net (8.16.0.27/8.16.0.27) with SMTP id
	x1N0wx6J010594
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:23 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=sGG5jdokIhPNaOt3oe3IZ7B/JOfU8p02aKns0/lSHEk=;
	b=nure6pypuuFTIeovaxVCv9PodP0pzVVyn008mQNI3mq+orly+T8DlG/XQF9SVkNM7xp3
	QFUaWVxc2kWEO/uAjV4SrI0ujCOZnIIomncqR2HeP7mk7LGlfth8cqykgbx3mUC9z3Zl
	iT9jslnoY9ZY4ic0JW6negFEMZpfx77sUFM=
Received: from maileast.thefacebook.com ([199.201.65.23])
	by m0089730.ppops.net with ESMTP id 2qttb18abn-3
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:23 -0800
Received: from mx-out.facebook.com (2620:10d:c0a1:3::13) by
	mail.thefacebook.com (2620:10d:c021:18::175) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:22 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 3A9075AE1524; Fri, 22 Feb 2019 17:07:21 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
Date: Fri, 22 Feb 2019 17:06:56 -0800
Message-ID: <20190223010703.678070-3-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

From: Martin KaFai Lau <kafai@fb.com>

This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
"int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
to the egress path where the bpf prog is called by
ip_finish_output() or ip6_finish_output().  The verifier
ensures that the parameter must be a tcp_sock.

This helper makes a tcp_sock enter CWR state.  It can be used
by a bpf_prog to manage egress network bandwidth limit per
cgroupv2.  A later patch will have a sample program to
show how it can be used to limit bandwidth usage per cgroupv2.

To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
during load time if the prog uses this new helper.
The newly added prog->enforce_expected_attach_type bit will also be set
if this new helper is used.  This bit is for backward compatibility reason
because currently prog->expected_attach_type has been ignored in
BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
prog->expected_attach_type is only enforced if the
prog->enforce_expected_attach_type bit is set.
i.e. prog->expected_attach_type is only enforced if this new helper
is used by the prog.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h |  9 ++++++++-
 kernel/bpf/syscall.c     | 12 ++++++++++++
 kernel/bpf/verifier.c    |  4 ++++
 net/core/filter.c        | 25 +++++++++++++++++++++++++
 6 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d5ba2fc01af3..2d54ba7cf9dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -195,6 +195,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
 	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
 	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
+	ARG_PTR_TO_TCP_SOCK,    /* pointer to tcp_sock */
 };
 
 /* type of values returned from helper functions */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f32b3eca5a04..c6e878bdc5a6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -510,7 +510,8 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bcdd2474eee7..95b5058fa945 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2359,6 +2359,12 @@ union bpf_attr {
  *	Return
  *		A **struct bpf_tcp_sock** pointer on success, or NULL in
  *		case of failure.
+ *
+ * int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)
+ *	Description
+ *		Make a tcp_sock enter CWR state.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2457,7 +2463,8 @@ union bpf_attr {
 	FN(spin_lock),			\
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
-	FN(tcp_sock),
+	FN(tcp_sock),			\
+	FN(tcp_enter_cwr),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ec7c552af76b..9a478f2875cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1482,6 +1482,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		switch (expected_attach_type) {
+		case BPF_CGROUP_INET_INGRESS:
+		case BPF_CGROUP_INET_EGRESS:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	default:
 		return 0;
 	}
@@ -1725,6 +1733,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		return prog->enforce_expected_attach_type &&
+			prog->expected_attach_type != attach_type ?
+			-EINVAL : 0;
 	default:
 		return 0;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1b9496c41383..95fb385c6f3c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2424,6 +2424,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			return -EFAULT;
 		}
 		meta->ptr_id = reg->id;
+	} else if (arg_type == ARG_PTR_TO_TCP_SOCK) {
+		expected_type = PTR_TO_TCP_SOCK;
+		if (type != expected_type)
+			goto err_type;
 	} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
 		if (meta->func_id == BPF_FUNC_spin_lock) {
 			if (process_spin_lock(env, regno, true))
diff --git a/net/core/filter.c b/net/core/filter.c
index 97916eedfe69..ca57ef25279c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5426,6 +5426,24 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = {
 	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
 };
 
+BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
+{
+	struct sock *sk = (struct sock *)tp;
+
+	if (sk->sk_state == TCP_ESTABLISHED) {
+		tcp_enter_cwr(sk);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
+	.func        = bpf_tcp_enter_cwr,
+	.gpl_only    = false,
+	.ret_type    = RET_INTEGER,
+	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5585,6 +5603,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 #ifdef CONFIG_INET
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
+	case BPF_FUNC_tcp_enter_cwr:
+		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			prog->enforce_expected_attach_type = 1;
+			return &bpf_tcp_enter_cwr_proto;
+		} else {
+			return NULL;
+		}
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);

From patchwork Sat Feb 23 01:06:57 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047231
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="SDUycvJR";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnL1rbdz9sBR
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:34 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727622AbfBWBH0 (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:26 -0500
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:59040 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1727609AbfBWBH0 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:26 -0500
Received: from pps.filterd (m0148461.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N13D6d029037
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:24 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=XjKGTCDWUH2vOFmdoHnjnRduI7nJTNxFKBqN7HGlQoo=;
	b=SDUycvJRMx/C97UKKDV33/bSPBySmIiblE8HSRMTrMsdiHIs61WKsg9GIQAVHYi3Q/Pp
	zOzTpDCTJ0m+CS/1mEnucheQGBFs5rm2jbV2DwDbgQvozcIGgKJs7865kgHx925rWMhl
	TY9JdWvGEA31BIcdV9Jh+T3Eyr5XL7Ilvco=
Received: from mail.thefacebook.com ([199.201.64.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtuea044m-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:24 -0800
Received: from mx-out.facebook.com (2620:10d:c081:10::13) by
	mail.thefacebook.com (2620:10d:c081:35::130) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:23 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 459B15AE1524; Fri, 22 Feb 2019 17:07:23 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 3/9] bpf: Test bpf_tcp_enter_cwr in test_verifier
Date: Fri, 22 Feb 2019 17:06:57 -0800
Message-ID: <20190223010703.678070-4-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This test ensures the verifier has checked the arg1 of
BPF_FUNC_tcp_enter_cwr is of ARG_PTR_TO_TCP_SOCK type.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/testing/selftests/bpf/verifier/sock.c | 33 +++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
index 0ddfdf76aba5..b07a083eeb59 100644
--- a/tools/testing/selftests/bpf/verifier/sock.c
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -382,3 +382,36 @@
 	.result = REJECT,
 	.errstr = "type=tcp_sock expected=sock",
 },
+{
+	"bpf_tcp_enter_cwr(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_enter_cwr),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=tcp_sock",
+},
+{
+	"bpf_tcp_enter_cwr(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_enter_cwr),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},

From patchwork Sat Feb 23 01:06:58 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047236
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="FHzS45dr";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnQ5Xv5z9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:38 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727686AbfBWBHh (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:37 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:44628 "EHLO
	mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1725814AbfBWBH3 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:29 -0500
Received: from pps.filterd (m0109332.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N0x0Uj020528
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:27 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=keJb9k0TuvmHBo0p+AQA422QrszLTKbFzrT2nPDe1a0=;
	b=FHzS45drQls1FtD+P2y0RipD0juYvuf90stPq2vonoKJO5LwAeWlzYTo+eq0rQUQVAei
	/JxrTievGR/RR3UOqWgIQ+coqcpZYjSc/fgBvShq0MEmayJx8BofOwz6hdwlrjk9/AVJ
	cp379OHVjt2bjuW87lca9h4Ssbmilz2rJTs=
Received: from maileast.thefacebook.com ([199.201.65.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtubkr4es-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:27 -0800
Received: from mx-out.facebook.com (2620:10d:c0a1:3::13) by
	mail.thefacebook.com (2620:10d:c021:18::176) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:26 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 526F55AE1524; Fri, 22 Feb 2019 17:07:25 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
Date: Fri, 22 Feb 2019 17:06:58 -0800
Message-ID: <20190223010703.678070-5-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
"int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
be attached to the ingress and egress path. The helper is needed
because his type of bpf_prog cannot modify the skb directly.

This helper is used to set the ECN field of ECN capable IP packets to ce
(congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
used by a bpf_prog to manage egress or ingress network bandwdith limit
per cgroupv2 by inducing an ECN response in the TCP sender.
This works best when using DCTCP.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 10 +++++++++-
 net/core/filter.c        | 14 ++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 95b5058fa945..fc646f3eaf9b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2365,6 +2365,13 @@ union bpf_attr {
  *		Make a tcp_sock enter CWR state.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
+ *	Description
+ *		Sets ECN of IP header to ce (congestion encountered) if
+ *		current value is ect (ECN capable). Works with IPv6 and IPv4.
+ *	Return
+ *		1 if set, 0 if not set.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2464,7 +2471,8 @@ union bpf_attr {
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
 	FN(tcp_sock),			\
-	FN(tcp_enter_cwr),
+	FN(tcp_enter_cwr),		\
+	FN(skb_ecn_set_ce),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index ca57ef25279c..955369c6ed30 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5444,6 +5444,18 @@ static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
 	.ret_type    = RET_INTEGER,
 	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
 };
+
+BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
+{
+	return INET_ECN_set_ce(skb);
+}
+
+static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
+	.func		= bpf_skb_ecn_set_ce,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5610,6 +5622,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 		} else {
 			return NULL;
 		}
+	case BPF_FUNC_skb_ecn_set_ce:
+		return &bpf_skb_ecn_set_ce_proto;
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);

From patchwork Sat Feb 23 01:06:59 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047234
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="PSX07Ck0";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnN5PN6z9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:36 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727659AbfBWBHa (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:30 -0500
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:54468 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1727648AbfBWBH3 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:29 -0500
Received: from pps.filterd (m0109334.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N13Kf4006329
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:29 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=QLG66rWgrxCE17qC3jT5Gkv6zmMeppEQFp4+mzjL3cU=;
	b=PSX07Ck02YWbj4zPbOlxHmTRcLpoNVcAeQ+mTUKpqJhYdN1ffc6Y4LFckoz0OSzs5BMq
	ut7KGlRgxin4cJAILOc8iaIJVA552Z2r64HTNDMYRdsnRud2pXJ5iLFhWLMdhJGHvvwz
	9hk+UxTXKiDe1bUHx3yh0ViDRvYUFqnMGzM=
Received: from mail.thefacebook.com ([199.201.64.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtu86r524-3
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:29 -0800
Received: from mx-out.facebook.com (2620:10d:c081:10::13) by
	mail.thefacebook.com (2620:10d:c081:35::130) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:28 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 5ECB15AE1524; Fri, 22 Feb 2019 17:07:27 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 5/9] bpf: Add bpf helper
	bpf_tcp_check_probe_timer
Date: Fri, 22 Feb 2019 17:06:59 -0800
Message-ID: <20190223010703.678070-6-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch adds a new bpf helper BPF_FUNC_tcp_check_probe_timer
"int bpf_check_tcp_probe_timer(struct tcp_bpf_sock *tp, u32 when_us)".
It is added to BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently
can be attached to the ingress and egress path.

To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
during load time if the prog uses this new helper.
The newly added prog->enforce_expected_attach_type bit will also be set
if this new helper is used.  This bit is for backward compatibility reason
because currently prog->expected_attach_type has been ignored in
BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
prog->expected_attach_type is only enforced if the
prog->enforce_expected_attach_type bit is set.
i.e. prog->expected_attach_type is only enforced if this new helper
is used by the prog.

The function forces when_us to be at least TCP_TIMEOUT_MIN (currently
2 jiffies) and no more than TCP_RTO_MIN (currently 200ms).

When using a bpf_prog to limit the egress bandwidth of a cgroup,
it can happen that we drop a packet of a connection that has no
packets out. In this case, the connection may not retry sending
the packet until the probe timer fires. Since the default value
of the probe timer is at least 200ms, this can introduce link
underutiliation (i.e. the cgroup egress bandwidth being smaller
than the specified rate) thus increased tail latency.
This helper function allows for setting a smaller probe timer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 12 +++++++++++-
 net/core/filter.c        | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fc646f3eaf9b..5d0bed852800 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2372,6 +2372,15 @@ union bpf_attr {
  *		current value is ect (ECN capable). Works with IPv6 and IPv4.
  *	Return
  *		1 if set, 0 if not set.
+ *
+ * int bpf_tcp_check_probe_timer(struct bpf_tcp_sock *tp, int when_us)
+ *	Description
+ *		Checks that there are no packets out and there is no pending
+ *		timer. If both of these are true, it bounds when_us by
+ *		TCP_TIMEOUT_MIN (2 jiffies) or TCP_RTO_MIN (200ms) and
+ *		sets the probe timer.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2472,7 +2481,8 @@ union bpf_attr {
 	FN(sk_fullsock),		\
 	FN(tcp_sock),			\
 	FN(tcp_enter_cwr),		\
-	FN(skb_ecn_set_ce),
+	FN(skb_ecn_set_ce),		\
+	FN(tcp_check_probe_timer),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 955369c6ed30..7d7026768840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5456,6 +5456,31 @@ static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 };
+
+BPF_CALL_2(bpf_tcp_check_probe_timer, struct tcp_sock *, tp, u32, when_us)
+{
+	struct sock *sk = (struct sock *) tp;
+	unsigned long when = usecs_to_jiffies(when_us);
+
+	if (!tp->packets_out && !inet_csk(sk)->icsk_pending) {
+		if (when < TCP_TIMEOUT_MIN)
+			when = TCP_TIMEOUT_MIN;
+		else if (when > TCP_RTO_MIN)
+			when = TCP_RTO_MIN;
+
+		tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+				     when, TCP_RTO_MAX, NULL);
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_tcp_check_probe_timer_proto = {
+	.func		= bpf_tcp_check_probe_timer,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TCP_SOCK,
+	.arg2_type	= ARG_ANYTHING,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5624,6 +5649,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 		}
 	case BPF_FUNC_skb_ecn_set_ce:
 		return &bpf_skb_ecn_set_ce_proto;
+	case BPF_FUNC_tcp_check_probe_timer:
+		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			prog->enforce_expected_attach_type = 1;
+			return &bpf_tcp_check_probe_timer_proto;
+		} else {
+			return NULL;
+		}
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);

From patchwork Sat Feb 23 01:07:00 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047235
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="VatzmNpY";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnP6wm9z9sBL
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:37 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727668AbfBWBHd (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:33 -0500
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:59062 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1727648AbfBWBHc (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:32 -0500
Received: from pps.filterd (m0148461.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N13CTe029034
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:31 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=QrAVmi+1v502/TIu+QAG0303RQgysxGX6imY/cMmlXw=;
	b=VatzmNpYZalKYUYVqdtZKeu1K+ehLY6YC0a95I2YJYQayrh01ua8NHkUUa4ESoQ2tDx8
	2lxPyCLdr+6LqNY16hNUO6+SgxMryb7qdZcqc+4bb8edGSzAm/LXj3cqnG8n7KE1C73c
	b726ZSe96L0mWJ18tyUoc8vc8HIt2D+Nin4=
Received: from maileast.thefacebook.com ([199.201.65.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtuea0450-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:31 -0800
Received: from mx-out.facebook.com (2620:10d:c0a1:3::13) by
	mail.thefacebook.com (2620:10d:c021:18::175) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:29 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 682C85AE1524; Fri, 22 Feb 2019 17:07:29 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 6/9] bpf: sync bpf.h to tools and update
	bpf_helpers.h
Date: Fri, 22 Feb 2019 17:07:00 -0800
Message-ID: <20190223010703.678070-7-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch syncs the uapi bpf.h to tools/ and also updates
bpf_herlpers.h in tools/

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/include/uapi/linux/bpf.h            | 27 ++++++++++++++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bcdd2474eee7..5d0bed852800 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2359,6 +2359,28 @@ union bpf_attr {
  *	Return
  *		A **struct bpf_tcp_sock** pointer on success, or NULL in
  *		case of failure.
+ *
+ * int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)
+ *	Description
+ *		Make a tcp_sock enter CWR state.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
+ *	Description
+ *		Sets ECN of IP header to ce (congestion encountered) if
+ *		current value is ect (ECN capable). Works with IPv6 and IPv4.
+ *	Return
+ *		1 if set, 0 if not set.
+ *
+ * int bpf_tcp_check_probe_timer(struct bpf_tcp_sock *tp, int when_us)
+ *	Description
+ *		Checks that there are no packets out and there is no pending
+ *		timer. If both of these are true, it bounds when_us by
+ *		TCP_TIMEOUT_MIN (2 jiffies) or TCP_RTO_MIN (200ms) and
+ *		sets the probe timer.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2457,7 +2479,10 @@ union bpf_attr {
 	FN(spin_lock),			\
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
-	FN(tcp_sock),
+	FN(tcp_sock),			\
+	FN(tcp_enter_cwr),		\
+	FN(skb_ecn_set_ce),		\
+	FN(tcp_check_probe_timer),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d9999f1ed1d2..8aec59624ebc 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -180,6 +180,12 @@ static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
 	(void *) BPF_FUNC_sk_fullsock;
 static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
 	(void *) BPF_FUNC_tcp_sock;
+static int (*bpf_tcp_enter_cwr)(struct bpf_tcp_sock *tp) =
+	(void *) BPF_FUNC_tcp_enter_cwr;
+static int (*bpf_skb_ecn_set_ce)(void *ctx) =
+	(void *) BPF_FUNC_skb_ecn_set_ce;
+static int (*bpf_tcp_check_probe_timer)(struct bpf_tcp_sock *tp, int when_us) =
+	(void *) BPF_FUNC_tcp_check_probe_timer;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions

From patchwork Sat Feb 23 01:07:01 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047237
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="Ur7rpM/4";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnR407sz9sBL
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:39 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727677AbfBWBHg (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:36 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:43788 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
	by vger.kernel.org with ESMTP id S1727648AbfBWBHf (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:35 -0500
Received: from pps.filterd (m0001303.ppops.net [127.0.0.1])
	by m0001303.ppops.net (8.16.0.27/8.16.0.27) with SMTP id
	x1N126P7003907
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:33 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=rn5jcVVPzJwQc/kogkatwerp9NWevg560PQEksl3eM0=;
	b=Ur7rpM/4G1ran3x9ywJUUophKTSnD347pe7fIsUtO5SGEesTAFiaLmr+N4sxK/k3R8Ze
	2d155M6rKrRx+uwu+auO8R7gdIg0JYTqjVfFCc5ez/7VdFH5cHgqmfpJrev5qaFzgciD
	1wZykvVeJFTdlwPxdh5I9AAss0PzYP5BMVU=
Received: from mail.thefacebook.com ([199.201.64.23])
	by m0001303.ppops.net with ESMTP id 2qtupd02qq-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:33 -0800
Received: from mx-out.facebook.com (2620:10d:c081:10::13) by
	mail.thefacebook.com (2620:10d:c081:35::127) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:32 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id 711415AE1524; Fri, 22 Feb 2019 17:07:31 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 7/9] bpf: Sample NRM BPF program to limit egress
	bw
Date: Fri, 22 Feb 2019 17:07:01 -0800
Message-ID: <20190223010703.678070-8-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

A cgroup skb BPF program to limit cgroup output bandwidth.
It uses a modified virtual token bucket queue to limit average
egress bandwidth. The implementation uses credits instead of tokens.
Negative credits imply that queueing would have happened (this is
a virtual queue, so no queueing is done by it. However, queueing may
occur at the actual qdisc (which is not used for rate limiting).

This implementation uses 3 thresholds, one to start marking packets and
the other two to drop packets:
                                 CREDIT
       - <--------------------------|------------------------> +
             |    |          |      0
             |  Large pkt    |
             |  drop thresh  |
  Small pkt drop             Mark threshold
      thresh

The effect of marking depends on the type of packet:
a) If the packet is ECN enabled and it is a TCP packet, then the packet
   is ECN marked. The current mark threshold is tuned for DCTCP.
b) If the packet is a TCP packet, then we probabilistically call tcp_cwr
   to reduce the congestion window. The current implementation uses a linear
   distribution (0% probability at marking threshold, 100% probability
   at drop threshold).
c) If the packet is not a TCP packet, then it is dropped.

If the credit is below the drop threshold, the packet is dropped. If it
is a TCP packet, then it also calls tcp_cwr since packets dropped by
by a cgroup skb BPF program do not automatically trigger a call to
tcp_cwr in the current kernel code.

This BPF program actually uses 2 drop thresholds, one threshold
for larger packets (>= 120 bytes) and another for smaller packets. This
protects smaller packets such as SYNs, ACKs, etc.

The default bandwidth limit is set at 1Gbps but this can be changed by
a user program through a shared BPF map. In addition, by default this BPF
program does not limit connections using loopback. This behavior can be
overwritten by the user program. There is also an option to calculate
some statistics, such as percent of packets marked or dropped, which
the user program can access.

A latter patch provides such a program (nrm.c)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile       |   2 +
 samples/bpf/nrm.h          |  31 ++++++
 samples/bpf/nrm_kern.h     | 137 ++++++++++++++++++++++++++
 samples/bpf/nrm_out_kern.c | 190 +++++++++++++++++++++++++++++++++++++
 4 files changed, 360 insertions(+)
 create mode 100644 samples/bpf/nrm.h
 create mode 100644 samples/bpf/nrm_kern.h
 create mode 100644 samples/bpf/nrm_out_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a0ef7eddd0b3..897b467066fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -167,6 +167,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += nrm_out_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -266,6 +267,7 @@ $(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
 $(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
+$(obj)/nrm_out_kern.o: $(src)/nrm.h $(src)/nrm_kern.h
 
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
diff --git a/samples/bpf/nrm.h b/samples/bpf/nrm.h
new file mode 100644
index 000000000000..ea89d6027ff0
--- /dev/null
+++ b/samples/bpf/nrm.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Include file for NRM programs
+ */
+struct nrm_vqueue {
+	struct bpf_spin_lock lock;
+	/* 4 byte hole */
+	unsigned long long lasttime;	/* In ns */
+	int credit;			/* In bytes */
+	unsigned int rate;		/* In bytes per NS << 20 */
+};
+
+struct nrm_queue_stats {
+	unsigned long rate;		/* in Mbps*/
+	unsigned long stats:1,		/* get NRM stats (marked, dropped,..) */
+		loopback:1;		/* also limit flows using loopback */
+	unsigned long long pkts_marked;
+	unsigned long long bytes_marked;
+	unsigned long long pkts_dropped;
+	unsigned long long bytes_dropped;
+	unsigned long long pkts_total;
+	unsigned long long bytes_total;
+	unsigned long long firstPacketTime;
+	unsigned long long lastPacketTime;
+};
diff --git a/samples/bpf/nrm_kern.h b/samples/bpf/nrm_kern.h
new file mode 100644
index 000000000000..e48d4d2944a9
--- /dev/null
+++ b/samples/bpf/nrm_kern.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Include file for sample NRM BPF programs
+ */
+#define KBUILD_MODNAME "foo"
+#include <stddef.h>
+#include <stdbool.h>
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/ipv6.h>
+#include <uapi/linux/in.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/filter.h>
+#include <uapi/linux/pkt_cls.h>
+#include <net/ipv6.h>
+#include <net/inet_ecn.h>
+#include "bpf_endian.h"
+#include "bpf_helpers.h"
+#include "nrm.h"
+
+#define DROP_PKT	0
+#define ALLOW_PKT	1
+#define TCP_ECN_OK	1
+
+#define NRM_DEBUG 0  // Set to 1 to enable debugging
+#if NRM_DEBUG
+#define bpf_printk(fmt, ...)					\
+({								\
+	char ____fmt[] = fmt;					\
+	bpf_trace_printk(____fmt, sizeof(____fmt),		\
+			 ##__VA_ARGS__);			\
+})
+#else
+#define bpf_printk(fmt, ...)
+#endif
+
+#define INITIAL_CREDIT_PACKETS	100
+#define MAX_BYTES_PER_PACKET	1500
+#define MARK_THRESH		(80 * MAX_BYTES_PER_PACKET)
+#define DROP_THRESH		(80 * 5 * MAX_BYTES_PER_PACKET)
+#define LARGE_PKT_DROP_THRESH	(DROP_THRESH - (15 * MAX_BYTES_PER_PACKET))
+#define MARK_REGION_SIZE	(LARGE_PKT_DROP_THRESH - MARK_THRESH)
+#define LARGE_PKT_THRESH	120
+#define MAX_CREDIT		(100 * MAX_BYTES_PER_PACKET)
+#define INIT_CREDIT		(INITIAL_CREDIT_PACKETS * MAX_BYTES_PER_PACKET)
+
+// rate in bytes per ns << 20
+#define CREDIT_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20)
+
+struct bpf_map_def SEC("maps") queue_state = {
+	.type = BPF_MAP_TYPE_CGROUP_STORAGE,
+	.key_size = sizeof(struct bpf_cgroup_storage_key),
+	.value_size = sizeof(struct nrm_vqueue),
+};
+BPF_ANNOTATE_KV_PAIR(queue_state, struct bpf_cgroup_storage_key,
+		     struct nrm_vqueue);
+
+struct bpf_map_def SEC("maps") queue_stats = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct nrm_queue_stats),
+	.max_entries = 1,
+};
+BPF_ANNOTATE_KV_PAIR(queue_stats, int, struct nrm_queue_stats);
+
+struct nrm_pkt_info {
+	bool	is_ip;
+	bool	is_tcp;
+	short	ecn;
+};
+
+static __always_inline void nrm_get_pkt_info(struct __sk_buff *skb,
+					     struct nrm_pkt_info *pkti)
+{
+	struct iphdr iph;
+	struct ipv6hdr *ip6h;
+
+	bpf_skb_load_bytes(skb, 0, &iph, 12);
+	if (iph.version == 6) {
+		ip6h = (struct ipv6hdr *)&iph;
+		pkti->is_ip = true;
+		pkti->is_tcp = (ip6h->nexthdr == 6);
+		pkti->ecn = (ip6h->flow_lbl[0] >> 4) & INET_ECN_MASK;
+	} else if (iph.version == 4) {
+		pkti->is_ip = true;
+		pkti->is_tcp = (iph.protocol == 6);
+		pkti->ecn = iph.tos & INET_ECN_MASK;
+	} else {
+		pkti->is_ip = false;
+		pkti->is_tcp = false;
+		pkti->ecn = 0;
+	}
+}
+
+static __always_inline void nrm_init_vqueue(struct nrm_vqueue *qdp, int rate)
+{
+		bpf_printk("Initializing queue_state, rate:%d\n", rate * 128);
+		qdp->lasttime = bpf_ktime_get_ns();
+		qdp->credit = INIT_CREDIT;
+		qdp->rate = rate * 128;
+}
+
+static __always_inline void nrm_update_stats(struct nrm_queue_stats *qsp,
+					     int len,
+					     unsigned long long curtime,
+					     bool congestion_flag,
+					     bool drop_flag)
+{
+	if (qsp != NULL) {
+		// Following is needed for work conserving
+		__sync_add_and_fetch(&(qsp->bytes_total), len);
+		if (qsp->stats) {
+			// Optionally update statistics
+			if (qsp->firstPacketTime == 0)
+				qsp->firstPacketTime = curtime;
+			qsp->lastPacketTime = curtime;
+			__sync_add_and_fetch(&(qsp->pkts_total), 1);
+			if (congestion_flag) {
+				__sync_add_and_fetch(&(qsp->pkts_marked), 1);
+				__sync_add_and_fetch(&(qsp->bytes_marked), len);
+			}
+			if (drop_flag) {
+				__sync_add_and_fetch(&(qsp->pkts_dropped), 1);
+				__sync_add_and_fetch(&(qsp->bytes_dropped),
+						     len);
+			}
+		}
+	}
+}
diff --git a/samples/bpf/nrm_out_kern.c b/samples/bpf/nrm_out_kern.c
new file mode 100644
index 000000000000..2d4c5a647daa
--- /dev/null
+++ b/samples/bpf/nrm_out_kern.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Sample Network Resource Manager (NRM) BPF program.
+ *
+ * A cgroup skb BPF egress program to limit cgroup output bandwidth.
+ * It uses a modified virtual token bucket queue to limit average
+ * egress bandwidth. The implementation uses credits instead of tokens.
+ * Negative credits imply that queueing would have happened (this is
+ * a virtual queue, so no queueing is done by it. However, queueing may
+ * occur at the actual qdisc (which is not used for rate limiting).
+ *
+ * This implementation uses 3 thresholds, one to start marking packets and
+ * the other two to drop packets:
+ *                                  CREDIT
+ *        - <--------------------------|------------------------> +
+ *              |    |          |      0
+ *              |  Large pkt    |
+ *              |  drop thresh  |
+ *   Small pkt drop             Mark threshold
+ *       thresh
+ *
+ * The effect of marking depends on the type of packet:
+ * a) If the packet is ECN enabled and it is a TCP packet, then the packet
+ *    is ECN marked.
+ * b) If the packet is a TCP packet, then we probabilistically call tcp_cwr
+ *    to reduce the congestion window. The current implementation uses a linear
+ *    distribution (0% probability at marking threshold, 100% probability
+ *    at drop threshold).
+ * c) If the packet is not a TCP packet, then it is dropped.
+ *
+ * If the credit is below the drop threshold, the packet is dropped. If it
+ * is a TCP packet, then it also calls tcp_cwr since packets dropped by
+ * by a cgroup skb BPF program do not automatically trigger a call to
+ * tcp_cwr in the current kernel code.
+ *
+ * This BPF program actually uses 2 drop thresholds, one threshold
+ * for larger packets (>= 120 bytes) and another for smaller packets. This
+ * protects smaller packets such as SYNs, ACKs, etc.
+ *
+ * The default bandwidth limit is set at 1Gbps but this can be changed by
+ * a user program through a shared BPF map. In addition, by default this BPF
+ * program does not limit connections using loopback. This behavior can be
+ * overwritten by the user program. There is also an option to calculate
+ * some statistics, such as percent of packets marked or dropped, which
+ * the user program can access.
+ *
+ * A latter patch provides such a program (nrm.c)
+ */
+
+#include "nrm_kern.h"
+
+SEC("cgroup_skb/egress")
+int _nrm_out_cg(struct __sk_buff *skb)
+{
+	struct nrm_pkt_info pkti;
+	int len = skb->len;
+	unsigned int queue_index = 0;
+	unsigned long long curtime;
+	int credit;
+	signed long long delta = 0, zero = 0;
+	int max_credit = MAX_CREDIT;
+	bool congestion_flag = false;
+	bool drop_flag = false;
+	bool cwr_flag = false;
+	struct nrm_vqueue *qdp;
+	struct nrm_queue_stats *qsp = NULL;
+	int rv = ALLOW_PKT;
+
+	qsp = bpf_map_lookup_elem(&queue_stats, &queue_index);
+	if (qsp != NULL && !qsp->loopback && (skb->ifindex == 1))
+		return ALLOW_PKT;
+
+	nrm_get_pkt_info(skb, &pkti);
+
+	// We may want to account for the length of headers in len
+	// calculation, like ETH header + overhead, specially if it
+	// is a gso packet. But I am not doing it right now.
+
+	qdp = bpf_get_local_storage(&queue_state, 0);
+	if (!qdp)
+		return ALLOW_PKT;
+	else if (qdp->lasttime == 0)
+		nrm_init_vqueue(qdp, 1024);
+
+	curtime = bpf_ktime_get_ns();
+
+	// Begin critical section
+	bpf_spin_lock(&qdp->lock);
+	credit = qdp->credit;
+	delta = curtime - qdp->lasttime;
+	/* delta < 0 implies that another process with a curtime greater
+	 * than ours beat us to the critical section and already added
+	 * the new credit, so we should not add it ourselves
+	 */
+	if (delta > 0) {
+		qdp->lasttime = curtime;
+		credit += CREDIT_PER_NS(delta, qdp->rate);
+		if (credit > MAX_CREDIT)
+			credit = MAX_CREDIT;
+	}
+	credit -= len;
+	qdp->credit = credit;
+	bpf_spin_unlock(&qdp->lock);
+	// End critical section
+
+	// Check if we should update rate
+	if (qsp != NULL && (qsp->rate * 128) != qdp->rate) {
+		qdp->rate = qsp->rate * 128;
+		bpf_printk("Updating rate: %d (1sec:%llu bits)\n",
+			   (int)qdp->rate,
+			   CREDIT_PER_NS(1000000000, qdp->rate) * 8);
+	}
+
+	// Set flags (drop, congestion, cwr)
+	// Dropping => we are congested, so ignore congestion flag
+	if (pkti.is_ip) {
+		if (credit < -DROP_THRESH ||
+		    (len > LARGE_PKT_THRESH &&
+		     credit < -LARGE_PKT_DROP_THRESH)) {
+			// Very congested, set drop flag
+			drop_flag = true;
+			if (pkti.is_tcp && pkti.ecn == 0)
+				cwr_flag = true;
+		} else if (credit < 0) {
+			// Congested, set congestion flag
+			if (pkti.is_tcp || pkti.ecn) {
+				if (credit < -MARK_THRESH)
+					congestion_flag = true;
+				else
+					congestion_flag = false;
+			} else {
+				congestion_flag = true;
+			}
+		}
+
+		if (congestion_flag) {
+			if (!pkti.ecn || !bpf_skb_ecn_set_ce(skb)) {
+				if (pkti.is_tcp) {
+					u32 rand = bpf_get_prandom_u32();
+
+					if (-credit >= MARK_THRESH +
+					    (rand % MARK_REGION_SIZE)) {
+						// Do cong avoidance
+						cwr_flag = true;
+					}
+				} else if (len > LARGE_PKT_THRESH) {
+					// Problem if too many small packets?
+					drop_flag = true;
+					congestion_flag = false;
+				}
+			}
+		}
+
+		if (pkti.is_tcp && (drop_flag || cwr_flag)) {
+			struct bpf_sock *sk;
+			struct bpf_tcp_sock *tp = NULL;
+
+			sk = skb->sk;
+			if (sk) {
+				sk = bpf_sk_fullsock(sk);
+				if (sk)
+					tp = bpf_tcp_sock(sk);
+			}
+			if (tp && drop_flag)
+				bpf_tcp_check_probe_timer(tp, 20000);
+			if (tp && cwr_flag)
+				bpf_tcp_enter_cwr(tp);
+		}
+
+		if (drop_flag)
+			rv = DROP_PKT;
+
+	} else if (credit < -MARK_THRESH) {
+		drop_flag = true;
+		rv =  DROP_PKT;
+	}
+
+	nrm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
+
+	if (rv == DROP_PKT)
+		__sync_add_and_fetch(&(qdp->credit), len);
+
+	return rv;
+}
+char _license[] SEC("license") = "GPL";

From patchwork Sat Feb 23 01:07:02 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047238
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="Pkz0fCDY";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnT35Gcz9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:41 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727695AbfBWBHk (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:40 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48178 "EHLO
	mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1727670AbfBWBHf (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:35 -0500
Received: from pps.filterd (m0148460.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N0xtXF010980
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:34 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=uBqjDGsUgkyEVGhWSHAznkxVIn/9boKRVECS6lnGGYI=;
	b=Pkz0fCDYUIKdM/+MV2fk2llnQyi/9TBlS711pam3vEhXBUn4+eb//QrGOep4/iCg+zHg
	lN8co9idDyuj1nCnAy5PDD3T68UnYkne8zHn4JNF0hRhQq8AvdRpruYBvfV6kSGiFQiU
	k5lzv83v4hOyEAVNkCnN78K69cNtThjj4ks=
Received: from maileast.thefacebook.com ([199.201.65.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtug703nh-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:34 -0800
Received: from mx-out.facebook.com (2620:10d:c0a1:3::13) by
	mail.thefacebook.com (2620:10d:c021:18::171) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:33 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id A02305AE1524; Fri, 22 Feb 2019 17:07:33 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 8/9] bpf: User program for testing NRM
Date: Fri, 22 Feb 2019 17:07:02 -0800
Message-ID: <20190223010703.678070-9-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

The program nrm creates a cgroup and attaches a BPF program to the
cgroup for testing NRM for egress traffic. One still needs to create
network traffic. This can be done through netesto, netperf or iperf3.
A follow-up patch contains a script to create traffic.

USAGE: nrm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>]
           [-w] [-h] [prog]
  Where:
   -d        Print BPF trace debug buffer
   -l        Also limit flows doing loopback
   -n <#>    To create cgroup "/nrm#" and attach prog. Default is /nrm1
             This is convenient when testing NRM in more than 1 cgroup
   -r <rate> Rate limit in Mbps
   -s        Get NRM stats (marked, dropped, etc.)
   -t <time> Exit after specified seconds (deault is 0)
   -w        Work conserving flag. cgroup can increase its bandwidth
             beyond the rate limit specified while there is available
             bandwidth. Current implementation assumes there is only
             NIC (eth0), but can be extended to support multiple NICs.
             Currrently only supported for egress. Note, this is just
	     a proof of concept.
   -h        Print this info
   prog      BPF program file name. Name defaults to nrm_out_kern.o for
             output, and nrm_in_ker.o for input.

More information about NRM can be found in the paper "BPF Host Resource
Management" presented at the 2018 Linux Plumbers Conference, Networking Track
(http://vger.kernel.org/lpc_net2018_talks/LPC%20BPF%20Network%20Resource%20Paper.pdf)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile |   3 +
 samples/bpf/nrm.c    | 440 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 443 insertions(+)
 create mode 100644 samples/bpf/nrm.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 897b467066fd..6186c9fc3179 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += nrm
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+nrm-objs := bpf_load.o nrm.o $(CGROUP_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -268,6 +270,7 @@ $(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 $(obj)/nrm_out_kern.o: $(src)/nrm.h $(src)/nrm_kern.h
+$(obj)/nrm.o: $(src)/nrm.h
 
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
diff --git a/samples/bpf/nrm.c b/samples/bpf/nrm.c
new file mode 100644
index 000000000000..ae2ab61b0fb3
--- /dev/null
+++ b/samples/bpf/nrm.c
@@ -0,0 +1,440 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Example program for Network Resource Managmement
+ *
+ * This program loads a cgroup skb BPF program to enforce cgroup output
+ * (egress) or input (ingress) bandwidth limits.
+ *
+ * USAGE: nrm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>] [-w] [-h] [prog]
+ *   Where:
+ *    -d	Print BPF trace debug buffer
+ *    -l	Also limit flows doing loopback
+ *    -n <#>	To create cgroup \"/nrm#\" and attach prog
+ *		Default is /nrm1
+ *    -r <rate>	Rate limit in Mbps
+ *    -s	Get NRM stats (marked, dropped, etc.)
+ *    -t <time>	Exit after specified seconds (deault is 0)
+ *    -w	Work conserving flag. cgroup can increase its bandwidth
+ *		beyond the rate limit specified while there is available
+ *		bandwidth. Current implementation assumes there is only
+ *		NIC (eth0), but can be extended to support multiple NICs.
+ *		Currrently only supported for egress.
+ *    -h	Print this info
+ *    prog	BPF program file name. Name defaults to nrm_out_kern.o
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <sys/resource.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/unistd.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_load.h"
+#include "bpf_rlimit.h"
+#include "cgroup_helpers.h"
+#include "nrm.h"
+#include "bpf_util.h"
+#include "bpf/bpf.h"
+#include "bpf/libbpf.h"
+
+bool outFlag = true;
+int minRate = 1000;		/* cgroup rate limit in Mbps */
+int rate = 1000;		/* can grow if rate conserving is enabled */
+int dur = 1;
+bool stats_flag;
+bool loopback_flag;
+bool debugFlag;
+bool work_conserving_flag;
+
+static void Usage(void);
+static void read_trace_pipe2(void);
+static void do_error(char *msg, bool errno_flag);
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+struct bpf_object *obj;
+int bpfprog_fd;
+int cgroup_storage_fd;
+
+static void read_trace_pipe2(void)
+{
+	int trace_fd;
+	FILE *outf;
+	char *outFname = "nrm_out.log";
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0) {
+		printf("Error opening trace_pipe\n");
+		return;
+	}
+
+	if (!outFlag)
+		outFname = "nrm_in.log";
+	outf = fopen(outFname, "w");
+
+	if (outf == NULL)
+		printf("Error creating %s\n", outFname);
+
+	while (1) {
+		static char buf[4097];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf) - 1);
+		if (sz > 0) {
+			buf[sz] = 0;
+			puts(buf);
+			if (outf != NULL) {
+				fprintf(outf, "%s\n", buf);
+				fflush(outf);
+			}
+		}
+	}
+}
+
+static void do_error(char *msg, bool errno_flag)
+{
+	if (errno_flag)
+		printf("ERROR: %s, errno: %d\n", msg, errno);
+	else
+		printf("ERROR: %s\n", msg);
+	exit(1);
+}
+
+static int prog_load(char *prog)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.file = prog,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int map_fd;
+	struct bpf_map *map;
+
+	int ret = 0;
+
+	if (access(prog, O_RDONLY) < 0) {
+		printf("Error accessing file %s: %s\n", prog, strerror(errno));
+		return 1;
+	}
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &bpfprog_fd))
+		ret = 1;
+	if (!ret) {
+		map = bpf_object__find_map_by_name(obj, "queue_stats");
+		map_fd = bpf_map__fd(map);
+		if (map_fd < 0) {
+			printf("Map not found: %s\n", strerror(map_fd));
+			ret = 1;
+		}
+	}
+
+	if (ret) {
+		printf("ERROR: load_bpf_file failed for: %s\n", prog);
+		printf("  Output from verifier:\n%s\n------\n", bpf_log_buf);
+		ret = -1;
+	} else {
+		ret = map_fd;
+	}
+
+	return ret;
+}
+
+static int run_bpf_prog(char *prog, int cg_id)
+{
+	int map_fd;
+	int rc = 0;
+	int key = 0;
+	int cg1 = 0;
+	int type = BPF_CGROUP_INET_EGRESS;
+	char cg_dir[100];
+	struct nrm_queue_stats qstats = {0};
+
+	sprintf(cg_dir, "/nrm%d", cg_id);
+	map_fd = prog_load(prog);
+	if (map_fd  == -1)
+		return 1;
+
+	if (setup_cgroup_environment()) {
+		printf("ERROR: setting cgroup environment\n");
+		goto err;
+	}
+	cg1 = create_and_get_cgroup(cg_dir);
+	if (!cg1) {
+		printf("ERROR: create_and_get_cgroup\n");
+		goto err;
+	}
+	if (join_cgroup(cg_dir)) {
+		printf("ERROR: join_cgroup\n");
+		goto err;
+	}
+
+	qstats.rate = rate;
+	qstats.stats = stats_flag ? 1 : 0;
+	qstats.loopback = loopback_flag ? 1 : 0;
+	if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
+		printf("ERROR: Could not update map element\n");
+		goto err;
+	}
+
+	if (!outFlag)
+		type = BPF_CGROUP_INET_INGRESS;
+	if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
+		printf("ERROR: bpf_prog_attach fails!\n");
+		log_err("Attaching prog");
+		goto err;
+	}
+
+	if (work_conserving_flag) {
+		struct timeval t0, t_last, t_new;
+		FILE *fin;
+		unsigned long long last_eth_tx_bytes, new_eth_tx_bytes;
+		signed long long last_cg_tx_bytes, new_cg_tx_bytes;
+		signed long long delta_time, delta_bytes, delta_rate;
+		int delta_ms;
+#define DELTA_RATE_CHECK 10000		/* in us */
+#define RATE_THRESHOLD 9500000000	/* 9.5 Gbps */
+
+		bpf_map_lookup_elem(map_fd, &key, &qstats);
+		if (gettimeofday(&t0, NULL) < 0)
+			do_error("gettimeofday failed", true);
+		t_last = t0;
+		fin = fopen("/sys/class/net/eth0/statistics/tx_bytes", "r");
+		if (fscanf(fin, "%llu", &last_eth_tx_bytes) != 1)
+			do_error("fscanf fails", false);
+		fclose(fin);
+		last_cg_tx_bytes = qstats.bytes_total;
+		while (true) {
+			usleep(DELTA_RATE_CHECK);
+			if (gettimeofday(&t_new, NULL) < 0)
+				do_error("gettimeofday failed", true);
+			delta_ms = (t_new.tv_sec - t0.tv_sec) * 1000 +
+				(t_new.tv_usec - t0.tv_usec)/1000;
+			if (delta_ms > dur * 1000)
+				break;
+			delta_time = (t_new.tv_sec - t_last.tv_sec) * 1000000 +
+				(t_new.tv_usec - t_last.tv_usec);
+			if (delta_time == 0)
+				continue;
+			t_last = t_new;
+			fin = fopen("/sys/class/net/eth0/statistics/tx_bytes",
+				    "r");
+			if (fscanf(fin, "%llu", &new_eth_tx_bytes) != 1)
+				do_error("fscanf fails", false);
+			fclose(fin);
+			printf("  new_eth_tx_bytes:%llu\n",
+			       new_eth_tx_bytes);
+			bpf_map_lookup_elem(map_fd, &key, &qstats);
+			new_cg_tx_bytes = qstats.bytes_total;
+			delta_bytes = new_eth_tx_bytes - last_eth_tx_bytes;
+			last_eth_tx_bytes = new_eth_tx_bytes;
+			delta_rate = (delta_bytes * 8000000) / delta_time;
+			printf("%5d - eth_rate:%.1fGbps cg_rate:%.3fGbps",
+			       delta_ms, delta_rate/1000000000.0,
+			       rate/1000.0);
+			if (delta_rate < RATE_THRESHOLD) {
+				/* can increase cgroup rate limit, but first
+				 * check if we are using the current limit.
+				 * Currently increasing by 6.25%, unknown
+				 * if that is the optimal rate.
+				 */
+				int rate_diff100;
+
+				delta_bytes = new_cg_tx_bytes -
+					last_cg_tx_bytes;
+				last_cg_tx_bytes = new_cg_tx_bytes;
+				delta_rate = (delta_bytes * 8000000) /
+					delta_time;
+				printf(" rate:%.3fGbps",
+				       delta_rate/1000000000.0);
+				rate_diff100 = (((long long)rate)*1000000 -
+						     delta_rate) * 100 /
+					(((long long) rate) * 1000000);
+				printf("  rdiff:%d", rate_diff100);
+				if (rate_diff100  <= 3) {
+					rate += (rate >> 4);
+					if (rate > RATE_THRESHOLD / 1000000)
+						rate = RATE_THRESHOLD / 1000000;
+					qstats.rate = rate;
+					printf(" INC\n");
+				} else {
+					printf("\n");
+				}
+			} else {
+				/* Need to decrease cgroup rate limit.
+				 * Currently decreasing by 12.5%, unknown
+				 * if that is optimal
+				 */
+				printf(" DEC\n");
+				rate -= (rate >> 3);
+				if (rate < minRate)
+					rate = minRate;
+				qstats.rate = rate;
+			}
+			if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY))
+				do_error("update map element fails", false);
+		}
+	} else {
+		sleep(dur);
+	}
+	// Get stats!
+	if (stats_flag && bpf_map_lookup_elem(map_fd, &key, &qstats)) {
+		char fname[100];
+		FILE *fout;
+
+		if (!outFlag)
+			sprintf(fname, "nrm.%d.in", cg_id);
+		else
+			sprintf(fname, "nrm.%d.out", cg_id);
+		fout = fopen(fname, "w");
+		fprintf(fout, "id:%d\n", cg_id);
+		fprintf(fout, "ERROR: Could not lookup queue_stats\n");
+	} else if (stats_flag && qstats.lastPacketTime >
+		   qstats.firstPacketTime) {
+		long long delta_us = (qstats.lastPacketTime -
+				      qstats.firstPacketTime)/1000;
+		unsigned int rate_mbps = ((qstats.bytes_total -
+					   qstats.bytes_dropped) * 8 /
+					  delta_us);
+		double percent_pkts, percent_bytes;
+		char fname[100];
+		FILE *fout;
+
+		if (!outFlag)
+			sprintf(fname, "nrm.%d.in", cg_id);
+		else
+			sprintf(fname, "nrm.%d.out", cg_id);
+		fout = fopen(fname, "w");
+		fprintf(fout, "id:%d\n", cg_id);
+		fprintf(fout, "rate_mbps:%d\n", rate_mbps);
+		fprintf(fout, "duration:%.1f secs\n",
+			(qstats.lastPacketTime - qstats.firstPacketTime) /
+			1000000000.0);
+		fprintf(fout, "packets:%d\n", (int)qstats.pkts_total);
+		fprintf(fout, "bytes_MB:%d\n", (int)(qstats.bytes_total /
+						     1000000));
+		fprintf(fout, "pkts_dropped:%d\n", (int)qstats.pkts_dropped);
+		fprintf(fout, "bytes_dropped_MB:%d\n",
+			(int)(qstats.bytes_dropped /
+						       1000000));
+		// Marked Pkts and Bytes
+		percent_pkts = (qstats.pkts_marked * 100.0) /
+			(qstats.pkts_total + 1);
+		percent_bytes = (qstats.bytes_marked * 100.0) /
+			(qstats.bytes_total + 1);
+		fprintf(fout, "pkts_marked_percent:%6.2f\n", percent_pkts);
+		fprintf(fout, "bytes_marked_percent:%6.2f\n", percent_bytes);
+
+		// Dropped Pkts and Bytes
+		percent_pkts = (qstats.pkts_dropped * 100.0) /
+			(qstats.pkts_total + 1);
+		percent_bytes = (qstats.bytes_dropped * 100.0) /
+			(qstats.bytes_total + 1);
+		fprintf(fout, "pkts_dropped_percent:%6.2f\n", percent_pkts);
+		fprintf(fout, "bytes_dropped_percent:%6.2f\n", percent_bytes);
+		fclose(fout);
+	}
+
+	if (debugFlag)
+		read_trace_pipe2();
+	return rc;
+err:
+	rc = 1;
+
+	if (cg1)
+		close(cg1);
+	cleanup_cgroup_environment();
+
+	return rc;
+}
+
+static void Usage(void)
+{
+	printf("This program loads a cgroup skb BPF program to enforce\n"
+	       "cgroup output (egress) bandwidth limits.\n\n"
+	       "USAGE: nrm [-o] [-d]  [-l] [-n <id>] [-r <rate>] [-s]\n"
+	       "           [-t <secs>] [-w] [-h] [prog]\n"
+	       "  Where:\n"
+	       "    -o         indicates egress direction (default)\n"
+	       "    -d         print BPF trace debug buffer\n"
+	       "    -l         also limit flows using loopback\n"
+	       "    -n <#>     to create cgroup \"/nrm#\" and attach prog\n"
+	       "               Default is /nrm1\n"
+	       "    -r <rate>  Rate in Mbps\n"
+	       "    -s         Update NRM stats\n"
+	       "    -t <time>  Exit after specified seconds (deault is 0)\n"
+	       "    -w	       Work conserving flag. cgroup can increase\n"
+	       "               bandwidth beyond the rate limit specified\n"
+	       "               while there is available bandwidth. Current\n"
+	       "               implementation assumes there is only eth0\n"
+	       "               but can be extended to support multiple NICs\n"
+	       "    -h         print this info\n"
+	       "    prog       BPF program file name. Name defaults to\n"
+	       "                 nrm_out_kern.o for output, and\n"
+	       "                 nrm_in_ker.o for input.\n");
+}
+
+int main(int argc, char **argv)
+{
+	char *prog = "nrm_out_kern.o";
+	int  k;
+	int cg_id = 1;
+	char *optstring = "iodln:r:st:wh";
+
+	while ((k = getopt(argc, argv, optstring)) != -1) {
+		switch (k) {
+		case'o':
+			break;
+		case 'd':
+			debugFlag = true;
+			break;
+		case 'l':
+			loopback_flag = true;
+			break;
+		case 'n':
+			cg_id = atoi(optarg);
+			break;
+		case 'r':
+			minRate = atoi(optarg) * 1.024;
+			rate = minRate;
+			break;
+		case 's':
+			stats_flag = true;
+			break;
+		case 't':
+			dur = atoi(optarg);
+			break;
+		case 'w':
+			work_conserving_flag = true;
+			break;
+		case '?':
+			if (optopt == 'n' || optopt == 'r' || optopt == 't')
+				fprintf(stderr,
+					"Option -%c requires an argument.\n\n",
+					optopt);
+		case 'h':
+			// fallthrough
+		default:
+			Usage();
+			return 0;
+		}
+	}
+
+	if (optind < argc)
+		prog = argv[optind];
+	printf("NRM prog: %s\n", prog != NULL ? prog : "NULL");
+
+	return run_bpf_prog(prog, cg_id);
+}

From patchwork Sat Feb 23 01:07:03 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lawrence Brakmo <brakmo@fb.com>
X-Patchwork-Id: 1047239
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=fb.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=fb.com header.i=@fb.com header.b="Kv4x6DPK";
	dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 445qnW6DDVz9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Sat, 23 Feb 2019 12:07:43 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727705AbfBWBHm (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Fri, 22 Feb 2019 20:07:42 -0500
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:59084 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1725814AbfBWBHk (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 22 Feb 2019 20:07:40 -0500
Received: from pps.filterd (m0148461.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1N13D4C029047
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:38 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
	h=from : to : cc : subject
	: date : message-id : in-reply-to : references : mime-version :
	content-type; s=facebook;
	bh=FSykPDiWmwYS9A+5viAVXcv/RtarnL9HSahVQuEr1zQ=;
	b=Kv4x6DPKjykRPkujbBFLaLfsz2LmC3/2nwf1ryppq+U4CUYvaL69CZBmGlsLBMzNvkRb
	9mALsQgkWigfEWrg90V9vLoysC75M4LB8GifVnp5Y2juLqEf4viG5egkRGHWTb5LpkMe
	WHrsARN+54b62YcKdHFMA4I+gj3FQ0LEhaw=
Received: from maileast.thefacebook.com ([199.201.65.23])
	by mx0a-00082601.pphosted.com with ESMTP id 2qtuea045a-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
	for <netdev@vger.kernel.org>; Fri, 22 Feb 2019 17:07:38 -0800
Received: from mx-out.facebook.com (2620:10d:c0a1:3::13) by
	mail.thefacebook.com (2620:10d:c021:18::176) with Microsoft SMTP
	Server
	(version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id
	15.1.1531.3; Fri, 22 Feb 2019 17:07:36 -0800
Received: by devbig009.ftw2.facebook.com (Postfix, from userid 10340)
	id AAE315AE1524; Fri, 22 Feb 2019 17:07:35 -0800 (PST)
Smtp-Origin-Hostprefix: devbig
From: brakmo <brakmo@fb.com>
Smtp-Origin-Hostname: devbig009.ftw2.facebook.com
To: netdev <netdev@vger.kernel.org>
CC: Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <eric.dumazet@gmail.com>, Kernel Team <Kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH v2 bpf-next 9/9] bpf: NRM test script
Date: Fri, 22 Feb 2019 17:07:03 -0800
Message-ID: <20190223010703.678070-10-brakmo@fb.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190223010703.678070-1-brakmo@fb.com>
References: <20190223010703.678070-1-brakmo@fb.com>
X-FB-Internal: Safe
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2019-02-23_01:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Script for testing NRM (Network Resource Manager) framework.
It creates a cgroup to use for testing and load a BPF program to limit
egress bandwidht. It then uses iperf3 or netperf to create
loads. The output is the goodput in Mbps (unless -D is used).

It can work on a single host using loopback or among two hosts (with netperf).
When using loopback, it is recommended to also introduce a delay of at least
1ms (-d=1), otherwise the assigned bandwidth is likely to be underutilized.

USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]
             [-d=<delay>|--delay=<delay>] [--debug] [-E]
             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >] [-l]
	     [-N] [-p=<port>|--port=<port>] [-P] [-q=<qdisc>]
             [-R] [-s=<server>|--server=<server] [--stats]
	     [-t=<time>|--time=<time>] [-w] [cubic|dctcp]
  Where:
    out               Egress (default egress)
    -b or --bpf       BPF program filename to load and attach.
                      Default is nrm_out_kern.o for egress,
                      nrm_in_kern.o for ingress
    -c or -cc         TCP congestion control (cubic or dctcp)
    -d or --delay     Add a delay in ms using netem
    -D                In addition to the goodput in Mbps, it also outputs
                      other detailed information. This information is
                      test dependent (i.e. iperf3 or netperf).
    --debug           Print BPF trace buffer
    -E                Enable ECN (not required for dctcp)
    -f or --flows     Number of concurrent flows (default=1)
    -i or --id        cgroup id (an integer, default is 1)
    -l                Do not limit flows using loopback
    -N                Use netperf instead of iperf3
    -h                Help
    -p or --port      iperf3 port (default is 5201)
    -P                Use an iperf3 instance for each flow
    -q                Use the specified qdisc.
    -r or --rate      Rate in Mbps (default 1s 1Gbps)
    -R                Use TCP_RR for netperf. 1st flow has req
                      size of 10KB, rest of 1MB. Reply in all
                      cases is 1 byte.
                      More detailed output for each flow can be found
                      in the files netperf.<cg>.<flow>, where <cg> is the
                      cgroup id as specified with the -i flag, and <flow>
                      is the flow id starting at 1 and increasing by 1 for
                      flow (as specified by -f).
    -s or --server    hostname of netperf server. Used to create netperf
                      test traffic between to hosts (default is within host)
                      netserver must be running on the host.
    --stats           Get NRM stats (marked, dropped, etc.)
    -t or --time      duration of iperf3 in seconds (default=5)
    -w                Work conserving flag. cgroup can increase its
                      bandwidth beyond the rate limit specified
                      while there is available bandwidth. Current
                      implementation assumes there is only one NIC
                      (eth0), but can be extended to support multiple
                      NICs. This is just a proof of concept.
    cubic or dctcp    specify TCP CC to use

Examples:
 ./do_nrm_test.sh -l -d=1 -D --stats
     Runs a 5 second test, using a single iperf3 flow and with the default
     rate limit of 1Gbps and a delay of 1ms (using netem) using the default
     TCP congestion control on the loopback device (hence we use "-l" to
     enforce bandwidth limit on loopback device). Since no direction is
     specified, it defaults to egress. Since no TCP CC algorithm is
     specified it uses the system default.
     With no -D flag, only the value of the AGGREGATE OUTPUT would show.
     id refers to the cgroup id and is useful when running multi cgroup
     tests (see do_nrm_test_multi.sh script).
   Output:
     Details for NRM in cgroup 1
     id:1
     rate_mbps:713
     duration:4.9 secs
     packets:10072
     bytes_MB:468
     pkts_dropped:491
     bytes_dropped_MB:32
     pkts_marked_percent: 28.64
     bytes_marked_percent: 29.15
     pkts_dropped_percent:  4.87
     bytes_dropped_percent:  6.86
     PING AVG DELAY:2.072
     AGGREGATE_GOODPUT:729

./do_nrm_test.sh -l -d=1 -D --stats dctcp
     Same as above but using dctcp. Note that fewer bytes are dropped
     (0.13 vs. 6.86%).
   Output:
     Details for NRM in cgroup 1
     id:1
     rate_mbps:932
     duration:4.9 secs
     packets:15514
     bytes_MB:570
     pkts_dropped:11
     bytes_dropped_MB:0
     pkts_marked_percent: 40.38
     bytes_marked_percent: 46.82
     pkts_dropped_percent:  0.07
     bytes_dropped_percent:  0.13
     PING AVG DELAY:2.069
     AGGREGATE_GOODPUT:953

./do_nrm_test.sh -d=1 -D --stats
     As first example, but without limiting loopback device (i.e. no
     "-l" flag). Since there is no bandwidth limiting, no details for
     NRM are printed out.
   Output:
     Details for NRM in cgroup 1
     PING AVG DELAY:2.021
     AGGREGATE_GOODPUT:40226

./do_nrm_test.sh -l -d=1 -D --stats -f=2
     Uses iper3 and does 2 flows
./do_nrm_test.sh -l -d=1 -D --stats -f=4 -P
     Uses iperf3 and does 4 flows, each flow as a separate process.
./do_nrm_test.sh -l -d=1 -D --stats -f=4 -N
     Uses netperf, 4 flows
./do_nrm_test.sh -f=1 -r=2000 -t=5 -N -D --stats dctcp -s=<server-name>
     Uses netperf between two hosts. The remote host name is specified
     with -s= and you need to start the program netserver manually on
     the remote host. It will use 1 flow, a rate limit of 2Gbps and dctcp.
./do_nrm_test.sh -f=1 -r=2000 -t=5 -N -D --stats -w dctcp \
     -s=<server-name>
     As previous, but allows use of extra bandwidth. For this test the
     rate is 8Gbps vs. 1Gbps of the previous test.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/do_nrm_test.sh | 437 +++++++++++++++++++++++++++++++++++++
 1 file changed, 437 insertions(+)
 create mode 100755 samples/bpf/do_nrm_test.sh

diff --git a/samples/bpf/do_nrm_test.sh b/samples/bpf/do_nrm_test.sh
new file mode 100755
index 000000000000..91d99237aea5
--- /dev/null
+++ b/samples/bpf/do_nrm_test.sh
@@ -0,0 +1,437 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2019 Facebook
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of version 2 of the GNU General Public
+# License as published by the Free Software Foundation.
+
+Usage() {
+  echo "Script for testing NRM (Network Resource Manager) framework."
+  echo "It creates a cgroup to use for testing and load a BPF program to limit"
+  echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create"
+  echo "loads. The output is the goodput in Mbps (unless -D was used)."
+  echo ""
+  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]"
+  echo "             [-d=<delay>|--delay=<delay>] [--debug] [-E]"
+  echo "             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]"
+  echo "             [-l] [-N] [-p=<port>|--port=<port>] [-P]"
+  echo "             [-q=<qdisc>] [-R] [-s=<server>|--server=<server]"
+  echo "             [-S|--stats] -t=<time>|--time=<time>] [-w] [cubic|dctcp]"
+  echo "  Where:"
+  echo "    out               egress (default)"
+  echo "    -b or --bpf       BPF program filename to load and attach."
+  echo "                      Default is nrm_out_kern.o for egress,"
+  echo "                      nrm_in_kern.o for ingress"
+  echo "    -c or -cc         TCP congestion control (cubic or dctcp)"
+  echo "    --debug           print BPF trace buffer"
+  echo "    -d or --delay     add a delay in ms using netem"
+  echo "    -D                In addition to the goodput in Mbps, it also outputs"
+  echo "                      other detailed information. This information is"
+  echo "                      test dependent (i.e. iperf3 or netperf)."
+  echo "    -E                enable ECN (not required for dctcp)"
+  echo "    -f or --flows     number of concurrent flows (default=1)"
+  echo "    -i or --id        cgroup id (an integer, default is 1)"
+  echo "    -N                use netperf instead of iperf3"
+  echo "    -l                do not limit flows using loopback"
+  echo "    -h                Help"
+  echo "    -p or --port      iperf3 port (default is 5201)"
+  echo "    -P                use an iperf3 instance for each flow"
+  echo "    -q                use the specified qdisc"
+  echo "    -r or --rate      rate in Mbps (default 1s 1Gbps)"
+  echo "    -R                Use TCP_RR for netperf. 1st flow has req"
+  echo "                      size of 10KB, rest of 1MB. Reply in all"
+  echo "                      cases is 1 byte."
+  echo "                      More detailed output for each flow can be found"
+  echo "                      in the files netperf.<cg>.<flow>, where <cg> is the"
+  echo "                      cgroup id as specified with the -i flag, and <flow>"
+  echo "                      is the flow id starting at 1 and increasing by 1 for"
+  echo "                      flow (as specified by -f)."
+  echo "    -s or --server    hostname of netperf server. Used to create netperf"
+  echo "                      test traffic between to hosts (default is within host)"
+  echo "                      netserver must be running on the host."
+  echo "    -S or --stats     whether to update nrm stats (default is yes)."
+  echo "    -t or --time      duration of iperf3 in seconds (default=5)"
+  echo "    -w                Work conserving flag. cgroup can increase its"
+  echo "                      bandwidth beyond the rate limit specified"
+  echo "                      while there is available bandwidth. Current"
+  echo "                      implementation assumes there is only one NIC"
+  echo "                      (eth0), but can be extended to support multiple"
+  echo "                       NICs."
+  echo "    cubic or dctcp    specify which TCP CC to use"
+  echo " "
+  exit
+}
+
+#set -x
+
+debug_flag=0
+args="$@"
+name="$0"
+netem=0
+cc=x
+dir="-o"
+dir_name="out"
+dur=5
+flows=1
+id=1
+prog=""
+port=5201
+rate=1000
+multi_iperf=0
+flow_cnt=1
+use_netperf=0
+rr=0
+ecn=0
+details=0
+server=""
+qdisc=""
+flags=""
+do_stats=0
+
+function start_nrm () {
+  rm -f nrm.out
+  echo "./nrm $dir -n $id -r $rate -t $dur $flags $dbg $prog" > nrm.out
+  echo " " >> nrm.out
+  ./nrm $dir -n $id -r $rate -t $dur $flags $dbg $prog >> nrm.out 2>&1  &
+  echo $!
+}
+
+processArgs () {
+  for i in $args ; do
+    case $i in
+    # Support for upcomming ingress rate limiting
+    #in)         # support for upcoming ingress rate limiting
+    #  dir="-i"
+    #  dir_name="in"
+    #  ;;
+    out)
+      dir="-o"
+      dir_name="out"
+      ;;
+    -b=*|--bpf=*)
+      prog="${i#*=}"
+      ;;
+    -c=*|--cc=*)
+      cc="${i#*=}"
+      ;;
+    --debug)
+      flags="$flags -d"
+      debug_flag=1
+      ;;
+    -d=*|--delay=*)
+      netem="${i#*=}"
+      ;;
+    -D)
+      details=1
+      ;;
+    -E)
+     ecn=1
+     ;;
+    # Support for upcomming fq Early Departure Time egress rate limiting
+    #--edt)
+    # prog="nrm_out_edt_kern.o"
+    # qdisc="fq"
+    # ;;
+    -f=*|--flows=*)
+      flows="${i#*=}"
+      ;;
+    -i=*|--id=*)
+      id="${i#*=}"
+      ;;
+    -l)
+      flags="$flags -l"
+      ;;
+    -N)
+      use_netperf=1
+      ;;
+    -p=*|--port=*)
+      port="${i#*=}"
+      ;;
+    -P)
+      multi_iperf=1
+      ;;
+    -q=*)
+      qdisc="${i#*=}"
+      ;;
+    -r=*|--rate=*)
+      rate="${i#*=}"
+      ;;
+    -R)
+      rr=1
+      ;;
+    -s=*|--server=*)
+      server="${i#*=}"
+      ;;
+    -S|--stats)
+      flags="$flags -s"
+      do_stats=1
+      ;;
+    -t=*|--time=*)
+      dur="${i#*=}"
+      ;;
+    -w)
+      flags="$flags -w"
+      ;;
+    cubic)
+      cc=cubic
+      ;;
+    dctcp)
+      cc=dctcp
+      ;;
+    *)
+      echo "Unknown arg:$i"
+      Usage
+      ;;
+    esac
+  done
+}
+
+processArgs
+
+if [ $debug_flag -eq 1 ] ; then
+  rm -f nrm_out.log
+fi
+
+nrm_pid=$(start_nrm)
+usleep 100000
+
+host=`hostname`
+cg_base_dir=/sys/fs/cgroup
+cg_dir="$cg_base_dir/cgroup-test-work-dir/nrm$id"
+
+echo $$ >> $cg_dir/cgroup.procs
+
+ulimit -l unlimited
+
+rm -f ss.out
+rm -f nrm.[0-9]*.$dir_name
+if [ $ecn -ne 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_ecn=1
+fi
+
+if [ $use_netperf -eq 0 ] ; then
+  cur_cc=`sysctl -n net.ipv4.tcp_congestion_control`
+  if [ "$cc" != "x" ] ; then
+    sysctl -w -q -n net.ipv4.tcp_congestion_control=$cc
+  fi
+fi
+
+if [ "$netem" -ne "0" ] ; then
+  if [ "$qdisc" != "" ] ; then
+    echo "WARNING: Ignoring -q options because -d option used"
+  fi
+  tc qdisc del dev lo root > /dev/null 2>&1
+  tc qdisc add dev lo root netem delay $netem\ms > /dev/null 2>&1
+elif [ "$qdisc" != "" ] ; then
+  tc qdisc del dev lo root > /dev/null 2>&1
+  tc qdisc add dev lo root $qdisc > /dev/null 2>&1
+fi
+
+n=0
+m=$[$dur * 5]
+hn="::1"
+if [ $use_netperf -ne 0 ] ; then
+  if [ "$server" != "" ] ; then
+    hn=$server
+  fi
+fi
+
+( ping6 -i 0.2 -c $m $hn > ping.out 2>&1 ) &
+
+if [ $use_netperf -ne 0 ] ; then
+  begNetserverPid=`ps ax | grep netserver | grep --invert-match "grep" | \
+                   awk '{ print $1 }'`
+  if [ "$begNetserverPid" == "" ] ; then
+    if [ "$server" == "" ] ; then
+      ( ./netserver > /dev/null 2>&1) &
+      usleep 100000
+    fi
+  fi
+  flow_cnt=1
+  if [ "$server" == "" ] ; then
+    np_server=$host
+  else
+    np_server=$server
+  fi
+  if [ "$cc" == "x" ] ; then
+    np_cc=""
+  else
+    np_cc="-K $cc,$cc"
+  fi
+  replySize=1
+  while [ $flow_cnt -le $flows ] ; do
+    if [ $rr -ne 0 ] ; then
+      reqSize=1M
+      if [ $flow_cnt -eq 1 ] ; then
+        reqSize=10K
+      fi
+      if [ "$dir" == "-i" ] ; then
+        replySize=$reqSize
+        reqSize=1
+      fi
+      ( ./netperf -H $np_server -l $dur -f m -j -t TCP_RR  -- -r $reqSize,$replySize $np_cc -k P50_lATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,LOCAL_RECV_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+    else
+      if [ "$dir" == "-i" ] ; then
+        ( ./netperf -H $np_server -l $dur -f m -j -t TCP_RR -- -r 1,10M $np_cc -k P50_LATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,REMOTE_TRANSPORT_RETRANS,REMOTE_SEND_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+      else
+        ( ./netperf -H $np_server -l $dur -f m -j -t TCP_STREAM -- $np_cc -k P50_lATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+      fi
+    fi
+    flow_cnt=$[flow_cnt+1]
+  done
+
+# sleep for duration of test (plus some buffer)
+  n=$[dur+2]
+  sleep $n
+
+# force graceful termination of netperf
+  pids=`pgrep netperf`
+  for p in $pids ; do
+    kill -SIGALRM $p
+  done
+
+  flow_cnt=1
+  rate=0
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+  fi
+  while [ $flow_cnt -le $flows ] ; do
+    if [ "$dir" == "-i" ] ; then
+      r=`cat netperf.$id.$flow_cnt | grep -o "REMOTE_SEND_THROUGHPUT=[0-9]*" | grep -o "[0-9]*"`
+    else
+      r=`cat netperf.$id.$flow_cnt | grep -o "LOCAL_SEND_THROUGHPUT=[0-9]*" | grep -o "[0-9]*"`
+    fi
+    echo "rate for flow $flow_cnt: $r"
+    rate=$[rate+r]
+    if [ $details -ne 0 ] ; then
+      echo "-----"
+      echo "Details for cgroup $id, flow $flow_cnt"
+      cat netperf.$id.$flow_cnt
+    fi
+    flow_cnt=$[flow_cnt+1]
+  done
+  if [ $details -ne 0 ] ; then
+    echo ""
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+elif [ $multi_iperf -eq 0 ] ; then
+  (iperf3 -s -p $port -1 > /dev/null 2>&1) &
+  usleep 100000
+  iperf3 -c $host -p $port -i 0 -P $flows -f m -t $dur > iperf.$id
+  rates=`grep receiver iperf.$id | grep -o "[0-9.]* Mbits" | grep -o "^[0-9]*"`
+  rate=`echo $rates | grep -o "[0-9]*$"`
+
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+else
+  flow_cnt=1
+  while [ $flow_cnt -le $flows ] ; do
+    (iperf3 -s -p $port -1 > /dev/null 2>&1) &
+    ( iperf3 -c $host -p $port -i 0 -P 1 -f m -t $dur | grep receiver | grep -o "[0-9.]* Mbits" | grep -o "^[0-9]*" | grep -o "[0-9]*$" > iperf3.$id.$flow_cnt ) &
+    port=$[port+1]
+    flow_cnt=$[flow_cnt+1]
+  done
+  n=$[dur+1]
+  sleep $n
+  flow_cnt=1
+  rate=0
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+  fi
+
+  while [ $flow_cnt -le $flows ] ; do
+    r=`cat iperf3.$id.$flow_cnt`
+#    echo "rate for flow $flow_cnt: $r"
+  if [ $details -ne 0 ] ; then
+    echo "Rate for cgroup $id, flow $flow_cnt LOCAL_SEND_THROUGHPUT=$r"
+  fi
+    rate=$[rate+r]
+    flow_cnt=$[flow_cnt+1]
+  done
+  if [ $details -ne 0 ] ; then
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+fi
+
+if [ $use_netperf -eq 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_congestion_control=$cur_cc
+fi
+if [ $ecn -ne 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_ecn=0
+fi
+if [ "$netem" -ne "0" ] ; then
+  tc qdisc del dev lo root > /dev/null 2>&1
+fi
+
+sleep 2
+
+nrmPid=`ps ax | grep "nrm " | grep --invert-match "grep" | awk '{ print $1 }'`
+if [ "$nrmPid" == "$nrm_pid" ] ; then
+  kill $nrm_pid
+fi
+
+sleep 1
+
+# Detach any BPF programs that may have lingered
+ttx=`bpftool cgroup tree | grep nrm`
+v=2
+for x in $ttx ; do
+    if [ "${x:0:36}" == "/sys/fs/cgroup/cgroup-test-work-dir/" ] ; then
+	cg=$x ; v=0
+    else
+	if [ $v -eq 0 ] ; then
+	    id=$x ; v=1
+	else
+	    if [ $v -eq 1 ] ; then
+		type=$x ; bpftool cgroup detach $cg $type id $id
+		v=0
+	    fi
+	fi
+    fi
+done
+
+if [ $use_netperf -ne 0 ] ; then
+  if [ "$server" == "" ] ; then
+    if [ "$begNetserverPid" == "" ] ; then
+      netserverPid=`ps ax | grep netserver | grep --invert-match "grep" | awk '{ print $1 }'`
+      if [ "$netserverPid" != "" ] ; then
+        kill $netserverPid
+      fi
+    fi
+  fi
+fi
+exit