From patchwork Sat Dec 14 00:47:38 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209558 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="h1vuWBXD"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTQl3jywz9sNH for ; Sat, 14 Dec 2019 11:47:42 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726781AbfLNArl (ORCPT ); Fri, 13 Dec 2019 19:47:41 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:1902 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726744AbfLNArl (ORCPT ); Fri, 13 Dec 2019 19:47:41 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UbBc009289 for ; Fri, 13 Dec 2019 16:47:40 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=lbAX90rW65np9+9KtVJ+3Pdd5XsV1QF/HUAdph2/+PI=; b=h1vuWBXD+rB91R5ShHCJ5sIaZhv3S4in6PdHJCufFwxrukLoWnMk2lIFBb/PnCV3D0FY ZYEhcLxEAHJ/f22/gZQFPOQLWr5OvpRykEvVUHW4Nk4D+I39Yy/dbsdIr1mpLeTELU+w qw+ZgGDlIHzqRDjBGztOF7/h/N9AC7CcJwM= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g0r-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:47:40 -0800 Received: from intmgw002.41.prn1.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::127) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:39 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id C0B932943AB4; Fri, 13 Dec 2019 16:47:38 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Date: Fri, 13 Dec 2019 16:47:38 -0800 Message-ID: <20191214004738.1652239-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=618 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch makes the verifier save the PTR_TO_BTF_ID register state when spilling to the stack. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song --- kernel/bpf/verifier.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 034ef81f935b..408264c1d55b 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1915,6 +1915,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case PTR_TO_TCP_SOCK: case PTR_TO_TCP_SOCK_OR_NULL: case PTR_TO_XDP_SOCK: + case PTR_TO_BTF_ID: return true; default: return false; From patchwork Sat Dec 14 00:47:41 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209561 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="rTdqZWW1"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTQp4pThz9sNH for ; Sat, 14 Dec 2019 11:47:46 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726638AbfLNArq (ORCPT ); Fri, 13 Dec 2019 19:47:46 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:44814 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725818AbfLNArq (ORCPT ); Fri, 13 Dec 2019 19:47:46 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UsRN000709 for ; Fri, 13 Dec 2019 16:47:44 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=aZ3k55IeHXzz0oQl49HHmGjLvl6KOkd3U1YI/LTpL4c=; b=rTdqZWW1rtaZQKDTxOFKU1mQE3mCQcSHJfHMrOEBS4BDMHROVemAN1DBxQllaxyqlau/ rRPJlyN8SIJ7ApmW63lADKwogGP5mUx5VLba60mvLwWfUzrbY+Ogf1k7sXdiZPHW2jLK fXLnU4N9m0dPA3W58Ko0XDoruFJ/zhgXR6M= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wv1r0vwer-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:47:44 -0800 Received: from intmgw002.41.prn1.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::130) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:42 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 019952943AB4; Fri, 13 Dec 2019 16:47:41 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Date: Fri, 13 Dec 2019 16:47:41 -0800 Message-ID: <20191214004741.1652332-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxlogscore=591 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org info->btf_id expects the btf_id of a struct, so it should store the final result after skipping modifiers (if any). It also takes this chanace to add a missing newline in one of the bpf_log() messages. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song --- kernel/bpf/btf.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 7d40da240891..88359a4bccb0 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3696,7 +3696,6 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, /* this is a pointer to another type */ info->reg_type = PTR_TO_BTF_ID; - info->btf_id = t->type; if (tgt_prog) { ret = btf_translate_to_vmlinux(log, btf, t, tgt_prog->type); @@ -3707,10 +3706,14 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, return false; } } + + info->btf_id = t->type; t = btf_type_by_id(btf, t->type); /* skip modifiers */ - while (btf_type_is_modifier(t)) + while (btf_type_is_modifier(t)) { + info->btf_id = t->type; t = btf_type_by_id(btf, t->type); + } if (!btf_type_is_struct(t)) { bpf_log(log, "func '%s' arg%d type %s is not a struct\n", @@ -3736,7 +3739,7 @@ int btf_struct_access(struct bpf_verifier_log *log, again: tname = __btf_name_by_offset(btf_vmlinux, t->name_off); if (!btf_type_is_struct(t)) { - bpf_log(log, "Type '%s' is not a struct", tname); + bpf_log(log, "Type '%s' is not a struct\n", tname); return -EINVAL; } From patchwork Sat Dec 14 00:47:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209563 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="ANFc5iXm"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTQr4wXkz9sPL for ; Sat, 14 Dec 2019 11:47:48 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726774AbfLNArr (ORCPT ); Fri, 13 Dec 2019 19:47:47 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:40656 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725818AbfLNArr (ORCPT ); Fri, 13 Dec 2019 19:47:47 -0500 Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.16.0.42/8.16.0.42) with SMTP id xBE0RFaj015105 for ; Fri, 13 Dec 2019 16:47:46 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=0cN9+uzdZsJkNCfcX5bdzZd4wW8G3zXTiRCtb7tEfyU=; b=ANFc5iXmnCA7Fcy7YHCH/4qErUvfBqrkz+mRmVmta/SWozIxDK2HX8zsQBFmZLHjfgQw IjaeqRRN6elvNK7oU3v2jXriUvEmQYWINSq7nyqslv4CfIPFXpPmVWWwAQUZj0rXWfxz PzG8I627zzj9epv1wW6wyi8OK2i3d29CFB0= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0001303.ppops.net with ESMTP id 2wvfg8skwm-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:47:46 -0800 Received: from intmgw004.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:45 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 59E962943AB4; Fri, 13 Dec 2019 16:47:44 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Date: Fri, 13 Dec 2019 16:47:44 -0800 Message-ID: <20191214004744.1652395-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 spamscore=0 lowpriorityscore=0 adultscore=0 impostorscore=0 suspectscore=13 mlxscore=0 mlxlogscore=806 phishscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org It allows bpf prog (e.g. tracing) to attach to a kernel function that takes enum argument. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 88359a4bccb0..6e652643849b 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3676,7 +3676,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, /* skip modifiers */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); - if (btf_type_is_int(t)) + if (btf_type_is_int(t) || btf_type_is_enum(t)) /* accessing a scalar */ return true; if (!btf_type_is_ptr(t)) { From patchwork Sat Dec 14 00:47:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209565 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="jSzlgJvh"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTQt51hrz9sNH for ; Sat, 14 Dec 2019 11:47:50 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726846AbfLNArt (ORCPT ); Fri, 13 Dec 2019 19:47:49 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:40212 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726739AbfLNArt (ORCPT ); Fri, 13 Dec 2019 19:47:49 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UUIv009123 for ; Fri, 13 Dec 2019 16:47:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=r81yLaamRDrJuMG+02OJ+C554sfaS+0BuGA9hnPfmRk=; b=jSzlgJvh+mHpa9+isM921aS1EkINNgkH4U2H1279yUq1lsft6cJXz0fJE6t0mfZWMxFn G8FmTkJcbbDCx45g58n+uEMIQYx8n3x5ODCFLAuJIoTLKc7TdLDuJAr4KK/gJXqTxQqk bYR+smy9B64OWpHGm5iyiqL6p8Bx1TFYXQ4= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g16-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:47:48 -0800 Received: from intmgw002.08.frc2.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::128) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:47 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9E4E82943AB4; Fri, 13 Dec 2019 16:47:46 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Date: Fri, 13 Dec 2019 16:47:46 -0800 Message-ID: <20191214004746.1652586-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=740 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch allows bitfield access as a scalar. It currently limits the access to sizeof(u64) and upto the end of the struct. It is needed in a later bpf-tcp-cc example that reads bitfield from inet_connection_sock and tcp_sock. Signed-off-by: Martin KaFai Lau --- kernel/bpf/btf.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6e652643849b..011194831499 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log, } for_each_member(i, t, member) { - if (btf_member_bitfield_size(t, member)) - /* bitfields are not supported yet */ - continue; - /* offset of the field in bytes */ moff = btf_member_bit_offset(t, member) / 8; if (off + size <= moff) @@ -3757,6 +3753,15 @@ int btf_struct_access(struct bpf_verifier_log *log, if (off < moff) continue; + if (btf_member_bitfield_size(t, member)) { + if (off == moff && + !(btf_member_bit_offset(t, member) % 8) && + size <= sizeof(u64) && + off + size <= t->size) + return SCALAR_VALUE; + continue; + } + /* type of the field */ mtype = btf_type_by_id(btf_vmlinux, member->type); mname = __btf_name_by_offset(btf_vmlinux, member->name_off); From patchwork Sat Dec 14 00:47:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209567 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="FZJpXtEs"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTR05GPFz9sNH for ; Sat, 14 Dec 2019 11:47:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726867AbfLNAr4 (ORCPT ); Fri, 13 Dec 2019 19:47:56 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:58066 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726783AbfLNAr4 (ORCPT ); Fri, 13 Dec 2019 19:47:56 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UVe3009126 for ; Fri, 13 Dec 2019 16:47:53 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=N9NWgg8riZ5k08SdpnJXHZyHzrLlzfMQ0ntDSxAWQSA=; b=FZJpXtEsYcjZ1OlNPcNlN4v6cvvANi385blhBY4fiQPiZntxpX6ZK9KJTgOeEoHUGze3 q+yDAuhorjuw0XDmEhI+GTeuyifCXchN2Qebfz2cgagLLSUkqHc6agp5j9FJe49FTLUe nqQqG5LKFxvQbRAKS+rkXY1vHM36556T+yU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g19-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:47:53 -0800 Received: from intmgw004.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:49 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id F01482943AB4; Fri, 13 Dec 2019 16:47:48 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Date: Fri, 13 Dec 2019 16:47:48 -0800 Message-ID: <20191214004748.1652668-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=38 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch allows the kernel's struct ops (i.e. func ptr) to be implemented in BPF. The first use case in this series is the "struct tcp_congestion_ops" which will be introduced in a latter patch. This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS. The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular func ptr of a kernel struct. The attr->attach_btf_id is the btf id of a kernel struct. The attr->expected_attach_type is the member "index" of that kernel struct. The first member of a struct starts with member index 0. That will avoid ambiguity when a kernel struct has multiple func ptrs with the same func signature. For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written to implement the "init" func ptr of the "struct tcp_congestion_ops". The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops" of the _running_ kernel. The attr->expected_attach_type is 3. The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved by arch_prepare_bpf_trampoline that will be done in the next patch when introducing BPF_MAP_TYPE_STRUCT_OPS. "struct bpf_struct_ops" is introduced as a common interface for the kernel struct that supports BPF_PROG_TYPE_STRUCT_OPS prog. The supporting kernel struct will need to implement an instance of the "struct bpf_struct_ops". The supporting kernel struct also needs to implement a bpf_verifier_ops. During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right bpf_verifier_ops by searching the attr->attach_btf_id. A new "btf_struct_access" is also added to the bpf_verifier_ops such that the supporting kernel struct can optionally provide its own specific check on accessing the func arg (e.g. provide limited write access). After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called to initialize some values (e.g. the btf id of the supporting kernel struct) and it can only be done once the btf_vmlinux is available. The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog if the return type of the prog->aux->attach_func_proto is "void". Signed-off-by: Martin KaFai Lau --- include/linux/bpf.h | 30 +++++++ include/linux/bpf_types.h | 4 + include/linux/btf.h | 34 ++++++++ include/uapi/linux/bpf.h | 1 + kernel/bpf/Makefile | 2 +- kernel/bpf/bpf_struct_ops.c | 124 +++++++++++++++++++++++++++ kernel/bpf/bpf_struct_ops_types.h | 4 + kernel/bpf/btf.c | 88 ++++++++++++++------ kernel/bpf/syscall.c | 17 ++-- kernel/bpf/verifier.c | 134 +++++++++++++++++++++++------- 10 files changed, 374 insertions(+), 64 deletions(-) create mode 100644 kernel/bpf/bpf_struct_ops.c create mode 100644 kernel/bpf/bpf_struct_ops_types.h diff --git a/include/linux/bpf.h b/include/linux/bpf.h index d467983e61bb..1f0a5fc8c5ee 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -349,6 +349,10 @@ struct bpf_verifier_ops { const struct bpf_insn *src, struct bpf_insn *dst, struct bpf_prog *prog, u32 *target_size); + int (*btf_struct_access)(struct bpf_verifier_log *log, + const struct btf_type *t, int off, int size, + enum bpf_access_type atype, + u32 *next_btf_id); }; struct bpf_prog_offload_ops { @@ -667,6 +671,32 @@ struct bpf_array_aux { struct work_struct work; }; +struct btf_type; +struct btf_member; + +#define BPF_STRUCT_OPS_MAX_NR_MEMBERS 64 +struct bpf_struct_ops { + const struct bpf_verifier_ops *verifier_ops; + int (*init)(struct btf *_btf_vmlinux); + int (*check_member)(const struct btf_type *t, + const struct btf_member *member); + const struct btf_type *type; + const char *name; + struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; + u32 type_id; +}; + +#if defined(CONFIG_BPF_JIT) +const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); +void bpf_struct_ops_init(struct btf *_btf_vmlinux); +#else +static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) +{ + return NULL; +} +static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { } +#endif + struct bpf_array { struct bpf_map map; u32 elem_size; diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 93740b3614d7..fadd243ffa2d 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -65,6 +65,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2, BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport, struct sk_reuseport_md, struct sk_reuseport_kern) #endif +#if defined(CONFIG_BPF_JIT) +BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops, + void *, void *) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/btf.h b/include/linux/btf.h index 79d4abc2556a..f74a09a7120b 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -53,6 +53,18 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, u32 expected_offset, u32 expected_size); int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t); bool btf_type_is_void(const struct btf_type *t); +s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind); +const struct btf_type *btf_type_skip_modifiers(const struct btf *btf, + u32 id, u32 *res_id); +const struct btf_type *btf_type_resolve_ptr(const struct btf *btf, + u32 id, u32 *res_id); +const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, + u32 id, u32 *res_id); + +#define for_each_member(i, struct_type, member) \ + for (i = 0, member = btf_type_member(struct_type); \ + i < btf_type_vlen(struct_type); \ + i++, member++) static inline bool btf_type_is_ptr(const struct btf_type *t) { @@ -84,6 +96,28 @@ static inline bool btf_type_is_func_proto(const struct btf_type *t) return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC_PROTO; } +static inline u16 btf_type_vlen(const struct btf_type *t) +{ + return BTF_INFO_VLEN(t->info); +} + +static inline bool btf_type_kflag(const struct btf_type *t) +{ + return BTF_INFO_KFLAG(t->info); +} + +static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type, + const struct btf_member *member) +{ + return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset) + : 0; +} + +static inline const struct btf_member *btf_type_member(const struct btf_type *t) +{ + return (const struct btf_member *)(t + 1); +} + #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); const char *btf_name_by_offset(const struct btf *btf, u32 offset); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index dbbcf0b02970..12900dfa1461 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -174,6 +174,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_PROG_TYPE_TRACING, + BPF_PROG_TYPE_STRUCT_OPS, }; enum bpf_attach_type { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index d4f330351f87..0e636387db6f 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -6,7 +6,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o obj-$(CONFIG_BPF_SYSCALL) += disasm.o -obj-$(CONFIG_BPF_JIT) += trampoline.o +obj-$(CONFIG_BPF_JIT) += trampoline.o bpf_struct_ops.o obj-$(CONFIG_BPF_SYSCALL) += btf.o obj-$(CONFIG_BPF_JIT) += dispatcher.o ifeq ($(CONFIG_NET),y) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c new file mode 100644 index 000000000000..817d5aac42e5 --- /dev/null +++ b/kernel/bpf/bpf_struct_ops.c @@ -0,0 +1,124 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2019 Facebook + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define BPF_STRUCT_OPS_TYPE(_name) \ +extern struct bpf_struct_ops bpf_##_name; +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE + +enum { +#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name, +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE + __NR_BPF_STRUCT_OPS_TYPE, +}; + +static struct bpf_struct_ops * const bpf_struct_ops[] = { +#define BPF_STRUCT_OPS_TYPE(_name) \ + [BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name, +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE +}; + +const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = { +}; + +const struct bpf_prog_ops bpf_struct_ops_prog_ops = { +}; + +void bpf_struct_ops_init(struct btf *_btf_vmlinux) +{ + const struct btf_member *member; + struct bpf_struct_ops *st_ops; + struct bpf_verifier_log log = {}; + const struct btf_type *t; + const char *mname; + s32 type_id; + u32 i, j; + + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { + st_ops = bpf_struct_ops[i]; + + type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name, + BTF_KIND_STRUCT); + if (type_id < 0) { + pr_warn("Cannot find struct %s in btf_vmlinux\n", + st_ops->name); + continue; + } + t = btf_type_by_id(_btf_vmlinux, type_id); + if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) { + pr_warn("Cannot support #%u members in struct %s\n", + btf_type_vlen(t), st_ops->name); + continue; + } + + for_each_member(j, t, member) { + const struct btf_type *func_proto; + + mname = btf_name_by_offset(_btf_vmlinux, + member->name_off); + if (!*mname) { + pr_warn("anon member in struct %s is not supported\n", + st_ops->name); + break; + } + + if (btf_member_bitfield_size(t, member)) { + pr_warn("bit field member %s in struct %s is not supported\n", + mname, st_ops->name); + break; + } + + func_proto = btf_type_resolve_func_ptr(_btf_vmlinux, + member->type, + NULL); + if (func_proto && + btf_distill_func_proto(&log, _btf_vmlinux, + func_proto, mname, + &st_ops->func_models[j])) { + pr_warn("Error in parsing func ptr %s in struct %s\n", + mname, st_ops->name); + break; + } + } + + if (j == btf_type_vlen(t)) { + if (st_ops->init(_btf_vmlinux)) { + pr_warn("Error in init bpf_struct_ops %s\n", + st_ops->name); + } else { + st_ops->type_id = type_id; + st_ops->type = t; + } + } + } +} + +extern struct btf *btf_vmlinux; + +const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) +{ + unsigned int i; + + if (!type_id || !btf_vmlinux) + return NULL; + + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { + if (bpf_struct_ops[i]->type_id == type_id) + return bpf_struct_ops[i]; + } + + return NULL; +} diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h new file mode 100644 index 000000000000..7bb13ff49ec2 --- /dev/null +++ b/kernel/bpf/bpf_struct_ops_types.h @@ -0,0 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* internal file - do not include directly */ + +/* To be filled in a later patch */ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 011194831499..16924e5fa126 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -180,11 +180,6 @@ */ #define BTF_MAX_SIZE (16 * 1024 * 1024) -#define for_each_member(i, struct_type, member) \ - for (i = 0, member = btf_type_member(struct_type); \ - i < btf_type_vlen(struct_type); \ - i++, member++) - #define for_each_member_from(i, from, struct_type, member) \ for (i = from, member = btf_type_member(struct_type) + from; \ i < btf_type_vlen(struct_type); \ @@ -382,6 +377,65 @@ static bool btf_type_is_datasec(const struct btf_type *t) return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC; } +s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind) +{ + const struct btf_type *t; + const char *tname; + u32 i; + + for (i = 1; i <= btf->nr_types; i++) { + t = btf->types[i]; + if (BTF_INFO_KIND(t->info) != kind) + continue; + + tname = btf_name_by_offset(btf, t->name_off); + if (!strcmp(tname, name)) + return i; + } + + return -ENOENT; +} + +const struct btf_type *btf_type_skip_modifiers(const struct btf *btf, + u32 id, u32 *res_id) +{ + const struct btf_type *t = btf_type_by_id(btf, id); + + while (btf_type_is_modifier(t)) { + id = t->type; + t = btf_type_by_id(btf, t->type); + } + + if (res_id) + *res_id = id; + + return t; +} + +const struct btf_type *btf_type_resolve_ptr(const struct btf *btf, + u32 id, u32 *res_id) +{ + const struct btf_type *t; + + t = btf_type_skip_modifiers(btf, id, NULL); + if (!btf_type_is_ptr(t)) + return NULL; + + return btf_type_skip_modifiers(btf, t->type, res_id); +} + +const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, + u32 id, u32 *res_id) +{ + const struct btf_type *ptype; + + ptype = btf_type_resolve_ptr(btf, id, res_id); + if (ptype && btf_type_is_func_proto(ptype)) + return ptype; + + return NULL; +} + /* Types that act only as a source, not sink or intermediate * type when resolving. */ @@ -446,16 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding) return "UNKN"; } -static u16 btf_type_vlen(const struct btf_type *t) -{ - return BTF_INFO_VLEN(t->info); -} - -static bool btf_type_kflag(const struct btf_type *t) -{ - return BTF_INFO_KFLAG(t->info); -} - static u32 btf_member_bit_offset(const struct btf_type *struct_type, const struct btf_member *member) { @@ -463,13 +507,6 @@ static u32 btf_member_bit_offset(const struct btf_type *struct_type, : member->offset; } -static u32 btf_member_bitfield_size(const struct btf_type *struct_type, - const struct btf_member *member) -{ - return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset) - : 0; -} - static u32 btf_type_int(const struct btf_type *t) { return *(u32 *)(t + 1); @@ -480,11 +517,6 @@ static const struct btf_array *btf_type_array(const struct btf_type *t) return (const struct btf_array *)(t + 1); } -static const struct btf_member *btf_type_member(const struct btf_type *t) -{ - return (const struct btf_member *)(t + 1); -} - static const struct btf_enum *btf_type_enum(const struct btf_type *t) { return (const struct btf_enum *)(t + 1); @@ -3604,6 +3636,8 @@ struct btf *btf_parse_vmlinux(void) goto errout; } + bpf_struct_ops_init(btf); + btf_verifier_env_free(env); refcount_set(&btf->refcnt, 1); return btf; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b08c362f4e02..19b2d57f7c04 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1672,17 +1672,22 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, enum bpf_attach_type expected_attach_type, u32 btf_id, u32 prog_fd) { - switch (prog_type) { - case BPF_PROG_TYPE_TRACING: + if (btf_id) { if (btf_id > BTF_MAX_TYPE) return -EINVAL; - break; - default: - if (btf_id || prog_fd) + + switch (prog_type) { + case BPF_PROG_TYPE_TRACING: + case BPF_PROG_TYPE_STRUCT_OPS: + break; + default: return -EINVAL; - break; + } } + if (prog_fd && prog_type != BPF_PROG_TYPE_TRACING) + return -EINVAL; + switch (prog_type) { case BPF_PROG_TYPE_CGROUP_SOCK: switch (expected_attach_type) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 408264c1d55b..4c1eaa1a2965 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2858,11 +2858,6 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env, u32 btf_id; int ret; - if (atype != BPF_READ) { - verbose(env, "only read is supported\n"); - return -EACCES; - } - if (off < 0) { verbose(env, "R%d is ptr_%s invalid negative access: off=%d\n", @@ -2879,17 +2874,32 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env, return -EACCES; } - ret = btf_struct_access(&env->log, t, off, size, atype, &btf_id); + if (env->ops->btf_struct_access) { + ret = env->ops->btf_struct_access(&env->log, t, off, size, + atype, &btf_id); + } else { + if (atype != BPF_READ) { + verbose(env, "only read is supported\n"); + return -EACCES; + } + + ret = btf_struct_access(&env->log, t, off, size, atype, + &btf_id); + } + if (ret < 0) return ret; - if (ret == SCALAR_VALUE) { - mark_reg_unknown(env, regs, value_regno); - return 0; + if (atype == BPF_READ) { + if (ret == SCALAR_VALUE) { + mark_reg_unknown(env, regs, value_regno); + return 0; + } + mark_reg_known_zero(env, regs, value_regno); + regs[value_regno].type = PTR_TO_BTF_ID; + regs[value_regno].btf_id = btf_id; } - mark_reg_known_zero(env, regs, value_regno); - regs[value_regno].type = PTR_TO_BTF_ID; - regs[value_regno].btf_id = btf_id; + return 0; } @@ -6343,8 +6353,30 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) static int check_return_code(struct bpf_verifier_env *env) { struct tnum enforce_attach_type_range = tnum_unknown; + const struct bpf_prog *prog = env->prog; struct bpf_reg_state *reg; struct tnum range = tnum_range(0, 1); + int err; + + /* The struct_ops func-ptr's return type could be "void" */ + if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS && + !prog->aux->attach_func_proto->type) + return 0; + + /* eBPF calling convetion is such that R0 is used + * to return the value from eBPF program. + * Make sure that it's readable at this time + * of bpf_exit, which means that program wrote + * something into it earlier + */ + err = check_reg_arg(env, BPF_REG_0, SRC_OP); + if (err) + return err; + + if (is_pointer_value(env, BPF_REG_0)) { + verbose(env, "R0 leaks addr as return value\n"); + return -EACCES; + } switch (env->prog->type) { case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: @@ -8010,21 +8042,6 @@ static int do_check(struct bpf_verifier_env *env) if (err) return err; - /* eBPF calling convetion is such that R0 is used - * to return the value from eBPF program. - * Make sure that it's readable at this time - * of bpf_exit, which means that program wrote - * something into it earlier - */ - err = check_reg_arg(env, BPF_REG_0, SRC_OP); - if (err) - return err; - - if (is_pointer_value(env, BPF_REG_0)) { - verbose(env, "R0 leaks addr as return value\n"); - return -EACCES; - } - err = check_return_code(env); if (err) return err; @@ -8833,12 +8850,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) convert_ctx_access = bpf_xdp_sock_convert_ctx_access; break; case PTR_TO_BTF_ID: - if (type == BPF_WRITE) { + if (type == BPF_READ) { + insn->code = BPF_LDX | BPF_PROBE_MEM | + BPF_SIZE((insn)->code); + env->prog->aux->num_exentries++; + } else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) { verbose(env, "Writes through BTF pointers are not allowed\n"); return -EINVAL; } - insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code); - env->prog->aux->num_exentries++; continue; default: continue; @@ -9505,6 +9524,58 @@ static void print_verification_stats(struct bpf_verifier_env *env) env->peak_states, env->longest_mark_read_walk); } +static int check_struct_ops_btf_id(struct bpf_verifier_env *env) +{ + const struct btf_type *t, *func_proto; + const struct bpf_struct_ops *st_ops; + const struct btf_member *member; + struct bpf_prog *prog = env->prog; + u32 btf_id, member_idx; + const char *mname; + + btf_id = prog->aux->attach_btf_id; + st_ops = bpf_struct_ops_find(btf_id); + if (!st_ops) { + verbose(env, "attach_btf_id %u is not a supported struct\n", + btf_id); + return -ENOTSUPP; + } + + t = st_ops->type; + member_idx = prog->expected_attach_type; + if (member_idx >= btf_type_vlen(t)) { + verbose(env, "attach to invalid member idx %u of struct %s\n", + member_idx, st_ops->name); + return -EINVAL; + } + + member = &btf_type_member(t)[member_idx]; + mname = btf_name_by_offset(btf_vmlinux, member->name_off); + func_proto = btf_type_resolve_func_ptr(btf_vmlinux, member->type, + NULL); + if (!func_proto) { + verbose(env, "attach to invalid member %s(@idx %u) of struct %s\n", + mname, member_idx, st_ops->name); + return -EINVAL; + } + + if (st_ops->check_member) { + int err = st_ops->check_member(t, member); + + if (err) { + verbose(env, "attach to unsupported member %s of struct %s\n", + mname, st_ops->name); + return err; + } + } + + prog->aux->attach_func_proto = func_proto; + prog->aux->attach_func_name = mname; + env->ops = st_ops->verifier_ops; + + return 0; +} + static int check_attach_btf_id(struct bpf_verifier_env *env) { struct bpf_prog *prog = env->prog; @@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) long addr; u64 key; + if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) + return check_struct_ops_btf_id(env); + if (prog->type != BPF_PROG_TYPE_TRACING) return 0; From patchwork Sat Dec 14 00:47:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209571 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="Appw2TWg"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTR43fwjz9sNH for ; Sat, 14 Dec 2019 11:48:00 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726907AbfLNAsA (ORCPT ); Fri, 13 Dec 2019 19:48:00 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:64986 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726861AbfLNAr6 (ORCPT ); Fri, 13 Dec 2019 19:47:58 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UsFg000695 for ; Fri, 13 Dec 2019 16:47:54 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=9Bp6jpkLsBCx5iEmRdZfJj0ff5C1rqDS0/5/7Sw9gXs=; b=Appw2TWghPACDArqWgq6pNtRx9OF28oWpmmdt2jARAQ5jmsz0975Fkq2WRq9cjOaSbbe UDIONZZYOrjv2pOUl2wluST/5w5toou0hakJvwz3eJejJ0dNyrfGQrW1P3Fn/KRbBpfy E/BI+v3OxgGweDC58AvdGHR2OLjcBnbw31Y= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wv1r0vwf8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:47:53 -0800 Received: from intmgw003.06.prn3.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::130) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:52 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 6FDF12943AB4; Fri, 13 Dec 2019 16:47:51 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Date: Fri, 13 Dec 2019 16:47:51 -0800 Message-ID: <20191214004751.1652774-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxlogscore=999 suspectscore=43 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value is a kernel struct with its func ptr implemented in bpf prog. This new map is the interface to register/unregister/introspect a bpf implemented kernel struct. The kernel struct is actually embedded inside another new struct (or called the "value" struct in the code). For example, "struct tcp_congestion_ops" is embbeded in: struct __bpf_tcp_congestion_ops { refcount_t refcnt; enum bpf_struct_ops_state state; struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ } The map value is "struct __bpf_tcp_congestion_ops". The "bpftool map dump" will then be able to show the state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case). This "value" struct is created automatically by a macro. Having a separate "value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding "void (*init)(void)" to "struct __bpf_XYZ" to do some initialization works before registering the struct_ops to the kernel subsystem). The libbpf will take care of finding and populating the "struct __bpf_XYZ" from "struct XYZ". Register a struct_ops to a kernel subsystem: 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id set to the btf id "struct __bpf_tcp_congestion_ops" of the running kernel. Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id is added such that attr->btf_fd can still be used as the "user" btf which could store other useful sysadmin/debug info that may be introduced in the furture, e.g. creation-date/compiler-details/map-creator...etc. 3. Create a "struct __bpf_tcp_congestion_ops" object as described in the running kernel btf. Populate the value of this object. The function ptr should be populated with the prog fds. 4. Call BPF_MAP_UPDATE with the object created in (3) as the map value. The key is always "0". During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's args as an array of u64 is generated. BPF_MAP_UPDATE also allows the specific struct_ops to do some final checks in "st_ops->init_member()" (e.g. ensure all mandatory func ptrs are implemented). If everything looks good, it will register this kernel struct to the kernel subsystem. The map will not allow further update from this point. Unregister a struct_ops from the kernel subsystem: BPF_MAP_DELETE with key "0". Introspect a struct_ops: BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will have the prog _id_ populated as the func ptr. The map value state (enum bpf_struct_ops_state) will transit from: INIT (map created) => INUSE (map updated, i.e. reg) => TOBEFREE (map value deleted, i.e. unreg) Note that the above state is not exposed to the uapi/bpf.h. It will be obtained from the btf of the running kernel. The kernel subsystem needs to call bpf_struct_ops_get() and bpf_struct_ops_put() to manage the "refcnt" in the "struct __bpf_XYZ". This patch uses a separate refcnt for the purose of tracking the subsystem usage. Another approach is to reuse the map->refcnt and then "show" (i.e. during map_lookup) the subsystem's usage by doing map->refcnt - map->usercnt to filter out the map-fd/pinned-map usage. However, that will also tie down the future semantics of map->refcnt and map->usercnt. The very first subsystem's refcnt (during reg()) holds one count to map->refcnt. When the very last subsystem's refcnt is gone, it will also release the map->refcnt. All bpf_prog will be freed when the map->refcnt reaches 0 (i.e. during map_free()). Here is how the bpftool map command will look like: [root@arch-fb-vm1 bpf]# bpftool map show 6: struct_ops name dctcp flags 0x0 key 4B value 256B max_entries 1 memlock 4096B btf_id 6 [root@arch-fb-vm1 bpf]# bpftool map dump id 6 [{ "value": { "refcnt": { "refs": { "counter": 1 } }, "state": 1, "data": { "list": { "next": 0, "prev": 0 }, "key": 0, "flags": 2, "init": 24, "release": 0, "ssthresh": 25, "cong_avoid": 30, "set_state": 27, "cwnd_event": 28, "in_ack_event": 26, "undo_cwnd": 29, "pkts_acked": 0, "min_tso_segs": 0, "sndbuf_expand": 0, "cong_control": 0, "get_info": 0, "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 ], "owner": 0 } } } ] Misc Notes: * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. It does an inplace update on "*value" instead returning a pointer to syscall.c. Otherwise, it needs a separate copy of "zero" value for the BPF_STRUCT_OPS_STATE_INIT to avoid races. * The bpf_struct_ops_map_delete_elem() is also called without preempt_disable() from map_delete_elem(). It is because the "->unreg()" may requires sleepable context, e.g. the "tcp_unregister_congestion_control()". * "const" is added to some of the existing "struct btf_func_model *" function arg to avoid a compiler warning caused by this patch. Signed-off-by: Martin KaFai Lau --- arch/x86/net/bpf_jit_comp.c | 10 +- include/linux/bpf.h | 49 +++- include/linux/bpf_types.h | 3 + include/linux/btf.h | 11 + include/uapi/linux/bpf.h | 7 +- kernel/bpf/bpf_struct_ops.c | 465 +++++++++++++++++++++++++++++++++++- kernel/bpf/btf.c | 20 +- kernel/bpf/map_in_map.c | 3 +- kernel/bpf/syscall.c | 47 ++-- kernel/bpf/trampoline.c | 5 +- kernel/bpf/verifier.c | 5 + 11 files changed, 585 insertions(+), 40 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4c8a2d1f8470..0b9b486432bd 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1328,7 +1328,7 @@ xadd: if (is_imm8(insn->off)) return proglen; } -static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args, +static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args, int stack_size) { int i; @@ -1344,7 +1344,7 @@ static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args, -(stack_size - i * 8)); } -static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args, +static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, int stack_size) { int i; @@ -1361,7 +1361,7 @@ static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args, -(stack_size - i * 8)); } -static int invoke_bpf(struct btf_func_model *m, u8 **pprog, +static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_prog **progs, int prog_cnt, int stack_size) { u8 *prog = *pprog; @@ -1456,7 +1456,7 @@ static int invoke_bpf(struct btf_func_model *m, u8 **pprog, * add rsp, 8 // skip eth_type_trans's frame * ret // return to its caller */ -int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +int arch_prepare_bpf_trampoline(void *image, const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call) @@ -1529,7 +1529,7 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags */ if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY)) return -EFAULT; - return 0; + return (void *)prog - image; } static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1f0a5fc8c5ee..349cedd7b97b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -17,6 +17,7 @@ #include #include #include +#include struct bpf_verifier_env; struct bpf_verifier_log; @@ -106,6 +107,7 @@ struct bpf_map { struct btf *btf; struct bpf_map_memory memory; char name[BPF_OBJ_NAME_LEN]; + u32 btf_vmlinux_value_type_id; bool unpriv_array; bool frozen; /* write-once; write-protected by freeze_mutex */ /* 22 bytes hole */ @@ -183,7 +185,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map) static inline bool bpf_map_support_seq_show(const struct bpf_map *map) { - return map->btf && map->ops->map_seq_show_elem; + return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) && + map->ops->map_seq_show_elem; } int map_check_no_btf(const struct bpf_map *map, @@ -441,7 +444,8 @@ struct btf_func_model { * fentry = a set of program to run before calling original function * fexit = a set of program to run after original function */ -int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +int arch_prepare_bpf_trampoline(void *image, + const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call); @@ -671,6 +675,7 @@ struct bpf_array_aux { struct work_struct work; }; +struct bpf_struct_ops_value; struct btf_type; struct btf_member; @@ -680,21 +685,61 @@ struct bpf_struct_ops { int (*init)(struct btf *_btf_vmlinux); int (*check_member)(const struct btf_type *t, const struct btf_member *member); + int (*init_member)(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata); + int (*reg)(void *kdata); + void (*unreg)(void *kdata); const struct btf_type *type; + const struct btf_type *value_type; const char *name; struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; + u32 value_id; }; #if defined(CONFIG_BPF_JIT) +#define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); void bpf_struct_ops_init(struct btf *_btf_vmlinux); +bool bpf_struct_ops_get(const void *kdata); +void bpf_struct_ops_put(const void *kdata); +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, + void *value); +static inline bool bpf_try_module_get(const void *data, struct module *owner) +{ + if (owner == BPF_MODULE_OWNER) + return bpf_struct_ops_get(data); + else + return try_module_get(owner); +} +static inline void bpf_module_put(const void *data, struct module *owner) +{ + if (owner == BPF_MODULE_OWNER) + bpf_struct_ops_put(data); + else + module_put(owner); +} #else static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) { return NULL; } static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { } +static inline bool bpf_try_module_get(const void *data, struct module *owner) +{ + return try_module_get(owner); +} +static inline void bpf_module_put(const void *data, struct module *owner) +{ + module_put(owner); +} +static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, + void *key, + void *value) +{ + return -EINVAL; +} #endif struct bpf_array { diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index fadd243ffa2d..9f326e6ef885 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) +#if defined(CONFIG_BPF_JIT) +BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) +#endif diff --git a/include/linux/btf.h b/include/linux/btf.h index f74a09a7120b..49094564f1f1 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -60,6 +60,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf, u32 id, u32 *res_id); const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, u32 id, u32 *res_id); +const struct btf_type * +btf_resolve_size(const struct btf *btf, const struct btf_type *type, + u32 *type_size, const struct btf_type **elem_type, + u32 *total_nelems); #define for_each_member(i, struct_type, member) \ for (i = 0, member = btf_type_member(struct_type); \ @@ -106,6 +110,13 @@ static inline bool btf_type_kflag(const struct btf_type *t) return BTF_INFO_KFLAG(t->info); } +static inline u32 btf_member_bit_offset(const struct btf_type *struct_type, + const struct btf_member *member) +{ + return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) + : member->offset; +} + static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type, const struct btf_member *member) { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 12900dfa1461..8809212d9d6c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -136,6 +136,7 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, BPF_MAP_TYPE_SK_STORAGE, BPF_MAP_TYPE_DEVMAP_HASH, + BPF_MAP_TYPE_STRUCT_OPS, }; /* Note that tracing related programs such as @@ -392,6 +393,10 @@ union bpf_attr { __u32 btf_fd; /* fd pointing to a BTF type data */ __u32 btf_key_type_id; /* BTF type_id of the key */ __u32 btf_value_type_id; /* BTF type_id of the value */ + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel- + * struct stored as the + * map value + */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ @@ -3340,7 +3345,7 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u32 :32; + __u32 btf_vmlinux_value_type_id; __u64 netns_dev; __u64 netns_ino; __u32 btf_id; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 817d5aac42e5..00f49ac1342d 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -12,8 +12,68 @@ #include #include +enum bpf_struct_ops_state { + BPF_STRUCT_OPS_STATE_INIT, + BPF_STRUCT_OPS_STATE_INUSE, + BPF_STRUCT_OPS_STATE_TOBEFREE, +}; + +#define BPF_STRUCT_OPS_COMMON_VALUE \ + refcount_t refcnt; \ + enum bpf_struct_ops_state state + +struct bpf_struct_ops_value { + BPF_STRUCT_OPS_COMMON_VALUE; + char data[0] ____cacheline_aligned_in_smp; +}; + +struct bpf_struct_ops_map { + struct bpf_map map; + const struct bpf_struct_ops *st_ops; + /* protect map_update */ + spinlock_t lock; + /* progs has all the bpf_prog that is populated + * to the func ptr of the kernel's struct + * (in kvalue.data). + */ + struct bpf_prog **progs; + /* image is a page that has all the trampolines + * that stores the func args before calling the bpf_prog. + * A PAGE_SIZE "image" is enough to store all trampoline for + * "progs[]". + */ + void *image; + /* uvalue->data stores the kernel struct + * (e.g. tcp_congestion_ops) that is more useful + * to userspace than the kvalue. For example, + * the bpf_prog's id is stored instead of the kernel + * address of a func ptr. + */ + struct bpf_struct_ops_value *uvalue; + /* kvalue.data stores the actual kernel's struct + * (e.g. tcp_congestion_ops) that will be + * registered to the kernel subsystem. + */ + struct bpf_struct_ops_value kvalue; +}; + +#define VALUE_PREFIX "__bpf_" +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) + +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value + * exposed to the userspace and its btf-type-id is stored + * at the map->btf_vmlinux_value_type_id. + * + * The *_name##_dummy is to ensure the BTF type is emitted. + */ + #define BPF_STRUCT_OPS_TYPE(_name) \ -extern struct bpf_struct_ops bpf_##_name; +extern struct bpf_struct_ops bpf_##_name; \ + \ +static struct __bpf_##_name { \ + BPF_STRUCT_OPS_COMMON_VALUE; \ + struct _name data ____cacheline_aligned_in_smp; \ +} *_name##_dummy; #include "bpf_struct_ops_types.h" #undef BPF_STRUCT_OPS_TYPE @@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = { const struct bpf_prog_ops bpf_struct_ops_prog_ops = { }; +static const struct btf_type *module_type; + void bpf_struct_ops_init(struct btf *_btf_vmlinux) { + char value_name[128] = VALUE_PREFIX; + s32 type_id, value_id, module_id; const struct btf_member *member; struct bpf_struct_ops *st_ops; struct bpf_verifier_log log = {}; const struct btf_type *t; const char *mname; - s32 type_id; u32 i, j; + /* Avoid unused var compiler warning */ +#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy); +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE + + module_id = btf_find_by_name_kind(_btf_vmlinux, "module", + BTF_KIND_STRUCT); + if (module_id < 0) { + pr_warn("Cannot find struct module in btf_vmlinux\n"); + return; + } + module_type = btf_type_by_id(_btf_vmlinux, module_id); + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { st_ops = bpf_struct_ops[i]; + value_name[VALUE_PREFIX_LEN] = '\0'; + strncat(value_name + VALUE_PREFIX_LEN, st_ops->name, + sizeof(value_name) - VALUE_PREFIX_LEN - 1); + value_id = btf_find_by_name_kind(_btf_vmlinux, value_name, + BTF_KIND_STRUCT); + if (value_id < 0) { + pr_warn("Cannot find struct %s in btf_vmlinux\n", + value_name); + continue; + } + type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name, BTF_KIND_STRUCT); if (type_id < 0) { @@ -101,6 +188,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) } else { st_ops->type_id = type_id; st_ops->type = t; + st_ops->value_id = value_id; + st_ops->value_type = + btf_type_by_id(_btf_vmlinux, value_id); } } } @@ -108,6 +198,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) extern struct btf *btf_vmlinux; +static const struct bpf_struct_ops * +bpf_struct_ops_find_value(u32 value_id) +{ + unsigned int i; + + if (!value_id || !btf_vmlinux) + return NULL; + + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { + if (bpf_struct_ops[i]->value_id == value_id) + return bpf_struct_ops[i]; + } + + return NULL; +} + const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) { unsigned int i; @@ -122,3 +228,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) return NULL; } + +static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + u32 index = key ? *(u32 *)key : U32_MAX; + u32 *next = (u32 *)next_key; + + if (index >= map->max_entries) { + *next = 0; + return 0; + } + + if (index == map->max_entries - 1) + return -ENOENT; + + *next = index + 1; + return 0; +} + +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, + void *value) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + struct bpf_struct_ops_value *uvalue, *kvalue; + enum bpf_struct_ops_state state; + + if (unlikely(*(u32 *)key != 0)) + return -ENOENT; + + kvalue = &st_map->kvalue; + state = smp_load_acquire(&kvalue->state); + if (state == BPF_STRUCT_OPS_STATE_INIT) { + memset(value, 0, map->value_size); + return 0; + } + + /* No lock is needed. state and refcnt do not need + * to be updated together under atomic context. + */ + uvalue = (struct bpf_struct_ops_value *)value; + memcpy(uvalue, st_map->uvalue, map->value_size); + uvalue->state = state; + refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt)); + + return 0; +} + +static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key) +{ + return ERR_PTR(-EINVAL); +} + +static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) +{ + const struct btf_type *t = st_map->st_ops->type; + u32 i; + + for (i = 0; i < btf_type_vlen(t); i++) { + if (st_map->progs[i]) { + bpf_prog_put(st_map->progs[i]); + st_map->progs[i] = NULL; + } + } +} + +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + const struct bpf_struct_ops *st_ops = st_map->st_ops; + struct bpf_struct_ops_value *uvalue, *kvalue; + const struct btf_member *member; + const struct btf_type *t = st_ops->type; + void *udata, *kdata; + int prog_fd, err = 0; + void *image; + u32 i; + + if (flags) + return -EINVAL; + + if (*(u32 *)key != 0) + return -E2BIG; + + uvalue = (struct bpf_struct_ops_value *)value; + if (uvalue->state || refcount_read(&uvalue->refcnt)) + return -EINVAL; + + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; + + spin_lock(&st_map->lock); + + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { + err = -EBUSY; + goto unlock; + } + + memcpy(uvalue, value, map->value_size); + + udata = &uvalue->data; + kdata = &kvalue->data; + image = st_map->image; + + for_each_member(i, t, member) { + const struct btf_type *mtype, *ptype; + struct bpf_prog *prog; + u32 moff; + + moff = btf_member_bit_offset(t, member) / 8; + mtype = btf_type_by_id(btf_vmlinux, member->type); + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); + if (ptype == module_type) { + *(void **)(kdata + moff) = BPF_MODULE_OWNER; + continue; + } + + err = st_ops->init_member(t, member, kdata, udata); + if (err < 0) + goto reset_unlock; + + /* The ->init_member() has handled this member */ + if (err > 0) + continue; + + /* If st_ops->init_member does not handle it, + * we will only handle func ptrs and zero-ed members + * here. Reject everything else. + */ + + /* All non func ptr member must be 0 */ + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, + NULL)) { + u32 msize; + + mtype = btf_resolve_size(btf_vmlinux, mtype, + &msize, NULL, NULL); + if (IS_ERR(mtype)) { + err = PTR_ERR(mtype); + goto reset_unlock; + } + + if (memchr_inv(udata + moff, 0, msize)) { + err = -EINVAL; + goto reset_unlock; + } + + continue; + } + + prog_fd = (int)(*(unsigned long *)(udata + moff)); + /* Similar check as the attr->attach_prog_fd */ + if (!prog_fd) + continue; + + prog = bpf_prog_get(prog_fd); + if (IS_ERR(prog)) { + err = PTR_ERR(prog); + goto reset_unlock; + } + st_map->progs[i] = prog; + + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || + prog->aux->attach_btf_id != st_ops->type_id || + prog->expected_attach_type != i) { + err = -EINVAL; + goto reset_unlock; + } + + err = arch_prepare_bpf_trampoline(image, + &st_ops->func_models[i], 0, + &prog, 1, NULL, 0, NULL); + if (err < 0) + goto reset_unlock; + + *(void **)(kdata + moff) = image; + image += err; + + /* put prog_id to udata */ + *(unsigned long *)(udata + moff) = prog->aux->id; + } + + refcount_set(&kvalue->refcnt, 1); + bpf_map_inc(map); + + err = st_ops->reg(kdata); + if (!err) { + smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); + goto unlock; + } + + /* Error during st_ops->reg() */ + bpf_map_put(map); + +reset_unlock: + bpf_struct_ops_map_put_progs(st_map); + memset(uvalue, 0, map->value_size); + memset(kvalue, 0, map->value_size); + +unlock: + spin_unlock(&st_map->lock); + return err; +} + +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) +{ + enum bpf_struct_ops_state prev_state; + struct bpf_struct_ops_map *st_map; + + st_map = (struct bpf_struct_ops_map *)map; + prev_state = cmpxchg(&st_map->kvalue.state, + BPF_STRUCT_OPS_STATE_INUSE, + BPF_STRUCT_OPS_STATE_TOBEFREE); + if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) { + st_map->st_ops->unreg(&st_map->kvalue.data); + if (refcount_dec_and_test(&st_map->kvalue.refcnt)) + bpf_map_put(map); + } + + return 0; +} + +static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + void *value; + + value = bpf_struct_ops_map_lookup_elem(map, key); + if (!value) + return; + + btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, + value, m); + seq_puts(m, "\n"); +} + +static void bpf_struct_ops_map_free(struct bpf_map *map) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + + if (st_map->progs) + bpf_struct_ops_map_put_progs(st_map); + bpf_map_area_free(st_map->progs); + bpf_jit_free_exec(st_map->image); + bpf_map_area_free(st_map->uvalue); + bpf_map_area_free(st_map); +} + +static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) +{ + if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || + attr->map_flags || !attr->btf_vmlinux_value_type_id) + return -EINVAL; + return 0; +} + +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) +{ + const struct bpf_struct_ops *st_ops; + size_t map_total_size, st_map_size; + struct bpf_struct_ops_map *st_map; + const struct btf_type *t, *vt; + struct bpf_map_memory mem; + struct bpf_map *map; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); + if (!st_ops) + return ERR_PTR(-ENOTSUPP); + + vt = st_ops->value_type; + if (attr->value_size != vt->size) + return ERR_PTR(-EINVAL); + + t = st_ops->type; + + st_map_size = sizeof(*st_map) + + /* kvalue stores the struct __bpf_tcp_congestions_ops */ + (vt->size - sizeof(struct bpf_struct_ops_value)); + map_total_size = st_map_size + + /* uvalue */ + sizeof(vt->size) + + /* struct bpf_progs **progs */ + btf_type_vlen(t) * sizeof(struct bpf_prog *); + err = bpf_map_charge_init(&mem, map_total_size); + if (err < 0) + return ERR_PTR(err); + + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); + if (!st_map) { + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + st_map->st_ops = st_ops; + map = &st_map->map; + + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); + st_map->progs = + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), + NUMA_NO_NODE); + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!st_map->uvalue || !st_map->progs || !st_map->image) { + bpf_struct_ops_map_free(map); + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + + spin_lock_init(&st_map->lock); + set_vm_flush_reset_perms(st_map->image); + set_memory_x((long)st_map->image, 1); + bpf_map_init_from_attr(map, attr); + bpf_map_charge_move(&map->memory, &mem); + + return map; +} + +const struct bpf_map_ops bpf_struct_ops_map_ops = { + .map_alloc_check = bpf_struct_ops_map_alloc_check, + .map_alloc = bpf_struct_ops_map_alloc, + .map_free = bpf_struct_ops_map_free, + .map_get_next_key = bpf_struct_ops_map_get_next_key, + .map_lookup_elem = bpf_struct_ops_map_lookup_elem, + .map_delete_elem = bpf_struct_ops_map_delete_elem, + .map_update_elem = bpf_struct_ops_map_update_elem, + .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem, +}; + +/* "const void *" because some subsystem is + * passing a const (e.g. const struct tcp_congestion_ops *) + */ +bool bpf_struct_ops_get(const void *kdata) +{ + struct bpf_struct_ops_value *kvalue; + + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); + + return refcount_inc_not_zero(&kvalue->refcnt); +} + +void bpf_struct_ops_put(const void *kdata) +{ + struct bpf_struct_ops_value *kvalue; + + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); + if (refcount_dec_and_test(&kvalue->refcnt)) { + struct bpf_struct_ops_map *st_map; + + st_map = container_of(kvalue, struct bpf_struct_ops_map, + kvalue); + bpf_map_put(&st_map->map); + } +} diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 16924e5fa126..90837aaa86d6 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -500,13 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding) return "UNKN"; } -static u32 btf_member_bit_offset(const struct btf_type *struct_type, - const struct btf_member *member) -{ - return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) - : member->offset; -} - static u32 btf_type_int(const struct btf_type *t) { return *(u32 *)(t + 1); @@ -1089,7 +1082,7 @@ static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env) * *elem_type: same as return type ("struct X") * *total_nelems: 1 */ -static const struct btf_type * +const struct btf_type * btf_resolve_size(const struct btf *btf, const struct btf_type *type, u32 *type_size, const struct btf_type **elem_type, u32 *total_nelems) @@ -1143,8 +1136,10 @@ btf_resolve_size(const struct btf *btf, const struct btf_type *type, return ERR_PTR(-EINVAL); *type_size = nelems * size; - *total_nelems = nelems; - *elem_type = type; + if (total_nelems) + *total_nelems = nelems; + if (elem_type) + *elem_type = type; return array_type ? : type; } @@ -1858,7 +1853,10 @@ static void btf_modifier_seq_show(const struct btf *btf, u32 type_id, void *data, u8 bits_offset, struct seq_file *m) { - t = btf_type_id_resolve(btf, &type_id); + if (btf->resolved_ids) + t = btf_type_id_resolve(btf, &type_id); + else + t = btf_type_skip_modifiers(btf, type_id, NULL); btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); } diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 5e9366b33f0f..b3c48d1533cb 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -22,7 +22,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) */ if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY || inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE || - inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { + inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE || + inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { fdput(f); return ERR_PTR(-ENOTSUPP); } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b2d57f7c04..15f0505ce6d0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -628,7 +628,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, return ret; } -#define BPF_MAP_CREATE_LAST_FIELD btf_value_type_id +#define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id /* called via syscall */ static int map_create(union bpf_attr *attr) { @@ -642,6 +642,14 @@ static int map_create(union bpf_attr *attr) if (err) return -EINVAL; + if (attr->btf_vmlinux_value_type_id) { + if (attr->map_type != BPF_MAP_TYPE_STRUCT_OPS || + attr->btf_key_type_id || attr->btf_value_type_id) + return -EINVAL; + } else if (attr->btf_key_type_id && !attr->btf_value_type_id) { + return -EINVAL; + } + f_flags = bpf_get_file_flag(attr->map_flags); if (f_flags < 0) return f_flags; @@ -664,32 +672,35 @@ static int map_create(union bpf_attr *attr) atomic64_set(&map->usercnt, 1); mutex_init(&map->freeze_mutex); - if (attr->btf_key_type_id || attr->btf_value_type_id) { + map->spin_lock_off = -EINVAL; + if (attr->btf_key_type_id || attr->btf_value_type_id || + /* Even the map's value is a kernel's struct, + * the bpf_prog.o must have BTF to begin with + * to figure out the corresponding kernel's + * counter part. Thus, attr->btf_fd has + * to be valid also. + */ + attr->btf_vmlinux_value_type_id) { struct btf *btf; - if (!attr->btf_value_type_id) { - err = -EINVAL; - goto free_map; - } - btf = btf_get_by_fd(attr->btf_fd); if (IS_ERR(btf)) { err = PTR_ERR(btf); goto free_map; } + map->btf = btf; - err = map_check_btf(map, btf, attr->btf_key_type_id, - attr->btf_value_type_id); - if (err) { - btf_put(btf); - goto free_map; + if (attr->btf_value_type_id) { + err = map_check_btf(map, btf, attr->btf_key_type_id, + attr->btf_value_type_id); + if (err) + goto free_map; } - map->btf = btf; map->btf_key_type_id = attr->btf_key_type_id; map->btf_value_type_id = attr->btf_value_type_id; - } else { - map->spin_lock_off = -EINVAL; + map->btf_vmlinux_value_type_id = + attr->btf_vmlinux_value_type_id; } err = security_bpf_map_alloc(map); @@ -888,6 +899,8 @@ static int map_lookup_elem(union bpf_attr *attr) } else if (map->map_type == BPF_MAP_TYPE_QUEUE || map->map_type == BPF_MAP_TYPE_STACK) { err = map->ops->map_peek_elem(map, value); + } else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { + err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); } else { rcu_read_lock(); if (map->ops->map_lookup_elem_sys_only) @@ -1092,7 +1105,8 @@ static int map_delete_elem(union bpf_attr *attr) if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_delete_elem(map, key); goto out; - } else if (IS_FD_PROG_ARRAY(map)) { + } else if (IS_FD_PROG_ARRAY(map) || + map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { err = map->ops->map_delete_elem(map, key); goto out; } @@ -2822,6 +2836,7 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map, info.btf_key_type_id = map->btf_key_type_id; info.btf_value_type_id = map->btf_value_type_id; } + info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id; if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_info_fill(&info, map); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 5ee301ddbd00..610109cfc7a8 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -110,7 +110,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) fentry, fentry_cnt, fexit, fexit_cnt, tr->func.addr); - if (err) + if (err < 0) goto out; if (tr->selector) @@ -244,7 +244,8 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) } int __weak -arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +arch_prepare_bpf_trampoline(void *image, + const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4c1eaa1a2965..990f13165c52 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8149,6 +8149,11 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { + verbose(env, "bpf_struct_ops map cannot be used in prog\n"); + return -EINVAL; + } + return 0; } From patchwork Sat Dec 14 00:47:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209569 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="H4+fymey"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTR20x4Zz9sPW for ; Sat, 14 Dec 2019 11:47:58 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726847AbfLNAr5 (ORCPT ); Fri, 13 Dec 2019 19:47:57 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:18300 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726855AbfLNAr5 (ORCPT ); Fri, 13 Dec 2019 19:47:57 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UUws009100 for ; Fri, 13 Dec 2019 16:47:55 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=pMNDcwGQ/jFXV9f6MnUT5TZwLJwQ5OZRWr+wptOGTH0=; b=H4+fymeyMwN7H++zuaFaD9E/PYvItSDiZqabe6Eb7EOTl6DS/CsZr/xGIYpa+34gDpBg H4NseTGPGX5Nx3Q+tcFHytgIb8ZFzU2gdMTJ1hpgAT9ZuTfeQmpNhbxRK8cWeouohdXG fVN8DTmyA/plJs/veUFPnFvOTcatEmAdvj8= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g1k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:47:55 -0800 Received: from intmgw004.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:54 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id B4B7A2943AB4; Fri, 13 Dec 2019 16:47:53 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Date: Fri, 13 Dec 2019 16:47:53 -0800 Message-ID: <20191214004753.1653075-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch makes "struct tcp_congestion_ops" to be the first user of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops in bpf. The BPF implemented tcp_congestion_ops can be used like regular kernel tcp-cc through sysctl and setsockopt. e.g. [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic net.ipv4.tcp_congestion_control = bpf_cubic There has been attempt to move the TCP CC to the user space (e.g. CCP in TCP). The common arguments are faster turn around, get away from long-tail kernel versions in production...etc, which are legit points. BPF has been the continuous effort to join both kernel and userspace upsides together (e.g. XDP to gain the performance advantage without bypassing the kernel). The recent BPF advancements (in particular BTF-aware verifier, BPF trampoline, BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc) possible in BPF. It allows a faster turnaround for testing algorithm in the production while leveraging the existing (and continue growing) BPF feature/framework instead of building one specifically for userspace TCP CC. This patch allows write access to a few fields in tcp-sock (in bpf_tcp_ca_btf_struct_access()). The optional "get_info" is unsupported now. It can be added later. One possible way is to output the info with a btf-id to describe the content. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 2 + include/net/tcp.h | 1 + kernel/bpf/bpf_struct_ops_types.h | 7 +- net/core/filter.c | 2 +- net/ipv4/Makefile | 4 + net/ipv4/bpf_tcp_ca.c | 225 ++++++++++++++++++++++++++++++ net/ipv4/tcp_cong.c | 14 +- net/ipv4/tcp_ipv4.c | 6 +- net/ipv4/tcp_minisocks.c | 4 +- net/ipv4/tcp_output.c | 4 +- 10 files changed, 254 insertions(+), 15 deletions(-) create mode 100644 net/ipv4/bpf_tcp_ca.c diff --git a/include/linux/filter.h b/include/linux/filter.h index 37ac7025031d..7c22c5e6528d 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -844,6 +844,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog); int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, bpf_aux_classic_check_t trans, bool save_orig); void bpf_prog_destroy(struct bpf_prog *fp); +const struct bpf_func_proto * +bpf_base_func_proto(enum bpf_func_id func_id); int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_attach_bpf(u32 ufd, struct sock *sk); diff --git a/include/net/tcp.h b/include/net/tcp.h index 86b9a8766648..fd87fa1df603 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1007,6 +1007,7 @@ enum tcp_ca_ack_event_flags { #define TCP_CONG_NON_RESTRICTED 0x1 /* Requires ECN/ECT set on all packets */ #define TCP_CONG_NEEDS_ECN 0x2 +#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN) union tcp_cc_info; diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h index 7bb13ff49ec2..066d83ea1c99 100644 --- a/kernel/bpf/bpf_struct_ops_types.h +++ b/kernel/bpf/bpf_struct_ops_types.h @@ -1,4 +1,9 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* internal file - do not include directly */ -/* To be filled in a later patch */ +#ifdef CONFIG_BPF_JIT +#ifdef CONFIG_INET +#include +BPF_STRUCT_OPS_TYPE(tcp_congestion_ops) +#endif +#endif diff --git a/net/core/filter.c b/net/core/filter.c index a411f7835dee..fbb3698026bd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5975,7 +5975,7 @@ bool bpf_helper_changes_pkt_data(void *func) return false; } -static const struct bpf_func_proto * +const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { switch (func_id) { diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index d57ecfaf89d4..7360d9b3eaad 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -65,3 +65,7 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o xfrm4_protocol.o + +ifeq ($(CONFIG_BPF_SYSCALL),y) +obj-$(CONFIG_BPF_JIT) += bpf_tcp_ca.o +endif diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c new file mode 100644 index 000000000000..967af987bc26 --- /dev/null +++ b/net/ipv4/bpf_tcp_ca.c @@ -0,0 +1,225 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2019 Facebook */ + +#include +#include +#include +#include +#include +#include + +static u32 optional_ops[] = { + offsetof(struct tcp_congestion_ops, init), + offsetof(struct tcp_congestion_ops, release), + offsetof(struct tcp_congestion_ops, set_state), + offsetof(struct tcp_congestion_ops, cwnd_event), + offsetof(struct tcp_congestion_ops, in_ack_event), + offsetof(struct tcp_congestion_ops, pkts_acked), + offsetof(struct tcp_congestion_ops, min_tso_segs), + offsetof(struct tcp_congestion_ops, sndbuf_expand), + offsetof(struct tcp_congestion_ops, cong_control), +}; + +static u32 unsupported_ops[] = { + offsetof(struct tcp_congestion_ops, get_info), +}; + +static const struct btf_type *tcp_sock_type; +static u32 tcp_sock_id, sock_id; + +static int bpf_tcp_ca_init(struct btf *_btf_vmlinux) +{ + s32 type_id; + + type_id = btf_find_by_name_kind(_btf_vmlinux, "sock", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + sock_id = type_id; + + type_id = btf_find_by_name_kind(_btf_vmlinux, "tcp_sock", + BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + tcp_sock_id = type_id; + tcp_sock_type = btf_type_by_id(_btf_vmlinux, tcp_sock_id); + + return 0; +} + +static bool check_optional(u32 member_offset) +{ + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(optional_ops); i++) { + if (member_offset == optional_ops[i]) + return true; + } + + return false; +} + +static bool check_unsupported(u32 member_offset) +{ + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) { + if (member_offset == unsupported_ops[i]) + return true; + } + + return false; +} + +extern struct btf *btf_vmlinux; + +static bool bpf_tcp_ca_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) + return false; + if (type != BPF_READ) + return false; + if (off % size != 0) + return false; + + if (!btf_ctx_access(off, size, type, prog, info)) + return false; + + if (info->reg_type == PTR_TO_BTF_ID && info->btf_id == sock_id) + /* promote it to tcp_sock */ + info->btf_id = tcp_sock_id; + + return true; +} + +static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log, + const struct btf_type *t, int off, + int size, enum bpf_access_type atype, + u32 *next_btf_id) +{ + size_t end; + + if (atype == BPF_READ) + return btf_struct_access(log, t, off, size, atype, next_btf_id); + + if (t != tcp_sock_type) { + bpf_log(log, "only read is supported\n"); + return -EACCES; + } + + switch (off) { + case bpf_ctx_range(struct inet_connection_sock, icsk_ca_priv): + end = offsetofend(struct inet_connection_sock, icsk_ca_priv); + break; + case offsetof(struct inet_connection_sock, icsk_ack.pending): + end = offsetofend(struct inet_connection_sock, + icsk_ack.pending); + break; + case offsetof(struct tcp_sock, snd_cwnd): + end = offsetofend(struct tcp_sock, snd_cwnd); + break; + case offsetof(struct tcp_sock, snd_cwnd_cnt): + end = offsetofend(struct tcp_sock, snd_cwnd_cnt); + break; + case offsetof(struct tcp_sock, snd_ssthresh): + end = offsetofend(struct tcp_sock, snd_ssthresh); + break; + case offsetof(struct tcp_sock, ecn_flags): + end = offsetofend(struct tcp_sock, ecn_flags); + break; + default: + bpf_log(log, "no write support to tcp_sock at off %d\n", off); + return -EACCES; + } + + if (off + size > end) { + bpf_log(log, + "write access at off %d with size %d beyond the member of tcp_sock ended at %zu\n", + off, size, end); + return -EACCES; + } + + return NOT_INIT; +} + +static const struct bpf_func_proto * +bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id, + const struct bpf_prog *prog) +{ + return bpf_base_func_proto(func_id); +} + +static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = { + .get_func_proto = bpf_tcp_ca_get_func_proto, + .is_valid_access = bpf_tcp_ca_is_valid_access, + .btf_struct_access = bpf_tcp_ca_btf_struct_access, +}; + +static int bpf_tcp_ca_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct tcp_congestion_ops *utcp_ca; + struct tcp_congestion_ops *tcp_ca; + size_t tcp_ca_name_len; + int prog_fd; + u32 moff; + + utcp_ca = (const struct tcp_congestion_ops *)udata; + tcp_ca = (struct tcp_congestion_ops *)kdata; + + moff = btf_member_bit_offset(t, member) / 8; + switch (moff) { + case offsetof(struct tcp_congestion_ops, flags): + if (utcp_ca->flags & ~TCP_CONG_MASK) + return -EINVAL; + tcp_ca->flags = utcp_ca->flags; + return 1; + case offsetof(struct tcp_congestion_ops, name): + tcp_ca_name_len = strnlen(utcp_ca->name, sizeof(utcp_ca->name)); + if (!tcp_ca_name_len || + tcp_ca_name_len == sizeof(utcp_ca->name)) + return -EINVAL; + memcpy(tcp_ca->name, utcp_ca->name, sizeof(tcp_ca->name)); + return 1; + } + + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, NULL)) + return 0; + + prog_fd = (int)(*(unsigned long *)(udata + moff)); + if (!prog_fd && !check_optional(moff) && !check_unsupported(moff)) + return -EINVAL; + + return 0; +} + +static int bpf_tcp_ca_check_member(const struct btf_type *t, + const struct btf_member *member) +{ + if (check_unsupported(btf_member_bit_offset(t, member) / 8)) + return -ENOTSUPP; + return 0; +} + +static int bpf_tcp_ca_reg(void *kdata) +{ + return tcp_register_congestion_control(kdata); +} + +static void bpf_tcp_ca_unreg(void *kdata) +{ + tcp_unregister_congestion_control(kdata); +} + +struct bpf_struct_ops bpf_tcp_congestion_ops = { + .verifier_ops = &bpf_tcp_ca_verifier_ops, + .reg = bpf_tcp_ca_reg, + .unreg = bpf_tcp_ca_unreg, + .check_member = bpf_tcp_ca_check_member, + .init_member = bpf_tcp_ca_init_member, + .init = bpf_tcp_ca_init, + .name = "tcp_congestion_ops", +}; diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 3737ec096650..dc27f21bd815 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -162,7 +162,7 @@ void tcp_assign_congestion_control(struct sock *sk) rcu_read_lock(); ca = rcu_dereference(net->ipv4.tcp_congestion_control); - if (unlikely(!try_module_get(ca->owner))) + if (unlikely(!bpf_try_module_get(ca, ca->owner))) ca = &tcp_reno; icsk->icsk_ca_ops = ca; rcu_read_unlock(); @@ -208,7 +208,7 @@ void tcp_cleanup_congestion_control(struct sock *sk) if (icsk->icsk_ca_ops->release) icsk->icsk_ca_ops->release(sk); - module_put(icsk->icsk_ca_ops->owner); + bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner); } /* Used by sysctl to change default congestion control */ @@ -222,12 +222,12 @@ int tcp_set_default_congestion_control(struct net *net, const char *name) ca = tcp_ca_find_autoload(net, name); if (!ca) { ret = -ENOENT; - } else if (!try_module_get(ca->owner)) { + } else if (!bpf_try_module_get(ca, ca->owner)) { ret = -EBUSY; } else { prev = xchg(&net->ipv4.tcp_congestion_control, ca); if (prev) - module_put(prev->owner); + bpf_module_put(prev, prev->owner); ca->flags |= TCP_CONG_NON_RESTRICTED; ret = 0; @@ -366,19 +366,19 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, } else if (!load) { const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops; - if (try_module_get(ca->owner)) { + if (bpf_try_module_get(ca, ca->owner)) { if (reinit) { tcp_reinit_congestion_control(sk, ca); } else { icsk->icsk_ca_ops = ca; - module_put(old_ca->owner); + bpf_module_put(old_ca, old_ca->owner); } } else { err = -EBUSY; } } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) { err = -EPERM; - } else if (!try_module_get(ca->owner)) { + } else if (!bpf_try_module_get(ca, ca->owner)) { err = -EBUSY; } else { tcp_reinit_congestion_control(sk, ca); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 26637fce324d..45a88358168a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2619,7 +2619,8 @@ static void __net_exit tcp_sk_exit(struct net *net) int cpu; if (net->ipv4.tcp_congestion_control) - module_put(net->ipv4.tcp_congestion_control->owner); + bpf_module_put(net->ipv4.tcp_congestion_control, + net->ipv4.tcp_congestion_control->owner); for_each_possible_cpu(cpu) inet_ctl_sock_destroy(*per_cpu_ptr(net->ipv4.tcp_sk, cpu)); @@ -2726,7 +2727,8 @@ static int __net_init tcp_sk_init(struct net *net) /* Reno is always built in */ if (!net_eq(net, &init_net) && - try_module_get(init_net.ipv4.tcp_congestion_control->owner)) + bpf_try_module_get(init_net.ipv4.tcp_congestion_control, + init_net.ipv4.tcp_congestion_control->owner)) net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control; else net->ipv4.tcp_congestion_control = &tcp_reno; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index c802bc80c400..ad3b56d9fa71 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -414,7 +414,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst) rcu_read_lock(); ca = tcp_ca_find_key(ca_key); - if (likely(ca && try_module_get(ca->owner))) { + if (likely(ca && bpf_try_module_get(ca, ca->owner))) { icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst); icsk->icsk_ca_ops = ca; ca_got_dst = true; @@ -425,7 +425,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst) /* If no valid choice made yet, assign current system default ca. */ if (!ca_got_dst && (!icsk->icsk_ca_setsockopt || - !try_module_get(icsk->icsk_ca_ops->owner))) + !bpf_try_module_get(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner))) tcp_assign_congestion_control(sk); tcp_set_ca_state(sk, TCP_CA_Open); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index b184f03d7437..8e7187732ac1 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3356,8 +3356,8 @@ static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst) rcu_read_lock(); ca = tcp_ca_find_key(ca_key); - if (likely(ca && try_module_get(ca->owner))) { - module_put(icsk->icsk_ca_ops->owner); + if (likely(ca && bpf_try_module_get(ca, ca->owner))) { + bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner); icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst); icsk->icsk_ca_ops = ca; } From patchwork Sat Dec 14 00:47:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209572 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="UjJ177Hx"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTR53CG4z9sPL for ; Sat, 14 Dec 2019 11:48:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726908AbfLNAsA (ORCPT ); Fri, 13 Dec 2019 19:48:00 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:40192 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726863AbfLNAsA (ORCPT ); Fri, 13 Dec 2019 19:48:00 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UsgO000737 for ; Fri, 13 Dec 2019 16:47:59 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=DpW5nr6SkMXIlsH2XnuI/Ppjc5Jv/LNMlNIldIXuXg4=; b=UjJ177HxDmyqgU9vPe5J02jQgrccyaztaAAx9LVJrYNAubpkUyfHdYCPlwjnsCapfdn2 WoF6L648eTUEE7E20aJsVatVmGrlNxYwN2CiDLkxX8QV2sY0BHqP6dr+bgKqOhVmIcNV tl6y3WX99Y6T6ICpAc/bkUcqCRRvEIvmpGg= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wv1r0vwfh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:47:59 -0800 Received: from intmgw002.08.frc2.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::125) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:57 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 078FD2943AB4; Fri, 13 Dec 2019 16:47:55 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Date: Fri, 13 Dec 2019 16:47:55 -0800 Message-ID: <20191214004755.1653250-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxlogscore=999 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Add a helper to send out a tcp-ack. It will be used in the later bpf_dctcp implementation that requires to send out an ack when the CE state changed. Signed-off-by: Martin KaFai Lau Acked-by: Yonghong Song --- include/uapi/linux/bpf.h | 11 ++++++++++- net/ipv4/bpf_tcp_ca.c | 24 +++++++++++++++++++++++- 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 8809212d9d6c..602449a56dde 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2827,6 +2827,14 @@ union bpf_attr { * Return * On success, the strictly positive length of the string, including * the trailing NUL character. On error, a negative value. + * + * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt) + * Description + * Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock. + * *rcv_nxt* is the ack_seq to be sent out. + * Return + * 0 on success, or a negative error in case of failure. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2944,7 +2952,8 @@ union bpf_attr { FN(probe_read_user), \ FN(probe_read_kernel), \ FN(probe_read_user_str), \ - FN(probe_read_kernel_str), + FN(probe_read_kernel_str), \ + FN(tcp_send_ack), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 967af987bc26..1fb86f7a93c1 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -144,11 +144,33 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log, return NOT_INIT; } +BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt) +{ + /* bpf_tcp_ca prog cannot have NULL tp */ + __tcp_send_ack((struct sock *)tp, rcv_nxt); + return 0; +} + +const struct bpf_func_proto bpf_tcp_send_ack_proto = { + .func = bpf_tcp_send_ack, + .gpl_only = false, + /* In case we want to report error later */ + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg2_type = ARG_ANYTHING, + .btf_id = &tcp_sock_id, +}; + static const struct bpf_func_proto * bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { - return bpf_base_func_proto(func_id); + switch (func_id) { + case BPF_FUNC_tcp_send_ack: + return &bpf_tcp_send_ack_proto; + default: + return bpf_base_func_proto(func_id); + } } static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = { From patchwork Sat Dec 14 00:47:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209573 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="ioNdXgbX"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTR66S9Qz9sPL for ; Sat, 14 Dec 2019 11:48:02 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726928AbfLNAsC (ORCPT ); Fri, 13 Dec 2019 19:48:02 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:59278 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726916AbfLNAsC (ORCPT ); Fri, 13 Dec 2019 19:48:02 -0500 Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0Vune014043 for ; Fri, 13 Dec 2019 16:48:01 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=oLbwrkPk93pTDrz6Mlvl/uQqOvz+UQshqcWwP4NZztU=; b=ioNdXgbXpNxeOigwJAEUUqXouJ1rW/Jzc/7oEfQFCG9ZPl8yMdMoch3SKwH5DWN+1XW7 cA9cIu9btXjpoGZjzmIH+1bS32K5ch//wwhcQGFYfA3ftJwWsgDVDlUDXKKyzAugWVlh GXtcHmysp6ldkLMcv9K9/aflRuFYtXUGorI= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wvev5swph-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:48:01 -0800 Received: from intmgw003.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::e) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:47:59 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 8C34D2943AB4; Fri, 13 Dec 2019 16:47:58 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Date: Fri, 13 Dec 2019 16:47:58 -0800 Message-ID: <20191214004758.1653342-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 mlxlogscore=756 clxscore=1015 priorityscore=1501 impostorscore=0 suspectscore=13 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch adds a helper to handle jiffies. Some of the tcp_sock's timing is stored in jiffies. Although things could be deduced by CONFIG_HZ, having an easy way to get jiffies will make the later bpf-tcp-cc implementation easier. While at it, instead of reading jiffies alone, it also takes a "flags" argument to help converting between ns and jiffies. This helper is available to CAP_SYS_ADMIN. Signed-off-by: Martin KaFai Lau --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 16 +++++++++++++++- kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 25 +++++++++++++++++++++++++ net/core/filter.c | 2 ++ 5 files changed, 44 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 349cedd7b97b..00491961421e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1371,6 +1371,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto; extern const struct bpf_func_proto bpf_strtol_proto; extern const struct bpf_func_proto bpf_strtoul_proto; extern const struct bpf_func_proto bpf_tcp_sock_proto; +extern const struct bpf_func_proto bpf_jiffies_proto; /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 602449a56dde..cf864a5f7d61 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2835,6 +2835,16 @@ union bpf_attr { * Return * 0 on success, or a negative error in case of failure. * + * u64 bpf_jiffies(u64 in, u64 flags) + * Description + * jiffies helper. + * Return + * *flags*: 0, return the current jiffies. + * BPF_F_NS_TO_JIFFIES, convert *in* from ns to jiffies. + * BPF_F_JIFFIES_TO_NS, convert *in* from jiffies to + * ns. If *in* is zero, it returns the current + * jiffies as ns. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2953,7 +2963,8 @@ union bpf_attr { FN(probe_read_kernel), \ FN(probe_read_user_str), \ FN(probe_read_kernel_str), \ - FN(tcp_send_ack), + FN(tcp_send_ack), \ + FN(jiffies), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3032,6 +3043,9 @@ enum bpf_func_id { /* BPF_FUNC_sk_storage_get flags */ #define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0) +#define BPF_F_NS_TO_JIFFIES (1ULL << 0) +#define BPF_F_JIFFIES_TO_NS (1ULL << 1) + /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 2ff01a716128..0ffbda9a13e9 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2134,6 +2134,7 @@ const struct bpf_func_proto bpf_map_pop_elem_proto __weak; const struct bpf_func_proto bpf_map_peek_elem_proto __weak; const struct bpf_func_proto bpf_spin_lock_proto __weak; const struct bpf_func_proto bpf_spin_unlock_proto __weak; +const struct bpf_func_proto bpf_jiffies_proto __weak; const struct bpf_func_proto bpf_get_prandom_u32_proto __weak; const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index cada974c9f4e..e87c332d1b61 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "../../lib/kstrtox.h" @@ -486,4 +487,28 @@ const struct bpf_func_proto bpf_strtoul_proto = { .arg3_type = ARG_ANYTHING, .arg4_type = ARG_PTR_TO_LONG, }; + +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags) +{ + if (!flags) + return get_jiffies_64(); + + if (flags & BPF_F_NS_TO_JIFFIES) { + return nsecs_to_jiffies(in); + } else if (flags & BPF_F_JIFFIES_TO_NS) { + if (!in) + in = get_jiffies_64(); + return jiffies_to_nsecs(in); + } + + return 0; +} + +const struct bpf_func_proto bpf_jiffies_proto = { + .func = bpf_jiffies, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_ANYTHING, +}; #endif diff --git a/net/core/filter.c b/net/core/filter.c index fbb3698026bd..355746715901 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -6015,6 +6015,8 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_spin_unlock_proto; case BPF_FUNC_trace_printk: return bpf_get_trace_printk_proto(); + case BPF_FUNC_jiffies: + return &bpf_jiffies_proto; default: return NULL; } From patchwork Sat Dec 14 00:48:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209577 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="GhfLZhci"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTRB53Nkz9sPL for ; Sat, 14 Dec 2019 11:48:06 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726942AbfLNAsD (ORCPT ); Fri, 13 Dec 2019 19:48:03 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:24380 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726926AbfLNAsC (ORCPT ); Fri, 13 Dec 2019 19:48:02 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UsAh000704 for ; Fri, 13 Dec 2019 16:48:02 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=GIbrwAH03UgxoiBjij2bCQHSary/JIpWqcXiMokDRJo=; b=GhfLZhcivnV8slc9TJpDTMqOt0tCV7MTDadpTwJw1tL4ibed95E3nRpuFx6eTkegzcsv NPyyDWy4oaPJepUw3544+cz90b0oDvBHHIGD5gKegzfNff42++M3VWR/t09nTLuG7lMP dtNgfqf3VROY+TChhj2tamwJpIEEXAnQbVA= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wv1r0vwfn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:48:02 -0800 Received: from intmgw003.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:48:01 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id DC8C72943AB4; Fri, 13 Dec 2019 16:48:00 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/ Date: Fri, 13 Dec 2019 16:48:00 -0800 Message-ID: <20191214004800.1653481-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxlogscore=863 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org This patch sync uapi bpf.h to tools/ Signed-off-by: Martin KaFai Lau --- tools/include/uapi/linux/bpf.h | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index dbbcf0b02970..cf864a5f7d61 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -136,6 +136,7 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, BPF_MAP_TYPE_SK_STORAGE, BPF_MAP_TYPE_DEVMAP_HASH, + BPF_MAP_TYPE_STRUCT_OPS, }; /* Note that tracing related programs such as @@ -174,6 +175,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_PROG_TYPE_TRACING, + BPF_PROG_TYPE_STRUCT_OPS, }; enum bpf_attach_type { @@ -391,6 +393,10 @@ union bpf_attr { __u32 btf_fd; /* fd pointing to a BTF type data */ __u32 btf_key_type_id; /* BTF type_id of the key */ __u32 btf_value_type_id; /* BTF type_id of the value */ + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel- + * struct stored as the + * map value + */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ @@ -2821,6 +2827,24 @@ union bpf_attr { * Return * On success, the strictly positive length of the string, including * the trailing NUL character. On error, a negative value. + * + * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt) + * Description + * Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock. + * *rcv_nxt* is the ack_seq to be sent out. + * Return + * 0 on success, or a negative error in case of failure. + * + * u64 bpf_jiffies(u64 in, u64 flags) + * Description + * jiffies helper. + * Return + * *flags*: 0, return the current jiffies. + * BPF_F_NS_TO_JIFFIES, convert *in* from ns to jiffies. + * BPF_F_JIFFIES_TO_NS, convert *in* from jiffies to + * ns. If *in* is zero, it returns the current + * jiffies as ns. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2938,7 +2962,9 @@ union bpf_attr { FN(probe_read_user), \ FN(probe_read_kernel), \ FN(probe_read_user_str), \ - FN(probe_read_kernel_str), + FN(probe_read_kernel_str), \ + FN(tcp_send_ack), \ + FN(jiffies), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3017,6 +3043,9 @@ enum bpf_func_id { /* BPF_FUNC_sk_storage_get flags */ #define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0) +#define BPF_F_NS_TO_JIFFIES (1ULL << 0) +#define BPF_F_JIFFIES_TO_NS (1ULL << 1) + /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, @@ -3339,7 +3368,7 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u32 :32; + __u32 btf_vmlinux_value_type_id; __u64 netns_dev; __u64 netns_ino; __u32 btf_id; From patchwork Sat Dec 14 00:48:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209578 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="DChBOHko"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTRD5kK6z9sNH for ; Sat, 14 Dec 2019 11:48:08 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726986AbfLNAsH (ORCPT ); Fri, 13 Dec 2019 19:48:07 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:16228 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726890AbfLNAsG (ORCPT ); Fri, 13 Dec 2019 19:48:06 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UUJo009121 for ; Fri, 13 Dec 2019 16:48:06 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=ZTYK8LaX3K+RSkTNYO6iEs1RBSmx7LEmkFhZ38VrVeM=; b=DChBOHkoiFdtzbKe1jAQ8EDZgYg5A0DJ9nihaJSZDnBFlyK4fjWcoC3LN1szFBQ/ztJb Bq3fZLrM/MCL5DK6wE/YtritVOUW5alxFUfYRAj9/1/AO8+Ap4wzu0HUbJeAyXsBw+ZU y6oF5PLBLErrYtX10NpU0y5oQfl4JtWIacA= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g25-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:48:06 -0800 Received: from intmgw003.08.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:48:04 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 32D662943AB4; Fri, 13 Dec 2019 16:48:03 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Date: Fri, 13 Dec 2019 16:48:03 -0800 Message-ID: <20191214004803.1653618-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=43 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds BPF STRUCT_OPS support to libbpf. The only sec_name convention is SEC("struct_ops") to identify the struct ops implemented in BPF, e.g. SEC("struct_ops") struct tcp_congestion_ops dctcp = { .init = (void *)dctcp_init, /* <-- a bpf_prog */ /* ... some more func prts ... */ .name = "bpf_dctcp", }; In the bpf_object__open phase, libbpf will look for the "struct_ops" elf section and find out what is the btf-type the "struct_ops" is implementing. Note that the btf-type here is referring to a type in the bpf_prog.o's btf. It will then collect (through SHT_REL) where are the bpf progs that the func ptrs are referring to. In the bpf_object__load phase, the prepare_struct_ops() will load the btf_vmlinux and obtain the corresponding kernel's btf-type. With the kernel's btf-type, it can then set the prog->type, prog->attach_btf_id and the prog->expected_attach_type. Thus, the prog's properties do not rely on its section name. Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching process is as simple as: member-name match + btf-kind match + size match. If these matching conditions fail, libbpf will reject. The current targeting support is "struct tcp_congestion_ops" which most of its members are function pointers. The member ordering of the bpf_prog's btf-type can be different from the btf_vmlinux's btf-type. Once the prog's properties are all set, the libbpf will proceed to load all the progs. After that, register_struct_ops() will create a map, finalize the map-value by populating it with the prog-fd, and then register this "struct_ops" to the kernel by updating the map-value to the map. By default, libbpf does not unregister the struct_ops from the kernel during bpf_object__close(). It can be changed by setting the new "unreg_st_ops" in bpf_object_open_opts. Signed-off-by: Martin KaFai Lau --- tools/lib/bpf/bpf.c | 10 +- tools/lib/bpf/bpf.h | 5 +- tools/lib/bpf/libbpf.c | 599 +++++++++++++++++++++++++++++++++- tools/lib/bpf/libbpf.h | 3 +- tools/lib/bpf/libbpf_probes.c | 2 + 5 files changed, 612 insertions(+), 7 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 98596e15390f..ebb9d7066173 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -95,7 +95,11 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr) attr.btf_key_type_id = create_attr->btf_key_type_id; attr.btf_value_type_id = create_attr->btf_value_type_id; attr.map_ifindex = create_attr->map_ifindex; - attr.inner_map_fd = create_attr->inner_map_fd; + if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS) + attr.btf_vmlinux_value_type_id = + create_attr->btf_vmlinux_value_type_id; + else + attr.inner_map_fd = create_attr->inner_map_fd; return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } @@ -228,7 +232,9 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, memset(&attr, 0, sizeof(attr)); attr.prog_type = load_attr->prog_type; attr.expected_attach_type = load_attr->expected_attach_type; - if (attr.prog_type == BPF_PROG_TYPE_TRACING) { + if (attr.prog_type == BPF_PROG_TYPE_STRUCT_OPS) { + attr.attach_btf_id = load_attr->attach_btf_id; + } else if (attr.prog_type == BPF_PROG_TYPE_TRACING) { attr.attach_btf_id = load_attr->attach_btf_id; attr.attach_prog_fd = load_attr->attach_prog_fd; } else { diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 3c791fa8e68e..1ddbf7f33b83 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -48,7 +48,10 @@ struct bpf_create_map_attr { __u32 btf_key_type_id; __u32 btf_value_type_id; __u32 map_ifindex; - __u32 inner_map_fd; + union { + __u32 inner_map_fd; + __u32 btf_vmlinux_value_type_id; + }; }; LIBBPF_API int diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 27d5f7ecba32..ffb5cdd7db5a 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -67,6 +67,10 @@ #define __printf(a, b) __attribute__((format(printf, a, b))) +static struct btf *bpf_core_find_kernel_btf(void); +static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj, + int idx); + static int __base_pr(enum libbpf_print_level level, const char *format, va_list args) { @@ -128,6 +132,8 @@ void libbpf_print(enum libbpf_print_level level, const char *format, ...) # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ #endif +#define BPF_STRUCT_OPS_SEC_NAME "struct_ops" + static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; @@ -233,6 +239,32 @@ struct bpf_map { bool reused; }; +struct bpf_struct_ops { + const char *var_name; + const char *tname; + const struct btf_type *type; + struct bpf_program **progs; + __u32 *kern_func_off; + /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */ + void *data; + /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf + * format. + * struct __bpf_tcp_congestion_ops { + * [... some other kernel fields ...] + * struct tcp_congestion_ops data; + * } + * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops). + * prepare_struct_ops() will populate the "data" into + * "kern_vdata". + */ + void *kern_vdata; + __u32 type_id; + __u32 kern_vtype_id; + __u32 kern_vtype_size; + int fd; + bool unreg; +}; + struct bpf_secdata { void *rodata; void *data; @@ -251,6 +283,7 @@ struct bpf_object { size_t nr_maps; size_t maps_cap; struct bpf_secdata sections; + struct bpf_struct_ops st_ops; bool loaded; bool has_pseudo_calls; @@ -270,6 +303,7 @@ struct bpf_object { Elf_Data *data; Elf_Data *rodata; Elf_Data *bss; + Elf_Data *st_ops_data; size_t strtabidx; struct { GElf_Shdr shdr; @@ -282,6 +316,7 @@ struct bpf_object { int data_shndx; int rodata_shndx; int bss_shndx; + int st_ops_shndx; } efile; /* * All loaded bpf_object is linked in a list, which is @@ -509,6 +544,508 @@ static __u32 get_kernel_version(void) return KERNEL_VERSION(major, minor, patch); } +static int bpf_object__register_struct_ops(struct bpf_object *obj) +{ + struct bpf_create_map_attr map_attr = {}; + struct bpf_struct_ops *st_ops; + const char *tname; + __u32 i, zero = 0; + int fd, err; + + st_ops = &obj->st_ops; + if (!st_ops->kern_vdata) + return 0; + + tname = st_ops->tname; + for (i = 0; i < btf_vlen(st_ops->type); i++) { + struct bpf_program *prog = st_ops->progs[i]; + void *kern_data; + int prog_fd; + + if (!prog) + continue; + + prog_fd = bpf_program__nth_fd(prog, 0); + if (prog_fd < 0) { + pr_warn("struct_ops register %s: prog %s is not loaded\n", + tname, prog->name); + return -EINVAL; + } + + kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i]; + *(unsigned long *)kern_data = prog_fd; + } + + map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS; + map_attr.key_size = sizeof(unsigned int); + map_attr.value_size = st_ops->kern_vtype_size; + map_attr.max_entries = 1; + map_attr.btf_fd = btf__fd(obj->btf); + map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id; + map_attr.name = st_ops->var_name; + + fd = bpf_create_map_xattr(&map_attr); + if (fd < 0) { + err = -errno; + pr_warn("struct_ops register %s: Error in creating struct_ops map\n", + tname); + return err; + } + + err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0); + if (err) { + err = -errno; + close(fd); + pr_warn("struct_ops register %s: Error in updating struct_ops map\n", + tname); + return err; + } + + st_ops->fd = fd; + + return 0; +} + +static int bpf_struct_ops__unregister(struct bpf_struct_ops *st_ops) +{ + if (st_ops->fd != -1) { + __u32 zero = 0; + int err = 0; + + if (bpf_map_delete_elem(st_ops->fd, &zero)) + err = -errno; + zclose(st_ops->fd); + + return err; + } + + return 0; +} + +static const struct btf_type * +resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id); +static const struct btf_type * +resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id); + +static const struct btf_member * +find_member_by_offset(const struct btf_type *t, __u32 offset) +{ + struct btf_member *m; + int i; + + for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) { + if (btf_member_bit_offset(t, i) == offset) + return m; + } + + return NULL; +} + +static const struct btf_member * +find_member_by_name(const struct btf *btf, const struct btf_type *t, + const char *name) +{ + struct btf_member *m; + int i; + + for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) { + if (!strcmp(btf__name_by_offset(btf, m->name_off), name)) + return m; + } + + return NULL; +} + +#define STRUCT_OPS_VALUE_PREFIX "__bpf_" +#define STRUCT_OPS_VALUE_PREFIX_LEN (sizeof(STRUCT_OPS_VALUE_PREFIX) - 1) + +static int +bpf_struct_ops__get_kern_types(const struct btf *btf, const char *tname, + const struct btf_type **type, __u32 *type_id, + const struct btf_type **vtype, __u32 *vtype_id, + const struct btf_member **data_member) +{ + const struct btf_type *kern_type, *kern_vtype; + const struct btf_member *kern_data_member; + __s32 kern_vtype_id, kern_type_id; + char vtname[128] = STRUCT_OPS_VALUE_PREFIX; + __u32 i; + + kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT); + if (kern_type_id < 0) { + pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n", + tname); + return -ENOTSUP; + } + kern_type = btf__type_by_id(btf, kern_type_id); + + /* Find the corresponding "map_value" type that will be used + * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example, + * find "struct __bpf_tcp_congestion_ops" from the btf_vmlinux. + */ + strncat(vtname + STRUCT_OPS_VALUE_PREFIX_LEN, tname, + sizeof(vtname) - STRUCT_OPS_VALUE_PREFIX_LEN - 1); + kern_vtype_id = btf__find_by_name_kind(btf, vtname, + BTF_KIND_STRUCT); + if (kern_vtype_id < 0) { + pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n", + vtname); + return -ENOTSUP; + } + kern_vtype = btf__type_by_id(btf, kern_vtype_id); + + /* Find "struct tcp_congestion_ops" from + * struct __bpf_tcp_congestion_ops { + * [ ... ] + * struct tcp_congestion_ops data; + * } + */ + for (i = 0, kern_data_member = btf_members(kern_vtype); + i < btf_vlen(kern_vtype); + i++, kern_data_member++) { + if (kern_data_member->type == kern_type_id) + break; + } + if (i == btf_vlen(kern_vtype)) { + pr_warn("struct_ops prepare: struct %s data is not found in struct %s\n", + tname, vtname); + return -EINVAL; + } + + *type = kern_type; + *type_id = kern_type_id; + *vtype = kern_vtype; + *vtype_id = kern_vtype_id; + *data_member = kern_data_member; + + return 0; +} + +static int bpf_object__prepare_struct_ops(struct bpf_object *obj) +{ + const struct btf_member *member, *kern_member, *kern_data_member; + const struct btf_type *type, *kern_type, *kern_vtype; + __u32 i, kern_type_id, kern_vtype_id, kern_data_off; + struct bpf_struct_ops *st_ops; + void *data, *kern_data; + const struct btf *btf; + struct btf *kern_btf; + const char *tname; + int err; + + st_ops = &obj->st_ops; + if (!st_ops->data) + return 0; + + btf = obj->btf; + type = st_ops->type; + tname = st_ops->tname; + + kern_btf = bpf_core_find_kernel_btf(); + if (IS_ERR(kern_btf)) + return PTR_ERR(kern_btf); + + err = bpf_struct_ops__get_kern_types(kern_btf, tname, + &kern_type, &kern_type_id, + &kern_vtype, &kern_vtype_id, + &kern_data_member); + if (err) + goto done; + + pr_debug("struct_ops prepare %s: type_id:%u kern_type_id:%u kern_vtype_id:%u\n", + tname, st_ops->type_id, kern_type_id, kern_vtype_id); + + kern_data_off = kern_data_member->offset / 8; + st_ops->kern_vtype_size = kern_vtype->size; + st_ops->kern_vtype_id = kern_vtype_id; + + st_ops->kern_vdata = calloc(1, st_ops->kern_vtype_size); + if (!st_ops->kern_vdata) { + err = -ENOMEM; + goto done; + } + + data = st_ops->data; + kern_data = st_ops->kern_vdata + kern_data_off; + + err = -ENOTSUP; + for (i = 0, member = btf_members(type); i < btf_vlen(type); + i++, member++) { + const struct btf_type *mtype, *kern_mtype; + __u32 mtype_id, kern_mtype_id; + void *mdata, *kern_mdata; + __s64 msize, kern_msize; + __u32 moff, kern_moff; + __u32 kern_member_idx; + const char *mname; + + mname = btf__name_by_offset(btf, member->name_off); + kern_member = find_member_by_name(kern_btf, kern_type, mname); + if (!kern_member) { + pr_warn("struct_ops prepare %s: Cannot find member %s in kernel BTF\n", + tname, mname); + goto done; + } + + kern_member_idx = kern_member - btf_members(kern_type); + if (btf_member_bitfield_size(type, i) || + btf_member_bitfield_size(kern_type, kern_member_idx)) { + pr_warn("struct_ops prepare %s: bitfield %s is not supported\n", + tname, mname); + goto done; + } + + moff = member->offset / 8; + kern_moff = kern_member->offset / 8; + + mdata = data + moff; + kern_mdata = kern_data + kern_moff; + + mtype_id = member->type; + kern_mtype_id = kern_member->type; + + mtype = resolve_ptr(btf, mtype_id, NULL); + kern_mtype = resolve_ptr(kern_btf, kern_mtype_id, NULL); + if (mtype && kern_mtype) { + struct bpf_program *prog; + + if (!btf_is_func_proto(mtype) || + !btf_is_func_proto(kern_mtype)) { + pr_warn("struct_ops prepare %s: non func ptr %s is not supported\n", + tname, mname); + goto done; + } + + prog = st_ops->progs[i]; + if (!prog) { + pr_debug("struct_ops prepare %s: func ptr %s is not set\n", + tname, mname); + continue; + } + + if (prog->type != BPF_PROG_TYPE_UNSPEC || + prog->attach_btf_id) { + pr_warn("struct_ops prepare %s: cannot use prog %s with attach_btf_id %d for func ptr %s\n", + tname, prog->name, prog->attach_btf_id, mname); + err = -EINVAL; + goto done; + } + + prog->type = BPF_PROG_TYPE_STRUCT_OPS; + prog->attach_btf_id = kern_type_id; + /* expected_attach_type is the member index */ + prog->expected_attach_type = kern_member_idx; + + st_ops->kern_func_off[i] = kern_data_off + kern_moff; + + pr_debug("struct_ops prepare %s: func ptr %s is set to prog %s from data(+%u) to kern_data(+%u)\n", + tname, mname, prog->name, moff, kern_moff); + + continue; + } + + mtype_id = btf__resolve_type(btf, mtype_id); + kern_mtype_id = btf__resolve_type(kern_btf, kern_mtype_id); + if (mtype_id < 0 || kern_mtype_id < 0) { + pr_warn("struct_ops prepare %s: Cannot resolve the type for %s\n", + tname, mname); + goto done; + } + + mtype = btf__type_by_id(btf, mtype_id); + kern_mtype = btf__type_by_id(kern_btf, kern_mtype_id); + if (BTF_INFO_KIND(mtype->info) != + BTF_INFO_KIND(kern_mtype->info)) { + pr_warn("struct_ops prepare %s: Unmatched member type %s %u != %u(kernel)\n", + tname, mname, + BTF_INFO_KIND(mtype->info), + BTF_INFO_KIND(kern_mtype->info)); + goto done; + } + + msize = btf__resolve_size(btf, mtype_id); + kern_msize = btf__resolve_size(kern_btf, kern_mtype_id); + if (msize < 0 || kern_msize < 0 || msize != kern_msize) { + pr_warn("struct_ops prepare %s: Error in size of member %s: %zd != %zd(kernel)\n", + tname, mname, + (ssize_t)msize, (ssize_t)kern_msize); + goto done; + } + + pr_debug("struct_ops prepare %s: copy %s %u bytes from data(+%u) to kern_data(+%u)\n", + tname, mname, (unsigned int)msize, + moff, kern_moff); + memcpy(kern_mdata, mdata, msize); + } + + err = 0; + +done: + /* On error case, bpf_object__unload() will free the + * st_ops->kern_vdata. + */ + btf__free(kern_btf); + return err; +} + +static int bpf_object__collect_struct_ops_reloc(struct bpf_object *obj, + GElf_Shdr *shdr, + Elf_Data *data) +{ + const struct btf_member *member; + struct bpf_struct_ops *st_ops; + struct bpf_program *prog; + const char *name, *tname; + unsigned int shdr_idx; + const struct btf *btf; + Elf_Data *symbols; + unsigned int moff; + GElf_Sym sym; + GElf_Rel rel; + int i, nrels; + + symbols = obj->efile.symbols; + btf = obj->btf; + st_ops = &obj->st_ops; + tname = st_ops->tname; + + nrels = shdr->sh_size / shdr->sh_entsize; + for (i = 0; i < nrels; i++) { + if (!gelf_getrel(data, i, &rel)) { + pr_warn("struct_ops reloc %s: failed to get %d reloc\n", + tname, i); + return -LIBBPF_ERRNO__FORMAT; + } + + if (!gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym)) { + pr_warn("struct_ops reloc %s: symbol %" PRIx64 " not found\n", + tname, GELF_R_SYM(rel.r_info)); + return -LIBBPF_ERRNO__FORMAT; + } + + name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, + sym.st_name) ? : ""; + + pr_debug("%s relo for %lld value %lld name %d (\'%s\')\n", + tname, + (long long) (rel.r_info >> 32), + (long long) sym.st_value, sym.st_name, name); + + shdr_idx = sym.st_shndx; + moff = rel.r_offset; + pr_debug("struct_ops reloc %s: moff=%u, shdr_idx=%u\n", + tname, moff, shdr_idx); + + if (shdr_idx >= SHN_LORESERVE) { + pr_warn("struct_ops reloc %s: moff=%u shdr_idx=%u unsupported non-static function\n", + tname, moff, shdr_idx); + return -LIBBPF_ERRNO__RELOC; + } + + member = find_member_by_offset(st_ops->type, moff * 8); + if (!member) { + pr_warn("struct_ops reloc %s: cannot find member at moff=%u\n", + tname, moff); + return -EINVAL; + } + name = btf__name_by_offset(btf, member->name_off); + + if (!resolve_func_ptr(btf, member->type, NULL)) { + pr_warn("struct_ops reloc %s: cannot relocate non func ptr %s\n", + tname, name); + return -EINVAL; + } + + prog = bpf_object__find_prog_by_idx(obj, shdr_idx); + if (!prog) { + pr_warn("struct_ops reloc %s: cannot find prog at shdr_idx %u to relocate func ptr %s\n", + tname, shdr_idx, name); + return -EINVAL; + } + + st_ops->progs[member - btf_members(st_ops->type)] = prog; + } + + return 0; +} + +static int bpf_object__init_struct_ops(struct bpf_object *obj) +{ + const struct btf_type *type, *datasec; + const struct btf_var_secinfo *vsi; + struct bpf_struct_ops *st_ops; + const char *tname, *var_name; + __s32 type_id, datasec_id; + const struct btf *btf; + + if (obj->efile.st_ops_shndx == -1) + return 0; + + btf = obj->btf; + st_ops = &obj->st_ops; + datasec_id = btf__find_by_name_kind(btf, BPF_STRUCT_OPS_SEC_NAME, + BTF_KIND_DATASEC); + if (datasec_id < 0) { + pr_warn("struct_ops init: DATASEC %s not found\n", + BPF_STRUCT_OPS_SEC_NAME); + return -EINVAL; + } + + datasec = btf__type_by_id(btf, datasec_id); + if (btf_vlen(datasec) != 1) { + pr_warn("struct_ops init: multiple VAR in DATASEC %s\n", + BPF_STRUCT_OPS_SEC_NAME); + return -ENOTSUP; + } + vsi = btf_var_secinfos(datasec); + + type = btf__type_by_id(obj->btf, vsi->type); + if (!btf_is_var(type)) { + pr_warn("struct_ops init: vsi->type %u is not a VAR\n", + vsi->type); + return -EINVAL; + } + var_name = btf__name_by_offset(obj->btf, type->name_off); + + type_id = btf__resolve_type(obj->btf, vsi->type); + if (type_id < 0) { + pr_warn("struct_ops init: Cannot resolve var type_id %u in DATASEC %s\n", + vsi->type, BPF_STRUCT_OPS_SEC_NAME); + return -EINVAL; + } + + type = btf__type_by_id(obj->btf, type_id); + tname = btf__name_by_offset(obj->btf, type->name_off); + if (!btf_is_struct(type)) { + pr_warn("struct_ops init: %s is not a struct\n", tname); + return -EINVAL; + } + + if (type->size != obj->efile.st_ops_data->d_size) { + pr_warn("struct_ops init: %s unmatched size %u (BTF DATASEC) != %zu (ELF)\n", + tname, type->size, obj->efile.st_ops_data->d_size); + return -EINVAL; + } + + st_ops->data = malloc(type->size); + st_ops->progs = calloc(btf_vlen(type), sizeof(*st_ops->progs)); + st_ops->kern_func_off = malloc(btf_vlen(type) * + sizeof(*st_ops->kern_func_off)); + /* bpf_object__close() will take care of the free-ings */ + if (!st_ops->data || !st_ops->progs || !st_ops->kern_func_off) + return -ENOMEM; + memcpy(st_ops->data, obj->efile.st_ops_data->d_buf, type->size); + + st_ops->tname = tname; + st_ops->type = type; + st_ops->type_id = type_id; + st_ops->var_name = var_name; + + pr_debug("struct_ops init: %s found. type_id:%u\n", tname, type_id); + + return 0; +} + static struct bpf_object *bpf_object__new(const char *path, const void *obj_buf, size_t obj_buf_sz, @@ -550,6 +1087,9 @@ static struct bpf_object *bpf_object__new(const char *path, obj->efile.data_shndx = -1; obj->efile.rodata_shndx = -1; obj->efile.bss_shndx = -1; + obj->efile.st_ops_shndx = -1; + + obj->st_ops.fd = -1; obj->kern_version = get_kernel_version(); obj->loaded = false; @@ -572,6 +1112,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj) obj->efile.data = NULL; obj->efile.rodata = NULL; obj->efile.bss = NULL; + obj->efile.st_ops_data = NULL; zfree(&obj->efile.reloc_sects); obj->efile.nr_reloc_sects = 0; @@ -757,6 +1298,9 @@ int bpf_object__section_size(const struct bpf_object *obj, const char *name, } else if (!strcmp(name, ".rodata")) { if (obj->efile.rodata) *size = obj->efile.rodata->d_size; + } else if (!strcmp(name, BPF_STRUCT_OPS_SEC_NAME)) { + if (obj->efile.st_ops_data) + *size = obj->efile.st_ops_data->d_size; } else { ret = bpf_object_search_section_size(obj, name, &d_size); if (!ret) @@ -1060,6 +1604,30 @@ skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id) return t; } +static const struct btf_type * +resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id) +{ + const struct btf_type *t; + + t = skip_mods_and_typedefs(btf, id, NULL); + if (!btf_is_ptr(t)) + return NULL; + + return skip_mods_and_typedefs(btf, t->type, res_id); +} + +static const struct btf_type * +resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id) +{ + const struct btf_type *t; + + t = resolve_ptr(btf, id, res_id); + if (t && btf_is_func_proto(t)) + return t; + + return NULL; +} + /* * Fetch integer attribute of BTF map definition. Such attributes are * represented using a pointer to an array, in which dimensionality of array @@ -1509,7 +2077,7 @@ static void bpf_object__sanitize_btf_ext(struct bpf_object *obj) static bool bpf_object__is_btf_mandatory(const struct bpf_object *obj) { - return obj->efile.btf_maps_shndx >= 0; + return obj->efile.btf_maps_shndx >= 0 || obj->efile.st_ops_shndx >= 0; } static int bpf_object__init_btf(struct bpf_object *obj, @@ -1689,6 +2257,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps, } else if (strcmp(name, ".rodata") == 0) { obj->efile.rodata = data; obj->efile.rodata_shndx = idx; + } else if (strcmp(name, BPF_STRUCT_OPS_SEC_NAME) == 0) { + obj->efile.st_ops_data = data; + obj->efile.st_ops_shndx = idx; } else { pr_debug("skip section(%d) %s\n", idx, name); } @@ -1698,7 +2269,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps, int sec = sh.sh_info; /* points to other section */ /* Only do relo for section with exec instructions */ - if (!section_have_execinstr(obj, sec)) { + if (!section_have_execinstr(obj, sec) && + !strstr(name, BPF_STRUCT_OPS_SEC_NAME)) { pr_debug("skip relo %s(%d) for section(%d)\n", name, idx, sec); continue; @@ -1735,6 +2307,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps, err = bpf_object__sanitize_and_load_btf(obj); if (!err) err = bpf_object__init_prog_names(obj); + if (!err) + err = bpf_object__init_struct_ops(obj); return err; } @@ -3700,6 +4274,13 @@ static int bpf_object__collect_reloc(struct bpf_object *obj) return -LIBBPF_ERRNO__INTERNAL; } + if (idx == obj->efile.st_ops_shndx) { + err = bpf_object__collect_struct_ops_reloc(obj, shdr, data); + if (err) + return err; + continue; + } + prog = bpf_object__find_prog_by_idx(obj, idx); if (!prog) { pr_warn("relocation failed: no section(%d)\n", idx); @@ -3734,7 +4315,9 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt, load_attr.insns = insns; load_attr.insns_cnt = insns_cnt; load_attr.license = license; - if (prog->type == BPF_PROG_TYPE_TRACING) { + if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) { + load_attr.attach_btf_id = prog->attach_btf_id; + } else if (prog->type == BPF_PROG_TYPE_TRACING) { load_attr.attach_prog_fd = prog->attach_prog_fd; load_attr.attach_btf_id = prog->attach_btf_id; } else { @@ -3952,6 +4535,7 @@ __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz, if (IS_ERR(obj)) return obj; + obj->st_ops.unreg = OPTS_GET(opts, unreg_st_ops, false); obj->relaxed_core_relocs = OPTS_GET(opts, relaxed_core_relocs, false); relaxed_maps = OPTS_GET(opts, relaxed_maps, false); pin_root_path = OPTS_GET(opts, pin_root_path, NULL); @@ -4077,6 +4661,10 @@ int bpf_object__unload(struct bpf_object *obj) for (i = 0; i < obj->nr_programs; i++) bpf_program__unload(&obj->programs[i]); + if (obj->st_ops.unreg) + bpf_struct_ops__unregister(&obj->st_ops); + zfree(&obj->st_ops.kern_vdata); + return 0; } @@ -4100,7 +4688,9 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr) CHECK_ERR(bpf_object__create_maps(obj), err, out); CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out); + CHECK_ERR(bpf_object__prepare_struct_ops(obj), err, out); CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out); + CHECK_ERR(bpf_object__register_struct_ops(obj), err, out); return 0; out: @@ -4690,6 +5280,9 @@ void bpf_object__close(struct bpf_object *obj) bpf_program__exit(&obj->programs[i]); } zfree(&obj->programs); + zfree(&obj->st_ops.data); + zfree(&obj->st_ops.progs); + zfree(&obj->st_ops.kern_func_off); list_del(&obj->list); free(obj); diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 0dbf4bfba0c4..db255fce4948 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -109,8 +109,9 @@ struct bpf_object_open_opts { */ const char *pin_root_path; __u32 attach_prog_fd; + bool unreg_st_ops; }; -#define bpf_object_open_opts__last_field attach_prog_fd +#define bpf_object_open_opts__last_field unreg_st_ops LIBBPF_API struct bpf_object *bpf_object__open(const char *path); LIBBPF_API struct bpf_object * diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index a9eb8b322671..7f06942e9574 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -103,6 +103,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, case BPF_PROG_TYPE_CGROUP_SYSCTL: case BPF_PROG_TYPE_CGROUP_SOCKOPT: case BPF_PROG_TYPE_TRACING: + case BPF_PROG_TYPE_STRUCT_OPS: default: break; } @@ -251,6 +252,7 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex) case BPF_MAP_TYPE_XSKMAP: case BPF_MAP_TYPE_SOCKHASH: case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: + case BPF_MAP_TYPE_STRUCT_OPS: default: break; } From patchwork Sat Dec 14 00:48:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209581 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="cOaMjBg8"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTRN5YBcz9sPL for ; Sat, 14 Dec 2019 11:48:16 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727036AbfLNAsP (ORCPT ); Fri, 13 Dec 2019 19:48:15 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:41104 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726987AbfLNAsN (ORCPT ); Fri, 13 Dec 2019 19:48:13 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UtCk000747 for ; Fri, 13 Dec 2019 16:48:11 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=MGccbpaS1x/NG3G8e+ps2cay2TtxTMkX+g+I6SGlIXQ=; b=cOaMjBg8auR2pbl+/a57s6FCSsL6eR55BXE4eAeSSV4hKcz8GB68aF01zKum6RjUpwLt 8xA0PCgzbrsgunz+WZXNv6lTrj2uFdZ66vcV4FBDjXmscsX/gFnlyOSPn8v2TGSzW+Vk S3JgyeykSIjSWucHw7kCmtRZP863K7ezjGY= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wv1r0vwg4-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Fri, 13 Dec 2019 16:48:11 -0800 Received: from intmgw005.05.ash5.facebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::127) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Fri, 13 Dec 2019 16:48:09 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id AF2392943AB4; Fri, 13 Dec 2019 16:48:05 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example Date: Fri, 13 Dec 2019 16:48:05 -0800 Message-ID: <20191214004805.1653838-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxlogscore=999 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds a bpf_dctcp example. It currently does not do no-ECN fallback but the same could be done through the cgrp2-bpf. Signed-off-by: Martin KaFai Lau --- tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++ .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 198 +++++++++++++++ tools/testing/selftests/bpf/progs/bpf_dctcp.c | 194 +++++++++++++++ 3 files changed, 620 insertions(+) create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h new file mode 100644 index 000000000000..7ba8c1b4157a --- /dev/null +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h @@ -0,0 +1,228 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __BPF_TCP_HELPERS_H +#define __BPF_TCP_HELPERS_H + +#include +#include +#include +#include +#include "bpf_trace_helpers.h" + +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__) +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__) +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__) +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__) +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__) +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__) + +struct sock_common { + unsigned char skc_state; +} __attribute__((preserve_access_index)); + +struct sock { + struct sock_common __sk_common; +} __attribute__((preserve_access_index)); + +struct inet_sock { + struct sock sk; +} __attribute__((preserve_access_index)); + +struct inet_connection_sock { + struct inet_sock icsk_inet; + __u8 icsk_ca_state:6, + icsk_ca_setsockopt:1, + icsk_ca_dst_locked:1; + struct { + __u8 pending; + } icsk_ack; + __u64 icsk_ca_priv[104 / sizeof(__u64)]; +} __attribute__((preserve_access_index)); + +struct tcp_sock { + struct inet_connection_sock inet_conn; + + __u32 rcv_nxt; + __u32 snd_nxt; + __u32 snd_una; + __u8 ecn_flags; + __u32 delivered; + __u32 delivered_ce; + __u32 snd_cwnd; + __u32 snd_cwnd_cnt; + __u32 snd_cwnd_clamp; + __u32 snd_ssthresh; + __u8 syn_data:1, /* SYN includes data */ + syn_fastopen:1, /* SYN includes Fast Open option */ + syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ + syn_fastopen_ch:1, /* Active TFO re-enabling probe */ + syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ + save_syn:1, /* Save headers of SYN packet */ + is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ + syn_smc:1; /* SYN includes SMC */ + __u32 max_packets_out; + __u32 lsndtime; + __u32 prior_cwnd; +} __attribute__((preserve_access_index)); + +static __always_inline struct inet_connection_sock *inet_csk(const struct sock *sk) +{ + return (struct inet_connection_sock *)sk; +} + +static __always_inline void *inet_csk_ca(const struct sock *sk) +{ + return (void *)inet_csk(sk)->icsk_ca_priv; +} + +static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk) +{ + return (struct tcp_sock *)sk; +} + +static __always_inline bool before(__u32 seq1, __u32 seq2) +{ + return (__s32)(seq1-seq2) < 0; +} +#define after(seq2, seq1) before(seq1, seq2) + +#define TCP_ECN_OK 1 +#define TCP_ECN_QUEUE_CWR 2 +#define TCP_ECN_DEMAND_CWR 4 +#define TCP_ECN_SEEN 8 + +enum inet_csk_ack_state_t { + ICSK_ACK_SCHED = 1, + ICSK_ACK_TIMER = 2, + ICSK_ACK_PUSHED = 4, + ICSK_ACK_PUSHED2 = 8, + ICSK_ACK_NOW = 16 /* Send the next ACK immediately (once) */ +}; + +enum tcp_ca_event { + CA_EVENT_TX_START = 0, + CA_EVENT_CWND_RESTART = 1, + CA_EVENT_COMPLETE_CWR = 2, + CA_EVENT_LOSS = 3, + CA_EVENT_ECN_NO_CE = 4, + CA_EVENT_ECN_IS_CE = 5, +}; + +enum tcp_ca_state { + TCP_CA_Open = 0, + TCP_CA_Disorder = 1, + TCP_CA_CWR = 2, + TCP_CA_Recovery = 3, + TCP_CA_Loss = 4 +}; + +struct ack_sample { + __u32 pkts_acked; + __s32 rtt_us; + __u32 in_flight; +} __attribute__((preserve_access_index)); + +struct rate_sample { + __u64 prior_mstamp; /* starting timestamp for interval */ + __u32 prior_delivered; /* tp->delivered at "prior_mstamp" */ + __s32 delivered; /* number of packets delivered over interval */ + long interval_us; /* time for tp->delivered to incr "delivered" */ + __u32 snd_interval_us; /* snd interval for delivered packets */ + __u32 rcv_interval_us; /* rcv interval for delivered packets */ + long rtt_us; /* RTT of last (S)ACKed packet (or -1) */ + int losses; /* number of packets marked lost upon ACK */ + __u32 acked_sacked; /* number of packets newly (S)ACKed upon ACK */ + __u32 prior_in_flight; /* in flight before this ACK */ + bool is_app_limited; /* is sample from packet with bubble in pipe? */ + bool is_retrans; /* is sample from retransmission? */ + bool is_ack_delayed; /* is this (likely) a delayed ACK? */ +} __attribute__((preserve_access_index)); + +#define TCP_CA_NAME_MAX 16 +#define TCP_CONG_NEEDS_ECN 0x2 + +struct tcp_congestion_ops { + __u32 flags; + + /* initialize private data (optional) */ + void (*init)(struct sock *sk); + /* cleanup private data (optional) */ + void (*release)(struct sock *sk); + + /* return slow start threshold (required) */ + __u32 (*ssthresh)(struct sock *sk); + /* do new cwnd calculation (required) */ + void (*cong_avoid)(struct sock *sk, __u32 ack, __u32 acked); + /* call before changing ca_state (optional) */ + void (*set_state)(struct sock *sk, __u8 new_state); + /* call when cwnd event occurs (optional) */ + void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); + /* call when ack arrives (optional) */ + void (*in_ack_event)(struct sock *sk, __u32 flags); + /* new value of cwnd after loss (required) */ + __u32 (*undo_cwnd)(struct sock *sk); + /* hook for packet ack accounting (optional) */ + void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); + /* override sysctl_tcp_min_tso_segs */ + __u32 (*min_tso_segs)(struct sock *sk); + /* returns the multiplier used in tcp_sndbuf_expand (optional) */ + __u32 (*sndbuf_expand)(struct sock *sk); + /* call when packets are delivered to update cwnd and pacing rate, + * after all the ca_state processing. (optional) + */ + void (*cong_control)(struct sock *sk, const struct rate_sample *rs); + + char name[TCP_CA_NAME_MAX]; +}; + +#define min(a, b) ((a) < (b) ? (a) : (b)) +#define max(a, b) ((a) > (b) ? (a) : (b)) +#define min_not_zero(x, y) ({ \ + typeof(x) __x = (x); \ + typeof(y) __y = (y); \ + __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); }) + +static __always_inline __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked) +{ + __u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh); + + acked -= cwnd - tp->snd_cwnd; + tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp); + + return acked; +} + +static __always_inline bool tcp_in_slow_start(const struct tcp_sock *tp) +{ + return tp->snd_cwnd < tp->snd_ssthresh; +} + +static __always_inline bool tcp_is_cwnd_limited(const struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + /* If in slow start, ensure cwnd grows to twice what was ACKed. */ + if (tcp_in_slow_start(tp)) + return tp->snd_cwnd < 2 * tp->max_packets_out; + + return !!BPF_CORE_READ_BITFIELD(tp, is_cwnd_limited); +} + +static __always_inline void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, __u32 acked) +{ + /* If credits accumulated at a higher w, apply them gently now. */ + if (tp->snd_cwnd_cnt >= w) { + tp->snd_cwnd_cnt = 0; + tp->snd_cwnd++; + } + + tp->snd_cwnd_cnt += acked; + if (tp->snd_cwnd_cnt >= w) { + __u32 delta = tp->snd_cwnd_cnt / w; + + tp->snd_cwnd_cnt -= delta * w; + tp->snd_cwnd += delta; + } + tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp); +} + +#endif diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c new file mode 100644 index 000000000000..035de76bf8ed --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c @@ -0,0 +1,198 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2019 Facebook */ +#include +#include + +#define min(a, b) ((a) < (b) ? (a) : (b)) + +static const unsigned int total_bytes = 10 * 1024 * 1024; +static const struct timeval timeo_sec = { .tv_sec = 10 }; +static const size_t timeo_optlen = sizeof(timeo_sec); +static int stop, duration; + +static int settimeo(int fd) +{ + int err; + + err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo_sec, + timeo_optlen); + if (CHECK(err == -1, "setsockopt(fd, SO_RCVTIMEO)", "errno:%d\n", + errno)) + return -1; + + err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo_sec, + timeo_optlen); + if (CHECK(err == -1, "setsockopt(fd, SO_SNDTIMEO)", "errno:%d\n", + errno)) + return -1; + + return 0; +} + +static int settcpca(int fd, const char *tcp_ca) +{ + int err; + + err = setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, tcp_ca, strlen(tcp_ca)); + if (CHECK(err == -1, "setsockopt(fd, TCP_CONGESTION)", "errno:%d\n", + errno)) + return -1; + + return 0; +} + +static void *server(void *arg) +{ + int lfd = (int)(long)arg, err = 0, fd; + ssize_t nr_sent = 0, bytes = 0; + char batch[1500]; + + fd = accept(lfd, NULL, NULL); + while (fd == -1) { + if (errno == EINTR) + continue; + err = -errno; + goto done; + } + + if (settimeo(fd)) { + err = -errno; + goto done; + } + + while (bytes < total_bytes && !READ_ONCE(stop)) { + nr_sent = send(fd, &batch, + min(total_bytes - bytes, sizeof(batch)), 0); + if (nr_sent == -1 && errno == EINTR) + continue; + if (nr_sent == -1) { + err = -errno; + break; + } + bytes += nr_sent; + } + + CHECK(bytes != total_bytes, "send", "%zd != %u nr_sent:%zd errno:%d\n", + bytes, total_bytes, nr_sent, errno); + +done: + if (fd != -1) + close(fd); + if (err) { + WRITE_ONCE(stop, 1); + return ERR_PTR(err); + } + return NULL; +} + +static void do_test(const char *tcp_ca) +{ + struct sockaddr_in6 sa6 = {}; + ssize_t nr_recv = 0, bytes = 0; + int lfd = -1, fd = -1; + pthread_t srv_thread; + socklen_t addrlen = sizeof(sa6); + void *thread_ret; + char batch[1500]; + int err; + + WRITE_ONCE(stop, 0); + + lfd = socket(AF_INET6, SOCK_STREAM, 0); + if (CHECK(lfd == -1, "socket", "errno:%d\n", errno)) + return; + fd = socket(AF_INET6, SOCK_STREAM, 0); + if (CHECK(fd == -1, "socket", "errno:%d\n", errno)) { + close(lfd); + return; + } + + if (settcpca(lfd, tcp_ca) || settcpca(fd, tcp_ca) || + settimeo(lfd) || settimeo(fd)) + goto done; + + /* bind, listen and start server thread to accept */ + sa6.sin6_family = AF_INET6; + sa6.sin6_addr = in6addr_loopback; + err = bind(lfd, (struct sockaddr *)&sa6, addrlen); + if (CHECK(err == -1, "bind", "errno:%d\n", errno)) + goto done; + err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen); + if (CHECK(err == -1, "getsockname", "errno:%d\n", errno)) + goto done; + err = listen(lfd, 1); + if (CHECK(err == -1, "listen", "errno:%d\n", errno)) + goto done; + err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd); + if (CHECK(err != 0, "pthread_create", "err:%d\n", err)) + goto done; + + /* connect to server */ + err = connect(fd, (struct sockaddr *)&sa6, addrlen); + if (CHECK(err == -1, "connect", "errno:%d\n", errno)) + goto wait_thread; + + /* recv total_bytes */ + while (bytes < total_bytes && !READ_ONCE(stop)) { + nr_recv = recv(fd, &batch, + min(total_bytes - bytes, sizeof(batch)), 0); + if (nr_recv == -1 && errno == EINTR) + continue; + if (nr_recv == -1) + break; + bytes += nr_recv; + } + + CHECK(bytes != total_bytes, "recv", "%zd != %u nr_recv:%zd errno:%d\n", + bytes, total_bytes, nr_recv, errno); + +wait_thread: + WRITE_ONCE(stop, 1); + pthread_join(srv_thread, &thread_ret); + CHECK(IS_ERR(thread_ret), "pthread_join", "thread_ret:%ld", + PTR_ERR(thread_ret)); +done: + close(lfd); + close(fd); +} + +static struct bpf_object *load(const char *filename) +{ + DECLARE_LIBBPF_OPTS(bpf_object_open_opts, open_opts, + .unreg_st_ops = true, + ); + struct bpf_object *obj; + int err; + + obj = bpf_object__open_file(filename, &open_opts); + if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n", + PTR_ERR(obj))) + return obj; + + err = bpf_object__load(obj); + if (CHECK(err, "bpf_object__load", "err:%d\n", err)) { + bpf_object__close(obj); + return ERR_PTR(err); + } + + return obj; +} + +static void test_dctcp(void) +{ + struct bpf_object *obj; + + obj = load("bpf_dctcp.o"); + if (IS_ERR(obj)) + return; + + do_test("bpf_dctcp"); + + bpf_object__close(obj); +} + +void test_bpf_tcp_ca(void) +{ + if (test__start_subtest("dctcp")) + test_dctcp(); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c new file mode 100644 index 000000000000..794954832adc --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c @@ -0,0 +1,194 @@ +#include +#include +#include "bpf_tcp_helpers.h" + +char _license[] SEC("license") = "GPL"; + +#define DCTCP_MAX_ALPHA 1024U + +struct dctcp { + __u32 old_delivered; + __u32 old_delivered_ce; + __u32 prior_rcv_nxt; + __u32 dctcp_alpha; + __u32 next_seq; + __u32 ce_state; + __u32 loss_cwnd; +}; + +static unsigned int dctcp_shift_g = 4; /* g = 1/2^4 */ +static unsigned int dctcp_alpha_on_init = DCTCP_MAX_ALPHA; + +static __always_inline void dctcp_reset(const struct tcp_sock *tp, + struct dctcp *ca) +{ + ca->next_seq = tp->snd_nxt; + + ca->old_delivered = tp->delivered; + ca->old_delivered_ce = tp->delivered_ce; +} + +BPF_TCP_OPS_1(dctcp_init, void, struct sock *, sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + struct dctcp *ca = inet_csk_ca(sk); + + ca->prior_rcv_nxt = tp->rcv_nxt; + ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA); + ca->loss_cwnd = 0; + ca->ce_state = 0; + + dctcp_reset(tp, ca); +} + +BPF_TCP_OPS_1(dctcp_ssthresh, __u32, struct sock *, sk) +{ + struct dctcp *ca = inet_csk_ca(sk); + struct tcp_sock *tp = tcp_sk(sk); + + ca->loss_cwnd = tp->snd_cwnd; + return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U); +} + +BPF_TCP_OPS_2(dctcp_update_alpha, void, + struct sock *, sk, __u32, flags) +{ + const struct tcp_sock *tp = tcp_sk(sk); + struct dctcp *ca = inet_csk_ca(sk); + + /* Expired RTT */ + if (!before(tp->snd_una, ca->next_seq)) { + __u32 delivered_ce = tp->delivered_ce - ca->old_delivered_ce; + __u32 alpha = ca->dctcp_alpha; + + /* alpha = (1 - g) * alpha + g * F */ + + alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g); + if (delivered_ce) { + __u32 delivered = tp->delivered - ca->old_delivered; + + /* If dctcp_shift_g == 1, a 32bit value would overflow + * after 8 M packets. + */ + delivered_ce <<= (10 - dctcp_shift_g); + delivered_ce /= max(1U, delivered); + + alpha = min(alpha + delivered_ce, DCTCP_MAX_ALPHA); + } + ca->dctcp_alpha = alpha; + dctcp_reset(tp, ca); + } +} + +static __always_inline void dctcp_react_to_loss(struct sock *sk) +{ + struct dctcp *ca = inet_csk_ca(sk); + struct tcp_sock *tp = tcp_sk(sk); + + ca->loss_cwnd = tp->snd_cwnd; + tp->snd_ssthresh = max(tp->snd_cwnd >> 1U, 2U); +} + +BPF_TCP_OPS_2(dctcp_state, void, struct sock *, sk, __u8, new_state) +{ + if (new_state == TCP_CA_Recovery && + new_state != BPF_CORE_READ_BITFIELD(inet_csk(sk), icsk_ca_state)) + dctcp_react_to_loss(sk); + /* We handle RTO in dctcp_cwnd_event to ensure that we perform only + * one loss-adjustment per RTT. + */ +} + +static __always_inline void dctcp_ece_ack_cwr(struct sock *sk, __u32 ce_state) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (ce_state == 1) + tp->ecn_flags |= TCP_ECN_DEMAND_CWR; + else + tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR; +} + +/* Minimal DCTP CE state machine: + * + * S: 0 <- last pkt was non-CE + * 1 <- last pkt was CE + */ +static __always_inline +void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt, + __u32 *prior_rcv_nxt, __u32 *ce_state) +{ + __u32 new_ce_state = (evt == CA_EVENT_ECN_IS_CE) ? 1 : 0; + + if (*ce_state != new_ce_state) { + /* CE state has changed, force an immediate ACK to + * reflect the new CE state. If an ACK was delayed, + * send that first to reflect the prior CE state. + */ + if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) { + dctcp_ece_ack_cwr(sk, *ce_state); + bpf_tcp_send_ack(sk, *prior_rcv_nxt); + } + inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW; + } + *prior_rcv_nxt = tcp_sk(sk)->rcv_nxt; + *ce_state = new_ce_state; + dctcp_ece_ack_cwr(sk, new_ce_state); +} + +BPF_TCP_OPS_2(dctcp_cwnd_event, void, + struct sock *, sk, enum tcp_ca_event, ev) +{ + struct dctcp *ca = inet_csk_ca(sk); + + switch (ev) { + case CA_EVENT_ECN_IS_CE: + case CA_EVENT_ECN_NO_CE: + dctcp_ece_ack_update(sk, ev, &ca->prior_rcv_nxt, &ca->ce_state); + break; + case CA_EVENT_LOSS: + dctcp_react_to_loss(sk); + break; + default: + /* Don't care for the rest. */ + break; + } +} + +BPF_TCP_OPS_1(dctcp_cwnd_undo, __u32, struct sock *, sk) +{ + const struct dctcp *ca = inet_csk_ca(sk); + + return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd); +} + +BPF_TCP_OPS_3(tcp_reno_cong_avoid, void, + struct sock *, sk, __u32, ack, __u32, acked) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (!tcp_is_cwnd_limited(sk)) + return; + + /* In "safe" area, increase. */ + if (tcp_in_slow_start(tp)) { + acked = tcp_slow_start(tp, acked); + if (!acked) + return; + } + /* In dangerous area, increase slowly. */ + tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked); +} + +SEC("struct_ops") +struct tcp_congestion_ops dctcp = { + .init = (void *)dctcp_init, + .in_ack_event = (void *)dctcp_update_alpha, + .cwnd_event = (void *)dctcp_cwnd_event, + .ssthresh = (void *)dctcp_ssthresh, + .cong_avoid = (void *)tcp_reno_cong_avoid, + .undo_cwnd = (void *)dctcp_cwnd_undo, + .set_state = (void *)dctcp_state, + .flags = TCP_CONG_NEEDS_ECN, + .name = "bpf_dctcp", +}; From patchwork Sat Dec 14 00:48:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 1209582 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="NsZ+2TJH"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47ZTRP46DZz9sPW for ; Sat, 14 Dec 2019 11:48:17 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727050AbfLNAsQ (ORCPT ); Fri, 13 Dec 2019 19:48:16 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:51236 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726996AbfLNAsN (ORCPT ); Fri, 13 Dec 2019 19:48:13 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBE0UVQf009129 for ; Fri, 13 Dec 2019 16:48:12 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=PAGJ4yDo5B7Ns9C9nuZQq82ivUiRt+F9LvjZhaOJrhg=; b=NsZ+2TJH3UAn3DaD4RIPeWpvzwpuj2HIWP5I1UserQDBD2Tat4ZOTw4ZaYH8qt6d/zWU pA/uNovEMtYkkJQPQxSFpyvQJZu6TnkUHbDKxnwT4zTXi4VfIQBS63I53IoAz0s+oME/ fiRxgh1/MOlnLsKj2U+reFSjKZtNxCIhV4E= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2wvkm20g2d-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 13 Dec 2019 16:48:12 -0800 Received: from intmgw003.05.ash5.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 13 Dec 2019 16:48:09 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 062F72943AB4; Fri, 13 Dec 2019 16:48:07 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: CC: Alexei Starovoitov , Daniel Borkmann , David Miller , , Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 13/13] bpf: Add bpf_cubic example Date: Fri, 13 Dec 2019 16:48:07 -0800 Message-ID: <20191214004807.1653955-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-13_09:2019-12-13, 2019-12-13 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 suspectscore=13 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912140001 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds the bpf_cubic tcp sample. The CONFIG_HZ=1000 requirement will go away when the libbpf extern-var support is ready. Signed-off-by: Martin KaFai Lau --- .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 22 + tools/testing/selftests/bpf/progs/bpf_cubic.c | 502 ++++++++++++++++++ 2 files changed, 524 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/bpf_cubic.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c index 035de76bf8ed..3d787ecafa4a 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c @@ -178,6 +178,26 @@ static struct bpf_object *load(const char *filename) return obj; } +static void test_cubic(void) +{ + struct bpf_object *obj; + int err; + + err = system("zgrep 'CONFIG_HZ=1000$' /proc/config.gz >& /dev/null"); + if (err) { + test__skip(); + return; + } + + obj = load("bpf_cubic.o"); + if (IS_ERR(obj)) + return; + + do_test("bpf_cubic"); + + bpf_object__close(obj); +} + static void test_dctcp(void) { struct bpf_object *obj; @@ -195,4 +215,6 @@ void test_bpf_tcp_ca(void) { if (test__start_subtest("dctcp")) test_dctcp(); + if (test__start_subtest("cubic")) + test_cubic(); } diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c b/tools/testing/selftests/bpf/progs/bpf_cubic.c new file mode 100644 index 000000000000..ca77d6a34406 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c @@ -0,0 +1,502 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include +#include "bpf_tcp_helpers.h" + +char _license[] SEC("license") = "GPL"; + +#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi) + +#define BICTCP_BETA_SCALE 1024 /* Scale factor beta calculation + * max_cwnd = snd_cwnd * beta + */ +#define BICTCP_HZ 10 /* BIC HZ 2^10 = 1024 */ + +/* Two methods of hybrid slow start */ +#define HYSTART_ACK_TRAIN 0x1 +#define HYSTART_DELAY 0x2 + +/* Number of delay samples for detecting the increase of delay */ +#define HYSTART_MIN_SAMPLES 8 +#define HYSTART_DELAY_MIN (4U<<3) +#define HYSTART_DELAY_MAX (16U<<3) +#define HYSTART_DELAY_THRESH(x) clamp(x, HYSTART_DELAY_MIN, HYSTART_DELAY_MAX) + +static int fast_convergence = 1; +static const int beta = 717; /* = 717/1024 (BICTCP_BETA_SCALE) */ +static int initial_ssthresh; +static const int bic_scale = 41; +static int tcp_friendliness = 1; + +static int hystart = 1; +static int hystart_detect = HYSTART_ACK_TRAIN | HYSTART_DELAY; +static int hystart_low_window = 16; +static int hystart_ack_delta = 2; + +static __u32 cube_rtt_scale = (bic_scale * 10); /* 1024*c/rtt */ +static __u32 beta_scale = 8*(BICTCP_BETA_SCALE+beta) / 3 + / (BICTCP_BETA_SCALE - beta); +/* calculate the "K" for (wmax-cwnd) = c/rtt * K^3 + * so K = cubic_root( (wmax-cwnd)*rtt/c ) + * the unit of K is bictcp_HZ=2^10, not HZ + * + * c = bic_scale >> 10 + * rtt = 100ms + * + * the following code has been designed and tested for + * cwnd < 1 million packets + * RTT < 100 seconds + * HZ < 1,000,00 (corresponding to 10 nano-second) + */ + +/* 1/c * 2^2*bictcp_HZ * srtt, 2^40 */ +static __u64 cube_factor = (__u64)(1ull << (10+3*BICTCP_HZ)) / (bic_scale * 10); + +/* BIC TCP Parameters */ +struct bictcp { + __u32 cnt; /* increase cwnd by 1 after ACKs */ + __u32 last_max_cwnd; /* last maximum snd_cwnd */ + __u32 last_cwnd; /* the last snd_cwnd */ + __u32 last_time; /* time when updated last_cwnd */ + __u32 bic_origin_point;/* origin point of bic function */ + __u32 bic_K; /* time to origin point + from the beginning of the current epoch */ + __u32 delay_min; /* min delay (msec << 3) */ + __u32 epoch_start; /* beginning of an epoch */ + __u32 ack_cnt; /* number of acks */ + __u32 tcp_cwnd; /* estimated tcp cwnd */ + __u16 unused; + __u8 sample_cnt; /* number of samples to decide curr_rtt */ + __u8 found; /* the exit point is found? */ + __u32 round_start; /* beginning of each round */ + __u32 end_seq; /* end_seq of the round */ + __u32 last_ack; /* last time when the ACK spacing is close */ + __u32 curr_rtt; /* the minimum rtt of current round */ +}; + +static __always_inline void bictcp_reset(struct bictcp *ca) +{ + ca->cnt = 0; + ca->last_max_cwnd = 0; + ca->last_cwnd = 0; + ca->last_time = 0; + ca->bic_origin_point = 0; + ca->bic_K = 0; + ca->delay_min = 0; + ca->epoch_start = 0; + ca->ack_cnt = 0; + ca->tcp_cwnd = 0; + ca->found = 0; +} + +#define HZ 1000UL +#define NSEC_PER_MSEC 1000000UL +#define USEC_PER_MSEC 1000UL + +static __always_inline __u64 msecs_to_jiffies(__u32 m) +{ + return bpf_jiffies((__u64)m * NSEC_PER_MSEC, BPF_F_NS_TO_JIFFIES); +} + +static __always_inline __u32 bictcp_clock(void) +{ + return bpf_jiffies(0, BPF_F_JIFFIES_TO_NS) / NSEC_PER_MSEC; +} + +#define tcp_jiffies32 ((__u32)bpf_jiffies(0, 0)) + +static __always_inline void bictcp_hystart_reset(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); + + ca->round_start = ca->last_ack = bictcp_clock(); + ca->end_seq = tp->snd_nxt; + ca->curr_rtt = 0; + ca->sample_cnt = 0; +} + +BPF_TCP_OPS_1(bictcp_init, void, struct sock *, sk) +{ + struct bictcp *ca = inet_csk_ca(sk); + + bictcp_reset(ca); + + if (hystart) + bictcp_hystart_reset(sk); + + if (!hystart && initial_ssthresh) + tcp_sk(sk)->snd_ssthresh = initial_ssthresh; +} + +BPF_TCP_OPS_2(bictcp_cwnd_event, void, struct sock *, sk, + enum tcp_ca_event, event) +{ + if (event == CA_EVENT_TX_START) { + struct bictcp *ca = inet_csk_ca(sk); + __u32 now = tcp_jiffies32; + __s32 delta; + + delta = now - tcp_sk(sk)->lsndtime; + + /* We were application limited (idle) for a while. + * Shift epoch_start to keep cwnd growth to cubic curve. + */ + if (ca->epoch_start && delta > 0) { + ca->epoch_start += delta; + if (after(ca->epoch_start, now)) + ca->epoch_start = now; + } + return; + } +} + +#define BITS_PER_LONG (sizeof(long) * 8) +static __always_inline unsigned long __fls(unsigned long word) +{ + int num = BITS_PER_LONG - 1; + + if (!(word & (~0ul << 32))) { + num -= 32; + word <<= 32; + } + + if (!(word & (~0ul << (BITS_PER_LONG-16)))) { + num -= 16; + word <<= 16; + } + if (!(word & (~0ul << (BITS_PER_LONG-8)))) { + num -= 8; + word <<= 8; + } + if (!(word & (~0ul << (BITS_PER_LONG-4)))) { + num -= 4; + word <<= 4; + } + if (!(word & (~0ul << (BITS_PER_LONG-2)))) { + num -= 2; + word <<= 2; + } + if (!(word & (~0ul << (BITS_PER_LONG-1)))) + num -= 1; + return num; +} + +static __always_inline int fls64(__u64 x) +{ + if (x == 0) + return 0; + return __fls(x) + 1; +} + +static __always_inline __u64 div64_u64(__u64 dividend, __u64 divisor) +{ + return dividend / divisor; +} + +/* + * cbrt(x) MSB values for x MSB values in [0..63]. + * Precomputed then refined by hand - Willy Tarreau + * + * For x in [0..63], + * v = cbrt(x << 18) - 1 + * cbrt(x) = (v[x] + 10) >> 6 + */ +static const __u8 v[] = { + /* 0x00 */ 0, 54, 54, 54, 118, 118, 118, 118, + /* 0x08 */ 123, 129, 134, 138, 143, 147, 151, 156, + /* 0x10 */ 157, 161, 164, 168, 170, 173, 176, 179, + /* 0x18 */ 181, 185, 187, 190, 192, 194, 197, 199, + /* 0x20 */ 200, 202, 204, 206, 209, 211, 213, 215, + /* 0x28 */ 217, 219, 221, 222, 224, 225, 227, 229, + /* 0x30 */ 231, 232, 234, 236, 237, 239, 240, 242, + /* 0x38 */ 244, 245, 246, 248, 250, 251, 252, 254, +}; + +/* calculate the cubic root of x using a table lookup followed by one + * Newton-Raphson iteration. + * Avg err ~= 0.195% + */ +static __always_inline __u32 cubic_root(__u64 a) +{ + __u32 x, b, shift; + b = fls64(a); + if (a < 64) { + /* a in [0..63] */ + return ((__u32)v[(__u32)a] + 35) >> 6; + } + + /* b >= 7 */ + + b = ((b * 84) >> 8) - 1; + shift = (a >> (b * 3)); + + /* it is needed for verifier's bound check on v */ + if (shift >= 64) + return 0; + + x = ((__u32)(((__u32)v[shift] + 10) << b)) >> 6; + + /* + * Newton-Raphson iteration + * 2 + * x = ( 2 * x + a / x ) / 3 + * k+1 k k + */ + x = (2 * x + (__u32)div64_u64(a, (__u64)x * (__u64)(x - 1))); + x = ((x * 341) >> 10); + return x; +} + +/* + * Compute congestion window to use. + */ +static __always_inline void bictcp_update(struct bictcp *ca, __u32 cwnd, + __u32 acked) +{ + __u32 delta, bic_target, max_cnt; + __u64 offs, t; + + ca->ack_cnt += acked; /* count the number of ACKed packets */ + + if (ca->last_cwnd == cwnd && + (__s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + return; + + /* The CUBIC function can update ca->cnt at most once per jiffy. + * On all cwnd reduction events, ca->epoch_start is set to 0, + * which will force a recalculation of ca->cnt. + */ + if (ca->epoch_start && tcp_jiffies32 == ca->last_time) + goto tcp_friendliness; + + ca->last_cwnd = cwnd; + ca->last_time = tcp_jiffies32; + + if (ca->epoch_start == 0) { + ca->epoch_start = tcp_jiffies32; /* record beginning */ + ca->ack_cnt = acked; /* start counting */ + ca->tcp_cwnd = cwnd; /* syn with cubic */ + + if (ca->last_max_cwnd <= cwnd) { + ca->bic_K = 0; + ca->bic_origin_point = cwnd; + } else { + /* Compute new K based on + * (wmax-cwnd) * (srtt>>3 / HZ) / c * 2^(3*bictcp_HZ) + */ + ca->bic_K = cubic_root(cube_factor + * (ca->last_max_cwnd - cwnd)); + ca->bic_origin_point = ca->last_max_cwnd; + } + } + + /* cubic function - calc*/ + /* calculate c * time^3 / rtt, + * while considering overflow in calculation of time^3 + * (so time^3 is done by using 64 bit) + * and without the support of division of 64bit numbers + * (so all divisions are done by using 32 bit) + * also NOTE the unit of those veriables + * time = (t - K) / 2^bictcp_HZ + * c = bic_scale >> 10 + * rtt = (srtt >> 3) / HZ + * !!! The following code does not have overflow problems, + * if the cwnd < 1 million packets !!! + */ + + t = (__s32)(tcp_jiffies32 - ca->epoch_start); + t += msecs_to_jiffies(ca->delay_min >> 3); + /* change the unit from HZ to bictcp_HZ */ + t <<= BICTCP_HZ; + t /= HZ; + + if (t < ca->bic_K) /* t - K */ + offs = ca->bic_K - t; + else + offs = t - ca->bic_K; + + /* c/rtt * (t-K)^3 */ + delta = (cube_rtt_scale * offs * offs * offs) >> (10+3*BICTCP_HZ); + if (t < ca->bic_K) /* below origin*/ + bic_target = ca->bic_origin_point - delta; + else /* above origin*/ + bic_target = ca->bic_origin_point + delta; + + /* cubic function - calc bictcp_cnt*/ + if (bic_target > cwnd) { + ca->cnt = cwnd / (bic_target - cwnd); + } else { + ca->cnt = 100 * cwnd; /* very small increment*/ + } + + /* + * The initial growth of cubic function may be too conservative + * when the available bandwidth is still unknown. + */ + if (ca->last_max_cwnd == 0 && ca->cnt > 20) + ca->cnt = 20; /* increase cwnd 5% per RTT */ + +tcp_friendliness: + /* TCP Friendly */ + if (tcp_friendliness) { + __u32 scale = beta_scale; + __u32 n; + + /* update tcp cwnd */ + delta = (cwnd * scale) >> 3; + if (delta) { + n = ca->ack_cnt / delta; + ca->ack_cnt -= n * delta; + ca->tcp_cwnd += n; + } + + if (ca->tcp_cwnd > cwnd) { /* if bic is slower than tcp */ + delta = ca->tcp_cwnd - cwnd; + max_cnt = cwnd / delta; + if (ca->cnt > max_cnt) + ca->cnt = max_cnt; + } + } + + /* The maximum rate of cwnd increase CUBIC allows is 1 packet per + * 2 packets ACKed, meaning cwnd grows at 1.5x per RTT. + */ + ca->cnt = max(ca->cnt, 2U); +} + +BPF_TCP_OPS_3(bictcp_cong_avoid, void, struct sock *, sk, + __u32, ack, __u32, acked) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); + + if (!tcp_is_cwnd_limited(sk)) + return; + + if (tcp_in_slow_start(tp)) { + if (hystart && after(ack, ca->end_seq)) + bictcp_hystart_reset(sk); + acked = tcp_slow_start(tp, acked); + if (!acked) + return; + } + bictcp_update(ca, tp->snd_cwnd, acked); + tcp_cong_avoid_ai(tp, ca->cnt, acked); +} + +BPF_TCP_OPS_1(bictcp_recalc_ssthresh, __u32, + struct sock *, sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); + + ca->epoch_start = 0; /* end of epoch */ + + /* Wmax and fast convergence */ + if (tp->snd_cwnd < ca->last_max_cwnd && fast_convergence) + ca->last_max_cwnd = (tp->snd_cwnd * (BICTCP_BETA_SCALE + beta)) + / (2 * BICTCP_BETA_SCALE); + else + ca->last_max_cwnd = tp->snd_cwnd; + + return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U); +} + +BPF_TCP_OPS_2(bictcp_state, void, struct sock *, sk, + __u8, new_state) +{ + if (new_state == TCP_CA_Loss) { + bictcp_reset(inet_csk_ca(sk)); + bictcp_hystart_reset(sk); + } +} + +static __always_inline void hystart_update(struct sock *sk, __u32 delay) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); + + if (ca->found & hystart_detect) + return; + + if (hystart_detect & HYSTART_ACK_TRAIN) { + __u32 now = bictcp_clock(); + + /* first detection parameter - ack-train detection */ + if ((__s32)(now - ca->last_ack) <= hystart_ack_delta) { + ca->last_ack = now; + if ((__s32)(now - ca->round_start) > ca->delay_min >> 4) { + ca->found |= HYSTART_ACK_TRAIN; + tp->snd_ssthresh = tp->snd_cwnd; + } + } + } + + if (hystart_detect & HYSTART_DELAY) { + /* obtain the minimum delay of more than sampling packets */ + if (ca->sample_cnt < HYSTART_MIN_SAMPLES) { + if (ca->curr_rtt == 0 || ca->curr_rtt > delay) + ca->curr_rtt = delay; + + ca->sample_cnt++; + } else { + if (ca->curr_rtt > ca->delay_min + + HYSTART_DELAY_THRESH(ca->delay_min >> 3)) { + ca->found |= HYSTART_DELAY; + tp->snd_ssthresh = tp->snd_cwnd; + } + } + } +} + +/* Track delayed acknowledgment ratio using sliding window + * ratio = (15*ratio + sample) / 16 + */ +BPF_TCP_OPS_2(bictcp_acked, void, struct sock *, sk, + const struct ack_sample *, sample) +{ + const struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); + __u32 delay; + + /* Some calls are for duplicates without timetamps */ + if (sample->rtt_us < 0) + return; + + /* Discard delay samples right after fast recovery */ + if (ca->epoch_start && (__s32)(tcp_jiffies32 - ca->epoch_start) < HZ) + return; + + delay = (sample->rtt_us << 3) / USEC_PER_MSEC; + if (delay == 0) + delay = 1; + + /* first time call or link delay decreases */ + if (ca->delay_min == 0 || ca->delay_min > delay) + ca->delay_min = delay; + + /* hystart triggers when cwnd is larger than some threshold */ + if (hystart && tcp_in_slow_start(tp) && + tp->snd_cwnd >= hystart_low_window) + hystart_update(sk, delay); +} + +BPF_TCP_OPS_1(tcp_reno_undo_cwnd, __u32, struct sock *, sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + return max(tp->snd_cwnd, tp->prior_cwnd); +} + +SEC("struct_ops") +struct tcp_congestion_ops cubictcp = { + .init = (void *)bictcp_init, + .ssthresh = (void *)bictcp_recalc_ssthresh, + .cong_avoid = (void *)bictcp_cong_avoid, + .set_state = (void *)bictcp_state, + .undo_cwnd = (void *)tcp_reno_undo_cwnd, + .cwnd_event = (void *)bictcp_cwnd_event, + .pkts_acked = (void *)bictcp_acked, + .name = "bpf_cubic", +};