From patchwork Sun Nov 17 06:50:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 1196307 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="Hesi6yHW"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47G2m851Zyz9sP6 for ; Sun, 17 Nov 2019 17:50:48 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725901AbfKQGuq (ORCPT ); Sun, 17 Nov 2019 01:50:46 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:7688 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725867AbfKQGuq (ORCPT ); Sun, 17 Nov 2019 01:50:46 -0500 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xAH6nQGE002542 for ; Sat, 16 Nov 2019 22:50:43 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=63+8qd7O0ydfR/0T3LG2muosFp4njM6pnyZpTOJtrkI=; b=Hesi6yHWf2SBGsZCssmtDTwQbqt6JfelouC/8NifIOdEWIJ0KheglxCryb1UFqiiCA+0 bpNfnd8AjdckGESWd57cBYugj8Rhj5ZjiZFtW4t7mkF3fNqwcfuXVWiTXyp2aii5+guC tItsfk/7zyC3qkf+QdA0rLENKcOFaPMgcD8= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2waftjsek4-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Sat, 16 Nov 2019 22:50:43 -0800 Received: from 2401:db00:12:909f:face:0:3:0 (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::127) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Sat, 16 Nov 2019 22:50:41 -0800 Received: by devbig012.ftw2.facebook.com (Postfix, from userid 137359) id 96D432EC18AA; Sat, 16 Nov 2019 22:50:40 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: , , , CC: , , Andrii Nakryiko Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v5 bpf-next 1/5] bpf: switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails Date: Sat, 16 Nov 2019 22:50:31 -0800 Message-ID: <20191117065035.202462-2-andriin@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191117065035.202462-1-andriin@fb.com> References: <20191117065035.202462-1-andriin@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-11-17_01:2019-11-15, 2019-11-16 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 lowpriorityscore=0 malwarescore=0 adultscore=0 phishscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 clxscore=1015 mlxscore=0 spamscore=0 impostorscore=0 suspectscore=25 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1911170063 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit (32k). Due to using 32-bit counter, it's possible in practice to overflow refcounter and make it wrap around to 0, causing erroneous map free, while there are still references to it, causing use-after-free problems. But having a failing refcounting operations are problematic in some cases. One example is mmap() interface. After establishing initial memory-mapping, user is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily splitting it into multiple non-contiguous regions. All this happening without any control from the users of mmap subsystem. Rather mmap subsystem sends notifications to original creator of memory mapping through open/close callbacks, which are optionally specified during initial memory mapping creation. These callbacks are used to maintain accurate refcount for bpf_map (see next patch in this series). The problem is that open() callback is not supposed to fail, because memory-mapped resource is set up and properly referenced. This is posing a problem for using memory-mapping with BPF maps. One solution to this is to maintain separate refcount for just memory-mappings and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively. There are similar use cases in current work on tcp-bpf, necessitating extra counter as well. This seems like a rather unfortunate and ugly solution that doesn't scale well to various new use cases. Another approach to solve this is to use non-failing refcount_t type, which uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX, stays there. This utlimately causes memory leak, but prevents use after free. But given refcounting is not the most performance-critical operation with BPF maps (it's not used from running BPF program code), we can also just switch to 64-bit counter that can't overflow in practice, potentially disadvantaging 32-bit platforms a tiny bit. This simplifies semantics and allows above described scenarios to not worry about failing refcount increment operation. In terms of struct bpf_map size, we are still good and use the same amount of space: BEFORE (3 cache lines, 8 bytes of padding at the end): struct bpf_map { const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */ struct bpf_map * inner_map_meta; /* 8 8 */ void * security; /* 16 8 */ enum bpf_map_type map_type; /* 24 4 */ u32 key_size; /* 28 4 */ u32 value_size; /* 32 4 */ u32 max_entries; /* 36 4 */ u32 map_flags; /* 40 4 */ int spin_lock_off; /* 44 4 */ u32 id; /* 48 4 */ int numa_node; /* 52 4 */ u32 btf_key_type_id; /* 56 4 */ u32 btf_value_type_id; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btf * btf; /* 64 8 */ struct bpf_map_memory memory; /* 72 16 */ bool unpriv_array; /* 88 1 */ bool frozen; /* 89 1 */ /* XXX 38 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */ atomic_t usercnt; /* 132 4 */ struct work_struct work; /* 136 32 */ char name[16]; /* 168 16 */ /* size: 192, cachelines: 3, members: 21 */ /* sum members: 146, holes: 1, sum holes: 38 */ /* padding: 8 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */ } __attribute__((__aligned__(64))); AFTER (same 3 cache lines, no extra padding now): struct bpf_map { const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */ struct bpf_map * inner_map_meta; /* 8 8 */ void * security; /* 16 8 */ enum bpf_map_type map_type; /* 24 4 */ u32 key_size; /* 28 4 */ u32 value_size; /* 32 4 */ u32 max_entries; /* 36 4 */ u32 map_flags; /* 40 4 */ int spin_lock_off; /* 44 4 */ u32 id; /* 48 4 */ int numa_node; /* 52 4 */ u32 btf_key_type_id; /* 56 4 */ u32 btf_value_type_id; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btf * btf; /* 64 8 */ struct bpf_map_memory memory; /* 72 16 */ bool unpriv_array; /* 88 1 */ bool frozen; /* 89 1 */ /* XXX 38 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */ atomic64_t usercnt; /* 136 8 */ struct work_struct work; /* 144 32 */ char name[16]; /* 176 16 */ /* size: 192, cachelines: 3, members: 21 */ /* sum members: 154, holes: 1, sum holes: 38 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */ } __attribute__((__aligned__(64))); This patch, while modifying all users of bpf_map_inc, also cleans up its interface to match bpf_map_put with separate operations for bpf_map_inc and bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref, respectively). Also, given there are no users of bpf_map_inc_not_zero specifying uref=true, remove uref flag and default to uref=false internally. Acked-by: Song Liu Signed-off-by: Andrii Nakryiko --- .../net/ethernet/netronome/nfp/bpf/offload.c | 4 +- include/linux/bpf.h | 10 ++-- kernel/bpf/inode.c | 2 +- kernel/bpf/map_in_map.c | 2 +- kernel/bpf/syscall.c | 51 ++++++++----------- kernel/bpf/verifier.c | 6 +-- kernel/bpf/xskmap.c | 6 +-- net/core/bpf_sk_storage.c | 2 +- 8 files changed, 34 insertions(+), 49 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 88fab6a82acf..06927ba5a3ae 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -46,9 +46,7 @@ nfp_map_ptr_record(struct nfp_app_bpf *bpf, struct nfp_prog *nfp_prog, /* Grab a single ref to the map for our record. The prog destroy ndo * happens after free_used_maps(). */ - map = bpf_map_inc(map, false); - if (IS_ERR(map)) - return PTR_ERR(map); + bpf_map_inc(map); record = kmalloc(sizeof(*record), GFP_KERNEL); if (!record) { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5b81cde47314..34a34445c009 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -103,8 +103,8 @@ struct bpf_map { /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. */ - atomic_t refcnt ____cacheline_aligned; - atomic_t usercnt; + atomic64_t refcnt ____cacheline_aligned; + atomic64_t usercnt; struct work_struct work; char name[BPF_OBJ_NAME_LEN]; }; @@ -783,9 +783,9 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock); struct bpf_map *bpf_map_get_with_uref(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); -struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref); -struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map, - bool uref); +void bpf_map_inc(struct bpf_map *map); +void bpf_map_inc_with_uref(struct bpf_map *map); +struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map); void bpf_map_put_with_uref(struct bpf_map *map); void bpf_map_put(struct bpf_map *map); int bpf_map_charge_memlock(struct bpf_map *map, u32 pages); diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index a70f7209cda3..2f17f24258dc 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -34,7 +34,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type) raw = bpf_prog_inc(raw); break; case BPF_TYPE_MAP: - raw = bpf_map_inc(raw, true); + bpf_map_inc_with_uref(raw); break; default: WARN_ON_ONCE(1); diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index fab4fb134547..4cbe987be35b 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -98,7 +98,7 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map, return inner_map; if (bpf_map_meta_equal(map->inner_map_meta, inner_map)) - inner_map = bpf_map_inc(inner_map, false); + bpf_map_inc(inner_map); else inner_map = ERR_PTR(-EINVAL); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c88c815c2154..20030751b7a2 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -311,7 +311,7 @@ static void bpf_map_free_deferred(struct work_struct *work) static void bpf_map_put_uref(struct bpf_map *map) { - if (atomic_dec_and_test(&map->usercnt)) { + if (atomic64_dec_and_test(&map->usercnt)) { if (map->ops->map_release_uref) map->ops->map_release_uref(map); } @@ -322,7 +322,7 @@ static void bpf_map_put_uref(struct bpf_map *map) */ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock) { - if (atomic_dec_and_test(&map->refcnt)) { + if (atomic64_dec_and_test(&map->refcnt)) { /* bpf_map_free_id() must be called first */ bpf_map_free_id(map, do_idr_lock); btf_put(map->btf); @@ -575,8 +575,8 @@ static int map_create(union bpf_attr *attr) if (err) goto free_map; - atomic_set(&map->refcnt, 1); - atomic_set(&map->usercnt, 1); + atomic64_set(&map->refcnt, 1); + atomic64_set(&map->usercnt, 1); if (attr->btf_key_type_id || attr->btf_value_type_id) { struct btf *btf; @@ -653,21 +653,19 @@ struct bpf_map *__bpf_map_get(struct fd f) return f.file->private_data; } -/* prog's and map's refcnt limit */ -#define BPF_MAX_REFCNT 32768 - -struct bpf_map *bpf_map_inc(struct bpf_map *map, bool uref) +void bpf_map_inc(struct bpf_map *map) { - if (atomic_inc_return(&map->refcnt) > BPF_MAX_REFCNT) { - atomic_dec(&map->refcnt); - return ERR_PTR(-EBUSY); - } - if (uref) - atomic_inc(&map->usercnt); - return map; + atomic64_inc(&map->refcnt); } EXPORT_SYMBOL_GPL(bpf_map_inc); +void bpf_map_inc_with_uref(struct bpf_map *map) +{ + atomic64_inc(&map->refcnt); + atomic64_inc(&map->usercnt); +} +EXPORT_SYMBOL_GPL(bpf_map_inc_with_uref); + struct bpf_map *bpf_map_get_with_uref(u32 ufd) { struct fd f = fdget(ufd); @@ -677,38 +675,30 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd) if (IS_ERR(map)) return map; - map = bpf_map_inc(map, true); + bpf_map_inc_with_uref(map); fdput(f); return map; } /* map_idr_lock should have been held */ -static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, - bool uref) +static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref) { int refold; - refold = atomic_fetch_add_unless(&map->refcnt, 1, 0); - - if (refold >= BPF_MAX_REFCNT) { - __bpf_map_put(map, false); - return ERR_PTR(-EBUSY); - } - + refold = atomic64_fetch_add_unless(&map->refcnt, 1, 0); if (!refold) return ERR_PTR(-ENOENT); - if (uref) - atomic_inc(&map->usercnt); + atomic64_inc(&map->usercnt); return map; } -struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref) +struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map) { spin_lock_bh(&map_idr_lock); - map = __bpf_map_inc_not_zero(map, uref); + map = __bpf_map_inc_not_zero(map, false); spin_unlock_bh(&map_idr_lock); return map; @@ -1455,6 +1445,9 @@ static struct bpf_prog *____bpf_prog_get(struct fd f) return f.file->private_data; } +/* prog's refcnt limit */ +#define BPF_MAX_REFCNT 32768 + struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i) { if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e9dc95a18d44..9f59f7a19dd0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8179,11 +8179,7 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) * will be used by the valid program until it's unloaded * and all maps are released in free_used_maps() */ - map = bpf_map_inc(map, false); - if (IS_ERR(map)) { - fdput(f); - return PTR_ERR(map); - } + bpf_map_inc(map); aux->map_index = env->used_map_cnt; env->used_maps[env->used_map_cnt++] = map; diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c index da16c30868f3..90c4fce1c981 100644 --- a/kernel/bpf/xskmap.c +++ b/kernel/bpf/xskmap.c @@ -11,10 +11,8 @@ int xsk_map_inc(struct xsk_map *map) { - struct bpf_map *m = &map->map; - - m = bpf_map_inc(m, false); - return PTR_ERR_OR_ZERO(m); + bpf_map_inc(&map->map); + return 0; } void xsk_map_put(struct xsk_map *map) diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index da5639a5bd3b..458be6b3eda9 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -798,7 +798,7 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) * Try to grab map refcnt to make sure that it's still * alive and prevent concurrent removal. */ - map = bpf_map_inc_not_zero(&smap->map, false); + map = bpf_map_inc_not_zero(&smap->map); if (IS_ERR(map)) continue; From patchwork Sun Nov 17 06:50:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 1196306 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="rr9w0ZLZ"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47G2m75ljqz9s7T for ; Sun, 17 Nov 2019 17:50:47 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725867AbfKQGuq (ORCPT ); Sun, 17 Nov 2019 01:50:46 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:63860 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725880AbfKQGuq (ORCPT ); Sun, 17 Nov 2019 01:50:46 -0500 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xAH6mpcY011015 for ; Sat, 16 Nov 2019 22:50:44 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=lOYiIw2L00Jmt7jzpWQlfvJwmchRN5NMUOO5xe6Lo8M=; b=rr9w0ZLZqLU7QiUisffMzDjIyK1iHWlecvGwNT0VfwzXVjP3qvzFuZMnTBNMtd2MjtYX XV0LEMXJt/F5w0ir7jxyycNr1Z/SY/3+HHnqwNr2+5Nk5CfD/uiAatvfE2xG6JNOabuI HuF9gN89RfB4AHYhb/SlgAtxb2WQYZI4IY4= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wah9uyrmx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Sat, 16 Nov 2019 22:50:44 -0800 Received: from 2401:db00:12:9028:face:0:29:0 (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::129) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Sat, 16 Nov 2019 22:50:43 -0800 Received: by devbig012.ftw2.facebook.com (Postfix, from userid 137359) id CD5012EC18AA; Sat, 16 Nov 2019 22:50:42 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: , , , CC: , , Andrii Nakryiko Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v5 bpf-next 2/5] bpf: convert bpf_prog refcnt to atomic64_t Date: Sat, 16 Nov 2019 22:50:32 -0800 Message-ID: <20191117065035.202462-3-andriin@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191117065035.202462-1-andriin@fb.com> References: <20191117065035.202462-1-andriin@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-11-17_01:2019-11-15, 2019-11-16 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 mlxscore=0 lowpriorityscore=0 clxscore=1015 suspectscore=25 phishscore=0 spamscore=0 priorityscore=1501 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1911170063 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64 and remove artificial 32k limit. This allows to make bpf_prog's refcounting non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc. Validated compilation by running allyesconfig kernel build. Suggested-by: Daniel Borkmann Signed-off-by: Andrii Nakryiko --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 ++---- .../net/ethernet/cavium/thunder/nicvf_main.c | 9 ++---- .../net/ethernet/freescale/dpaa2/dpaa2-eth.c | 7 ++--- .../net/ethernet/mellanox/mlx4/en_netdev.c | 24 ++++----------- .../net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++++------- drivers/net/ethernet/qlogic/qede/qede_main.c | 8 ++--- drivers/net/virtio_net.c | 7 ++--- include/linux/bpf.h | 13 ++++---- kernel/bpf/inode.c | 5 ++-- kernel/bpf/syscall.c | 30 ++++++------------- kernel/events/core.c | 7 ++--- 11 files changed, 40 insertions(+), 97 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index c07172429c70..9da4fbee3cf7 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -3171,13 +3171,8 @@ static int bnxt_init_one_rx_ring(struct bnxt *bp, int ring_nr) bnxt_init_rxbd_pages(ring, type); if (BNXT_RX_PAGE_MODE(bp) && bp->xdp_prog) { - rxr->xdp_prog = bpf_prog_add(bp->xdp_prog, 1); - if (IS_ERR(rxr->xdp_prog)) { - int rc = PTR_ERR(rxr->xdp_prog); - - rxr->xdp_prog = NULL; - return rc; - } + bpf_prog_add(bp->xdp_prog, 1); + rxr->xdp_prog = bp->xdp_prog; } prod = rxr->rx_prod; for (i = 0; i < bp->rx_ring_size; i++) { diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 40a44dcb3d9b..f28409279ea4 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -1876,13 +1876,8 @@ static int nicvf_xdp_setup(struct nicvf *nic, struct bpf_prog *prog) if (nic->xdp_prog) { /* Attach BPF program */ - nic->xdp_prog = bpf_prog_add(nic->xdp_prog, nic->rx_queues - 1); - if (!IS_ERR(nic->xdp_prog)) { - bpf_attached = true; - } else { - ret = PTR_ERR(nic->xdp_prog); - nic->xdp_prog = NULL; - } + bpf_prog_add(nic->xdp_prog, nic->rx_queues - 1); + bpf_attached = true; } /* Calculate Tx queues needed for XDP and network stack */ diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index c26c0a7cbb6b..acc56606d3a5 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -1807,11 +1807,8 @@ static int setup_xdp(struct net_device *dev, struct bpf_prog *prog) if (prog && !xdp_mtu_valid(priv, dev->mtu)) return -EINVAL; - if (prog) { - prog = bpf_prog_add(prog, priv->num_channels); - if (IS_ERR(prog)) - return PTR_ERR(prog); - } + if (prog) + bpf_prog_add(prog, priv->num_channels); up = netif_running(dev); need_update = (!!priv->xdp_prog != !!prog); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 40ec5acf79c0..d4697beeacc2 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -2286,11 +2286,7 @@ int mlx4_en_try_alloc_resources(struct mlx4_en_priv *priv, lockdep_is_held(&priv->mdev->state_lock)); if (xdp_prog && carry_xdp_prog) { - xdp_prog = bpf_prog_add(xdp_prog, tmp->rx_ring_num); - if (IS_ERR(xdp_prog)) { - mlx4_en_free_resources(tmp); - return PTR_ERR(xdp_prog); - } + bpf_prog_add(xdp_prog, tmp->rx_ring_num); for (i = 0; i < tmp->rx_ring_num; i++) rcu_assign_pointer(tmp->rx_ring[i]->xdp_prog, xdp_prog); @@ -2782,11 +2778,9 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) * program for a new one. */ if (priv->tx_ring_num[TX_XDP] == xdp_ring_num) { - if (prog) { - prog = bpf_prog_add(prog, priv->rx_ring_num - 1); - if (IS_ERR(prog)) - return PTR_ERR(prog); - } + if (prog) + bpf_prog_add(prog, priv->rx_ring_num - 1); + mutex_lock(&mdev->state_lock); for (i = 0; i < priv->rx_ring_num; i++) { old_prog = rcu_dereference_protected( @@ -2807,13 +2801,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) if (!tmp) return -ENOMEM; - if (prog) { - prog = bpf_prog_add(prog, priv->rx_ring_num - 1); - if (IS_ERR(prog)) { - err = PTR_ERR(prog); - goto out; - } - } + if (prog) + bpf_prog_add(prog, priv->rx_ring_num - 1); mutex_lock(&mdev->state_lock); memcpy(&new_prof, priv->prof, sizeof(struct mlx4_en_port_profile)); @@ -2862,7 +2851,6 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) unlock_out: mutex_unlock(&mdev->state_lock); -out: kfree(tmp); return err; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 772bfdbdeb9c..1d4a66fb466a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -408,12 +408,9 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->stats = &c->priv->channel_stats[c->ix].rq; INIT_WORK(&rq->recover_work, mlx5e_rq_err_cqe_work); - rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL; - if (IS_ERR(rq->xdp_prog)) { - err = PTR_ERR(rq->xdp_prog); - rq->xdp_prog = NULL; - goto err_rq_wq_destroy; - } + if (params->xdp_prog) + bpf_prog_inc(params->xdp_prog); + rq->xdp_prog = params->xdp_prog; rq_xdp_ix = rq->ix; if (xsk) @@ -4406,16 +4403,11 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog) /* no need for full reset when exchanging programs */ reset = (!priv->channels.params.xdp_prog || !prog); - if (was_opened && !reset) { + if (was_opened && !reset) /* num_channels is invariant here, so we can take the * batched reference right upfront. */ - prog = bpf_prog_add(prog, priv->channels.num); - if (IS_ERR(prog)) { - err = PTR_ERR(prog); - goto unlock; - } - } + bpf_prog_add(prog, priv->channels.num); if (was_opened && reset) { struct mlx5e_channels new_channels = {}; diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 8d1c208f778f..1e26964fe4e9 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -2107,12 +2107,8 @@ static int qede_start_queues(struct qede_dev *edev, bool clear_stats) if (rc) goto out; - fp->rxq->xdp_prog = bpf_prog_add(edev->xdp_prog, 1); - if (IS_ERR(fp->rxq->xdp_prog)) { - rc = PTR_ERR(fp->rxq->xdp_prog); - fp->rxq->xdp_prog = NULL; - goto out; - } + bpf_prog_add(edev->xdp_prog, 1); + fp->rxq->xdp_prog = edev->xdp_prog; } if (fp->type & QEDE_FASTPATH_TX) { diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5a635f028bdc..4d7d5434cc5d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2445,11 +2445,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog, if (!prog && !old_prog) return 0; - if (prog) { - prog = bpf_prog_add(prog, vi->max_queue_pairs - 1); - if (IS_ERR(prog)) - return PTR_ERR(prog); - } + if (prog) + bpf_prog_add(prog, vi->max_queue_pairs - 1); /* Make sure NAPI is not using any XDP TX queues for RX. */ if (netif_running(dev)) { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 34a34445c009..fb606dc61a3a 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -485,7 +485,7 @@ struct bpf_func_info_aux { }; struct bpf_prog_aux { - atomic_t refcnt; + atomic64_t refcnt; u32 used_map_cnt; u32 max_ctx_offset; u32 max_pkt_offset; @@ -770,9 +770,9 @@ extern const struct bpf_verifier_ops xdp_analyzer_ops; struct bpf_prog *bpf_prog_get(u32 ufd); struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type, bool attach_drv); -struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i); +void bpf_prog_add(struct bpf_prog *prog, int i); void bpf_prog_sub(struct bpf_prog *prog, int i); -struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog); +void bpf_prog_inc(struct bpf_prog *prog); struct bpf_prog * __must_check bpf_prog_inc_not_zero(struct bpf_prog *prog); void bpf_prog_put(struct bpf_prog *prog); int __bpf_prog_charge(struct user_struct *user, u32 pages); @@ -912,10 +912,8 @@ static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, return ERR_PTR(-EOPNOTSUPP); } -static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, - int i) +static inline void bpf_prog_add(struct bpf_prog *prog, int i) { - return ERR_PTR(-EOPNOTSUPP); } static inline void bpf_prog_sub(struct bpf_prog *prog, int i) @@ -926,9 +924,8 @@ static inline void bpf_prog_put(struct bpf_prog *prog) { } -static inline struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog) +static inline void bpf_prog_inc(struct bpf_prog *prog) { - return ERR_PTR(-EOPNOTSUPP); } static inline struct bpf_prog *__must_check diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2f17f24258dc..ecf42bec38c0 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -31,7 +31,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type) { switch (type) { case BPF_TYPE_PROG: - raw = bpf_prog_inc(raw); + bpf_prog_inc(raw); break; case BPF_TYPE_MAP: bpf_map_inc_with_uref(raw); @@ -534,7 +534,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type if (!bpf_prog_get_ok(prog, &type, false)) return ERR_PTR(-EINVAL); - return bpf_prog_inc(prog); + bpf_prog_inc(prog); + return prog; } struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 20030751b7a2..52fe4bacb330 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1339,7 +1339,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) { - if (atomic_dec_and_test(&prog->aux->refcnt)) { + if (atomic64_dec_and_test(&prog->aux->refcnt)) { perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0); /* bpf_prog_free_id() must be called first */ bpf_prog_free_id(prog, do_idr_lock); @@ -1445,16 +1445,9 @@ static struct bpf_prog *____bpf_prog_get(struct fd f) return f.file->private_data; } -/* prog's refcnt limit */ -#define BPF_MAX_REFCNT 32768 - -struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i) +void bpf_prog_add(struct bpf_prog *prog, int i) { - if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) { - atomic_sub(i, &prog->aux->refcnt); - return ERR_PTR(-EBUSY); - } - return prog; + atomic64_add(i, &prog->aux->refcnt); } EXPORT_SYMBOL_GPL(bpf_prog_add); @@ -1465,13 +1458,13 @@ void bpf_prog_sub(struct bpf_prog *prog, int i) * path holds a reference to the program, thus atomic_sub() can * be safely used in such cases! */ - WARN_ON(atomic_sub_return(i, &prog->aux->refcnt) == 0); + WARN_ON(atomic64_sub_return(i, &prog->aux->refcnt) == 0); } EXPORT_SYMBOL_GPL(bpf_prog_sub); -struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog) +void bpf_prog_inc(struct bpf_prog *prog) { - return bpf_prog_add(prog, 1); + atomic64_inc(&prog->aux->refcnt); } EXPORT_SYMBOL_GPL(bpf_prog_inc); @@ -1480,12 +1473,7 @@ struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog *prog) { int refold; - refold = atomic_fetch_add_unless(&prog->aux->refcnt, 1, 0); - - if (refold >= BPF_MAX_REFCNT) { - __bpf_prog_put(prog, false); - return ERR_PTR(-EBUSY); - } + refold = atomic64_fetch_add_unless(&prog->aux->refcnt, 1, 0); if (!refold) return ERR_PTR(-ENOENT); @@ -1523,7 +1511,7 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type, goto out; } - prog = bpf_prog_inc(prog); + bpf_prog_inc(prog); out: fdput(f); return prog; @@ -1714,7 +1702,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) prog->orig_prog = NULL; prog->jited = 0; - atomic_set(&prog->aux->refcnt, 1); + atomic64_set(&prog->aux->refcnt, 1); prog->gpl_compatible = is_gpl ? 1 : 0; if (bpf_prog_is_dev_bound(prog->aux)) { diff --git a/kernel/events/core.c b/kernel/events/core.c index aec8dba2bea4..73c616876597 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10477,12 +10477,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, context = parent_event->overflow_handler_context; #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_EVENT_TRACING) if (overflow_handler == bpf_overflow_handler) { - struct bpf_prog *prog = bpf_prog_inc(parent_event->prog); + struct bpf_prog *prog = parent_event->prog; - if (IS_ERR(prog)) { - err = PTR_ERR(prog); - goto err_ns; - } + bpf_prog_inc(prog); event->prog = prog; event->orig_overflow_handler = parent_event->orig_overflow_handler; From patchwork Sun Nov 17 06:50:33 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 1196309 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="GZkx6X4L"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47G2mJ37C0z9sP6 for ; Sun, 17 Nov 2019 17:50:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726065AbfKQGux (ORCPT ); Sun, 17 Nov 2019 01:50:53 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:28352 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725959AbfKQGuw (ORCPT ); Sun, 17 Nov 2019 01:50:52 -0500 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xAH6lwGH001774 for ; Sat, 16 Nov 2019 22:50:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=WNRhByeZiYuBpIfbSRQPKTOmS6F0BWGyByRaoHGmyMY=; b=GZkx6X4L0bKhRKcbclqgMM9uXDif4SBZtDyWEMTe/DsOO1VZ8ZTGSp6IvykR/kiqewOk 6Cz6XRXom8cTSUlzyD4qOUBfYDmM4hmAcYCTjikFsrRbbKUXkA5GBQ9e3aW/n710RXYf DKKT29C1o1pNvF3H9L/Ap6lHePGgdGs5PWc= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2waftjseku-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Sat, 16 Nov 2019 22:50:48 -0800 Received: from 2401:db00:2120:80d4:face:0:39:0 (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Sat, 16 Nov 2019 22:50:47 -0800 Received: by devbig012.ftw2.facebook.com (Postfix, from userid 137359) id 0CAB72EC18AA; Sat, 16 Nov 2019 22:50:45 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: , , , CC: , , Andrii Nakryiko , Johannes Weiner , Rik van Riel Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v5 bpf-next 3/5] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Date: Sat, 16 Nov 2019 22:50:33 -0800 Message-ID: <20191117065035.202462-4-andriin@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191117065035.202462-1-andriin@fb.com> References: <20191117065035.202462-1-andriin@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-11-17_01:2019-11-15, 2019-11-16 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 lowpriorityscore=0 malwarescore=0 adultscore=0 phishscore=0 bulkscore=0 priorityscore=1501 mlxlogscore=999 clxscore=1015 mlxscore=0 spamscore=0 impostorscore=0 suspectscore=25 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1911170063 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Add ability to memory-map contents of BPF array map. This is extremely useful for working with BPF global data from userspace programs. It allows to avoid typical bpf_map_{lookup,update}_elem operations, improving both performance and usability. There had to be special considerations for map freezing, to avoid having writable memory view into a frozen map. To solve this issue, map freezing and mmap-ing is happening under mutex now: - if map is already frozen, no writable mapping is allowed; - if map has writable memory mappings active (accounted in map->writecnt), map freezing will keep failing with -EBUSY; - once number of writable memory mappings drops to zero, map freezing can be performed again. Only non-per-CPU plain arrays are supported right now. Maps with spinlocks can't be memory mapped either. For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc() to be mmap()'able. We also need to make sure that array data memory is page-sized and page-aligned, so we over-allocate memory in such a way that struct bpf_array is at the end of a single page of memory with array->value being aligned with the start of the second page. On deallocation we need to accomodate this memory arrangement to free vmalloc()'ed memory correctly. One important consideration regarding how memory-mapping subsystem functions. Memory-mapping subsystem provides few optional callbacks, among them open() and close(). close() is called for each memory region that is unmapped, so that users can decrease their reference counters and free up resources, if necessary. open() is *almost* symmetrical: it's called for each memory region that is being mapped, **except** the very first one. So bpf_map_mmap does initial refcnt bump, while open() will do any extra ones after that. Thus number of close() calls is equal to number of open() calls plus one more. Cc: Johannes Weiner Cc: Rik van Riel Acked-by: Song Liu Acked-by: John Fastabend Acked-by: Johannes Weiner Signed-off-by: Andrii Nakryiko --- include/linux/bpf.h | 11 ++-- include/linux/vmalloc.h | 1 + include/uapi/linux/bpf.h | 3 ++ kernel/bpf/arraymap.c | 58 +++++++++++++++++--- kernel/bpf/syscall.c | 99 ++++++++++++++++++++++++++++++++-- mm/vmalloc.c | 20 +++++++ tools/include/uapi/linux/bpf.h | 3 ++ 7 files changed, 183 insertions(+), 12 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index fb606dc61a3a..e913dd5946ae 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -68,6 +69,7 @@ struct bpf_map_ops { u64 *imm, u32 off); int (*map_direct_value_meta)(const struct bpf_map *map, u64 imm, u32 *off); + int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma); }; struct bpf_map_memory { @@ -96,9 +98,10 @@ struct bpf_map { u32 btf_value_type_id; struct btf *btf; struct bpf_map_memory memory; + char name[BPF_OBJ_NAME_LEN]; bool unpriv_array; - bool frozen; /* write-once */ - /* 48 bytes hole */ + bool frozen; /* write-once; write-protected by freeze_mutex */ + /* 22 bytes hole */ /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. @@ -106,7 +109,8 @@ struct bpf_map { atomic64_t refcnt ____cacheline_aligned; atomic64_t usercnt; struct work_struct work; - char name[BPF_OBJ_NAME_LEN]; + struct mutex freeze_mutex; + u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */ }; static inline bool map_value_has_spin_lock(const struct bpf_map *map) @@ -795,6 +799,7 @@ void bpf_map_charge_finish(struct bpf_map_memory *mem); void bpf_map_charge_move(struct bpf_map_memory *dst, struct bpf_map_memory *src); void *bpf_map_area_alloc(size_t size, int numa_node); +void *bpf_map_area_mmapable_alloc(size_t size, int numa_node); void bpf_map_area_free(void *base); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 4e7809408073..b4c58a191eb1 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -93,6 +93,7 @@ extern void *vzalloc(unsigned long size); extern void *vmalloc_user(unsigned long size); extern void *vmalloc_node(unsigned long size, int node); extern void *vzalloc_node(unsigned long size, int node); +extern void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags); extern void *vmalloc_exec(unsigned long size); extern void *vmalloc_32(unsigned long size); extern void *vmalloc_32_user(unsigned long size); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 4842a134b202..dbbcf0b02970 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -348,6 +348,9 @@ enum bpf_attach_type { /* Clone map from listener for newly accepted socket */ #define BPF_F_CLONE (1U << 9) +/* Enable memory-mapping BPF map */ +#define BPF_F_MMAPABLE (1U << 10) + /* flags for BPF_PROG_QUERY */ #define BPF_F_QUERY_EFFECTIVE (1U << 0) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 1c65ce0098a9..a42097c36b0c 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -14,7 +14,7 @@ #include "map_in_map.h" #define ARRAY_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) + (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK) static void bpf_array_free_percpu(struct bpf_array *array) { @@ -59,6 +59,10 @@ int array_map_alloc_check(union bpf_attr *attr) (percpu && numa_node != NUMA_NO_NODE)) return -EINVAL; + if (attr->map_type != BPF_MAP_TYPE_ARRAY && + attr->map_flags & BPF_F_MMAPABLE) + return -EINVAL; + if (attr->value_size > KMALLOC_MAX_SIZE) /* if value_size is bigger, the user space won't be able to * access the elements. @@ -102,10 +106,19 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) } array_size = sizeof(*array); - if (percpu) + if (percpu) { array_size += (u64) max_entries * sizeof(void *); - else - array_size += (u64) max_entries * elem_size; + } else { + /* rely on vmalloc() to return page-aligned memory and + * ensure array->value is exactly page-aligned + */ + if (attr->map_flags & BPF_F_MMAPABLE) { + array_size = PAGE_ALIGN(array_size); + array_size += PAGE_ALIGN((u64) max_entries * elem_size); + } else { + array_size += (u64) max_entries * elem_size; + } + } /* make sure there is no u32 overflow later in round_up() */ cost = array_size; @@ -117,7 +130,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) return ERR_PTR(ret); /* allocate all map elements and zero-initialize them */ - array = bpf_map_area_alloc(array_size, numa_node); + if (attr->map_flags & BPF_F_MMAPABLE) { + void *data; + + /* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */ + data = bpf_map_area_mmapable_alloc(array_size, numa_node); + if (!data) { + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + array = data + PAGE_ALIGN(sizeof(struct bpf_array)) + - offsetof(struct bpf_array, value); + } else { + array = bpf_map_area_alloc(array_size, numa_node); + } if (!array) { bpf_map_charge_finish(&mem); return ERR_PTR(-ENOMEM); @@ -350,6 +376,11 @@ static int array_map_delete_elem(struct bpf_map *map, void *key) return -EINVAL; } +static void *array_map_vmalloc_addr(struct bpf_array *array) +{ + return (void *)round_down((unsigned long)array, PAGE_SIZE); +} + /* Called when map->refcnt goes to zero, either from workqueue or from syscall */ static void array_map_free(struct bpf_map *map) { @@ -365,7 +396,10 @@ static void array_map_free(struct bpf_map *map) if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) bpf_array_free_percpu(array); - bpf_map_area_free(array); + if (array->map.map_flags & BPF_F_MMAPABLE) + bpf_map_area_free(array_map_vmalloc_addr(array)); + else + bpf_map_area_free(array); } static void array_map_seq_show_elem(struct bpf_map *map, void *key, @@ -444,6 +478,17 @@ static int array_map_check_btf(const struct bpf_map *map, return 0; } +int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) +{ + struct bpf_array *array = container_of(map, struct bpf_array, map); + pgoff_t pgoff = PAGE_ALIGN(sizeof(*array)) >> PAGE_SHIFT; + + if (!(map->map_flags & BPF_F_MMAPABLE)) + return -EINVAL; + + return remap_vmalloc_range(vma, array_map_vmalloc_addr(array), pgoff); +} + const struct bpf_map_ops array_map_ops = { .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, @@ -455,6 +500,7 @@ const struct bpf_map_ops array_map_ops = { .map_gen_lookup = array_map_gen_lookup, .map_direct_value_addr = array_map_direct_value_addr, .map_direct_value_meta = array_map_direct_value_meta, + .map_mmap = array_map_mmap, .map_seq_show_elem = array_map_seq_show_elem, .map_check_btf = array_map_check_btf, }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 52fe4bacb330..79ae97cf18fc 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -127,7 +127,7 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) return map; } -void *bpf_map_area_alloc(size_t size, int numa_node) +static void *__bpf_map_area_alloc(size_t size, int numa_node, bool mmapable) { /* We really just want to fail instead of triggering OOM killer * under memory pressure, therefore we set __GFP_NORETRY to kmalloc, @@ -142,18 +142,33 @@ void *bpf_map_area_alloc(size_t size, int numa_node) const gfp_t flags = __GFP_NOWARN | __GFP_ZERO; void *area; - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { + /* kmalloc()'ed memory can't be mmap()'ed */ + if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { area = kmalloc_node(size, GFP_USER | __GFP_NORETRY | flags, numa_node); if (area != NULL) return area; } - + if (mmapable) { + BUG_ON(!PAGE_ALIGNED(size)); + return vmalloc_user_node_flags(size, numa_node, GFP_KERNEL | + __GFP_RETRY_MAYFAIL | flags); + } return __vmalloc_node_flags_caller(size, numa_node, GFP_KERNEL | __GFP_RETRY_MAYFAIL | flags, __builtin_return_address(0)); } +void *bpf_map_area_alloc(size_t size, int numa_node) +{ + return __bpf_map_area_alloc(size, numa_node, false); +} + +void *bpf_map_area_mmapable_alloc(size_t size, int numa_node) +{ + return __bpf_map_area_alloc(size, numa_node, true); +} + void bpf_map_area_free(void *area) { kvfree(area); @@ -425,6 +440,74 @@ static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf, return -EINVAL; } +/* called for any extra memory-mapped regions (except initial) */ +static void bpf_map_mmap_open(struct vm_area_struct *vma) +{ + struct bpf_map *map = vma->vm_file->private_data; + + bpf_map_inc(map); + + if (vma->vm_flags & VM_WRITE) { + mutex_lock(&map->freeze_mutex); + map->writecnt++; + mutex_unlock(&map->freeze_mutex); + } +} + +/* called for all unmapped memory region (including initial) */ +static void bpf_map_mmap_close(struct vm_area_struct *vma) +{ + struct bpf_map *map = vma->vm_file->private_data; + + if (vma->vm_flags & VM_WRITE) { + mutex_lock(&map->freeze_mutex); + map->writecnt--; + mutex_unlock(&map->freeze_mutex); + } + + bpf_map_put(map); +} + +static const struct vm_operations_struct bpf_map_default_vmops = { + .open = bpf_map_mmap_open, + .close = bpf_map_mmap_close, +}; + +static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct bpf_map *map = filp->private_data; + int err; + + if (!map->ops->map_mmap || map_value_has_spin_lock(map)) + return -ENOTSUPP; + + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + + mutex_lock(&map->freeze_mutex); + + if ((vma->vm_flags & VM_WRITE) && map->frozen) { + err = -EPERM; + goto out; + } + + /* set default open/close callbacks */ + vma->vm_ops = &bpf_map_default_vmops; + vma->vm_private_data = map; + + err = map->ops->map_mmap(map, vma); + if (err) + goto out; + + bpf_map_inc(map); + + if (vma->vm_flags & VM_WRITE) + map->writecnt++; +out: + mutex_unlock(&map->freeze_mutex); + return err; +} + const struct file_operations bpf_map_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = bpf_map_show_fdinfo, @@ -432,6 +515,7 @@ const struct file_operations bpf_map_fops = { .release = bpf_map_release, .read = bpf_dummy_read, .write = bpf_dummy_write, + .mmap = bpf_map_mmap, }; int bpf_map_new_fd(struct bpf_map *map, int flags) @@ -577,6 +661,7 @@ static int map_create(union bpf_attr *attr) atomic64_set(&map->refcnt, 1); atomic64_set(&map->usercnt, 1); + mutex_init(&map->freeze_mutex); if (attr->btf_key_type_id || attr->btf_value_type_id) { struct btf *btf; @@ -1163,6 +1248,13 @@ static int map_freeze(const union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + + mutex_lock(&map->freeze_mutex); + + if (map->writecnt) { + err = -EBUSY; + goto err_put; + } if (READ_ONCE(map->frozen)) { err = -EBUSY; goto err_put; @@ -1174,6 +1266,7 @@ static int map_freeze(const union bpf_attr *attr) WRITE_ONCE(map->frozen, true); err_put: + mutex_unlock(&map->freeze_mutex); fdput(f); return err; } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a3c70e275f4e..4a7d7459c4f9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2671,6 +2671,26 @@ void *vzalloc_node(unsigned long size, int node) } EXPORT_SYMBOL(vzalloc_node); +/** + * vmalloc_user_node_flags - allocate memory for userspace on a specific node + * @size: allocation size + * @node: numa node + * @flags: flags for the page level allocator + * + * The resulting memory area is zeroed so it can be mapped to userspace + * without leaking data. + * + * Return: pointer to the allocated memory or %NULL on error + */ +void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags) +{ + return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END, + flags | __GFP_ZERO, PAGE_KERNEL, + VM_USERMAP, node, + __builtin_return_address(0)); +} +EXPORT_SYMBOL(vmalloc_user_node_flags); + /** * vmalloc_exec - allocate virtually contiguous, executable memory * @size: allocation size diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4842a134b202..dbbcf0b02970 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -348,6 +348,9 @@ enum bpf_attach_type { /* Clone map from listener for newly accepted socket */ #define BPF_F_CLONE (1U << 9) +/* Enable memory-mapping BPF map */ +#define BPF_F_MMAPABLE (1U << 10) + /* flags for BPF_PROG_QUERY */ #define BPF_F_QUERY_EFFECTIVE (1U << 0) From patchwork Sun Nov 17 06:50:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 1196308 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="f/Btx+wj"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47G2mH6cp6z9s7T for ; Sun, 17 Nov 2019 17:50:55 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726032AbfKQGuw (ORCPT ); Sun, 17 Nov 2019 01:50:52 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:63902 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726043AbfKQGuv (ORCPT ); Sun, 17 Nov 2019 01:50:51 -0500 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.42/8.16.0.42) with SMTP id xAH6nVq8020641 for ; Sat, 16 Nov 2019 22:50:50 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=9cDQKABTP46VQhYMyTMEld1LWTgfKVYS/6ldiraza8g=; b=f/Btx+wj+7OEKJTdMBIO+1xa6ZX6ZAdFbyZ8HtDAJBTDdR+x7g6b4iJ69WiFTh2ajWmU xeX2aFaQNtR5Wo9jqJvWnZ0HFVpkcA7aVdP1+nlIs7d156tk+YU4hyBqIRWikm5lq6uU ROTwye7cCUjxLSebR3VHLXlEfrPuplH9QMA= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 2wadg0u0g0-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Sat, 16 Nov 2019 22:50:50 -0800 Received: from 2401:db00:2120:80d4:face:0:39:0 (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::e) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Sat, 16 Nov 2019 22:50:48 -0800 Received: by devbig012.ftw2.facebook.com (Postfix, from userid 137359) id 422592EC18AA; Sat, 16 Nov 2019 22:50:47 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: , , , CC: , , Andrii Nakryiko Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v5 bpf-next 4/5] libbpf: make global data internal arrays mmap()-able, if possible Date: Sat, 16 Nov 2019 22:50:34 -0800 Message-ID: <20191117065035.202462-5-andriin@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191117065035.202462-1-andriin@fb.com> References: <20191117065035.202462-1-andriin@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-11-17_01:2019-11-15, 2019-11-16 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 mlxlogscore=999 mlxscore=0 suspectscore=8 malwarescore=0 impostorscore=0 adultscore=0 priorityscore=1501 clxscore=1015 phishscore=0 spamscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1911170064 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra flag to internal global data maps, if supported by kernel. This allows users to memory-map global data and use it without BPF map operations, greatly simplifying user experience. Acked-by: Song Liu Acked-by: John Fastabend Signed-off-by: Andrii Nakryiko --- tools/lib/bpf/libbpf.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 7132c6bdec02..15e91a1d6c11 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -142,6 +142,8 @@ struct bpf_capabilities { __u32 btf_func:1; /* BTF_KIND_VAR and BTF_KIND_DATASEC support */ __u32 btf_datasec:1; + /* BPF_F_MMAPABLE is supported for arrays */ + __u32 array_mmap:1; }; /* @@ -857,8 +859,6 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type, pr_warn("failed to alloc map name\n"); return -ENOMEM; } - pr_debug("map '%s' (global data): at sec_idx %d, offset %zu.\n", - map_name, map->sec_idx, map->sec_offset); def = &map->def; def->type = BPF_MAP_TYPE_ARRAY; @@ -866,6 +866,12 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type, def->value_size = data->d_size; def->max_entries = 1; def->map_flags = type == LIBBPF_MAP_RODATA ? BPF_F_RDONLY_PROG : 0; + if (obj->caps.array_mmap) + def->map_flags |= BPF_F_MMAPABLE; + + pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n", + map_name, map->sec_idx, map->sec_offset, def->map_flags); + if (data_buff) { *data_buff = malloc(data->d_size); if (!*data_buff) { @@ -2161,6 +2167,27 @@ static int bpf_object__probe_btf_datasec(struct bpf_object *obj) return 0; } +static int bpf_object__probe_array_mmap(struct bpf_object *obj) +{ + struct bpf_create_map_attr attr = { + .map_type = BPF_MAP_TYPE_ARRAY, + .map_flags = BPF_F_MMAPABLE, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 1, + }; + int fd; + + fd = bpf_create_map_xattr(&attr); + if (fd >= 0) { + obj->caps.array_mmap = 1; + close(fd); + return 1; + } + + return 0; +} + static int bpf_object__probe_caps(struct bpf_object *obj) { @@ -2169,6 +2196,7 @@ bpf_object__probe_caps(struct bpf_object *obj) bpf_object__probe_global_data, bpf_object__probe_btf_func, bpf_object__probe_btf_datasec, + bpf_object__probe_array_mmap, }; int i, ret; From patchwork Sun Nov 17 06:50:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 1196310 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="OEOfFhw/"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47G2mK1d3Vz9s7T for ; Sun, 17 Nov 2019 17:50:57 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726037AbfKQGuy (ORCPT ); Sun, 17 Nov 2019 01:50:54 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:35340 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726043AbfKQGuy (ORCPT ); Sun, 17 Nov 2019 01:50:54 -0500 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xAH6oahx006340 for ; Sat, 16 Nov 2019 22:50:52 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=sVaT+OfRUIT8nvCiT7mvqQsEr02Sb6mu9pDM1VXI5gE=; b=OEOfFhw/8+ucdD6CrW5+eP7zdIyo6dsHZu/KJPFW+u7z6gCYM3SIOhT6HmHKa5fY1kn0 OJ8+N6gc/2uU8hMyyHJhcK3fY4Nk698Vn6ZcxHYo7SMIWTn3bDqMxDWC45q1XLgiqYAy +BTgzTnjtjBwe3N6TMXVkBYmtVDgcpqOHPA= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2wafpe0j0h-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Sat, 16 Nov 2019 22:50:52 -0800 Received: from 2401:db00:2050:5076:face:0:7:0 (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA) id 15.1.1713.5; Sat, 16 Nov 2019 22:50:51 -0800 Received: by devbig012.ftw2.facebook.com (Postfix, from userid 137359) id 7538A2EC18AA; Sat, 16 Nov 2019 22:50:49 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: , , , CC: , , Andrii Nakryiko Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v5 bpf-next 5/5] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Date: Sat, 16 Nov 2019 22:50:35 -0800 Message-ID: <20191117065035.202462-6-andriin@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191117065035.202462-1-andriin@fb.com> References: <20191117065035.202462-1-andriin@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-11-17_01:2019-11-15, 2019-11-16 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 mlxscore=0 spamscore=0 suspectscore=67 priorityscore=1501 phishscore=0 adultscore=0 clxscore=1015 impostorscore=0 mlxlogscore=808 malwarescore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1911170064 X-FB-Internal: deliver Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Add selftests validating mmap()-ing BPF array maps: both single-element and multi-element ones. Check that plain bpf_map_update_elem() and bpf_map_lookup_elem() work correctly with memory-mapped array. Also convert CO-RE relocation tests to use memory-mapped views of global data. Acked-by: Song Liu Signed-off-by: Andrii Nakryiko --- .../selftests/bpf/prog_tests/core_reloc.c | 45 ++-- tools/testing/selftests/bpf/prog_tests/mmap.c | 220 ++++++++++++++++++ tools/testing/selftests/bpf/progs/test_mmap.c | 45 ++++ 3 files changed, 292 insertions(+), 18 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/mmap.c create mode 100644 tools/testing/selftests/bpf/progs/test_mmap.c diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c index f94bd071536b..ec9e2fdd6b89 100644 --- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c +++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include "progs/core_reloc_types.h" +#include #define STRUCT_TO_CHAR_PTR(struct_name) (const char *)&(struct struct_name) @@ -453,8 +454,15 @@ struct data { char out[256]; }; +static size_t roundup_page(size_t sz) +{ + long page_size = sysconf(_SC_PAGE_SIZE); + return (sz + page_size - 1) / page_size * page_size; +} + void test_core_reloc(void) { + const size_t mmap_sz = roundup_page(sizeof(struct data)); struct bpf_object_load_attr load_attr = {}; struct core_reloc_test_case *test_case; const char *tp_name, *probe_name; @@ -463,8 +471,8 @@ void test_core_reloc(void) struct bpf_map *data_map; struct bpf_program *prog; struct bpf_object *obj; - const int zero = 0; - struct data data; + struct data *data; + void *mmap_data = NULL; for (i = 0; i < ARRAY_SIZE(test_cases); i++) { test_case = &test_cases[i]; @@ -476,8 +484,7 @@ void test_core_reloc(void) ); obj = bpf_object__open_file(test_case->bpf_obj_file, &opts); - if (CHECK(IS_ERR_OR_NULL(obj), "obj_open", - "failed to open '%s': %ld\n", + if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n", test_case->bpf_obj_file, PTR_ERR(obj))) continue; @@ -519,24 +526,22 @@ void test_core_reloc(void) if (CHECK(!data_map, "find_data_map", "data map not found\n")) goto cleanup; - memset(&data, 0, sizeof(data)); - memcpy(data.in, test_case->input, test_case->input_len); - - err = bpf_map_update_elem(bpf_map__fd(data_map), - &zero, &data, 0); - if (CHECK(err, "update_data_map", - "failed to update .data map: %d\n", err)) + mmap_data = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE, + MAP_SHARED, bpf_map__fd(data_map), 0); + if (CHECK(mmap_data == MAP_FAILED, "mmap", + ".bss mmap failed: %d", errno)) { + mmap_data = NULL; goto cleanup; + } + data = mmap_data; + + memset(mmap_data, 0, sizeof(*data)); + memcpy(data->in, test_case->input, test_case->input_len); /* trigger test run */ usleep(1); - err = bpf_map_lookup_elem(bpf_map__fd(data_map), &zero, &data); - if (CHECK(err, "get_result", - "failed to get output data: %d\n", err)) - goto cleanup; - - equal = memcmp(data.out, test_case->output, + equal = memcmp(data->out, test_case->output, test_case->output_len) == 0; if (CHECK(!equal, "check_result", "input/output data don't match\n")) { @@ -548,12 +553,16 @@ void test_core_reloc(void) } for (j = 0; j < test_case->output_len; j++) { printf("output byte #%d: EXP 0x%02hhx GOT 0x%02hhx\n", - j, test_case->output[j], data.out[j]); + j, test_case->output[j], data->out[j]); } goto cleanup; } cleanup: + if (mmap_data) { + CHECK_FAIL(munmap(mmap_data, mmap_sz)); + mmap_data = NULL; + } if (!IS_ERR_OR_NULL(link)) { bpf_link__destroy(link); link = NULL; diff --git a/tools/testing/selftests/bpf/prog_tests/mmap.c b/tools/testing/selftests/bpf/prog_tests/mmap.c new file mode 100644 index 000000000000..051a6d48762c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/mmap.c @@ -0,0 +1,220 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include + +struct map_data { + __u64 val[512 * 4]; +}; + +struct bss_data { + __u64 in_val; + __u64 out_val; +}; + +static size_t roundup_page(size_t sz) +{ + long page_size = sysconf(_SC_PAGE_SIZE); + return (sz + page_size - 1) / page_size * page_size; +} + +void test_mmap(void) +{ + const char *file = "test_mmap.o"; + const char *probe_name = "raw_tracepoint/sys_enter"; + const char *tp_name = "sys_enter"; + const size_t bss_sz = roundup_page(sizeof(struct bss_data)); + const size_t map_sz = roundup_page(sizeof(struct map_data)); + const int zero = 0, one = 1, two = 2, far = 1500; + const long page_size = sysconf(_SC_PAGE_SIZE); + int err, duration = 0, i, data_map_fd; + struct bpf_program *prog; + struct bpf_object *obj; + struct bpf_link *link = NULL; + struct bpf_map *data_map, *bss_map; + void *bss_mmaped = NULL, *map_mmaped = NULL, *tmp1, *tmp2; + volatile struct bss_data *bss_data; + volatile struct map_data *map_data; + __u64 val = 0; + + obj = bpf_object__open_file("test_mmap.o", NULL); + if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n", + file, PTR_ERR(obj))) + return; + prog = bpf_object__find_program_by_title(obj, probe_name); + if (CHECK(!prog, "find_probe", "prog '%s' not found\n", probe_name)) + goto cleanup; + err = bpf_object__load(obj); + if (CHECK(err, "obj_load", "failed to load prog '%s': %d\n", + probe_name, err)) + goto cleanup; + + bss_map = bpf_object__find_map_by_name(obj, "test_mma.bss"); + if (CHECK(!bss_map, "find_bss_map", ".bss map not found\n")) + goto cleanup; + data_map = bpf_object__find_map_by_name(obj, "data_map"); + if (CHECK(!data_map, "find_data_map", "data_map map not found\n")) + goto cleanup; + data_map_fd = bpf_map__fd(data_map); + + bss_mmaped = mmap(NULL, bss_sz, PROT_READ | PROT_WRITE, MAP_SHARED, + bpf_map__fd(bss_map), 0); + if (CHECK(bss_mmaped == MAP_FAILED, "bss_mmap", + ".bss mmap failed: %d\n", errno)) { + bss_mmaped = NULL; + goto cleanup; + } + /* map as R/W first */ + map_mmaped = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED, + data_map_fd, 0); + if (CHECK(map_mmaped == MAP_FAILED, "data_mmap", + "data_map mmap failed: %d\n", errno)) { + map_mmaped = NULL; + goto cleanup; + } + + bss_data = bss_mmaped; + map_data = map_mmaped; + + CHECK_FAIL(bss_data->in_val); + CHECK_FAIL(bss_data->out_val); + CHECK_FAIL(map_data->val[0]); + CHECK_FAIL(map_data->val[1]); + CHECK_FAIL(map_data->val[2]); + CHECK_FAIL(map_data->val[far]); + + link = bpf_program__attach_raw_tracepoint(prog, tp_name); + if (CHECK(IS_ERR(link), "attach_raw_tp", "err %ld\n", PTR_ERR(link))) + goto cleanup; + + bss_data->in_val = 123; + val = 111; + CHECK_FAIL(bpf_map_update_elem(data_map_fd, &zero, &val, 0)); + + usleep(1); + + CHECK_FAIL(bss_data->in_val != 123); + CHECK_FAIL(bss_data->out_val != 123); + CHECK_FAIL(map_data->val[0] != 111); + CHECK_FAIL(map_data->val[1] != 222); + CHECK_FAIL(map_data->val[2] != 123); + CHECK_FAIL(map_data->val[far] != 3 * 123); + + CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &zero, &val)); + CHECK_FAIL(val != 111); + CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &one, &val)); + CHECK_FAIL(val != 222); + CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &two, &val)); + CHECK_FAIL(val != 123); + CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &far, &val)); + CHECK_FAIL(val != 3 * 123); + + /* data_map freeze should fail due to R/W mmap() */ + err = bpf_map_freeze(data_map_fd); + if (CHECK(!err || errno != EBUSY, "no_freeze", + "data_map freeze succeeded: err=%d, errno=%d\n", err, errno)) + goto cleanup; + + /* unmap R/W mapping */ + err = munmap(map_mmaped, map_sz); + map_mmaped = NULL; + if (CHECK(err, "data_map_munmap", "data_map munmap failed: %d\n", errno)) + goto cleanup; + + /* re-map as R/O now */ + map_mmaped = mmap(NULL, map_sz, PROT_READ, MAP_SHARED, data_map_fd, 0); + if (CHECK(map_mmaped == MAP_FAILED, "data_mmap", + "data_map R/O mmap failed: %d\n", errno)) { + map_mmaped = NULL; + goto cleanup; + } + map_data = map_mmaped; + + /* map/unmap in a loop to test ref counting */ + for (i = 0; i < 10; i++) { + int flags = i % 2 ? PROT_READ : PROT_WRITE; + void *p; + + p = mmap(NULL, map_sz, flags, MAP_SHARED, data_map_fd, 0); + if (CHECK_FAIL(p == MAP_FAILED)) + goto cleanup; + err = munmap(p, map_sz); + if (CHECK_FAIL(err)) + goto cleanup; + } + + /* data_map freeze should now succeed due to no R/W mapping */ + err = bpf_map_freeze(data_map_fd); + if (CHECK(err, "freeze", "data_map freeze failed: err=%d, errno=%d\n", + err, errno)) + goto cleanup; + + /* mapping as R/W now should fail */ + tmp1 = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED, + data_map_fd, 0); + if (CHECK(tmp1 != MAP_FAILED, "data_mmap", "mmap succeeded\n")) { + munmap(tmp1, map_sz); + goto cleanup; + } + + bss_data->in_val = 321; + usleep(1); + CHECK_FAIL(bss_data->in_val != 321); + CHECK_FAIL(bss_data->out_val != 321); + CHECK_FAIL(map_data->val[0] != 111); + CHECK_FAIL(map_data->val[1] != 222); + CHECK_FAIL(map_data->val[2] != 321); + CHECK_FAIL(map_data->val[far] != 3 * 321); + + /* check some more advanced mmap() manipulations */ + + /* map all but last page: pages 1-3 mapped */ + tmp1 = mmap(NULL, 3 * page_size, PROT_READ, MAP_SHARED, + data_map_fd, 0); + if (CHECK(tmp1 == MAP_FAILED, "adv_mmap1", "errno %d\n", errno)) + goto cleanup; + + /* unmap second page: pages 1, 3 mapped */ + err = munmap(tmp1 + page_size, page_size); + if (CHECK(err, "adv_mmap2", "errno %d\n", errno)) { + munmap(tmp1, map_sz); + goto cleanup; + } + + /* map page 2 back */ + tmp2 = mmap(tmp1 + page_size, page_size, PROT_READ, + MAP_SHARED | MAP_FIXED, data_map_fd, 0); + if (CHECK(tmp2 == MAP_FAILED, "adv_mmap3", "errno %d\n", errno)) { + munmap(tmp1, page_size); + munmap(tmp1 + 2*page_size, page_size); + goto cleanup; + } + CHECK(tmp1 + page_size != tmp2, "adv_mmap4", + "tmp1: %p, tmp2: %p\n", tmp1, tmp2); + + /* re-map all 4 pages */ + tmp2 = mmap(tmp1, 4 * page_size, PROT_READ, MAP_SHARED | MAP_FIXED, + data_map_fd, 0); + if (CHECK(tmp2 == MAP_FAILED, "adv_mmap5", "errno %d\n", errno)) { + munmap(tmp1, 3 * page_size); /* unmap page 1 */ + goto cleanup; + } + CHECK(tmp1 != tmp2, "adv_mmap6", "tmp1: %p, tmp2: %p\n", tmp1, tmp2); + + map_data = tmp2; + CHECK_FAIL(bss_data->in_val != 321); + CHECK_FAIL(bss_data->out_val != 321); + CHECK_FAIL(map_data->val[0] != 111); + CHECK_FAIL(map_data->val[1] != 222); + CHECK_FAIL(map_data->val[2] != 321); + CHECK_FAIL(map_data->val[far] != 3 * 321); + + munmap(tmp2, 4 * page_size); +cleanup: + if (bss_mmaped) + CHECK_FAIL(munmap(bss_mmaped, bss_sz)); + if (map_mmaped) + CHECK_FAIL(munmap(map_mmaped, map_sz)); + if (!IS_ERR_OR_NULL(link)) + bpf_link__destroy(link); + bpf_object__close(obj); +} diff --git a/tools/testing/selftests/bpf/progs/test_mmap.c b/tools/testing/selftests/bpf/progs/test_mmap.c new file mode 100644 index 000000000000..0d2ec9fbcf61 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_mmap.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2019 Facebook + +#include +#include +#include "bpf_helpers.h" + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 512 * 4); /* at least 4 pages of data */ + __uint(map_flags, BPF_F_MMAPABLE); + __type(key, __u32); + __type(value, __u64); +} data_map SEC(".maps"); + +static volatile __u64 in_val; +static volatile __u64 out_val; + +SEC("raw_tracepoint/sys_enter") +int test_mmap(void *ctx) +{ + int zero = 0, one = 1, two = 2, far = 1500; + __u64 val, *p; + + out_val = in_val; + + /* data_map[2] = in_val; */ + bpf_map_update_elem(&data_map, &two, (const void *)&in_val, 0); + + /* data_map[1] = data_map[0] * 2; */ + p = bpf_map_lookup_elem(&data_map, &zero); + if (p) { + val = (*p) * 2; + bpf_map_update_elem(&data_map, &one, &val, 0); + } + + /* data_map[far] = in_val * 3; */ + val = in_val * 3; + bpf_map_update_elem(&data_map, &far, &val, 0); + + return 0; +} +