From patchwork Sun Apr 1 15:00:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 894015 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40Ddpy4k4rz9s28 for ; Mon, 2 Apr 2018 01:01:30 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753811AbeDAPBA (ORCPT ); Sun, 1 Apr 2018 11:01:00 -0400 Received: from [75.106.27.153] ([75.106.27.153]:37978 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753772AbeDAPA7 (ORCPT ); Sun, 1 Apr 2018 11:00:59 -0400 Received: from [127.0.1.1] (localhost [127.0.0.1]) by john-Precision-Tower-5810 (Postfix) with ESMTP id 912A9D43500; Sun, 1 Apr 2018 08:00:54 -0700 (PDT) Subject: [bpf-next PATCH 1/4] bpf: sockmap, free memory on sock close with cork data From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net Cc: netdev@vger.kernel.org, davem@davemloft.net Date: Sun, 01 Apr 2018 08:00:54 -0700 Message-ID: <20180401150054.24727.78359.stgit@john-Precision-Tower-5810> In-Reply-To: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> References: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If a socket with pending cork data is closed we do not return the memory to the socket until the garbage collector free's the psock structure. The garbage collector though can run after the sock has completed its close operation. If this ordering happens the sock code will through a WARN_ON because there is still outstanding memory accounted to the sock. To resolve this ensure we return memory to the sock when a socket is closed. Signed-off-by: John Fastabend Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper") --- kernel/bpf/sockmap.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index d2bda5a..8ddf326 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -211,6 +211,12 @@ static void bpf_tcp_close(struct sock *sk, long timeout) close_fun = psock->save_close; write_lock_bh(&sk->sk_callback_lock); + if (psock->cork) { + free_start_sg(psock->sock, psock->cork); + kfree(psock->cork); + psock->cork = NULL; + } + list_for_each_entry_safe(md, mtmp, &psock->ingress, list) { list_del(&md->list); free_start_sg(psock->sock, md); From patchwork Sun Apr 1 15:00:59 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 894016 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40DdqZ1FD0z9s0q for ; Mon, 2 Apr 2018 01:02:02 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753817AbeDAPBF (ORCPT ); Sun, 1 Apr 2018 11:01:05 -0400 Received: from [75.106.27.153] ([75.106.27.153]:37986 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753777AbeDAPBD (ORCPT ); Sun, 1 Apr 2018 11:01:03 -0400 Received: from [127.0.1.1] (localhost [127.0.0.1]) by john-Precision-Tower-5810 (Postfix) with ESMTP id B68EFD4352C; Sun, 1 Apr 2018 08:00:59 -0700 (PDT) Subject: [bpf-next PATCH 2/4] bpf: sockmap, duplicates release calls may NULL sk_prot From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net Cc: netdev@vger.kernel.org, davem@davemloft.net Date: Sun, 01 Apr 2018 08:00:59 -0700 Message-ID: <20180401150059.24727.21318.stgit@john-Precision-Tower-5810> In-Reply-To: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> References: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org It is possible to have multiple ULP tcp_release call paths in flight if a sock is closed and simultaneously being removed from the sockmap control path. The result would be setting the sk_prot to the saved values on the first iteration and then on the second iteration setting the value to NULL. This patch resolves this by ensuring we only reset the sk_prot pointer if we have a valid saved state to set. Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: John Fastabend --- kernel/bpf/sockmap.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 8ddf326..8dd9210 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -182,8 +182,10 @@ static void bpf_tcp_release(struct sock *sk) psock->cork = NULL; } - sk->sk_prot = psock->sk_proto; - psock->sk_proto = NULL; + if (psock->sk_proto) { + sk->sk_prot = psock->sk_proto; + psock->sk_proto = NULL; + } out: rcu_read_unlock(); } From patchwork Sun Apr 1 15:01:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 894017 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40Ddqk3F3Tz9s1b for ; Mon, 2 Apr 2018 01:02:10 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753826AbeDAPBM (ORCPT ); Sun, 1 Apr 2018 11:01:12 -0400 Received: from [75.106.27.153] ([75.106.27.153]:37998 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753772AbeDAPBK (ORCPT ); Sun, 1 Apr 2018 11:01:10 -0400 Received: from [127.0.1.1] (localhost [127.0.0.1]) by john-Precision-Tower-5810 (Postfix) with ESMTP id E22C8D43533; Sun, 1 Apr 2018 08:01:04 -0700 (PDT) Subject: [bpf-next PATCH 3/4] bpf: sockmap, refactor sockmap routines to work with hashmap From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net Cc: netdev@vger.kernel.org, davem@davemloft.net Date: Sun, 01 Apr 2018 08:01:04 -0700 Message-ID: <20180401150104.24727.51922.stgit@john-Precision-Tower-5810> In-Reply-To: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> References: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch only refactors the existing sockmap code. This will allow much of the psock initialization code path and bpf helper codes to work for both sockmap bpf map types that are backed by an array, the currently supported type, and the new hash backed bpf map type sockhash. Most the fallout comes from three changes, - Pushing bpf programs into an independent structure so we can use it from the htab struct in the next patch. - Generalizing helpers to use void *key instead of the hardcoded u32. - Instead of passing map/key through the metadata we now do the lookup inline. This avoids storing the key in the metadata which will be useful when keys can be longer than 4 bytes. Signed-off-by: John Fastabend --- include/linux/filter.h | 3 - include/net/tcp.h | 3 - kernel/bpf/sockmap.c | 149 +++++++++++++++++++++++++++++------------------- net/core/filter.c | 27 +-------- 4 files changed, 95 insertions(+), 87 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index fc4e8f9..1d47160 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -519,9 +519,8 @@ struct sk_msg_buff { int sg_end; struct scatterlist sg_data[MAX_SKB_FRAGS]; bool sg_copy[MAX_SKB_FRAGS]; - __u32 key; __u32 flags; - struct bpf_map *map; + struct sock *sk; struct sk_buff *skb; struct list_head list; }; diff --git a/include/net/tcp.h b/include/net/tcp.h index 9c9b376..c3124d9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -810,9 +810,8 @@ struct tcp_skb_cb { #endif } header; /* For incoming skbs */ struct { - __u32 key; __u32 flags; - struct bpf_map *map; + struct sock *sk; void *data_end; } bpf; }; diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 8dd9210..c006e3b 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -47,14 +47,18 @@ #define SOCK_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) -struct bpf_stab { - struct bpf_map map; - struct sock **sock_map; +struct bpf_sock_progs { struct bpf_prog *bpf_tx_msg; struct bpf_prog *bpf_parse; struct bpf_prog *bpf_verdict; }; +struct bpf_stab { + struct bpf_map map; + struct sock **sock_map; + struct bpf_sock_progs progs; +}; + enum smap_psock_state { SMAP_TX_RUNNING, }; @@ -455,7 +459,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md) static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md) { return ((_rc == SK_PASS) ? - (md->map ? __SK_REDIRECT : __SK_PASS) : + (md->sk ? __SK_REDIRECT : __SK_PASS) : __SK_DROP); } @@ -1045,7 +1049,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb) * when we orphan the skb so that we don't have the possibility * to reference a stale map. */ - TCP_SKB_CB(skb)->bpf.map = NULL; + TCP_SKB_CB(skb)->bpf.sk = NULL; skb->sk = psock->sock; bpf_compute_data_pointers(skb); preempt_disable(); @@ -1055,7 +1059,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb) /* Moving return codes from UAPI namespace into internal namespace */ return rc == SK_PASS ? - (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) : + (TCP_SKB_CB(skb)->bpf.sk ? __SK_REDIRECT : __SK_PASS) : __SK_DROP; } @@ -1325,7 +1329,6 @@ static int smap_init_sock(struct smap_psock *psock, } static void smap_init_progs(struct smap_psock *psock, - struct bpf_stab *stab, struct bpf_prog *verdict, struct bpf_prog *parse) { @@ -1403,14 +1406,13 @@ static void smap_gc_work(struct work_struct *w) kfree(psock); } -static struct smap_psock *smap_init_psock(struct sock *sock, - struct bpf_stab *stab) +static struct smap_psock *smap_init_psock(struct sock *sock, int node) { struct smap_psock *psock; psock = kzalloc_node(sizeof(struct smap_psock), GFP_ATOMIC | __GFP_NOWARN, - stab->map.numa_node); + node); if (!psock) return ERR_PTR(-ENOMEM); @@ -1618,40 +1620,27 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key) * - sock_map must use READ_ONCE and (cmp)xchg operations * - BPF verdict/parse programs must use READ_ONCE and xchg operations */ -static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, - struct bpf_map *map, - void *key, u64 flags) + +static int __sock_map_ctx_update_elem(struct bpf_map *map, + struct bpf_sock_progs *progs, + struct sock *sock, + struct sock **map_link, + struct sock **hash_link, + void *key) { - struct bpf_stab *stab = container_of(map, struct bpf_stab, map); - struct smap_psock_map_entry *e = NULL; struct bpf_prog *verdict, *parse, *tx_msg; - struct sock *osock, *sock; + struct smap_psock_map_entry *e = NULL; struct smap_psock *psock; - u32 i = *(u32 *)key; bool new = false; int err; - if (unlikely(flags > BPF_EXIST)) - return -EINVAL; - - if (unlikely(i >= stab->map.max_entries)) - return -E2BIG; - - sock = READ_ONCE(stab->sock_map[i]); - if (flags == BPF_EXIST && !sock) - return -ENOENT; - else if (flags == BPF_NOEXIST && sock) - return -EEXIST; - - sock = skops->sk; - /* 1. If sock map has BPF programs those will be inherited by the * sock being added. If the sock is already attached to BPF programs * this results in an error. */ - verdict = READ_ONCE(stab->bpf_verdict); - parse = READ_ONCE(stab->bpf_parse); - tx_msg = READ_ONCE(stab->bpf_tx_msg); + verdict = READ_ONCE(progs->bpf_verdict); + parse = READ_ONCE(progs->bpf_parse); + tx_msg = READ_ONCE(progs->bpf_tx_msg); if (parse && verdict) { /* bpf prog refcnt may be zero if a concurrent attach operation @@ -1659,11 +1648,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, * we increment the refcnt. If this is the case abort with an * error. */ - verdict = bpf_prog_inc_not_zero(stab->bpf_verdict); + verdict = bpf_prog_inc_not_zero(progs->bpf_verdict); if (IS_ERR(verdict)) return PTR_ERR(verdict); - parse = bpf_prog_inc_not_zero(stab->bpf_parse); + parse = bpf_prog_inc_not_zero(progs->bpf_parse); if (IS_ERR(parse)) { bpf_prog_put(verdict); return PTR_ERR(parse); @@ -1671,7 +1660,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, } if (tx_msg) { - tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg); + tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg); if (IS_ERR(tx_msg)) { if (verdict) bpf_prog_put(verdict); @@ -1704,7 +1693,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, goto out_progs; } } else { - psock = smap_init_psock(sock, stab); + psock = smap_init_psock(sock, map->numa_node); if (IS_ERR(psock)) { err = PTR_ERR(psock); goto out_progs; @@ -1719,7 +1708,6 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, err = -ENOMEM; goto out_progs; } - e->entry = &stab->sock_map[i]; /* 3. At this point we have a reference to a valid psock that is * running. Attach any BPF programs needed. @@ -1736,7 +1724,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, err = smap_init_sock(psock, sock); if (err) goto out_free; - smap_init_progs(psock, stab, verdict, parse); + smap_init_progs(psock, verdict, parse); smap_start_sock(psock, sock); } @@ -1745,19 +1733,12 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, * it with. Because we can only have a single set of programs if * old_sock has a strp we can stop it. */ - list_add_tail(&e->list, &psock->maps); - write_unlock_bh(&sock->sk_callback_lock); - - osock = xchg(&stab->sock_map[i], sock); - if (osock) { - struct smap_psock *opsock = smap_psock_sk(osock); - - write_lock_bh(&osock->sk_callback_lock); - smap_list_remove(opsock, &stab->sock_map[i]); - smap_release_sock(opsock, osock); - write_unlock_bh(&osock->sk_callback_lock); + if (map_link) { + e->entry = map_link; + list_add_tail(&e->list, &psock->maps); } - return 0; + write_unlock_bh(&sock->sk_callback_lock); + return err; out_free: smap_release_sock(psock, sock); out_progs: @@ -1772,23 +1753,69 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, return err; } -int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) +static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, + struct bpf_map *map, + void *key, u64 flags) { struct bpf_stab *stab = container_of(map, struct bpf_stab, map); + struct bpf_sock_progs *progs = &stab->progs; + struct sock *osock, *sock; + u32 i = *(u32 *)key; + int err; + + if (unlikely(flags > BPF_EXIST)) + return -EINVAL; + + if (unlikely(i >= stab->map.max_entries)) + return -E2BIG; + + sock = READ_ONCE(stab->sock_map[i]); + if (flags == BPF_EXIST && !sock) + return -ENOENT; + else if (flags == BPF_NOEXIST && sock) + return -EEXIST; + + sock = skops->sk; + err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i], + NULL, key); + if (err) + goto out; + + osock = xchg(&stab->sock_map[i], sock); + if (osock) { + struct smap_psock *opsock = smap_psock_sk(osock); + + write_lock_bh(&osock->sk_callback_lock); + smap_list_remove(opsock, &stab->sock_map[i]); + smap_release_sock(opsock, osock); + write_unlock_bh(&osock->sk_callback_lock); + } +out: + return 0; +} + +int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) +{ + struct bpf_sock_progs *progs; struct bpf_prog *orig; - if (unlikely(map->map_type != BPF_MAP_TYPE_SOCKMAP)) + if (map->map_type == BPF_MAP_TYPE_SOCKMAP) { + struct bpf_stab *stab = container_of(map, struct bpf_stab, map); + + progs = &stab->progs; + } else { return -EINVAL; + } switch (type) { case BPF_SK_MSG_VERDICT: - orig = xchg(&stab->bpf_tx_msg, prog); + orig = xchg(&progs->bpf_tx_msg, prog); break; case BPF_SK_SKB_STREAM_PARSER: - orig = xchg(&stab->bpf_parse, prog); + orig = xchg(&progs->bpf_parse, prog); break; case BPF_SK_SKB_STREAM_VERDICT: - orig = xchg(&stab->bpf_verdict, prog); + orig = xchg(&progs->bpf_verdict, prog); break; default: return -EOPNOTSUPP; @@ -1837,16 +1864,18 @@ static int sock_map_update_elem(struct bpf_map *map, static void sock_map_release(struct bpf_map *map, struct file *map_file) { struct bpf_stab *stab = container_of(map, struct bpf_stab, map); + struct bpf_sock_progs *progs; struct bpf_prog *orig; - orig = xchg(&stab->bpf_parse, NULL); + progs = &stab->progs; + orig = xchg(&progs->bpf_parse, NULL); if (orig) bpf_prog_put(orig); - orig = xchg(&stab->bpf_verdict, NULL); + orig = xchg(&progs->bpf_verdict, NULL); if (orig) bpf_prog_put(orig); - orig = xchg(&stab->bpf_tx_msg, NULL); + orig = xchg(&progs->bpf_tx_msg, NULL); if (orig) bpf_prog_put(orig); } diff --git a/net/core/filter.c b/net/core/filter.c index d31aff9..6456e9d 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1859,9 +1859,8 @@ int skb_do_redirect(struct sk_buff *skb) if (unlikely(flags & ~(BPF_F_INGRESS))) return SK_DROP; - tcb->bpf.key = key; tcb->bpf.flags = flags; - tcb->bpf.map = map; + tcb->bpf.sk = __sock_map_lookup_elem(map, key); return SK_PASS; } @@ -1869,16 +1868,8 @@ int skb_do_redirect(struct sk_buff *skb) struct sock *do_sk_redirect_map(struct sk_buff *skb) { struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); - struct sock *sk = NULL; - if (tcb->bpf.map) { - sk = __sock_map_lookup_elem(tcb->bpf.map, tcb->bpf.key); - - tcb->bpf.key = 0; - tcb->bpf.map = NULL; - } - - return sk; + return tcb->bpf.sk; } static const struct bpf_func_proto bpf_sk_redirect_map_proto = { @@ -1898,25 +1889,15 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb) if (unlikely(flags & ~(BPF_F_INGRESS))) return SK_DROP; - msg->key = key; msg->flags = flags; - msg->map = map; + msg->sk = __sock_map_lookup_elem(map, key); return SK_PASS; } struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) { - struct sock *sk = NULL; - - if (msg->map) { - sk = __sock_map_lookup_elem(msg->map, msg->key); - - msg->key = 0; - msg->map = NULL; - } - - return sk; + return msg->sk; } static const struct bpf_func_proto bpf_msg_redirect_map_proto = { From patchwork Sun Apr 1 15:01:10 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 894018 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40Ddqr12DTz9s0q for ; Mon, 2 Apr 2018 01:02:16 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753830AbeDAPBS (ORCPT ); Sun, 1 Apr 2018 11:01:18 -0400 Received: from [75.106.27.153] ([75.106.27.153]:38008 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753818AbeDAPBO (ORCPT ); Sun, 1 Apr 2018 11:01:14 -0400 Received: from [127.0.1.1] (localhost [127.0.0.1]) by john-Precision-Tower-5810 (Postfix) with ESMTP id 182D9D43568; Sun, 1 Apr 2018 08:01:10 -0700 (PDT) Subject: [bpf-next PATCH 4/4] bpf: sockmap, add hash map support From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net Cc: netdev@vger.kernel.org, davem@davemloft.net Date: Sun, 01 Apr 2018 08:01:10 -0700 Message-ID: <20180401150109.24727.86658.stgit@john-Precision-Tower-5810> In-Reply-To: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> References: <20180401145728.24727.706.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Sockmap is currently backed by an array and enforces keys to be four bytes. This works well for many use cases and was originally modeled after devmap which also uses four bytes keys. However, this has become limiting in larger use cases where a hash would be more appropriate. For example users may want to use the 5-tuple of the socket as the lookup key. To support this add hash support. Signed-off-by: John Fastabend --- include/linux/bpf.h | 8 include/linux/bpf_types.h | 1 include/uapi/linux/bpf.h | 6 kernel/bpf/core.c | 1 kernel/bpf/sockmap.c | 512 ++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 14 + net/core/filter.c | 54 +++ tools/bpf/bpftool/map.c | 1 tools/include/uapi/linux/bpf.h | 6 tools/testing/selftests/bpf/bpf_helpers.h | 13 + 10 files changed, 598 insertions(+), 18 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 95a7abd..d5c77f2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -645,6 +645,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map) #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET) struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key); +struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key); int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type); #else static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key) @@ -652,6 +653,12 @@ static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key) return NULL; } +static inline struct sock *__sock_hash_lookup_elem(struct bpf_map *map, + void *key) +{ + return NULL; +} + static inline int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) @@ -677,6 +684,7 @@ static inline int sock_map_prog(struct bpf_map *map, extern const struct bpf_func_proto bpf_skb_vlan_pop_proto; extern const struct bpf_func_proto bpf_get_stackid_proto; extern const struct bpf_func_proto bpf_sock_map_update_proto; +extern const struct bpf_func_proto bpf_sock_hash_update_proto; /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 2b28fcf..3101118 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -47,6 +47,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) #endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c5ec897..9886e88 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -115,6 +115,7 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP, BPF_MAP_TYPE_SOCKMAP, BPF_MAP_TYPE_CPUMAP, + BPF_MAP_TYPE_SOCKHASH, }; enum bpf_prog_type { @@ -821,7 +822,10 @@ struct bpf_stack_build_id { FN(msg_apply_bytes), \ FN(msg_cork_bytes), \ FN(msg_pull_data), \ - FN(bind), + FN(bind), \ + FN(sock_hash_update), \ + FN(msg_redirect_hash), \ + FN(sk_redirect_hash), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d315b39..dd2c906 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1769,6 +1769,7 @@ void bpf_user_rnd_init_once(void) const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; const struct bpf_func_proto bpf_get_current_comm_proto __weak; const struct bpf_func_proto bpf_sock_map_update_proto __weak; +const struct bpf_func_proto bpf_sock_hash_update_proto __weak; const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index c006e3b..175fcb0 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -59,6 +59,28 @@ struct bpf_stab { struct bpf_sock_progs progs; }; +struct bucket { + struct hlist_nulls_head head; + raw_spinlock_t lock; +}; + +struct bpf_htab { + struct bpf_map map; + struct bucket *buckets; + atomic_t count; + u32 n_buckets; + u32 elem_size; + struct bpf_sock_progs progs; +}; + +struct htab_elem { + struct rcu_head rcu; + struct hlist_nulls_node hash_node; + u32 hash; + struct sock *sk; + char key[0]; +}; + enum smap_psock_state { SMAP_TX_RUNNING, }; @@ -66,6 +88,8 @@ enum smap_psock_state { struct smap_psock_map_entry { struct list_head list; struct sock **entry; + struct htab_elem *hash_link; + struct bpf_htab *htab; }; struct smap_psock { @@ -194,6 +218,27 @@ static void bpf_tcp_release(struct sock *sk) rcu_read_unlock(); } +static void htab_elem_free_rcu(struct rcu_head *head) +{ + struct htab_elem *l = container_of(head, struct htab_elem, rcu); + + /* must increment bpf_prog_active to avoid kprobe+bpf triggering while + * we're calling kfree, otherwise deadlock is possible if kprobes + * are placed somewhere inside of slub + */ + preempt_disable(); + __this_cpu_inc(bpf_prog_active); + kfree(l); + __this_cpu_dec(bpf_prog_active); + preempt_enable(); +} + +static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) +{ + atomic_dec(&htab->count); + call_rcu(&l->rcu, htab_elem_free_rcu); +} + static void bpf_tcp_close(struct sock *sk, long timeout) { void (*close_fun)(struct sock *sk, long timeout); @@ -230,10 +275,16 @@ static void bpf_tcp_close(struct sock *sk, long timeout) } list_for_each_entry_safe(e, tmp, &psock->maps, list) { - osk = cmpxchg(e->entry, sk, NULL); - if (osk == sk) { - list_del(&e->list); - smap_release_sock(psock, sk); + if (e->entry) { + osk = cmpxchg(e->entry, sk, NULL); + if (osk == sk) { + list_del(&e->list); + smap_release_sock(psock, sk); + } + } else { + hlist_nulls_del_rcu(&e->hash_link->hash_node); + smap_release_sock(psock, e->hash_link->sk); + free_htab_elem(e->htab, e->hash_link); } } write_unlock_bh(&sk->sk_callback_lock); @@ -1483,12 +1534,14 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } -static void smap_list_remove(struct smap_psock *psock, struct sock **entry) +static void smap_list_remove(struct smap_psock *psock, + struct sock **entry, + struct htab_elem *hash_link) { struct smap_psock_map_entry *e, *tmp; list_for_each_entry_safe(e, tmp, &psock->maps, list) { - if (e->entry == entry) { + if (e->entry == entry || e->hash_link == hash_link) { list_del(&e->list); break; } @@ -1526,7 +1579,7 @@ static void sock_map_free(struct bpf_map *map) * to be null and queued for garbage collection. */ if (likely(psock)) { - smap_list_remove(psock, &stab->sock_map[i]); + smap_list_remove(psock, &stab->sock_map[i], NULL); smap_release_sock(psock, sock); } write_unlock_bh(&sock->sk_callback_lock); @@ -1585,7 +1638,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key) if (psock->bpf_parse) smap_stop_sock(psock, sock); - smap_list_remove(psock, &stab->sock_map[k]); + smap_list_remove(psock, &stab->sock_map[k], NULL); smap_release_sock(psock, sock); out: write_unlock_bh(&sock->sk_callback_lock); @@ -1625,7 +1678,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map, struct bpf_sock_progs *progs, struct sock *sock, struct sock **map_link, - struct sock **hash_link, + struct htab_elem *hash_link, void *key) { struct bpf_prog *verdict, *parse, *tx_msg; @@ -1737,6 +1790,11 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map, e->entry = map_link; list_add_tail(&e->list, &psock->maps); } + if (hash_link) { + e->hash_link = hash_link; + e->htab = container_of(map, struct bpf_htab, map); + list_add_tail(&e->list, &psock->maps); + } write_unlock_bh(&sock->sk_callback_lock); return err; out_free: @@ -1786,7 +1844,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, struct smap_psock *opsock = smap_psock_sk(osock); write_lock_bh(&osock->sk_callback_lock); - smap_list_remove(opsock, &stab->sock_map[i]); + smap_list_remove(opsock, &stab->sock_map[i], NULL); smap_release_sock(opsock, osock); write_unlock_bh(&osock->sk_callback_lock); } @@ -1803,6 +1861,10 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) struct bpf_stab *stab = container_of(map, struct bpf_stab, map); progs = &stab->progs; + } else if (map->map_type == BPF_MAP_TYPE_SOCKHASH) { + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + progs = &htab->progs; } else { return -EINVAL; } @@ -1863,11 +1925,19 @@ static int sock_map_update_elem(struct bpf_map *map, static void sock_map_release(struct bpf_map *map, struct file *map_file) { - struct bpf_stab *stab = container_of(map, struct bpf_stab, map); struct bpf_sock_progs *progs; struct bpf_prog *orig; - progs = &stab->progs; + if (map->map_type == BPF_MAP_TYPE_SOCKMAP) { + struct bpf_stab *stab = container_of(map, struct bpf_stab, map); + + progs = &stab->progs; + } else { + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + progs = &htab->progs; + } + orig = xchg(&progs->bpf_parse, NULL); if (orig) bpf_prog_put(orig); @@ -1880,6 +1950,396 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file) bpf_prog_put(orig); } +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) +{ + struct bpf_htab *htab; + int i, err; + u64 cost; + + if (!capable(CAP_NET_ADMIN)) + return ERR_PTR(-EPERM); + + /* check sanity of attributes */ + if (attr->max_entries == 0 || + attr->map_flags & ~SOCK_CREATE_FLAG_MASK) + return ERR_PTR(-EINVAL); + + if (attr->value_size > KMALLOC_MAX_SIZE) + return ERR_PTR(-E2BIG); + + err = bpf_tcp_ulp_register(); + if (err && err != -EEXIST) + return ERR_PTR(err); + + htab = kzalloc(sizeof(*htab), GFP_USER); + if (!htab) + return ERR_PTR(-ENOMEM); + + bpf_map_init_from_attr(&htab->map, attr); + + htab->n_buckets = roundup_pow_of_two(htab->map.max_entries); + htab->elem_size = sizeof(struct htab_elem) + + round_up(htab->map.key_size, 8); + + if (htab->n_buckets == 0 || + htab->n_buckets > U32_MAX / sizeof(struct bucket)) + goto free_htab; + + cost = (u64) htab->n_buckets * sizeof(struct bucket) + + (u64) htab->elem_size * htab->map.max_entries; + + if (cost >= U32_MAX - PAGE_SIZE) + goto free_htab; + + htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; + err = bpf_map_precharge_memlock(htab->map.pages); + if (err) + goto free_htab; + + err = -ENOMEM; + htab->buckets = bpf_map_area_alloc( + htab->n_buckets * sizeof(struct bucket), + htab->map.numa_node); + if (!htab->buckets) + goto free_htab; + + for (i = 0; i < htab->n_buckets; i++) { + INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); + raw_spin_lock_init(&htab->buckets[i].lock); + } + + return &htab->map; +free_htab: + kfree(htab); + return ERR_PTR(err); +} + +static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash) +{ + return &htab->buckets[hash & (htab->n_buckets - 1)]; +} + +static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, + u32 hash) +{ + return &__select_bucket(htab, hash)->head; +} + +static void sock_hash_free(struct bpf_map *map) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + int i; + + synchronize_rcu(); + + /* At this point no update, lookup or delete operations can happen. + * However, be aware we can still get a socket state event updates, + * and data ready callabacks that reference the psock from sk_user_data + * Also psock worker threads are still in-flight. So smap_release_sock + * will only free the psock after cancel_sync on the worker threads + * and a grace period expire to ensure psock is really safe to remove. + */ + rcu_read_lock(); + for (i = 0; i < htab->n_buckets; i++) { + struct hlist_nulls_head *head = select_bucket(htab, i); + struct hlist_nulls_node *n; + struct htab_elem *l; + + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) { + struct sock *sock = l->sk; + struct smap_psock *psock; + + hlist_nulls_del_rcu(&l->hash_node); + write_lock_bh(&sock->sk_callback_lock); + psock = smap_psock_sk(sock); + /* This check handles a racing sock event that can get + * the sk_callback_lock before this case but after xchg + * causing the refcnt to hit zero and sock user data + * (psock) to be null and queued for garbage collection. + */ + if (likely(psock)) { + smap_list_remove(psock, NULL, l); + smap_release_sock(psock, sock); + } + write_unlock_bh(&sock->sk_callback_lock); + kfree(l); + } + } + rcu_read_unlock(); + bpf_map_area_free(htab->buckets); + kfree(htab); +} + +static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab, + void *key, u32 key_size, u32 hash, + struct sock *sk, + struct htab_elem *old_elem) +{ + struct htab_elem *l_new; + + if (atomic_inc_return(&htab->count) > htab->map.max_entries) { + if (!old_elem) { + atomic_dec(&htab->count); + return ERR_PTR(-E2BIG); + } + } + l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN, + htab->map.numa_node); + if (!l_new) + return ERR_PTR(-ENOMEM); + + memcpy(l_new->key, key, key_size); + l_new->sk = sk; + l_new->hash = hash; + return l_new; +} + +static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, + u32 hash, void *key, u32 key_size) +{ + struct hlist_nulls_node *n; + struct htab_elem *l; + + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) + if (l->hash == hash && !memcmp(&l->key, key, key_size)) + return l; + + return NULL; +} + +static inline u32 htab_map_hash(const void *key, u32 key_len) +{ + return jhash(key, key_len, 0); +} + +static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head, + u32 hash, void *key, + u32 key_size, u32 n_buckets) +{ + struct hlist_nulls_node *n; + struct htab_elem *l; + +again: + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) + if (l->hash == hash && !memcmp(&l->key, key, key_size)) + return l; + + if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1)))) + goto again; + + return NULL; +} + +static int sock_hash_get_next_key(struct bpf_map *map, + void *key, void *next_key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct htab_elem *l, *next_l; + struct hlist_nulls_head *h; + u32 hash, key_size; + int i = 0; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + key_size = map->key_size; + if (!key) + goto find_first_elem; + hash = htab_map_hash(key, key_size); + h = select_bucket(htab, hash); + + l = lookup_nulls_elem_raw(h, hash, key, key_size, htab->n_buckets); + if (!l) + goto find_first_elem; + next_l = hlist_nulls_entry_safe( + rcu_dereference_raw(hlist_nulls_next_rcu(&l->hash_node)), + struct htab_elem, hash_node); + if (next_l) { + memcpy(next_key, next_l->key, key_size); + return 0; + } + + /* no more elements in this hash list, go to the next bucket */ + i = hash & (htab->n_buckets - 1); + i++; + +find_first_elem: + /* iterate over buckets */ + for (; i < htab->n_buckets; i++) { + h = select_bucket(htab, i); + + /* pick first element in the bucket */ + next_l = hlist_nulls_entry_safe( + rcu_dereference_raw(hlist_nulls_first_rcu(h)), + struct htab_elem, hash_node); + if (next_l) { + /* if it's not empty, just return it */ + memcpy(next_key, next_l->key, key_size); + return 0; + } + } + + /* iterated over all buckets and all elements */ + return -ENOENT; +} + +static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, + struct bpf_map *map, + void *key, u64 map_flags) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct bpf_sock_progs *progs = &htab->progs; + struct htab_elem *l_new = NULL, *l_old; + struct hlist_nulls_head *head; + struct smap_psock *psock; + u32 key_size, hash; + struct sock *sock; + struct bucket *b; + int err; + + sock = skops->sk; + + if (sock->sk_type != SOCK_STREAM || + sock->sk_protocol != IPPROTO_TCP) + return -EOPNOTSUPP; + + if (unlikely(map_flags > BPF_EXIST)) + return -EINVAL; + + WARN_ON_ONCE(!rcu_read_lock_held()); + key_size = map->key_size; + hash = htab_map_hash(key, key_size); + b = __select_bucket(htab, hash); + head = &b->head; + + err = __sock_map_ctx_update_elem(map, progs, sock, NULL, NULL, key); + if (err) + goto err; + + /* bpf_map_update_elem() can be called in_irq() */ + raw_spin_lock_bh(&b->lock); + l_old = lookup_elem_raw(head, hash, key, key_size); + if (l_old && map_flags == BPF_NOEXIST) { + err = -EEXIST; + goto bucket_err; + } + if (!l_old && map_flags == BPF_EXIST) { + err = -ENOENT; + goto bucket_err; + } + + l_new = alloc_sock_hash_elem(htab, key, key_size, hash, sock, l_old); + if (IS_ERR(l_new)) { + err = PTR_ERR(l_new); + goto bucket_err; + } + + /* add new element to the head of the list, so that + * concurrent search will find it before old elem + */ + hlist_nulls_add_head_rcu(&l_new->hash_node, head); + if (l_old) { + psock = smap_psock_sk(l_old->sk); + + hlist_nulls_del_rcu(&l_old->hash_node); + smap_list_remove(psock, NULL, l_old); + smap_release_sock(psock, l_old->sk); + free_htab_elem(htab, l_old); + } + raw_spin_unlock_bh(&b->lock); + return 0; +bucket_err: + raw_spin_unlock_bh(&b->lock); +err: + psock = smap_psock_sk(sock); + if (psock) + smap_release_sock(psock, sock); + return err; +} + +static int sock_hash_update_elem(struct bpf_map *map, + void *key, void *value, u64 flags) +{ + struct bpf_sock_ops_kern skops; + u32 fd = *(u32 *)value; + struct socket *socket; + int err; + + socket = sockfd_lookup(fd, &err); + if (!socket) + return err; + + skops.sk = socket->sk; + if (!skops.sk) { + fput(socket->file); + return -EINVAL; + } + + err = sock_hash_ctx_update_elem(&skops, map, key, flags); + fput(socket->file); + return err; +} + +static int sock_hash_delete_elem(struct bpf_map *map, void *key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct hlist_nulls_head *head; + struct bucket *b; + struct htab_elem *l; + u32 hash, key_size; + int ret = -ENOENT; + + key_size = map->key_size; + hash = htab_map_hash(key, key_size); + b = __select_bucket(htab, hash); + head = &b->head; + + raw_spin_lock_bh(&b->lock); + l = lookup_elem_raw(head, hash, key, key_size); + if (l) { + struct sock *sock = l->sk; + struct smap_psock *psock; + + hlist_nulls_del_rcu(&l->hash_node); + write_lock_bh(&sock->sk_callback_lock); + psock = smap_psock_sk(sock); + /* This check handles a racing sock event that can get the + * sk_callback_lock before this case but after xchg happens + * causing the refcnt to hit zero and sock user data (psock) + * to be null and queued for garbage collection. + */ + if (likely(psock)) { + smap_list_remove(psock, NULL, l); + smap_release_sock(psock, sock); + } + write_unlock_bh(&sock->sk_callback_lock); + free_htab_elem(htab, l); + ret = 0; + } + raw_spin_unlock_bh(&b->lock); + return ret; +} + +struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct hlist_nulls_head *head; + struct htab_elem *l; + u32 key_size, hash; + struct bucket *b; + struct sock *sk; + + key_size = map->key_size; + hash = htab_map_hash(key, key_size); + b = __select_bucket(htab, hash); + head = &b->head; + + raw_spin_lock_bh(&b->lock); + l = lookup_elem_raw(head, hash, key, key_size); + sk = l ? l->sk : NULL; + raw_spin_unlock_bh(&b->lock); + return sk; +} + const struct bpf_map_ops sock_map_ops = { .map_alloc = sock_map_alloc, .map_free = sock_map_free, @@ -1890,6 +2350,16 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file) .map_release = sock_map_release, }; +const struct bpf_map_ops sock_hash_ops = { + .map_alloc = sock_hash_alloc, + .map_free = sock_hash_free, + .map_lookup_elem = sock_map_lookup, + .map_get_next_key = sock_hash_get_next_key, + .map_update_elem = sock_hash_update_elem, + .map_delete_elem = sock_hash_delete_elem, + .map_release = sock_map_release, +}; + BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock, struct bpf_map *, map, void *, key, u64, flags) { @@ -1907,3 +2377,21 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file) .arg3_type = ARG_PTR_TO_MAP_KEY, .arg4_type = ARG_ANYTHING, }; + +BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, bpf_sock, + struct bpf_map *, map, void *, key, u64, flags) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return sock_hash_ctx_update_elem(bpf_sock, map, key, flags); +} + +const struct bpf_func_proto bpf_sock_hash_update_proto = { + .func = bpf_sock_hash_update, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_PTR_TO_MAP_KEY, + .arg4_type = ARG_ANYTHING, +}; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5dd1dcb..878cca6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2088,6 +2088,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_msg_redirect_map) goto error; break; + case BPF_MAP_TYPE_SOCKHASH: + if (func_id != BPF_FUNC_sk_redirect_hash && + func_id != BPF_FUNC_sock_hash_update && + func_id != BPF_FUNC_map_delete_elem && + func_id != BPF_FUNC_msg_redirect_hash) + goto error; + break; default: break; } @@ -2124,11 +2131,14 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, break; case BPF_FUNC_sk_redirect_map: case BPF_FUNC_msg_redirect_map: + case BPF_FUNC_sock_map_update: if (map->map_type != BPF_MAP_TYPE_SOCKMAP) goto error; break; - case BPF_FUNC_sock_map_update: - if (map->map_type != BPF_MAP_TYPE_SOCKMAP) + case BPF_FUNC_sk_redirect_hash: + case BPF_FUNC_msg_redirect_hash: + case BPF_FUNC_sock_hash_update: + if (map->map_type != BPF_MAP_TYPE_SOCKHASH) goto error; break; default: diff --git a/net/core/filter.c b/net/core/filter.c index 6456e9d..d629479 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1850,6 +1850,31 @@ int skb_do_redirect(struct sk_buff *skb) .arg2_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb, + struct bpf_map *, map, void *, key, u64, flags) +{ + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); + + /* If user passes invalid input drop the packet. */ + if (unlikely(flags & ~(BPF_F_INGRESS))) + return SK_DROP; + + tcb->bpf.flags = flags; + tcb->bpf.sk = __sock_hash_lookup_elem(map, key); + + return SK_PASS; +} + +static const struct bpf_func_proto bpf_sk_redirect_hash_proto = { + .func = bpf_sk_redirect_hash, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_PTR_TO_MAP_KEY, + .arg4_type = ARG_ANYTHING, +}; + BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb, struct bpf_map *, map, u32, key, u64, flags) { @@ -1882,6 +1907,29 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb) .arg4_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg_buff *, msg, + struct bpf_map *, map, void *, key, u64, flags) +{ + /* If user passes invalid input drop the packet. */ + if (unlikely(flags & ~(BPF_F_INGRESS))) + return SK_DROP; + + msg->flags = flags; + msg->sk = __sock_hash_lookup_elem(map, key); + + return SK_PASS; +} + +static const struct bpf_func_proto bpf_msg_redirect_hash_proto = { + .func = bpf_msg_redirect_hash, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_PTR_TO_MAP_KEY, + .arg4_type = ARG_ANYTHING, +}; + BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg, struct bpf_map *, map, u32, key, u64, flags) { @@ -3892,6 +3940,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, return &bpf_sock_ops_cb_flags_set_proto; case BPF_FUNC_sock_map_update: return &bpf_sock_map_update_proto; + case BPF_FUNC_sock_hash_update: + return &bpf_sock_hash_update_proto; default: return bpf_base_func_proto(func_id); } @@ -3903,6 +3953,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, switch (func_id) { case BPF_FUNC_msg_redirect_map: return &bpf_msg_redirect_map_proto; + case BPF_FUNC_msg_redirect_hash: + return &bpf_msg_redirect_hash_proto; case BPF_FUNC_msg_apply_bytes: return &bpf_msg_apply_bytes_proto; case BPF_FUNC_msg_cork_bytes: @@ -3934,6 +3986,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, return &bpf_get_socket_uid_proto; case BPF_FUNC_sk_redirect_map: return &bpf_sk_redirect_map_proto; + case BPF_FUNC_sk_redirect_hash: + return &bpf_sk_redirect_hash_proto; default: return bpf_base_func_proto(func_id); } diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index f95fa67..2fa4cbb 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -67,6 +67,7 @@ [BPF_MAP_TYPE_DEVMAP] = "devmap", [BPF_MAP_TYPE_SOCKMAP] = "sockmap", [BPF_MAP_TYPE_CPUMAP] = "cpumap", + [BPF_MAP_TYPE_SOCKHASH] = "sockhash", }; static unsigned int get_possible_cpus(void) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9d07465..1a19450 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -115,6 +115,7 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP, BPF_MAP_TYPE_SOCKMAP, BPF_MAP_TYPE_CPUMAP, + BPF_MAP_TYPE_SOCKHASH, }; enum bpf_prog_type { @@ -821,7 +822,10 @@ struct bpf_stack_build_id { FN(msg_apply_bytes), \ FN(msg_cork_bytes), \ FN(msg_pull_data), \ - FN(bind), + FN(bind), \ + FN(sock_hash_update), \ + FN(msg_redirect_hash), \ + FN(sk_redirect_hash), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h index d8223d9..fc6cf08 100644 --- a/tools/testing/selftests/bpf/bpf_helpers.h +++ b/tools/testing/selftests/bpf/bpf_helpers.h @@ -73,7 +73,7 @@ static int (*bpf_getsockopt)(void *ctx, int level, int optname, void *optval, (void *) BPF_FUNC_getsockopt; static int (*bpf_sock_ops_cb_flags_set)(void *ctx, int flags) = (void *) BPF_FUNC_sock_ops_cb_flags_set; -static int (*bpf_sk_redirect_map)(void *ctx, void *map, int key, int flags) = +static int (*bpf_sk_redirect_map)(void *ctx, void *map, void *key, int flags) = (void *) BPF_FUNC_sk_redirect_map; static int (*bpf_sock_map_update)(void *map, void *key, void *value, unsigned long long flags) = @@ -86,7 +86,7 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf, (void *) BPF_FUNC_perf_prog_read_value; static int (*bpf_override_return)(void *ctx, unsigned long rc) = (void *) BPF_FUNC_override_return; -static int (*bpf_msg_redirect_map)(void *ctx, void *map, int key, int flags) = +static int (*bpf_msg_redirect_map)(void *ctx, void *map, void *key, int flags) = (void *) BPF_FUNC_msg_redirect_map; static int (*bpf_msg_apply_bytes)(void *ctx, int len) = (void *) BPF_FUNC_msg_apply_bytes; @@ -96,6 +96,15 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) = (void *) BPF_FUNC_msg_pull_data; static int (*bpf_bind)(void *ctx, void *addr, int addr_len) = (void *) BPF_FUNC_bind; +static int (*bpf_sock_hash_update)(void *map, void *key, void *value, + unsigned long long flags) = + (void *) BPF_FUNC_sock_hash_update; +static int (*bpf_msg_redirect_hash)(void *map, void *key, void *value, + unsigned long long flags) = + (void *) BPF_FUNC_msg_redirect_hash; +static int (*bpf_sk_redirect_hash)(void *map, void *key, void *value, + unsigned long long flags) = + (void *) BPF_FUNC_sk_redirect_hash; /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions