[v3,bpf-next,02/14] bpf: introduce cgroup storage maps

Message ID	20180720174558.5829-3-guro@fb.com
State	Changes Requested, archived
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Roman Gushchin <guro@fb.com> To: <netdev@vger.kernel.org> CC: <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>, Roman Gushchin <guro@fb.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net> Subject: [PATCH v3 bpf-next 02/14] bpf: introduce cgroup storage maps Date: Fri, 20 Jul 2018 10:45:46 -0700 Message-ID: <20180720174558.5829-3-guro@fb.com> In-Reply-To: <20180720174558.5829-1-guro@fb.com> References: <20180720174558.5829-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: 1; SN1PR15MB0175; 23:diKC1VUCafO5q6PUtDWHqZilwxDAEiaaxnhHHO5EfFouXMWfJLCsFXTJUfRcUWYx8rBPtaws5SZJ2v3X2Be3ryqQNcPrSg9oZRCwvhGdNTTniv8Oh61LxmpzzUwNp8MBzvwJ50/h7/VbSULQv58qDHiGjWYXzGVoVsfBsoC6oiChrcQA58xh9kPztm+wLcqwnBCqQkABMaCU5/V9aFF5UsSz2/DjVisNZiDgNkgGUJLSnLF4ZX/xRfgXt+5CO2NAf/tvpl052Q3FuOJZZVrYlzcce0UtclIXP+ATj0Q7HJ+nh9Ck30OBEmSLvgwDFfMzvnpN2km69G9QZdnqzF86Zeah44ZR+AMdCLYN6xvPLQ/wzjfBAZEHW5j7CnYhjrN85/zrd/gkH+D++ocQMYwmL60WfkEkXP5vMxANJ/lk+UxDE4SGNoP2cmETkaCpvrD9+kltYOOPKyAPIYzAE6lMPnTbNkk7aU03RGg+SzY1QL1Igfi2d/617QWV9aPeAFykqm5dKEKIiRNZvW6jJ++OzbLsU8HDyGYV2oEgC4Yfssp3gUF3tXH4pWSlhkaumA6K4O34WXp7CsiaCCfZW8uXOcIcMysHrKTyiAhjFujxFIfRso9DNuPN+agyvMqR43O8d20CwJtaWqptOrDwruL9CTWUK1SK/QBsFl3jFFGOvwu6IdfQxavz/0tXorFDIUY7SLGIrlK5OeQ7yePxMFMqYmlQONaS4OEtfbhSnDh5/gzfEiMbitVxps1ae4vYK+SYJ3EwpxVVUeMmFzilJUQtrd2fB5GvdOJVB7buPWDYXiFzuhs7o1SmeEFAZzGYzXKbQg4ME79/h3wNN5RlU1o/hKNfM62hsAhezskXOu3XAM8vmim6XEl3Z3KbdwfbR8bTLrmjZilTAZVZe36GUMcInEKelHJAEAq40rqgUQ3yCGs8cykZGym1oA4weiwIL8hwvxeLxqrfFFR5rn7rfpNS2Y1C1Q0MkC8RL6NePGosJB8bKTaj7kEsjy8yftphkNucL5ls83fNK3p21KLlPz6JTp8Eq6Wq7qibLxUpAAgh3zVI2TBCNczT0yUgGo0xeMROPDuYR5rxdo6sxRRfCY6YIClR/9qR+z9BukNU2l3nerXNAlsEzyWD37/o6h9NJk57sIbp9u3OotYD3YdoERXz9KLizKor2jBDr3qqiccH9GRL3Agq/ux++K6dXw1LPDnPs0yKq9rVl3eoD2Oge5S7S49FRyAoQK7MXC0nGvMiBW11Bw5F1RRyaj0zYwSnjpN1nNwRl8+a3gPdy6iIXqUmOpNs0OAdSYHi7IP5OTk4PDOWIJQ/qaliJeEUacSPAR8o X-Microsoft-Antispam-Message-Info: 1PeRJqdWMbYts7v5KE/51xR5MyXnnjN/AhYPS8jRfmHbFjAs/11Sv6cF+OiNKM1fgYakeK6AxsVDn9DXZwr7JZfMUV7wlS3b2EdqqiNMQ+jr+Aq1uDuP/d6gveSdXwkws1t3S8wQte3gEuALdJpELp0X5bzfzZ2ZfWsLWz/DEdjgoFG4K31CPCzH83LkCuWcCt0zXiLWjSFr68BNa/qDDML1yRRMihLGvAkjUg4mNPwWw73hjOa095kN9sEotrP0qmwjv6SdlMbeH8O/zdR3mqAab9uh7p49IQe22JNHJo6KnGAcrRqAKvxsLqDOQJQco8dzBuv1sHz1PPK8hepsdxmZ3Tj8Knr6JWcY20Uk5e8= X-Microsoft-Exchange-Diagnostics: 1; SN1PR15MB0175; 6:8J4u1L0FWWtBoO5YZKV8P1jhdS6Eun2wqyuP8aYi03yLABdz/6HwEMgMtyCF5aZnENdYwDDa5+MIRZ0Zr/gx1sbzJRI5jA3ETjaS/tRiXLNt9Asr3skq3/PHmEcO80EYGzUGYrAtK08SwPQzPjI1fHR8SFxMTSZFKXNrHjTApZQixawtrKTYEmucJiXP/KkWZV9m6f3LIpLcwxIstqpHtHenAs2MTYyWWqQrsNDXUWR7moKQuAsAJF7nSxJTGfXi8BetvSxXtarTGgvfH+XYGMOqnfNOPCLvgPr3kZq9KHE2MZvMC+EKGFfcWHsLJxuTd4ycVifmxkDzcxCtLL//WiQ83aC8TX6dqmROmkGcHkTWqL/QUmAj0WsS4/V/BIKJjxdiRRJDltIGu6r/TojVGfq5QLUX7tjOC/P2A/5PGqucq2iZR6O4Bsh4Az+PLc7CH+28/RJa11RQAqiYW/Nw1w==; 5:wdJpua/Ylcx1r/vTKd4Ra/Vf2nUs9mjKlrs+a107vN2G792Yu+roaeYlENaSQCrZLrd7NgdYbP96cSVIqoXOV14I3NVfXsfYfXWvR5t2lcX2EmLsXYM9UKeVlvy3FkY9ua+5gQaQo4TvhCCTz1MJzhPP8yGTwRuD9CTci8sARAQ=; 7:Q9p+DT6BVQu/57d6gWPsnxwexHdk+y13Y/53y0dgqBfNP3/k/qZ9drX0J+8zo9LJJKWf4Z8BEsfNCjjALNgBTjcHV1IboTAFKfcUuXSTXHNmIErIJoU7yVC3uPZSOeQSZ8x9RO8oh1nwqWUuiErqs5XK/TLQkomAKg6msPnqfIEBKl9HuVYq+A2EgkFMmklX8miIJj20t3vcj+Nxzqhw7MqGYeLRyrGIZnRP+ruxFBrQiHlDCs3YaKLR8o2UB6nL SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; SN1PR15MB0175; 20:StTqo0lhxZ8BQPpdGCAT7FiEH07lU4FQOhQdIEMumwXX9Y7erXS8eDOYFy7UTp+a3E1el3FbtWfMVsmzZtCVG5fVMDCA8/1hNxNC5VjFflNQmBs75MmkCJyprLeMr22vbnk/RJEM/G167VyDSIRRujYtc7zBv4MAxedu/OYQ/AI= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Jul 2018 17:46:15.4205 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 9bc31e68-116c-4155-53bd-08d5ee68b14a X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	bpf: cgroup local storage \| expand [v3,bpf-next,00/14] bpf: cgroup local storage [v3,bpf-next,01/14] bpf: add ability to charge bpf maps memory dynamically [v3,bpf-next,02/14] bpf: introduce cgroup storage maps [v3,bpf-next,03/14] bpf: pass a pointer to a cgroup storage using pcpu variable [v3,bpf-next,04/14] bpf: allocate cgroup storage entries on attaching bpf programs [v3,bpf-next,05/14] bpf: extend bpf_prog_array to store pointers to the cgroup storage [v3,bpf-next,06/14] bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE [v3,bpf-next,07/14] bpf: don't allow create maps of cgroup local storages [v3,bpf-next,08/14] bpf: introduce the bpf_get_local_storage() helper function [v3,bpf-next,09/14] bpf: sync bpf.h to tools/ [v3,bpf-next,10/14] bpftool: add support for CGROUP_STORAGE maps [v3,bpf-next,11/14] bpf/test_run: support cgroup local storage [v3,bpf-next,12/14] selftests/bpf: add verifier cgroup storage tests [v3,bpf-next,13/14] selftests/bpf: add a cgroup storage test [v3,bpf-next,14/14] samples/bpf: extend test_cgrp2_attach2 test to use cgroup storage

Message ID

20180720174558.5829-3-guro@fb.com

State

Changes Requested, archived

Delegated to:

BPF Maintainers

Headers

From: Roman Gushchin <guro@fb.com>
To: <netdev@vger.kernel.org>
CC: <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>,
	Roman Gushchin <guro@fb.com>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>
Subject: [PATCH v3 bpf-next 02/14] bpf: introduce cgroup storage maps
Date: Fri, 20 Jul 2018 10:45:46 -0700
Message-ID: <20180720174558.5829-3-guro@fb.com>
In-Reply-To: <20180720174558.5829-1-guro@fb.com>
References: <20180720174558.5829-1-guro@fb.com>
MIME-Version: 1.0
Content-Type: text/plain
Received-SPF: None (protection.outlook.com: fb.com does not designate
	permitted sender hosts)
SpamDiagnosticOutput: 1:99
SpamDiagnosticMetadata: NSPM
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Jul 2018 17:46:15.4205
	(UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 9bc31e68-116c-4155-53bd-08d5ee68b14a
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1PR15MB0175
X-OriginatorOrg: fb.com
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, ,
	definitions=2018-07-20_05:, , signatures=0
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Series

bpf: cgroup local storage | expand

Commit Message

Roman Gushchin July 20, 2018, 5:45 p.m. UTC

This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.

From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.

The only difference is that some operations are restricted:
  1) a user can't create new entries,
  2) a user can't remove existing entries.

The lookup from userspace is o(log(n)).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h |  38 +++++
 include/linux/bpf.h        |   1 +
 include/linux/bpf_types.h  |   3 +
 include/uapi/linux/bpf.h   |   6 +
 kernel/bpf/Makefile        |   1 +
 kernel/bpf/local_storage.c | 367 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c      |  12 ++
 7 files changed, 428 insertions(+)
 create mode 100644 kernel/bpf/local_storage.c

Comments

Daniel Borkmann July 27, 2018, 4:11 a.m. UTC | #1

On 07/20/2018 07:45 PM, Roman Gushchin wrote:
> This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
> a special type of maps which are implementing the cgroup storage.
> 
> From the userspace point of view it's almost a generic
> hash map with the (cgroup inode id, attachment type) pair
> used as a key.
> 
> The only difference is that some operations are restricted:
>   1) a user can't create new entries,
>   2) a user can't remove existing entries.
> 
> The lookup from userspace is o(log(n)).
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Martin KaFai Lau <kafai@fb.com>

(First of all sorry for the late review, only limited availability this week
 on my side.)

> ---
>  include/linux/bpf-cgroup.h |  38 +++++
>  include/linux/bpf.h        |   1 +
>  include/linux/bpf_types.h  |   3 +
>  include/uapi/linux/bpf.h   |   6 +
>  kernel/bpf/Makefile        |   1 +
>  kernel/bpf/local_storage.c | 367 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c      |  12 ++
>  7 files changed, 428 insertions(+)
>  create mode 100644 kernel/bpf/local_storage.c
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 79795c5fa7c3..6b0e7bd4b154 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -3,19 +3,39 @@
>  #define _BPF_CGROUP_H
>  
>  #include <linux/jump_label.h>
> +#include <linux/rbtree.h>
>  #include <uapi/linux/bpf.h>
>  
[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 15d69b278277..0b089ba4595d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5140,6 +5140,14 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
>  				return -E2BIG;
>  			}
>  
> +			if (map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE &&
> +			    bpf_cgroup_storage_assign(env->prog, map)) {
> +				verbose(env,
> +					"only one cgroup storage is allowed\n");
> +				fdput(f);
> +				return -EBUSY;
> +			}
> +
>  			/* hold the map. If the program is rejected by verifier,
>  			 * the map will be released by release_maps() or it
>  			 * will be used by the valid program until it's unloaded
> @@ -5148,6 +5156,10 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
>  			map = bpf_map_inc(map, false);
>  			if (IS_ERR(map)) {
>  				fdput(f);
> +				if (map->map_type ==
> +				    BPF_MAP_TYPE_CGROUP_STORAGE)
> +					bpf_cgroup_storage_release(env->prog,
> +								   map);

I think this behavior is a bit strange, meaning that we would reset the map via
bpf_cgroup_storage_release() in this case, but if we error out and exit in any
later instruction the prior bpf_cgroup_storage_assign() is not undone, meaning
at this point we have no other choice but to destroy the map since any later
BPF prog load with bpf_cgroup_storage_assign() attempt would fail with -EBUSY
even though it's not assigned anywhere, is that correct? Same also on any other
errors along the prog load path. E.g. say, as one example, your verifier buffer
is too small, so any retry with the very same program from loader side would fail
above due to different prog pointers?

>  				return PTR_ERR(map);
>  			}
>  			env->used_maps[env->used_map_cnt++] = map;
>

Roman Gushchin July 27, 2018, 5:12 p.m. UTC | #2

On Fri, Jul 27, 2018 at 06:11:31AM +0200, Daniel Borkmann wrote:
> On 07/20/2018 07:45 PM, Roman Gushchin wrote:
> > This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
> > a special type of maps which are implementing the cgroup storage.
> > 
> > From the userspace point of view it's almost a generic
> > hash map with the (cgroup inode id, attachment type) pair
> > used as a key.
> > 
> > The only difference is that some operations are restricted:
> >   1) a user can't create new entries,
> >   2) a user can't remove existing entries.
> > 
> > The lookup from userspace is o(log(n)).
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Acked-by: Martin KaFai Lau <kafai@fb.com>
> 
> (First of all sorry for the late review, only limited availability this week
>  on my side.)

Np, thank you for the review!

> 
> > ---
> >  include/linux/bpf-cgroup.h |  38 +++++
> >  include/linux/bpf.h        |   1 +
> >  include/linux/bpf_types.h  |   3 +
> >  include/uapi/linux/bpf.h   |   6 +
> >  kernel/bpf/Makefile        |   1 +
> >  kernel/bpf/local_storage.c | 367 +++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/bpf/verifier.c      |  12 ++
> >  7 files changed, 428 insertions(+)
> >  create mode 100644 kernel/bpf/local_storage.c
> > 
> > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > index 79795c5fa7c3..6b0e7bd4b154 100644
> > --- a/include/linux/bpf-cgroup.h
> > +++ b/include/linux/bpf-cgroup.h
> > @@ -3,19 +3,39 @@
> >  #define _BPF_CGROUP_H
> >  
> >  #include <linux/jump_label.h>
> > +#include <linux/rbtree.h>
> >  #include <uapi/linux/bpf.h>
> >  
> [...]
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 15d69b278277..0b089ba4595d 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -5140,6 +5140,14 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
> >  				return -E2BIG;
> >  			}
> >  
> > +			if (map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE &&
> > +			    bpf_cgroup_storage_assign(env->prog, map)) {
> > +				verbose(env,
> > +					"only one cgroup storage is allowed\n");
> > +				fdput(f);
> > +				return -EBUSY;
> > +			}
> > +
> >  			/* hold the map. If the program is rejected by verifier,
> >  			 * the map will be released by release_maps() or it
> >  			 * will be used by the valid program until it's unloaded
> > @@ -5148,6 +5156,10 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
> >  			map = bpf_map_inc(map, false);
> >  			if (IS_ERR(map)) {
> >  				fdput(f);
> > +				if (map->map_type ==
> > +				    BPF_MAP_TYPE_CGROUP_STORAGE)
> > +					bpf_cgroup_storage_release(env->prog,
> > +								   map);
> 
> I think this behavior is a bit strange, meaning that we would reset the map via
> bpf_cgroup_storage_release() in this case, but if we error out and exit in any
> later instruction the prior bpf_cgroup_storage_assign() is not undone, meaning
> at this point we have no other choice but to destroy the map since any later
> BPF prog load with bpf_cgroup_storage_assign() attempt would fail with -EBUSY
> even though it's not assigned anywhere, is that correct? Same also on any other
> errors along the prog load path. E.g. say, as one example, your verifier buffer
> is too small, so any retry with the very same program from loader side would fail
> above due to different prog pointers?

Yeah, I see...
We should call bpf_cgroup_storage_release() from the generic verifier error path.
I'll fix this and the leak in the other patch and resend soon.

Thanks!

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 79795c5fa7c3..6b0e7bd4b154 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -3,19 +3,39 @@ 
 #define _BPF_CGROUP_H
 
 #include <linux/jump_label.h>
+#include <linux/rbtree.h>
 #include <uapi/linux/bpf.h>
 
 struct sock;
 struct sockaddr;
 struct cgroup;
 struct sk_buff;
+struct bpf_map;
+struct bpf_prog;
 struct bpf_sock_ops_kern;
+struct bpf_cgroup_storage;
 
 #ifdef CONFIG_CGROUP_BPF
 
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
 
+struct bpf_cgroup_storage_map;
+
+struct bpf_storage_buffer {
+	struct rcu_head rcu;
+	char data[0];
+};
+
+struct bpf_cgroup_storage {
+	struct bpf_storage_buffer *buf;
+	struct bpf_cgroup_storage_map *map;
+	struct bpf_cgroup_storage_key key;
+	struct list_head list;
+	struct rb_node node;
+	struct rcu_head rcu;
+};
+
 struct bpf_prog_list {
 	struct list_head node;
 	struct bpf_prog *prog;
@@ -76,6 +96,15 @@  int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 				      short access, enum bpf_attach_type type);
 
+struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog);
+void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage);
+void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
+			     struct cgroup *cgroup,
+			     enum bpf_attach_type type);
+void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage);
+int bpf_cgroup_storage_assign(struct bpf_prog *prog, struct bpf_map *map);
+void bpf_cgroup_storage_release(struct bpf_prog *prog, struct bpf_map *map);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
 ({									      \
@@ -220,6 +249,15 @@  static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog,
+					    struct bpf_map *map) { return 0; }
+static inline void bpf_cgroup_storage_release(struct bpf_prog *prog,
+					      struct bpf_map *map) {}
+static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(
+	struct bpf_prog *prog) { return 0; }
+static inline void bpf_cgroup_storage_free(
+	struct bpf_cgroup_storage *storage) {}
+
 #define cgroup_bpf_enabled (0)
 #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5a4a256473c3..9d1e4727495e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -282,6 +282,7 @@  struct bpf_prog_aux {
 	struct bpf_prog *prog;
 	struct user_struct *user;
 	u64 load_time; /* ns since boottime */
+	struct bpf_map *cgroup_storage;
 	char name[BPF_OBJ_NAME_LEN];
 #ifdef CONFIG_SECURITY
 	void *security;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2d5549..add08be53b6f 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -37,6 +37,9 @@  BPF_MAP_TYPE(BPF_MAP_TYPE_PERF_EVENT_ARRAY, perf_event_array_map_ops)
 #ifdef CONFIG_CGROUPS
 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
 #endif
+#ifdef CONFIG_CGROUP_BPF
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
+#endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 870113916cac..a0aa53148763 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -75,6 +75,11 @@  struct bpf_lpm_trie_key {
 	__u8	data[0];	/* Arbitrary size */
 };
 
+struct bpf_cgroup_storage_key {
+	__u64	cgroup_inode_id;	/* cgroup inode id */
+	__u32	attach_type;		/* program attach type */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -120,6 +125,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_CPUMAP,
 	BPF_MAP_TYPE_XSKMAP,
 	BPF_MAP_TYPE_SOCKHASH,
+	BPF_MAP_TYPE_CGROUP_STORAGE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f27f5496d6fe..e8906cbad81f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -3,6 +3,7 @@  obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
new file mode 100644
index 000000000000..940889eda2c7
--- /dev/null
+++ b/kernel/bpf/local_storage.c
@@ -0,0 +1,367 @@ 
+//SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf-cgroup.h>
+#include <linux/bpf.h>
+#include <linux/bug.h>
+#include <linux/filter.h>
+#include <linux/mm.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+
+#ifdef CONFIG_CGROUP_BPF
+
+struct bpf_cgroup_storage_map {
+	struct bpf_map map;
+	struct bpf_prog *prog;
+
+	spinlock_t lock;
+	struct rb_root root;
+	struct list_head list;
+};
+
+static struct bpf_cgroup_storage_map *map_to_storage(struct bpf_map *map)
+{
+	return container_of(map, struct bpf_cgroup_storage_map, map);
+}
+
+static int bpf_cgroup_storage_key_cmp(
+	const struct bpf_cgroup_storage_key *key1,
+	const struct bpf_cgroup_storage_key *key2)
+{
+	if (key1->cgroup_inode_id < key2->cgroup_inode_id)
+		return -1;
+	else if (key1->cgroup_inode_id > key2->cgroup_inode_id)
+		return 1;
+	else if (key1->attach_type < key2->attach_type)
+		return -1;
+	else if (key1->attach_type > key2->attach_type)
+		return 1;
+	return 0;
+}
+
+static struct bpf_cgroup_storage *cgroup_storage_lookup(
+	struct bpf_cgroup_storage_map *map, struct bpf_cgroup_storage_key *key,
+	bool locked)
+{
+	struct rb_root *root = &map->root;
+	struct rb_node *node;
+
+	/*
+	 * This lock protects rbtree and list of storage entries,
+	 * which are used from the syscall context only.
+	 * So, simple spin_lock()/unlock() is fine here.
+	 */
+	if (!locked)
+		spin_lock(&map->lock);
+
+	node = root->rb_node;
+	while (node) {
+		struct bpf_cgroup_storage *storage;
+
+		storage = container_of(node, struct bpf_cgroup_storage, node);
+
+		switch (bpf_cgroup_storage_key_cmp(key, &storage->key)) {
+		case -1:
+			node = node->rb_left;
+			break;
+		case 1:
+			node = node->rb_right;
+			break;
+		default:
+			if (!locked)
+				spin_unlock(&map->lock);
+			return storage;
+		}
+	}
+
+	if (!locked)
+		spin_unlock(&map->lock);
+
+	return NULL;
+}
+
+static int cgroup_storage_insert(struct bpf_cgroup_storage_map *map,
+				 struct bpf_cgroup_storage *storage)
+{
+	struct rb_root *root = &map->root;
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	while (*new) {
+		struct bpf_cgroup_storage *this;
+
+		this = container_of(*new, struct bpf_cgroup_storage, node);
+
+		parent = *new;
+		switch (bpf_cgroup_storage_key_cmp(&storage->key, &this->key)) {
+		case -1:
+			new = &((*new)->rb_left);
+			break;
+		case 1:
+			new = &((*new)->rb_right);
+			break;
+		default:
+			return -EEXIST;
+		}
+	}
+
+	rb_link_node(&storage->node, parent, new);
+	rb_insert_color(&storage->node, root);
+
+	return 0;
+}
+
+static void *cgroup_storage_lookup_elem(struct bpf_map *_map, void *_key)
+{
+	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
+	struct bpf_cgroup_storage_key *key = _key;
+	struct bpf_cgroup_storage *storage;
+
+	storage = cgroup_storage_lookup(map, key, false);
+	if (!storage)
+		return NULL;
+
+	return &READ_ONCE(storage->buf)->data[0];
+}
+
+static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
+				      void *value, u64 flags)
+{
+	struct bpf_cgroup_storage_key *key = _key;
+	struct bpf_cgroup_storage *storage;
+	struct bpf_storage_buffer *new;
+
+	if (flags & BPF_NOEXIST)
+		return -EINVAL;
+
+	storage = cgroup_storage_lookup((struct bpf_cgroup_storage_map *)map,
+					key, false);
+	if (!storage)
+		return -ENOENT;
+
+	new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
+			   map->value_size, __GFP_ZERO | GFP_USER,
+			   map->numa_node);
+	if (!new)
+		return -ENOMEM;
+
+	memcpy(&new->data[0], value, map->value_size);
+
+	new = xchg(&storage->buf, new);
+	kfree_rcu(new, rcu);
+
+	return 0;
+}
+
+static int cgroup_storage_get_next_key(struct bpf_map *_map, void *_key,
+				       void *_next_key)
+{
+	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
+	struct bpf_cgroup_storage_key *key = _key;
+	struct bpf_cgroup_storage_key *next = _next_key;
+	struct bpf_cgroup_storage *storage;
+
+	spin_lock(&map->lock);
+
+	if (list_empty(&map->list))
+		goto enoent;
+
+	if (key) {
+		storage = cgroup_storage_lookup(map, key, true);
+		if (!storage)
+			goto enoent;
+
+		storage = list_next_entry(storage, list);
+		if (!storage)
+			goto enoent;
+	} else {
+		storage = list_first_entry(&map->list,
+					 struct bpf_cgroup_storage, list);
+	}
+
+	spin_unlock(&map->lock);
+	next->attach_type = storage->key.attach_type;
+	next->cgroup_inode_id = storage->key.cgroup_inode_id;
+	return 0;
+
+enoent:
+	spin_unlock(&map->lock);
+	return -ENOENT;
+}
+
+static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_cgroup_storage_map *map;
+
+	if (attr->key_size != sizeof(struct bpf_cgroup_storage_key))
+		return ERR_PTR(-EINVAL);
+
+	if (attr->value_size > PAGE_SIZE)
+		return ERR_PTR(-E2BIG);
+
+	map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map),
+			   __GFP_ZERO | GFP_USER, numa_node);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	map->map.pages = round_up(sizeof(struct bpf_cgroup_storage_map),
+				  PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* copy mandatory map attributes */
+	bpf_map_init_from_attr(&map->map, attr);
+
+	spin_lock_init(&map->lock);
+	map->root = RB_ROOT;
+	INIT_LIST_HEAD(&map->list);
+
+	return &map->map;
+}
+
+static void cgroup_storage_map_free(struct bpf_map *_map)
+{
+	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
+
+	WARN_ON(!RB_EMPTY_ROOT(&map->root));
+	WARN_ON(!list_empty(&map->list));
+
+	kfree(map);
+}
+
+static int cgroup_storage_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EINVAL;
+}
+
+const struct bpf_map_ops cgroup_storage_map_ops = {
+	.map_alloc = cgroup_storage_map_alloc,
+	.map_free = cgroup_storage_map_free,
+	.map_get_next_key = cgroup_storage_get_next_key,
+	.map_lookup_elem = cgroup_storage_lookup_elem,
+	.map_update_elem = cgroup_storage_update_elem,
+	.map_delete_elem = cgroup_storage_delete_elem,
+};
+
+/*
+ * Called by the verifier. bpf_verifier_lock must be locked.
+ */
+int bpf_cgroup_storage_assign(struct bpf_prog *prog, struct bpf_map *_map)
+{
+	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
+
+	if (map->prog && map->prog != prog)
+		return -EBUSY;
+	if (prog->aux->cgroup_storage && prog->aux->cgroup_storage != _map)
+		return -EBUSY;
+
+	map->prog = prog;
+	prog->aux->cgroup_storage = _map;
+
+	return 0;
+}
+
+/*
+ * Called by the verifier. bpf_verifier_lock must be locked.
+ */
+void bpf_cgroup_storage_release(struct bpf_prog *prog, struct bpf_map *_map)
+{
+	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
+
+	if (map->prog == prog) {
+		WARN_ON(prog->aux->cgroup_storage != _map);
+		map->prog = NULL;
+	}
+}
+
+struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog)
+{
+	struct bpf_cgroup_storage *storage;
+	struct bpf_map *map;
+	u32 pages;
+
+	map = prog->aux->cgroup_storage;
+	if (!map)
+		return NULL;
+
+	pages = round_up(sizeof(struct bpf_cgroup_storage) +
+			 sizeof(struct bpf_storage_buffer) +
+			 map->value_size, PAGE_SIZE) >> PAGE_SHIFT;
+	if (bpf_map_charge_memlock(map, pages))
+		return ERR_PTR(-EPERM);
+
+	storage = kmalloc_node(sizeof(struct bpf_cgroup_storage),
+			       __GFP_ZERO | GFP_USER, map->numa_node);
+	if (!storage) {
+		bpf_map_uncharge_memlock(map, pages);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	storage->buf = kmalloc_node(sizeof(struct bpf_storage_buffer) +
+				    map->value_size, __GFP_ZERO | GFP_USER,
+				    map->numa_node);
+	if (!storage->buf) {
+		bpf_map_uncharge_memlock(map, pages);
+		kfree(storage);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	storage->map = (struct bpf_cgroup_storage_map *)map;
+
+	return storage;
+}
+
+void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage)
+{
+	u32 pages;
+	struct bpf_map *map;
+
+	if (!storage)
+		return;
+
+	map = &storage->map->map;
+	pages = round_up(sizeof(struct bpf_cgroup_storage) +
+			 sizeof(struct bpf_storage_buffer) +
+			 map->value_size, PAGE_SIZE) >> PAGE_SHIFT;
+	bpf_map_uncharge_memlock(map, pages);
+
+	kfree_rcu(storage->buf, rcu);
+	kfree_rcu(storage, rcu);
+}
+
+void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
+			     struct cgroup *cgroup,
+			     enum bpf_attach_type type)
+{
+	struct bpf_cgroup_storage_map *map;
+
+	if (!storage)
+		return;
+
+	storage->key.attach_type = type;
+	storage->key.cgroup_inode_id = cgroup->kn->id.id;
+
+	map = storage->map;
+
+	spin_lock(&map->lock);
+	WARN_ON(cgroup_storage_insert(map, storage));
+	list_add(&storage->list, &map->list);
+	spin_unlock(&map->lock);
+}
+
+void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage)
+{
+	struct bpf_cgroup_storage_map *map;
+	struct rb_root *root;
+
+	if (!storage)
+		return;
+
+	map = storage->map;
+
+	spin_lock(&map->lock);
+	root = &map->root;
+	rb_erase(&storage->node, root);
+
+	list_del(&storage->list);
+	spin_unlock(&map->lock);
+}
+
+#endif
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 15d69b278277..0b089ba4595d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5140,6 +5140,14 @@  static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
 				return -E2BIG;
 			}
 
+			if (map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE &&
+			    bpf_cgroup_storage_assign(env->prog, map)) {
+				verbose(env,
+					"only one cgroup storage is allowed\n");
+				fdput(f);
+				return -EBUSY;
+			}
+
 			/* hold the map. If the program is rejected by verifier,
 			 * the map will be released by release_maps() or it
 			 * will be used by the valid program until it's unloaded
@@ -5148,6 +5156,10 @@  static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
 			map = bpf_map_inc(map, false);
 			if (IS_ERR(map)) {
 				fdput(f);
+				if (map->map_type ==
+				    BPF_MAP_TYPE_CGROUP_STORAGE)
+					bpf_cgroup_storage_release(env->prog,
+								   map);
 				return PTR_ERR(map);
 			}
 			env->used_maps[env->used_map_cnt++] = map;

[v3,bpf-next,02/14] bpf: introduce cgroup storage maps

Commit Message

Comments

Patch