[RFC,bpf-next,v3,4/7] bpf: add bpf queue and stack maps

Message ID	153724637114.7866.8418882266337744689.stgit@kernel
State	RFC, archived
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Subject: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps From: Mauricio Vasquez B <mauricio.vasquez@polito.it> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, netdev@vger.kernel.org Cc: Yonghong Song <yhs@fb.com> Date: Tue, 18 Sep 2018 06:52:51 +0200 Message-ID: <153724637114.7866.8418882266337744689.stgit@kernel> In-Reply-To: <153724634652.7866.6354309647800281793.stgit@kernel> References: <153724634652.7866.6354309647800281793.stgit@kernel> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Implement bpf queue/stack maps \| expand [RFC,bpf-next,v3,0/7] Implement bpf queue/stack maps [RFC,bpf-next,v3,1/7] bpf: rename stack trace map [RFC,bpf-next,v3,2/7] bpf/syscall: allow key to be null in map functions [RFC,bpf-next,v3,3/7] bpf: add lookup_and_delete map operation [RFC,bpf-next,v3,4/7] bpf: add bpf queue and stack maps [RFC,bpf-next,v3,5/7] bpf: restrict use of peek/push/pop [RFC,bpf-next,v3,6/7] Sync uapi/bpf.h to tools/include [RFC,bpf-next,v3,7/7] selftests/bpf: add test cases for queue and stack maps

Mauricio Vasquez Sept. 18, 2018, 4:52 a.m. UTC

Implement two new kind of maps that support the peek, push and pop
operations.

A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 include/linux/bpf.h           |    3 
 include/linux/bpf_types.h     |    2 
 include/uapi/linux/bpf.h      |   30 ++++
 kernel/bpf/Makefile           |    2 
 kernel/bpf/core.c             |    3 
 kernel/bpf/helpers.c          |   98 ++++++++++++++
 kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c         |    5 +
 net/core/filter.c             |    6 +
 9 files changed, 437 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c

Alexei Starovoitov Sept. 18, 2018, 11:27 p.m. UTC | #1

On Tue, Sep 18, 2018 at 06:52:51AM +0200, Mauricio Vasquez B wrote:
> Implement two new kind of maps that support the peek, push and pop
> operations.
> 
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  include/linux/bpf.h           |    3 
>  include/linux/bpf_types.h     |    2 
>  include/uapi/linux/bpf.h      |   30 ++++
>  kernel/bpf/Makefile           |    2 
>  kernel/bpf/core.c             |    3 
>  kernel/bpf/helpers.c          |   98 ++++++++++++++
>  kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c         |    5 +
>  net/core/filter.c             |    6 +
>  9 files changed, 437 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/bpf/queue_stack_maps.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c63a44381d3f..8e924b5c5a0e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -807,6 +807,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>  extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>  extern const struct bpf_func_proto bpf_map_update_elem_proto;
>  extern const struct bpf_func_proto bpf_map_delete_elem_proto;
> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>  
>  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>  extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 33f7f574b983..903a446f14c3 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4cda584c6640..c899386dcb2b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -128,6 +128,8 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_SOCKHASH,
>  	BPF_MAP_TYPE_CGROUP_STORAGE,
>  	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
> +	BPF_MAP_TYPE_QUEUE,
> +	BPF_MAP_TYPE_STACK,
>  };
>  
>  enum bpf_prog_type {
> @@ -460,6 +462,29 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
> + * 			 u64 flags)

since we're passing <=8 byte value there is no need for pointer and len.
Lower bits of 'u64 value' would be faster and easier to use from bpf prog side.

> + * 	Description
> + * 		Push an element *value* in *map*. *flags* is one of:
> + *
> + * 		**BPF_EXIST**
> + * 		If the queue/stack is full, the oldest element is removed to
> + * 		make room for this.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
> + * 	Description
> + * 		Pop an element from *map*.
> + * Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)

for pop/peak helpers the 'void *value' makes sense,
but 'len' doesn't need to be there and doesn't need to be checked in runtime.
Verifier should be able to do that.
More below.

> + * 	Description
> + * 		Get an element from *map* without removing it.
> + * Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>   * int bpf_probe_read(void *dst, u32 size, const void *src)
>   * 	Description
>   * 		For tracing programs, safely attempt to read *size* bytes from
> @@ -2227,7 +2252,10 @@ union bpf_attr {
>  	FN(get_current_cgroup_id),	\
>  	FN(get_local_storage),		\
>  	FN(sk_select_reuseport),	\
> -	FN(skb_ancestor_cgroup_id),
> +	FN(skb_ancestor_cgroup_id),	\
> +	FN(map_push_elem),		\
> +	FN(map_pop_elem),		\
> +	FN(map_peek_elem),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index e656bce87c8f..2d77bc5b2aca 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -3,7 +3,7 @@ obj-y := core.o
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
>  ifeq ($(CONFIG_NET),y)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 3f5bf1af0826..8d2db076d123 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
>  const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
>  const struct bpf_func_proto bpf_map_update_elem_proto __weak;
>  const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_push_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
>  
>  const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
>  const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 1991466b8327..5f364e6acaf1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -76,6 +76,104 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>  	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>  };
>  
> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
> +	   u64, flags)
> +{
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	return map->ops->map_update_elem(map, NULL, value, flags);
> +}
> +
> +const struct bpf_func_proto bpf_map_push_elem_proto = {
> +	.func		= bpf_map_push_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +	.arg4_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *, value, u32, size)
> +{
> +	void *ptr;
> +
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> +	if (!ptr)
> +		return -ENOENT;
> +
> +	switch (size) {
> +	case 1:
> +		*(u8 *) value = *(u8 *) ptr;
> +		break;
> +	case 2:
> +		*(u16 *) value = *(u16 *) ptr;
> +		break;
> +	case 4:
> +		*(u32 *) value = *(u32 *) ptr;
> +		break;
> +	case 8:
> +		*(u64 *) value = *(u64 *) ptr;
> +		break;

this is inefficient. can we pass value ptr into ops and let it populate it?

> +	}
> +
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_map_pop_elem_proto = {
> +	.func		= bpf_map_pop_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +};
> +
> +BPF_CALL_3(bpf_map_peek_elem, struct bpf_map *, map, void *, value, u32, size)
> +{
> +	void *ptr;
> +
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	ptr = map->ops->map_lookup_elem(map, NULL);
> +	if (!ptr)
> +		return -ENOENT;
> +
> +	switch (size) {
> +	case 1:
> +		*(u8 *) value = *(u8 *) ptr;
> +		break;
> +	case 2:
> +		*(u16 *) value = *(u16 *) ptr;
> +		break;
> +	case 4:
> +		*(u32 *) value = *(u32 *) ptr;
> +		break;
> +	case 8:
> +		*(u64 *) value = *(u64 *) ptr;
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_map_peek_elem_proto = {
> +	.func		= bpf_map_pop_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +};
> +
>  const struct bpf_func_proto bpf_get_prandom_u32_proto = {
>  	.func		= bpf_user_rnd_u32,
>  	.gpl_only	= false,
> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
> new file mode 100644
> index 000000000000..10c081f3f02b
> --- /dev/null
> +++ b/kernel/bpf/queue_stack_maps.c
> @@ -0,0 +1,291 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * queue_stack_maps.c: BPF queue and stack maps
> + *
> + * Copyright (c) 2018 Politecnico di Torino
> + */
> +#include <linux/bpf.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include "percpu_freelist.h"
> +
> +#define QUEUE_STACK_CREATE_FLAG_MASK \
> +	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
> +
> +
> +struct bpf_queue_stack {
> +	struct bpf_map map;
> +	raw_spinlock_t lock;
> +	u32 head, tail;
> +	u32 index_mask;
> +	u32 count;
> +
> +	char elements[0] __aligned(8);
> +};
> +
> +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
> +{
> +	return container_of(map, struct bpf_queue_stack, map);
> +}
> +
> +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
> +{
> +	return qs->count == 0;
> +}
> +
> +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
> +{
> +	return qs->count == qs->map.max_entries;
> +}
> +
> +/* Called from syscall */
> +static int queue_stack_map_alloc_check(union bpf_attr *attr)
> +{
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
> +	    (attr->value_size != 1 && attr->value_size != 2 &&
> +	    attr->value_size != 4 && attr->value_size != 8) ||
> +	    attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
> +		return -EINVAL;
> +
> +	if (attr->value_size > KMALLOC_MAX_SIZE)
> +		/* if value_size is bigger, the user space won't be able to
> +		 * access the elements.
> +		 */
> +		return -E2BIG;
> +
> +	return 0;
> +}
> +
> +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
> +{
> +	int ret, numa_node = bpf_map_attr_numa_node(attr);
> +	u32 max_entries, value_size, index_mask;
> +	u64 queue_size, cost, mask64;
> +	struct bpf_queue_stack *qs;
> +
> +	max_entries = attr->max_entries;
> +	value_size = attr->value_size;
> +
> +	/* From arraymap.c:
> +	 * On 32 bit archs roundup_pow_of_two() with max_entries that has
> +	 * upper most bit set in u32 space is undefined behavior due to
> +	 * resulting 1U << 32, so do it manually here in u64 space.
> +	 */
> +	mask64 = fls_long(max_entries - 1);
> +	mask64 = 1ULL << mask64;
> +	mask64 -= 1;
> +
> +	index_mask = mask64;
> +
> +	/* Round up queue size to nearest power of 2 */
> +	max_entries = index_mask + 1;
> +	/* Check for overflows. */
> +	if (max_entries < attr->max_entries)
> +		return ERR_PTR(-E2BIG);
> +
> +	queue_size = sizeof(*qs) + (u64) value_size * max_entries;
> +
> +	cost = queue_size;
> +	if (cost >= U32_MAX - PAGE_SIZE)
> +		return ERR_PTR(-E2BIG);
> +
> +	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	ret = bpf_map_precharge_memlock(cost);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	qs = bpf_map_area_alloc(queue_size, numa_node);
> +	if (!qs)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memset(qs, 0, sizeof(*qs));
> +
> +	bpf_map_init_from_attr(&qs->map, attr);
> +
> +	qs->map.pages = cost;
> +	qs->index_mask = index_mask;
> +
> +	raw_spin_lock_init(&qs->lock);
> +
> +	return &qs->map;
> +}
> +
> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
> +static void queue_stack_map_free(struct bpf_map *map)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +
> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> +	 * so the programs (can be more than one that used this map) were
> +	 * disconnected from events. Wait for outstanding critical sections in
> +	 * these programs to complete
> +	 */
> +	synchronize_rcu();
> +
> +	bpf_map_area_free(qs);
> +}
> +
> +static void *__queue_map_lookup(struct bpf_map *map, bool delete)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long flags;
> +	void *ptr = NULL;
> +
> +	raw_spin_lock_irqsave(&qs->lock, flags);
> +
> +	if (queue_stack_map_is_empty(qs))
> +		goto out;
> +
> +	ptr = &qs->elements[qs->tail * qs->map.value_size];
> +
> +	if (delete) {
> +		qs->tail = (qs->tail + 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
> +	return ptr;
> +}
> +
> +
> +static void *__stack_map_lookup(struct bpf_map *map, bool delete)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long flags;
> +	void *ptr = NULL;
> +	u32 index;
> +
> +	raw_spin_lock_irqsave(&qs->lock, flags);
> +
> +	if (queue_stack_map_is_empty(qs))
> +		goto out;
> +
> +	index = (qs->head - 1) & qs->index_mask;
> +	ptr = &qs->elements[index * qs->map.value_size];
> +
> +	if (delete) {
> +		qs->head = (qs->head - 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, flags);

here it's racing with another cpu accessing the same map.
rcu doesn't protect from this particular qs->element being reused
by another cpu.

> +	return ptr;

say, stack is full. one cpu is doing pop(), getting this ptr,
while another cpu doing push() and overwriting this memory.

> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return __queue_map_lookup(map, false);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return __stack_map_lookup(map, false);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return __queue_map_lookup(map, true);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return __stack_map_lookup(map, true);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
> +				       void *value, u64 flags)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long irq_flags;
> +	int err = 0;
> +	void *dst;
> +
> +	/* BPF_EXIST is used to force making room for a new element in case the
> +	 * map is full
> +	 */
> +	bool replace = (flags & BPF_EXIST);
> +
> +	/* Check supported flags for queue and stqueue_map_is_preallocack maps */
> +	if (flags & BPF_NOEXIST || flags > BPF_EXIST)
> +		return -EINVAL;
> +
> +	raw_spin_lock_irqsave(&qs->lock, irq_flags);
> +
> +	if (queue_stack_map_is_full(qs)) {
> +		if (!replace) {
> +			err = -E2BIG;
> +			goto out;
> +		}
> +		/* advance tail pointer to overwrite oldest element */
> +		qs->tail = (qs->tail + 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +	dst = &qs->elements[qs->head * qs->map.value_size];
> +
> +	switch (qs->map.value_size) {
> +	case 1:
> +		*(u8 *) dst = *(u8 *) value;
> +		break;
> +	case 2:
> +		*(u16 *) dst = *(u16 *) value;
> +		break;
> +	case 4:
> +		*(u32 *) dst = *(u32 *) value;
> +		break;
> +	case 8:
> +		*(u64 *) dst = *(u64 *) value;
> +		break;
> +	}
> +
> +	qs->head = (qs->head + 1) & qs->index_mask;
> +	qs->count++;
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
> +	return err;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -EINVAL;
> +}
> +
> +/* Called from syscall */
> +static int queue_stack_map_get_next_key(struct bpf_map *map, void *key,
> +					void *next_key)
> +{
> +	return -EINVAL;
> +}
> +
> +const struct bpf_map_ops queue_map_ops = {
> +	.map_alloc_check = queue_stack_map_alloc_check,
> +	.map_alloc = queue_stack_map_alloc,
> +	.map_free = queue_stack_map_free,
> +	.map_lookup_elem = queue_map_lookup_elem,
> +	.map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
> +	.map_update_elem = queue_stack_map_update_elem,
> +	.map_delete_elem = queue_stack_map_delete_elem,
> +	.map_get_next_key = queue_stack_map_get_next_key,
> +};
> +
> +const struct bpf_map_ops stack_map_ops = {
> +	.map_alloc_check = queue_stack_map_alloc_check,
> +	.map_alloc = queue_stack_map_alloc,
> +	.map_free = queue_stack_map_free,
> +	.map_lookup_elem = stack_map_lookup_elem,
> +	.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
> +	.map_update_elem = queue_stack_map_update_elem,
> +	.map_delete_elem = queue_stack_map_delete_elem,
> +	.map_get_next_key = queue_stack_map_get_next_key,
> +};
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f4ff0c569e54..b9e005188f0e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2369,7 +2369,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>  	if (func_id != BPF_FUNC_tail_call &&
>  	    func_id != BPF_FUNC_map_lookup_elem &&
>  	    func_id != BPF_FUNC_map_update_elem &&
> -	    func_id != BPF_FUNC_map_delete_elem)
> +	    func_id != BPF_FUNC_map_delete_elem &&
> +	    func_id != BPF_FUNC_map_push_elem &&
> +	    func_id != BPF_FUNC_map_pop_elem &&
> +	    func_id != BPF_FUNC_map_peek_elem)
>  		return 0;
>  
>  	if (meta->map_ptr == NULL) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index feb578506009..c7b73376c23a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4839,6 +4839,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  		return &bpf_map_update_elem_proto;
>  	case BPF_FUNC_map_delete_elem:
>  		return &bpf_map_delete_elem_proto;
> +	case BPF_FUNC_map_push_elem:
> +		return &bpf_map_push_elem_proto;
> +	case BPF_FUNC_map_pop_elem:
> +		return &bpf_map_pop_elem_proto;
> +	case BPF_FUNC_map_peek_elem:
> +		return &bpf_map_peek_elem_proto;
>  	case BPF_FUNC_get_prandom_u32:
>  		return &bpf_get_prandom_u32_proto;
>  	case BPF_FUNC_get_smp_processor_id:
>

Mauricio Vasquez Sept. 19, 2018, 4:28 a.m. UTC | #2

On 09/18/2018 06:27 PM, Alexei Starovoitov wrote:
> On Tue, Sep 18, 2018 at 06:52:51AM +0200, Mauricio Vasquez B wrote:
>> Implement two new kind of maps that support the peek, push and pop
>> operations.
>>
>> A use case for this is to keep track of a pool of elements, like
>> network ports in a SNAT.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf.h           |    3
>>   include/linux/bpf_types.h     |    2
>>   include/uapi/linux/bpf.h      |   30 ++++
>>   kernel/bpf/Makefile           |    2
>>   kernel/bpf/core.c             |    3
>>   kernel/bpf/helpers.c          |   98 ++++++++++++++
>>   kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/verifier.c         |    5 +
>>   net/core/filter.c             |    6 +
>>   9 files changed, 437 insertions(+), 3 deletions(-)
>>   create mode 100644 kernel/bpf/queue_stack_maps.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index c63a44381d3f..8e924b5c5a0e 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -807,6 +807,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>>   extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>>   extern const struct bpf_func_proto bpf_map_update_elem_proto;
>>   extern const struct bpf_func_proto bpf_map_delete_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>>   
>>   extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>>   extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 33f7f574b983..903a446f14c3 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>>   #endif
>>   #endif
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 4cda584c6640..c899386dcb2b 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -128,6 +128,8 @@ enum bpf_map_type {
>>   	BPF_MAP_TYPE_SOCKHASH,
>>   	BPF_MAP_TYPE_CGROUP_STORAGE,
>>   	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>> +	BPF_MAP_TYPE_QUEUE,
>> +	BPF_MAP_TYPE_STACK,
>>   };
>>   
>>   enum bpf_prog_type {
>> @@ -460,6 +462,29 @@ union bpf_attr {
>>    * 	Return
>>    * 		0 on success, or a negative error in case of failure.
>>    *
>> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
>> + * 			 u64 flags)
> since we're passing <=8 byte value there is no need for pointer and len.
> Lower bits of 'u64 value' would be faster and easier to use from bpf prog side.
I have the doubt between both solutions, will change it to only u64 value.

>> + * 	Description
>> + * 		Push an element *value* in *map*. *flags* is one of:
>> + *
>> + * 		**BPF_EXIST**
>> + * 		If the queue/stack is full, the oldest element is removed to
>> + * 		make room for this.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
>> + * 	Description
>> + * 		Pop an element from *map*.
>> + * Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)
> for pop/peak helpers the 'void *value' makes sense,
> but 'len' doesn't need to be there and doesn't need to be checked in runtime.
> Verifier should be able to do that.

You are right, ARG_PTR_TO_MAP_VALUE should do these tests automatically 
based on map->value_size.

> More below.
>
>> + * 	Description
>> + * 		Get an element from *map* without removing it.
>> + * Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>>    * int bpf_probe_read(void *dst, u32 size, const void *src)
>>    * 	Description
>>    * 		For tracing programs, safely attempt to read *size* bytes from
>> @@ -2227,7 +2252,10 @@ union bpf_attr {
>>   	FN(get_current_cgroup_id),	\
>>   	FN(get_local_storage),		\
>>   	FN(sk_select_reuseport),	\
>> -	FN(skb_ancestor_cgroup_id),
>> +	FN(skb_ancestor_cgroup_id),	\
>> +	FN(map_push_elem),		\
>> +	FN(map_pop_elem),		\
>> +	FN(map_peek_elem),
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>    * function eBPF program intends to call
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index e656bce87c8f..2d77bc5b2aca 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -3,7 +3,7 @@ obj-y := core.o
>>   
>>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>   obj-$(CONFIG_BPF_SYSCALL) += btf.o
>>   ifeq ($(CONFIG_NET),y)
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 3f5bf1af0826..8d2db076d123 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
>>   const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
>>   const struct bpf_func_proto bpf_map_update_elem_proto __weak;
>>   const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_push_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
>>   
>>   const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
>>   const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 1991466b8327..5f364e6acaf1 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -76,6 +76,104 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>>   	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>>   };
>>   
>> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
>> +	   u64, flags)
>> +{
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	return map->ops->map_update_elem(map, NULL, value, flags);
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_push_elem_proto = {
>> +	.func		= bpf_map_push_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +	.arg4_type	= ARG_ANYTHING,
>> +};
>> +
>> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *, value, u32, size)
>> +{
>> +	void *ptr;
>> +
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
>> +	if (!ptr)
>> +		return -ENOENT;
>> +
>> +	switch (size) {
>> +	case 1:
>> +		*(u8 *) value = *(u8 *) ptr;
>> +		break;
>> +	case 2:
>> +		*(u16 *) value = *(u16 *) ptr;
>> +		break;
>> +	case 4:
>> +		*(u32 *) value = *(u32 *) ptr;
>> +		break;
>> +	case 8:
>> +		*(u64 *) value = *(u64 *) ptr;
>> +		break;
> this is inefficient. can we pass value ptr into ops and let it populate it?

I don't think so, doing that implies that look_and_delete will be a 
per-value op, while other ops in maps are per-reference.
For instance, how to change it in the case of peek helper that is using 
the lookup operation?, we cannot change the signature of the lookup 
operation.

This is something that worries me a little bit, we are creating new 
per-value helpers based on already existing per-reference operations, 
this is not probably the cleanest way.  Here we are at the beginning of 
the discussion once again, how should we map helpers and syscalls to ops.

What about creating pop/peek/push ops, mapping helpers one to one and 
adding some logic into syscall.c to call the correct operation in case 
the map is stack/queue?
Syscall mapping would be:
bpf_map_lookup_elem() -> peek
bpf_map_lookup_and_delete_elem() -> pop
bpf_map_update_elem() -> push

Does it make sense?

>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_pop_elem_proto = {
>> +	.func		= bpf_map_pop_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +};
>> +
>> +BPF_CALL_3(bpf_map_peek_elem, struct bpf_map *, map, void *, value, u32, size)
>> +{
>> +	void *ptr;
>> +
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	ptr = map->ops->map_lookup_elem(map, NULL);
>> +	if (!ptr)
>> +		return -ENOENT;
>> +
>> +	switch (size) {
>> +	case 1:
>> +		*(u8 *) value = *(u8 *) ptr;
>> +		break;
>> +	case 2:
>> +		*(u16 *) value = *(u16 *) ptr;
>> +		break;
>> +	case 4:
>> +		*(u32 *) value = *(u32 *) ptr;
>> +		break;
>> +	case 8:
>> +		*(u64 *) value = *(u64 *) ptr;
>> +		break;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_peek_elem_proto = {
>> +	.func		= bpf_map_pop_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +};
>> +
>>   const struct bpf_func_proto bpf_get_prandom_u32_proto = {
>>   	.func		= bpf_user_rnd_u32,
>>   	.gpl_only	= false,
>> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
>> new file mode 100644
>> index 000000000000..10c081f3f02b
>> --- /dev/null
>> +++ b/kernel/bpf/queue_stack_maps.c
>> @@ -0,0 +1,291 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * queue_stack_maps.c: BPF queue and stack maps
>> + *
>> + * Copyright (c) 2018 Politecnico di Torino
>> + */
>> +#include <linux/bpf.h>
>> +#include <linux/list.h>
>> +#include <linux/slab.h>
>> +#include "percpu_freelist.h"
>> +
>> +#define QUEUE_STACK_CREATE_FLAG_MASK \
>> +	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>> +
>> +
>> +struct bpf_queue_stack {
>> +	struct bpf_map map;
>> +	raw_spinlock_t lock;
>> +	u32 head, tail;
>> +	u32 index_mask;
>> +	u32 count;
>> +
>> +	char elements[0] __aligned(8);
>> +};
>> +
>> +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
>> +{
>> +	return container_of(map, struct bpf_queue_stack, map);
>> +}
>> +
>> +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
>> +{
>> +	return qs->count == 0;
>> +}
>> +
>> +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
>> +{
>> +	return qs->count == qs->map.max_entries;
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_stack_map_alloc_check(union bpf_attr *attr)
>> +{
>> +	/* check sanity of attributes */
>> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
>> +	    (attr->value_size != 1 && attr->value_size != 2 &&
>> +	    attr->value_size != 4 && attr->value_size != 8) ||
>> +	    attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
>> +		return -EINVAL;
>> +
>> +	if (attr->value_size > KMALLOC_MAX_SIZE)
>> +		/* if value_size is bigger, the user space won't be able to
>> +		 * access the elements.
>> +		 */
>> +		return -E2BIG;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
>> +{
>> +	int ret, numa_node = bpf_map_attr_numa_node(attr);
>> +	u32 max_entries, value_size, index_mask;
>> +	u64 queue_size, cost, mask64;
>> +	struct bpf_queue_stack *qs;
>> +
>> +	max_entries = attr->max_entries;
>> +	value_size = attr->value_size;
>> +
>> +	/* From arraymap.c:
>> +	 * On 32 bit archs roundup_pow_of_two() with max_entries that has
>> +	 * upper most bit set in u32 space is undefined behavior due to
>> +	 * resulting 1U << 32, so do it manually here in u64 space.
>> +	 */
>> +	mask64 = fls_long(max_entries - 1);
>> +	mask64 = 1ULL << mask64;
>> +	mask64 -= 1;
>> +
>> +	index_mask = mask64;
>> +
>> +	/* Round up queue size to nearest power of 2 */
>> +	max_entries = index_mask + 1;
>> +	/* Check for overflows. */
>> +	if (max_entries < attr->max_entries)
>> +		return ERR_PTR(-E2BIG);
>> +
>> +	queue_size = sizeof(*qs) + (u64) value_size * max_entries;
>> +
>> +	cost = queue_size;
>> +	if (cost >= U32_MAX - PAGE_SIZE)
>> +		return ERR_PTR(-E2BIG);
>> +
>> +	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>> +
>> +	ret = bpf_map_precharge_memlock(cost);
>> +	if (ret < 0)
>> +		return ERR_PTR(ret);
>> +
>> +	qs = bpf_map_area_alloc(queue_size, numa_node);
>> +	if (!qs)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	memset(qs, 0, sizeof(*qs));
>> +
>> +	bpf_map_init_from_attr(&qs->map, attr);
>> +
>> +	qs->map.pages = cost;
>> +	qs->index_mask = index_mask;
>> +
>> +	raw_spin_lock_init(&qs->lock);
>> +
>> +	return &qs->map;
>> +}
>> +
>> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
>> +static void queue_stack_map_free(struct bpf_map *map)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +
>> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
>> +	 * so the programs (can be more than one that used this map) were
>> +	 * disconnected from events. Wait for outstanding critical sections in
>> +	 * these programs to complete
>> +	 */
>> +	synchronize_rcu();
>> +
>> +	bpf_map_area_free(qs);
>> +}
>> +
>> +static void *__queue_map_lookup(struct bpf_map *map, bool delete)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long flags;
>> +	void *ptr = NULL;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, flags);
>> +
>> +	if (queue_stack_map_is_empty(qs))
>> +		goto out;
>> +
>> +	ptr = &qs->elements[qs->tail * qs->map.value_size];
>> +
>> +	if (delete) {
>> +		qs->tail = (qs->tail + 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
>> +	return ptr;
>> +}
>> +
>> +
>> +static void *__stack_map_lookup(struct bpf_map *map, bool delete)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long flags;
>> +	void *ptr = NULL;
>> +	u32 index;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, flags);
>> +
>> +	if (queue_stack_map_is_empty(qs))
>> +		goto out;
>> +
>> +	index = (qs->head - 1) & qs->index_mask;
>> +	ptr = &qs->elements[index * qs->map.value_size];
>> +
>> +	if (delete) {
>> +		qs->head = (qs->head - 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
> here it's racing with another cpu accessing the same map.
> rcu doesn't protect from this particular qs->element being reused
> by another cpu.
>
>> +	return ptr;
> say, stack is full. one cpu is doing pop(), getting this ptr,
> while another cpu doing push() and overwriting this memory.
Yes, are right.  The lock must be held until the copy is done.  The 
solution to it depends on what ops we agree to create.
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __queue_map_lookup(map, false);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __stack_map_lookup(map, false);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __queue_map_lookup(map, true);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __stack_map_lookup(map, true);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
>> +				       void *value, u64 flags)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long irq_flags;
>> +	int err = 0;
>> +	void *dst;
>> +
>> +	/* BPF_EXIST is used to force making room for a new element in case the
>> +	 * map is full
>> +	 */
>> +	bool replace = (flags & BPF_EXIST);
>> +
>> +	/* Check supported flags for queue and stqueue_map_is_preallocack maps */
>> +	if (flags & BPF_NOEXIST || flags > BPF_EXIST)
>> +		return -EINVAL;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, irq_flags);
>> +
>> +	if (queue_stack_map_is_full(qs)) {
>> +		if (!replace) {
>> +			err = -E2BIG;
>> +			goto out;
>> +		}
>> +		/* advance tail pointer to overwrite oldest element */
>> +		qs->tail = (qs->tail + 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +	dst = &qs->elements[qs->head * qs->map.value_size];
>> +
>> +	switch (qs->map.value_size) {
>> +	case 1:
>> +		*(u8 *) dst = *(u8 *) value;
>> +		break;
>> +	case 2:
>> +		*(u16 *) dst = *(u16 *) value;
>> +		break;
>> +	case 4:
>> +		*(u32 *) dst = *(u32 *) value;
>> +		break;
>> +	case 8:
>> +		*(u64 *) dst = *(u64 *) value;
>> +		break;
>> +	}
>> +
>> +	qs->head = (qs->head + 1) & qs->index_mask;
>> +	qs->count++;
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
>> +	return err;
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_stack_map_get_next_key(struct bpf_map *map, void *key,
>> +					void *next_key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +const struct bpf_map_ops queue_map_ops = {
>> +	.map_alloc_check = queue_stack_map_alloc_check,
>> +	.map_alloc = queue_stack_map_alloc,
>> +	.map_free = queue_stack_map_free,
>> +	.map_lookup_elem = queue_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
>> +	.map_update_elem = queue_stack_map_update_elem,
>> +	.map_delete_elem = queue_stack_map_delete_elem,
>> +	.map_get_next_key = queue_stack_map_get_next_key,
>> +};
>> +
>> +const struct bpf_map_ops stack_map_ops = {
>> +	.map_alloc_check = queue_stack_map_alloc_check,
>> +	.map_alloc = queue_stack_map_alloc,
>> +	.map_free = queue_stack_map_free,
>> +	.map_lookup_elem = stack_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
>> +	.map_update_elem = queue_stack_map_update_elem,
>> +	.map_delete_elem = queue_stack_map_delete_elem,
>> +	.map_get_next_key = queue_stack_map_get_next_key,
>> +};
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c569e54..b9e005188f0e 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2369,7 +2369,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>>   	if (func_id != BPF_FUNC_tail_call &&
>>   	    func_id != BPF_FUNC_map_lookup_elem &&
>>   	    func_id != BPF_FUNC_map_update_elem &&
>> -	    func_id != BPF_FUNC_map_delete_elem)
>> +	    func_id != BPF_FUNC_map_delete_elem &&
>> +	    func_id != BPF_FUNC_map_push_elem &&
>> +	    func_id != BPF_FUNC_map_pop_elem &&
>> +	    func_id != BPF_FUNC_map_peek_elem)
>>   		return 0;
>>   
>>   	if (meta->map_ptr == NULL) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index feb578506009..c7b73376c23a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4839,6 +4839,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>   		return &bpf_map_update_elem_proto;
>>   	case BPF_FUNC_map_delete_elem:
>>   		return &bpf_map_delete_elem_proto;
>> +	case BPF_FUNC_map_push_elem:
>> +		return &bpf_map_push_elem_proto;
>> +	case BPF_FUNC_map_pop_elem:
>> +		return &bpf_map_pop_elem_proto;
>> +	case BPF_FUNC_map_peek_elem:
>> +		return &bpf_map_peek_elem_proto;
>>   	case BPF_FUNC_get_prandom_u32:
>>   		return &bpf_get_prandom_u32_proto;
>>   	case BPF_FUNC_get_smp_processor_id:
>>

Alexei Starovoitov Oct. 2, 2018, 12:26 a.m. UTC | #3

On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
> > > > +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
> > > > value, u32, size)
> > > > +{
> > > > +    void *ptr;
> > > > +
> > > > +    if (map->value_size != size)
> > > > +        return -EINVAL;
> > > > +
> > > > +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> > > > +    if (!ptr)
> > > > +        return -ENOENT;
> > > > +
> > > > +    switch (size) {
> > > > +    case 1:
> > > > +        *(u8 *) value = *(u8 *) ptr;
> > > > +        break;
> > > > +    case 2:
> > > > +        *(u16 *) value = *(u16 *) ptr;
> > > > +        break;
> > > > +    case 4:
> > > > +        *(u32 *) value = *(u32 *) ptr;
> > > > +        break;
> > > > +    case 8:
> > > > +        *(u64 *) value = *(u64 *) ptr;
> > > > +        break;
> > > this is inefficient. can we pass value ptr into ops and let it
> > > populate it?
> > 
> > I don't think so, doing that implies that look_and_delete will be a
> > per-value op, while other ops in maps are per-reference.
> > For instance, how to change it in the case of peek helper that is using
> > the lookup operation?, we cannot change the signature of the lookup
> > operation.
> > 
> > This is something that worries me a little bit, we are creating new
> > per-value helpers based on already existing per-reference operations,
> > this is not probably the cleanest way.  Here we are at the beginning of
> > the discussion once again, how should we map helpers and syscalls to
> > ops.
> > 
> > What about creating pop/peek/push ops, mapping helpers one to one and
> > adding some logic into syscall.c to call the correct operation in case
> > the map is stack/queue?
> > Syscall mapping would be:
> > bpf_map_lookup_elem() -> peek
> > bpf_map_lookup_and_delete_elem() -> pop
> > bpf_map_update_elem() -> push
> > 
> > Does it make sense?
> 
> Hello Alexei,
> 
> Do you have any feedback on this specific part?

Indeed. It seems push/pop ops will be cleaner.
I still think that peek() is useless due to races.
So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
right?

Mauricio Vasquez Oct. 3, 2018, 5:01 p.m. UTC | #4

On 10/01/2018 07:26 PM, Alexei Starovoitov wrote:
> On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
>>>>> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
>>>>> value, u32, size)
>>>>> +{
>>>>> +    void *ptr;
>>>>> +
>>>>> +    if (map->value_size != size)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
>>>>> +    if (!ptr)
>>>>> +        return -ENOENT;
>>>>> +
>>>>> +    switch (size) {
>>>>> +    case 1:
>>>>> +        *(u8 *) value = *(u8 *) ptr;
>>>>> +        break;
>>>>> +    case 2:
>>>>> +        *(u16 *) value = *(u16 *) ptr;
>>>>> +        break;
>>>>> +    case 4:
>>>>> +        *(u32 *) value = *(u32 *) ptr;
>>>>> +        break;
>>>>> +    case 8:
>>>>> +        *(u64 *) value = *(u64 *) ptr;
>>>>> +        break;
>>>> this is inefficient. can we pass value ptr into ops and let it
>>>> populate it?
>>> I don't think so, doing that implies that look_and_delete will be a
>>> per-value op, while other ops in maps are per-reference.
>>> For instance, how to change it in the case of peek helper that is using
>>> the lookup operation?, we cannot change the signature of the lookup
>>> operation.
>>>
>>> This is something that worries me a little bit, we are creating new
>>> per-value helpers based on already existing per-reference operations,
>>> this is not probably the cleanest way.  Here we are at the beginning of
>>> the discussion once again, how should we map helpers and syscalls to
>>> ops.
>>>
>>> What about creating pop/peek/push ops, mapping helpers one to one and
>>> adding some logic into syscall.c to call the correct operation in case
>>> the map is stack/queue?
>>> Syscall mapping would be:
>>> bpf_map_lookup_elem() -> peek
>>> bpf_map_lookup_and_delete_elem() -> pop
>>> bpf_map_update_elem() -> push
>>>
>>> Does it make sense?
>> Hello Alexei,
>>
>> Do you have any feedback on this specific part?
> Indeed. It seems push/pop ops will be cleaner.
> I still think that peek() is useless due to races.
> So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
> and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
> right?
>
>
That's right.

While updating the push api some came to me, do we have any specific 
reason to support only 1, 2, 4, 8 bytes? I think we could do it general 
enough to support any number of bytes, if the user is worried about the 
cost of memcpys he could use a map of 8 bytes pointers as you mentioned 
some time ago.
 From an API point of view, pop/peek helpers already expect a void 
*value (int bpf_map_[pop, peek]_elem(map, void *value)), the only change 
would also to use a pointer in the push instead of a u64.

Alexei Starovoitov Oct. 3, 2018, 9:10 p.m. UTC | #5

On Wed, Oct 03, 2018 at 12:01:37PM -0500, Mauricio Vasquez wrote:
> 
> 
> On 10/01/2018 07:26 PM, Alexei Starovoitov wrote:
> > On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
> > > > > > +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
> > > > > > value, u32, size)
> > > > > > +{
> > > > > > +    void *ptr;
> > > > > > +
> > > > > > +    if (map->value_size != size)
> > > > > > +        return -EINVAL;
> > > > > > +
> > > > > > +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> > > > > > +    if (!ptr)
> > > > > > +        return -ENOENT;
> > > > > > +
> > > > > > +    switch (size) {
> > > > > > +    case 1:
> > > > > > +        *(u8 *) value = *(u8 *) ptr;
> > > > > > +        break;
> > > > > > +    case 2:
> > > > > > +        *(u16 *) value = *(u16 *) ptr;
> > > > > > +        break;
> > > > > > +    case 4:
> > > > > > +        *(u32 *) value = *(u32 *) ptr;
> > > > > > +        break;
> > > > > > +    case 8:
> > > > > > +        *(u64 *) value = *(u64 *) ptr;
> > > > > > +        break;
> > > > > this is inefficient. can we pass value ptr into ops and let it
> > > > > populate it?
> > > > I don't think so, doing that implies that look_and_delete will be a
> > > > per-value op, while other ops in maps are per-reference.
> > > > For instance, how to change it in the case of peek helper that is using
> > > > the lookup operation?, we cannot change the signature of the lookup
> > > > operation.
> > > > 
> > > > This is something that worries me a little bit, we are creating new
> > > > per-value helpers based on already existing per-reference operations,
> > > > this is not probably the cleanest way.  Here we are at the beginning of
> > > > the discussion once again, how should we map helpers and syscalls to
> > > > ops.
> > > > 
> > > > What about creating pop/peek/push ops, mapping helpers one to one and
> > > > adding some logic into syscall.c to call the correct operation in case
> > > > the map is stack/queue?
> > > > Syscall mapping would be:
> > > > bpf_map_lookup_elem() -> peek
> > > > bpf_map_lookup_and_delete_elem() -> pop
> > > > bpf_map_update_elem() -> push
> > > > 
> > > > Does it make sense?
> > > Hello Alexei,
> > > 
> > > Do you have any feedback on this specific part?
> > Indeed. It seems push/pop ops will be cleaner.
> > I still think that peek() is useless due to races.
> > So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
> > and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
> > right?
> > 
> > 
> That's right.
> 
> While updating the push api some came to me, do we have any specific reason
> to support only 1, 2, 4, 8 bytes? I think we could do it general enough to
> support any number of bytes, if the user is worried about the cost of
> memcpys he could use a map of 8 bytes pointers as you mentioned some time
> ago.
> From an API point of view, pop/peek helpers already expect a void *value
> (int bpf_map_[pop, peek]_elem(map, void *value)), the only change would also
> to use a pointer in the push instead of a u64.

Indeed. Good idea. Full copy of the value for push and pop makes sense.
On the verifier side we probably need ARG_PTR_TO_UNINIT_MAP_VALUE
in addition to normal ARG_PTR_TO_MAP_VALUE for the case of bpf_map_pop().

[RFC,bpf-next,v3,4/7] bpf: add bpf queue and stack maps

Commit Message

Comments

Patch