[bpf-next,1/3] bpf: add bpf queue map

Message ID	153356390770.6981.4228793745105954649.stgit@kernel
State	Changes Requested, archived
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Subject: [PATCH bpf-next 1/3] bpf: add bpf queue map From: Mauricio Vasquez B <mauricio.vasquez@polito.it> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net> Cc: netdev@vger.kernel.org Date: Mon, 06 Aug 2018 15:58:30 +0200 Message-ID: <153356390770.6981.4228793745105954649.stgit@kernel> In-Reply-To: <153356387977.6981.12236150594041620482.stgit@kernel> References: <153356387977.6981.12236150594041620482.stgit@kernel> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Implement bpf map queue \| expand [bpf-next,0/3] Implement bpf map queue [bpf-next,1/3] bpf: add bpf queue map [bpf-next,2/3] selftests/bpf: add test cases for BPF_MAP_TYPE_QUEUE [bpf-next,3/3] bpf: add sample for BPF_MAP_TYPE_QUEUE

Mauricio Vasquez Aug. 6, 2018, 1:58 p.m. UTC

Bpf queue implements a LIFO/FIFO data containers for ebpf programs.

It allows to push an element to the queue by using the update operation
and to pop an element from the queue by using the lookup operation.

A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 include/linux/bpf_types.h |    1 
 include/uapi/linux/bpf.h  |    5 +
 kernel/bpf/Makefile       |    2 
 kernel/bpf/queuemap.c     |  287 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c      |   61 +++++++---
 kernel/bpf/verifier.c     |   16 ++-
 6 files changed, 353 insertions(+), 19 deletions(-)
 create mode 100644 kernel/bpf/queuemap.c

Daniel Borkmann Aug. 7, 2018, 1:40 p.m. UTC | #1

On 08/06/2018 03:58 PM, Mauricio Vasquez B wrote:
> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
> 
> It allows to push an element to the queue by using the update operation
> and to pop an element from the queue by using the lookup operation.
> 
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  include/linux/bpf_types.h |    1 
>  include/uapi/linux/bpf.h  |    5 +
>  kernel/bpf/Makefile       |    2 
>  kernel/bpf/queuemap.c     |  287 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c      |   61 +++++++---
>  kernel/bpf/verifier.c     |   16 ++-
>  6 files changed, 353 insertions(+), 19 deletions(-)
>  create mode 100644 kernel/bpf/queuemap.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2d5549..6c7a62f3fe43 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0ebaaf7f3568..2c171c40eb45 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -120,6 +120,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_CPUMAP,
>  	BPF_MAP_TYPE_XSKMAP,
>  	BPF_MAP_TYPE_SOCKHASH,
> +	BPF_MAP_TYPE_QUEUE,
>  };
>  
>  enum bpf_prog_type {
> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>  /* Flag for stack_map, store build_id+offset instead of pointer */
>  #define BPF_F_STACK_BUILD_ID	(1U << 5)
>  
> +/* Flags for queue_map, type of queue */
> +#define BPF_F_QUEUE_FIFO	(1U << 16)
> +#define BPF_F_QUEUE_LIFO	(2U << 16)
> +
>  enum bpf_stack_build_id_status {
>  	/* user space need an empty entry to identify end of a trace */
>  	BPF_STACK_BUILD_ID_EMPTY = 0,
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f27f5496d6fe..30f02ef66635 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -2,7 +2,7 @@
>  obj-y := core.o
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o queuemap.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
>  ifeq ($(CONFIG_NET),y)
> diff --git a/kernel/bpf/queuemap.c b/kernel/bpf/queuemap.c
> new file mode 100644
> index 000000000000..ab30af43b4cc
> --- /dev/null
> +++ b/kernel/bpf/queuemap.c
> @@ -0,0 +1,287 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * queuemap.c: BPF queue map
> + *
> + * Copyright (c) 2018 Politecnico di Torino
> + */
> +#include <linux/bpf.h>
> +#include <linux/rculist.h>
> +#include <linux/slab.h>
> +#include "percpu_freelist.h"
> +
> +#define QUEUE_CREATE_FLAG_MASK \
> +	(BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
> +	 BPF_F_QUEUE_FIFO | BPF_F_QUEUE_LIFO)
> +
> +enum queue_type {
> +	QUEUE_FIFO = (BPF_F_QUEUE_FIFO >> 16),
> +	QUEUE_LIFO = (BPF_F_QUEUE_LIFO >> 16),
> +};
> +
> +struct bpf_queue {
> +	struct bpf_map map;
> +	struct list_head head;
> +	struct pcpu_freelist freelist;
> +	void *nodes;
> +	enum queue_type type;
> +	raw_spinlock_t lock;
> +	atomic_t count;
> +	u32 node_size;
> +};
> +
> +struct queue_node {
> +	struct pcpu_freelist_node fnode;
> +	struct bpf_queue *queue;
> +	struct list_head list;
> +	struct rcu_head rcu;
> +	char element[0] __aligned(8);
> +};
> +
> +static bool queue_map_is_prealloc(struct bpf_queue *queue)
> +{
> +	return !(queue->map.map_flags & BPF_F_NO_PREALLOC);
> +}
> +
> +/* Called from syscall */
> +static int queue_map_alloc_check(union bpf_attr *attr)
> +{
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
> +	    attr->value_size == 0 || attr->map_flags & ~QUEUE_CREATE_FLAG_MASK)
> +		return -EINVAL;
> +
> +	if ((attr->map_flags >> 16) != QUEUE_FIFO &&
> +	    (attr->map_flags >> 16) != QUEUE_LIFO) {
> +		return -EINVAL;
> +	}
> +
> +	if (attr->value_size > KMALLOC_MAX_SIZE)
> +		/* if value_size is bigger, the user space won't be able to
> +		 * access the elements.
> +		 */
> +		return -E2BIG;
> +
> +	return 0;
> +}
> +
> +static int prealloc_init(struct bpf_queue *queue)
> +{
> +	u32 node_size = sizeof(struct queue_node) +
> +			round_up(queue->map.value_size, 8);
> +	u32 num_entries = queue->map.max_entries;
> +	int err;
> +
> +	queue->nodes = bpf_map_area_alloc(node_size * num_entries,
> +					  queue->map.numa_node);
> +	if (!queue->nodes)
> +		return -ENOMEM;
> +
> +	err = pcpu_freelist_init(&queue->freelist);
> +	if (err)
> +		goto free_nodes;
> +
> +	pcpu_freelist_populate(&queue->freelist,
> +			       queue->nodes +
> +			       offsetof(struct queue_node, fnode),
> +			       node_size, num_entries);
> +
> +	return 0;
> +
> +free_nodes:
> +	bpf_map_area_free(queue->nodes);
> +	return err;
> +}
> +
> +static void prealloc_destroy(struct bpf_queue *queue)
> +{
> +	bpf_map_area_free(queue->nodes);
> +	pcpu_freelist_destroy(&queue->freelist);
> +}
> +
> +static struct bpf_map *queue_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_queue *queue;
> +	u64 cost = sizeof(*queue);
> +	int ret;
> +
> +	queue = kzalloc(sizeof(*queue), GFP_USER);
> +	if (!queue)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bpf_map_init_from_attr(&queue->map, attr);
> +
> +	queue->node_size = sizeof(struct queue_node) +
> +			   round_up(attr->value_size, 8);
> +	cost += (u64) attr->max_entries * queue->node_size;
> +	if (cost >= U32_MAX - PAGE_SIZE) {
> +		ret = -E2BIG;
> +		goto free_queue;
> +	}
> +
> +	queue->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	ret = bpf_map_precharge_memlock(queue->map.pages);
> +	if (ret)
> +		goto free_queue;
> +
> +	INIT_LIST_HEAD(&queue->head);
> +
> +	raw_spin_lock_init(&queue->lock);
> +
> +	queue->type = attr->map_flags >> 16;
> +
> +	if (queue_map_is_prealloc(queue))
> +		ret = prealloc_init(queue);
> +		if (ret)
> +			goto free_queue;
> +
> +	return &queue->map;
> +
> +free_queue:
> +	kfree(queue);
> +	return ERR_PTR(ret);
> +}
> +
> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
> +static void queue_map_free(struct bpf_map *map)
> +{
> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
> +	struct queue_node *l;
> +
> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> +	 * so the programs (can be more than one that used this map) were
> +	 * disconnected from events. Wait for outstanding critical sections in
> +	 * these programs to complete
> +	 */
> +	synchronize_rcu();
> +
> +	/* some of queue_elem_free_rcu() callbacks for elements of this map may
> +	 * not have executed. Wait for them.
> +	 */
> +	rcu_barrier();
> +	if (!queue_map_is_prealloc(queue))
> +		list_for_each_entry_rcu(l, &queue->head, list) {
> +			list_del_rcu(&l->list);
> +			kfree(l);
> +		}
> +	else
> +		prealloc_destroy(queue);
> +	kfree(queue);
> +}
> +
> +static void queue_elem_free_rcu(struct rcu_head *head)
> +{
> +	struct queue_node *l = container_of(head, struct queue_node, rcu);
> +	struct bpf_queue *queue = l->queue;
> +
> +	/* must increment bpf_prog_active to avoid kprobe+bpf triggering while
> +	 * we're calling kfree, otherwise deadlock is possible if kprobes
> +	 * are placed somewhere inside of slub
> +	 */
> +	preempt_disable();
> +	__this_cpu_inc(bpf_prog_active);
> +	if (queue_map_is_prealloc(queue))
> +		pcpu_freelist_push(&queue->freelist, &l->fnode);
> +	else
> +		kfree(l);
> +	__this_cpu_dec(bpf_prog_active);
> +	preempt_enable();
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
> +	unsigned long flags;
> +	struct queue_node *node;
> +
> +	raw_spin_lock_irqsave(&queue->lock, flags);

I think a per-cpu flavor would be very useful here as well in this map
type such that we wouldn't need a lock here but only guarantee that
preemption is disabled, and it could be used as temp store for the program
for example.

> +	node = list_first_or_null_rcu(&queue->head, struct queue_node, list);
> +	if (!node) {
> +		raw_spin_unlock_irqrestore(&queue->lock, flags);
> +		return NULL;
> +	}
> +
> +	if (!queue_map_is_prealloc(queue))
> +		atomic_dec(&queue->count);
> +
> +	list_del_rcu(&node->list);
> +	call_rcu(&node->rcu, queue_elem_free_rcu);
> +
> +	raw_spin_unlock_irqrestore(&queue->lock, flags);
> +
> +	return &node->element;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_map_update_elem(struct bpf_map *map, void *key, void *value,
> +				 u64 map_flags)
> +{
> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
> +	unsigned long flags;
> +	struct queue_node *new;

Should reject invalid map update flags here.

> +	if (!queue_map_is_prealloc(queue)) {
> +		if (atomic_inc_return(&queue->count) > queue->map.max_entries) {
> +			atomic_dec(&queue->count);
> +			return -E2BIG;
> +		}
> +
> +		new = kmalloc_node(queue->node_size, GFP_ATOMIC | __GFP_NOWARN,
> +				   queue->map.numa_node);
> +		if (!new) {
> +			atomic_dec(&queue->count);
> +			return -ENOMEM;
> +		}
> +	} else {
> +		struct pcpu_freelist_node *l;
> +
> +		l = pcpu_freelist_pop(&queue->freelist);
> +		if (!l)
> +			return -E2BIG;
> +		new = container_of(l, struct queue_node, fnode);
> +	}
> +
> +	memcpy(new->element, value, queue->map.value_size);
> +	new->queue = queue;
> +
> +	raw_spin_lock_irqsave(&queue->lock, flags);
> +	switch (queue->type) {
> +	case QUEUE_FIFO:
> +		list_add_tail_rcu(&new->list, &queue->head);
> +		break;
> +
> +	case QUEUE_LIFO:
> +		list_add_rcu(&new->list, &queue->head);
> +		break;
> +	}
> +
> +	raw_spin_unlock_irqrestore(&queue->lock, flags);
> +
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -EINVAL;
> +}
> +
> +/* Called from syscall */
> +static int queue_map_get_next_key(struct bpf_map *map, void *key,
> +				  void *next_key)
> +{
> +	return -EINVAL;
> +}
> +
> +const struct bpf_map_ops queue_map_ops = {
> +	.map_alloc_check = queue_map_alloc_check,
> +	.map_alloc = queue_map_alloc,
> +	.map_free = queue_map_free,
> +	.map_lookup_elem = queue_map_lookup_elem,
> +	.map_update_elem = queue_map_update_elem,
> +	.map_delete_elem = queue_map_delete_elem,
> +	.map_get_next_key = queue_map_get_next_key,
> +};
> +
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a31a1ba0f8ea..7e9a11d69eef 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -622,11 +622,19 @@ static int map_lookup_elem(union bpf_attr *attr)
>  		err = -EPERM;
>  		goto err_put;
>  	}
> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
> +		key = memdup_user(ukey, map->key_size);
> +		if (IS_ERR(key)) {
> +			err = PTR_ERR(key);
> +			goto err_put;
> +		}
> +	} else {
> +		if (ukey) {
> +			err = -EINVAL;
> +			goto err_put;
> +		}

Given you have a fixed key_size of 0, this could probably be made more
generic and refactored a bit into a helper to check for 0 map->key_size
instead of map type.

> -	key = memdup_user(ukey, map->key_size);
> -	if (IS_ERR(key)) {
> -		err = PTR_ERR(key);
> -		goto err_put;
> +		key = NULL;
>  	}
>  
>  	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
> @@ -709,10 +717,19 @@ static int map_update_elem(union bpf_attr *attr)
>  		goto err_put;
>  	}
>  
> -	key = memdup_user(ukey, map->key_size);
> -	if (IS_ERR(key)) {
> -		err = PTR_ERR(key);
> -		goto err_put;
> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
> +		key = memdup_user(ukey, map->key_size);
> +		if (IS_ERR(key)) {
> +			err = PTR_ERR(key);
> +			goto err_put;
> +		}
> +	} else {
> +		if (ukey) {
> +			err = -EINVAL;
> +			goto err_put;
> +		}
> +
> +		key = NULL;
>  	}

Ditto, and also further below as well for the other syscall cases.

>  
>  	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
> @@ -803,10 +820,19 @@ static int map_delete_elem(union bpf_attr *attr)
>  		goto err_put;
>  	}
>  
> -	key = memdup_user(ukey, map->key_size);
> -	if (IS_ERR(key)) {
> -		err = PTR_ERR(key);
> -		goto err_put;
> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
> +		key = memdup_user(ukey, map->key_size);
> +		if (IS_ERR(key)) {
> +			err = PTR_ERR(key);
> +			goto err_put;
> +		}
> +	} else {
> +		if (ukey) {
> +			err = -EINVAL;
> +			goto err_put;
> +		}
> +
> +		key = NULL;
>  	}
>  
>  	if (bpf_map_is_dev_bound(map)) {
> @@ -855,9 +881,14 @@ static int map_get_next_key(union bpf_attr *attr)
>  	}
>  
>  	if (ukey) {
> -		key = memdup_user(ukey, map->key_size);
> -		if (IS_ERR(key)) {
> -			err = PTR_ERR(key);
> +		if (map->map_type != BPF_MAP_TYPE_QUEUE) {
> +			key = memdup_user(ukey, map->key_size);
> +			if (IS_ERR(key)) {
> +				err = PTR_ERR(key);
> +				goto err_put;
> +			}
> +		} else {
> +			err = -EINVAL;
>  			goto err_put;
>  		}
>  	} else {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 25e47c195874..83099a9a21d9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1976,8 +1976,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  		return -EACCES;
>  	}
>  
> -	if (arg_type == ARG_PTR_TO_MAP_KEY ||
> -	    arg_type == ARG_PTR_TO_MAP_VALUE) {
> +	if (arg_type == ARG_PTR_TO_MAP_KEY) {
> +		expected_type = PTR_TO_STACK;
> +		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
> +		    type != expected_type && type != SCALAR_VALUE)
> +			goto err_type;
> +	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
>  		expected_type = PTR_TO_STACK;
>  		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
>  		    type != expected_type)
> @@ -2021,6 +2025,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  		/* bpf_map_xxx(map_ptr) call: remember that map_ptr */
>  		meta->map_ptr = reg->map_ptr;
>  	} else if (arg_type == ARG_PTR_TO_MAP_KEY) {
> +		bool zero_size_allowed = false;
>  		/* bpf_map_xxx(..., map_ptr, ..., key) call:
>  		 * check that [key, key + map->key_size) are within
>  		 * stack limits and initialized
> @@ -2034,8 +2039,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  			verbose(env, "invalid map_ptr to access map->key\n");
>  			return -EACCES;
>  		}
> +
> +		if (meta->map_ptr->map_type == BPF_MAP_TYPE_QUEUE)

Here as well.

> +			zero_size_allowed = true;

Also, verifier should rather enforce a const NULL key here than allowing one to
be set (but not as the only 'valid' input option), meaning, I don't think it would
be good to allow a 'zero' sized pointer to stack in this case.

>  		err = check_helper_mem_access(env, regno,
> -					      meta->map_ptr->key_size, false,
> +					      meta->map_ptr->key_size,
> +					      zero_size_allowed,
>  					      NULL);
>  	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
>  		/* bpf_map_xxx(..., map_ptr, ..., value) call:
>

Daniel Borkmann Aug. 7, 2018, 1:52 p.m. UTC | #2

On 08/06/2018 03:58 PM, Mauricio Vasquez B wrote:
> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
> 
> It allows to push an element to the queue by using the update operation
> and to pop an element from the queue by using the lookup operation.
> 
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
[...]
> +static int prealloc_init(struct bpf_queue *queue)
> +{
> +	u32 node_size = sizeof(struct queue_node) +
> +			round_up(queue->map.value_size, 8);
> +	u32 num_entries = queue->map.max_entries;
> +	int err;
> +
> +	queue->nodes = bpf_map_area_alloc(node_size * num_entries,
> +					  queue->map.numa_node);

That doesn't work either. If you don't set numa node, then here in
your case you'll always use numa node 0, which is unintentional.
You need to get the node via bpf_map_attr_numa_node(attr) helper.
Same issue in queue_map_update_elem().

Alexei Starovoitov Aug. 7, 2018, 2:42 p.m. UTC | #3

On Mon, Aug 06, 2018 at 03:58:30PM +0200, Mauricio Vasquez B wrote:
> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.

queue/stack datastructure would be a great addition.

> It allows to push an element to the queue by using the update operation
> and to pop an element from the queue by using the lookup operation.
> 
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  include/linux/bpf_types.h |    1 
>  include/uapi/linux/bpf.h  |    5 +
>  kernel/bpf/Makefile       |    2 
>  kernel/bpf/queuemap.c     |  287 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c      |   61 +++++++---
>  kernel/bpf/verifier.c     |   16 ++-
>  6 files changed, 353 insertions(+), 19 deletions(-)
>  create mode 100644 kernel/bpf/queuemap.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2d5549..6c7a62f3fe43 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0ebaaf7f3568..2c171c40eb45 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -120,6 +120,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_CPUMAP,
>  	BPF_MAP_TYPE_XSKMAP,
>  	BPF_MAP_TYPE_SOCKHASH,
> +	BPF_MAP_TYPE_QUEUE,
>  };
>  
>  enum bpf_prog_type {
> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>  /* Flag for stack_map, store build_id+offset instead of pointer */
>  #define BPF_F_STACK_BUILD_ID	(1U << 5)
>  
> +/* Flags for queue_map, type of queue */
> +#define BPF_F_QUEUE_FIFO	(1U << 16)
> +#define BPF_F_QUEUE_LIFO	(2U << 16)

the choice of flags looks odd.
Why start at bit 16 and why waste two bits?
It's either stack or queue.
May be instead define two map_types:
BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
that share common implementation?

And how about adding three new helpers: push/pop/peek as well?
Reusing lookup/update is neat, but does lookup == pop
or does lookup == peek ?
I suspect it will be confusing.
Three new helpers cost nothing, but will make bpf progs easier to read.

Could you also add a flag for replacement policy?
In this implementation when max_entries limit is reached
the map_update_elem (aka push) will fail with e2big.
It would be useful to allow pushing and dropping elements from
the other side then such queue/stack can be used to keep
track of the last N events (the prog would unconditionally push
and user space would swap the queue for new one via map-in-map
and drain old queue).
Speaking of map-in-map, please add a test to make sure
queue/stack works with hash/array_of_maps.

selftests in patch2 needs to have kernel side as well.
it's not enough to test syscall access only.

Mauricio Vasquez Aug. 9, 2018, 2:50 a.m. UTC | #4

On 08/07/2018 08:40 AM, Daniel Borkmann wrote:
> On 08/06/2018 03:58 PM, Mauricio Vasquez B wrote:
>> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
>>
>> It allows to push an element to the queue by using the update operation
>> and to pop an element from the queue by using the lookup operation.
>>
>> A use case for this is to keep track of a pool of elements, like
>> network ports in a SNAT.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf_types.h |    1
>>   include/uapi/linux/bpf.h  |    5 +
>>   kernel/bpf/Makefile       |    2
>>   kernel/bpf/queuemap.c     |  287 +++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c      |   61 +++++++---
>>   kernel/bpf/verifier.c     |   16 ++-
>>   6 files changed, 353 insertions(+), 19 deletions(-)
>>   create mode 100644 kernel/bpf/queuemap.c
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index c5700c2d5549..6c7a62f3fe43 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>>   #endif
>>   #endif
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0ebaaf7f3568..2c171c40eb45 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -120,6 +120,7 @@ enum bpf_map_type {
>>   	BPF_MAP_TYPE_CPUMAP,
>>   	BPF_MAP_TYPE_XSKMAP,
>>   	BPF_MAP_TYPE_SOCKHASH,
>> +	BPF_MAP_TYPE_QUEUE,
>>   };
>>   
>>   enum bpf_prog_type {
>> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>>   /* Flag for stack_map, store build_id+offset instead of pointer */
>>   #define BPF_F_STACK_BUILD_ID	(1U << 5)
>>   
>> +/* Flags for queue_map, type of queue */
>> +#define BPF_F_QUEUE_FIFO	(1U << 16)
>> +#define BPF_F_QUEUE_LIFO	(2U << 16)
>> +
>>   enum bpf_stack_build_id_status {
>>   	/* user space need an empty entry to identify end of a trace */
>>   	BPF_STACK_BUILD_ID_EMPTY = 0,
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index f27f5496d6fe..30f02ef66635 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -2,7 +2,7 @@
>>   obj-y := core.o
>>   
>>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o queuemap.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>   obj-$(CONFIG_BPF_SYSCALL) += btf.o
>>   ifeq ($(CONFIG_NET),y)
>> diff --git a/kernel/bpf/queuemap.c b/kernel/bpf/queuemap.c
>> new file mode 100644
>> index 000000000000..ab30af43b4cc
>> --- /dev/null
>> +++ b/kernel/bpf/queuemap.c
>> @@ -0,0 +1,287 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * queuemap.c: BPF queue map
>> + *
>> + * Copyright (c) 2018 Politecnico di Torino
>> + */
>> +#include <linux/bpf.h>
>> +#include <linux/rculist.h>
>> +#include <linux/slab.h>
>> +#include "percpu_freelist.h"
>> +
>> +#define QUEUE_CREATE_FLAG_MASK \
>> +	(BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
>> +	 BPF_F_QUEUE_FIFO | BPF_F_QUEUE_LIFO)
>> +
>> +enum queue_type {
>> +	QUEUE_FIFO = (BPF_F_QUEUE_FIFO >> 16),
>> +	QUEUE_LIFO = (BPF_F_QUEUE_LIFO >> 16),
>> +};
>> +
>> +struct bpf_queue {
>> +	struct bpf_map map;
>> +	struct list_head head;
>> +	struct pcpu_freelist freelist;
>> +	void *nodes;
>> +	enum queue_type type;
>> +	raw_spinlock_t lock;
>> +	atomic_t count;
>> +	u32 node_size;
>> +};
>> +
>> +struct queue_node {
>> +	struct pcpu_freelist_node fnode;
>> +	struct bpf_queue *queue;
>> +	struct list_head list;
>> +	struct rcu_head rcu;
>> +	char element[0] __aligned(8);
>> +};
>> +
>> +static bool queue_map_is_prealloc(struct bpf_queue *queue)
>> +{
>> +	return !(queue->map.map_flags & BPF_F_NO_PREALLOC);
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_map_alloc_check(union bpf_attr *attr)
>> +{
>> +	/* check sanity of attributes */
>> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
>> +	    attr->value_size == 0 || attr->map_flags & ~QUEUE_CREATE_FLAG_MASK)
>> +		return -EINVAL;
>> +
>> +	if ((attr->map_flags >> 16) != QUEUE_FIFO &&
>> +	    (attr->map_flags >> 16) != QUEUE_LIFO) {
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (attr->value_size > KMALLOC_MAX_SIZE)
>> +		/* if value_size is bigger, the user space won't be able to
>> +		 * access the elements.
>> +		 */
>> +		return -E2BIG;
>> +
>> +	return 0;
>> +}
>> +
>> +static int prealloc_init(struct bpf_queue *queue)
>> +{
>> +	u32 node_size = sizeof(struct queue_node) +
>> +			round_up(queue->map.value_size, 8);
>> +	u32 num_entries = queue->map.max_entries;
>> +	int err;
>> +
>> +	queue->nodes = bpf_map_area_alloc(node_size * num_entries,
>> +					  queue->map.numa_node);
>> +	if (!queue->nodes)
>> +		return -ENOMEM;
>> +
>> +	err = pcpu_freelist_init(&queue->freelist);
>> +	if (err)
>> +		goto free_nodes;
>> +
>> +	pcpu_freelist_populate(&queue->freelist,
>> +			       queue->nodes +
>> +			       offsetof(struct queue_node, fnode),
>> +			       node_size, num_entries);
>> +
>> +	return 0;
>> +
>> +free_nodes:
>> +	bpf_map_area_free(queue->nodes);
>> +	return err;
>> +}
>> +
>> +static void prealloc_destroy(struct bpf_queue *queue)
>> +{
>> +	bpf_map_area_free(queue->nodes);
>> +	pcpu_freelist_destroy(&queue->freelist);
>> +}
>> +
>> +static struct bpf_map *queue_map_alloc(union bpf_attr *attr)
>> +{
>> +	struct bpf_queue *queue;
>> +	u64 cost = sizeof(*queue);
>> +	int ret;
>> +
>> +	queue = kzalloc(sizeof(*queue), GFP_USER);
>> +	if (!queue)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	bpf_map_init_from_attr(&queue->map, attr);
>> +
>> +	queue->node_size = sizeof(struct queue_node) +
>> +			   round_up(attr->value_size, 8);
>> +	cost += (u64) attr->max_entries * queue->node_size;
>> +	if (cost >= U32_MAX - PAGE_SIZE) {
>> +		ret = -E2BIG;
>> +		goto free_queue;
>> +	}
>> +
>> +	queue->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>> +
>> +	ret = bpf_map_precharge_memlock(queue->map.pages);
>> +	if (ret)
>> +		goto free_queue;
>> +
>> +	INIT_LIST_HEAD(&queue->head);
>> +
>> +	raw_spin_lock_init(&queue->lock);
>> +
>> +	queue->type = attr->map_flags >> 16;
>> +
>> +	if (queue_map_is_prealloc(queue))
>> +		ret = prealloc_init(queue);
>> +		if (ret)
>> +			goto free_queue;
>> +
>> +	return &queue->map;
>> +
>> +free_queue:
>> +	kfree(queue);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
>> +static void queue_map_free(struct bpf_map *map)
>> +{
>> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
>> +	struct queue_node *l;
>> +
>> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
>> +	 * so the programs (can be more than one that used this map) were
>> +	 * disconnected from events. Wait for outstanding critical sections in
>> +	 * these programs to complete
>> +	 */
>> +	synchronize_rcu();
>> +
>> +	/* some of queue_elem_free_rcu() callbacks for elements of this map may
>> +	 * not have executed. Wait for them.
>> +	 */
>> +	rcu_barrier();
>> +	if (!queue_map_is_prealloc(queue))
>> +		list_for_each_entry_rcu(l, &queue->head, list) {
>> +			list_del_rcu(&l->list);
>> +			kfree(l);
>> +		}
>> +	else
>> +		prealloc_destroy(queue);
>> +	kfree(queue);
>> +}
>> +
>> +static void queue_elem_free_rcu(struct rcu_head *head)
>> +{
>> +	struct queue_node *l = container_of(head, struct queue_node, rcu);
>> +	struct bpf_queue *queue = l->queue;
>> +
>> +	/* must increment bpf_prog_active to avoid kprobe+bpf triggering while
>> +	 * we're calling kfree, otherwise deadlock is possible if kprobes
>> +	 * are placed somewhere inside of slub
>> +	 */
>> +	preempt_disable();
>> +	__this_cpu_inc(bpf_prog_active);
>> +	if (queue_map_is_prealloc(queue))
>> +		pcpu_freelist_push(&queue->freelist, &l->fnode);
>> +	else
>> +		kfree(l);
>> +	__this_cpu_dec(bpf_prog_active);
>> +	preempt_enable();
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
>> +	unsigned long flags;
>> +	struct queue_node *node;
>> +
>> +	raw_spin_lock_irqsave(&queue->lock, flags);
> I think a per-cpu flavor would be very useful here as well in this map
> type such that we wouldn't need a lock here but only guarantee that
> preemption is disabled, and it could be used as temp store for the program
> for example.
I agree. I'd focus for now in the basic implementation, once it is ready 
we could go ahead with a per-cpu version.
>
>> +	node = list_first_or_null_rcu(&queue->head, struct queue_node, list);
>> +	if (!node) {
>> +		raw_spin_unlock_irqrestore(&queue->lock, flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (!queue_map_is_prealloc(queue))
>> +		atomic_dec(&queue->count);
>> +
>> +	list_del_rcu(&node->list);
>> +	call_rcu(&node->rcu, queue_elem_free_rcu);
>> +
>> +	raw_spin_unlock_irqrestore(&queue->lock, flags);
>> +
>> +	return &node->element;
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_map_update_elem(struct bpf_map *map, void *key, void *value,
>> +				 u64 map_flags)
>> +{
>> +	struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
>> +	unsigned long flags;
>> +	struct queue_node *new;
> Should reject invalid map update flags here.
Will do in new push helper proposed by Alexei.
>
>> +	if (!queue_map_is_prealloc(queue)) {
>> +		if (atomic_inc_return(&queue->count) > queue->map.max_entries) {
>> +			atomic_dec(&queue->count);
>> +			return -E2BIG;
>> +		}
>> +
>> +		new = kmalloc_node(queue->node_size, GFP_ATOMIC | __GFP_NOWARN,
>> +				   queue->map.numa_node);
>> +		if (!new) {
>> +			atomic_dec(&queue->count);
>> +			return -ENOMEM;
>> +		}
>> +	} else {
>> +		struct pcpu_freelist_node *l;
>> +
>> +		l = pcpu_freelist_pop(&queue->freelist);
>> +		if (!l)
>> +			return -E2BIG;
>> +		new = container_of(l, struct queue_node, fnode);
>> +	}
>> +
>> +	memcpy(new->element, value, queue->map.value_size);
>> +	new->queue = queue;
>> +
>> +	raw_spin_lock_irqsave(&queue->lock, flags);
>> +	switch (queue->type) {
>> +	case QUEUE_FIFO:
>> +		list_add_tail_rcu(&new->list, &queue->head);
>> +		break;
>> +
>> +	case QUEUE_LIFO:
>> +		list_add_rcu(&new->list, &queue->head);
>> +		break;
>> +	}
>> +
>> +	raw_spin_unlock_irqrestore(&queue->lock, flags);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_map_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_map_get_next_key(struct bpf_map *map, void *key,
>> +				  void *next_key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +const struct bpf_map_ops queue_map_ops = {
>> +	.map_alloc_check = queue_map_alloc_check,
>> +	.map_alloc = queue_map_alloc,
>> +	.map_free = queue_map_free,
>> +	.map_lookup_elem = queue_map_lookup_elem,
>> +	.map_update_elem = queue_map_update_elem,
>> +	.map_delete_elem = queue_map_delete_elem,
>> +	.map_get_next_key = queue_map_get_next_key,
>> +};
>> +
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index a31a1ba0f8ea..7e9a11d69eef 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -622,11 +622,19 @@ static int map_lookup_elem(union bpf_attr *attr)
>>   		err = -EPERM;
>>   		goto err_put;
>>   	}
>> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
>> +		key = memdup_user(ukey, map->key_size);
>> +		if (IS_ERR(key)) {
>> +			err = PTR_ERR(key);
>> +			goto err_put;
>> +		}
>> +	} else {
>> +		if (ukey) {
>> +			err = -EINVAL;
>> +			goto err_put;
>> +		}
> Given you have a fixed key_size of 0, this could probably be made more
> generic and refactored a bit into a helper to check for 0 map->key_size
> instead of map type.

With a new set of helpers following changes are not needed anymore.
>
>> -	key = memdup_user(ukey, map->key_size);
>> -	if (IS_ERR(key)) {
>> -		err = PTR_ERR(key);
>> -		goto err_put;
>> +		key = NULL;
>>   	}
>>   
>>   	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
>> @@ -709,10 +717,19 @@ static int map_update_elem(union bpf_attr *attr)
>>   		goto err_put;
>>   	}
>>   
>> -	key = memdup_user(ukey, map->key_size);
>> -	if (IS_ERR(key)) {
>> -		err = PTR_ERR(key);
>> -		goto err_put;
>> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
>> +		key = memdup_user(ukey, map->key_size);
>> +		if (IS_ERR(key)) {
>> +			err = PTR_ERR(key);
>> +			goto err_put;
>> +		}
>> +	} else {
>> +		if (ukey) {
>> +			err = -EINVAL;
>> +			goto err_put;
>> +		}
>> +
>> +		key = NULL;
>>   	}
> Ditto, and also further below as well for the other syscall cases.
>
>>   
>>   	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
>> @@ -803,10 +820,19 @@ static int map_delete_elem(union bpf_attr *attr)
>>   		goto err_put;
>>   	}
>>   
>> -	key = memdup_user(ukey, map->key_size);
>> -	if (IS_ERR(key)) {
>> -		err = PTR_ERR(key);
>> -		goto err_put;
>> +	if (map->map_type != BPF_MAP_TYPE_QUEUE) {
>> +		key = memdup_user(ukey, map->key_size);
>> +		if (IS_ERR(key)) {
>> +			err = PTR_ERR(key);
>> +			goto err_put;
>> +		}
>> +	} else {
>> +		if (ukey) {
>> +			err = -EINVAL;
>> +			goto err_put;
>> +		}
>> +
>> +		key = NULL;
>>   	}
>>   
>>   	if (bpf_map_is_dev_bound(map)) {
>> @@ -855,9 +881,14 @@ static int map_get_next_key(union bpf_attr *attr)
>>   	}
>>   
>>   	if (ukey) {
>> -		key = memdup_user(ukey, map->key_size);
>> -		if (IS_ERR(key)) {
>> -			err = PTR_ERR(key);
>> +		if (map->map_type != BPF_MAP_TYPE_QUEUE) {
>> +			key = memdup_user(ukey, map->key_size);
>> +			if (IS_ERR(key)) {
>> +				err = PTR_ERR(key);
>> +				goto err_put;
>> +			}
>> +		} else {
>> +			err = -EINVAL;
>>   			goto err_put;
>>   		}
>>   	} else {
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 25e47c195874..83099a9a21d9 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1976,8 +1976,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>>   		return -EACCES;
>>   	}
>>   
>> -	if (arg_type == ARG_PTR_TO_MAP_KEY ||
>> -	    arg_type == ARG_PTR_TO_MAP_VALUE) {
>> +	if (arg_type == ARG_PTR_TO_MAP_KEY) {
>> +		expected_type = PTR_TO_STACK;
>> +		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
>> +		    type != expected_type && type != SCALAR_VALUE)
>> +			goto err_type;
>> +	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
>>   		expected_type = PTR_TO_STACK;
>>   		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
>>   		    type != expected_type)
>> @@ -2021,6 +2025,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>>   		/* bpf_map_xxx(map_ptr) call: remember that map_ptr */
>>   		meta->map_ptr = reg->map_ptr;
>>   	} else if (arg_type == ARG_PTR_TO_MAP_KEY) {
>> +		bool zero_size_allowed = false;
>>   		/* bpf_map_xxx(..., map_ptr, ..., key) call:
>>   		 * check that [key, key + map->key_size) are within
>>   		 * stack limits and initialized
>> @@ -2034,8 +2039,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>>   			verbose(env, "invalid map_ptr to access map->key\n");
>>   			return -EACCES;
>>   		}
>> +
>> +		if (meta->map_ptr->map_type == BPF_MAP_TYPE_QUEUE)
> Here as well.
>
>> +			zero_size_allowed = true;
> Also, verifier should rather enforce a const NULL key here than allowing one to
> be set (but not as the only 'valid' input option), meaning, I don't think it would
> be good to allow a 'zero' sized pointer to stack in this case.
>
>>   		err = check_helper_mem_access(env, regno,
>> -					      meta->map_ptr->key_size, false,
>> +					      meta->map_ptr->key_size,
>> +					      zero_size_allowed,
>>   					      NULL);
>>   	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
>>   		/* bpf_map_xxx(..., map_ptr, ..., value) call:
>>
>

Mauricio Vasquez Aug. 9, 2018, 2:55 a.m. UTC | #5

On 08/07/2018 08:52 AM, Daniel Borkmann wrote:
> On 08/06/2018 03:58 PM, Mauricio Vasquez B wrote:
>> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
>>
>> It allows to push an element to the queue by using the update operation
>> and to pop an element from the queue by using the lookup operation.
>>
>> A use case for this is to keep track of a pool of elements, like
>> network ports in a SNAT.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> [...]
>> +static int prealloc_init(struct bpf_queue *queue)
>> +{
>> +	u32 node_size = sizeof(struct queue_node) +
>> +			round_up(queue->map.value_size, 8);
>> +	u32 num_entries = queue->map.max_entries;
>> +	int err;
>> +
>> +	queue->nodes = bpf_map_area_alloc(node_size * num_entries,
>> +					  queue->map.numa_node);
> That doesn't work either. If you don't set numa node, then here in
> your case you'll always use numa node 0, which is unintentional.
> You need to get the node via bpf_map_attr_numa_node(attr) helper.
> Same issue in queue_map_update_elem().
>
This should work, map.numa_node is initialized using 
bpf_map_attr_numa_node() in bpf_map_init_from_attr()

The htab does exactly the same.

Mauricio Vasquez Aug. 9, 2018, 3:08 a.m. UTC | #6

On 08/07/2018 09:42 AM, Alexei Starovoitov wrote:
> On Mon, Aug 06, 2018 at 03:58:30PM +0200, Mauricio Vasquez B wrote:
>> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
> queue/stack datastructure would be a great addition.
>
>> It allows to push an element to the queue by using the update operation
>> and to pop an element from the queue by using the lookup operation.
>>
>> A use case for this is to keep track of a pool of elements, like
>> network ports in a SNAT.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf_types.h |    1
>>   include/uapi/linux/bpf.h  |    5 +
>>   kernel/bpf/Makefile       |    2
>>   kernel/bpf/queuemap.c     |  287 +++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c      |   61 +++++++---
>>   kernel/bpf/verifier.c     |   16 ++-
>>   6 files changed, 353 insertions(+), 19 deletions(-)
>>   create mode 100644 kernel/bpf/queuemap.c
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index c5700c2d5549..6c7a62f3fe43 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>>   #endif
>>   #endif
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0ebaaf7f3568..2c171c40eb45 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -120,6 +120,7 @@ enum bpf_map_type {
>>   	BPF_MAP_TYPE_CPUMAP,
>>   	BPF_MAP_TYPE_XSKMAP,
>>   	BPF_MAP_TYPE_SOCKHASH,
>> +	BPF_MAP_TYPE_QUEUE,
>>   };
>>   
>>   enum bpf_prog_type {
>> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>>   /* Flag for stack_map, store build_id+offset instead of pointer */
>>   #define BPF_F_STACK_BUILD_ID	(1U << 5)
>>   
>> +/* Flags for queue_map, type of queue */
>> +#define BPF_F_QUEUE_FIFO	(1U << 16)
>> +#define BPF_F_QUEUE_LIFO	(2U << 16)
> the choice of flags looks odd.
> Why start at bit 16 and why waste two bits?
> It's either stack or queue.
> May be instead define two map_types:
> BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
> that share common implementation?

I like this solution, I'll spin a v2 with two different types of maps.

> And how about adding three new helpers: push/pop/peek as well?
> Reusing lookup/update is neat, but does lookup == pop
> or does lookup == peek ?
> I suspect it will be confusing.
> Three new helpers cost nothing, but will make bpf progs easier to read.
I agree. I have one doubt here, update/lookup/delete is implemented in 
all map types, if the operation is not supported it returns -EINVAL.
For push/pop/peek, should we implement it in all maps or is it better to 
check the map type before invoking map->ops->push/pop/seek?
(Maybe checking if map->ops->xxx is NULL will also work)

> Could you also add a flag for replacement policy?
> In this implementation when max_entries limit is reached
> the map_update_elem (aka push) will fail with e2big.
> It would be useful to allow pushing and dropping elements from
> the other side then such queue/stack can be used to keep
> track of the last N events (the prog would unconditionally push
> and user space would swap the queue for new one via map-in-map
> and drain old queue).
I like it, will do.
> Speaking of map-in-map, please add a test to make sure
> queue/stack works with hash/array_of_maps.

Sure.
> selftests in patch2 needs to have kernel side as well.
> it's not enough to test syscall access only.
>
Will do.

Alexei Starovoitov Aug. 9, 2018, 4:48 a.m. UTC | #7

On Wed, Aug 08, 2018 at 10:08:47PM -0500, Mauricio Vasquez wrote:
> 
> > And how about adding three new helpers: push/pop/peek as well?
> > Reusing lookup/update is neat, but does lookup == pop
> > or does lookup == peek ?
> > I suspect it will be confusing.
> > Three new helpers cost nothing, but will make bpf progs easier to read.
> I agree. I have one doubt here, update/lookup/delete is implemented in all
> map types, if the operation is not supported it returns -EINVAL.
> For push/pop/peek, should we implement it in all maps or is it better to
> check the map type before invoking map->ops->push/pop/seek?
> (Maybe checking if map->ops->xxx is NULL will also work)

Since push/pop/peak are only for this queue/stack I thought we won't
be adding 'ops' for them and just call the helpers from progs.
But now I'm having second thoughts, since 3 new commands for syscall
feels like overkill.
At the same time I still don't feel that lookup == pop is the right alias.
Also what peak is going to alias to ?
May be let's go back to your original idea with a tweak:
push == update
peak == lookup
pop = lookup + delete
In other words in case of stack the bpf_map_lookup_elem will return
the pointer to value of top element in the stack and
bpf_map_delete_elem will delete that top element.
Then in user space we can introduce push/pop always_inline functions like:
void bpf_push(struct bpf_map *map, void *value)
{
  bpf_map_update_elem(map, NULL/*key*/, value);
}

void *bpf_pop(struct bpf_map *map)
{
  void * val = bpf_map_lookup_elem(map, NULL/*key*/);
  bpf_map_delete_elem(map, NULL/*key*/);
  return val;
}

Thoughts?

Daniel Borkmann Aug. 9, 2018, 9:02 a.m. UTC | #8

On 08/09/2018 06:48 AM, Alexei Starovoitov wrote:
> On Wed, Aug 08, 2018 at 10:08:47PM -0500, Mauricio Vasquez wrote:
>>
>>> And how about adding three new helpers: push/pop/peek as well?
>>> Reusing lookup/update is neat, but does lookup == pop
>>> or does lookup == peek ?
>>> I suspect it will be confusing.
>>> Three new helpers cost nothing, but will make bpf progs easier to read.
>> I agree. I have one doubt here, update/lookup/delete is implemented in all
>> map types, if the operation is not supported it returns -EINVAL.
>> For push/pop/peek, should we implement it in all maps or is it better to
>> check the map type before invoking map->ops->push/pop/seek?
>> (Maybe checking if map->ops->xxx is NULL will also work)
> 
> Since push/pop/peak are only for this queue/stack I thought we won't
> be adding 'ops' for them and just call the helpers from progs.
> But now I'm having second thoughts, since 3 new commands for syscall
> feels like overkill.
> At the same time I still don't feel that lookup == pop is the right alias.
> Also what peak is going to alias to ?
> May be let's go back to your original idea with a tweak:
> push == update
> peak == lookup
> pop = lookup + delete
> In other words in case of stack the bpf_map_lookup_elem will return
> the pointer to value of top element in the stack and
> bpf_map_delete_elem will delete that top element.
> Then in user space we can introduce push/pop always_inline functions like:
> void bpf_push(struct bpf_map *map, void *value)
> {
>   bpf_map_update_elem(map, NULL/*key*/, value);
> }
> 
> void *bpf_pop(struct bpf_map *map)
> {
>   void * val = bpf_map_lookup_elem(map, NULL/*key*/);
>   bpf_map_delete_elem(map, NULL/*key*/);
>   return val;
> }
> 
> Thoughts?

I actually think that having new push/peak/pop BPF map helpers would
be fine, as well as having them sit in map->ops. Initially I'd probably
leave out the syscall ops counterparts so they can be used only from
program.

Potentially array and per-cpu array could implement them even by having
an internal pointer to the current slot which we move on push/pop. This
would potentially also avoid all the RCU callbacks to free the elements,
elem allocations, list locking, and improve cache locality compared to
the current implementation.

Agree that existing ops are not the right alias, but deferring to user
space as inline function also doesn't really seem like a good fit, imho,
so I'd prefer rather to have something native. (Aside from that, the
above inline bpf_pop() would also race between CPUs.)

Thanks,
Daniel

Mauricio Vasquez Aug. 9, 2018, 2:51 p.m. UTC | #9

On 08/09/2018 04:02 AM, Daniel Borkmann wrote:
> On 08/09/2018 06:48 AM, Alexei Starovoitov wrote:
>> On Wed, Aug 08, 2018 at 10:08:47PM -0500, Mauricio Vasquez wrote:
>>>> And how about adding three new helpers: push/pop/peek as well?
>>>> Reusing lookup/update is neat, but does lookup == pop
>>>> or does lookup == peek ?
>>>> I suspect it will be confusing.
>>>> Three new helpers cost nothing, but will make bpf progs easier to read.
>>> I agree. I have one doubt here, update/lookup/delete is implemented in all
>>> map types, if the operation is not supported it returns -EINVAL.
>>> For push/pop/peek, should we implement it in all maps or is it better to
>>> check the map type before invoking map->ops->push/pop/seek?
>>> (Maybe checking if map->ops->xxx is NULL will also work)
>> Since push/pop/peak are only for this queue/stack I thought we won't
>> be adding 'ops' for them and just call the helpers from progs.
>> But now I'm having second thoughts, since 3 new commands for syscall
>> feels like overkill.
>> At the same time I still don't feel that lookup == pop is the right alias.
>> Also what peak is going to alias to ?
>> May be let's go back to your original idea with a tweak:
>> push == update
>> peak == lookup
>> pop = lookup + delete
>> In other words in case of stack the bpf_map_lookup_elem will return
>> the pointer to value of top element in the stack and
>> bpf_map_delete_elem will delete that top element.
>> Then in user space we can introduce push/pop always_inline functions like:
>> void bpf_push(struct bpf_map *map, void *value)
>> {
>>    bpf_map_update_elem(map, NULL/*key*/, value);
>> }
>>
>> void *bpf_pop(struct bpf_map *map)
>> {
>>    void * val = bpf_map_lookup_elem(map, NULL/*key*/);
>>    bpf_map_delete_elem(map, NULL/*key*/);
>>    return val;
>> }
>>
>> Thoughts?
> I actually think that having new push/peak/pop BPF map helpers would
> be fine, as well as having them sit in map->ops. Initially I'd probably
> leave out the syscall ops counterparts so they can be used only from
> program.

I also think that having the new helpers is good. I don't have a strong 
opinion about saving them in map->ops or not, in any case it has to be 
verified that push/pop/peek are only invoked in stack/queue maps. I 
think we could force in on the verifier, would it be ok?

> Potentially array and per-cpu array could implement them even by having
> an internal pointer to the current slot which we move on push/pop. This
> would potentially also avoid all the RCU callbacks to free the elements,
> elem allocations, list locking, and improve cache locality compared to
> the current implementation.

I could change the implementation, a linked list is used when 
BPF_F_NO_PREALLOC is passed, otherwise a plain array with internal 
indexes is used.
I have a question about RCU and a possible array-based implementation. 
It should be guarantee that a pointer from a pop() operation is valid 
for the whole program duration.
If an eBPF program A calls pop() it will get a pointer to the head in 
the array, and the head index will move, it is possible that a second 
program B (running on a different CPU) pushes a new element overwriting 
the element pointed by A.
I think we need to use a more sophisticated mechanism than just an array 
and two indexes. Am I missing something?

> Agree that existing ops are not the right alias, but deferring to user
> space as inline function also doesn't really seem like a good fit, imho,
> so I'd prefer rather to have something native. (Aside from that, the
> above inline bpf_pop() would also race between CPUs.)

I think we should have push/pop/peek syscalls as well, having a 
bpf_pop() that is race prone would create problems. Users expected maps 
operations to be safe, so having one that is not will confuse them.
> Thanks,
> Daniel
>

Thanks,
Mauricio.

Alexei Starovoitov Aug. 9, 2018, 4:23 p.m. UTC | #10

On Thu, Aug 09, 2018 at 09:51:49AM -0500, Mauricio Vasquez wrote:
> 
> > Agree that existing ops are not the right alias, but deferring to user
> > space as inline function also doesn't really seem like a good fit, imho,
> > so I'd prefer rather to have something native. (Aside from that, the
> > above inline bpf_pop() would also race between CPUs.)
> 
> I think we should have push/pop/peek syscalls as well, having a bpf_pop()
> that is race prone would create problems. Users expected maps operations to
> be safe, so having one that is not will confuse them.

agree the races are not acceptable.
How about a mixed solution:
- introduce bpf_push/pop/peak helpers that programs will use, so
  they don't need to pass useless key=NULL
- introduce map->ops->lookup_and_delete and map->ops->lookup_or_init
  that prog-side helpers can use and syscall has 1-1 mapping for

Native lookup_or_init() helper for programs and syscall is badly missing.
Most of the bcc scripts use it and bcc has a racy workaround.
Similarly lookup_and_delete() syscall is 1-1 to pop() for stack/queue
and useful for regular hash maps.

At the end for stack/queue map the programs will use:
int bpf_push(map, value);
value_or_null = bpf_pop(map); // guaranteed non-racy for multi-cpu
value_or_null = bpf_peak(map); // racy if 2+ cpus doing it

from syscall:
bpf_map_lookup_elem(map, NULL, &value); // returns top of stack
bpf_map_lookup_and_delete_elem(map, NULL, &value); // returns top and deletes top atomically
bpf_map_update_elem(map, NULL, &value); // pushes new value into stack atomically

Eventually hash and other maps will implement bpf_map_lookup_and_delete()
for both bpf progs and syscall.

The main point that prog-side api doesn't have to match 1-1 to syscall-side,
since they're different enough already.
Like lookup_or_init() is badly needed for programs, but unnecessary for syscall.

Thoughts?

Mauricio Vasquez Aug. 9, 2018, 11:41 p.m. UTC | #11

On 08/09/2018 11:23 AM, Alexei Starovoitov wrote:
> On Thu, Aug 09, 2018 at 09:51:49AM -0500, Mauricio Vasquez wrote:
>>> Agree that existing ops are not the right alias, but deferring to user
>>> space as inline function also doesn't really seem like a good fit, imho,
>>> so I'd prefer rather to have something native. (Aside from that, the
>>> above inline bpf_pop() would also race between CPUs.)
>> I think we should have push/pop/peek syscalls as well, having a bpf_pop()
>> that is race prone would create problems. Users expected maps operations to
>> be safe, so having one that is not will confuse them.
> agree the races are not acceptable.
> How about a mixed solution:
> - introduce bpf_push/pop/peak helpers that programs will use, so
>    they don't need to pass useless key=NULL
> - introduce map->ops->lookup_and_delete and map->ops->lookup_or_init
>    that prog-side helpers can use and syscall has 1-1 mapping for
I think if is a fair solution.
> Native lookup_or_init() helper for programs and syscall is badly missing.
> Most of the bcc scripts use it and bcc has a racy workaround.
> Similarly lookup_and_delete() syscall is 1-1 to pop() for stack/queue
> and useful for regular hash maps.
>
> At the end for stack/queue map the programs will use:
> int bpf_push(map, value);

Also flags should be passed here.

> value_or_null = bpf_pop(map); // guaranteed non-racy for multi-cpu
> value_or_null = bpf_peak(map); // racy if 2+ cpus doing it
Is there any reason for it to be racy?

>
> from syscall:
> bpf_map_lookup_elem(map, NULL, &value); // returns top of stack
> bpf_map_lookup_and_delete_elem(map, NULL, &value); // returns top and deletes top atomically
> bpf_map_update_elem(map, NULL, &value); // pushes new value into stack atomically
>
> Eventually hash and other maps will implement bpf_map_lookup_and_delete()
> for both bpf progs and syscall.
>
> The main point that prog-side api doesn't have to match 1-1 to syscall-side,
> since they're different enough already.
> Like lookup_or_init() is badly needed for programs, but unnecessary for syscall.
>
> Thoughts?
>
I agree with the idea, if there are not more thoughts on this I'd 
proceed to the implementation.

Alexei Starovoitov Aug. 10, 2018, 3:09 a.m. UTC | #12

On Thu, Aug 09, 2018 at 06:41:54PM -0500, Mauricio Vasquez wrote:
> 
> 
> On 08/09/2018 11:23 AM, Alexei Starovoitov wrote:
> > On Thu, Aug 09, 2018 at 09:51:49AM -0500, Mauricio Vasquez wrote:
> > > > Agree that existing ops are not the right alias, but deferring to user
> > > > space as inline function also doesn't really seem like a good fit, imho,
> > > > so I'd prefer rather to have something native. (Aside from that, the
> > > > above inline bpf_pop() would also race between CPUs.)
> > > I think we should have push/pop/peek syscalls as well, having a bpf_pop()
> > > that is race prone would create problems. Users expected maps operations to
> > > be safe, so having one that is not will confuse them.
> > agree the races are not acceptable.
> > How about a mixed solution:
> > - introduce bpf_push/pop/peak helpers that programs will use, so
> >    they don't need to pass useless key=NULL
> > - introduce map->ops->lookup_and_delete and map->ops->lookup_or_init
> >    that prog-side helpers can use and syscall has 1-1 mapping for
> I think if is a fair solution.
> > Native lookup_or_init() helper for programs and syscall is badly missing.
> > Most of the bcc scripts use it and bcc has a racy workaround.
> > Similarly lookup_and_delete() syscall is 1-1 to pop() for stack/queue
> > and useful for regular hash maps.
> > 
> > At the end for stack/queue map the programs will use:
> > int bpf_push(map, value);
> 
> Also flags should be passed here.

Good point about flags. Indeed flags should be there for
extensibility, but I cannot think of use case for them at the moment.
hash's exist/noexist don't apply here.
Would be good to have at least one bit in use from the start.

> 
> > value_or_null = bpf_pop(map); // guaranteed non-racy for multi-cpu
> > value_or_null = bpf_peak(map); // racy if 2+ cpus doing it
> Is there any reason for it to be racy?

because two cpus will be looking at the same element?
whethere implementation is array based or link list with rcu
the race is still there.

[bpf-next,1/3] bpf: add bpf queue map

Commit Message

Comments

Patch