mbox series

[RFC,bpf-next,v2,0/4] Implement bpf queue/stack maps

Message ID 153575074884.30050.17670029209466860207.stgit@kernel
Headers show
Series Implement bpf queue/stack maps | expand

Message

Mauricio Vasquez Aug. 31, 2018, 9:25 p.m. UTC
In some applications this is needed have a pool of free elements, like for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possibleto get an any element without having they key
it is associated to.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>

---

I am sending this as an RFC because there is still an issue I am not sure how
to solve.

The queue/stack maps have a linked list for saving the nodes, and a
preallocation schema based on the pcpu_freelist already implemented and used
in the htabmap.  Each time an element is pushed into the map, a node from the
pcpu_freelist is taken and then added to the linked list.

The pop operation takes and *removes* the first node from the linked list, then
it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
to the pcpu_freelist when the rcu callback is executed.  This is needed because
an element returned by the pop() operation should remain valid for the whole
duration of the eBPF program.

The problem is that elements are not immediately returned to the free list, so
in some cases the push operation could fail because there are not free nodes
in the pcpu_freelist.

The following code snippet exposes that problem.

...
	/* Push MAP_SIZE elements */
	for (i = 0; i < MAP_SIZE; i++)
		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);

	/* Pop all elements */
	for (i = 0; i < MAP_SIZE; i++)
		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
		       val == vals[i]);

  // sleep(1) <-- If I put this sleep, everything works.
	/* Push MAP_SIZE elements */
	for (i = 0; i < MAP_SIZE; i++)
		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
           ^^^
           This fails because there are not available elements in pcpu_freelist
...

I think a possible solution is to oversize the pcpu_freelist (no idea by how
much, maybe double or, or make it 1.5 time the max elements in the map?)
I also have concerns about it, it would waste that memory in many cases and
this is also probably that it doesn't solve the issue because that code snippet
is puhsing and popping elements too fast, so even if the pcpu_freelist is much
large a certain time instant all the elements could be used.

Is this really an important issue?
Any idea of how to solve it?

Thanks,
---

Mauricio Vasquez B (4):
      bpf: add bpf queue and stack maps
      bpf: restrict use of peek/push/pop
      Sync uapi/bpf.h to tools/include
      selftests/bpf: add test cases for queue and stack maps


 include/linux/bpf.h                                |    8 
 include/linux/bpf_types.h                          |    2 
 include/uapi/linux/bpf.h                           |   36 ++
 kernel/bpf/Makefile                                |    2 
 kernel/bpf/helpers.c                               |   44 ++
 kernel/bpf/queue_stack_maps.c                      |  353 ++++++++++++++++++++
 kernel/bpf/syscall.c                               |   96 +++++
 kernel/bpf/verifier.c                              |   20 +
 net/core/filter.c                                  |    6 
 tools/include/uapi/linux/bpf.h                     |   36 ++
 tools/lib/bpf/bpf.c                                |   12 +
 tools/lib/bpf/bpf.h                                |    1 
 tools/testing/selftests/bpf/Makefile               |    2 
 tools/testing/selftests/bpf/bpf_helpers.h          |    7 
 tools/testing/selftests/bpf/test_maps.c            |  101 ++++++
 tools/testing/selftests/bpf/test_progs.c           |   99 ++++++
 tools/testing/selftests/bpf/test_queue_map.c       |    4 
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 +++
 tools/testing/selftests/bpf/test_stack_map.c       |    4 
 19 files changed, 877 insertions(+), 15 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

--

Comments

Alexei Starovoitov Sept. 7, 2018, 12:13 a.m. UTC | #1
On Fri, Aug 31, 2018 at 11:25:48PM +0200, Mauricio Vasquez B wrote:
> In some applications this is needed have a pool of free elements, like for
> example the list of free L4 ports in a SNAT.  None of the current maps allow
> to do it as it is not possibleto get an any element without having they key
> it is associated to.
> 
> This patchset implements two new kind of eBPF maps: queue and stack.
> Those maps provide to eBPF programs the peek, push and pop operations, and for
> userspace applications a new bpf_map_lookup_and_delete_elem() is added.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> 
> ---
> 
> I am sending this as an RFC because there is still an issue I am not sure how
> to solve.
> 
> The queue/stack maps have a linked list for saving the nodes, and a
> preallocation schema based on the pcpu_freelist already implemented and used
> in the htabmap.  Each time an element is pushed into the map, a node from the
> pcpu_freelist is taken and then added to the linked list.
> 
> The pop operation takes and *removes* the first node from the linked list, then
> it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
> to the pcpu_freelist when the rcu callback is executed.  This is needed because
> an element returned by the pop() operation should remain valid for the whole
> duration of the eBPF program.
> 
> The problem is that elements are not immediately returned to the free list, so
> in some cases the push operation could fail because there are not free nodes
> in the pcpu_freelist.
> 
> The following code snippet exposes that problem.
> 
> ...
> 	/* Push MAP_SIZE elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
> 
> 	/* Pop all elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
> 		       val == vals[i]);
> 
>   // sleep(1) <-- If I put this sleep, everything works.
> 	/* Push MAP_SIZE elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>            ^^^
>            This fails because there are not available elements in pcpu_freelist
> ...
> 
> I think a possible solution is to oversize the pcpu_freelist (no idea by how
> much, maybe double or, or make it 1.5 time the max elements in the map?)
> I also have concerns about it, it would waste that memory in many cases and
> this is also probably that it doesn't solve the issue because that code snippet
> is puhsing and popping elements too fast, so even if the pcpu_freelist is much
> large a certain time instant all the elements could be used.
> 
> Is this really an important issue?
> Any idea of how to solve it?

It is important issue indeed and a difficult one to solve.
We have the same issue with hash map.
If the prog is doing:
value = lookup(key);
delete(key);
// here the prog shouldn't be accessing the value anymore, since the memory
// could have been reused, but value pointer is still valid and points to
// allocated memory

bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
validity of value without races.
With pcpu_freelist I don't think there is a solution.
We can have this queue/stack map without prealloc and use kmalloc/kfree
back and forth. Performance will not be as great, but for your use case,
I suspect, it will be good enough.
The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
If somebody starts doing push/pop for every packet, the rcu subsystem
will struggle and nothing we can do about it.

The only way I could think of to resolve this problem is to reuse
the logic that Joe is working on for socket lookups inside the program.
Joe,
how is that going? Could you repost the latest patches?

In such case the api for stack map will look like:

elem = bpf_map_pop_elem(stack);
// access elem
bpf_map_free_elem(elem);
// here prog is not allowed to access elem and verifier will catch that

elem = bpf_map_alloc_elem(stack);
// populate elem
bpf_map_push_elem(elem);
// here prog is not allowed to access elem and verifier will catch that

Then both pre-allocated elems and kmalloc/kfree will work fine
and no unbounded rcu issues in both cases.
Joe Stringer Sept. 7, 2018, 6:27 a.m. UTC | #2
On Thu, 6 Sep 2018 at 17:13, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
> validity of value without races.
> With pcpu_freelist I don't think there is a solution.
> We can have this queue/stack map without prealloc and use kmalloc/kfree
> back and forth. Performance will not be as great, but for your use case,
> I suspect, it will be good enough.
> The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
> If somebody starts doing push/pop for every packet, the rcu subsystem
> will struggle and nothing we can do about it.
>
> The only way I could think of to resolve this problem is to reuse
> the logic that Joe is working on for socket lookups inside the program.
> Joe,
> how is that going? Could you repost the latest patches?

I can rebase & send them out. Was just wanting to get a little more testing in.

Cheers,
Joe
Mauricio Vasquez Sept. 7, 2018, 8:40 p.m. UTC | #3
On 09/06/2018 07:13 PM, Alexei Starovoitov wrote:
> On Fri, Aug 31, 2018 at 11:25:48PM +0200, Mauricio Vasquez B wrote:
>> In some applications this is needed have a pool of free elements, like for
>> example the list of free L4 ports in a SNAT.  None of the current maps allow
>> to do it as it is not possibleto get an any element without having they key
>> it is associated to.
>>
>> This patchset implements two new kind of eBPF maps: queue and stack.
>> Those maps provide to eBPF programs the peek, push and pop operations, and for
>> userspace applications a new bpf_map_lookup_and_delete_elem() is added.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>>
>> ---
>>
>> I am sending this as an RFC because there is still an issue I am not sure how
>> to solve.
>>
>> The queue/stack maps have a linked list for saving the nodes, and a
>> preallocation schema based on the pcpu_freelist already implemented and used
>> in the htabmap.  Each time an element is pushed into the map, a node from the
>> pcpu_freelist is taken and then added to the linked list.
>>
>> The pop operation takes and *removes* the first node from the linked list, then
>> it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
>> to the pcpu_freelist when the rcu callback is executed.  This is needed because
>> an element returned by the pop() operation should remain valid for the whole
>> duration of the eBPF program.
>>
>> The problem is that elements are not immediately returned to the free list, so
>> in some cases the push operation could fail because there are not free nodes
>> in the pcpu_freelist.
>>
>> The following code snippet exposes that problem.
>>
>> ...
>> 	/* Push MAP_SIZE elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>>
>> 	/* Pop all elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> 		       val == vals[i]);
>>
>>    // sleep(1) <-- If I put this sleep, everything works.
>> 	/* Push MAP_SIZE elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>>             ^^^
>>             This fails because there are not available elements in pcpu_freelist
>> ...
>>
>> I think a possible solution is to oversize the pcpu_freelist (no idea by how
>> much, maybe double or, or make it 1.5 time the max elements in the map?)
>> I also have concerns about it, it would waste that memory in many cases and
>> this is also probably that it doesn't solve the issue because that code snippet
>> is puhsing and popping elements too fast, so even if the pcpu_freelist is much
>> large a certain time instant all the elements could be used.
>>
>> Is this really an important issue?
>> Any idea of how to solve it?
> It is important issue indeed and a difficult one to solve.
> We have the same issue with hash map.
> If the prog is doing:
> value = lookup(key);
> delete(key);
> // here the prog shouldn't be accessing the value anymore, since the memory
> // could have been reused, but value pointer is still valid and points to
> // allocated memory
Just to notice that for the queue map it is a little bit worse because 
there isn't a way to mark an element to be reused, hence in some cases 
the pool of free elements could be exhausted.
> bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
> validity of value without races.
> With pcpu_freelist I don't think there is a solution.
> We can have this queue/stack map without prealloc and use kmalloc/kfree
> back and forth. Performance will not be as great, but for your use case,
> I suspect, it will be good enough.
I agree, for our use case we are not that worried about the performance, 
it is still in the dataplane but let's say it is not in the "hot" path.
> The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
> If somebody starts doing push/pop for every packet, the rcu subsystem
> will struggle and nothing we can do about it.
>
> The only way I could think of to resolve this problem is to reuse
> the logic that Joe is working on for socket lookups inside the program.
> Joe,
> how is that going? Could you repost the latest patches?
>
> In such case the api for stack map will look like:
>
> elem = bpf_map_pop_elem(stack);
> // access elem
> bpf_map_free_elem(elem);
> // here prog is not allowed to access elem and verifier will catch that
>
> elem = bpf_map_alloc_elem(stack);
> // populate elem
> bpf_map_push_elem(elem);
> // here prog is not allowed to access elem and verifier will catch that
>
> Then both pre-allocated elems and kmalloc/kfree will work fine
> and no unbounded rcu issues in both cases.
>
>

I read the Joe's proposal and using that for this problem looks like a 
nice solution.

I think a good trade-off for now would be to go ahead with a queue/stack 
map without preallocating support (or maybe include it having always in 
mind that this issue has to be solved in the near future) and then, as a 
separated work, try to use Joe's proposal in the map helpers.

What do you think?

Thanks,
Mauricio.