diff mbox series

[bpf,1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side

Message ID 505e5dfeea6ab7dd3719bb9863fc50e7595e06ed.1557789256.git.daniel@iogearbox.net
State Accepted
Delegated to: BPF Maintainers
Headers show
Series BPF LRU map fix | expand

Commit Message

Daniel Borkmann May 13, 2019, 11:18 p.m. UTC
Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h  | 1 +
 kernel/bpf/syscall.c | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Andrii Nakryiko May 14, 2019, 5:04 a.m. UTC | #1
On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Add a callback map_lookup_elem_sys_only() that map implementations
> could use over map_lookup_elem() from system call side in case the
> map implementation needs to handle the latter differently than from
> the BPF data path. If map_lookup_elem_sys_only() is set, this will
> be preferred pick for map lookups out of user space. This hook is

This is kind of surprising behavior  w/ preferred vs default lookup
code path. Why the desired behavior can't be achieved with an extra
flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
more extensible and more generic approach, avoiding duplication of
lookup semantics.

E.g., for LRU map, with flag on lookup, one can decide whether lookup
from inside BPF program (not just from syscall side!) should modify
LRU ordering or not, simply by specifying extra flag. Am I missing
some complication that prevents us from doing it that way?

> used in a follow-up fix for LRU map, but once development window
> opens, we can convert other map types from map_lookup_elem() (here,
> the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
> the callback to simplify and clean up the latter.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf.h  | 1 +
>  kernel/bpf/syscall.c | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 59631dd..4fb3aa2 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -36,6 +36,7 @@ struct bpf_map_ops {
>         void (*map_free)(struct bpf_map *map);
>         int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
>         void (*map_release_uref)(struct bpf_map *map);
> +       void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
>
>         /* funcs callable from userspace and from eBPF programs */
>         void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ad3ccf8..cb5440b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,7 +808,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>                 err = map->ops->map_peek_elem(map, value);
>         } else {
>                 rcu_read_lock();
> -               ptr = map->ops->map_lookup_elem(map, key);
> +               if (map->ops->map_lookup_elem_sys_only)
> +                       ptr = map->ops->map_lookup_elem_sys_only(map, key);
> +               else
> +                       ptr = map->ops->map_lookup_elem(map, key);
>                 if (IS_ERR(ptr)) {
>                         err = PTR_ERR(ptr);
>                 } else if (!ptr) {
> --
> 2.9.5
>
Daniel Borkmann May 14, 2019, 7:59 a.m. UTC | #2
On 05/14/2019 07:04 AM, Andrii Nakryiko wrote:
> On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> Add a callback map_lookup_elem_sys_only() that map implementations
>> could use over map_lookup_elem() from system call side in case the
>> map implementation needs to handle the latter differently than from
>> the BPF data path. If map_lookup_elem_sys_only() is set, this will
>> be preferred pick for map lookups out of user space. This hook is
> 
> This is kind of surprising behavior  w/ preferred vs default lookup
> code path. Why the desired behavior can't be achieved with an extra
> flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
> more extensible and more generic approach, avoiding duplication of
> lookup semantics.

For lookup from syscall side, this is possible of course. Given the
current situation breaks heuristic with any walks of the LRU map, I
presume you are saying something like an opt-in flag such as
BPF_F_MARK_USED would be more useful? I was thinking about something
like this initially, but then I couldn't come up with a concrete use
case where it's needed/useful today for user space. Given that, my
preference was to only add such flag wait until there is an actual
need for it, and in any case, it is trivial to add it later on. Do
you have a concrete need for it today that would justify such flag?

> E.g., for LRU map, with flag on lookup, one can decide whether lookup
> from inside BPF program (not just from syscall side!) should modify
> LRU ordering or not, simply by specifying extra flag. Am I missing
> some complication that prevents us from doing it that way?

For programs it's a bit tricky. The BPF call interface is ...

  BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)

... meaning verifier does not care what argument 3 and beyond contains.
>From BPF context/pov, it could also be uninitialized register. This would
mean, we'd need to add a BPF_CALL_3(bpf_map_lookup_elem2, ...) interface
which programs would use instead (and to not break existing ones), or
some other new helper call that gets a map value argument to unmark the
element from LRU side. While all doable one way or another although bit
hacky, we should probably clarify and understand the use case for it
first, thus brings me back to the last question from above paragraph.

Thanks,
Daniel
Andrii Nakryiko May 14, 2019, 4:34 p.m. UTC | #3
On Tue, May 14, 2019 at 12:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 05/14/2019 07:04 AM, Andrii Nakryiko wrote:
> > On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> Add a callback map_lookup_elem_sys_only() that map implementations
> >> could use over map_lookup_elem() from system call side in case the
> >> map implementation needs to handle the latter differently than from
> >> the BPF data path. If map_lookup_elem_sys_only() is set, this will
> >> be preferred pick for map lookups out of user space. This hook is
> >
> > This is kind of surprising behavior  w/ preferred vs default lookup
> > code path. Why the desired behavior can't be achieved with an extra
> > flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
> > more extensible and more generic approach, avoiding duplication of
> > lookup semantics.
>
> For lookup from syscall side, this is possible of course. Given the
> current situation breaks heuristic with any walks of the LRU map, I
> presume you are saying something like an opt-in flag such as
> BPF_F_MARK_USED would be more useful? I was thinking about something

To preserve existing semantics, it would be opt-out
BPF_F_DONT_MARK_USED, if you don't want to update LRU, so that
existing use cases don't break.

> like this initially, but then I couldn't come up with a concrete use
> case where it's needed/useful today for user space. Given that, my
> preference was to only add such flag wait until there is an actual
> need for it, and in any case, it is trivial to add it later on. Do
> you have a concrete need for it today that would justify such flag?

So my concern was with having two ops for lookup for maps
(map_lookup_elem() and map_lookup_elem_sys_only()) which for existing
use cases differ only in whether we are reordering LRU on lookup or
not, which felt like would be cleaner to solve with extending
ops->map_lookup_elem() to accept flags. But now I realize that there
are important implementation limitations preventing doing this cleanly
and efficiently, so I rescind my proposal.

>
> > E.g., for LRU map, with flag on lookup, one can decide whether lookup
> > from inside BPF program (not just from syscall side!) should modify
> > LRU ordering or not, simply by specifying extra flag. Am I missing
> > some complication that prevents us from doing it that way?
>
> For programs it's a bit tricky. The BPF call interface is ...
>
>   BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>
> ... meaning verifier does not care what argument 3 and beyond contains.
> From BPF context/pov, it could also be uninitialized register. This would
> mean, we'd need to add a BPF_CALL_3(bpf_map_lookup_elem2, ...) interface
> which programs would use instead (and to not break existing ones), or
> some other new helper call that gets a map value argument to unmark the
> element from LRU side. While all doable one way or another although bit
> hacky, we should probably clarify and understand the use case for it
> first, thus brings me back to the last question from above paragraph.

Yeah, if we wanted to expose this functionality from BPF side right
now, we'd have to add new helper w/ extra flags arg. As I mentioned
above, though, I assumed it wouldn't be too hard to make existing
BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
translate to map->ops->map_lookup_elem(key, 0 /* flags */), filling in
default flags = 0 value, but apparently that's not that simple (and
will hurt performance).

>
> Thanks,
> Daniel
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 59631dd..4fb3aa2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -36,6 +36,7 @@  struct bpf_map_ops {
 	void (*map_free)(struct bpf_map *map);
 	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
 	void (*map_release_uref)(struct bpf_map *map);
+	void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
 
 	/* funcs callable from userspace and from eBPF programs */
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ad3ccf8..cb5440b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -808,7 +808,10 @@  static int map_lookup_elem(union bpf_attr *attr)
 		err = map->ops->map_peek_elem(map, value);
 	} else {
 		rcu_read_lock();
-		ptr = map->ops->map_lookup_elem(map, key);
+		if (map->ops->map_lookup_elem_sys_only)
+			ptr = map->ops->map_lookup_elem_sys_only(map, key);
+		else
+			ptr = map->ops->map_lookup_elem(map, key);
 		if (IS_ERR(ptr)) {
 			err = PTR_ERR(ptr);
 		} else if (!ptr) {