[bpf-next,11/13] bpf: libbpf: Add STRUCT_OPS support

Message ID	20191214004803.1653618-1-kafai@fb.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, David Miller <davem@davemloft.net>, <kernel-team@fb.com>, <netdev@vger.kernel.org> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Date: Fri, 13 Dec 2019 16:48:03 -0800 Message-ID: <20191214004803.1653618-1-kafai@fb.com> In-Reply-To: <20191214004737.1652076-1-kafai@fb.com> References: <20191214004737.1652076-1-kafai@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Introduce BPF STRUCT_OPS \| expand [bpf-next,00/13] Introduce BPF STRUCT_OPS [bpf-next,01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack [bpf-next,02/13] bpf: Avoid storing modifier to info->btf_id [bpf-next,03/13] bpf: Add enum support to btf_ctx_access() [bpf-next,04/13] bpf: Support bitfield read access in btf_struct_access [bpf-next,05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS [bpf-next,06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS [bpf-next,07/13] bpf: tcp: Support tcp_congestion_ops in bpf [bpf-next,08/13] bpf: Add BPF_FUNC_tcp_send_ack helper [bpf-next,09/13] bpf: Add BPF_FUNC_jiffies [bpf-next,10/13] bpf: Synch uapi bpf.h to tools/ [bpf-next,11/13] bpf: libbpf: Add STRUCT_OPS support [bpf-next,12/13] bpf: Add bpf_dctcp example [bpf-next,13/13] bpf: Add bpf_cubic example

Martin KaFai Lau Dec. 14, 2019, 12:48 a.m. UTC

This patch adds BPF STRUCT_OPS support to libbpf.

The only sec_name convention is SEC("struct_ops") to identify the
struct ops implemented in BPF, e.g.
SEC("struct_ops")
struct tcp_congestion_ops dctcp = {
	.init           = (void *)dctcp_init,  /* <-- a bpf_prog */
	/* ... some more func prts ... */
	.name           = "bpf_dctcp",
};

In the bpf_object__open phase, libbpf will look for the "struct_ops"
elf section and find out what is the btf-type the "struct_ops" is
implementing.  Note that the btf-type here is referring to
a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
where are the bpf progs that the func ptrs are referring to.

In the bpf_object__load phase, the prepare_struct_ops() will load
the btf_vmlinux and obtain the corresponding kernel's btf-type.
With the kernel's btf-type, it can then set the prog->type,
prog->attach_btf_id and the prog->expected_attach_type.  Thus,
the prog's properties do not rely on its section name.

Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
process is as simple as: member-name match + btf-kind match + size match.
If these matching conditions fail, libbpf will reject.
The current targeting support is "struct tcp_congestion_ops" which
most of its members are function pointers.
The member ordering of the bpf_prog's btf-type can be different from
the btf_vmlinux's btf-type.

Once the prog's properties are all set,
the libbpf will proceed to load all the progs.

After that, register_struct_ops() will create a map, finalize the
map-value by populating it with the prog-fd, and then register this
"struct_ops" to the kernel by updating the map-value to the map.

By default, libbpf does not unregister the struct_ops from the kernel
during bpf_object__close().  It can be changed by setting the new
"unreg_st_ops" in bpf_object_open_opts.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/lib/bpf/bpf.c           |  10 +-
 tools/lib/bpf/bpf.h           |   5 +-
 tools/lib/bpf/libbpf.c        | 599 +++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h        |   3 +-
 tools/lib/bpf/libbpf_probes.c |   2 +
 5 files changed, 612 insertions(+), 7 deletions(-)

Andrii Nakryiko Dec. 18, 2019, 3:07 a.m. UTC | #1

On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds BPF STRUCT_OPS support to libbpf.
>
> The only sec_name convention is SEC("struct_ops") to identify the
> struct ops implemented in BPF, e.g.
> SEC("struct_ops")
> struct tcp_congestion_ops dctcp = {
>         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>         /* ... some more func prts ... */
>         .name           = "bpf_dctcp",
> };
>
> In the bpf_object__open phase, libbpf will look for the "struct_ops"
> elf section and find out what is the btf-type the "struct_ops" is
> implementing.  Note that the btf-type here is referring to
> a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> where are the bpf progs that the func ptrs are referring to.
>
> In the bpf_object__load phase, the prepare_struct_ops() will load
> the btf_vmlinux and obtain the corresponding kernel's btf-type.
> With the kernel's btf-type, it can then set the prog->type,
> prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> the prog's properties do not rely on its section name.
>
> Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> process is as simple as: member-name match + btf-kind match + size match.
> If these matching conditions fail, libbpf will reject.
> The current targeting support is "struct tcp_congestion_ops" which
> most of its members are function pointers.
> The member ordering of the bpf_prog's btf-type can be different from
> the btf_vmlinux's btf-type.
>
> Once the prog's properties are all set,
> the libbpf will proceed to load all the progs.
>
> After that, register_struct_ops() will create a map, finalize the
> map-value by populating it with the prog-fd, and then register this
> "struct_ops" to the kernel by updating the map-value to the map.
>
> By default, libbpf does not unregister the struct_ops from the kernel
> during bpf_object__close().  It can be changed by setting the new
> "unreg_st_ops" in bpf_object_open_opts.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

This looks pretty good to me. The big two things is exposing structops
as real struct bpf_map, so that users can interact with it using
libbpf APIs, as well as splitting struct_ops map creation and
registration. bpf_object__load() should only make sure all maps are
created, progs are loaded/verified, but none of BPF program can yet be
called. Then attach is the phase where registration happens.


>  tools/lib/bpf/bpf.c           |  10 +-
>  tools/lib/bpf/bpf.h           |   5 +-
>  tools/lib/bpf/libbpf.c        | 599 +++++++++++++++++++++++++++++++++-
>  tools/lib/bpf/libbpf.h        |   3 +-
>  tools/lib/bpf/libbpf_probes.c |   2 +
>  5 files changed, 612 insertions(+), 7 deletions(-)
>

[...]

>  LIBBPF_API int
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 27d5f7ecba32..ffb5cdd7db5a 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -67,6 +67,10 @@
>
>  #define __printf(a, b) __attribute__((format(printf, a, b)))
>
> +static struct btf *bpf_core_find_kernel_btf(void);

this is not CO-RE specific anymore, we should probably just rename it
to bpf_find_kernel_btf

> +static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj,
> +                                                       int idx);
> +
>  static int __base_pr(enum libbpf_print_level level, const char *format,
>                      va_list args)
>  {
> @@ -128,6 +132,8 @@ void libbpf_print(enum libbpf_print_level level, const char *format, ...)
>  # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ
>  #endif
>
> +#define BPF_STRUCT_OPS_SEC_NAME "struct_ops"

This is a special ELF section recognized by libbpf, so similarly to
".maps" (and ".kconfig", which I'm renaming from ".extern"), I think
this should be ".struct_ops" (or I'd even drop underscore and go with
".structops", but not insisting).

> +
>  static inline __u64 ptr_to_u64(const void *ptr)
>  {
>         return (__u64) (unsigned long) ptr;
> @@ -233,6 +239,32 @@ struct bpf_map {
>         bool reused;
>  };
>
> +struct bpf_struct_ops {
> +       const char *var_name;
> +       const char *tname;
> +       const struct btf_type *type;
> +       struct bpf_program **progs;
> +       __u32 *kern_func_off;
> +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> +       void *data;
> +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf

Using __bpf_ prefix for this struct_ops-specific types is a bit too
generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
it btf_ops_ or btf_structops_?


> +        * format.
> +        * struct __bpf_tcp_congestion_ops {
> +        *      [... some other kernel fields ...]
> +        *      struct tcp_congestion_ops data;
> +        * }
> +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).

Comment isn't very clear.. do you mean that data pointed to by
kern_vdata is of sizeof(...) bytes?

> +        * prepare_struct_ops() will populate the "data" into
> +        * "kern_vdata".
> +        */
> +       void *kern_vdata;
> +       __u32 type_id;
> +       __u32 kern_vtype_id;
> +       __u32 kern_vtype_size;
> +       int fd;
> +       bool unreg;

This unreg flag (and default behavior to not unregister) is bothering
me a bit.. Shouldn't this be controlled by map's lifetime, at least.
E.g., if no one pins that map - then struct_ops should be unregistered
on map destruction. If application wants to keep BPF programs
attached, it should make sure to pin map, before userspace part exits?
Is this problematic in any way?

> +};
> +
>  struct bpf_secdata {
>         void *rodata;
>         void *data;
> @@ -251,6 +283,7 @@ struct bpf_object {
>         size_t nr_maps;
>         size_t maps_cap;
>         struct bpf_secdata sections;
> +       struct bpf_struct_ops st_ops;

These bpf_struct_ops are strictly belonging to that special struct_ops
map, right? So I'd say we should change struct bpf_map to contain
per-map extra piece of info. We can combine that with current mmaped
pointer for internal maps;


struct bpf_map {
    ...
    union {
        void *mmaped;
        struct bpf_struct_ops *st_ops;
    };
};

That way those special maps can have extra piece of information
specific to that special map's type.


>
>         bool loaded;
>         bool has_pseudo_calls;
> @@ -270,6 +303,7 @@ struct bpf_object {
>                 Elf_Data *data;
>                 Elf_Data *rodata;
>                 Elf_Data *bss;
> +               Elf_Data *st_ops_data;
>                 size_t strtabidx;
>                 struct {
>                         GElf_Shdr shdr;
> @@ -282,6 +316,7 @@ struct bpf_object {
>                 int data_shndx;
>                 int rodata_shndx;
>                 int bss_shndx;
> +               int st_ops_shndx;
>         } efile;
>         /*
>          * All loaded bpf_object is linked in a list, which is
> @@ -509,6 +544,508 @@ static __u32 get_kernel_version(void)
>         return KERNEL_VERSION(major, minor, patch);
>  }
>
> +static int bpf_object__register_struct_ops(struct bpf_object *obj)
> +{
> +       struct bpf_create_map_attr map_attr = {};
> +       struct bpf_struct_ops *st_ops;
> +       const char *tname;
> +       __u32 i, zero = 0;
> +       int fd, err;
> +
> +       st_ops = &obj->st_ops;
> +       if (!st_ops->kern_vdata)
> +               return 0;

this shouldn't happen, right? I'd drop the check or return error at least.

> +
> +       tname = st_ops->tname;
> +       for (i = 0; i < btf_vlen(st_ops->type); i++) {
> +               struct bpf_program *prog = st_ops->progs[i];
> +               void *kern_data;
> +               int prog_fd;
> +
> +               if (!prog)
> +                       continue;
> +
> +               prog_fd = bpf_program__nth_fd(prog, 0);

nit: just bpf_program__fd(prog)

> +               if (prog_fd < 0) {
> +                       pr_warn("struct_ops register %s: prog %s is not loaded\n",
> +                               tname, prog->name);
> +                       return -EINVAL;
> +               }

This is redundant check, register_struct_ops will not be called if any
program loading fails.

> +
> +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> +               *(unsigned long *)kern_data = prog_fd;
> +       }
> +
> +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> +       map_attr.key_size = sizeof(unsigned int);
> +       map_attr.value_size = st_ops->kern_vtype_size;
> +       map_attr.max_entries = 1;
> +       map_attr.btf_fd = btf__fd(obj->btf);
> +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> +       map_attr.name = st_ops->var_name;
> +
> +       fd = bpf_create_map_xattr(&map_attr);

we should try to reuse bpf_object__init_internal_map(). This will add
struct bpf_map which users can iterate over and look up by name, etc.
We had similar discussion when Daniel was adding  global data maps,
and we conclusively decided that these special maps have to be
represented in libbpf as struct bpf_map as well.

> +       if (fd < 0) {
> +               err = -errno;
> +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> +                       tname);
> +               return err;
> +       }
> +
> +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

This is what "activates" struct_ops, so this has to happen outside of
load, load shouldn't trigger execution of BPF programs. So something
like bpf_map__attach_struct_ops() or we if introduce new concept for
struct_ops: bpf_struct_ops__attach(), which can be called explicitly
by user of automatically from skeletons <skeleton>__attach().


> +       if (err) {
> +               err = -errno;
> +               close(fd);
> +               pr_warn("struct_ops register %s: Error in updating struct_ops map\n",
> +                       tname);
> +               return err;
> +       }
> +
> +       st_ops->fd = fd;
> +
> +       return 0;
> +}
> +
> +static int bpf_struct_ops__unregister(struct bpf_struct_ops *st_ops)
> +{
> +       if (st_ops->fd != -1) {
> +               __u32 zero = 0;
> +               int err = 0;
> +
> +               if (bpf_map_delete_elem(st_ops->fd, &zero))
> +                       err = -errno;
> +               zclose(st_ops->fd);
> +
> +               return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct btf_type *
> +resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
> +static const struct btf_type *
> +resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
> +
> +static const struct btf_member *
> +find_member_by_offset(const struct btf_type *t, __u32 offset)

nit: find_member_by_bit_offset (offset -> bit_offset)?

> +{
> +       struct btf_member *m;
> +       int i;
> +
> +       for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
> +               if (btf_member_bit_offset(t, i) == offset)
> +                       return m;
> +       }
> +
> +       return NULL;
> +}
> +
> +static const struct btf_member *
> +find_member_by_name(const struct btf *btf, const struct btf_type *t,
> +                   const char *name)
> +{
> +       struct btf_member *m;
> +       int i;
> +
> +       for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
> +               if (!strcmp(btf__name_by_offset(btf, m->name_off), name))
> +                       return m;
> +       }
> +
> +       return NULL;
> +}
> +
> +#define STRUCT_OPS_VALUE_PREFIX "__bpf_"
> +#define STRUCT_OPS_VALUE_PREFIX_LEN (sizeof(STRUCT_OPS_VALUE_PREFIX) - 1)
> +
> +static int
> +bpf_struct_ops__get_kern_types(const struct btf *btf, const char *tname,

nit: there is no "bpf_struct_ops" object in libbpf and this is not its
method, so it's a violation of libbpf's naming convention, please
consider renaming to something like "find_struct_ops_kern_types"

> +                              const struct btf_type **type, __u32 *type_id,
> +                              const struct btf_type **vtype, __u32 *vtype_id,
> +                              const struct btf_member **data_member)
> +{
> +       const struct btf_type *kern_type, *kern_vtype;
> +       const struct btf_member *kern_data_member;
> +       __s32 kern_vtype_id, kern_type_id;
> +       char vtname[128] = STRUCT_OPS_VALUE_PREFIX;
> +       __u32 i;
> +
> +       kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
> +       if (kern_type_id < 0) {
> +               pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
> +                       tname);
> +               return -ENOTSUP;

just return kern_type_id (pass through btf__find_by_name_kind's
result). Same below.

> +       }
> +       kern_type = btf__type_by_id(btf, kern_type_id);
> +
> +       /* Find the corresponding "map_value" type that will be used
> +        * in map_update(BPF_MAP_TYPE_STRUCT_OPS).  For example,
> +        * find "struct __bpf_tcp_congestion_ops" from the btf_vmlinux.
> +        */
> +       strncat(vtname + STRUCT_OPS_VALUE_PREFIX_LEN, tname,
> +               sizeof(vtname) - STRUCT_OPS_VALUE_PREFIX_LEN - 1);
> +       kern_vtype_id = btf__find_by_name_kind(btf, vtname,
> +                                              BTF_KIND_STRUCT);
> +       if (kern_vtype_id < 0) {
> +               pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
> +                       vtname);
> +               return -ENOTSUP;
> +       }
> +       kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> +
> +       /* Find "struct tcp_congestion_ops" from
> +        * struct __bpf_tcp_congestion_ops {
> +        *      [ ... ]
> +        *      struct tcp_congestion_ops data;
> +        * }
> +        */
> +       for (i = 0, kern_data_member = btf_members(kern_vtype);
> +            i < btf_vlen(kern_vtype);
> +            i++, kern_data_member++) {

nit: multi-line for is kind of ugly, maybe move kern_data_member
assignment out of for?

> +               if (kern_data_member->type == kern_type_id)
> +                       break;
> +       }
> +       if (i == btf_vlen(kern_vtype)) {
> +               pr_warn("struct_ops prepare: struct %s data is not found in struct %s\n",
> +                       tname, vtname);
> +               return -EINVAL;
> +       }
> +

[...]

>  static int bpf_object__init_btf(struct bpf_object *obj,
> @@ -1689,6 +2257,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
>                         } else if (strcmp(name, ".rodata") == 0) {
>                                 obj->efile.rodata = data;
>                                 obj->efile.rodata_shndx = idx;
> +                       } else if (strcmp(name, BPF_STRUCT_OPS_SEC_NAME) == 0) {
> +                               obj->efile.st_ops_data = data;
> +                               obj->efile.st_ops_shndx = idx;
>                         } else {
>                                 pr_debug("skip section(%d) %s\n", idx, name);
>                         }
> @@ -1698,7 +2269,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
>                         int sec = sh.sh_info; /* points to other section */
>
>                         /* Only do relo for section with exec instructions */
> -                       if (!section_have_execinstr(obj, sec)) {
> +                       if (!section_have_execinstr(obj, sec) &&
> +                           !strstr(name, BPF_STRUCT_OPS_SEC_NAME)) {

why substring match?

>                                 pr_debug("skip relo %s(%d) for section(%d)\n",
>                                          name, idx, sec);
>                                 continue;

[...]

Martin KaFai Lau Dec. 18, 2019, 7:03 a.m. UTC | #2

On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > This patch adds BPF STRUCT_OPS support to libbpf.
> >
> > The only sec_name convention is SEC("struct_ops") to identify the
> > struct ops implemented in BPF, e.g.
> > SEC("struct_ops")
> > struct tcp_congestion_ops dctcp = {
> >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >         /* ... some more func prts ... */
> >         .name           = "bpf_dctcp",
> > };
> >
> > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > elf section and find out what is the btf-type the "struct_ops" is
> > implementing.  Note that the btf-type here is referring to
> > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > where are the bpf progs that the func ptrs are referring to.
> >
> > In the bpf_object__load phase, the prepare_struct_ops() will load
> > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > With the kernel's btf-type, it can then set the prog->type,
> > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > the prog's properties do not rely on its section name.
> >
> > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > process is as simple as: member-name match + btf-kind match + size match.
> > If these matching conditions fail, libbpf will reject.
> > The current targeting support is "struct tcp_congestion_ops" which
> > most of its members are function pointers.
> > The member ordering of the bpf_prog's btf-type can be different from
> > the btf_vmlinux's btf-type.
> >
> > Once the prog's properties are all set,
> > the libbpf will proceed to load all the progs.
> >
> > After that, register_struct_ops() will create a map, finalize the
> > map-value by populating it with the prog-fd, and then register this
> > "struct_ops" to the kernel by updating the map-value to the map.
> >
> > By default, libbpf does not unregister the struct_ops from the kernel
> > during bpf_object__close().  It can be changed by setting the new
> > "unreg_st_ops" in bpf_object_open_opts.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> 
> This looks pretty good to me. The big two things is exposing structops
> as real struct bpf_map, so that users can interact with it using
> libbpf APIs, as well as splitting struct_ops map creation and
> registration. bpf_object__load() should only make sure all maps are
> created, progs are loaded/verified, but none of BPF program can yet be
> called. Then attach is the phase where registration happens.
Thanks for the review.

[ ... ]

> >  static inline __u64 ptr_to_u64(const void *ptr)
> >  {
> >         return (__u64) (unsigned long) ptr;
> > @@ -233,6 +239,32 @@ struct bpf_map {
> >         bool reused;
> >  };
> >
> > +struct bpf_struct_ops {
> > +       const char *var_name;
> > +       const char *tname;
> > +       const struct btf_type *type;
> > +       struct bpf_program **progs;
> > +       __u32 *kern_func_off;
> > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > +       void *data;
> > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> 
> Using __bpf_ prefix for this struct_ops-specific types is a bit too
> generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> it btf_ops_ or btf_structops_?
Is it a concern on name collision?

The prefix pick is to use a more representative name.
struct_ops use many bpf pieces and btf is one of them.
Very soon, all new codes will depend on BTF and btf_ prefix
could become generic also.

Unlike tracepoint, there is no non-btf version of struct_ops.  

> 
> 
> > +        * format.
> > +        * struct __bpf_tcp_congestion_ops {
> > +        *      [... some other kernel fields ...]
> > +        *      struct tcp_congestion_ops data;
> > +        * }
> > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> 
> Comment isn't very clear.. do you mean that data pointed to by
> kern_vdata is of sizeof(...) bytes?
> 
> > +        * prepare_struct_ops() will populate the "data" into
> > +        * "kern_vdata".
> > +        */
> > +       void *kern_vdata;
> > +       __u32 type_id;
> > +       __u32 kern_vtype_id;
> > +       __u32 kern_vtype_size;
> > +       int fd;
> > +       bool unreg;
> 
> This unreg flag (and default behavior to not unregister) is bothering
> me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> E.g., if no one pins that map - then struct_ops should be unregistered
> on map destruction. If application wants to keep BPF programs
> attached, it should make sure to pin map, before userspace part exits?
> Is this problematic in any way?
I don't think it should in the struct_ops case.  I think of the
struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
in this case) and this map-progs stay (or keep attaching) until it is
detached.  Like other attached bpf_prog keeps running without
caring if the bpf_prog is pinned or not.

About the "bool unreg;", the default can be changed to true if
it makes more sense.

[ ... ]

> 
> > +
> > +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> > +               *(unsigned long *)kern_data = prog_fd;
> > +       }
> > +
> > +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> > +       map_attr.key_size = sizeof(unsigned int);
> > +       map_attr.value_size = st_ops->kern_vtype_size;
> > +       map_attr.max_entries = 1;
> > +       map_attr.btf_fd = btf__fd(obj->btf);
> > +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> > +       map_attr.name = st_ops->var_name;
> > +
> > +       fd = bpf_create_map_xattr(&map_attr);
> 
> we should try to reuse bpf_object__init_internal_map(). This will add
> struct bpf_map which users can iterate over and look up by name, etc.
> We had similar discussion when Daniel was adding  global data maps,
> and we conclusively decided that these special maps have to be
> represented in libbpf as struct bpf_map as well.
I will take a look.

> 
> > +       if (fd < 0) {
> > +               err = -errno;
> > +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> > +                       tname);
> > +               return err;
> > +       }
> > +
> > +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

Martin KaFai Lau Dec. 18, 2019, 7:20 a.m. UTC | #3

On Tue, Dec 17, 2019 at 11:03:45PM -0800, Martin Lau wrote:
> On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch adds BPF STRUCT_OPS support to libbpf.
> > >
> > > The only sec_name convention is SEC("struct_ops") to identify the
> > > struct ops implemented in BPF, e.g.
> > > SEC("struct_ops")
> > > struct tcp_congestion_ops dctcp = {
> > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > >         /* ... some more func prts ... */
> > >         .name           = "bpf_dctcp",
> > > };
> > >
> > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > elf section and find out what is the btf-type the "struct_ops" is
> > > implementing.  Note that the btf-type here is referring to
> > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > where are the bpf progs that the func ptrs are referring to.
> > >
> > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > With the kernel's btf-type, it can then set the prog->type,
> > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > the prog's properties do not rely on its section name.
> > >
> > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > process is as simple as: member-name match + btf-kind match + size match.
> > > If these matching conditions fail, libbpf will reject.
> > > The current targeting support is "struct tcp_congestion_ops" which
> > > most of its members are function pointers.
> > > The member ordering of the bpf_prog's btf-type can be different from
> > > the btf_vmlinux's btf-type.
> > >
> > > Once the prog's properties are all set,
> > > the libbpf will proceed to load all the progs.
> > >
> > > After that, register_struct_ops() will create a map, finalize the
> > > map-value by populating it with the prog-fd, and then register this
> > > "struct_ops" to the kernel by updating the map-value to the map.
> > >
> > > By default, libbpf does not unregister the struct_ops from the kernel
> > > during bpf_object__close().  It can be changed by setting the new
> > > "unreg_st_ops" in bpf_object_open_opts.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > 
> > This looks pretty good to me. The big two things is exposing structops
> > as real struct bpf_map, so that users can interact with it using
> > libbpf APIs, as well as splitting struct_ops map creation and
> > registration. bpf_object__load() should only make sure all maps are
> > created, progs are loaded/verified, but none of BPF program can yet be
> > called. Then attach is the phase where registration happens.
> Thanks for the review.
> 
> [ ... ]
> 
> > >  static inline __u64 ptr_to_u64(const void *ptr)
> > >  {
> > >         return (__u64) (unsigned long) ptr;
> > > @@ -233,6 +239,32 @@ struct bpf_map {
> > >         bool reused;
> > >  };
> > >
> > > +struct bpf_struct_ops {
> > > +       const char *var_name;
> > > +       const char *tname;
> > > +       const struct btf_type *type;
> > > +       struct bpf_program **progs;
> > > +       __u32 *kern_func_off;
> > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > +       void *data;
> > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > 
> > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > it btf_ops_ or btf_structops_?
> Is it a concern on name collision?
> 
> The prefix pick is to use a more representative name.
> struct_ops use many bpf pieces and btf is one of them.
> Very soon, all new codes will depend on BTF and btf_ prefix
> could become generic also.
> 
> Unlike tracepoint, there is no non-btf version of struct_ops.
May be bpf_struct_ops_?

It was my early pick but it read quite weird,
bpf_[struct]_<ops>_[tcp_congestion]_<ops>.

Hence, I go with __bpf_<actual-name-of-the-kernel-struct> in this series.

Andrii Nakryiko Dec. 18, 2019, 4:34 p.m. UTC | #4

On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch adds BPF STRUCT_OPS support to libbpf.
> > >
> > > The only sec_name convention is SEC("struct_ops") to identify the
> > > struct ops implemented in BPF, e.g.
> > > SEC("struct_ops")
> > > struct tcp_congestion_ops dctcp = {
> > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > >         /* ... some more func prts ... */
> > >         .name           = "bpf_dctcp",
> > > };
> > >
> > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > elf section and find out what is the btf-type the "struct_ops" is
> > > implementing.  Note that the btf-type here is referring to
> > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > where are the bpf progs that the func ptrs are referring to.
> > >
> > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > With the kernel's btf-type, it can then set the prog->type,
> > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > the prog's properties do not rely on its section name.
> > >
> > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > process is as simple as: member-name match + btf-kind match + size match.
> > > If these matching conditions fail, libbpf will reject.
> > > The current targeting support is "struct tcp_congestion_ops" which
> > > most of its members are function pointers.
> > > The member ordering of the bpf_prog's btf-type can be different from
> > > the btf_vmlinux's btf-type.
> > >
> > > Once the prog's properties are all set,
> > > the libbpf will proceed to load all the progs.
> > >
> > > After that, register_struct_ops() will create a map, finalize the
> > > map-value by populating it with the prog-fd, and then register this
> > > "struct_ops" to the kernel by updating the map-value to the map.
> > >
> > > By default, libbpf does not unregister the struct_ops from the kernel
> > > during bpf_object__close().  It can be changed by setting the new
> > > "unreg_st_ops" in bpf_object_open_opts.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> >
> > This looks pretty good to me. The big two things is exposing structops
> > as real struct bpf_map, so that users can interact with it using
> > libbpf APIs, as well as splitting struct_ops map creation and
> > registration. bpf_object__load() should only make sure all maps are
> > created, progs are loaded/verified, but none of BPF program can yet be
> > called. Then attach is the phase where registration happens.
> Thanks for the review.
>
> [ ... ]
>
> > >  static inline __u64 ptr_to_u64(const void *ptr)
> > >  {
> > >         return (__u64) (unsigned long) ptr;
> > > @@ -233,6 +239,32 @@ struct bpf_map {
> > >         bool reused;
> > >  };
> > >
> > > +struct bpf_struct_ops {
> > > +       const char *var_name;
> > > +       const char *tname;
> > > +       const struct btf_type *type;
> > > +       struct bpf_program **progs;
> > > +       __u32 *kern_func_off;
> > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > +       void *data;
> > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >
> > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > it btf_ops_ or btf_structops_?
> Is it a concern on name collision?
>
> The prefix pick is to use a more representative name.
> struct_ops use many bpf pieces and btf is one of them.
> Very soon, all new codes will depend on BTF and btf_ prefix
> could become generic also.
>
> Unlike tracepoint, there is no non-btf version of struct_ops.

Not so much name collision, as being able to immediately recognize
that it's used to provide type information for struct_ops. Think about
some automated tooling parsing vmlinux BTF and trying to create some
derivative types for those btf_trace_xxx and __bpf_xxx types. Having
unique prefix that identifies what kind of type-providing struct it is
is very useful to do generic tool like that. While __bpf_ isn't
specifying in any ways that it's for struct_ops.

>
> >
> >
> > > +        * format.
> > > +        * struct __bpf_tcp_congestion_ops {
> > > +        *      [... some other kernel fields ...]
> > > +        *      struct tcp_congestion_ops data;
> > > +        * }
> > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >
> > Comment isn't very clear.. do you mean that data pointed to by
> > kern_vdata is of sizeof(...) bytes?
> >
> > > +        * prepare_struct_ops() will populate the "data" into
> > > +        * "kern_vdata".
> > > +        */
> > > +       void *kern_vdata;
> > > +       __u32 type_id;
> > > +       __u32 kern_vtype_id;
> > > +       __u32 kern_vtype_size;
> > > +       int fd;
> > > +       bool unreg;
> >
> > This unreg flag (and default behavior to not unregister) is bothering
> > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > E.g., if no one pins that map - then struct_ops should be unregistered
> > on map destruction. If application wants to keep BPF programs
> > attached, it should make sure to pin map, before userspace part exits?
> > Is this problematic in any way?
> I don't think it should in the struct_ops case.  I think of the
> struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> in this case) and this map-progs stay (or keep attaching) until it is
> detached.  Like other attached bpf_prog keeps running without
> caring if the bpf_prog is pinned or not.

I'll let someone else comment on how this behaves for cgroup, xdp,
etc, but for tracing, for example, we have FD-based BPF links, which
will detach program automatically when FD is closed. I think the idea
is to extend this to other types of BPF programs as well, so there is
no risk of leaving some stray BPF program running after unintended
crash of userspace program. When application explicitly needs BPF
program to outlive its userspace control app, then this can be
achieved by pinning map/program in BPFFS.

>
> About the "bool unreg;", the default can be changed to true if
> it makes more sense.
>
> [ ... ]
>
> >
> > > +
> > > +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> > > +               *(unsigned long *)kern_data = prog_fd;
> > > +       }
> > > +
> > > +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> > > +       map_attr.key_size = sizeof(unsigned int);
> > > +       map_attr.value_size = st_ops->kern_vtype_size;
> > > +       map_attr.max_entries = 1;
> > > +       map_attr.btf_fd = btf__fd(obj->btf);
> > > +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> > > +       map_attr.name = st_ops->var_name;
> > > +
> > > +       fd = bpf_create_map_xattr(&map_attr);
> >
> > we should try to reuse bpf_object__init_internal_map(). This will add
> > struct bpf_map which users can iterate over and look up by name, etc.
> > We had similar discussion when Daniel was adding  global data maps,
> > and we conclusively decided that these special maps have to be
> > represented in libbpf as struct bpf_map as well.
> I will take a look.
>
> >
> > > +       if (fd < 0) {
> > > +               err = -errno;
> > > +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> > > +                       tname);
> > > +               return err;
> > > +       }
> > > +
> > > +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

Andrii Nakryiko Dec. 18, 2019, 4:36 p.m. UTC | #5

On Tue, Dec 17, 2019 at 11:20 PM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 11:03:45PM -0800, Martin Lau wrote:
> > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > >
> > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > struct ops implemented in BPF, e.g.
> > > > SEC("struct_ops")
> > > > struct tcp_congestion_ops dctcp = {
> > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > >         /* ... some more func prts ... */
> > > >         .name           = "bpf_dctcp",
> > > > };
> > > >
> > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > implementing.  Note that the btf-type here is referring to
> > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > where are the bpf progs that the func ptrs are referring to.
> > > >
> > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > With the kernel's btf-type, it can then set the prog->type,
> > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > the prog's properties do not rely on its section name.
> > > >
> > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > If these matching conditions fail, libbpf will reject.
> > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > most of its members are function pointers.
> > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > the btf_vmlinux's btf-type.
> > > >
> > > > Once the prog's properties are all set,
> > > > the libbpf will proceed to load all the progs.
> > > >
> > > > After that, register_struct_ops() will create a map, finalize the
> > > > map-value by populating it with the prog-fd, and then register this
> > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > >
> > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > during bpf_object__close().  It can be changed by setting the new
> > > > "unreg_st_ops" in bpf_object_open_opts.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > >
> > > This looks pretty good to me. The big two things is exposing structops
> > > as real struct bpf_map, so that users can interact with it using
> > > libbpf APIs, as well as splitting struct_ops map creation and
> > > registration. bpf_object__load() should only make sure all maps are
> > > created, progs are loaded/verified, but none of BPF program can yet be
> > > called. Then attach is the phase where registration happens.
> > Thanks for the review.
> >
> > [ ... ]
> >
> > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > >  {
> > > >         return (__u64) (unsigned long) ptr;
> > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > >         bool reused;
> > > >  };
> > > >
> > > > +struct bpf_struct_ops {
> > > > +       const char *var_name;
> > > > +       const char *tname;
> > > > +       const struct btf_type *type;
> > > > +       struct bpf_program **progs;
> > > > +       __u32 *kern_func_off;
> > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > +       void *data;
> > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > >
> > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > it btf_ops_ or btf_structops_?
> > Is it a concern on name collision?
> >
> > The prefix pick is to use a more representative name.
> > struct_ops use many bpf pieces and btf is one of them.
> > Very soon, all new codes will depend on BTF and btf_ prefix
> > could become generic also.
> >
> > Unlike tracepoint, there is no non-btf version of struct_ops.
> May be bpf_struct_ops_?
>
> It was my early pick but it read quite weird,
> bpf_[struct]_<ops>_[tcp_congestion]_<ops>.
>
> Hence, I go with __bpf_<actual-name-of-the-kernel-struct> in this series.

bpf_struct_ops_ is much better, IMO, but given this struct serves only
the purpose of providing type information to kernel, I think
btf_struct_ops_ is more justified.
And this <ops>_xxx_<ops> duplication doesn't bother me at all, again,
because it's not directly used in C code. But believe me, having
unique prefix is so good, even in the simples case of grepping through
vmlinux.h.

Martin KaFai Lau Dec. 18, 2019, 5:33 p.m. UTC | #6

On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > >
> > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > struct ops implemented in BPF, e.g.
> > > > SEC("struct_ops")
> > > > struct tcp_congestion_ops dctcp = {
> > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > >         /* ... some more func prts ... */
> > > >         .name           = "bpf_dctcp",
> > > > };
> > > >
> > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > implementing.  Note that the btf-type here is referring to
> > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > where are the bpf progs that the func ptrs are referring to.
> > > >
> > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > With the kernel's btf-type, it can then set the prog->type,
> > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > the prog's properties do not rely on its section name.
> > > >
> > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > If these matching conditions fail, libbpf will reject.
> > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > most of its members are function pointers.
> > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > the btf_vmlinux's btf-type.
> > > >
> > > > Once the prog's properties are all set,
> > > > the libbpf will proceed to load all the progs.
> > > >
> > > > After that, register_struct_ops() will create a map, finalize the
> > > > map-value by populating it with the prog-fd, and then register this
> > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > >
> > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > during bpf_object__close().  It can be changed by setting the new
> > > > "unreg_st_ops" in bpf_object_open_opts.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > >
> > > This looks pretty good to me. The big two things is exposing structops
> > > as real struct bpf_map, so that users can interact with it using
> > > libbpf APIs, as well as splitting struct_ops map creation and
> > > registration. bpf_object__load() should only make sure all maps are
> > > created, progs are loaded/verified, but none of BPF program can yet be
> > > called. Then attach is the phase where registration happens.
> > Thanks for the review.
> >
> > [ ... ]
> >
> > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > >  {
> > > >         return (__u64) (unsigned long) ptr;
> > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > >         bool reused;
> > > >  };
> > > >
> > > > +struct bpf_struct_ops {
> > > > +       const char *var_name;
> > > > +       const char *tname;
> > > > +       const struct btf_type *type;
> > > > +       struct bpf_program **progs;
> > > > +       __u32 *kern_func_off;
> > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > +       void *data;
> > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > >
> > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > it btf_ops_ or btf_structops_?
> > Is it a concern on name collision?
> >
> > The prefix pick is to use a more representative name.
> > struct_ops use many bpf pieces and btf is one of them.
> > Very soon, all new codes will depend on BTF and btf_ prefix
> > could become generic also.
> >
> > Unlike tracepoint, there is no non-btf version of struct_ops.
> 
> Not so much name collision, as being able to immediately recognize
> that it's used to provide type information for struct_ops. Think about
> some automated tooling parsing vmlinux BTF and trying to create some
> derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> unique prefix that identifies what kind of type-providing struct it is
> is very useful to do generic tool like that. While __bpf_ isn't
> specifying in any ways that it's for struct_ops.
> 
> >
> > >
> > >
> > > > +        * format.
> > > > +        * struct __bpf_tcp_congestion_ops {
> > > > +        *      [... some other kernel fields ...]
> > > > +        *      struct tcp_congestion_ops data;
> > > > +        * }
> > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> > >
> > > Comment isn't very clear.. do you mean that data pointed to by
> > > kern_vdata is of sizeof(...) bytes?
> > >
> > > > +        * prepare_struct_ops() will populate the "data" into
> > > > +        * "kern_vdata".
> > > > +        */
> > > > +       void *kern_vdata;
> > > > +       __u32 type_id;
> > > > +       __u32 kern_vtype_id;
> > > > +       __u32 kern_vtype_size;
> > > > +       int fd;
> > > > +       bool unreg;
> > >
> > > This unreg flag (and default behavior to not unregister) is bothering
> > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > > E.g., if no one pins that map - then struct_ops should be unregistered
> > > on map destruction. If application wants to keep BPF programs
> > > attached, it should make sure to pin map, before userspace part exits?
> > > Is this problematic in any way?
> > I don't think it should in the struct_ops case.  I think of the
> > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> > in this case) and this map-progs stay (or keep attaching) until it is
> > detached.  Like other attached bpf_prog keeps running without
> > caring if the bpf_prog is pinned or not.
> 
> I'll let someone else comment on how this behaves for cgroup, xdp,
> etc,
> but for tracing, for example, we have FD-based BPF links, which
> will detach program automatically when FD is closed. I think the idea
> is to extend this to other types of BPF programs as well, so there is
> no risk of leaving some stray BPF program running after unintended
Like xdp_prog, struct_ops does not have another fd-based-link.
This link can be created for struct_ops, xdp_prog and others later.
I don't see a conflict here.

> crash of userspace program. When application explicitly needs BPF
> program to outlive its userspace control app, then this can be
> achieved by pinning map/program in BPFFS.
If the concern is about not leaving struct_ops behind,
lets assume there is no "detach" and only depends on the very
last userspace's handles (FD/pinned) of a map goes away,
what may be an easy way to remove bpf_cubic from the system:

[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
    net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
    net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
    net.ipv4.tcp_congestion_control = bpf_cubic

> 
> >
> > About the "bool unreg;", the default can be changed to true if
> > it makes more sense.
> >

Andrii Nakryiko Dec. 18, 2019, 6:14 p.m. UTC | #7

On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>
> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > > >
> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > > struct ops implemented in BPF, e.g.
> > > > > SEC("struct_ops")
> > > > > struct tcp_congestion_ops dctcp = {
> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > > >         /* ... some more func prts ... */
> > > > >         .name           = "bpf_dctcp",
> > > > > };
> > > > >
> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > > implementing.  Note that the btf-type here is referring to
> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > > where are the bpf progs that the func ptrs are referring to.
> > > > >
> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > > With the kernel's btf-type, it can then set the prog->type,
> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > > the prog's properties do not rely on its section name.
> > > > >
> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > > If these matching conditions fail, libbpf will reject.
> > > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > > most of its members are function pointers.
> > > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > > the btf_vmlinux's btf-type.
> > > > >
> > > > > Once the prog's properties are all set,
> > > > > the libbpf will proceed to load all the progs.
> > > > >
> > > > > After that, register_struct_ops() will create a map, finalize the
> > > > > map-value by populating it with the prog-fd, and then register this
> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > > >
> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > > during bpf_object__close().  It can be changed by setting the new
> > > > > "unreg_st_ops" in bpf_object_open_opts.
> > > > >
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > >
> > > > This looks pretty good to me. The big two things is exposing structops
> > > > as real struct bpf_map, so that users can interact with it using
> > > > libbpf APIs, as well as splitting struct_ops map creation and
> > > > registration. bpf_object__load() should only make sure all maps are
> > > > created, progs are loaded/verified, but none of BPF program can yet be
> > > > called. Then attach is the phase where registration happens.
> > > Thanks for the review.
> > >
> > > [ ... ]
> > >
> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > > >  {
> > > > >         return (__u64) (unsigned long) ptr;
> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > > >         bool reused;
> > > > >  };
> > > > >
> > > > > +struct bpf_struct_ops {
> > > > > +       const char *var_name;
> > > > > +       const char *tname;
> > > > > +       const struct btf_type *type;
> > > > > +       struct bpf_program **progs;
> > > > > +       __u32 *kern_func_off;
> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > > +       void *data;
> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > > >
> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > > it btf_ops_ or btf_structops_?
> > > Is it a concern on name collision?
> > >
> > > The prefix pick is to use a more representative name.
> > > struct_ops use many bpf pieces and btf is one of them.
> > > Very soon, all new codes will depend on BTF and btf_ prefix
> > > could become generic also.
> > >
> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >
> > Not so much name collision, as being able to immediately recognize
> > that it's used to provide type information for struct_ops. Think about
> > some automated tooling parsing vmlinux BTF and trying to create some
> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> > unique prefix that identifies what kind of type-providing struct it is
> > is very useful to do generic tool like that. While __bpf_ isn't
> > specifying in any ways that it's for struct_ops.
> >
> > >
> > > >
> > > >
> > > > > +        * format.
> > > > > +        * struct __bpf_tcp_congestion_ops {
> > > > > +        *      [... some other kernel fields ...]
> > > > > +        *      struct tcp_congestion_ops data;
> > > > > +        * }
> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> > > >
> > > > Comment isn't very clear.. do you mean that data pointed to by
> > > > kern_vdata is of sizeof(...) bytes?
> > > >
> > > > > +        * prepare_struct_ops() will populate the "data" into
> > > > > +        * "kern_vdata".
> > > > > +        */
> > > > > +       void *kern_vdata;
> > > > > +       __u32 type_id;
> > > > > +       __u32 kern_vtype_id;
> > > > > +       __u32 kern_vtype_size;
> > > > > +       int fd;
> > > > > +       bool unreg;
> > > >
> > > > This unreg flag (and default behavior to not unregister) is bothering
> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> > > > on map destruction. If application wants to keep BPF programs
> > > > attached, it should make sure to pin map, before userspace part exits?
> > > > Is this problematic in any way?
> > > I don't think it should in the struct_ops case.  I think of the
> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> > > in this case) and this map-progs stay (or keep attaching) until it is
> > > detached.  Like other attached bpf_prog keeps running without
> > > caring if the bpf_prog is pinned or not.
> >
> > I'll let someone else comment on how this behaves for cgroup, xdp,
> > etc,
> > but for tracing, for example, we have FD-based BPF links, which
> > will detach program automatically when FD is closed. I think the idea
> > is to extend this to other types of BPF programs as well, so there is
> > no risk of leaving some stray BPF program running after unintended
> Like xdp_prog, struct_ops does not have another fd-based-link.
> This link can be created for struct_ops, xdp_prog and others later.
> I don't see a conflict here.

My point was that default behavior should be conservative: free up
resources automatically on process exit, unless specifically pinned by
user.
But this discussion made me realize that we miss one thing from
general bpf_link framework. See below.

>
> > crash of userspace program. When application explicitly needs BPF
> > program to outlive its userspace control app, then this can be
> > achieved by pinning map/program in BPFFS.
> If the concern is about not leaving struct_ops behind,
> lets assume there is no "detach" and only depends on the very
> last userspace's handles (FD/pinned) of a map goes away,
> what may be an easy way to remove bpf_cubic from the system:

Yeah, I think this "last map FD close frees up resources/detaches" is
a good behavior.

Where we do have problem is with bpf_link__destroy() unconditionally
also detaching whatever was attached (tracepoint, kprobe, or whatever
was done to create bpf_link in the first place). Now,
bpf_link__destroy() has to be called by user (or skeleton) to at least
free up malloc()'ed structs. But it appears that it's not always
desirable that upon bpf_link destruction underlying BPF program gets
detached. I think this will be the case for xdp and others as well.

I think the good and generic way to go about this is to have this as a
general concept of destroying the link without detaching BPF programs.
E.g., what if we have new API call `void bpf_link__unlink()`, which
will mark that link as not requiring to detach underlying BPF program.
When bpf_link__destroy() is called later, it will just free resources
allocated to maintain bpf_link itself, but won't detach any BPF
programs/resources.

With this, user will have to explicitly specify that he doesn't want
to detach even when skeleton/link is destroyed. If we get consensus on
this, I can add support for this to all the existing bpf_links and you
can build on that?

>
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
>     net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
>     net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
>     net.ipv4.tcp_congestion_control = bpf_cubic
>
> >
> > >
> > > About the "bool unreg;", the default can be changed to true if
> > > it makes more sense.
> > >

Martin KaFai Lau Dec. 18, 2019, 8:19 p.m. UTC | #8

On Wed, Dec 18, 2019 at 10:14:04AM -0800, Andrii Nakryiko wrote:
[ ... ]
> 
> Where we do have problem is with bpf_link__destroy() unconditionally
> also detaching whatever was attached (tracepoint, kprobe, or whatever
> was done to create bpf_link in the first place). Now,
> bpf_link__destroy() has to be called by user (or skeleton) to at least
> free up malloc()'ed structs. But it appears that it's not always
> desirable that upon bpf_link destruction underlying BPF program gets
> detached. I think this will be the case for xdp and others as well.
> 
> I think the good and generic way to go about this is to have this as a
> general concept of destroying the link without detaching BPF programs.
> E.g., what if we have new API call `void bpf_link__unlink()`, which
> will mark that link as not requiring to detach underlying BPF program.
> When bpf_link__destroy() is called later, it will just free resources
> allocated to maintain bpf_link itself, but won't detach any BPF
> programs/resources.
> 
> With this, user will have to explicitly specify that he doesn't want
> to detach even when skeleton/link is destroyed. If we get consensus on
> this, I can add support for this to all the existing bpf_links and you
> can build on that?
Keeping the current struct_ops unreg mechanism (i.e.
bpf_struct_ops__unregister(), to be renamed) and
having a way to opt-out sounds good to me.  Thanks.

Toke Høiland-Jørgensen Dec. 19, 2019, 8:53 a.m. UTC | #9

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>>
>> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
>> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>> > >
>> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
>> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> > > > >
>> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
>> > > > >
>> > > > > The only sec_name convention is SEC("struct_ops") to identify the
>> > > > > struct ops implemented in BPF, e.g.
>> > > > > SEC("struct_ops")
>> > > > > struct tcp_congestion_ops dctcp = {
>> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>> > > > >         /* ... some more func prts ... */
>> > > > >         .name           = "bpf_dctcp",
>> > > > > };
>> > > > >
>> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
>> > > > > elf section and find out what is the btf-type the "struct_ops" is
>> > > > > implementing.  Note that the btf-type here is referring to
>> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
>> > > > > where are the bpf progs that the func ptrs are referring to.
>> > > > >
>> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
>> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
>> > > > > With the kernel's btf-type, it can then set the prog->type,
>> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
>> > > > > the prog's properties do not rely on its section name.
>> > > > >
>> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
>> > > > > process is as simple as: member-name match + btf-kind match + size match.
>> > > > > If these matching conditions fail, libbpf will reject.
>> > > > > The current targeting support is "struct tcp_congestion_ops" which
>> > > > > most of its members are function pointers.
>> > > > > The member ordering of the bpf_prog's btf-type can be different from
>> > > > > the btf_vmlinux's btf-type.
>> > > > >
>> > > > > Once the prog's properties are all set,
>> > > > > the libbpf will proceed to load all the progs.
>> > > > >
>> > > > > After that, register_struct_ops() will create a map, finalize the
>> > > > > map-value by populating it with the prog-fd, and then register this
>> > > > > "struct_ops" to the kernel by updating the map-value to the map.
>> > > > >
>> > > > > By default, libbpf does not unregister the struct_ops from the kernel
>> > > > > during bpf_object__close().  It can be changed by setting the new
>> > > > > "unreg_st_ops" in bpf_object_open_opts.
>> > > > >
>> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> > > > > ---
>> > > >
>> > > > This looks pretty good to me. The big two things is exposing structops
>> > > > as real struct bpf_map, so that users can interact with it using
>> > > > libbpf APIs, as well as splitting struct_ops map creation and
>> > > > registration. bpf_object__load() should only make sure all maps are
>> > > > created, progs are loaded/verified, but none of BPF program can yet be
>> > > > called. Then attach is the phase where registration happens.
>> > > Thanks for the review.
>> > >
>> > > [ ... ]
>> > >
>> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
>> > > > >  {
>> > > > >         return (__u64) (unsigned long) ptr;
>> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
>> > > > >         bool reused;
>> > > > >  };
>> > > > >
>> > > > > +struct bpf_struct_ops {
>> > > > > +       const char *var_name;
>> > > > > +       const char *tname;
>> > > > > +       const struct btf_type *type;
>> > > > > +       struct bpf_program **progs;
>> > > > > +       __u32 *kern_func_off;
>> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
>> > > > > +       void *data;
>> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
>> > > >
>> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
>> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
>> > > > it btf_ops_ or btf_structops_?
>> > > Is it a concern on name collision?
>> > >
>> > > The prefix pick is to use a more representative name.
>> > > struct_ops use many bpf pieces and btf is one of them.
>> > > Very soon, all new codes will depend on BTF and btf_ prefix
>> > > could become generic also.
>> > >
>> > > Unlike tracepoint, there is no non-btf version of struct_ops.
>> >
>> > Not so much name collision, as being able to immediately recognize
>> > that it's used to provide type information for struct_ops. Think about
>> > some automated tooling parsing vmlinux BTF and trying to create some
>> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
>> > unique prefix that identifies what kind of type-providing struct it is
>> > is very useful to do generic tool like that. While __bpf_ isn't
>> > specifying in any ways that it's for struct_ops.
>> >
>> > >
>> > > >
>> > > >
>> > > > > +        * format.
>> > > > > +        * struct __bpf_tcp_congestion_ops {
>> > > > > +        *      [... some other kernel fields ...]
>> > > > > +        *      struct tcp_congestion_ops data;
>> > > > > +        * }
>> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
>> > > >
>> > > > Comment isn't very clear.. do you mean that data pointed to by
>> > > > kern_vdata is of sizeof(...) bytes?
>> > > >
>> > > > > +        * prepare_struct_ops() will populate the "data" into
>> > > > > +        * "kern_vdata".
>> > > > > +        */
>> > > > > +       void *kern_vdata;
>> > > > > +       __u32 type_id;
>> > > > > +       __u32 kern_vtype_id;
>> > > > > +       __u32 kern_vtype_size;
>> > > > > +       int fd;
>> > > > > +       bool unreg;
>> > > >
>> > > > This unreg flag (and default behavior to not unregister) is bothering
>> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
>> > > > E.g., if no one pins that map - then struct_ops should be unregistered
>> > > > on map destruction. If application wants to keep BPF programs
>> > > > attached, it should make sure to pin map, before userspace part exits?
>> > > > Is this problematic in any way?
>> > > I don't think it should in the struct_ops case.  I think of the
>> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
>> > > in this case) and this map-progs stay (or keep attaching) until it is
>> > > detached.  Like other attached bpf_prog keeps running without
>> > > caring if the bpf_prog is pinned or not.
>> >
>> > I'll let someone else comment on how this behaves for cgroup, xdp,
>> > etc,
>> > but for tracing, for example, we have FD-based BPF links, which
>> > will detach program automatically when FD is closed. I think the idea
>> > is to extend this to other types of BPF programs as well, so there is
>> > no risk of leaving some stray BPF program running after unintended
>> Like xdp_prog, struct_ops does not have another fd-based-link.
>> This link can be created for struct_ops, xdp_prog and others later.
>> I don't see a conflict here.
>
> My point was that default behavior should be conservative: free up
> resources automatically on process exit, unless specifically pinned by
> user.
> But this discussion made me realize that we miss one thing from
> general bpf_link framework. See below.
>
>>
>> > crash of userspace program. When application explicitly needs BPF
>> > program to outlive its userspace control app, then this can be
>> > achieved by pinning map/program in BPFFS.
>> If the concern is about not leaving struct_ops behind,
>> lets assume there is no "detach" and only depends on the very
>> last userspace's handles (FD/pinned) of a map goes away,
>> what may be an easy way to remove bpf_cubic from the system:
>
> Yeah, I think this "last map FD close frees up resources/detaches" is
> a good behavior.
>
> Where we do have problem is with bpf_link__destroy() unconditionally
> also detaching whatever was attached (tracepoint, kprobe, or whatever
> was done to create bpf_link in the first place). Now,
> bpf_link__destroy() has to be called by user (or skeleton) to at least
> free up malloc()'ed structs. But it appears that it's not always
> desirable that upon bpf_link destruction underlying BPF program gets
> detached. I think this will be the case for xdp and others as well.

For XDP the model has thus far been "once attached, the program stays
until explicitly detached". Changing that would certainly be surprising,
so I agree that splitting the API is best (not that I'm sure how many
XDP programs will end up using that API, but that's a different
concern)...

-Toke

Andrii Nakryiko Dec. 19, 2019, 8:49 p.m. UTC | #10

On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
> >>
> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >> > >
> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> > > > >
> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> >> > > > >
> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> >> > > > > struct ops implemented in BPF, e.g.
> >> > > > > SEC("struct_ops")
> >> > > > > struct tcp_congestion_ops dctcp = {
> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >> > > > >         /* ... some more func prts ... */
> >> > > > >         .name           = "bpf_dctcp",
> >> > > > > };
> >> > > > >
> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
> >> > > > > implementing.  Note that the btf-type here is referring to
> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> >> > > > > where are the bpf progs that the func ptrs are referring to.
> >> > > > >
> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> >> > > > > With the kernel's btf-type, it can then set the prog->type,
> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> >> > > > > the prog's properties do not rely on its section name.
> >> > > > >
> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
> >> > > > > If these matching conditions fail, libbpf will reject.
> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
> >> > > > > most of its members are function pointers.
> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
> >> > > > > the btf_vmlinux's btf-type.
> >> > > > >
> >> > > > > Once the prog's properties are all set,
> >> > > > > the libbpf will proceed to load all the progs.
> >> > > > >
> >> > > > > After that, register_struct_ops() will create a map, finalize the
> >> > > > > map-value by populating it with the prog-fd, and then register this
> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> >> > > > >
> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> >> > > > > during bpf_object__close().  It can be changed by setting the new
> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
> >> > > > >
> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> >> > > > > ---
> >> > > >
> >> > > > This looks pretty good to me. The big two things is exposing structops
> >> > > > as real struct bpf_map, so that users can interact with it using
> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
> >> > > > registration. bpf_object__load() should only make sure all maps are
> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
> >> > > > called. Then attach is the phase where registration happens.
> >> > > Thanks for the review.
> >> > >
> >> > > [ ... ]
> >> > >
> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> >> > > > >  {
> >> > > > >         return (__u64) (unsigned long) ptr;
> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> >> > > > >         bool reused;
> >> > > > >  };
> >> > > > >
> >> > > > > +struct bpf_struct_ops {
> >> > > > > +       const char *var_name;
> >> > > > > +       const char *tname;
> >> > > > > +       const struct btf_type *type;
> >> > > > > +       struct bpf_program **progs;
> >> > > > > +       __u32 *kern_func_off;
> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> >> > > > > +       void *data;
> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >> > > >
> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> >> > > > it btf_ops_ or btf_structops_?
> >> > > Is it a concern on name collision?
> >> > >
> >> > > The prefix pick is to use a more representative name.
> >> > > struct_ops use many bpf pieces and btf is one of them.
> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
> >> > > could become generic also.
> >> > >
> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >> >
> >> > Not so much name collision, as being able to immediately recognize
> >> > that it's used to provide type information for struct_ops. Think about
> >> > some automated tooling parsing vmlinux BTF and trying to create some
> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> >> > unique prefix that identifies what kind of type-providing struct it is
> >> > is very useful to do generic tool like that. While __bpf_ isn't
> >> > specifying in any ways that it's for struct_ops.
> >> >
> >> > >
> >> > > >
> >> > > >
> >> > > > > +        * format.
> >> > > > > +        * struct __bpf_tcp_congestion_ops {
> >> > > > > +        *      [... some other kernel fields ...]
> >> > > > > +        *      struct tcp_congestion_ops data;
> >> > > > > +        * }
> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >> > > >
> >> > > > Comment isn't very clear.. do you mean that data pointed to by
> >> > > > kern_vdata is of sizeof(...) bytes?
> >> > > >
> >> > > > > +        * prepare_struct_ops() will populate the "data" into
> >> > > > > +        * "kern_vdata".
> >> > > > > +        */
> >> > > > > +       void *kern_vdata;
> >> > > > > +       __u32 type_id;
> >> > > > > +       __u32 kern_vtype_id;
> >> > > > > +       __u32 kern_vtype_size;
> >> > > > > +       int fd;
> >> > > > > +       bool unreg;
> >> > > >
> >> > > > This unreg flag (and default behavior to not unregister) is bothering
> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> >> > > > on map destruction. If application wants to keep BPF programs
> >> > > > attached, it should make sure to pin map, before userspace part exits?
> >> > > > Is this problematic in any way?
> >> > > I don't think it should in the struct_ops case.  I think of the
> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> >> > > in this case) and this map-progs stay (or keep attaching) until it is
> >> > > detached.  Like other attached bpf_prog keeps running without
> >> > > caring if the bpf_prog is pinned or not.
> >> >
> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
> >> > etc,
> >> > but for tracing, for example, we have FD-based BPF links, which
> >> > will detach program automatically when FD is closed. I think the idea
> >> > is to extend this to other types of BPF programs as well, so there is
> >> > no risk of leaving some stray BPF program running after unintended
> >> Like xdp_prog, struct_ops does not have another fd-based-link.
> >> This link can be created for struct_ops, xdp_prog and others later.
> >> I don't see a conflict here.
> >
> > My point was that default behavior should be conservative: free up
> > resources automatically on process exit, unless specifically pinned by
> > user.
> > But this discussion made me realize that we miss one thing from
> > general bpf_link framework. See below.
> >
> >>
> >> > crash of userspace program. When application explicitly needs BPF
> >> > program to outlive its userspace control app, then this can be
> >> > achieved by pinning map/program in BPFFS.
> >> If the concern is about not leaving struct_ops behind,
> >> lets assume there is no "detach" and only depends on the very
> >> last userspace's handles (FD/pinned) of a map goes away,
> >> what may be an easy way to remove bpf_cubic from the system:
> >
> > Yeah, I think this "last map FD close frees up resources/detaches" is
> > a good behavior.
> >
> > Where we do have problem is with bpf_link__destroy() unconditionally
> > also detaching whatever was attached (tracepoint, kprobe, or whatever
> > was done to create bpf_link in the first place). Now,
> > bpf_link__destroy() has to be called by user (or skeleton) to at least
> > free up malloc()'ed structs. But it appears that it's not always
> > desirable that upon bpf_link destruction underlying BPF program gets
> > detached. I think this will be the case for xdp and others as well.
>
> For XDP the model has thus far been "once attached, the program stays
> until explicitly detached". Changing that would certainly be surprising,
> so I agree that splitting the API is best (not that I'm sure how many
> XDP programs will end up using that API, but that's a different
> concern)...

This would be a new FD-based API for XDP, I don't think we can change
existing API. But I think default behavior should still be to
auto-detach, unless explicitly "pinned" in whatever way. That would
prevent surprising "leakage" of BPF programs for unsuspecting users.

>
> -Toke
>

Toke Høiland-Jørgensen Dec. 20, 2019, 10:16 a.m. UTC | #11

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>> >>
>> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
>> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>> >> > >
>> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
>> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > > > >
>> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
>> >> > > > >
>> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
>> >> > > > > struct ops implemented in BPF, e.g.
>> >> > > > > SEC("struct_ops")
>> >> > > > > struct tcp_congestion_ops dctcp = {
>> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>> >> > > > >         /* ... some more func prts ... */
>> >> > > > >         .name           = "bpf_dctcp",
>> >> > > > > };
>> >> > > > >
>> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
>> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
>> >> > > > > implementing.  Note that the btf-type here is referring to
>> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
>> >> > > > > where are the bpf progs that the func ptrs are referring to.
>> >> > > > >
>> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
>> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
>> >> > > > > With the kernel's btf-type, it can then set the prog->type,
>> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
>> >> > > > > the prog's properties do not rely on its section name.
>> >> > > > >
>> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
>> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
>> >> > > > > If these matching conditions fail, libbpf will reject.
>> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
>> >> > > > > most of its members are function pointers.
>> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
>> >> > > > > the btf_vmlinux's btf-type.
>> >> > > > >
>> >> > > > > Once the prog's properties are all set,
>> >> > > > > the libbpf will proceed to load all the progs.
>> >> > > > >
>> >> > > > > After that, register_struct_ops() will create a map, finalize the
>> >> > > > > map-value by populating it with the prog-fd, and then register this
>> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
>> >> > > > >
>> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
>> >> > > > > during bpf_object__close().  It can be changed by setting the new
>> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
>> >> > > > >
>> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> >> > > > > ---
>> >> > > >
>> >> > > > This looks pretty good to me. The big two things is exposing structops
>> >> > > > as real struct bpf_map, so that users can interact with it using
>> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
>> >> > > > registration. bpf_object__load() should only make sure all maps are
>> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
>> >> > > > called. Then attach is the phase where registration happens.
>> >> > > Thanks for the review.
>> >> > >
>> >> > > [ ... ]
>> >> > >
>> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
>> >> > > > >  {
>> >> > > > >         return (__u64) (unsigned long) ptr;
>> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
>> >> > > > >         bool reused;
>> >> > > > >  };
>> >> > > > >
>> >> > > > > +struct bpf_struct_ops {
>> >> > > > > +       const char *var_name;
>> >> > > > > +       const char *tname;
>> >> > > > > +       const struct btf_type *type;
>> >> > > > > +       struct bpf_program **progs;
>> >> > > > > +       __u32 *kern_func_off;
>> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
>> >> > > > > +       void *data;
>> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
>> >> > > >
>> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
>> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
>> >> > > > it btf_ops_ or btf_structops_?
>> >> > > Is it a concern on name collision?
>> >> > >
>> >> > > The prefix pick is to use a more representative name.
>> >> > > struct_ops use many bpf pieces and btf is one of them.
>> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
>> >> > > could become generic also.
>> >> > >
>> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
>> >> >
>> >> > Not so much name collision, as being able to immediately recognize
>> >> > that it's used to provide type information for struct_ops. Think about
>> >> > some automated tooling parsing vmlinux BTF and trying to create some
>> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
>> >> > unique prefix that identifies what kind of type-providing struct it is
>> >> > is very useful to do generic tool like that. While __bpf_ isn't
>> >> > specifying in any ways that it's for struct_ops.
>> >> >
>> >> > >
>> >> > > >
>> >> > > >
>> >> > > > > +        * format.
>> >> > > > > +        * struct __bpf_tcp_congestion_ops {
>> >> > > > > +        *      [... some other kernel fields ...]
>> >> > > > > +        *      struct tcp_congestion_ops data;
>> >> > > > > +        * }
>> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
>> >> > > >
>> >> > > > Comment isn't very clear.. do you mean that data pointed to by
>> >> > > > kern_vdata is of sizeof(...) bytes?
>> >> > > >
>> >> > > > > +        * prepare_struct_ops() will populate the "data" into
>> >> > > > > +        * "kern_vdata".
>> >> > > > > +        */
>> >> > > > > +       void *kern_vdata;
>> >> > > > > +       __u32 type_id;
>> >> > > > > +       __u32 kern_vtype_id;
>> >> > > > > +       __u32 kern_vtype_size;
>> >> > > > > +       int fd;
>> >> > > > > +       bool unreg;
>> >> > > >
>> >> > > > This unreg flag (and default behavior to not unregister) is bothering
>> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
>> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
>> >> > > > on map destruction. If application wants to keep BPF programs
>> >> > > > attached, it should make sure to pin map, before userspace part exits?
>> >> > > > Is this problematic in any way?
>> >> > > I don't think it should in the struct_ops case.  I think of the
>> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
>> >> > > in this case) and this map-progs stay (or keep attaching) until it is
>> >> > > detached.  Like other attached bpf_prog keeps running without
>> >> > > caring if the bpf_prog is pinned or not.
>> >> >
>> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
>> >> > etc,
>> >> > but for tracing, for example, we have FD-based BPF links, which
>> >> > will detach program automatically when FD is closed. I think the idea
>> >> > is to extend this to other types of BPF programs as well, so there is
>> >> > no risk of leaving some stray BPF program running after unintended
>> >> Like xdp_prog, struct_ops does not have another fd-based-link.
>> >> This link can be created for struct_ops, xdp_prog and others later.
>> >> I don't see a conflict here.
>> >
>> > My point was that default behavior should be conservative: free up
>> > resources automatically on process exit, unless specifically pinned by
>> > user.
>> > But this discussion made me realize that we miss one thing from
>> > general bpf_link framework. See below.
>> >
>> >>
>> >> > crash of userspace program. When application explicitly needs BPF
>> >> > program to outlive its userspace control app, then this can be
>> >> > achieved by pinning map/program in BPFFS.
>> >> If the concern is about not leaving struct_ops behind,
>> >> lets assume there is no "detach" and only depends on the very
>> >> last userspace's handles (FD/pinned) of a map goes away,
>> >> what may be an easy way to remove bpf_cubic from the system:
>> >
>> > Yeah, I think this "last map FD close frees up resources/detaches" is
>> > a good behavior.
>> >
>> > Where we do have problem is with bpf_link__destroy() unconditionally
>> > also detaching whatever was attached (tracepoint, kprobe, or whatever
>> > was done to create bpf_link in the first place). Now,
>> > bpf_link__destroy() has to be called by user (or skeleton) to at least
>> > free up malloc()'ed structs. But it appears that it's not always
>> > desirable that upon bpf_link destruction underlying BPF program gets
>> > detached. I think this will be the case for xdp and others as well.
>>
>> For XDP the model has thus far been "once attached, the program stays
>> until explicitly detached". Changing that would certainly be surprising,
>> so I agree that splitting the API is best (not that I'm sure how many
>> XDP programs will end up using that API, but that's a different
>> concern)...
>
> This would be a new FD-based API for XDP, I don't think we can change
> existing API. But I think default behavior should still be to
> auto-detach, unless explicitly "pinned" in whatever way. That would
> prevent surprising "leakage" of BPF programs for unsuspecting users.

But why do we need a new API for attaching XDP programs? Also, what are
the use cases where it makes sense to have this kind of "transient" XDP
program? The only one I can think about is something like xdpdump, which
moves packets to userspace (and should stop doing that when the
userspace listener goes away). But with bpf-to-bpf tracing, xdpdump
won't actually be an XDP program, so what's left? The system firewall
rules don't go away when the program that installed them exits either;
why should an XDP program?

-Toke

Andrii Nakryiko Dec. 20, 2019, 5:34 p.m. UTC | #12

On Fri, Dec 20, 2019 at 2:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
> >> >>
> >> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> >> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >> >> > >
> >> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> >> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > > > >
> >> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> >> >> > > > >
> >> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> >> >> > > > > struct ops implemented in BPF, e.g.
> >> >> > > > > SEC("struct_ops")
> >> >> > > > > struct tcp_congestion_ops dctcp = {
> >> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >> >> > > > >         /* ... some more func prts ... */
> >> >> > > > >         .name           = "bpf_dctcp",
> >> >> > > > > };
> >> >> > > > >
> >> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> >> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
> >> >> > > > > implementing.  Note that the btf-type here is referring to
> >> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> >> >> > > > > where are the bpf progs that the func ptrs are referring to.
> >> >> > > > >
> >> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> >> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> >> >> > > > > With the kernel's btf-type, it can then set the prog->type,
> >> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> >> >> > > > > the prog's properties do not rely on its section name.
> >> >> > > > >
> >> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> >> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
> >> >> > > > > If these matching conditions fail, libbpf will reject.
> >> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
> >> >> > > > > most of its members are function pointers.
> >> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
> >> >> > > > > the btf_vmlinux's btf-type.
> >> >> > > > >
> >> >> > > > > Once the prog's properties are all set,
> >> >> > > > > the libbpf will proceed to load all the progs.
> >> >> > > > >
> >> >> > > > > After that, register_struct_ops() will create a map, finalize the
> >> >> > > > > map-value by populating it with the prog-fd, and then register this
> >> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> >> >> > > > >
> >> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> >> >> > > > > during bpf_object__close().  It can be changed by setting the new
> >> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
> >> >> > > > >
> >> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> >> >> > > > > ---
> >> >> > > >
> >> >> > > > This looks pretty good to me. The big two things is exposing structops
> >> >> > > > as real struct bpf_map, so that users can interact with it using
> >> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
> >> >> > > > registration. bpf_object__load() should only make sure all maps are
> >> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
> >> >> > > > called. Then attach is the phase where registration happens.
> >> >> > > Thanks for the review.
> >> >> > >
> >> >> > > [ ... ]
> >> >> > >
> >> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> >> >> > > > >  {
> >> >> > > > >         return (__u64) (unsigned long) ptr;
> >> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> >> >> > > > >         bool reused;
> >> >> > > > >  };
> >> >> > > > >
> >> >> > > > > +struct bpf_struct_ops {
> >> >> > > > > +       const char *var_name;
> >> >> > > > > +       const char *tname;
> >> >> > > > > +       const struct btf_type *type;
> >> >> > > > > +       struct bpf_program **progs;
> >> >> > > > > +       __u32 *kern_func_off;
> >> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> >> >> > > > > +       void *data;
> >> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >> >> > > >
> >> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> >> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> >> >> > > > it btf_ops_ or btf_structops_?
> >> >> > > Is it a concern on name collision?
> >> >> > >
> >> >> > > The prefix pick is to use a more representative name.
> >> >> > > struct_ops use many bpf pieces and btf is one of them.
> >> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
> >> >> > > could become generic also.
> >> >> > >
> >> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >> >> >
> >> >> > Not so much name collision, as being able to immediately recognize
> >> >> > that it's used to provide type information for struct_ops. Think about
> >> >> > some automated tooling parsing vmlinux BTF and trying to create some
> >> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> >> >> > unique prefix that identifies what kind of type-providing struct it is
> >> >> > is very useful to do generic tool like that. While __bpf_ isn't
> >> >> > specifying in any ways that it's for struct_ops.
> >> >> >
> >> >> > >
> >> >> > > >
> >> >> > > >
> >> >> > > > > +        * format.
> >> >> > > > > +        * struct __bpf_tcp_congestion_ops {
> >> >> > > > > +        *      [... some other kernel fields ...]
> >> >> > > > > +        *      struct tcp_congestion_ops data;
> >> >> > > > > +        * }
> >> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >> >> > > >
> >> >> > > > Comment isn't very clear.. do you mean that data pointed to by
> >> >> > > > kern_vdata is of sizeof(...) bytes?
> >> >> > > >
> >> >> > > > > +        * prepare_struct_ops() will populate the "data" into
> >> >> > > > > +        * "kern_vdata".
> >> >> > > > > +        */
> >> >> > > > > +       void *kern_vdata;
> >> >> > > > > +       __u32 type_id;
> >> >> > > > > +       __u32 kern_vtype_id;
> >> >> > > > > +       __u32 kern_vtype_size;
> >> >> > > > > +       int fd;
> >> >> > > > > +       bool unreg;
> >> >> > > >
> >> >> > > > This unreg flag (and default behavior to not unregister) is bothering
> >> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> >> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> >> >> > > > on map destruction. If application wants to keep BPF programs
> >> >> > > > attached, it should make sure to pin map, before userspace part exits?
> >> >> > > > Is this problematic in any way?
> >> >> > > I don't think it should in the struct_ops case.  I think of the
> >> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> >> >> > > in this case) and this map-progs stay (or keep attaching) until it is
> >> >> > > detached.  Like other attached bpf_prog keeps running without
> >> >> > > caring if the bpf_prog is pinned or not.
> >> >> >
> >> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
> >> >> > etc,
> >> >> > but for tracing, for example, we have FD-based BPF links, which
> >> >> > will detach program automatically when FD is closed. I think the idea
> >> >> > is to extend this to other types of BPF programs as well, so there is
> >> >> > no risk of leaving some stray BPF program running after unintended
> >> >> Like xdp_prog, struct_ops does not have another fd-based-link.
> >> >> This link can be created for struct_ops, xdp_prog and others later.
> >> >> I don't see a conflict here.
> >> >
> >> > My point was that default behavior should be conservative: free up
> >> > resources automatically on process exit, unless specifically pinned by
> >> > user.
> >> > But this discussion made me realize that we miss one thing from
> >> > general bpf_link framework. See below.
> >> >
> >> >>
> >> >> > crash of userspace program. When application explicitly needs BPF
> >> >> > program to outlive its userspace control app, then this can be
> >> >> > achieved by pinning map/program in BPFFS.
> >> >> If the concern is about not leaving struct_ops behind,
> >> >> lets assume there is no "detach" and only depends on the very
> >> >> last userspace's handles (FD/pinned) of a map goes away,
> >> >> what may be an easy way to remove bpf_cubic from the system:
> >> >
> >> > Yeah, I think this "last map FD close frees up resources/detaches" is
> >> > a good behavior.
> >> >
> >> > Where we do have problem is with bpf_link__destroy() unconditionally
> >> > also detaching whatever was attached (tracepoint, kprobe, or whatever
> >> > was done to create bpf_link in the first place). Now,
> >> > bpf_link__destroy() has to be called by user (or skeleton) to at least
> >> > free up malloc()'ed structs. But it appears that it's not always
> >> > desirable that upon bpf_link destruction underlying BPF program gets
> >> > detached. I think this will be the case for xdp and others as well.
> >>
> >> For XDP the model has thus far been "once attached, the program stays
> >> until explicitly detached". Changing that would certainly be surprising,
> >> so I agree that splitting the API is best (not that I'm sure how many
> >> XDP programs will end up using that API, but that's a different
> >> concern)...
> >
> > This would be a new FD-based API for XDP, I don't think we can change
> > existing API. But I think default behavior should still be to
> > auto-detach, unless explicitly "pinned" in whatever way. That would
> > prevent surprising "leakage" of BPF programs for unsuspecting users.
>
> But why do we need a new API for attaching XDP programs? Also, what are
> the use cases where it makes sense to have this kind of "transient" XDP
> program? The only one I can think about is something like xdpdump, which

During development, for instance, when you buggy userspace program
crashes? I think by default all those attached BPF programs should be
auto-detachable, if possible. That's the direction that worked out
really well with kprobes/tracepoints/perf_events. Previously, using
old APIs, you'd attach kprobe and if userspace doesn't clean up, that
kprobe would stay attached in the system, consuming resources without
users noticing this (which is especially critical in production).
Switching to auto-detachable FD-based interface greatly improved that
experience. I think this is a good model going forward.

In practice, for production use cases, it will be just a trivial piece
of code to keep it attached:

struct bpf_link *xdp_link = bpf_program__attach_xdp(...);
bpf_link__disconnect(xdp_link); /* now if userspace program crashes,
xdp BPF program will stay connected */

> moves packets to userspace (and should stop doing that when the
> userspace listener goes away). But with bpf-to-bpf tracing, xdpdump
> won't actually be an XDP program, so what's left? The system firewall
> rules don't go away when the program that installed them exits either;
> why should an XDP program?

See above, I'm not saying that it shouldn't be possible to keep it
attached. I'm just arguing it's not a good default, because it can
catch developers off guard and cause problems, especially in
production environments. In the end, it is a resource leak, unless you
want and expect it.

>
> -Toke
>

[bpf-next,11/13] bpf: libbpf: Add STRUCT_OPS support

Commit Message

Comments

Patch