diff mbox series

[bpf-next,5/6] libbpf: support libbpf-provided extern variables

Message ID 20191117070807.251360-6-andriin@fb.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Add libbpf-provided extern variables support | expand

Commit Message

Andrii Nakryiko Nov. 17, 2019, 7:08 a.m. UTC
Add support for extern variables, provided to BPF program by libbpf. Currently
the following extern variables are supported:
  - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
    executing, follows KERNEL_VERSION() macro convention;
  - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
    boolean, and integer values are supported. Strings are not supported at
    the moment.

All values are represented as 64-bit integers, with the follow value encoding:
  - for boolean values, y is 1, n or missing value is 0;
  - for tristate values, y is 1, m is 2, n or missing value is 0;
  - for integers, the values is 64-bit integer, sign-extended, if negative; if
    config value is missing, it's represented as 0, which makes explicit 0 and
    missing config value indistinguishable. If this will turn out to be
    a problem in practice, we'll need to deal with it somehow.

Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
expected type (bool, tristate, int), so it doesn't enforce any specific set of
values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
found in config file are set to 0. If any of declared
externs are unrecognized by libbpf, error is returned on bpf_object__open().
In the future it will be possible to extend available set of supported externs
to include new libbpf-provided values, if necessary, or linking against
kernel- or other BPF object-provided global variables/functions.

Config file itself is searched in /boot/config-$(uname -r) location with
fallback to /proc/config.gz, unless config path is specified explicitly
through bpf_object_open_opts' kernel_config_path option. Both gzipped and
plain text formats are supported. Libbpf adds explicit dependency on zlib
because of this, but this shouldn't be a problem, given libelf already depends
on zlib.

All detected extern variables, are put into a separate .extern internal map.
It, similarly to .rodata map, is marked as read-only from BPF program side, as
well as is frozen on load. This allows BPF verifier to track extern values as
constants and performe enhanced branch prediction and dead code elimination.
This can be relied upon for doing kernel version/feature detection and using
potentially unsupported field relocations or BPF helpers in a CO-RE-based BPF
program, while still having a single version of BPF program running on old and
new kernels. Selftests are validating this explicitly for unexisting BPF
helper.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/Makefile               |  17 +-
 tools/lib/bpf/libbpf.c               | 471 ++++++++++++++++++++++-----
 tools/lib/bpf/libbpf.h               |   8 +-
 tools/testing/selftests/bpf/Makefile |   2 +-
 4 files changed, 409 insertions(+), 89 deletions(-)

Comments

Alexei Starovoitov Nov. 19, 2019, 3:21 a.m. UTC | #1
On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
> Add support for extern variables, provided to BPF program by libbpf. Currently
> the following extern variables are supported:
>   - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
>     executing, follows KERNEL_VERSION() macro convention;
>   - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
>     boolean, and integer values are supported. Strings are not supported at
>     the moment.
> 
> All values are represented as 64-bit integers, with the follow value encoding:
>   - for boolean values, y is 1, n or missing value is 0;
>   - for tristate values, y is 1, m is 2, n or missing value is 0;
>   - for integers, the values is 64-bit integer, sign-extended, if negative; if
>     config value is missing, it's represented as 0, which makes explicit 0 and
>     missing config value indistinguishable. If this will turn out to be
>     a problem in practice, we'll need to deal with it somehow.

I read that statement as there is no extensibility for such api.

> Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
> expected type (bool, tristate, int), so it doesn't enforce any specific set of
> values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
> found in config file are set to 0.

This is not pretty either.

> +
> +		switch (*value) {
> +		case 'n':
> +			*ext_val = 0;
> +			break;
> +		case 'y':
> +			*ext_val = 1;
> +			break;
> +		case 'm':
> +			*ext_val = 2;
> +			break;
> +		case '"':
> +			pr_warn("extern '%s': strings are not supported\n",
> +				ext->name);
> +			err = -EINVAL;
> +			goto out;
> +		default:
> +			errno = 0;
> +			*ext_val = strtoull(value, &value_end, 10);
> +			if (errno) {
> +				err = -errno;
> +				pr_warn("extern '%s': failed to parse value: %d\n",
> +					ext->name, err);
> +				goto out;
> +			}

BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
.config has as bytes to the program and let program parse n/y/m, strings and
integers.

LINUX_KERNEL_VERSION is a special case and can stay as u64.
Andrii Nakryiko Nov. 19, 2019, 6:57 a.m. UTC | #2
On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
> > Add support for extern variables, provided to BPF program by libbpf. Currently
> > the following extern variables are supported:
> >   - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
> >     executing, follows KERNEL_VERSION() macro convention;
> >   - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
> >     boolean, and integer values are supported. Strings are not supported at
> >     the moment.
> >
> > All values are represented as 64-bit integers, with the follow value encoding:
> >   - for boolean values, y is 1, n or missing value is 0;
> >   - for tristate values, y is 1, m is 2, n or missing value is 0;
> >   - for integers, the values is 64-bit integer, sign-extended, if negative; if
> >     config value is missing, it's represented as 0, which makes explicit 0 and
> >     missing config value indistinguishable. If this will turn out to be
> >     a problem in practice, we'll need to deal with it somehow.
>
> I read that statement as there is no extensibility for such api.

What do you mean exactly?

Are you worried about 0 vs undefined case? I don't think it's going to
be a problem in practice. Looking at my .config, I see that integer
config values set to their default values are still explicitly
specified with those values. E.g.,

CONFIG_HZ_1000=y
CONFIG_HZ=1000

CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
set. So while I won't claim that it's the case for any possible
integer config, it seems to be pretty consistent in practice.

Also, I see a lot of values set to explicit 0, like:

CONFIG_BASE_SMALL=0

So it seems like integers are typically spelled out explicitly in real
configs and I think this 0 default is pretty sane.

Next, speaking about extensibility. Once we have BTF type info for
externs, our possibilities are much better. It will be possible to
support bool, int, in64 for the same bool value. Libbpf will be able
to validate the range and fail program load if declared extern type
doesn't match actual value type and value range. So I think
extensibility is there, but right now we are enforcing (logically)
everything to be uin64_t. Unfortunately, with the way externs are done
in ELF, I don't know neither type nor size, so can't be more strict
than that.

If we really need to know whether some config value is defined or not,
regardless of its value, we can have it by convention. E.g.,
CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
CONFIG_XXX is defined explicitly or not. But I don't want to add that
until we really have a use case where it matters.

>
> > Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
> > expected type (bool, tristate, int), so it doesn't enforce any specific set of
> > values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
> > found in config file are set to 0.
>
> This is not pretty either.

What exactly: defaulting to zero or not knowing config value's type?
Given all the options, defaulting to zero seems like the best way to
go.

>
> > +
> > +             switch (*value) {
> > +             case 'n':
> > +                     *ext_val = 0;
> > +                     break;
> > +             case 'y':
> > +                     *ext_val = 1;
> > +                     break;
> > +             case 'm':
> > +                     *ext_val = 2;
> > +                     break;

reading some more code from scripts/kconfig/symbol.c, I'll need to
handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
more feedback on this version.

> > +             case '"':
> > +                     pr_warn("extern '%s': strings are not supported\n",
> > +                             ext->name);
> > +                     err = -EINVAL;
> > +                     goto out;
> > +             default:
> > +                     errno = 0;
> > +                     *ext_val = strtoull(value, &value_end, 10);
> > +                     if (errno) {
> > +                             err = -errno;
> > +                             pr_warn("extern '%s': failed to parse value: %d\n",
> > +                                     ext->name, err);
> > +                             goto out;
> > +                     }
>
> BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
> .config has as bytes to the program and let program parse n/y/m, strings and
> integers.

Config value is not changing. This is an incredible waste of CPU
resources to re-parse same value over and over again. And it's
incredibly much worse usability as well. Again, once we have BTF for
externs, we can just declare values as const char[] and then user will
be able to do its own parsing. Until then, I think pre-parsing values
into convenient u64 types are much better and handles all the typical
cases.

>
> LINUX_KERNEL_VERSION is a special case and can stay as u64.
>
Andrii Nakryiko Nov. 19, 2019, 3:42 p.m. UTC | #3
On Mon, Nov 18, 2019 at 10:57 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
> > > Add support for extern variables, provided to BPF program by libbpf. Currently
> > > the following extern variables are supported:
> > >   - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
> > >     executing, follows KERNEL_VERSION() macro convention;
> > >   - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
> > >     boolean, and integer values are supported. Strings are not supported at
> > >     the moment.
> > >
> > > All values are represented as 64-bit integers, with the follow value encoding:
> > >   - for boolean values, y is 1, n or missing value is 0;
> > >   - for tristate values, y is 1, m is 2, n or missing value is 0;
> > >   - for integers, the values is 64-bit integer, sign-extended, if negative; if
> > >     config value is missing, it's represented as 0, which makes explicit 0 and
> > >     missing config value indistinguishable. If this will turn out to be
> > >     a problem in practice, we'll need to deal with it somehow.
> >
> > I read that statement as there is no extensibility for such api.
>
> What do you mean exactly?
>
> Are you worried about 0 vs undefined case? I don't think it's going to
> be a problem in practice. Looking at my .config, I see that integer
> config values set to their default values are still explicitly
> specified with those values. E.g.,
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
>
> CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
> set. So while I won't claim that it's the case for any possible
> integer config, it seems to be pretty consistent in practice.
>
> Also, I see a lot of values set to explicit 0, like:
>
> CONFIG_BASE_SMALL=0
>
> So it seems like integers are typically spelled out explicitly in real
> configs and I think this 0 default is pretty sane.
>
> Next, speaking about extensibility. Once we have BTF type info for
> externs, our possibilities are much better. It will be possible to
> support bool, int, in64 for the same bool value. Libbpf will be able
> to validate the range and fail program load if declared extern type
> doesn't match actual value type and value range. So I think
> extensibility is there, but right now we are enforcing (logically)
> everything to be uin64_t. Unfortunately, with the way externs are done
> in ELF, I don't know neither type nor size, so can't be more strict
> than that.
>
> If we really need to know whether some config value is defined or not,
> regardless of its value, we can have it by convention. E.g.,
> CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
> CONFIG_XXX is defined explicitly or not. But I don't want to add that
> until we really have a use case where it matters.
>
> >
> > > Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
> > > expected type (bool, tristate, int), so it doesn't enforce any specific set of
> > > values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
> > > found in config file are set to 0.
> >
> > This is not pretty either.
>
> What exactly: defaulting to zero or not knowing config value's type?
> Given all the options, defaulting to zero seems like the best way to
> go.
>
> >
> > > +
> > > +             switch (*value) {
> > > +             case 'n':
> > > +                     *ext_val = 0;
> > > +                     break;
> > > +             case 'y':
> > > +                     *ext_val = 1;
> > > +                     break;
> > > +             case 'm':
> > > +                     *ext_val = 2;
> > > +                     break;
>
> reading some more code from scripts/kconfig/symbol.c, I'll need to
> handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
> more feedback on this version.
>
> > > +             case '"':
> > > +                     pr_warn("extern '%s': strings are not supported\n",
> > > +                             ext->name);
> > > +                     err = -EINVAL;
> > > +                     goto out;
> > > +             default:
> > > +                     errno = 0;
> > > +                     *ext_val = strtoull(value, &value_end, 10);
> > > +                     if (errno) {
> > > +                             err = -errno;
> > > +                             pr_warn("extern '%s': failed to parse value: %d\n",
> > > +                                     ext->name, err);
> > > +                             goto out;
> > > +                     }
> >
> > BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
> > .config has as bytes to the program and let program parse n/y/m, strings and
> > integers.
>
> Config value is not changing. This is an incredible waste of CPU
> resources to re-parse same value over and over again. And it's
> incredibly much worse usability as well. Again, once we have BTF for
> externs, we can just declare values as const char[] and then user will
> be able to do its own parsing. Until then, I think pre-parsing values
> into convenient u64 types are much better and handles all the typical
> cases.


One more thing I didn't realize I didn't state explicitly, because
I've been thinking and talking about that for so long now, that it
kind of internalized completely.

These externs, including CONFIG_XXX ones, are meant to interoperate
nicely with field relocations within BPF CO-RE concept. They are,
among other things, are meant to disable parts of BPF program logic
through verifier's dead code elimination by doing something like:


if (CONFIG_SOME_FEATURES_ENABLED) {
    BPF_CORE_READ(t, some_extra_field);
    /* or */
    bpf_helper_that_only_present_when_feature_is_enabled();
} else {
    /* fallback logic */
}

With CONFIG_SOME_FEATURES_ENABLED not being a read-only integer
constant when BPF program is loaded, this is impossible. So it
absolutely must be some sort of easy to use integer constant. With BTF
for externs we can have more flexibility in the future, of course.

>
> >
> > LINUX_KERNEL_VERSION is a special case and can stay as u64.
> >
Alexei Starovoitov Nov. 19, 2019, 3:58 p.m. UTC | #4
On 11/19/19 7:42 AM, Andrii Nakryiko wrote:
> On Mon, Nov 18, 2019 at 10:57 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>
>> On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
>>>> Add support for extern variables, provided to BPF program by libbpf. Currently
>>>> the following extern variables are supported:
>>>>    - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
>>>>      executing, follows KERNEL_VERSION() macro convention;
>>>>    - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
>>>>      boolean, and integer values are supported. Strings are not supported at
>>>>      the moment.
>>>>
>>>> All values are represented as 64-bit integers, with the follow value encoding:
>>>>    - for boolean values, y is 1, n or missing value is 0;
>>>>    - for tristate values, y is 1, m is 2, n or missing value is 0;
>>>>    - for integers, the values is 64-bit integer, sign-extended, if negative; if
>>>>      config value is missing, it's represented as 0, which makes explicit 0 and
>>>>      missing config value indistinguishable. If this will turn out to be
>>>>      a problem in practice, we'll need to deal with it somehow.
>>>
>>> I read that statement as there is no extensibility for such api.
>>
>> What do you mean exactly?
>>
>> Are you worried about 0 vs undefined case? I don't think it's going to
>> be a problem in practice. Looking at my .config, I see that integer
>> config values set to their default values are still explicitly
>> specified with those values. E.g.,
>>
>> CONFIG_HZ_1000=y
>> CONFIG_HZ=1000
>>
>> CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
>> set. So while I won't claim that it's the case for any possible
>> integer config, it seems to be pretty consistent in practice.
>>
>> Also, I see a lot of values set to explicit 0, like:
>>
>> CONFIG_BASE_SMALL=0
>>
>> So it seems like integers are typically spelled out explicitly in real
>> configs and I think this 0 default is pretty sane.
>>
>> Next, speaking about extensibility. Once we have BTF type info for
>> externs, our possibilities are much better. It will be possible to
>> support bool, int, in64 for the same bool value. Libbpf will be able
>> to validate the range and fail program load if declared extern type
>> doesn't match actual value type and value range. So I think
>> extensibility is there, but right now we are enforcing (logically)
>> everything to be uin64_t. Unfortunately, with the way externs are done
>> in ELF, I don't know neither type nor size, so can't be more strict
>> than that.
>>
>> If we really need to know whether some config value is defined or not,
>> regardless of its value, we can have it by convention. E.g.,
>> CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
>> CONFIG_XXX is defined explicitly or not. But I don't want to add that
>> until we really have a use case where it matters.
>>
>>>
>>>> Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
>>>> expected type (bool, tristate, int), so it doesn't enforce any specific set of
>>>> values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
>>>> found in config file are set to 0.
>>>
>>> This is not pretty either.
>>
>> What exactly: defaulting to zero or not knowing config value's type?
>> Given all the options, defaulting to zero seems like the best way to
>> go.
>>
>>>
>>>> +
>>>> +             switch (*value) {
>>>> +             case 'n':
>>>> +                     *ext_val = 0;
>>>> +                     break;
>>>> +             case 'y':
>>>> +                     *ext_val = 1;
>>>> +                     break;
>>>> +             case 'm':
>>>> +                     *ext_val = 2;
>>>> +                     break;
>>
>> reading some more code from scripts/kconfig/symbol.c, I'll need to
>> handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
>> more feedback on this version.
>>
>>>> +             case '"':
>>>> +                     pr_warn("extern '%s': strings are not supported\n",
>>>> +                             ext->name);
>>>> +                     err = -EINVAL;
>>>> +                     goto out;
>>>> +             default:
>>>> +                     errno = 0;
>>>> +                     *ext_val = strtoull(value, &value_end, 10);
>>>> +                     if (errno) {
>>>> +                             err = -errno;
>>>> +                             pr_warn("extern '%s': failed to parse value: %d\n",
>>>> +                                     ext->name, err);
>>>> +                             goto out;
>>>> +                     }
>>>
>>> BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
>>> .config has as bytes to the program and let program parse n/y/m, strings and
>>> integers.
>>
>> Config value is not changing. This is an incredible waste of CPU
>> resources to re-parse same value over and over again. And it's
>> incredibly much worse usability as well. Again, once we have BTF for
>> externs, we can just declare values as const char[] and then user will
>> be able to do its own parsing. Until then, I think pre-parsing values
>> into convenient u64 types are much better and handles all the typical
>> cases.
> 
> 
> One more thing I didn't realize I didn't state explicitly, because
> I've been thinking and talking about that for so long now, that it
> kind of internalized completely.
> 
> These externs, including CONFIG_XXX ones, are meant to interoperate
> nicely with field relocations within BPF CO-RE concept. They are,
> among other things, are meant to disable parts of BPF program logic
> through verifier's dead code elimination by doing something like:
> 
> 
> if (CONFIG_SOME_FEATURES_ENABLED) {
>      BPF_CORE_READ(t, some_extra_field);
>      /* or */
>      bpf_helper_that_only_present_when_feature_is_enabled();
> } else {
>      /* fallback logic */
> }
> 
> With CONFIG_SOME_FEATURES_ENABLED not being a read-only integer
> constant when BPF program is loaded, this is impossible. So it
> absolutely must be some sort of easy to use integer constant. 

Hmm. what difference do you see between u64 and char[] ?
The const propagation logic in the verifier should work the same way.
If it doesn't it's a bug in the verifier and it's not ok to hack
extern api to workaround the bug.

What you're advocating with libbpf-side of conversion to integers
reminds me of our earlier attempts with cgroup_sysctl hooks where
we started with ints only to realize that in practice it's too
limited. Then bpf_strtol was introduced and api got much cleaner.
Same thing here. Converting char[] into ints or whatever else
is the job of the program. Not of libbpf. The verifier can be taught
to optimize bpf_strtol() into const when const char[] is passed in.

As far as is_enabled() check doing it as 0/1 the way you're proposing
has in-band signaling issues that you admitted in the commit log.
For is_enabled() may be new builtin() on llvm side would be better?
Something like __builtin_preserve_field_info(field, BPF_FIELD_EXISTS)
but can be used on _any_ extern function or variable.
Like __builtin_is_extern_resolved(extern_name);
Then on libbpf side CONFIG_* that are not in config.gz won't be seen
by the program (instead of seen as 0 in your proposal) and the code
will look like:
if (__builtin_is_extern_resolved(CONFIG_NETNS)) {
   ..do things;
} else {
}
The verifier dead code elimination will take care of branches.
The BPF program itself doesn't need to read the value of CONFIG_
it only needs to know whether it was defined.
Such builtin would match semantics better.
If CONFIG_ is tri-state doing
if (*(u8*)CONFIG_FOO == 'y' || *(u8*)CONFIG_FOO == 'm')
is cleaner than *(u64*)CONFIG_FOO == 1 || 2.
and constant propagation in the verifier should work the same way.
Daniel Borkmann Nov. 19, 2019, 11:32 p.m. UTC | #5
On 11/19/19 4:58 PM, Alexei Starovoitov wrote:
> On 11/19/19 7:42 AM, Andrii Nakryiko wrote:
>> On Mon, Nov 18, 2019 at 10:57 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
>>>>> Add support for extern variables, provided to BPF program by libbpf. Currently
>>>>> the following extern variables are supported:
>>>>>     - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
>>>>>       executing, follows KERNEL_VERSION() macro convention;
>>>>>     - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
>>>>>       boolean, and integer values are supported. Strings are not supported at
>>>>>       the moment.
>>>>>
>>>>> All values are represented as 64-bit integers, with the follow value encoding:
>>>>>     - for boolean values, y is 1, n or missing value is 0;
>>>>>     - for tristate values, y is 1, m is 2, n or missing value is 0;
>>>>>     - for integers, the values is 64-bit integer, sign-extended, if negative; if
>>>>>       config value is missing, it's represented as 0, which makes explicit 0 and
>>>>>       missing config value indistinguishable. If this will turn out to be
>>>>>       a problem in practice, we'll need to deal with it somehow.
>>>>
>>>> I read that statement as there is no extensibility for such api.
>>>
>>> What do you mean exactly?
>>>
>>> Are you worried about 0 vs undefined case? I don't think it's going to
>>> be a problem in practice. Looking at my .config, I see that integer
>>> config values set to their default values are still explicitly
>>> specified with those values. E.g.,
>>>
>>> CONFIG_HZ_1000=y
>>> CONFIG_HZ=1000
>>>
>>> CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
>>> set. So while I won't claim that it's the case for any possible
>>> integer config, it seems to be pretty consistent in practice.
>>>
>>> Also, I see a lot of values set to explicit 0, like:
>>>
>>> CONFIG_BASE_SMALL=0
>>>
>>> So it seems like integers are typically spelled out explicitly in real
>>> configs and I think this 0 default is pretty sane.
>>>
>>> Next, speaking about extensibility. Once we have BTF type info for
>>> externs, our possibilities are much better. It will be possible to
>>> support bool, int, in64 for the same bool value. Libbpf will be able
>>> to validate the range and fail program load if declared extern type
>>> doesn't match actual value type and value range. So I think
>>> extensibility is there, but right now we are enforcing (logically)
>>> everything to be uin64_t. Unfortunately, with the way externs are done
>>> in ELF, I don't know neither type nor size, so can't be more strict
>>> than that.
>>>
>>> If we really need to know whether some config value is defined or not,
>>> regardless of its value, we can have it by convention. E.g.,
>>> CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
>>> CONFIG_XXX is defined explicitly or not. But I don't want to add that
>>> until we really have a use case where it matters.
>>>
>>>>> Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
>>>>> expected type (bool, tristate, int), so it doesn't enforce any specific set of
>>>>> values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
>>>>> found in config file are set to 0.
>>>>
>>>> This is not pretty either.
>>>
>>> What exactly: defaulting to zero or not knowing config value's type?
>>> Given all the options, defaulting to zero seems like the best way to
>>> go.
>>>
>>>>> +
>>>>> +             switch (*value) {
>>>>> +             case 'n':
>>>>> +                     *ext_val = 0;
>>>>> +                     break;
>>>>> +             case 'y':
>>>>> +                     *ext_val = 1;
>>>>> +                     break;
>>>>> +             case 'm':
>>>>> +                     *ext_val = 2;
>>>>> +                     break;
>>>
>>> reading some more code from scripts/kconfig/symbol.c, I'll need to
>>> handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
>>> more feedback on this version.
>>>
>>>>> +             case '"':
>>>>> +                     pr_warn("extern '%s': strings are not supported\n",
>>>>> +                             ext->name);
>>>>> +                     err = -EINVAL;
>>>>> +                     goto out;
>>>>> +             default:
>>>>> +                     errno = 0;
>>>>> +                     *ext_val = strtoull(value, &value_end, 10);
>>>>> +                     if (errno) {
>>>>> +                             err = -errno;
>>>>> +                             pr_warn("extern '%s': failed to parse value: %d\n",
>>>>> +                                     ext->name, err);
>>>>> +                             goto out;
>>>>> +                     }
>>>>
>>>> BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
>>>> .config has as bytes to the program and let program parse n/y/m, strings and
>>>> integers.
>>>
>>> Config value is not changing. This is an incredible waste of CPU
>>> resources to re-parse same value over and over again. And it's
>>> incredibly much worse usability as well. Again, once we have BTF for
>>> externs, we can just declare values as const char[] and then user will
>>> be able to do its own parsing. Until then, I think pre-parsing values
>>> into convenient u64 types are much better and handles all the typical
>>> cases.
>>
>> One more thing I didn't realize I didn't state explicitly, because
>> I've been thinking and talking about that for so long now, that it
>> kind of internalized completely.
>>
>> These externs, including CONFIG_XXX ones, are meant to interoperate
>> nicely with field relocations within BPF CO-RE concept. They are,
>> among other things, are meant to disable parts of BPF program logic
>> through verifier's dead code elimination by doing something like:
>>
>> if (CONFIG_SOME_FEATURES_ENABLED) {
>>       BPF_CORE_READ(t, some_extra_field);
b>>       /* or */
>>       bpf_helper_that_only_present_when_feature_is_enabled();
>> } else {
>>       /* fallback logic */
>> }
>>
>> With CONFIG_SOME_FEATURES_ENABLED not being a read-only integer
>> constant when BPF program is loaded, this is impossible. So it
>> absolutely must be some sort of easy to use integer constant.
> 
> Hmm. what difference do you see between u64 and char[] ?
> The const propagation logic in the verifier should work the same way.
> If it doesn't it's a bug in the verifier and it's not ok to hack
> extern api to workaround the bug.
> 
> What you're advocating with libbpf-side of conversion to integers
> reminds me of our earlier attempts with cgroup_sysctl hooks where
> we started with ints only to realize that in practice it's too
> limited. Then bpf_strtol was introduced and api got much cleaner.
> Same thing here. Converting char[] into ints or whatever else
> is the job of the program. Not of libbpf. The verifier can be taught
> to optimize bpf_strtol() into const when const char[] is passed in.
> 
> As far as is_enabled() check doing it as 0/1 the way you're proposing
> has in-band signaling issues that you admitted in the commit log.
> For is_enabled() may be new builtin() on llvm side would be better?
> Something like __builtin_preserve_field_info(field, BPF_FIELD_EXISTS)
> but can be used on _any_ extern function or variable.
> Like __builtin_is_extern_resolved(extern_name);
> Then on libbpf side CONFIG_* that are not in config.gz won't be seen
> by the program (instead of seen as 0 in your proposal) and the code
> will look like:
> if (__builtin_is_extern_resolved(CONFIG_NETNS)) {
>     ..do things;
> } else {
> }
> The verifier dead code elimination will take care of branches.

I sort of like the option of __builtin_is_extern_resolved() better than
plain 0 to provide an option for the developer to explicitly check for
that. But I'd like to take a step back in the discussion on the topic of
bpf_object__init_extern_map(). I'm wondering why it must be part of libbpf
at all to read the kernel config and resolve CONFIG_ / LINUX_KERNEL_VERSION
automatically. This feels a bit too much assumptions and automagic resolving.
Can't the application on top of libbpf pass in a callback where the extern
resolution would be outsourced into application rather than in libbpf?
Reason I'm asking is two-fold: i) this concept feels quite generic and my
take is that this could be applied to many other things as well beyond just
plain kernel config, ii) callback would also allow to experiment first what
would work best in practice wrt kernel config as one specific example, and
in a later step, libbpf could provide this as one built-in callback option
for the user to opt-into if its found to be generic/useful enough.

> The BPF program itself doesn't need to read the value of CONFIG_
> it only needs to know whether it was defined.
> Such builtin would match semantics better.
> If CONFIG_ is tri-state doing
> if (*(u8*)CONFIG_FOO == 'y' || *(u8*)CONFIG_FOO == 'm')

Passing in raw buffer feels more natural, agree, but then, again, if we had
a callback it would be up to the one who's implementing it.

> is cleaner than *(u64*)CONFIG_FOO == 1 || 2.
> and constant propagation in the verifier should work the same way.
Andrii Nakryiko Nov. 20, 2019, 3:17 a.m. UTC | #6
On Tue, Nov 19, 2019 at 3:32 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/19/19 4:58 PM, Alexei Starovoitov wrote:
> > On 11/19/19 7:42 AM, Andrii Nakryiko wrote:
> >> On Mon, Nov 18, 2019 at 10:57 PM Andrii Nakryiko
> >> <andrii.nakryiko@gmail.com> wrote:
> >>> On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
> >>>>> Add support for extern variables, provided to BPF program by libbpf. Currently
> >>>>> the following extern variables are supported:
> >>>>>     - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
> >>>>>       executing, follows KERNEL_VERSION() macro convention;
> >>>>>     - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
> >>>>>       boolean, and integer values are supported. Strings are not supported at
> >>>>>       the moment.
> >>>>>
> >>>>> All values are represented as 64-bit integers, with the follow value encoding:
> >>>>>     - for boolean values, y is 1, n or missing value is 0;
> >>>>>     - for tristate values, y is 1, m is 2, n or missing value is 0;
> >>>>>     - for integers, the values is 64-bit integer, sign-extended, if negative; if
> >>>>>       config value is missing, it's represented as 0, which makes explicit 0 and
> >>>>>       missing config value indistinguishable. If this will turn out to be
> >>>>>       a problem in practice, we'll need to deal with it somehow.
> >>>>
> >>>> I read that statement as there is no extensibility for such api.
> >>>
> >>> What do you mean exactly?
> >>>
> >>> Are you worried about 0 vs undefined case? I don't think it's going to
> >>> be a problem in practice. Looking at my .config, I see that integer
> >>> config values set to their default values are still explicitly
> >>> specified with those values. E.g.,
> >>>
> >>> CONFIG_HZ_1000=y
> >>> CONFIG_HZ=1000
> >>>
> >>> CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
> >>> set. So while I won't claim that it's the case for any possible
> >>> integer config, it seems to be pretty consistent in practice.
> >>>
> >>> Also, I see a lot of values set to explicit 0, like:
> >>>
> >>> CONFIG_BASE_SMALL=0
> >>>
> >>> So it seems like integers are typically spelled out explicitly in real
> >>> configs and I think this 0 default is pretty sane.
> >>>
> >>> Next, speaking about extensibility. Once we have BTF type info for
> >>> externs, our possibilities are much better. It will be possible to
> >>> support bool, int, in64 for the same bool value. Libbpf will be able
> >>> to validate the range and fail program load if declared extern type
> >>> doesn't match actual value type and value range. So I think
> >>> extensibility is there, but right now we are enforcing (logically)
> >>> everything to be uin64_t. Unfortunately, with the way externs are done
> >>> in ELF, I don't know neither type nor size, so can't be more strict
> >>> than that.
> >>>
> >>> If we really need to know whether some config value is defined or not,
> >>> regardless of its value, we can have it by convention. E.g.,
> >>> CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
> >>> CONFIG_XXX is defined explicitly or not. But I don't want to add that
> >>> until we really have a use case where it matters.
> >>>
> >>>>> Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
> >>>>> expected type (bool, tristate, int), so it doesn't enforce any specific set of
> >>>>> values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
> >>>>> found in config file are set to 0.
> >>>>
> >>>> This is not pretty either.
> >>>
> >>> What exactly: defaulting to zero or not knowing config value's type?
> >>> Given all the options, defaulting to zero seems like the best way to
> >>> go.
> >>>
> >>>>> +
> >>>>> +             switch (*value) {
> >>>>> +             case 'n':
> >>>>> +                     *ext_val = 0;
> >>>>> +                     break;
> >>>>> +             case 'y':
> >>>>> +                     *ext_val = 1;
> >>>>> +                     break;
> >>>>> +             case 'm':
> >>>>> +                     *ext_val = 2;
> >>>>> +                     break;
> >>>
> >>> reading some more code from scripts/kconfig/symbol.c, I'll need to
> >>> handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
> >>> more feedback on this version.
> >>>
> >>>>> +             case '"':
> >>>>> +                     pr_warn("extern '%s': strings are not supported\n",
> >>>>> +                             ext->name);
> >>>>> +                     err = -EINVAL;
> >>>>> +                     goto out;
> >>>>> +             default:
> >>>>> +                     errno = 0;
> >>>>> +                     *ext_val = strtoull(value, &value_end, 10);
> >>>>> +                     if (errno) {
> >>>>> +                             err = -errno;
> >>>>> +                             pr_warn("extern '%s': failed to parse value: %d\n",
> >>>>> +                                     ext->name, err);
> >>>>> +                             goto out;
> >>>>> +                     }
> >>>>
> >>>> BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
> >>>> .config has as bytes to the program and let program parse n/y/m, strings and
> >>>> integers.
> >>>
> >>> Config value is not changing. This is an incredible waste of CPU
> >>> resources to re-parse same value over and over again. And it's
> >>> incredibly much worse usability as well. Again, once we have BTF for
> >>> externs, we can just declare values as const char[] and then user will
> >>> be able to do its own parsing. Until then, I think pre-parsing values
> >>> into convenient u64 types are much better and handles all the typical
> >>> cases.
> >>
> >> One more thing I didn't realize I didn't state explicitly, because
> >> I've been thinking and talking about that for so long now, that it
> >> kind of internalized completely.
> >>
> >> These externs, including CONFIG_XXX ones, are meant to interoperate
> >> nicely with field relocations within BPF CO-RE concept. They are,
> >> among other things, are meant to disable parts of BPF program logic
> >> through verifier's dead code elimination by doing something like:
> >>
> >> if (CONFIG_SOME_FEATURES_ENABLED) {
> >>       BPF_CORE_READ(t, some_extra_field);
> b>>       /* or */
> >>       bpf_helper_that_only_present_when_feature_is_enabled();
> >> } else {
> >>       /* fallback logic */
> >> }
> >>
> >> With CONFIG_SOME_FEATURES_ENABLED not being a read-only integer
> >> constant when BPF program is loaded, this is impossible. So it
> >> absolutely must be some sort of easy to use integer constant.
> >
> > Hmm. what difference do you see between u64 and char[] ?
> > The const propagation logic in the verifier should work the same way.
> > If it doesn't it's a bug in the verifier and it's not ok to hack
> > extern api to workaround the bug.
> >
> > What you're advocating with libbpf-side of conversion to integers
> > reminds me of our earlier attempts with cgroup_sysctl hooks where
> > we started with ints only to realize that in practice it's too
> > limited. Then bpf_strtol was introduced and api got much cleaner.
> > Same thing here. Converting char[] into ints or whatever else
> > is the job of the program. Not of libbpf. The verifier can be taught
> > to optimize bpf_strtol() into const when const char[] is passed in.
> >
> > As far as is_enabled() check doing it as 0/1 the way you're proposing
> > has in-band signaling issues that you admitted in the commit log.
> > For is_enabled() may be new builtin() on llvm side would be better?
> > Something like __builtin_preserve_field_info(field, BPF_FIELD_EXISTS)
> > but can be used on _any_ extern function or variable.
> > Like __builtin_is_extern_resolved(extern_name);
> > Then on libbpf side CONFIG_* that are not in config.gz won't be seen
> > by the program (instead of seen as 0 in your proposal) and the code
> > will look like:
> > if (__builtin_is_extern_resolved(CONFIG_NETNS)) {
> >     ..do things;
> > } else {
> > }
> > The verifier dead code elimination will take care of branches.
>
> I sort of like the option of __builtin_is_extern_resolved() better than
> plain 0 to provide an option for the developer to explicitly check for

I think we can actually have both and let user decide which semantics
works better for seem.
If extern is defined as a weak symbol, we'll default to 0 the way that
I proposed initially. If extern is not weak, than it will have to be
guarded with __builtin_extern_resolved(), otherwise read will cause
verifier to reject the read. Does that work?

> that. But I'd like to take a step back in the discussion on the topic of
> bpf_object__init_extern_map(). I'm wondering why it must be part of libbpf
> at all to read the kernel config and resolve CONFIG_ / LINUX_KERNEL_VERSION
> automatically. This feels a bit too much assumptions and automagic resolving.
> Can't the application on top of libbpf pass in a callback where the extern
> resolution would be outsourced into application rather than in libbpf?

What you are describing is already possible today, it's called global
data. Specifically, .rodata would have exactly the same constant
tracking capabilities, yet be completely application-provided. We
definitely need to improve the API to initialize it, though, but it's
separate from this whole extern discussion. I'm going to work on that
soon as well.

The point of externs, though, is to denote something that's not
allocated and populated by BPF program or its userspace control app,
it's something that's provided by some outside "entity": libbpf itself
(e.g., for LINUX_KERNEL_VERSION and kernel config), kernel (for
whatever we are going to expose from kernel eventually), or another
BPF object (as part of static or dynamic linking). Right now kernel-
and other BPF object-provided use cases are not yet developed, but it
fits in this model (even though it will probably have a considerably
different internal implementation). You can think about Kconfig as
being provided by kernel itself, but of course implemented by libbpf.
Think about writing kernel code. You'd expect to have CONFIG_HZ
available to you without any parsing, such that you can just use it in
your expressions. Similarly, #ifdef CONFIG_TASK_IO_ACCOUNTING form is
a norm in kernel code. With these externs, BPF programs will feel even
more like they are extension of kernel, with all the similar
"facilities" at user's disposal.

I don't think it's a good idea to require all application that would
benefit from knowing few CONFIG_XXX values to re-implement their own
config parsing logic. We'll end up with lots of slightly incompatible
implementations. While not complicated, the logic still requires quite
a lot of coding, so we are going to save a bunch of effort for
prospective users of this feature.

On the other hand, kernel config is a pretty well-defined and common
problem, that we can actually solve it nicely and provide a great and
intuitive way to get and use it. It's also extensible, and once we
have BTF types for externs, libbpf will become even more clever,
allowing users to specify whether they want to get value as bool, or
some tristate enum, or integer, or maybe just a raw string
representation. Before we have that, I think defaulting to uniform
int64_t makes a lot of sense and doesn't preclude further improvements
and extensions.

> Reason I'm asking is two-fold: i) this concept feels quite generic and my
> take is that this could be applied to many other things as well beyond just
> plain kernel config, ii) callback would also allow to experiment first what
> would work best in practice wrt kernel config as one specific example, and
> in a later step, libbpf could provide this as one built-in callback option
> for the user to opt-into if its found to be generic/useful enough.
>
> > The BPF program itself doesn't need to read the value of CONFIG_
> > it only needs to know whether it was defined.
> > Such builtin would match semantics better.
> > If CONFIG_ is tri-state doing
> > if (*(u8*)CONFIG_FOO == 'y' || *(u8*)CONFIG_FOO == 'm')
>
> Passing in raw buffer feels more natural, agree, but then, again, if we had
> a callback it would be up to the one who's implementing it.

See above, I think weak vs strong extern would give us the best of
both worlds. Applications not caring to distinguished "undefined" vs
0, would just specify it as a weak symbol. Default strong extern,
though, will enforce that whoever tries to fetch value needs to ensure
that extern was actually found and resolved.

>
> > is cleaner than *(u64*)CONFIG_FOO == 1 || 2.
> > and constant propagation in the verifier should work the same way.
Andrii Nakryiko Nov. 20, 2019, 3:44 a.m. UTC | #7
On Tue, Nov 19, 2019 at 7:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 11/19/19 7:42 AM, Andrii Nakryiko wrote:
> > On Mon, Nov 18, 2019 at 10:57 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >>
> >> On Mon, Nov 18, 2019 at 7:21 PM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Sat, Nov 16, 2019 at 11:08:06PM -0800, Andrii Nakryiko wrote:
> >>>> Add support for extern variables, provided to BPF program by libbpf. Currently
> >>>> the following extern variables are supported:
> >>>>    - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
> >>>>      executing, follows KERNEL_VERSION() macro convention;
> >>>>    - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
> >>>>      boolean, and integer values are supported. Strings are not supported at
> >>>>      the moment.
> >>>>
> >>>> All values are represented as 64-bit integers, with the follow value encoding:
> >>>>    - for boolean values, y is 1, n or missing value is 0;
> >>>>    - for tristate values, y is 1, m is 2, n or missing value is 0;
> >>>>    - for integers, the values is 64-bit integer, sign-extended, if negative; if
> >>>>      config value is missing, it's represented as 0, which makes explicit 0 and
> >>>>      missing config value indistinguishable. If this will turn out to be
> >>>>      a problem in practice, we'll need to deal with it somehow.
> >>>
> >>> I read that statement as there is no extensibility for such api.
> >>
> >> What do you mean exactly?
> >>
> >> Are you worried about 0 vs undefined case? I don't think it's going to
> >> be a problem in practice. Looking at my .config, I see that integer
> >> config values set to their default values are still explicitly
> >> specified with those values. E.g.,
> >>
> >> CONFIG_HZ_1000=y
> >> CONFIG_HZ=1000
> >>
> >> CONFIG_HZ default is 1000, if CONFIG_HZ_1000==y, but still I see it
> >> set. So while I won't claim that it's the case for any possible
> >> integer config, it seems to be pretty consistent in practice.
> >>
> >> Also, I see a lot of values set to explicit 0, like:
> >>
> >> CONFIG_BASE_SMALL=0
> >>
> >> So it seems like integers are typically spelled out explicitly in real
> >> configs and I think this 0 default is pretty sane.
> >>
> >> Next, speaking about extensibility. Once we have BTF type info for
> >> externs, our possibilities are much better. It will be possible to
> >> support bool, int, in64 for the same bool value. Libbpf will be able
> >> to validate the range and fail program load if declared extern type
> >> doesn't match actual value type and value range. So I think
> >> extensibility is there, but right now we are enforcing (logically)
> >> everything to be uin64_t. Unfortunately, with the way externs are done
> >> in ELF, I don't know neither type nor size, so can't be more strict
> >> than that.
> >>
> >> If we really need to know whether some config value is defined or not,
> >> regardless of its value, we can have it by convention. E.g.,
> >> CONFIG_DEFINED_XXX will be either 0 or 1, depending if corresponding
> >> CONFIG_XXX is defined explicitly or not. But I don't want to add that
> >> until we really have a use case where it matters.
> >>
> >>>
> >>>> Generally speaking, libbpf is not aware of which CONFIG_XXX values is of which
> >>>> expected type (bool, tristate, int), so it doesn't enforce any specific set of
> >>>> values and just parses n/y/m as 0/1/2, respectively. CONFIG_XXX values not
> >>>> found in config file are set to 0.
> >>>
> >>> This is not pretty either.
> >>
> >> What exactly: defaulting to zero or not knowing config value's type?
> >> Given all the options, defaulting to zero seems like the best way to
> >> go.
> >>
> >>>
> >>>> +
> >>>> +             switch (*value) {
> >>>> +             case 'n':
> >>>> +                     *ext_val = 0;
> >>>> +                     break;
> >>>> +             case 'y':
> >>>> +                     *ext_val = 1;
> >>>> +                     break;
> >>>> +             case 'm':
> >>>> +                     *ext_val = 2;
> >>>> +                     break;
> >>
> >> reading some more code from scripts/kconfig/symbol.c, I'll need to
> >> handle N/Y/M and 0x hexadecimals, will add in v2 after collecting some
> >> more feedback on this version.
> >>
> >>>> +             case '"':
> >>>> +                     pr_warn("extern '%s': strings are not supported\n",
> >>>> +                             ext->name);
> >>>> +                     err = -EINVAL;
> >>>> +                     goto out;
> >>>> +             default:
> >>>> +                     errno = 0;
> >>>> +                     *ext_val = strtoull(value, &value_end, 10);
> >>>> +                     if (errno) {
> >>>> +                             err = -errno;
> >>>> +                             pr_warn("extern '%s': failed to parse value: %d\n",
> >>>> +                                     ext->name, err);
> >>>> +                             goto out;
> >>>> +                     }
> >>>
> >>> BPF has bpf_strtol() helper. I think it would be cleaner to pass whatever
> >>> .config has as bytes to the program and let program parse n/y/m, strings and
> >>> integers.
> >>
> >> Config value is not changing. This is an incredible waste of CPU
> >> resources to re-parse same value over and over again. And it's
> >> incredibly much worse usability as well. Again, once we have BTF for
> >> externs, we can just declare values as const char[] and then user will
> >> be able to do its own parsing. Until then, I think pre-parsing values
> >> into convenient u64 types are much better and handles all the typical
> >> cases.
> >
> >
> > One more thing I didn't realize I didn't state explicitly, because
> > I've been thinking and talking about that for so long now, that it
> > kind of internalized completely.
> >
> > These externs, including CONFIG_XXX ones, are meant to interoperate
> > nicely with field relocations within BPF CO-RE concept. They are,
> > among other things, are meant to disable parts of BPF program logic
> > through verifier's dead code elimination by doing something like:
> >
> >
> > if (CONFIG_SOME_FEATURES_ENABLED) {
> >      BPF_CORE_READ(t, some_extra_field);
> >      /* or */
> >      bpf_helper_that_only_present_when_feature_is_enabled();
> > } else {
> >      /* fallback logic */
> > }
> >
> > With CONFIG_SOME_FEATURES_ENABLED not being a read-only integer
> > constant when BPF program is loaded, this is impossible. So it
> > absolutely must be some sort of easy to use integer constant.
>
> Hmm. what difference do you see between u64 and char[] ?
> The const propagation logic in the verifier should work the same way.
> If it doesn't it's a bug in the verifier and it's not ok to hack
> extern api to workaround the bug.

For some specific subset of cases (mostly bool and tristate), yes,
verifier should be able to track raw byte value, because there is no
transformation applied to those values. With integers it's certainly
not the case.

Imagine something like:

if (CONFIG_HZ > 1000) {
  ... do something ...
else if (CONFIG_HZ > 100) {
  ... something else ...
} else {
  ... yet another fallback ...
}

With CONFIG_HZ being integer, entire if/else if/else will be
eliminated by verifier. If you do bpf_strtoul(), it's not possible,
unless we do those more further optimizations (implementing const
versions of helpers). The latter would be a good addition, if possible
to implement generically, of course, but I'm not sure we need to block
on that.

>
> What you're advocating with libbpf-side of conversion to integers
> reminds me of our earlier attempts with cgroup_sysctl hooks where
> we started with ints only to realize that in practice it's too
> limited. Then bpf_strtol was introduced and api got much cleaner.
> Same thing here. Converting char[] into ints or whatever else
> is the job of the program. Not of libbpf. The verifier can be taught
> to optimize bpf_strtol() into const when const char[] is passed in.

Given config values are constant and won't change throughout lifetime
of kernel, it's much more practical and easier to parse any
complicated value in userspace and pass necessary well-structured and
easy to use (and thus - performant) data to BPF side through global
data. But if BPF developer knows that CONFIG_HZ has to be integer
(because it's defined in Kconfig as having a type int), then it's
going to be integer, there is no doubt about that.

So while I can imagine some extreme cases where we might need parsing
string, I think most such cases can be painlessly solved in userspace.

Having said that, I don't oppose having an option to expose strings
and allow to work with them, either from userspace or BPF side. It is
extension of what I implemented and can be easily added.

>
> As far as is_enabled() check doing it as 0/1 the way you're proposing
> has in-band signaling issues that you admitted in the commit log.
> For is_enabled() may be new builtin() on llvm side would be better?
> Something like __builtin_preserve_field_info(field, BPF_FIELD_EXISTS)
> but can be used on _any_ extern function or variable.
> Like __builtin_is_extern_resolved(extern_name);
> Then on libbpf side CONFIG_* that are not in config.gz won't be seen
> by the program (instead of seen as 0 in your proposal) and the code
> will look like:
> if (__builtin_is_extern_resolved(CONFIG_NETNS)) {
>    ..do things;
> } else {
> }
> The verifier dead code elimination will take care of branches.
> The BPF program itself doesn't need to read the value of CONFIG_
> it only needs to know whether it was defined.
> Such builtin would match semantics better.

I agree such __builtin is useful and we should add it. But also I
believe that a lot of common cases would be much simpler and nicer if
we have this not defined = 0 logic. But I think we can satisfy both
sides without sacrificing anything. If you define extern as weak:

extern __attribute__((weak)) uint64_t CONFIG_MISSING;

Libbpf will set it to zero, if it's not recognized/found. So check
like below will nicely work:

if (CONFIG_MISSING) {
  .. do something ...
}

If the extern is strong, then the above check will fail, and will have
to be written with using __builtin:

if (__builtin_extern_resolved(CONFIG_MISSING)) {
  .. do something ..
}

This puts control over semantics into users hands. WDYT?


> If CONFIG_ is tri-state doing
> if (*(u8*)CONFIG_FOO == 'y' || *(u8*)CONFIG_FOO == 'm')
> is cleaner than *(u64*)CONFIG_FOO == 1 || 2.
> and constant propagation in the verifier should work the same way.

Once we get BTF info for externs, you'll be able to do just that very cleanly:

extern bool CONFIG_FOO; - true, if =y, false if =n, false, if extern
is weak and undefined in config
extern char CONFIG_FOO; will get 'y'/'n'/'m' values
extern enum tristate CONFIG_FOO - YES/NO/MODULE (or whatever we define
in enum tristate)
extern char CONFIG_FOO[]; - will get raw zero-terminated "y\0", "n\0", or "m\0"

Until we have BTF, though, we can dictate that all those should be
defined uniformly as uin64_t and be handled as in my current patch.
Then, with introduction of BTF, libbpf will do necessary extra
transformations and enforcement of type (e.g., if defined as `extern
bool CONFIG_FOO`, but actual value is m (module) - that will be an
error and enforced by libbpf).

Only if currently users will still use something like bool or int,
instead of uint64_t, that might break later because with BTF for
externs libbpf will suddenly start enforcing more restrictions. But I
think it's just going to be a misuse of current API and shouldn't be
considered a breaking change.

So, to summarize, we proceed with uint64_t for everything, with added
bits of weak vs strong handling. Then in parallel we'll work on adding
BTF for externs and __builtin_extern_resolved (and corresponding new
kind of BTF relocation) and will keep making this whole API even
better, while already having something useful and extensible.

As for strings, I'd prefer to add them in a follow up patch, but if
you guys insist, I can add them anyways. One reservation about strings
I do still have is how better to represent strings:

extern const char *CONFIG_SOME_STRING;   /* pointer to a string
literal, stored in soem other place */
/* or */
extern const char CONFIG_SOME_STRING[];  /* inlined string contents */

The latter can be implemented and used (to some degree) today. But the
former would more seamlessly blend with exposing string-based
variables from kernel. Also,

extern __attribute__((weak)) const char *CONFIG_SOME_STRING;

would more naturally resolve to NULL, if not explicitly defined in
kernel config.

But I'm feeling less strongly about strings overall.

Thoughts?
Alexei Starovoitov Nov. 20, 2019, 9:39 p.m. UTC | #8
On 11/19/19 7:44 PM, Andrii Nakryiko wrote:
> So, to summarize, we proceed with uint64_t for everything, with added
> bits of weak vs strong handling. Then in parallel we'll work on adding
> BTF for externs and __builtin_extern_resolved (and corresponding new
> kind of BTF relocation) and will keep making this whole API even
> better, while already having something useful and extensible.

I didn't know extern can be weak. That was a good find.
Indeed undefined config_* can be represented as uint64 zero value.
But I still have an issue with 'proceed uint64_t for everything',
since strings and tri-state don't fit into uint64.

How about strtol be applied by libbpf only for things that parse 
successfully (like decimal and hex) and everything else will be 
represented raw ?
Like CONFIG=y will be 121.
CONFIG=m will be 109
CONFIG="abc" will be 0x636261
In other words CONFIG_A is an address and
'extern weak uint64_t CONFIG_A' reads raw bytes from that location.
When that CONFIG_A is undefined in /boot/config.gz the u64 read
from that address will return zero.
u8 read from that address will return zero too.
Andrii Nakryiko Nov. 20, 2019, 10:47 p.m. UTC | #9
On Wed, Nov 20, 2019 at 1:39 PM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 11/19/19 7:44 PM, Andrii Nakryiko wrote:
> > So, to summarize, we proceed with uint64_t for everything, with added
> > bits of weak vs strong handling. Then in parallel we'll work on adding
> > BTF for externs and __builtin_extern_resolved (and corresponding new
> > kind of BTF relocation) and will keep making this whole API even
> > better, while already having something useful and extensible.
>
> I didn't know extern can be weak. That was a good find.
> Indeed undefined config_* can be represented as uint64 zero value.
> But I still have an issue with 'proceed uint64_t for everything',
> since strings and tri-state don't fit into uint64.

well, great that we are at least aligned w.r.t. int/hex :)

not clear about bool, though. They are subset of tristate, so should
they be 'y'/'n' (which is certainly weird, isn't it?) raw characters
(with zero byte termination? Or just plain single char?) or C's 0 and
1?

>
> How about strtol be applied by libbpf only for things that parse
> successfully (like decimal and hex) and everything else will be
> represented raw ?
> Like CONFIG=y will be 121.
> CONFIG=m will be 109

Again, conceptually, bool is subset of tristate. So if we go with real
bools, then 0 and 1 should match to tristate's y/n as well.

For tristate we also can have 8-byte enum, if you think that's better
than short 0/1/2 (or whatever, -1, if you don't like 2):

enum kconfig_tristate {
        NO = 0,
        YES = 1,
        MODULE = 2,
        __KCONFIG_TRISTATE_INVALID = 0xFFFFFFFFFFFFFFFF
};

> CONFIG="abc" will be 0x636261
> In other words CONFIG_A is an address and
> 'extern weak uint64_t CONFIG_A' reads raw bytes from that location.
> When that CONFIG_A is undefined in /boot/config.gz the u64 read
> from that address will return zero.
> u8 read from that address will return zero too.

I'm puzzled on why strings should be represented as "extern int64_t"?
What about strings longer than 7 bytes (+1 zero terminator)? IMO,
strings should be strings, i.e., const char []. Or const char *, if we
had a support (see my previous email). Cramming arbitrary strings into
uint64_t seems weird and I fail to see the reason.


So just to summarize how I understand situation right now. Please let
me know if that's not true.

1. We proceed with extern uint64_t for what's detected as integer/hex
(at runtime).
2. Weak/strong externs determine how we handle unresolved config
values. Weak - default to 0, strong - ensure it can't be read. But
this might be complicated by our further choices below.

Things that I'd like to clarify. And I'll list options we've discussed.

1. For bools:
  a) parse them as 0/1, for now use them as uint64_t. Once BTF for
externs available, users will be able to pick bool instead.
  b) parse them as 0/1 (false, true), represent as single byte bool.
  c) leave them as raw single character: 'y' or 'n'
    c1) represent as single byte?
    c2) represent as 8 bytes to match uint64_t?
  d) leave them as a raw string (zero-terminated)
    d1) represent as fixed 8-byte string?..
    d2) represent as variable-sized raw string?

2. For tristate, pretty much the same set of questions, except there
is one more state 'm'. One extra option: represent as 8-byte enum,
maybe? We can do 4-byte enum as well, but that has all the downsides
of variable sized externs without yet having BTF type info.

3. For strings:
  a) don't add them yet, if string is detected in runtime, return
error (my original proposal);
  b) add them as up to 7 characters fixed strings you proposed? What
do we do with longer strings?
  c) add them as variable-sized zero-terminated strings (libbpf will
determine size at runtime). See below for consequences.

If we pick any combination where fields are not uniform (so, if we add
strings right now, or have bool as 1 byte, or whatever), we'll have to
specify what size value we fall back to if weakly-defined extern is
not resolved. Is it going to be zero uint64_t? This is wasteful for
what's expected to be a string (and dirty). On the other hand,
specifying it as empty single-byte string is not going to work for
CONFIG_XXX that is undefined integer. Generally speaking, having
variable sized representation is very confusing until we have expected
type information or at least expected byte size.

Given all this, I think realistically we can pick few combinations:

1. Only support int/hex as uint64_t. Anything y/n/m or a string will
fail in runtime.
Pros:
  - we get at least something to use in practice (LINUX_KERNEL_VERSION
and int CONFIG_XXXs).
Cons:
  - undefined weak extern will have to be assumed either uint64_t 0
(and succeed) or undefined string (and fail). No clearly best behavior
here.
  - no ability to do "feature detection" logic in BPF program.
2. Stick to uint64_t for bool/tristate/int/hex. Don't support strings
yet. Improve in backwards compatible way once we get BTF type info for
externs (adding strings at that time).
Pros:
  - well-defined and logical behavior
  - easily extensible once we get BTF for externs. No new flags, no
changes in behavior.
Cons:
  - no strings yet
  - Alexei doesn't like y/n/m as 0/1/2
3. Mix of the options for bool/tristate/string I layed out above.
Pros:
  - we can get reasonable strings (or up to 7-byte ones, but not sure why);
  - bool/tristate can be accessed, though, IMO, in an ugly
character-based fashion.
Cons:
  - some arbitrary choices need to be made, that couldn't be changed
even when we have BTF for externs.
  - needs further discussion and specification to narrow down all the downsides.

My preferences would be 2, if not, then 1.
Alexei Starovoitov Nov. 21, 2019, 12:18 a.m. UTC | #10
On Wed, Nov 20, 2019 at 02:47:58PM -0800, Andrii Nakryiko wrote:
> 
> Given all this, I think realistically we can pick few combinations:
> 
> 1. Only support int/hex as uint64_t. Anything y/n/m or a string will
> fail in runtime.
> Pros:
>   - we get at least something to use in practice (LINUX_KERNEL_VERSION
> and int CONFIG_XXXs).
> Cons:
>   - undefined weak extern will have to be assumed either uint64_t 0
> (and succeed) or undefined string (and fail). No clearly best behavior
> here.

what that uint64 going to be when config_ is defined?
If your answer is 1 than it's not extensible.

>   - no ability to do "feature detection" logic in BPF program.
> 2. Stick to uint64_t for bool/tristate/int/hex. Don't support strings
> yet. Improve in backwards compatible way once we get BTF type info for
> externs (adding strings at that time).
> Pros:
>   - well-defined and logical behavior
>   - easily extensible once we get BTF for externs. No new flags, no
> changes in behavior.

extensible with new flag == not extensible.
The choices for bpf program that we pick for extern must keep working
when llvm starts generating BTF for externs.

> My preferences would be 2, if not, then 1.

I'm proposing something else.
I see libbpf as a tool to pass /boot/config.gz into bpf program. From program
pov CONFIG_FOO is a label. It's not a variable of type u8 or type u64.
Example:
CONFIG_A=100
CONFIG_B=y
CONFIG_C="abcd"
will be a map of one element with value:
char ar[]= { 0x64, 0, 0, 0, 'y', 'a', 'b', 'c', 'd', 0};

CONFIG_A = &ar[0];
CONFIG_B = &ar[4];
CONFIG_C = &ar[5];

libbpf parses config.gz and converts all int/hex into 4 byte or 8 byte
integers depending on number of text digits it sees in config.gz
with alignment. All other strings and characters are passed as-is.
If program says
extern u8 CONFIG_A, CONFIG_B, CONFIG_C;
It will read 1st byte from these three labels.
Later when llvm emits BTF those u8 swill stay as-is and will read the same
things. With BTF if program says 'extern _Bool CONFIG_B;' then it will be an
explicit direction for libbpf to convert that CONFIG_B value into _Bool at
program load time and represent it as sizeof(_Bool) in map element. If program
says 'extern uint32_t CONFIG_C;' the libbpf will keep first 4 bytes of that
string in there. If program says 'extern uint64_t CONFIG_C;' the libbpf will
keep first 4 character plus one byte for zero plus 3 bytes of padding in map
element.
'extern char CONFIG_C[5];' is also ok. Either with or without BTF.
In both cases it will be 5 bytes in map element.
Without BTF doing 'extern char *CONFIG_C;' will read garbage 8 bytes from that
label. With BTF 'extern char *CONFIG_C;' will be converted to pointer to inner
bytes of map element.
In other words BTF types will be directives of how libbpf should convert
strings in text file to be consumed by bpf program. Since we don't have types
now int/hex are the only ones that libbpf will convert unconditionally. The
logic of parsing config.gz is generic. It can parse any file with 'var=value'
lines this way. All names will be preserved.
In that sense LINUX_KERNEL_VERSION is a text file that has one line:
LINUX_KERNEL_VESRION=1234
If digits fit into u32 it should be u32.
This way users can feed any configuration.txt file into libbpf and into their
programs. Without BTF the map element is a collection of raw bytes and the
program needs to be smart about reading them with correct sizes. With BTF
libbpf will be converting .txt into requested types and failing to load
if libbpf cannot convert string from text into requested type.
Andrii Nakryiko Nov. 21, 2019, 2:13 a.m. UTC | #11
On Wed, Nov 20, 2019 at 4:18 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 02:47:58PM -0800, Andrii Nakryiko wrote:
> >
> > Given all this, I think realistically we can pick few combinations:
> >
> > 1. Only support int/hex as uint64_t. Anything y/n/m or a string will
> > fail in runtime.
> > Pros:
> >   - we get at least something to use in practice (LINUX_KERNEL_VERSION
> > and int CONFIG_XXXs).
> > Cons:
> >   - undefined weak extern will have to be assumed either uint64_t 0
> > (and succeed) or undefined string (and fail). No clearly best behavior
> > here.
>
> what that uint64 going to be when config_ is defined?
> If your answer is 1 than it's not extensible.

Let's say we have

extern __attribute__((weak)) uint64_t CONFIG_VALUE;

Now let's see how different possible configs will behave:

.config                         uint64_t value
CONFIG_VALUE=123            =>  123
CONFIG_VALUE=y              =>  1
CONFIG_VALUE="abc"          =>  libbpf error (assuming no string support)
# CONFIG_VALUE not defined  =>  0

>
> >   - no ability to do "feature detection" logic in BPF program.
> > 2. Stick to uint64_t for bool/tristate/int/hex. Don't support strings
> > yet. Improve in backwards compatible way once we get BTF type info for
> > externs (adding strings at that time).
> > Pros:
> >   - well-defined and logical behavior
> >   - easily extensible once we get BTF for externs. No new flags, no
> > changes in behavior.
>
> extensible with new flag == not extensible.

please re-read two lines you just quoted: **"No new flags, no changes
in behavior"**.
So extensible by your definition.

> The choices for bpf program that we pick for extern must keep working
> when llvm starts generating BTF for externs.

Agree and that's what I'm claiming: they are going to work.

>
> > My preferences would be 2, if not, then 1.
>
> I'm proposing something else.
> I see libbpf as a tool to pass /boot/config.gz into bpf program. From program
> pov CONFIG_FOO is a label. It's not a variable of type u8 or type u64.

This is not true for usage of CONFIG_XXX values in kernel code. Why
would we have a completely different semantics for BPF programs? When
kernel code does:

int sec = jiffies * CONFIG_HZ;

CONFIG_HZ is not just a label pointing to raw string representation of
100 or 1000, it **IS** a number 1000. You cannot multiply strings in
C, yet you can multiply CONFIG_HZ.

> Example:
> CONFIG_A=100
> CONFIG_B=y
> CONFIG_C="abcd"
> will be a map of one element with value:
> char ar[]= { 0x64, 0, 0, 0, 'y', 'a', 'b', 'c', 'd', 0};
>
> CONFIG_A = &ar[0];
> CONFIG_B = &ar[4];
> CONFIG_C = &ar[5];
>
> libbpf parses config.gz and converts all int/hex into 4 byte or 8 byte
> integers depending on number of text digits it sees in config.gz
> with alignment. All other strings and characters are passed as-is.
> If program says
> extern u8 CONFIG_A, CONFIG_B, CONFIG_C;
> It will read 1st byte from these three labels.
> Later when llvm emits BTF those u8 swill stay as-is and will read the same
> things. With BTF if program says 'extern _Bool CONFIG_B;' then it will be an
> explicit direction for libbpf to convert that CONFIG_B value into _Bool at
> program load time and represent it as sizeof(_Bool) in map element. If program
> says 'extern uint32_t CONFIG_C;' the libbpf will keep first 4 bytes of that
> string in there. If program says 'extern uint64_t CONFIG_C;' the libbpf will
> keep first 4 character plus one byte for zero plus 3 bytes of padding in map
> element.

This set of rules is not consistent, if I interpret what you are
saying correctly.

.config            extern definition              w/o BTF             w/ BTF

CONFIG_VALUE=y     extern bool CONFIG_VALUE;      {'y'}
true             -- change in behavior
CONFIG_VALUE=100   extern uint32_t CONFIG_VALUE;  {0x64, 0, 0, 0}
{0x64, 0, 0, 0}  -- no change, because we parse ints
CONFIG_VALUE="abc" extern uint32_t CONFIG_VALUE;  {'a', 'b', 'c', 0}
??? still {'a', 'b', 'c', 0}? or error?

So for bools we'll start interpreting 'y' as 1 (true), but only when
we get BTF. Even though we can clearly parse y/n today as 1 (true) or
0 (false).

For integers, nothing changes and uint32_t means "I want parsed
integer and at most 4 lower bytes on little-endian or 4 upper bytes on
big-endian". Though we might want to reject it with BTF info, if it
doesn't fit into 32 bits (because now we can?).

For strings, uint32_t suddenly means "give me first 4 raw bytes". I'm
not even clear what we will do with BTF in that case.

This logic also breaks when user declares "extern uint64_t", but real
value fits in just 4 bytes. According to your rules, libbpf will
allocate just 4 bytes, and user will read extra 4 bytes of garbage. In
either low or high 4 bytes, depending on machine endianness.

This strikes me as unnecessarily convoluted behavior.

> 'extern char CONFIG_C[5];' is also ok. Either with or without BTF.
> In both cases it will be 5 bytes in map element.
> Without BTF doing 'extern char *CONFIG_C;' will read garbage 8 bytes from that
> label. With BTF 'extern char *CONFIG_C;' will be converted to pointer to inner
> bytes of map element.

Again, I don't even know how many bytes user expected, so when it's
defined - I don't know how many zeros I should substitute. Or if user
define uint64_T, but value fits just 4 bytes, how do I know user
didn't expect all 8. For strings, it's even more interesting. "a" vs
"abracadabra" will require different amount of storage, so if user
will try to read it as extern uint32_t, he will either read 'a' +
garbage, or 'abra'.

I really struggle to see how what I'm proposing is not more consistent
and logical behavior. I'm saying that before we have BTF, all uses
should be `extern uint64_t CONFIG_BLA`. There is nothing preventing
user to do `extern int CONFIG_BLA`, but we don't have abilities to
enforce that, so we are saying "don't do it because it's not
supported, it might or might not work, with random probability". Now,
assuming people follow documentation and examples, we know that it's
always 8 bytes that people are expecting. So all integers are parsed
and allocated as uint64_t, which makes it consistent for big- and
litte-endian machines. For bools, it's just natural to interpret them
as 0 and 1, I fail to see why this is controversial. y/n is just
Kconfig text representation **in text file** of true/false (bool
type). By extension, y/n/m can be mapped into 0/1/2. Sure 2 is sort of
arbitrary, but it follows y, it keeps bool a strict subset of
tristate. For strings I'd hold off until we have BTF, because then
declaring `extern const char CONFIG_BLA[]` is a clear indication you
expect string. While `extern uint32_t CONFIG_BLA;` is a clear
indication that you expect integer, so if the config value is not
parsable to integer -- that's an error. No need to arbitrarily map
first 4 character and pretend they are integer. If you want that
behavior, just declare `extern u8 CONFIG_BLA[]`.

Now, everything is currently defined as uint64_t. BTF info for extern
arrives. People go fancy and say: "screw it, I don't want to waste 8
bytes for bool". They go and change it to "extern bool CONFIG_BLA;".
Good, libbpf sees BTF, validates it's only y or n in config. If it's m
- fail, if it's 1, 0, 123, "a", "abc", "y" - fail. It's not a bool.
Just strictly better behavior.

Once we have BTF, bam, strings just works, because libbpf knows that
user expects 'const char[]'. All this type information is available to
tooling at that point as well, so BPF object auto-generated skeleton
can take all that into account, because it had guarantees that libbpf
will always allocate that amount of bytes for primitive types.

Before we have BTF type info, it's either 8 bytes for supported types,
or pain for users. I don't know why I'd pick the latter.

> In other words BTF types will be directives of how libbpf should convert
> strings in text file to be consumed by bpf program. Since we don't have types
> now int/hex are the only ones that libbpf will convert unconditionally. The
> logic of parsing config.gz is generic. It can parse any file with 'var=value'
> lines this way. All names will be preserved.
> In that sense LINUX_KERNEL_VERSION is a text file that has one line:
> LINUX_KERNEL_VESRION=1234
> If digits fit into u32 it should be u32.

See above. Libbpf doesn't know that user expects u32, so choosing 4 vs
8 bytes at runtime is, while super easy for libbpf, would be super
frustrating for users. We should just say no until we have types.

> This way users can feed any configuration.txt file into libbpf and into their
> programs. Without BTF the map element is a collection of raw bytes and the
> program needs to be smart about reading them with correct sizes. With BTF
> libbpf will be converting .txt into requested types and failing to load
> if libbpf cannot convert string from text into requested type.

This was never intended as a generic application-provided config
parser, I'm sorry if I made that impression. This is specifically
driven towards kernel config, taking into account kernel config
semantics with its explicitly supported types of values and how it's
exposed to kernel code. This is intended to provide an illusion to BPF
program of having as close environment, as we can get, to just
programming inside the kernel. Example 'var=value' will never happen
with Kconfig. It's invalid config and libbpf should reject that.

If application needs to pass custom config, application will have to
parse it on its own and pass it as global data: either mutable or
read-only. If we deem this necessary, we can provide custom libbpf
helpers to automate that, but I'd refrain from that. There are way too
many different ways to define application configs to reasonably
support anything generic. It's just not the problem that libbpf should
try to solve. Libbpf should make using of global data simpler and more
user-friendly.
Alexei Starovoitov Nov. 21, 2019, 4:39 a.m. UTC | #12
On 11/20/19 6:13 PM, Andrii Nakryiko wrote:
> We should just say no until we have types.

Ok. I'm convinced. None of the proposals are clean enough
until we have BTF with externs.
diff mbox series

Patch

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 99425d0be6ff..28020b23c4d2 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -56,8 +56,8 @@  ifndef VERBOSE
 endif
 
 FEATURE_USER = .libbpf
-FEATURE_TESTS = libelf libelf-mmap bpf reallocarray
-FEATURE_DISPLAY = libelf bpf
+FEATURE_TESTS = libelf libelf-mmap zlib bpf reallocarray
+FEATURE_DISPLAY = libelf zlib bpf
 
 INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
 FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES)
@@ -159,7 +159,7 @@  all: fixdep
 
 all_cmd: $(CMD_TARGETS) check
 
-$(BPF_IN_SHARED): force elfdep bpfdep bpf_helper_defs.h
+$(BPF_IN_SHARED): force elfdep zdep bpfdep bpf_helper_defs.h
 	@(test -f ../../include/uapi/linux/bpf.h -a -f ../../../include/uapi/linux/bpf.h && ( \
 	(diff -B ../../include/uapi/linux/bpf.h ../../../include/uapi/linux/bpf.h >/dev/null) || \
 	echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'" >&2 )) || true
@@ -177,7 +177,7 @@  $(BPF_IN_SHARED): force elfdep bpfdep bpf_helper_defs.h
 	echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/if_xdp.h' differs from latest version at 'include/uapi/linux/if_xdp.h'" >&2 )) || true
 	$(Q)$(MAKE) $(build)=libbpf OUTPUT=$(SHARED_OBJDIR) CFLAGS="$(CFLAGS) $(SHLIB_FLAGS)"
 
-$(BPF_IN_STATIC): force elfdep bpfdep bpf_helper_defs.h
+$(BPF_IN_STATIC): force elfdep zdep bpfdep bpf_helper_defs.h
 	$(Q)$(MAKE) $(build)=libbpf OUTPUT=$(STATIC_OBJDIR)
 
 bpf_helper_defs.h: $(srctree)/include/uapi/linux/bpf.h
@@ -189,7 +189,7 @@  $(OUTPUT)libbpf.so: $(OUTPUT)libbpf.so.$(LIBBPF_VERSION)
 $(OUTPUT)libbpf.so.$(LIBBPF_VERSION): $(BPF_IN_SHARED)
 	$(QUIET_LINK)$(CC) $(LDFLAGS) \
 		--shared -Wl,-soname,libbpf.so.$(LIBBPF_MAJOR_VERSION) \
-		-Wl,--version-script=$(VERSION_SCRIPT) $^ -lelf -o $@
+		-Wl,--version-script=$(VERSION_SCRIPT) $^ -lelf -lz -o $@
 	@ln -sf $(@F) $(OUTPUT)libbpf.so
 	@ln -sf $(@F) $(OUTPUT)libbpf.so.$(LIBBPF_MAJOR_VERSION)
 
@@ -197,7 +197,7 @@  $(OUTPUT)libbpf.a: $(BPF_IN_STATIC)
 	$(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
 
 $(OUTPUT)test_libbpf: test_libbpf.c $(OUTPUT)libbpf.a
-	$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $(INCLUDES) $^ -lelf -o $@
+	$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $(INCLUDES) $^ -lelf -lz -o $@
 
 $(OUTPUT)libbpf.pc:
 	$(QUIET_GEN)sed -e "s|@PREFIX@|$(prefix)|" \
@@ -280,12 +280,15 @@  clean:
 
 
 
-PHONY += force elfdep bpfdep cscope tags
+PHONY += force elfdep zdep bpfdep cscope tags
 force:
 
 elfdep:
 	@if [ "$(feature-libelf)" != "1" ]; then echo "No libelf found"; exit 1 ; fi
 
+zdep:
+	@if [ "$(feature-zlib)" != "1" ]; then echo "No zlib found"; exit 1 ; fi
+
 bpfdep:
 	@if [ "$(feature-bpf)" != "1" ]; then echo "BPF API too old"; exit 1 ; fi
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 639c2e17df0b..d9b70483a2c7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -44,6 +44,7 @@ 
 #include <tools/libc_compat.h>
 #include <libelf.h>
 #include <gelf.h>
+#include <zlib.h>
 
 #include "libbpf.h"
 #include "bpf.h"
@@ -146,6 +147,23 @@  struct bpf_capabilities {
 	__u32 array_mmap:1;
 };
 
+enum reloc_type {
+	RELO_LD64,
+	RELO_CALL,
+	RELO_DATA,
+	RELO_EXTERN,
+};
+
+struct reloc_desc {
+	enum reloc_type type;
+	int insn_idx;
+	union {
+		int map_idx;
+		int text_off;
+		int data_off;
+	};
+};
+
 /*
  * bpf_prog should be a better name but it has been used in
  * linux/filter.h.
@@ -164,18 +182,7 @@  struct bpf_program {
 	size_t insns_cnt, main_prog_cnt;
 	enum bpf_prog_type type;
 
-	struct reloc_desc {
-		enum {
-			RELO_LD64,
-			RELO_CALL,
-			RELO_DATA,
-		} type;
-		int insn_idx;
-		union {
-			int map_idx;
-			int text_off;
-		};
-	} *reloc_desc;
+	struct reloc_desc *reloc_desc;
 	int nr_reloc;
 	int log_level;
 
@@ -209,12 +216,14 @@  enum libbpf_map_type {
 	LIBBPF_MAP_DATA,
 	LIBBPF_MAP_BSS,
 	LIBBPF_MAP_RODATA,
+	LIBBPF_MAP_EXTERN,
 };
 
 static const char * const libbpf_type_to_btf_name[] = {
 	[LIBBPF_MAP_DATA]	= ".data",
 	[LIBBPF_MAP_BSS]	= ".bss",
 	[LIBBPF_MAP_RODATA]	= ".rodata",
+	[LIBBPF_MAP_EXTERN]	= ".extern",
 };
 
 struct bpf_map {
@@ -238,6 +247,13 @@  struct bpf_map {
 struct bpf_secdata {
 	void *rodata;
 	void *data;
+	void *extern_data;
+};
+
+struct extern_desc {
+	const char *name;
+	__u32 data_off;
+	int sym_idx;
 };
 
 static LIST_HEAD(bpf_objects_list);
@@ -254,6 +270,10 @@  struct bpf_object {
 	size_t maps_cap;
 	struct bpf_secdata sections;
 
+	struct extern_desc *externs;
+	int nr_extern;
+	int extern_map_idx;
+
 	bool loaded;
 	bool has_pseudo_calls;
 	bool relaxed_core_relocs;
@@ -281,6 +301,7 @@  struct bpf_object {
 		int maps_shndx;
 		int btf_maps_shndx;
 		int text_shndx;
+		int symbols_shndx;
 		int data_shndx;
 		int rodata_shndx;
 		int bss_shndx;
@@ -839,7 +860,8 @@  static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
 
 static int
 bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
-			      int sec_idx, Elf_Data *data, void **data_buff)
+			      int sec_idx, void *data, size_t data_sz,
+			      void **data_copy)
 {
 	char map_name[BPF_OBJ_NAME_LEN];
 	struct bpf_map_def *def;
@@ -859,27 +881,30 @@  bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 		pr_warn("failed to alloc map name\n");
 		return -ENOMEM;
 	}
+	pr_debug("map '%s' (%s): at sec_idx %d, offset %zu.\n", map_name,
+		 libbpf_type_to_btf_name[type], map->sec_idx, map->sec_offset);
 
 	def = &map->def;
 	def->type = BPF_MAP_TYPE_ARRAY;
 	def->key_size = sizeof(int);
-	def->value_size = data->d_size;
+	def->value_size = data_sz;
 	def->max_entries = 1;
-	def->map_flags = type == LIBBPF_MAP_RODATA ? BPF_F_RDONLY_PROG : 0;
+	def->map_flags = type == LIBBPF_MAP_RODATA || type == LIBBPF_MAP_EXTERN
+			 ? BPF_F_RDONLY_PROG : 0;
 	if (obj->caps.array_mmap)
 		def->map_flags |= BPF_F_MMAPABLE;
 
 	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
 		 map_name, map->sec_idx, map->sec_offset, def->map_flags);
 
-	if (data_buff) {
-		*data_buff = malloc(data->d_size);
-		if (!*data_buff) {
+	if (data_copy) {
+		*data_copy = malloc(data_sz);
+		if (!*data_copy) {
 			zfree(&map->name);
 			pr_warn("failed to alloc map content buffer\n");
 			return -ENOMEM;
 		}
-		memcpy(*data_buff, data->d_buf, data->d_size);
+		memcpy(*data_copy, data, data_sz);
 	}
 
 	pr_debug("map %td is \"%s\"\n", map - obj->maps, map->name);
@@ -898,7 +923,8 @@  static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 	if (obj->efile.data_shndx >= 0) {
 		err = bpf_object__init_internal_map(obj, LIBBPF_MAP_DATA,
 						    obj->efile.data_shndx,
-						    obj->efile.data,
+						    obj->efile.data->d_buf,
+						    obj->efile.data->d_size,
 						    &obj->sections.data);
 		if (err)
 			return err;
@@ -906,7 +932,8 @@  static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 	if (obj->efile.rodata_shndx >= 0) {
 		err = bpf_object__init_internal_map(obj, LIBBPF_MAP_RODATA,
 						    obj->efile.rodata_shndx,
-						    obj->efile.rodata,
+						    obj->efile.rodata->d_buf,
+						    obj->efile.rodata->d_size,
 						    &obj->sections.rodata);
 		if (err)
 			return err;
@@ -914,13 +941,173 @@  static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 	if (obj->efile.bss_shndx >= 0) {
 		err = bpf_object__init_internal_map(obj, LIBBPF_MAP_BSS,
 						    obj->efile.bss_shndx,
-						    obj->efile.bss, NULL);
+						    obj->efile.bss->d_buf,
+						    obj->efile.bss->d_size,
+						    NULL);
 		if (err)
 			return err;
 	}
 	return 0;
 }
 
+
+static int find_extern_by_name(const void *name, const void *_ext)
+{
+	const struct extern_desc *ext = _ext;
+
+	return strcmp(name, ext->name);
+}
+
+static int bpf_object__read_kernel_config(struct bpf_object *obj,
+					  const char *config_path)
+{
+	char buf[PATH_MAX], *sep, *value, *value_end;
+	struct extern_desc *ext;
+	int len, err = 0;
+	__u64 *ext_val;
+	gzFile file;
+
+	if (config_path) {
+		file = gzopen(config_path, "r");
+	} else {
+		struct utsname uts;
+
+		uname(&uts);
+		len = snprintf(buf, PATH_MAX, "/boot/config-%s", uts.release);
+		if (len < 0)
+			return -EINVAL;
+		else if (len >= PATH_MAX)
+			return -ENAMETOOLONG;
+		/* gzopen also accepts uncompressed files. */
+		file = gzopen(buf, "r");
+		if (!file)
+			file = gzopen("/proc/config.gz", "r");
+	}
+	if (!file) {
+		pr_warn("failed to read kernel config at '%s'\n", config_path);
+		return -ENOENT;
+	}
+
+	while (gzgets(file, buf, sizeof(buf))) {
+		if (strncmp(buf, "CONFIG_", 7))
+			continue;
+
+		sep = strchr(buf, '=');
+		if (!sep) {
+			err = -EINVAL;
+			pr_warn("failed to parse '%s': no separator\n", buf);
+			goto out;
+		}
+		/* Trim ending '\n' */
+		buf[strlen(buf) - 1] = '\0';
+		/* Split on '=' and ensure that a value is present. */
+		*sep = '\0';
+		if (!sep[1]) {
+			err = -EINVAL;
+			*sep = '=';
+			pr_warn("failed to parse '%s': no value\n", buf);
+			goto out;
+		}
+
+		ext = bsearch(buf, obj->externs, obj->nr_extern, sizeof(*ext),
+			      find_extern_by_name);
+		if (!ext)
+			continue;
+
+		ext_val = (__u64 *)(obj->sections.extern_data + ext->data_off);
+		value = sep + 1;
+
+		switch (*value) {
+		case 'n':
+			*ext_val = 0;
+			break;
+		case 'y':
+			*ext_val = 1;
+			break;
+		case 'm':
+			*ext_val = 2;
+			break;
+		case '"':
+			pr_warn("extern '%s': strings are not supported\n",
+				ext->name);
+			err = -EINVAL;
+			goto out;
+		default:
+			errno = 0;
+			*ext_val = strtoull(value, &value_end, 10);
+			if (errno) {
+				err = -errno;
+				pr_warn("extern '%s': failed to parse value: %d\n",
+					ext->name, err);
+				goto out;
+			}
+			if (*value_end && *value_end != '\n') {
+				err = -EINVAL;
+				pr_warn("extern '%s': failed to parse value\n",
+					ext->name);
+				goto out;
+			}
+			break;
+		}
+		pr_debug("extern '%s' = %llu\n", ext->name, *ext_val);
+	}
+
+out:
+	gzclose(file);
+	return err;
+}
+
+static int bpf_object__init_extern_map(struct bpf_object *obj,
+				       const char *config_path)
+{
+	bool need_config = false;
+	struct extern_desc *ext;
+	__u64 *ext_val;
+	int err, i;
+
+	if (obj->nr_extern == 0)
+		return 0;
+	if (!obj->caps.global_data)
+		return -ENOTSUP;
+
+	obj->sections.extern_data = calloc(obj->nr_extern, sizeof(__u64));
+	if (!obj->sections.extern_data)
+		return -ENOMEM;
+
+	for (i = 0; i < obj->nr_extern; i++) {
+		ext = &obj->externs[i];
+		ext_val = (__u64 *)(obj->sections.extern_data + ext->data_off);
+
+		if (strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) {
+			*ext_val = get_kernel_version();
+			if (*ext_val == 0) {
+				pr_warn("failed to get kernel version\n");
+				return -EINVAL;
+			}
+			pr_debug("extern '%s' = 0x%llx\n", ext->name, *ext_val);
+		} else if (strncmp(ext->name, "CONFIG_", 7) == 0) {
+			need_config = true;
+		} else {
+			pr_warn("unrecognized extern: '%s'\n", ext->name);
+			return -EINVAL;
+		}
+	}
+	if (need_config) {
+		err = bpf_object__read_kernel_config(obj, config_path);
+		if (err)
+			return -EINVAL;
+	}
+
+	err = bpf_object__init_internal_map(obj, LIBBPF_MAP_EXTERN,
+					    obj->efile.symbols_shndx,
+					    obj->sections.extern_data,
+					    obj->nr_extern * sizeof(__u64),
+					    NULL);
+	if (!err)
+		obj->extern_map_idx = obj->nr_maps - 1;
+	return err;
+}
+
 static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
 {
 	Elf_Data *symbols = obj->efile.symbols;
@@ -1395,12 +1582,17 @@  static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict,
 	return 0;
 }
 
-static int bpf_object__init_maps(struct bpf_object *obj, bool relaxed_maps,
-				 const char *pin_root_path)
+static int bpf_object__init_maps(struct bpf_object *obj,
+				 const struct bpf_object_open_opts *opts)
 {
-	bool strict = !relaxed_maps;
+	const char *pin_root_path, *config_path;
+	bool strict;
 	int err;
 
+	strict = !OPTS_GET(opts, relaxed_maps, false);
+	pin_root_path = OPTS_GET(opts, pin_root_path, NULL);
+	config_path = OPTS_GET(opts, kernel_config_path, NULL);
+
 	err = bpf_object__init_user_maps(obj, strict);
 	if (err)
 		return err;
@@ -1413,6 +1605,10 @@  static int bpf_object__init_maps(struct bpf_object *obj, bool relaxed_maps,
 	if (err)
 		return err;
 
+	err = bpf_object__init_extern_map(obj, config_path);
+	if (err)
+		return err;
+
 	if (obj->nr_maps) {
 		qsort(obj->maps, obj->nr_maps, sizeof(obj->maps[0]),
 		      compare_bpf_map);
@@ -1594,8 +1790,7 @@  static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj)
 	return 0;
 }
 
-static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
-				   const char *pin_root_path)
+static int bpf_object__elf_collect(struct bpf_object *obj)
 {
 	Elf *elf = obj->efile.elf;
 	GElf_Ehdr *ep = &obj->efile.ehdr;
@@ -1667,6 +1862,7 @@  static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
 				return -LIBBPF_ERRNO__FORMAT;
 			}
 			obj->efile.symbols = data;
+			obj->efile.symbols_shndx = idx;
 			obj->efile.strtabidx = sh.sh_link;
 		} else if (sh.sh_type == SHT_PROGBITS && data->d_size > 0) {
 			if (sh.sh_flags & SHF_EXECINSTR) {
@@ -1730,14 +1926,81 @@  static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
 		pr_warn("Corrupted ELF file: index of strtab invalid\n");
 		return -LIBBPF_ERRNO__FORMAT;
 	}
-	err = bpf_object__init_btf(obj, btf_data, btf_ext_data);
-	if (!err)
-		err = bpf_object__init_maps(obj, relaxed_maps, pin_root_path);
-	if (!err)
-		err = bpf_object__sanitize_and_load_btf(obj);
-	if (!err)
-		err = bpf_object__init_prog_names(obj);
-	return err;
+	return bpf_object__init_btf(obj, btf_data, btf_ext_data);
+}
+
+static bool sym_is_extern(const GElf_Sym *sym)
+{
+	/* externs are symbols with type=NOTYPE, bind=GLOBAL, section=UND */
+	return sym->st_shndx == SHN_UNDEF &&
+	       GELF_ST_BIND(sym->st_info) == STB_GLOBAL &&
+	       GELF_ST_TYPE(sym->st_info) == STT_NOTYPE;
+}
+
+static int cmp_extern_by_name(const void *_a, const void *_b)
+{
+	const struct extern_desc *a = _a;
+	const struct extern_desc *b = _b;
+
+	return strcmp(a->name, b->name);
+}
+
+static int bpf_object__collect_externs(struct bpf_object *obj)
+{
+	struct extern_desc *ext;
+	Elf_Scn *scn;
+	GElf_Shdr sh;
+	int i, n;
+
+	if (!obj->efile.symbols)
+		return 0;
+
+	scn = elf_getscn(obj->efile.elf, obj->efile.symbols_shndx);
+	if (!scn)
+		return -LIBBPF_ERRNO__FORMAT;
+	if (gelf_getshdr(scn, &sh) != &sh)
+		return -LIBBPF_ERRNO__FORMAT;
+	n = sh.sh_size / sh.sh_entsize;
+
+	pr_debug("looking for externs among %d symbols...\n", n);
+	for (i = 0; i < n; i++) {
+		const char *sym_name;
+		GElf_Sym sym;
+
+		if (!gelf_getsym(obj->efile.symbols, i, &sym))
+			return -LIBBPF_ERRNO__FORMAT;
+		if (!sym_is_extern(&sym))
+			continue;
+		sym_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx,
+				      sym.st_name);
+		if (!sym_name || !sym_name[0])
+			continue;
+
+		ext = obj->externs;
+		ext = reallocarray(ext, obj->nr_extern + 1, sizeof(*ext));
+		if (!ext)
+			return -ENOMEM;
+		obj->externs = ext;
+
+		ext = &ext[obj->nr_extern];
+		ext->name = strdup(sym_name);
+		ext->sym_idx = i;
+		if (!ext->name)
+			return -ENOMEM;
+		obj->nr_extern++;
+	}
+	pr_debug("collected %d externs total\n", obj->nr_extern);
+
+	/* sort externs by name and calculate their offsets within a map */
+	qsort(obj->externs, obj->nr_extern, sizeof(*ext), cmp_extern_by_name);
+	for (i = 0; i < obj->nr_extern; i++) {
+		ext = &obj->externs[i];
+		ext->data_off = i * sizeof(__u64);
+		pr_debug("extern #%d: symbol %d, off %u, name %s\n",
+			 i, ext->sym_idx, ext->data_off, ext->name);
+	}
+
+	return 0;
 }
 
 static struct bpf_program *
@@ -1791,6 +2054,8 @@  bpf_object__section_to_libbpf_map_type(const struct bpf_object *obj, int shndx)
 		return LIBBPF_MAP_BSS;
 	else if (shndx == obj->efile.rodata_shndx)
 		return LIBBPF_MAP_RODATA;
+	else if (shndx == obj->efile.symbols_shndx)
+		return LIBBPF_MAP_EXTERN;
 	else
 		return LIBBPF_MAP_UNSPEC;
 }
@@ -1830,6 +2095,30 @@  static int bpf_program__record_reloc(struct bpf_program *prog,
 			insn_idx, insn->code);
 		return -LIBBPF_ERRNO__RELOC;
 	}
+
+	if (sym_is_extern(sym)) {
+		int sym_idx = GELF_R_SYM(rel->r_info);
+		int i, n = obj->nr_extern;
+		struct extern_desc *ext;
+
+		for (i = 0; i < n; i++) {
+			ext = &obj->externs[i];
+			if (ext->sym_idx == sym_idx)
+				break;
+		}
+		if (i >= n) {
+			pr_warn("extern relo failed to find extern for sym %d\n",
+				sym_idx);
+			return -LIBBPF_ERRNO__RELOC;
+		}
+		pr_debug("found extern #%d '%s' (sym %d, off %u) for insn %u\n",
+			 i, ext->name, ext->sym_idx, ext->data_off, insn_idx);
+		reloc_desc->type = RELO_EXTERN;
+		reloc_desc->insn_idx = insn_idx;
+		reloc_desc->data_off = ext->data_off;
+		return 0;
+	}
+
 	if (!shdr_idx || shdr_idx >= SHN_LORESERVE) {
 		pr_warn("invalid relo for \'%s\' in special section 0x%x; forgot to initialize global var?..\n",
 			name, shdr_idx);
@@ -2296,23 +2585,41 @@  bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)
 	char *cp, errmsg[STRERR_BUFSIZE];
 	int err, zero = 0;
 	__u8 *data;
+	bool freeze;
 
-	/* Nothing to do here since kernel already zero-initializes .bss map. */
-	if (map->libbpf_type == LIBBPF_MAP_BSS)
+	switch (map->libbpf_type) {
+	case LIBBPF_MAP_BSS:
+		/* kernel already zero-initializes .bss map. */
 		return 0;
-
-	data = map->libbpf_type == LIBBPF_MAP_DATA ?
-	       obj->sections.data : obj->sections.rodata;
+	case LIBBPF_MAP_DATA:
+		data = obj->sections.data;
+		freeze = false;
+		break;
+	case LIBBPF_MAP_RODATA:
+		data = obj->sections.rodata;
+		freeze = true;
+		break;
+	case LIBBPF_MAP_EXTERN:
+		data = obj->sections.extern_data;
+		freeze = true;
+		break;
+	case LIBBPF_MAP_UNSPEC:
+	default:
+		return -EINVAL;
+	}
 
 	err = bpf_map_update_elem(map->fd, &zero, data, 0);
-	/* Freeze .rodata map as read-only from syscall side. */
-	if (!err && map->libbpf_type == LIBBPF_MAP_RODATA) {
+	if (err)
+		return -errno;
+
+	/* Freeze .rodata or .externs map as read-only from syscall side. */
+	if (freeze) {
 		err = bpf_map_freeze(map->fd);
 		if (err) {
+			err = -errno;
 			cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg));
 			pr_warn("Error freezing map(%s) as read-only: %s\n",
 				map->name, cp);
-			err = 0;
 		}
 	}
 	return err;
@@ -3554,9 +3861,6 @@  bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj,
 	size_t new_cnt;
 	int err;
 
-	if (relo->type != RELO_CALL)
-		return -LIBBPF_ERRNO__RELOC;
-
 	if (prog->idx == obj->efile.text_shndx) {
 		pr_warn("relo in .text insn %d into off %d\n",
 			relo->insn_idx, relo->text_off);
@@ -3617,33 +3921,38 @@  bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
 		return 0;
 
 	for (i = 0; i < prog->nr_reloc; i++) {
-		if (prog->reloc_desc[i].type == RELO_LD64 ||
-		    prog->reloc_desc[i].type == RELO_DATA) {
-			bool relo_data = prog->reloc_desc[i].type == RELO_DATA;
-			struct bpf_insn *insns = prog->insns;
-			int insn_idx, map_idx;
+		struct reloc_desc *relo = &prog->reloc_desc[i];
+		struct bpf_insn *insns = prog->insns;
+		int insn_idx = relo->insn_idx;
 
-			insn_idx = prog->reloc_desc[i].insn_idx;
-			map_idx = prog->reloc_desc[i].map_idx;
-
-			if (insn_idx + 1 >= (int)prog->insns_cnt) {
-				pr_warn("relocation out of range: '%s'\n",
-					prog->section_name);
-				return -LIBBPF_ERRNO__RELOC;
-			}
+		if (insn_idx + 1 >= (int)prog->insns_cnt) {
+			pr_warn("relo #%d: insn out of range %d\n", i, insn_idx);
+			return -LIBBPF_ERRNO__RELOC;
+		}
 
-			if (!relo_data) {
-				insns[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
-			} else {
-				insns[insn_idx].src_reg = BPF_PSEUDO_MAP_VALUE;
-				insns[insn_idx + 1].imm = insns[insn_idx].imm;
-			}
-			insns[insn_idx].imm = obj->maps[map_idx].fd;
-		} else if (prog->reloc_desc[i].type == RELO_CALL) {
-			err = bpf_program__reloc_text(prog, obj,
-						      &prog->reloc_desc[i]);
+		switch (relo->type) {
+		case RELO_LD64:
+			insns[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+			insns[insn_idx].imm = obj->maps[relo->map_idx].fd;
+			break;
+		case RELO_DATA:
+			insns[insn_idx].src_reg = BPF_PSEUDO_MAP_VALUE;
+			insns[insn_idx + 1].imm = insns[insn_idx].imm;
+			insns[insn_idx].imm = obj->maps[relo->map_idx].fd;
+			break;
+		case RELO_EXTERN:
+			insns[insn_idx].src_reg = BPF_PSEUDO_MAP_VALUE;
+			insns[insn_idx].imm = obj->maps[obj->extern_map_idx].fd;
+			insns[insn_idx + 1].imm = relo->data_off;
+			break;
+		case RELO_CALL:
+			err = bpf_program__reloc_text(prog, obj, relo);
 			if (err)
 				return err;
+			break;
+		default:
+			pr_warn("relo #%d: bad relo type %d\n", i, relo->type);
+			return -EINVAL;
 		}
 	}
 
@@ -3917,12 +4226,10 @@  static struct bpf_object *
 __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz,
 		   struct bpf_object_open_opts *opts)
 {
-	const char *pin_root_path;
 	struct bpf_program *prog;
 	struct bpf_object *obj;
 	const char *obj_name;
 	char tmp_name[64];
-	bool relaxed_maps;
 	__u32 attach_prog_fd;
 	int err;
 
@@ -3952,16 +4259,19 @@  __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz,
 		return obj;
 
 	obj->relaxed_core_relocs = OPTS_GET(opts, relaxed_core_relocs, false);
-	relaxed_maps = OPTS_GET(opts, relaxed_maps, false);
-	pin_root_path = OPTS_GET(opts, pin_root_path, NULL);
 	attach_prog_fd = OPTS_GET(opts, attach_prog_fd, 0);
 
-	CHECK_ERR(bpf_object__elf_init(obj), err, out);
-	CHECK_ERR(bpf_object__check_endianness(obj), err, out);
-	CHECK_ERR(bpf_object__probe_caps(obj), err, out);
-	CHECK_ERR(bpf_object__elf_collect(obj, relaxed_maps, pin_root_path),
-		  err, out);
-	CHECK_ERR(bpf_object__collect_reloc(obj), err, out);
+	err = bpf_object__elf_init(obj);
+	err = err ? : bpf_object__check_endianness(obj);
+	err = err ? : bpf_object__probe_caps(obj);
+	err = err ? : bpf_object__elf_collect(obj);
+	err = err ? : bpf_object__collect_externs(obj);
+	err = err ? : bpf_object__init_maps(obj, opts);
+	err = err ? : bpf_object__sanitize_and_load_btf(obj);
+	err = err ? : bpf_object__init_prog_names(obj);
+	err = err ? : bpf_object__collect_reloc(obj);
+	if (err)
+		goto out;
 	bpf_object__elf_finish(obj);
 
 	bpf_object__for_each_program(prog, obj) {
@@ -4681,6 +4991,9 @@  void bpf_object__close(struct bpf_object *obj)
 
 	zfree(&obj->sections.rodata);
 	zfree(&obj->sections.data);
+	zfree(&obj->sections.extern_data);
+	zfree(&obj->externs);
+	obj->nr_extern = 0;
 	zfree(&obj->maps);
 	obj->nr_maps = 0;
 
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 0dbf4bfba0c4..30b920879319 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -103,14 +103,18 @@  struct bpf_object_open_opts {
 	bool relaxed_maps;
 	/* process CO-RE relocations non-strictly, allowing them to fail */
 	bool relaxed_core_relocs;
+	__u32 attach_prog_fd;
 	/* maps that set the 'pinning' attribute in their definition will have
 	 * their pin_path attribute set to a file in this directory, and be
 	 * auto-pinned to that path on load; defaults to "/sys/fs/bpf".
 	 */
 	const char *pin_root_path;
-	__u32 attach_prog_fd;
+	/* kernel config file path override (for CONFIG_ externs); can point
+	 * to either uncompressed text file or .gz file
+	 */
+	const char *kernel_config_path;
 };
-#define bpf_object_open_opts__last_field attach_prog_fd
+#define bpf_object_open_opts__last_field kernel_config_path
 
 LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
 LIBBPF_API struct bpf_object *
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 5737888cd6a7..df3f42aa9a1a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -22,7 +22,7 @@  CFLAGS += -g -Wall -O2 $(GENFLAGS) -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR)	\
 	  -I$(GENDIR) -I$(TOOLSDIR) -I$(CURDIR)				\
 	  -Dbpf_prog_load=bpf_prog_test_load				\
 	  -Dbpf_load_program=bpf_test_load_program
-LDLIBS += -lcap -lelf -lrt -lpthread
+LDLIBS += -lcap -lelf -lz -lrt -lpthread
 
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \