[v3,bpf-next,0/4] Add support for cgroup bpf_link

Message ID	20200330030001.2312810-1-andriin@fb.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko <andriin@fb.com> Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <ast@fb.com>, <daniel@iogearbox.net>, <rdna@fb.com> CC: <andrii.nakryiko@gmail.com>, <kernel-team@fb.com>, Andrii Nakryiko <andriin@fb.com> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Date: Sun, 29 Mar 2020 19:59:57 -0700 Message-ID: <20200330030001.2312810-1-andriin@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Add support for cgroup bpf_link \| expand [v3,bpf-next,0/4] Add support for cgroup bpf_link [v3,bpf-next,1/4] bpf: implement bpf_link-based cgroup BPF program attachment [v3,bpf-next,2/4] bpf: implement bpf_prog replacement for an active bpf_cgroup_link [v3,bpf-next,3/4] libbpf: add support for bpf_link-based cgroup attachment [v3,bpf-next,4/4] selftests/bpf: test FD-based cgroup attachment

Andrii Nakryiko March 30, 2020, 2:59 a.m. UTC

bpf_link abstraction itself was formalized in [0] with justifications for why
its semantics is a good fit for attaching BPF programs of various types. This
patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
programs.

Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
co-exist with legacy BPF program multi-attachment.

bpf_link is destroyed and automatically detached when the last open FD holding
the reference to bpf_link is closed. This means that by default, when the
process that created bpf_link exits, attached BPF program will be
automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
any other bpf_link, can be pinned in BPF FS and by those means survive the
exit of process that created the link. This is useful in many scenarios to
provide long-living BPF program attachments. Pinning also means that there
could be many owners of bpf_link through independent FDs.

Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
dying it will automatically detach all active bpf_links. This ensures that
cgroup clean up is not delayed due to active bpf_link even despite no chance
for any BPF program to be run for a given cgroup. In that sense it's similar
to existing behavior of dropping refcnt of attached bpf_prog. But in the case
of bpf_link, bpf_link is not destroyed and is still available to user as long
as at least one active FD is still open (or if it's pinned in BPF FS).

There are two main cgroup-specific differences between bpf_link-based and
direct bpf_prog-based attachment.

First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
bpf_link, which makes it possible to auto-clean up attached bpf_link when user
process abruptly exits without explicitly detaching BPF program. This makes
for a safe default behavior proven in BPF tracing program types. But bpf_link
doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
cgroup from cleaning up its BPF state.

Second, only owners of bpf_link (those who created bpf_link in the first place
or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
it. This makes sure that no other process can accidentally remove/replace BPF
program.

This patch set also implements LINK_UPDATE sub-command, which allows to
replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
is supposed to be generic command for different types of bpf_links.

  [0] https://lore.kernel.org/bpf/20200228223948.360936-1-andriin@fb.com/

v2->v3:
  - revert back to just MULTI mode (Alexei);
  - fix tinyconfig compilation warning (kbuild test robot);

v1->v2:
  - implement exclusive and overridable exclusive modes (Andrey Ignatov);
  - fix build for !CONFIG_CGROUP_BPF build;
  - add more selftests for non-multi mode and inter-operability;


Andrii Nakryiko (4):
  bpf: implement bpf_link-based cgroup BPF program attachment
  bpf: implement bpf_prog replacement for an active bpf_cgroup_link
  libbpf: add support for bpf_link-based cgroup attachment
  selftests/bpf: test FD-based cgroup attachment

 include/linux/bpf-cgroup.h                    |  41 +-
 include/linux/bpf.h                           |  10 +-
 include/uapi/linux/bpf.h                      |  22 +-
 kernel/bpf/cgroup.c                           | 395 ++++++++++++++----
 kernel/bpf/syscall.c                          | 113 ++++-
 kernel/cgroup/cgroup.c                        |  41 +-
 tools/include/uapi/linux/bpf.h                |  22 +-
 tools/lib/bpf/bpf.c                           |  34 ++
 tools/lib/bpf/bpf.h                           |  19 +
 tools/lib/bpf/libbpf.c                        |  46 ++
 tools/lib/bpf/libbpf.h                        |   8 +-
 tools/lib/bpf/libbpf.map                      |   4 +
 .../selftests/bpf/prog_tests/cgroup_link.c    | 244 +++++++++++
 .../selftests/bpf/progs/test_cgroup_link.c    |  24 ++
 14 files changed, 924 insertions(+), 99 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cgroup_link.c

David Ahern March 30, 2020, 2:49 p.m. UTC | #1

On 3/29/20 8:59 PM, Andrii Nakryiko wrote:
> bpf_link abstraction itself was formalized in [0] with justifications for why
> its semantics is a good fit for attaching BPF programs of various types. This
> patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
> programs.
> 
> Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
> semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
> co-exist with legacy BPF program multi-attachment.
> 
> bpf_link is destroyed and automatically detached when the last open FD holding
> the reference to bpf_link is closed. This means that by default, when the
> process that created bpf_link exits, attached BPF program will be
> automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
> any other bpf_link, can be pinned in BPF FS and by those means survive the
> exit of process that created the link. This is useful in many scenarios to
> provide long-living BPF program attachments. Pinning also means that there
> could be many owners of bpf_link through independent FDs.
> 
> Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
> dying it will automatically detach all active bpf_links. This ensures that
> cgroup clean up is not delayed due to active bpf_link even despite no chance
> for any BPF program to be run for a given cgroup. In that sense it's similar
> to existing behavior of dropping refcnt of attached bpf_prog. But in the case
> of bpf_link, bpf_link is not destroyed and is still available to user as long
> as at least one active FD is still open (or if it's pinned in BPF FS).
> 
> There are two main cgroup-specific differences between bpf_link-based and
> direct bpf_prog-based attachment.
> 
> First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
> bpf_link, which makes it possible to auto-clean up attached bpf_link when user
> process abruptly exits without explicitly detaching BPF program. This makes
> for a safe default behavior proven in BPF tracing program types. But bpf_link
> doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
> cgroup from cleaning up its BPF state.
> 
> Second, only owners of bpf_link (those who created bpf_link in the first place
> or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
> it. This makes sure that no other process can accidentally remove/replace BPF
> program.
> 
> This patch set also implements LINK_UPDATE sub-command, which allows to
> replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
> behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
> is supposed to be generic command for different types of bpf_links.
> 

The observability piece should go in the same release as the feature.

Andrii Nakryiko March 30, 2020, 8:20 p.m. UTC | #2

On Mon, Mar 30, 2020 at 7:49 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/29/20 8:59 PM, Andrii Nakryiko wrote:
> > bpf_link abstraction itself was formalized in [0] with justifications for why
> > its semantics is a good fit for attaching BPF programs of various types. This
> > patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
> > programs.
> >
> > Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
> > semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
> > co-exist with legacy BPF program multi-attachment.
> >
> > bpf_link is destroyed and automatically detached when the last open FD holding
> > the reference to bpf_link is closed. This means that by default, when the
> > process that created bpf_link exits, attached BPF program will be
> > automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
> > any other bpf_link, can be pinned in BPF FS and by those means survive the
> > exit of process that created the link. This is useful in many scenarios to
> > provide long-living BPF program attachments. Pinning also means that there
> > could be many owners of bpf_link through independent FDs.
> >
> > Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
> > dying it will automatically detach all active bpf_links. This ensures that
> > cgroup clean up is not delayed due to active bpf_link even despite no chance
> > for any BPF program to be run for a given cgroup. In that sense it's similar
> > to existing behavior of dropping refcnt of attached bpf_prog. But in the case
> > of bpf_link, bpf_link is not destroyed and is still available to user as long
> > as at least one active FD is still open (or if it's pinned in BPF FS).
> >
> > There are two main cgroup-specific differences between bpf_link-based and
> > direct bpf_prog-based attachment.
> >
> > First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
> > bpf_link, which makes it possible to auto-clean up attached bpf_link when user
> > process abruptly exits without explicitly detaching BPF program. This makes
> > for a safe default behavior proven in BPF tracing program types. But bpf_link
> > doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
> > cgroup from cleaning up its BPF state.
> >
> > Second, only owners of bpf_link (those who created bpf_link in the first place
> > or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
> > it. This makes sure that no other process can accidentally remove/replace BPF
> > program.
> >
> > This patch set also implements LINK_UPDATE sub-command, which allows to
> > replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
> > behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
> > is supposed to be generic command for different types of bpf_links.
> >
>
> The observability piece should go in the same release as the feature.

You mean LINK_QUERY command I mentioned before? Yes, I'm working on
adding it next, regardless if this patch set goes in right now or
later.

David Ahern March 30, 2020, 8:45 p.m. UTC | #3

On 3/30/20 2:20 PM, Andrii Nakryiko wrote:
>> The observability piece should go in the same release as the feature.
> You mean LINK_QUERY command I mentioned before? Yes, I'm working on
> adding it next, regardless if this patch set goes in right now or
> later.

There was also mention of a "human override" details of which have not
been discussed. Since the query is not ready either, then the
create/update should wait so that all of it can be in the same kernel
release. As it stands it is a half-baked feature.

Alexei Starovoitov March 30, 2020, 8:50 p.m. UTC | #4

On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 2:20 PM, Andrii Nakryiko wrote:
> >> The observability piece should go in the same release as the feature.
> > You mean LINK_QUERY command I mentioned before? Yes, I'm working on
> > adding it next, regardless if this patch set goes in right now or
> > later.
>
> There was also mention of a "human override" details of which have not
> been discussed. Since the query is not ready either, then the
> create/update should wait so that all of it can be in the same kernel
> release. As it stands it is a half-baked feature.

"human override" was discussed only in the context of xdp.
There won't be override for cgroup-bpf.
Just like one doesn't exist today and there is no need for it.
There will be no override for clsbpf either.
None of the multi prog hooks need it.

Alexei Starovoitov March 30, 2020, 10:50 p.m. UTC | #5

On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
> release. As it stands it is a half-baked feature.

speaking of half-baked.
I think as it stands (even without link_query) it's already extremely
useful addition and doesn't take anything away from existing cgroup-bpf
and doesn't hinder observability. 'bpftool cgroup' works just fine.
So I've applied the set.

Even if it was half-baked it would still be applie-able.
Many features are developed over the course of multiple
kernel releases. Example: your nexthops, mptcp, bpf-lsm.

David Ahern March 30, 2020, 11:43 p.m. UTC | #6

On 3/30/20 4:50 PM, Alexei Starovoitov wrote:
> On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
>> release. As it stands it is a half-baked feature.
> 
> speaking of half-baked.
> I think as it stands (even without link_query) it's already extremely
> useful addition and doesn't take anything away from existing cgroup-bpf
> and doesn't hinder observability. 'bpftool cgroup' works just fine.
> So I've applied the set.
> 
> Even if it was half-baked it would still be applie-able.
> Many features are developed over the course of multiple
> kernel releases. Example: your nexthops, mptcp, bpf-lsm.
> 

nexthops were not - refactoring in 1 release and the entire feature went
in to 5.4. Large features / patch sets often must be spread across
kernel versions because it is not humanly possible to send and review
the patches.

This is not a large feature, and there is no reason for CREATE/UPDATE -
a mere 4 patch set - to go in without something as essential as the
QUERY for observability.

Alexei Starovoitov March 31, 2020, 12:32 a.m. UTC | #7

On Mon, Mar 30, 2020 at 05:43:52PM -0600, David Ahern wrote:
> On 3/30/20 4:50 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
> >> release. As it stands it is a half-baked feature.
> > 
> > speaking of half-baked.
> > I think as it stands (even without link_query) it's already extremely
> > useful addition and doesn't take anything away from existing cgroup-bpf
> > and doesn't hinder observability. 'bpftool cgroup' works just fine.
> > So I've applied the set.
> > 
> > Even if it was half-baked it would still be applie-able.
> > Many features are developed over the course of multiple
> > kernel releases. Example: your nexthops, mptcp, bpf-lsm.
> > 
> 
> nexthops were not - refactoring in 1 release and the entire feature went
> in to 5.4. Large features / patch sets often must be spread across
> kernel versions because it is not humanly possible to send and review
> the patches.
> 
> This is not a large feature, and there is no reason for CREATE/UPDATE -
> a mere 4 patch set - to go in without something as essential as the
> QUERY for observability.

As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
Also there is Documentation/bpf/drgn.rst
Kernel is an open book. Just learn this simple tool and everything will
be observable. Not only bpf objects, but the rest of the kernel too.
imo the use case for LINK_QUERY makes sense for xdp only.
There is one program there and it's a dispatcher program that libdispatcher
will be iteratively updating via freplace. For cgroup bpf_links I cannot
think of a reason why apps would need to query it.

David Ahern March 31, 2020, 12:57 a.m. UTC | #8

On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
>>
>> This is not a large feature, and there is no reason for CREATE/UPDATE -
>> a mere 4 patch set - to go in without something as essential as the
>> QUERY for observability.
> 
> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.

You want a feature where a process can prevent another from installing a
program on a cgroup. How do I learn which process is holding the
bpf_link reference and preventing me from installing a program? Unless I
have missed some recent change that is not currently covered by bpftool
cgroup, and there is no way reading kernel code will tell me.

###
To quote Lorenz from an earlier response:

"However, this behaviour concerns me. It's like Windows not
letting you delete a file while an application has it opened, which just
leads to randomly killing programs until you find the right one. It's
frustrating and counter productive.

You're taking power away from the operator. In your deployment scenario
this might make sense, but I think it's a really bad model in general.
If I am privileged I need to be able to exercise that privilege."
###

That is my point. You are restricting what root can do and people will
not want to resort to killing random processes trying to find the one
holding a reference. This is an essential missing piece and should go in
at the same time as this set.

Alexei Starovoitov March 31, 2020, 1:17 a.m. UTC | #9

On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>
> >> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >> a mere 4 patch set - to go in without something as essential as the
> >> QUERY for observability.
> > 
> > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
> 
> You want a feature where a process can prevent another from installing a
> program on a cgroup. How do I learn which process is holding the
> bpf_link reference and preventing me from installing a program? Unless I
> have missed some recent change that is not currently covered by bpftool
> cgroup, and there is no way reading kernel code will tell me.

No. That's not the case at all. You misunderstood the concept.

> That is my point. You are restricting what root can do and people will
> not want to resort to killing random processes trying to find the one
> holding a reference. 

Not true either.
bpf_link = old attach with allow_multi (but with extra safety for owner)
The only thing bpf_link protects is the owner of the link from other
processes of nuking that link.
It does _not_ prevent other processes attaching their own cgroup-bpf progs
either via old interface or via bpf_link.

It will be different for xdp where only one prog is allowed per xdp hook.
There it will prevent other xdp progs. And there link_queury and
"human override" will be necessary.

Andrii Nakryiko March 31, 2020, 3:54 a.m. UTC | #10

On Mon, Mar 30, 2020 at 5:57 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>
> >> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >> a mere 4 patch set - to go in without something as essential as the
> >> QUERY for observability.
> >
> > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
>
> You want a feature where a process can prevent another from installing a
> program on a cgroup. How do I learn which process is holding the
> bpf_link reference and preventing me from installing a program? Unless I
> have missed some recent change that is not currently covered by bpftool
> cgroup, and there is no way reading kernel code will tell me.
>
> ###
> To quote Lorenz from an earlier response:
>
> "However, this behaviour concerns me. It's like Windows not
> letting you delete a file while an application has it opened, which just
> leads to randomly killing programs until you find the right one. It's
> frustrating and counter productive.
>
> You're taking power away from the operator. In your deployment scenario
> this might make sense, but I think it's a really bad model in general.
> If I am privileged I need to be able to exercise that privilege."
> ###
>
> That is my point. You are restricting what root can do and people will
> not want to resort to killing random processes trying to find the one
> holding a reference. This is an essential missing piece and should go in
> at the same time as this set.

No need to kill random processes, you can kill only those that hold
bpf_link FD. You can find them using drgn tool with script like [0].
It will give you quite a lot of information already, but it should
also find pinned bpf_links, I haven't added it yet.

Found total 11 bpf_links.
-------------------------------------------------
type: tracing
prog: 'test1' id:223 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test2' id:224 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test3' id:225 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test4' id:226 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test5' id:227 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test6' id:228 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: raw_tp
prog: '' id:237 type:BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
tp: bpf_test_finish
pids: 449462
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2/cg3
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2/cg3/cg4
pids: 449881
-------------------------------------------------


   [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7

David Ahern March 31, 2020, 4:54 p.m. UTC | #11

On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> 
>    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7

#!/usr/bin/env drgn

where do I find drgn? Google search is not turning up anything relevant.

Andrii Nakryiko March 31, 2020, 5:03 p.m. UTC | #12

On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> >
> >    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7
>
> #!/usr/bin/env drgn
>
> where do I find drgn? Google search is not turning up anything relevant.

Right, sorry, there were announcements and discussions about drgn on
mailing list before, so I wrongly assumed people are aware. It's here:

https://github.com/osandov/drgn

There is another BPF-related tool, that shows freplace'ed BPF
programs, attached to original program (which I stole some parts of
for my script, thanks Andrey Ignatov!):

https://github.com/osandov/drgn/blob/master/tools/bpf_inspect.py

Alexei Starovoitov March 31, 2020, 5:11 p.m. UTC | #13

On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> >
> >    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7
>
> #!/usr/bin/env drgn
>
> where do I find drgn? Google search is not turning up anything relevant.

It's the link I mentioned in this thread just few emails ago:
"Also there is Documentation/bpf/drgn.rst"

Edward Cree March 31, 2020, 9:51 p.m. UTC | #14

On 31/03/2020 04:54, Andrii Nakryiko wrote:
> No need to kill random processes, you can kill only those that hold
> bpf_link FD. You can find them using drgn tool with script like [0].
For the record, I find the argument "we don't need a query feature,
 because you can just use a kernel debugger" *utterly* *horrifying*.
Now, it seems to be moot, because Alexei has given other, better
 reasons why query doesn't need to land yet; but can we please not
 ever treat debugging interfaces as a substitute for proper APIs?

</scream>
-ed

David Ahern March 31, 2020, 10:44 p.m. UTC | #15

On 3/31/20 3:51 PM, Edward Cree wrote:
> On 31/03/2020 04:54, Andrii Nakryiko wrote:
>> No need to kill random processes, you can kill only those that hold
>> bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
>  because you can just use a kernel debugger" *utterly* *horrifying*.
> Now, it seems to be moot, because Alexei has given other, better
>  reasons why query doesn't need to land yet; but can we please not
>  ever treat debugging interfaces as a substitute for proper APIs?
> 
> </scream>
> -ed
> 

just about to send the same intent. Dev packages and processing
/proc/kcore is not a proper observability API for production systems.

Andrii Nakryiko April 1, 2020, 12:22 a.m. UTC | #16

On Tue, Mar 31, 2020 at 2:52 PM Edward Cree <ecree@solarflare.com> wrote:
>
> On 31/03/2020 04:54, Andrii Nakryiko wrote:
> > No need to kill random processes, you can kill only those that hold
> > bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
>  because you can just use a kernel debugger" *utterly* *horrifying*.
> Now, it seems to be moot, because Alexei has given other, better
>  reasons why query doesn't need to land yet; but can we please not
>  ever treat debugging interfaces as a substitute for proper APIs?

Can you please point out where I was objecting to observability API
(which is LINK_QUERY thing we've discussed and I didn't oppose, and
I'm going to add next)?

What I'm doubtful of is this "human override" functionality. I think a
tool that shows who's using (processes and mounted files in BPF FS)
given bpf_link is way more useful, because it allows you to both
"unblock" BPF hook (by killing "bad" processes and removing mounted
bpf_link files) and know which processes (read applications) are
misbehaving.

I'll address drgn vs not concern in reply to David Ahern, who's also
*utterly horrified*, apparently, so I'll try to calm him as well. ;)

>
> </scream>
> -ed

Andrii Nakryiko April 1, 2020, 12:45 a.m. UTC | #17

On Tue, Mar 31, 2020 at 3:44 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/31/20 3:51 PM, Edward Cree wrote:
> > On 31/03/2020 04:54, Andrii Nakryiko wrote:
> >> No need to kill random processes, you can kill only those that hold
> >> bpf_link FD. You can find them using drgn tool with script like [0].
> > For the record, I find the argument "we don't need a query feature,
> >  because you can just use a kernel debugger" *utterly* *horrifying*.
> > Now, it seems to be moot, because Alexei has given other, better
> >  reasons why query doesn't need to land yet; but can we please not
> >  ever treat debugging interfaces as a substitute for proper APIs?
> >
> > </scream>
> > -ed
> >
>
> just about to send the same intent. Dev packages and processing
> /proc/kcore is not a proper observability API for production systems.

I'm not against observability. LINK_QUERY is going to be added. I'm
also looking into making bpf_link into "lookup-able by id" object,
similar to bpf_map and bpf_prog, which will allow to easily just say
"show me all the BPF attachments in the system", which is impossible
to do right now, btw.

As for the drgn and /proc/kcore. drgn is an awesome tool to do lots of
inner kernel API observability stuff, which is impractical to expose
through stable APIs. But you don't have to use it to get the same
effect. The problem that script is solving is to show all the
processes that have open FD to bpf_link files. This is the same
problem fuser command is solving for normal files, but solution is
similar. fuser seems to be doing it iterating over all processes and
its FDs in procfs. Not the most efficient way, but it works. Here's
what you can get for cgroup bpf_link file with my last patch set
already:

# cat /proc/1366584/fdinfo/14
pos:    0
flags:  02000000
mnt_id: 14
link_type:      cgroup
prog_tag:       9ad187367cf2b9e8
prog_id:        1649

We can extend that information further with relevant details. This is
a good and bigger discussion for LINK_QUERY API as well, given it and
fdinfo might be treated as two ways to get same information. This is
one reason I didn't do it for cgroup bpf_link, there are already
enough related discussions to keep us all involved for more than a
week now.

But it would be nice to start discussing and figuring out these
relevant details, instead of being horrified and terrified, and
spreading FUD. Or inventing ways to violate good properties of
bpf_link (e.g., by forceful nuking) due to theoretical worries about
the need to detach bpf_link without finding application or pinned file
that holds it. As Alexei mentioned, what's there already (raw_tp,
tracing, and now cgroup bpf_links) is no worse than what we had
before. By the time we get to XDP bpf_link, we'll have even more
observability capabilities.

David Ahern April 1, 2020, 1:42 a.m. UTC | #18

On 3/30/20 7:17 PM, Alexei Starovoitov wrote:
> On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
>> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
>>>>
>>>> This is not a large feature, and there is no reason for CREATE/UPDATE -
>>>> a mere 4 patch set - to go in without something as essential as the
>>>> QUERY for observability.
>>>
>>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
>>
>> You want a feature where a process can prevent another from installing a
>> program on a cgroup. How do I learn which process is holding the
>> bpf_link reference and preventing me from installing a program? Unless I
>> have missed some recent change that is not currently covered by bpftool
>> cgroup, and there is no way reading kernel code will tell me.
> 
> No. That's not the case at all. You misunderstood the concept.

I don't think so ...

> 
>> That is my point. You are restricting what root can do and people will
>> not want to resort to killing random processes trying to find the one
>> holding a reference. 
> 
> Not true either.
> bpf_link = old attach with allow_multi (but with extra safety for owner)

cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI.
That's a year for tools like 'ip vrf exec' to exist and be relied on.
'ip vrf exec' does not use MULTI.

I have not done a deep dive on systemd code, but on ubuntu 18.04 system:

$ sudo ~/bin/bpftool cgroup tree
CgroupPath
ID       AttachType      AttachFlags     Name
/sys/fs/cgroup/unified/system.slice/systemd-udevd.service
    5        ingress
    4        egress
/sys/fs/cgroup/unified/system.slice/systemd-journald.service
    3        ingress
    2        egress
/sys/fs/cgroup/unified/system.slice/systemd-logind.service
    7        ingress
    6        egress

suggests that multi is not common with systemd either at some point in
its path, so 'ip vrf exec' is not alone in not using the flag. There
most likely are many other tools.


> The only thing bpf_link protects is the owner of the link from other
> processes of nuking that link.
> It does _not_ prevent other processes attaching their own cgroup-bpf progs
> either via old interface or via bpf_link.
> 

It does when that older code does not use the MULTI flag. There is a
history that is going to create conflicts and being able to id which
program holds the bpf_link is essential.

And this is really just one use case. There are many other reasons for
wanting to know what process is holding a reference to something.

Alexei Starovoitov April 1, 2020, 2:04 a.m. UTC | #19

On Tue, Mar 31, 2020 at 07:42:46PM -0600, David Ahern wrote:
> On 3/30/20 7:17 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
> >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>>>
> >>>> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >>>> a mere 4 patch set - to go in without something as essential as the
> >>>> QUERY for observability.
> >>>
> >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
> >>
> >> You want a feature where a process can prevent another from installing a
> >> program on a cgroup. How do I learn which process is holding the
> >> bpf_link reference and preventing me from installing a program? Unless I
> >> have missed some recent change that is not currently covered by bpftool
> >> cgroup, and there is no way reading kernel code will tell me.
> > 
> > No. That's not the case at all. You misunderstood the concept.
> 
> I don't think so ...
> 
> > 
> >> That is my point. You are restricting what root can do and people will
> >> not want to resort to killing random processes trying to find the one
> >> holding a reference. 
> > 
> > Not true either.
> > bpf_link = old attach with allow_multi (but with extra safety for owner)
> 
> cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI.
> That's a year for tools like 'ip vrf exec' to exist and be relied on.
> 'ip vrf exec' does not use MULTI.
> 
> I have not done a deep dive on systemd code, but on ubuntu 18.04 system:
> 
> $ sudo ~/bin/bpftool cgroup tree
> CgroupPath
> ID       AttachType      AttachFlags     Name
> /sys/fs/cgroup/unified/system.slice/systemd-udevd.service
>     5        ingress
>     4        egress
> /sys/fs/cgroup/unified/system.slice/systemd-journald.service
>     3        ingress
>     2        egress
> /sys/fs/cgroup/unified/system.slice/systemd-logind.service
>     7        ingress
>     6        egress
> 
> suggests that multi is not common with systemd either at some point in
> its path, so 'ip vrf exec' is not alone in not using the flag. There
> most likely are many other tools.

Please take a look at systemd source code:
src/core/bpf-devices.c
src/core/bpf-firewall.c
It prefers to use BPF_F_ALLOW_MULTI when possible.
Since it's the most sensible flag.
Since 'ip vrf exec' is not using allow_multi it's breaking several systemd features.
(regardless of what bpf_link can and cannot do)

> > The only thing bpf_link protects is the owner of the link from other
> > processes of nuking that link.
> > It does _not_ prevent other processes attaching their own cgroup-bpf progs
> > either via old interface or via bpf_link.
> > 
> 
> It does when that older code does not use the MULTI flag. There is a
> history that is going to create conflicts and being able to id which
> program holds the bpf_link is essential.
> 
> And this is really just one use case. There are many other reasons for
> wanting to know what process is holding a reference to something.

I'm not disagreeing that it's useful to query what is attached where. My point
once again that bpf_link for cgroup didn't change a single bit in this logic.
There are processes (like systemd) that are using allow_multi. When they switch
to use bpf_link few years from now nothing will change for all other processes
in the system. Only systemd will be assured that their bpf-device prog will not
be accidentally removed by 'ip vrf'. Currently nothing protects systemd's bpf
progs. Any cap_net_admin process can _accidentally_ nuke it. It's even more
weird that bpf-cgroup-device that systemd is using is under cap_net_admin.
There is nothing networking about it. But that's a separate discussion.

May be you should fix 'ip vrf' first before systemd folks start yelling and
then we can continue arguing about merits of observability?

Edward Cree April 1, 2020, 2:26 p.m. UTC | #20

On 01/04/2020 01:22, Andrii Nakryiko wrote:
> Can you please point out where I was objecting to observability API
> (which is LINK_QUERY thing we've discussed and I didn't oppose, and
> I'm going to add next)?
I didn't say you objected to it.
I just said that you argued that it was OK for it to not land in the
 same release as the rest of the API, because drgn could paper over
 that gap.  Which seems to me to signify a dangerous way of thinking,
 and I wanted to raise the alarm about that.
(If you _don't_ see what's wrong with that argument... well, that'd
 be even _more_ alarming.  Debuggers — and fuser, for that matter —
 are for when things go wrong _in ways the designers of the system
 failed to anticipate_.  They should not be part of a 'normal' work-
 flow for dealing with problems that we already _know_ are possible;
 it's kinda-sorta like how exceptions shouldn't be used for non-
 exceptional situations.)

-ed

Andrii Nakryiko April 1, 2020, 5:39 p.m. UTC | #21

On Wed, Apr 1, 2020 at 7:26 AM Edward Cree <ecree@solarflare.com> wrote:
>
> On 01/04/2020 01:22, Andrii Nakryiko wrote:
> > Can you please point out where I was objecting to observability API
> > (which is LINK_QUERY thing we've discussed and I didn't oppose, and
> > I'm going to add next)?
> I didn't say you objected to it.
> I just said that you argued that it was OK for it to not land in the
>  same release as the rest of the API, because drgn could paper over
>  that gap.  Which seems to me to signify a dangerous way of thinking,
>  and I wanted to raise the alarm about that.

Let's have a bit more context first. BTW, please don't trim chain of
replies (at least not so aggressively) next time.

>>> That is my point. You are restricting what root can do and people will
>>> not want to resort to killing random processes trying to find the one
>>> holding a reference. This is an essential missing piece and should go in
>>> at the same time as this set.
>> No need to kill random processes, you can kill only those that hold
>> bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
> because you can just use a kernel debugger" *utterly* *horrifying*.

I was addressing the concern of having to randomly kill processes.
Which is a "human override" thing, not really an observability one.
And my response is exactly that it's better to be able to see all
owners of bpf_link, rather than have a "nuke" option. It so happens
that drgn is a very useful tool for this and I was able to prototype
such script quickly enough to share it with others. drgn is not the
only option, you can do that by iterating procfs and using fdinfo. It
can certainly be improved for bpf_link-specific case by providing more
information (like cgroup path, etc). But in general, it's a question
of "which processes use given file", which unfortunately, I don't
think Linux has a better observability APIs for. I'm not happy about
that, but it's not really bpf_link-specific issue either.

> (If you _don't_ see what's wrong with that argument... well, that'd
>  be even _more_ alarming.  Debuggers — and fuser, for that matter —
>  are for when things go wrong _in ways the designers of the system
>  failed to anticipate_.  They should not be part of a 'normal' work-
>  flow for dealing with problems that we already _know_ are possible;
>  it's kinda-sorta like how exceptions shouldn't be used for non-
>  exceptional situations.)

I don't think it's as quite black and white as you are stating. For
instance, I *think* lsof, netstat, bcc tools, etc disagree with you.
Also need to kill application because it's attached to XDP or cgroup
doesn't seem like a "normal" workflow either. But I really would
rather leave it at that, if you don't mind.

>
> -ed

[v3,bpf-next,0/4] Add support for cgroup bpf_link

Message

Comments