Message ID | 20200330030001.2312810-1-andriin@fb.com |
---|---|
Headers | show |
Series | Add support for cgroup bpf_link | expand |
On 3/29/20 8:59 PM, Andrii Nakryiko wrote: > bpf_link abstraction itself was formalized in [0] with justifications for why > its semantics is a good fit for attaching BPF programs of various types. This > patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF > programs. > > Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI > semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can > co-exist with legacy BPF program multi-attachment. > > bpf_link is destroyed and automatically detached when the last open FD holding > the reference to bpf_link is closed. This means that by default, when the > process that created bpf_link exits, attached BPF program will be > automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like > any other bpf_link, can be pinned in BPF FS and by those means survive the > exit of process that created the link. This is useful in many scenarios to > provide long-living BPF program attachments. Pinning also means that there > could be many owners of bpf_link through independent FDs. > > Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is > dying it will automatically detach all active bpf_links. This ensures that > cgroup clean up is not delayed due to active bpf_link even despite no chance > for any BPF program to be run for a given cgroup. In that sense it's similar > to existing behavior of dropping refcnt of attached bpf_prog. But in the case > of bpf_link, bpf_link is not destroyed and is still available to user as long > as at least one active FD is still open (or if it's pinned in BPF FS). > > There are two main cgroup-specific differences between bpf_link-based and > direct bpf_prog-based attachment. > > First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own" > bpf_link, which makes it possible to auto-clean up attached bpf_link when user > process abruptly exits without explicitly detaching BPF program. This makes > for a safe default behavior proven in BPF tracing program types. But bpf_link > doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent > cgroup from cleaning up its BPF state. > > Second, only owners of bpf_link (those who created bpf_link in the first place > or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update > it. This makes sure that no other process can accidentally remove/replace BPF > program. > > This patch set also implements LINK_UPDATE sub-command, which allows to > replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag > behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it > is supposed to be generic command for different types of bpf_links. > The observability piece should go in the same release as the feature.
On Mon, Mar 30, 2020 at 7:49 AM David Ahern <dsahern@gmail.com> wrote: > > On 3/29/20 8:59 PM, Andrii Nakryiko wrote: > > bpf_link abstraction itself was formalized in [0] with justifications for why > > its semantics is a good fit for attaching BPF programs of various types. This > > patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF > > programs. > > > > Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI > > semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can > > co-exist with legacy BPF program multi-attachment. > > > > bpf_link is destroyed and automatically detached when the last open FD holding > > the reference to bpf_link is closed. This means that by default, when the > > process that created bpf_link exits, attached BPF program will be > > automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like > > any other bpf_link, can be pinned in BPF FS and by those means survive the > > exit of process that created the link. This is useful in many scenarios to > > provide long-living BPF program attachments. Pinning also means that there > > could be many owners of bpf_link through independent FDs. > > > > Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is > > dying it will automatically detach all active bpf_links. This ensures that > > cgroup clean up is not delayed due to active bpf_link even despite no chance > > for any BPF program to be run for a given cgroup. In that sense it's similar > > to existing behavior of dropping refcnt of attached bpf_prog. But in the case > > of bpf_link, bpf_link is not destroyed and is still available to user as long > > as at least one active FD is still open (or if it's pinned in BPF FS). > > > > There are two main cgroup-specific differences between bpf_link-based and > > direct bpf_prog-based attachment. > > > > First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own" > > bpf_link, which makes it possible to auto-clean up attached bpf_link when user > > process abruptly exits without explicitly detaching BPF program. This makes > > for a safe default behavior proven in BPF tracing program types. But bpf_link > > doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent > > cgroup from cleaning up its BPF state. > > > > Second, only owners of bpf_link (those who created bpf_link in the first place > > or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update > > it. This makes sure that no other process can accidentally remove/replace BPF > > program. > > > > This patch set also implements LINK_UPDATE sub-command, which allows to > > replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag > > behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it > > is supposed to be generic command for different types of bpf_links. > > > > The observability piece should go in the same release as the feature. You mean LINK_QUERY command I mentioned before? Yes, I'm working on adding it next, regardless if this patch set goes in right now or later.
On 3/30/20 2:20 PM, Andrii Nakryiko wrote: >> The observability piece should go in the same release as the feature. > You mean LINK_QUERY command I mentioned before? Yes, I'm working on > adding it next, regardless if this patch set goes in right now or > later. There was also mention of a "human override" details of which have not been discussed. Since the query is not ready either, then the create/update should wait so that all of it can be in the same kernel release. As it stands it is a half-baked feature.
On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote: > > On 3/30/20 2:20 PM, Andrii Nakryiko wrote: > >> The observability piece should go in the same release as the feature. > > You mean LINK_QUERY command I mentioned before? Yes, I'm working on > > adding it next, regardless if this patch set goes in right now or > > later. > > There was also mention of a "human override" details of which have not > been discussed. Since the query is not ready either, then the > create/update should wait so that all of it can be in the same kernel > release. As it stands it is a half-baked feature. "human override" was discussed only in the context of xdp. There won't be override for cgroup-bpf. Just like one doesn't exist today and there is no need for it. There will be no override for clsbpf either. None of the multi prog hooks need it.
On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
> release. As it stands it is a half-baked feature.
speaking of half-baked.
I think as it stands (even without link_query) it's already extremely
useful addition and doesn't take anything away from existing cgroup-bpf
and doesn't hinder observability. 'bpftool cgroup' works just fine.
So I've applied the set.
Even if it was half-baked it would still be applie-able.
Many features are developed over the course of multiple
kernel releases. Example: your nexthops, mptcp, bpf-lsm.
On 3/30/20 4:50 PM, Alexei Starovoitov wrote: > On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote: >> release. As it stands it is a half-baked feature. > > speaking of half-baked. > I think as it stands (even without link_query) it's already extremely > useful addition and doesn't take anything away from existing cgroup-bpf > and doesn't hinder observability. 'bpftool cgroup' works just fine. > So I've applied the set. > > Even if it was half-baked it would still be applie-able. > Many features are developed over the course of multiple > kernel releases. Example: your nexthops, mptcp, bpf-lsm. > nexthops were not - refactoring in 1 release and the entire feature went in to 5.4. Large features / patch sets often must be spread across kernel versions because it is not humanly possible to send and review the patches. This is not a large feature, and there is no reason for CREATE/UPDATE - a mere 4 patch set - to go in without something as essential as the QUERY for observability.
On Mon, Mar 30, 2020 at 05:43:52PM -0600, David Ahern wrote: > On 3/30/20 4:50 PM, Alexei Starovoitov wrote: > > On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote: > >> release. As it stands it is a half-baked feature. > > > > speaking of half-baked. > > I think as it stands (even without link_query) it's already extremely > > useful addition and doesn't take anything away from existing cgroup-bpf > > and doesn't hinder observability. 'bpftool cgroup' works just fine. > > So I've applied the set. > > > > Even if it was half-baked it would still be applie-able. > > Many features are developed over the course of multiple > > kernel releases. Example: your nexthops, mptcp, bpf-lsm. > > > > nexthops were not - refactoring in 1 release and the entire feature went > in to 5.4. Large features / patch sets often must be spread across > kernel versions because it is not humanly possible to send and review > the patches. > > This is not a large feature, and there is no reason for CREATE/UPDATE - > a mere 4 patch set - to go in without something as essential as the > QUERY for observability. As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. Also there is Documentation/bpf/drgn.rst Kernel is an open book. Just learn this simple tool and everything will be observable. Not only bpf objects, but the rest of the kernel too. imo the use case for LINK_QUERY makes sense for xdp only. There is one program there and it's a dispatcher program that libdispatcher will be iteratively updating via freplace. For cgroup bpf_links I cannot think of a reason why apps would need to query it.
On 3/30/20 6:32 PM, Alexei Starovoitov wrote: >> >> This is not a large feature, and there is no reason for CREATE/UPDATE - >> a mere 4 patch set - to go in without something as essential as the >> QUERY for observability. > > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. You want a feature where a process can prevent another from installing a program on a cgroup. How do I learn which process is holding the bpf_link reference and preventing me from installing a program? Unless I have missed some recent change that is not currently covered by bpftool cgroup, and there is no way reading kernel code will tell me. ### To quote Lorenz from an earlier response: "However, this behaviour concerns me. It's like Windows not letting you delete a file while an application has it opened, which just leads to randomly killing programs until you find the right one. It's frustrating and counter productive. You're taking power away from the operator. In your deployment scenario this might make sense, but I think it's a really bad model in general. If I am privileged I need to be able to exercise that privilege." ### That is my point. You are restricting what root can do and people will not want to resort to killing random processes trying to find the one holding a reference. This is an essential missing piece and should go in at the same time as this set.
On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote: > On 3/30/20 6:32 PM, Alexei Starovoitov wrote: > >> > >> This is not a large feature, and there is no reason for CREATE/UPDATE - > >> a mere 4 patch set - to go in without something as essential as the > >> QUERY for observability. > > > > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. > > You want a feature where a process can prevent another from installing a > program on a cgroup. How do I learn which process is holding the > bpf_link reference and preventing me from installing a program? Unless I > have missed some recent change that is not currently covered by bpftool > cgroup, and there is no way reading kernel code will tell me. No. That's not the case at all. You misunderstood the concept. > That is my point. You are restricting what root can do and people will > not want to resort to killing random processes trying to find the one > holding a reference. Not true either. bpf_link = old attach with allow_multi (but with extra safety for owner) The only thing bpf_link protects is the owner of the link from other processes of nuking that link. It does _not_ prevent other processes attaching their own cgroup-bpf progs either via old interface or via bpf_link. It will be different for xdp where only one prog is allowed per xdp hook. There it will prevent other xdp progs. And there link_queury and "human override" will be necessary.
On Mon, Mar 30, 2020 at 5:57 PM David Ahern <dsahern@gmail.com> wrote: > > On 3/30/20 6:32 PM, Alexei Starovoitov wrote: > >> > >> This is not a large feature, and there is no reason for CREATE/UPDATE - > >> a mere 4 patch set - to go in without something as essential as the > >> QUERY for observability. > > > > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. > > You want a feature where a process can prevent another from installing a > program on a cgroup. How do I learn which process is holding the > bpf_link reference and preventing me from installing a program? Unless I > have missed some recent change that is not currently covered by bpftool > cgroup, and there is no way reading kernel code will tell me. > > ### > To quote Lorenz from an earlier response: > > "However, this behaviour concerns me. It's like Windows not > letting you delete a file while an application has it opened, which just > leads to randomly killing programs until you find the right one. It's > frustrating and counter productive. > > You're taking power away from the operator. In your deployment scenario > this might make sense, but I think it's a really bad model in general. > If I am privileged I need to be able to exercise that privilege." > ### > > That is my point. You are restricting what root can do and people will > not want to resort to killing random processes trying to find the one > holding a reference. This is an essential missing piece and should go in > at the same time as this set. No need to kill random processes, you can kill only those that hold bpf_link FD. You can find them using drgn tool with script like [0]. It will give you quite a lot of information already, but it should also find pinned bpf_links, I haven't added it yet. Found total 11 bpf_links. ------------------------------------------------- type: tracing prog: 'test1' id:223 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: tracing prog: 'test2' id:224 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: tracing prog: 'test3' id:225 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: tracing prog: 'test4' id:226 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: tracing prog: 'test5' id:227 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: tracing prog: 'test6' id:228 type:BPF_PROG_TYPE_TRACING pids: 449027 ------------------------------------------------- type: raw_tp prog: '' id:237 type:BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE tp: bpf_test_finish pids: 449462 ------------------------------------------------- type: cgroup prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB attach: BPF_CGROUP_INET_EGRESS cgroup: /cgroup-test-work-dir/cg1 pids: 449881 ------------------------------------------------- type: cgroup prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB attach: BPF_CGROUP_INET_EGRESS cgroup: /cgroup-test-work-dir/cg1/cg2 pids: 449881 ------------------------------------------------- type: cgroup prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB attach: BPF_CGROUP_INET_EGRESS cgroup: /cgroup-test-work-dir/cg1/cg2/cg3 pids: 449881 ------------------------------------------------- type: cgroup prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB attach: BPF_CGROUP_INET_EGRESS cgroup: /cgroup-test-work-dir/cg1/cg2/cg3/cg4 pids: 449881 ------------------------------------------------- [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7
On 3/30/20 9:54 PM, Andrii Nakryiko wrote: > > [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7 #!/usr/bin/env drgn where do I find drgn? Google search is not turning up anything relevant.
On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote: > > On 3/30/20 9:54 PM, Andrii Nakryiko wrote: > > > > [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7 > > #!/usr/bin/env drgn > > where do I find drgn? Google search is not turning up anything relevant. Right, sorry, there were announcements and discussions about drgn on mailing list before, so I wrongly assumed people are aware. It's here: https://github.com/osandov/drgn There is another BPF-related tool, that shows freplace'ed BPF programs, attached to original program (which I stole some parts of for my script, thanks Andrey Ignatov!): https://github.com/osandov/drgn/blob/master/tools/bpf_inspect.py
On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote: > > On 3/30/20 9:54 PM, Andrii Nakryiko wrote: > > > > [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7 > > #!/usr/bin/env drgn > > where do I find drgn? Google search is not turning up anything relevant. It's the link I mentioned in this thread just few emails ago: "Also there is Documentation/bpf/drgn.rst"
On 31/03/2020 04:54, Andrii Nakryiko wrote: > No need to kill random processes, you can kill only those that hold > bpf_link FD. You can find them using drgn tool with script like [0]. For the record, I find the argument "we don't need a query feature, because you can just use a kernel debugger" *utterly* *horrifying*. Now, it seems to be moot, because Alexei has given other, better reasons why query doesn't need to land yet; but can we please not ever treat debugging interfaces as a substitute for proper APIs? </scream> -ed
On 3/31/20 3:51 PM, Edward Cree wrote: > On 31/03/2020 04:54, Andrii Nakryiko wrote: >> No need to kill random processes, you can kill only those that hold >> bpf_link FD. You can find them using drgn tool with script like [0]. > For the record, I find the argument "we don't need a query feature, > because you can just use a kernel debugger" *utterly* *horrifying*. > Now, it seems to be moot, because Alexei has given other, better > reasons why query doesn't need to land yet; but can we please not > ever treat debugging interfaces as a substitute for proper APIs? > > </scream> > -ed > just about to send the same intent. Dev packages and processing /proc/kcore is not a proper observability API for production systems.
On Tue, Mar 31, 2020 at 2:52 PM Edward Cree <ecree@solarflare.com> wrote: > > On 31/03/2020 04:54, Andrii Nakryiko wrote: > > No need to kill random processes, you can kill only those that hold > > bpf_link FD. You can find them using drgn tool with script like [0]. > For the record, I find the argument "we don't need a query feature, > because you can just use a kernel debugger" *utterly* *horrifying*. > Now, it seems to be moot, because Alexei has given other, better > reasons why query doesn't need to land yet; but can we please not > ever treat debugging interfaces as a substitute for proper APIs? Can you please point out where I was objecting to observability API (which is LINK_QUERY thing we've discussed and I didn't oppose, and I'm going to add next)? What I'm doubtful of is this "human override" functionality. I think a tool that shows who's using (processes and mounted files in BPF FS) given bpf_link is way more useful, because it allows you to both "unblock" BPF hook (by killing "bad" processes and removing mounted bpf_link files) and know which processes (read applications) are misbehaving. I'll address drgn vs not concern in reply to David Ahern, who's also *utterly horrified*, apparently, so I'll try to calm him as well. ;) > > </scream> > -ed
On Tue, Mar 31, 2020 at 3:44 PM David Ahern <dsahern@gmail.com> wrote: > > On 3/31/20 3:51 PM, Edward Cree wrote: > > On 31/03/2020 04:54, Andrii Nakryiko wrote: > >> No need to kill random processes, you can kill only those that hold > >> bpf_link FD. You can find them using drgn tool with script like [0]. > > For the record, I find the argument "we don't need a query feature, > > because you can just use a kernel debugger" *utterly* *horrifying*. > > Now, it seems to be moot, because Alexei has given other, better > > reasons why query doesn't need to land yet; but can we please not > > ever treat debugging interfaces as a substitute for proper APIs? > > > > </scream> > > -ed > > > > just about to send the same intent. Dev packages and processing > /proc/kcore is not a proper observability API for production systems. I'm not against observability. LINK_QUERY is going to be added. I'm also looking into making bpf_link into "lookup-able by id" object, similar to bpf_map and bpf_prog, which will allow to easily just say "show me all the BPF attachments in the system", which is impossible to do right now, btw. As for the drgn and /proc/kcore. drgn is an awesome tool to do lots of inner kernel API observability stuff, which is impractical to expose through stable APIs. But you don't have to use it to get the same effect. The problem that script is solving is to show all the processes that have open FD to bpf_link files. This is the same problem fuser command is solving for normal files, but solution is similar. fuser seems to be doing it iterating over all processes and its FDs in procfs. Not the most efficient way, but it works. Here's what you can get for cgroup bpf_link file with my last patch set already: # cat /proc/1366584/fdinfo/14 pos: 0 flags: 02000000 mnt_id: 14 link_type: cgroup prog_tag: 9ad187367cf2b9e8 prog_id: 1649 We can extend that information further with relevant details. This is a good and bigger discussion for LINK_QUERY API as well, given it and fdinfo might be treated as two ways to get same information. This is one reason I didn't do it for cgroup bpf_link, there are already enough related discussions to keep us all involved for more than a week now. But it would be nice to start discussing and figuring out these relevant details, instead of being horrified and terrified, and spreading FUD. Or inventing ways to violate good properties of bpf_link (e.g., by forceful nuking) due to theoretical worries about the need to detach bpf_link without finding application or pinned file that holds it. As Alexei mentioned, what's there already (raw_tp, tracing, and now cgroup bpf_links) is no worse than what we had before. By the time we get to XDP bpf_link, we'll have even more observability capabilities.
On 3/30/20 7:17 PM, Alexei Starovoitov wrote: > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote: >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote: >>>> >>>> This is not a large feature, and there is no reason for CREATE/UPDATE - >>>> a mere 4 patch set - to go in without something as essential as the >>>> QUERY for observability. >>> >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. >> >> You want a feature where a process can prevent another from installing a >> program on a cgroup. How do I learn which process is holding the >> bpf_link reference and preventing me from installing a program? Unless I >> have missed some recent change that is not currently covered by bpftool >> cgroup, and there is no way reading kernel code will tell me. > > No. That's not the case at all. You misunderstood the concept. I don't think so ... > >> That is my point. You are restricting what root can do and people will >> not want to resort to killing random processes trying to find the one >> holding a reference. > > Not true either. > bpf_link = old attach with allow_multi (but with extra safety for owner) cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI. That's a year for tools like 'ip vrf exec' to exist and be relied on. 'ip vrf exec' does not use MULTI. I have not done a deep dive on systemd code, but on ubuntu 18.04 system: $ sudo ~/bin/bpftool cgroup tree CgroupPath ID AttachType AttachFlags Name /sys/fs/cgroup/unified/system.slice/systemd-udevd.service 5 ingress 4 egress /sys/fs/cgroup/unified/system.slice/systemd-journald.service 3 ingress 2 egress /sys/fs/cgroup/unified/system.slice/systemd-logind.service 7 ingress 6 egress suggests that multi is not common with systemd either at some point in its path, so 'ip vrf exec' is not alone in not using the flag. There most likely are many other tools. > The only thing bpf_link protects is the owner of the link from other > processes of nuking that link. > It does _not_ prevent other processes attaching their own cgroup-bpf progs > either via old interface or via bpf_link. > It does when that older code does not use the MULTI flag. There is a history that is going to create conflicts and being able to id which program holds the bpf_link is essential. And this is really just one use case. There are many other reasons for wanting to know what process is holding a reference to something.
On Tue, Mar 31, 2020 at 07:42:46PM -0600, David Ahern wrote: > On 3/30/20 7:17 PM, Alexei Starovoitov wrote: > > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote: > >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote: > >>>> > >>>> This is not a large feature, and there is no reason for CREATE/UPDATE - > >>>> a mere 4 patch set - to go in without something as essential as the > >>>> QUERY for observability. > >>> > >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. > >> > >> You want a feature where a process can prevent another from installing a > >> program on a cgroup. How do I learn which process is holding the > >> bpf_link reference and preventing me from installing a program? Unless I > >> have missed some recent change that is not currently covered by bpftool > >> cgroup, and there is no way reading kernel code will tell me. > > > > No. That's not the case at all. You misunderstood the concept. > > I don't think so ... > > > > >> That is my point. You are restricting what root can do and people will > >> not want to resort to killing random processes trying to find the one > >> holding a reference. > > > > Not true either. > > bpf_link = old attach with allow_multi (but with extra safety for owner) > > cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI. > That's a year for tools like 'ip vrf exec' to exist and be relied on. > 'ip vrf exec' does not use MULTI. > > I have not done a deep dive on systemd code, but on ubuntu 18.04 system: > > $ sudo ~/bin/bpftool cgroup tree > CgroupPath > ID AttachType AttachFlags Name > /sys/fs/cgroup/unified/system.slice/systemd-udevd.service > 5 ingress > 4 egress > /sys/fs/cgroup/unified/system.slice/systemd-journald.service > 3 ingress > 2 egress > /sys/fs/cgroup/unified/system.slice/systemd-logind.service > 7 ingress > 6 egress > > suggests that multi is not common with systemd either at some point in > its path, so 'ip vrf exec' is not alone in not using the flag. There > most likely are many other tools. Please take a look at systemd source code: src/core/bpf-devices.c src/core/bpf-firewall.c It prefers to use BPF_F_ALLOW_MULTI when possible. Since it's the most sensible flag. Since 'ip vrf exec' is not using allow_multi it's breaking several systemd features. (regardless of what bpf_link can and cannot do) > > The only thing bpf_link protects is the owner of the link from other > > processes of nuking that link. > > It does _not_ prevent other processes attaching their own cgroup-bpf progs > > either via old interface or via bpf_link. > > > > It does when that older code does not use the MULTI flag. There is a > history that is going to create conflicts and being able to id which > program holds the bpf_link is essential. > > And this is really just one use case. There are many other reasons for > wanting to know what process is holding a reference to something. I'm not disagreeing that it's useful to query what is attached where. My point once again that bpf_link for cgroup didn't change a single bit in this logic. There are processes (like systemd) that are using allow_multi. When they switch to use bpf_link few years from now nothing will change for all other processes in the system. Only systemd will be assured that their bpf-device prog will not be accidentally removed by 'ip vrf'. Currently nothing protects systemd's bpf progs. Any cap_net_admin process can _accidentally_ nuke it. It's even more weird that bpf-cgroup-device that systemd is using is under cap_net_admin. There is nothing networking about it. But that's a separate discussion. May be you should fix 'ip vrf' first before systemd folks start yelling and then we can continue arguing about merits of observability?
On 01/04/2020 01:22, Andrii Nakryiko wrote: > Can you please point out where I was objecting to observability API > (which is LINK_QUERY thing we've discussed and I didn't oppose, and > I'm going to add next)? I didn't say you objected to it. I just said that you argued that it was OK for it to not land in the same release as the rest of the API, because drgn could paper over that gap. Which seems to me to signify a dangerous way of thinking, and I wanted to raise the alarm about that. (If you _don't_ see what's wrong with that argument... well, that'd be even _more_ alarming. Debuggers — and fuser, for that matter — are for when things go wrong _in ways the designers of the system failed to anticipate_. They should not be part of a 'normal' work- flow for dealing with problems that we already _know_ are possible; it's kinda-sorta like how exceptions shouldn't be used for non- exceptional situations.) -ed
On Wed, Apr 1, 2020 at 7:26 AM Edward Cree <ecree@solarflare.com> wrote: > > On 01/04/2020 01:22, Andrii Nakryiko wrote: > > Can you please point out where I was objecting to observability API > > (which is LINK_QUERY thing we've discussed and I didn't oppose, and > > I'm going to add next)? > I didn't say you objected to it. > I just said that you argued that it was OK for it to not land in the > same release as the rest of the API, because drgn could paper over > that gap. Which seems to me to signify a dangerous way of thinking, > and I wanted to raise the alarm about that. Let's have a bit more context first. BTW, please don't trim chain of replies (at least not so aggressively) next time. >>> That is my point. You are restricting what root can do and people will >>> not want to resort to killing random processes trying to find the one >>> holding a reference. This is an essential missing piece and should go in >>> at the same time as this set. >> No need to kill random processes, you can kill only those that hold >> bpf_link FD. You can find them using drgn tool with script like [0]. > For the record, I find the argument "we don't need a query feature, > because you can just use a kernel debugger" *utterly* *horrifying*. I was addressing the concern of having to randomly kill processes. Which is a "human override" thing, not really an observability one. And my response is exactly that it's better to be able to see all owners of bpf_link, rather than have a "nuke" option. It so happens that drgn is a very useful tool for this and I was able to prototype such script quickly enough to share it with others. drgn is not the only option, you can do that by iterating procfs and using fdinfo. It can certainly be improved for bpf_link-specific case by providing more information (like cgroup path, etc). But in general, it's a question of "which processes use given file", which unfortunately, I don't think Linux has a better observability APIs for. I'm not happy about that, but it's not really bpf_link-specific issue either. > (If you _don't_ see what's wrong with that argument... well, that'd > be even _more_ alarming. Debuggers — and fuser, for that matter — > are for when things go wrong _in ways the designers of the system > failed to anticipate_. They should not be part of a 'normal' work- > flow for dealing with problems that we already _know_ are possible; > it's kinda-sorta like how exceptions shouldn't be used for non- > exceptional situations.) I don't think it's as quite black and white as you are stating. For instance, I *think* lsof, netstat, bcc tools, etc disagree with you. Also need to kill application because it's attached to XDP or cgroup doesn't seem like a "normal" workflow either. But I really would rather leave it at that, if you don't mind. > > -ed