Message ID | 1486171342-3310547-1-git-send-email-ast@fb.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
> Note that all bpf programs types are global.
I don't think this has a clear enough meaning to work with. In
particular, I think that, if you have some software that installs
cgroup+bpf programs and you run it in a container, then I have no idea
what "global" means in this context, and you may run into trouble with
this patch once namespace ids become migratable because the cgroup+bpf
program in the container would potentially see dev+ino numbers from
*outside* the container. What happens when you migrate it?
I think that this patch plus a minor change to prevent installing
cgroup+bpf programs if the installer isn't in the init netns + fs ns
would work because it would allow new, migratable semantics to be
added down the road to relax the restriction.
Eric, what do you think?
On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote: > On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote: > > Note that all bpf programs types are global. > > I don't think this has a clear enough meaning to work with. In Please clarify what you mean. The quoted part says "bpf programs are global". What is not "clear enough" there? > I think that this patch plus a minor change to prevent installing > cgroup+bpf programs if the installer isn't in the init netns + fs ns > would work because it would allow new, migratable semantics to be > added down the road to relax the restriction. Forcing installer to be in init netns is not acceptable to David who added cgroup_sock in the first place. I'm not sure why we have to discuss that bit in circles.
On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote: >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote: >> > Note that all bpf programs types are global. >> >> I don't think this has a clear enough meaning to work with. In > > Please clarify what you mean. The quoted part says > "bpf programs are global". What is not "clear enough" there? What does "bpf programs are global" mean? I am genuinely unable to figure out what you mean. Here are some example guesses of what you might mean: - BPF programs are compiled independently of a namespace. This is certainly true, but I don't think it matters. - You want BPF programs to affect everything on the system. But this doesn't seem right to be -- they only affect things in the relevant cgroup, so they're not global in that sense. - You want BPF programs to affect everything in their cgroup regardless of namespace. This does seem to be what you think, but it doesn't say *why*, which is the relevant bit. - The set of BPF program types and the verification rules are independent of cgroup and namespace. This is true, but I don't think it matters. That's all I came up with. So, I'll repeat: what does "bpf programs are global" mean? > >> I think that this patch plus a minor change to prevent installing >> cgroup+bpf programs if the installer isn't in the init netns + fs ns >> would work because it would allow new, migratable semantics to be >> added down the road to relax the restriction. > > Forcing installer to be in init netns is not acceptable to David > who added cgroup_sock in the first place. I'm not sure why > we have to discuss that bit in circles. > Because we're one week or so from 4.10 final, the 4.10-rc code is problematic even for ip vrf, and there isn't a clear solution yet. There are a bunch of requirements that seem to conflict, and something has to give. --Andy
On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote: > On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote: > >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote: > >> > Note that all bpf programs types are global. > >> > >> I don't think this has a clear enough meaning to work with. In > > > > Please clarify what you mean. The quoted part says > > "bpf programs are global". What is not "clear enough" there? > > What does "bpf programs are global" mean? I am genuinely unable to > figure out what you mean. Here are some example guesses of what you > might mean: > > - BPF programs are compiled independently of a namespace. This is > certainly true, but I don't think it matters. > > - You want BPF programs to affect everything on the system. But this > doesn't seem right to be -- they only affect things in the relevant > cgroup, so they're not global in that sense. All bpf program types are global in the sense that you can make all of them to operate across all possible scopes and namespaces. cgroup only gives a scope for the program to run, but it's not limited by it. The user can have the same program attached to two or more different cgroups, so one program will run across multiple cgroups. > - The set of BPF program types and the verification rules are > independent of cgroup and namespace. This is true, but I don't think > it matters. It matters. That's actually the key to understand. The loading part verifies correctness for particular program type. Afterwards the same program can be attached to any place. Including different cgroups and different namespaces. The 'attach' part is like 'switch on' that enables program on particular hook. The scope (whether it's socket or netdev or cgroup) is a scope that program author uses to narrow down the hook, but it's not an ultimate restriction. For example the socket program can be attached to sockets and share information with cls_bpf program attached to netdev. The kprobe tracing program can peek into kernel internal data and share it with cls_bpf or any other type as long as everything is root. The information flow is global to the whole system. > Because we're one week or so from 4.10 final, the 4.10-rc code is > problematic even for ip vrf, and there isn't a clear solution yet. > There are a bunch of requirements that seem to conflict, and something > has to give. let's go back to the beginning: - you've identified a 'malfunction' in ip vrf. It's valid one. Thank you. - can it be fixed without kernel changes ? Yes. David offered to do so. - can kernel make it easier to address? Yes. I posted v1. - proposed sk->netns_inum v1 patch together with posted iproute2 change addresses the 'malfunction' ? Yes, but Eric didn't like inode only. All fair. - proposed bpf_sk_netns_id() v3 patch together with upcoming iproute2 change will address it? Yes. What's not to like? There were several clear solutions. The last one seems to be the best. And the sooner it lands the faster I can add override disable flag.
On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote: >> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote: >> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote: >> >> > Note that all bpf programs types are global. >> >> >> >> I don't think this has a clear enough meaning to work with. In >> > >> > Please clarify what you mean. The quoted part says >> > "bpf programs are global". What is not "clear enough" there? >> >> What does "bpf programs are global" mean? I am genuinely unable to >> figure out what you mean. Here are some example guesses of what you >> might mean: >> >> - BPF programs are compiled independently of a namespace. This is >> certainly true, but I don't think it matters. >> >> - You want BPF programs to affect everything on the system. But this >> doesn't seem right to be -- they only affect things in the relevant >> cgroup, so they're not global in that sense. > > All bpf program types are global in the sense that you can > make all of them to operate across all possible scopes and namespaces. I still don't understand what you mean here. A seccomp program runs in the process that installs it and children -- it does not run in "all possible scopes". A socket filter runs on a single socket and therefore runs in a single netns. So presumably I'm still misunderstanding you > cgroup only gives a scope for the program to run, but it's > not limited by it. The user can have the same program > attached to two or more different cgroups, so one program > will run across multiple cgroups. Does this mean "BPF programs are compiled independently of a namespace?" If so, I don't see why it's relevant at all. Sure, you could compile a BPF program once and install it in two different scopes, but that doesn't mean that the kernel should *run* it globally in any sense. Can you clarify? > >> - The set of BPF program types and the verification rules are >> independent of cgroup and namespace. This is true, but I don't think >> it matters. > > It matters. That's actually the key to understand. The loading part > verifies correctness for particular program type. > Afterwards the same program can be attached to any place. > Including different cgroups and different namespaces. > The 'attach' part is like 'switch on' that enables program > on particular hook. The scope (whether it's socket or netdev or cgroup) > is a scope that program author uses to narrow down the hook, > but it's not an ultimate restriction. > For example the socket program can be attached to sockets and > share information with cls_bpf program attached to netdev. > The kprobe tracing program can peek into kernel internal data > and share it with cls_bpf or any other type as long as > everything is root. The information flow is global to the whole system. Why does any of this imply that a cgroup+bpf program that is attached once should run for all network namespaces? > >> Because we're one week or so from 4.10 final, the 4.10-rc code is >> problematic even for ip vrf, and there isn't a clear solution yet. >> There are a bunch of requirements that seem to conflict, and something >> has to give. > > let's go back to the beginning: > - you've identified a 'malfunction' in ip vrf. It's valid one. Thank you. > - can it be fixed without kernel changes ? Yes. David offered to do so. He has (I think) a somewhat kludgey fix that gets the "ip netns" case right but not the "unshare -n" case. I think the latter can't be fixed without kernel changes.
On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote: > On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote: > >> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote: > >> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote: > >> >> > Note that all bpf programs types are global. > >> >> > >> >> I don't think this has a clear enough meaning to work with. In > >> > > >> > Please clarify what you mean. The quoted part says > >> > "bpf programs are global". What is not "clear enough" there? > >> > >> What does "bpf programs are global" mean? I am genuinely unable to > >> figure out what you mean. Here are some example guesses of what you > >> might mean: > >> > >> - BPF programs are compiled independently of a namespace. This is > >> certainly true, but I don't think it matters. > >> > >> - You want BPF programs to affect everything on the system. But this > >> doesn't seem right to be -- they only affect things in the relevant > >> cgroup, so they're not global in that sense. > > > > All bpf program types are global in the sense that you can > > make all of them to operate across all possible scopes and namespaces. > > I still don't understand what you mean here. A seccomp program runs > in the process that installs it and children -- it does not run in > "all possible scopes". seccomp and classic bpf is different, since there is no concept of the program there. cbpf is a stateless set of instructions that belong to some entity like seccomp or socket. ebpf is stateful and starts with the program, then hook and then scope. > A socket filter runs on a single socket and > therefore runs in a single netns. So presumably I'm still > misunderstanding you in classic - yes. ebpf can have the same program attached to multiple sockets in different netns. For classic - the object is the socket and the user can only manipulate that socket. For extended - the object is the program and it can exist on its own. Like the program can be pinned in bpffs while it's not attached to any hook. For classic bpf the ideas of CRIU naturally apply, since it checkpoints the socket and it happens that socket has a set of statless cbpf instructions within. So it's expected to save/restore cbpf as part of socket save/restore. In case of ebpf the program exists independently of the socket. Doing save/restore of the ebpf program attached to a socket is meaningless, since it could be pinned in bpffs, attached to other sockets, has state in bpf maps, some other process might be actively talking to that program and so on. ebpf is a graph of interconnected pieces. To criu such thing one really need to freeze the whole system, all of it processes, and stop the kernel. I don't see criu ever be applied to ebpf. > > cgroup only gives a scope for the program to run, but it's > > not limited by it. The user can have the same program > > attached to two or more different cgroups, so one program > > will run across multiple cgroups. > > Does this mean "BPF programs are compiled independently of a > namespace?" If so, I don't see why it's relevant at all. Sure, you > could compile a BPF program once and install it in two different > scopes, but that doesn't mean that the kernel should *run* it globally > in any sense. Can you clarify? if single program is attached to different sockets it exactly means that the kernel should run it for different sockets :) > >> - The set of BPF program types and the verification rules are > >> independent of cgroup and namespace. This is true, but I don't think > >> it matters. > > > > It matters. That's actually the key to understand. The loading part > > verifies correctness for particular program type. > > Afterwards the same program can be attached to any place. > > Including different cgroups and different namespaces. > > The 'attach' part is like 'switch on' that enables program > > on particular hook. The scope (whether it's socket or netdev or cgroup) > > is a scope that program author uses to narrow down the hook, > > but it's not an ultimate restriction. > > For example the socket program can be attached to sockets and > > share information with cls_bpf program attached to netdev. > > The kprobe tracing program can peek into kernel internal data > > and share it with cls_bpf or any other type as long as > > everything is root. The information flow is global to the whole system. > > Why does any of this imply that a cgroup+bpf program that is attached > once should run for all network namespaces? because cgroup is the only scope that is being used to scope this particular bpf_type_cgroup_sock program type. In the future we may add bpf_type_netns_sock program type which will use exactly the same sock create hook, but will be scoped by netns. It will be completely independent of any cgroups. The event of socket creation by user process within given netns will trigger a call to attached bpf program for that netns. All other netns (where program is not attached) won't see the program called. But the user will still be able to load single program and attach it to two netns-es. Potentially we can also allow a combination of scopes where single bpf program is attached to a hook while scoped by both netns and cgroup at the same time. I see it as an extension to prog_attach command where user will specify two fds (one for cgroup and one for netns) to make given prog_fd program scoped by both. Nothing in the current api prevents such future extensions. And from the other thread: > I'm not saying that at all. I'm saying that this use case sounds > valid, but maybe it could be solved differently. Here are some ideas: I'm happy to discuss api extension ideas as long as I don't hear a week later that they should be done for 4.10 > - Expose the actual cgroup (relative to the hooked cgroup) to the BPF > program. Then you could parse the headers, check what cgroup you're > in, and proceed accordingly. This could potentially be even faster > for your use case if done carefully because it will stress the > instruction cache less. that was considered. we thought about exposing cgroup inode or some form of relative cgroup id and use that for lookup, but for close to zero overhead network monitoring that extra lookup is quite costly, but this idea is still on the table. We might use it together with cls_bpf program type. > - Have a (non-default) flag that says "overridable". The effect is > that, if /A and /A/B both have overridable programs attached, then > sockets in /A/B don't run /A's program. If, however, /A has a > non-overridable program and /A/B has an overridable program, then > sockets in /A/B run both programs. IOW overridable programs override > other overridable programs, but non-overridable programs never > override anything and are never overridden by anything. that sounds a bit complex to maintain of the kernel side. Is this for the case where sandboxing daemon is using non-overridable and it still wants to allow inner daemon to do its monitoring? I guess that's useful. I don't mind adding something like this in the future. > Suppose someone wants to make CGROUP_BPF work right when a container > and a container manager both use it. It'll have to run both programs. > What would the semantics be if this were to be added? This is really > a question, not an indictment of your approach. For all I know, > you're proposing exactly what I suggested above. sounds like your proposed approach to let kernel keep two sets of overridable and non-overridable programs can work. Need to think about it more. There are things to resolve there. Like there gotta be a handle or some cookie that user will give to the kernel to identify particular program within a list. Otherwise i don't see how the user can detach the program later or even query whether it's attached already or not. We had a case when user daemon was restarting after an upgrade and was attaching cls_bpf to tc. After few weeks every packet was going through a ton of bpf programs. So I like single program being default, since the user doing a chain of progs would have to think about handles/cookies beforehand and would have to enable it explicitly with the flag. > And sandboxing needn't, and won't, wait until unprivileged bpf > programs are added. Plenty of sandboxes run as root. great. Last time I heard the unprivileged progs were positioned as must have for landlock. I'm happy to stay root only for now.
On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote: >> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote: >> >> What does "bpf programs are global" mean? I am genuinely unable to >> >> figure out what you mean. Here are some example guesses of what you >> >> might mean: >> >> >> >> - BPF programs are compiled independently of a namespace. This is >> >> certainly true, but I don't think it matters. >> >> >> >> - You want BPF programs to affect everything on the system. But this >> >> doesn't seem right to be -- they only affect things in the relevant >> >> cgroup, so they're not global in that sense. >> > >> > All bpf program types are global in the sense that you can >> > make all of them to operate across all possible scopes and namespaces. >> >> I still don't understand what you mean here. A seccomp program runs >> in the process that installs it and children -- it does not run in >> "all possible scopes". > > seccomp and classic bpf is different, since there is no concept > of the program there. cbpf is a stateless set of instructions > that belong to some entity like seccomp or socket. ebpf is stateful > and starts with the program, then hook and then scope. So... are you saying that a loaded eBPF object is global in the sense that if you attach the same object to more than one thing (two sockets, say), the *same* program runs and shares its state? If so, I agree, but that's still not an argument that the *same* attachment of an eBPF program to a cgroup should run in multiple network namespaces. You could also attach the (same) program once per netns and its state would be shared. I'm pretty sure I've never suggested that an eBPF program be bound to a namespace. I just think that a lot of things get more straightforward if an *attachment* of an eBPF program to a cgroup is bound to a single namespace. > >> A socket filter runs on a single socket and >> therefore runs in a single netns. So presumably I'm still >> misunderstanding you > > in classic - yes. ebpf can have the same program attached to > multiple sockets in different netns. > For classic - the object is the socket and the user can only > manipulate that socket. For extended - the object is the program > and it can exist on its own. Like the program can be pinned in bpffs > while it's not attached to any hook. > For classic bpf the ideas of CRIU naturally apply, since > it checkpoints the socket and it happens that socket has > a set of statless cbpf instructions within. So it's > expected to save/restore cbpf as part of socket save/restore. > In case of ebpf the program exists independently of the socket. True. > Doing save/restore of the ebpf program attached to a socket > is meaningless, since it could be pinned in bpffs, attached > to other sockets, has state in bpf maps, some other process > might be actively talking to that program and so on. > ebpf is a graph of interconnected pieces. To criu such thing > one really need to freeze the whole system, all of it processes, > and stop the kernel. I don't see criu ever be applied to ebpf. Not true, I think. CRIU could figure out which eBPF program is attached to a socket and snapshot the attachment and (separately) the eBPF object. If the eBPF object were to be attached to two sockets, CRIU would notice and only snaptshot the eBPF program once. I don't see why the whole system would need to be snapshotted. Obviously this code isn't written yet. But if eBPF were to be widely used for, say, seccomp, I bet the CRIU code would show up in short order. > > Potentially we can also allow a combination of scopes where > single bpf program is attached to a hook while scoped by > both netns and cgroup at the same time. > I see it as an extension to prog_attach command where > user will specify two fds (one for cgroup and one for netns) to make > given prog_fd program scoped by both. > Nothing in the current api prevents such future extensions. I think this might be sufficient for all usecases. Facebook's use case of monitoring everything (if I understood correctly) could be addressed by just attaching the program to the cgroup once for each namespace. The benefit would be a good deal of simplicity: you could relax the capable() call to ns_capable() in the future, you wouldn't need this patch, and you wouldn't be creating the first major non-netns-specific network hook in the kernel. > > And from the other thread: > >> I'm not saying that at all. I'm saying that this use case sounds >> valid, but maybe it could be solved differently. Here are some ideas: > > I'm happy to discuss api extension ideas as long as I don't hear > a week later that they should be done for 4.10 > >> - Expose the actual cgroup (relative to the hooked cgroup) to the BPF >> program. Then you could parse the headers, check what cgroup you're >> in, and proceed accordingly. This could potentially be even faster >> for your use case if done carefully because it will stress the >> instruction cache less. > > that was considered. we thought about exposing cgroup inode > or some form of relative cgroup id and use that for lookup, but for > close to zero overhead network monitoring that extra lookup > is quite costly, but this idea is still on the table. > We might use it together with cls_bpf program type. Another approach would be to let the attach call take an opaque parameter that's passed back to the eBPF program on each invocation. That would be a single mem-to-register copy, right? > >> - Have a (non-default) flag that says "overridable". The effect is >> that, if /A and /A/B both have overridable programs attached, then >> sockets in /A/B don't run /A's program. If, however, /A has a >> non-overridable program and /A/B has an overridable program, then >> sockets in /A/B run both programs. IOW overridable programs override >> other overridable programs, but non-overridable programs never >> override anything and are never overridden by anything. > > that sounds a bit complex to maintain of the kernel side. > Is this for the case where sandboxing daemon is using non-overridable > and it still wants to allow inner daemon to do its monitoring? > I guess that's useful. I don't mind adding something like this > in the future. > I'm just trying to come up with sane semantics to having non-overridable programs as well as overridable programs. It has to do *something* if you install both (or be disallowed). >> Suppose someone wants to make CGROUP_BPF work right when a container >> and a container manager both use it. It'll have to run both programs. >> What would the semantics be if this were to be added? This is really >> a question, not an indictment of your approach. For all I know, >> you're proposing exactly what I suggested above. > > sounds like your proposed approach to let kernel keep two sets > of overridable and non-overridable programs can work. > Need to think about it more. There are things to resolve there. > Like there gotta be a handle or some cookie that user will > give to the kernel to identify particular program within a list. > Otherwise i don't see how the user can detach the program later > or even query whether it's attached already or not. I thought your plan was to allow at most one program per (cgroup, hook type), in which case this isn't a problem, right? If you call detach, you detach the one and only program that's attached. > We had a case when user daemon was restarting after an upgrade > and was attaching cls_bpf to tc. After few weeks every packet > was going through a ton of bpf programs. So I like single > program being default, since the user doing a chain of progs > would have to think about handles/cookies beforehand > and would have to enable it explicitly with the flag. > >> And sandboxing needn't, and won't, wait until unprivileged bpf >> programs are added. Plenty of sandboxes run as root. > > great. Last time I heard the unprivileged progs were positioned as > must have for landlock. I'm happy to stay root only for now. > I think that letting sandboxing features work unprivileged is extremely valuable. Just because cgroup+bpf is root only for now doesn't mean it won't be used for sandboxing, though.
On Mon, Feb 06, 2017 at 06:57:45PM -0800, Andy Lutomirski wrote: > On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote: > >> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote: > >> >> What does "bpf programs are global" mean? I am genuinely unable to > >> >> figure out what you mean. Here are some example guesses of what you > >> >> might mean: > >> >> > >> >> - BPF programs are compiled independently of a namespace. This is > >> >> certainly true, but I don't think it matters. > >> >> > >> >> - You want BPF programs to affect everything on the system. But this > >> >> doesn't seem right to be -- they only affect things in the relevant > >> >> cgroup, so they're not global in that sense. > >> > > >> > All bpf program types are global in the sense that you can > >> > make all of them to operate across all possible scopes and namespaces. > >> > >> I still don't understand what you mean here. A seccomp program runs > >> in the process that installs it and children -- it does not run in > >> "all possible scopes". > > > > seccomp and classic bpf is different, since there is no concept > > of the program there. cbpf is a stateless set of instructions > > that belong to some entity like seccomp or socket. ebpf is stateful > > and starts with the program, then hook and then scope. > > So... are you saying that a loaded eBPF object is global in the sense > that if you attach the same object to more than one thing (two > sockets, say), the *same* program runs and shares its state? If so, I > agree, but that's still not an argument that the *same* attachment of > an eBPF program to a cgroup should run in multiple network namespaces. > You could also attach the (same) program once per netns and its state > would be shared. > > I'm pretty sure I've never suggested that an eBPF program be bound to > a namespace. I just think that a lot of things get more > straightforward if an *attachment* of an eBPF program to a cgroup is > bound to a single namespace. Thank you for this whole discussion over the last few months. Frankly in the beginning I wasn't 100% sure about the api we picked, now I'm completely convinced that we absolutely made the right choice. It's clean and keeps scoping constructs clear and explicit for the users. So I am proposing that in the future we should add the ability to scope bpf programs by netns. Just like current api scopes them by cgroup. The attachment must be explicit. Current api attaches type_cgroup_* program to a hook and scopes it by a given cgroup. At that time some apps within that cgroup may already run in different netns and attaching process may be in yet another netns. There is no way to have sane semantics without explicitly specifying the scope. Currently we do it by explicitly specifying the cgroup. In the future we need to extend it by specifying netns (without cgroup). Then for container technologies that are based on netns we'll have an efficient way of scoping programs to given netns. And when afnetns is finally ready, the same scoping approach will work for afnetns as well. For cases that need to have two scopes at the same time (like cgroup and netns) the bpf_sk_netns_id() helper will work one way and some future bpf_sk_cgroup_id() helper will work from the other side. So far in multi-scope cases one dimension is dominating. Like number of cgroups is way larger than number of netns, so explicit bpf_sk_netns_id() from inside the programs is faster than doing the same in the kernel. And if in the future there will be a case with a lot of cgroups and a lot of netns at the same time, we'll extend the api further to specify two scopes to bpf_prog_attach command. The kernel side will probably need a hashtable to lookup a bpf prog based on (cgroup, netns) pair of pointers. > >> A socket filter runs on a single socket and > >> therefore runs in a single netns. So presumably I'm still > >> misunderstanding you > > > > in classic - yes. ebpf can have the same program attached to > > multiple sockets in different netns. > > For classic - the object is the socket and the user can only > > manipulate that socket. For extended - the object is the program > > and it can exist on its own. Like the program can be pinned in bpffs > > while it's not attached to any hook. > > For classic bpf the ideas of CRIU naturally apply, since > > it checkpoints the socket and it happens that socket has > > a set of statless cbpf instructions within. So it's > > expected to save/restore cbpf as part of socket save/restore. > > In case of ebpf the program exists independently of the socket. > > True. > > > Doing save/restore of the ebpf program attached to a socket > > is meaningless, since it could be pinned in bpffs, attached > > to other sockets, has state in bpf maps, some other process > > might be actively talking to that program and so on. > > ebpf is a graph of interconnected pieces. To criu such thing > > one really need to freeze the whole system, all of it processes, > > and stop the kernel. I don't see criu ever be applied to ebpf. > > Not true, I think. CRIU could figure out which eBPF program is > attached to a socket and snapshot the attachment and (separately) the > eBPF object. If the eBPF object were to be attached to two sockets, > CRIU would notice and only snaptshot the eBPF program once. I don't that's C part of CRIU... and as soon as it goes to do R part the initial graph between programs, maps, and apps is now completely different and everything is broken. criu cannot save/restore a graph by saving/restoring one edge at a time while graph keeps changing. It needs to freeze the whole thing. > need this patch, and you wouldn't be creating the first major > non-netns-specific network hook in the kernel. first?! last time i checked two old cgroupv1 controllers ignore netns. nflog has an ability to log across netns. > >> - Expose the actual cgroup (relative to the hooked cgroup) to the BPF > >> program. Then you could parse the headers, check what cgroup you're > >> in, and proceed accordingly. This could potentially be even faster > >> for your use case if done carefully because it will stress the > >> instruction cache less. > > > > that was considered. we thought about exposing cgroup inode > > or some form of relative cgroup id and use that for lookup, but for > > close to zero overhead network monitoring that extra lookup > > is quite costly, but this idea is still on the table. > > We might use it together with cls_bpf program type. > > Another approach would be to let the attach call take an opaque > parameter that's passed back to the eBPF program on each invocation. > That would be a single mem-to-register copy, right? i'm not sure what you mean here. we discussed adding new types of relocations to bpf instruction stream. Today there is only BPF_PSEUDO_MAP_FD. We thought that netns and cgroup can be two other types. Similarly to maps the user process will open an fd to netns (or cgroup) and store it inside the program as BPF_PSEUDO_NETNS_FD. At load time the kernel will increment the refcnt. The problem with this approach is that bpf programs will suddenly hold the references to normal kernel objects and debugging of netns and cgroup destruction will become a nightmare. Thefore bpf_sk_netns_id() (and future bpf_sk_cgroup_id) approach is safer and simpler. > >> - Have a (non-default) flag that says "overridable". The effect is > >> that, if /A and /A/B both have overridable programs attached, then > >> sockets in /A/B don't run /A's program. If, however, /A has a > >> non-overridable program and /A/B has an overridable program, then > >> sockets in /A/B run both programs. IOW overridable programs override > >> other overridable programs, but non-overridable programs never > >> override anything and are never overridden by anything. > > > > that sounds a bit complex to maintain of the kernel side. > > Is this for the case where sandboxing daemon is using non-overridable > > and it still wants to allow inner daemon to do its monitoring? > > I guess that's useful. I don't mind adding something like this > > in the future. > > > > I'm just trying to come up with sane semantics to having > non-overridable programs as well as overridable programs. It has to > do *something* if you install both (or be disallowed). I think what you're proposing with two sets is fine. (if i understood the idea correctly). right now we have only overridable and adding non-overridable doesn't break anything as far as i can see. For simplicity we can do it with just two pointers (no chaining). Later can add priority and chain of programs. Which probably should be an array of programs to have good performance. I think your main concern all this time was that we won't be able to extend the api to suit security/sandboxing needs. I don't share that concern and I think this discussion outlined the work in that area for the next few years that we can incrementally do. > >> Suppose someone wants to make CGROUP_BPF work right when a container > >> and a container manager both use it. It'll have to run both programs. > >> What would the semantics be if this were to be added? This is really > >> a question, not an indictment of your approach. For all I know, > >> you're proposing exactly what I suggested above. > > > > sounds like your proposed approach to let kernel keep two sets > > of overridable and non-overridable programs can work. > > Need to think about it more. There are things to resolve there. > > Like there gotta be a handle or some cookie that user will > > give to the kernel to identify particular program within a list. > > Otherwise i don't see how the user can detach the program later > > or even query whether it's attached already or not. > > I thought your plan was to allow at most one program per (cgroup, hook > type), in which case this isn't a problem, right? If you call detach, > you detach the one and only program that's attached. yes. when there is only one program the detach is unambiguous. As soon as we have a chain the user will need a handle or priority or both. > > We had a case when user daemon was restarting after an upgrade > > and was attaching cls_bpf to tc. After few weeks every packet > > was going through a ton of bpf programs. So I like single > > program being default, since the user doing a chain of progs > > would have to think about handles/cookies beforehand > > and would have to enable it explicitly with the flag. > > > >> And sandboxing needn't, and won't, wait until unprivileged bpf > >> programs are added. Plenty of sandboxes run as root. > > > > great. Last time I heard the unprivileged progs were positioned as > > must have for landlock. I'm happy to stay root only for now. > > > > I think that letting sandboxing features work unprivileged is > extremely valuable. Just because cgroup+bpf is root only for now > doesn't mean it won't be used for sandboxing, though. agree.
diff --git a/fs/nsfs.c b/fs/nsfs.c index 8c9fb29c6673..1a604bccef86 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -49,6 +49,13 @@ static void nsfs_evict(struct inode *inode) ns->ops->put(ns); } +u64 proc_get_ns_devid_inum(struct ns_common *ns) +{ + u64 dev = new_encode_dev(nsfs_mnt->mnt_sb->s_dev); + + return (dev << 32) | ns->inum; +} + static void *__ns_get_path(struct path *path, struct ns_common *ns) { struct vfsmount *mnt = nsfs_mnt; diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index 12cb8bd81d2d..b567b021e652 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -48,7 +48,7 @@ extern int pid_ns_prepare_proc(struct pid_namespace *ns); extern void pid_ns_release_proc(struct pid_namespace *ns); extern int proc_alloc_inum(unsigned int *pino); extern void proc_free_inum(unsigned int inum); - +extern u64 proc_get_ns_devid_inum(struct ns_common *ns); #else /* CONFIG_PROC_FS */ static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; } @@ -61,6 +61,7 @@ static inline int proc_alloc_inum(unsigned int *inum) } static inline void proc_free_inum(unsigned int inum) {} +static u64 proc_get_ns_devid_inum(struct ns_common *ns) { return 0; } #endif /* CONFIG_PROC_FS */ static inline int ns_alloc_inum(struct ns_common *ns) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0eb0e87dbe9f..e5b8cf16cbaf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -430,6 +430,17 @@ union bpf_attr { * @xdp_md: pointer to xdp_md * @delta: An positive/negative integer to be added to xdp_md.data * Return: 0 on success or negative on error + * + * u64 bpf_sk_netns_id(sk) + * Returns unique value that identifies netns of given socket or skb. + * The upper 32-bits of the return value contain device id where namespace + * filesystem resides and lower 32-bits contain inode number within + * that filesystem. It's the same value as: + * struct stat st; + * stat("/proc/pid/ns/net", &st); + * return (st->st_dev << 32) | st->st_ino; + * @sk: pointer to struct sock or struct __sk_buff + * Return: filesystem's device id | netns inode */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -476,7 +487,8 @@ union bpf_attr { FN(set_hash_invalid), \ FN(get_numa_node_id), \ FN(skb_change_head), \ - FN(xdp_adjust_head), + FN(xdp_adjust_head), \ + FN(sk_netns_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index 1969b3f118c1..171f0a019705 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -52,6 +52,7 @@ #include <net/dst_metadata.h> #include <net/dst.h> #include <net/sock_reuseport.h> +#include <linux/proc_ns.h> /** * sk_filter_trim_cap - run a packet through a socket filter @@ -2597,6 +2598,34 @@ static const struct bpf_func_proto bpf_xdp_event_output_proto = { .arg5_type = ARG_CONST_STACK_SIZE, }; +BPF_CALL_1(bpf_sk_netns_id, struct sock *, sk) +{ + return proc_get_ns_devid_inum(&sock_net(sk)->ns); +} + +static const struct bpf_func_proto bpf_sk_netns_id_proto = { + .func = bpf_sk_netns_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +BPF_CALL_1(bpf_skb_netns_id, struct sk_buff *, skb) +{ + struct net_device *dev = skb->dev; + + if (!dev) + return 0; + return proc_get_ns_devid_inum(&dev_net(dev)->ns); +} + +static const struct bpf_func_proto bpf_skb_netns_id_proto = { + .func = bpf_skb_netns_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + static const struct bpf_func_proto * sk_filter_func_proto(enum bpf_func_id func_id) { @@ -2620,6 +2649,8 @@ sk_filter_func_proto(enum bpf_func_id func_id) case BPF_FUNC_trace_printk: if (capable(CAP_SYS_ADMIN)) return bpf_get_trace_printk_proto(); + case BPF_FUNC_sk_netns_id: + return &bpf_skb_netns_id_proto; default: return NULL; } @@ -2700,6 +2731,17 @@ xdp_func_proto(enum bpf_func_id func_id) } static const struct bpf_func_proto * +cg_sock_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_sk_netns_id: + return &bpf_sk_netns_id_proto; + default: + return sk_filter_func_proto(func_id); + } +} + +static const struct bpf_func_proto * cg_skb_func_proto(enum bpf_func_id func_id) { switch (func_id) { @@ -3255,7 +3297,7 @@ static const struct bpf_verifier_ops lwt_xmit_ops = { }; static const struct bpf_verifier_ops cg_sock_ops = { - .get_func_proto = sk_filter_func_proto, + .get_func_proto = cg_sock_func_proto, .is_valid_access = sock_filter_is_valid_access, .convert_ctx_access = sock_filter_convert_ctx_access, }; diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h index faaffe2e139a..971649483d62 100644 --- a/samples/bpf/bpf_helpers.h +++ b/samples/bpf/bpf_helpers.h @@ -94,6 +94,8 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) = (void *) BPF_FUNC_skb_under_cgroup; static int (*bpf_skb_change_head)(void *, int len, int flags) = (void *) BPF_FUNC_skb_change_head; +static unsigned long long (*bpf_sk_netns_id)(void *) = + (void *) BPF_FUNC_sk_netns_id; #if defined(__x86_64__) diff --git a/samples/bpf/sock_flags_kern.c b/samples/bpf/sock_flags_kern.c index 533dd11a6baa..f2507d7d847b 100644 --- a/samples/bpf/sock_flags_kern.c +++ b/samples/bpf/sock_flags_kern.c @@ -9,8 +9,10 @@ SEC("cgroup/sock1") int bpf_prog1(struct bpf_sock *sk) { char fmt[] = "socket: family %d type %d protocol %d\n"; + char fmt2[] = "socket: netns dev+inode %llx\n"; bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol); + bpf_trace_printk(fmt2, sizeof(fmt2), bpf_sk_netns_id(sk)); /* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets * ie., make ping6 fail diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c index ed18e9a4909c..73b62cc0d617 100644 --- a/samples/bpf/sockex1_kern.c +++ b/samples/bpf/sockex1_kern.c @@ -16,6 +16,7 @@ int bpf_prog1(struct __sk_buff *skb) { int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long *value; + char fmt[] = "netns dev+inode %llx\n"; if (skb->pkt_type != PACKET_OUTGOING) return 0; @@ -24,6 +25,7 @@ int bpf_prog1(struct __sk_buff *skb) if (value) __sync_fetch_and_add(value, skb->len); + bpf_trace_printk(fmt, sizeof(fmt), bpf_sk_netns_id(skb)); return 0; } char _license[] SEC("license") = "GPL";
in cases where bpf programs are looking at sockets and packets that belong to different netns, it could be useful to get an id that uniquely identify a netns within the whole system. Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns unique value that identifies netns of given socket or dev_net(skb->dev) The upper 32-bits of the return value contain device id where namespace filesystem resides and lower 32-bits contain inode number within that filesystem. It's the same as struct stat st; stat("/proc/pid/ns/net", &st); return (st->st_dev << 32) | st->st_ino; For example to disallow raw sockets in all non-init netns the bpf_type_cgroup_sock program can do: if (sk->type == SOCK_RAW && bpf_sk_netns_id(sk) != 0x3f0000075) return 0; where 0x3f0000075 comes from combination of st_dev and st_ino of /proc/pid/ns/net Note that all bpf programs types are global. The same socket filter program can be attached to sockets in different netns, just like cls_bpf can see ingress/egress packets of multiple net_devices in different netns. The cgroup_bpf programs are the most exposed to sockets and devices across netns, but the need to identify netns applies to all. For example, if bpf_type_cgroup_skb didn't exist the system wide monitoring daemon could have used ld_preload mechanism and attached the same program to see traffic from applications across netns. Therefore make bpf_sk_netns_id() helper available to all network related bpf program types. For socket, cls_bpf and cgroup_skb programs this helper can be considered a new feature, whereas for cgroup_sock programs that modify sk->bound_dev_if (like 'ip vrf' does) it's a bug fix, since 'ip vrf' needs to be netns aware. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- Eric, I'v added proc_get_ns_devid_inum() to nsfs.c right next to __ns_get_path(), so when it is time in the future to make nsfs more namespace aware, it will be easy to adjust both new_inode_pseudo(mnt->mnt_sb) line and proc_get_ns_devid_inum() I thought about using ns->stashed, but it's obviously transient inode and not usable. If later we decide to store dev_t into ns_common it will be fine as well. We'll just change proc_get_ns_devid_inum() without affecting user space. --- fs/nsfs.c | 7 +++++++ include/linux/proc_ns.h | 3 ++- include/uapi/linux/bpf.h | 14 +++++++++++++- net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++- samples/bpf/bpf_helpers.h | 2 ++ samples/bpf/sock_flags_kern.c | 2 ++ samples/bpf/sockex1_kern.c | 2 ++ 7 files changed, 71 insertions(+), 3 deletions(-)