diff mbox

[v2,net] bpf: add bpf_sk_netns_id() helper

Message ID 1486171342-3310547-1-git-send-email-ast@fb.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov Feb. 4, 2017, 1:22 a.m. UTC
in cases where bpf programs are looking at sockets and packets
that belong to different netns, it could be useful to get an id
that uniquely identify a netns within the whole system.

Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
unique value that identifies netns of given socket or dev_net(skb->dev)
The upper 32-bits of the return value contain device id where namespace
filesystem resides and lower 32-bits contain inode number within that filesystem.
It's the same as
 struct stat st;
 stat("/proc/pid/ns/net", &st);
 return (st->st_dev << 32)  | st->st_ino;

For example to disallow raw sockets in all non-init netns
the bpf_type_cgroup_sock program can do:
if (sk->type == SOCK_RAW && bpf_sk_netns_id(sk) != 0x3f0000075)
  return 0;
where 0x3f0000075 comes from combination of st_dev and st_ino
of /proc/pid/ns/net

Note that all bpf programs types are global. The same socket filter
program can be attached to sockets in different netns,
just like cls_bpf can see ingress/egress packets of multiple
net_devices in different netns. The cgroup_bpf programs are
the most exposed to sockets and devices across netns,
but the need to identify netns applies to all.
For example, if bpf_type_cgroup_skb didn't exist the system wide
monitoring daemon could have used ld_preload mechanism and
attached the same program to see traffic from applications
across netns. Therefore make bpf_sk_netns_id() helper available
to all network related bpf program types.
For socket, cls_bpf and cgroup_skb programs this helper
can be considered a new feature, whereas for cgroup_sock
programs that modify sk->bound_dev_if (like 'ip vrf' does)
it's a bug fix, since 'ip vrf' needs to be netns aware.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
Eric, I'v added proc_get_ns_devid_inum() to nsfs.c
right next to __ns_get_path(), so when it is time in the future
to make nsfs more namespace aware, it will be easy to adjust
both new_inode_pseudo(mnt->mnt_sb) line and proc_get_ns_devid_inum()
I thought about using ns->stashed, but it's obviously transient
inode and not usable. If later we decide to store dev_t into ns_common
it will be fine as well. We'll just change proc_get_ns_devid_inum()
without affecting user space.
---
 fs/nsfs.c                     |  7 +++++++
 include/linux/proc_ns.h       |  3 ++-
 include/uapi/linux/bpf.h      | 14 +++++++++++++-
 net/core/filter.c             | 44 ++++++++++++++++++++++++++++++++++++++++++-
 samples/bpf/bpf_helpers.h     |  2 ++
 samples/bpf/sock_flags_kern.c |  2 ++
 samples/bpf/sockex1_kern.c    |  2 ++
 7 files changed, 71 insertions(+), 3 deletions(-)

Comments

Andy Lutomirski Feb. 4, 2017, 5:15 p.m. UTC | #1
On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
> Note that all bpf programs types are global.

I don't think this has a clear enough meaning to work with.  In
particular, I think that, if you have some software that installs
cgroup+bpf programs and you run it in a container, then I have no idea
what "global" means in this context, and you may run into trouble with
this patch once namespace ids become migratable because the cgroup+bpf
program in the container would potentially see dev+ino numbers from
*outside* the container.  What happens when you migrate it?

I think that this patch plus a minor change to prevent installing
cgroup+bpf programs if the installer isn't in the init netns + fs ns
would work because it would allow new, migratable semantics to be
added down the road to relax the restriction.

Eric, what do you think?
Alexei Starovoitov Feb. 5, 2017, 3:25 a.m. UTC | #2
On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
> > Note that all bpf programs types are global.
> 
> I don't think this has a clear enough meaning to work with.  In

Please clarify what you mean. The quoted part says
"bpf programs are global". What is not "clear enough" there?

> I think that this patch plus a minor change to prevent installing
> cgroup+bpf programs if the installer isn't in the init netns + fs ns
> would work because it would allow new, migratable semantics to be
> added down the road to relax the restriction.

Forcing installer to be in init netns is not acceptable to David
who added cgroup_sock in the first place. I'm not sure why
we have to discuss that bit in circles.
Andy Lutomirski Feb. 5, 2017, 3:33 a.m. UTC | #3
On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
>> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
>> > Note that all bpf programs types are global.
>>
>> I don't think this has a clear enough meaning to work with.  In
>
> Please clarify what you mean. The quoted part says
> "bpf programs are global". What is not "clear enough" there?

What does "bpf programs are global" mean?  I am genuinely unable to
figure out what you mean.  Here are some example guesses of what you
might mean:

 - BPF programs are compiled independently of a namespace.  This is
certainly true, but I don't think it matters.

 - You want BPF programs to affect everything on the system.  But this
doesn't seem right to be -- they only affect things in the relevant
cgroup, so they're not global in that sense.

 - You want BPF programs to affect everything in their cgroup
regardless of namespace.  This does seem to be what you think, but it
doesn't say *why*, which is the relevant bit.

 - The set of BPF program types and the verification rules are
independent of cgroup and namespace.  This is true, but I don't think
it matters.

That's all I came up with.  So, I'll repeat: what does "bpf programs
are global" mean?

>
>> I think that this patch plus a minor change to prevent installing
>> cgroup+bpf programs if the installer isn't in the init netns + fs ns
>> would work because it would allow new, migratable semantics to be
>> added down the road to relax the restriction.
>
> Forcing installer to be in init netns is not acceptable to David
> who added cgroup_sock in the first place. I'm not sure why
> we have to discuss that bit in circles.
>

Because we're one week or so from 4.10 final, the 4.10-rc code is
problematic even for ip vrf, and there isn't a clear solution yet.
There are a bunch of requirements that seem to conflict, and something
has to give.

--Andy
Alexei Starovoitov Feb. 5, 2017, 4:05 a.m. UTC | #4
On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
> >> > Note that all bpf programs types are global.
> >>
> >> I don't think this has a clear enough meaning to work with.  In
> >
> > Please clarify what you mean. The quoted part says
> > "bpf programs are global". What is not "clear enough" there?
> 
> What does "bpf programs are global" mean?  I am genuinely unable to
> figure out what you mean.  Here are some example guesses of what you
> might mean:
> 
>  - BPF programs are compiled independently of a namespace.  This is
> certainly true, but I don't think it matters.
> 
>  - You want BPF programs to affect everything on the system.  But this
> doesn't seem right to be -- they only affect things in the relevant
> cgroup, so they're not global in that sense.

All bpf program types are global in the sense that you can
make all of them to operate across all possible scopes and namespaces.
cgroup only gives a scope for the program to run, but it's
not limited by it. The user can have the same program
attached to two or more different cgroups, so one program
will run across multiple cgroups.

>  - The set of BPF program types and the verification rules are
> independent of cgroup and namespace.  This is true, but I don't think
> it matters.

It matters. That's actually the key to understand. The loading part
verifies correctness for particular program type.
Afterwards the same program can be attached to any place.
Including different cgroups and different namespaces.
The 'attach' part is like 'switch on' that enables program
on particular hook. The scope (whether it's socket or netdev or cgroup)
is a scope that program author uses to narrow down the hook,
but it's not an ultimate restriction.
For example the socket program can be attached to sockets and
share information with cls_bpf program attached to netdev.
The kprobe tracing program can peek into kernel internal data
and share it with cls_bpf or any other type as long as
everything is root. The information flow is global to the whole system.

> Because we're one week or so from 4.10 final, the 4.10-rc code is
> problematic even for ip vrf, and there isn't a clear solution yet.
> There are a bunch of requirements that seem to conflict, and something
> has to give.

let's go back to the beginning:
- you've identified a 'malfunction' in ip vrf. It's valid one. Thank you.
- can it be fixed without kernel changes ? Yes. David offered to do so.
- can kernel make it easier to address? Yes. I posted v1.
- proposed sk->netns_inum v1 patch together with posted iproute2 change
addresses the 'malfunction' ? Yes, but Eric didn't like inode only. All fair.
- proposed bpf_sk_netns_id() v3 patch together with upcoming iproute2
change will address it? Yes.

What's not to like? There were several clear solutions.
The last one seems to be the best.
And the sooner it lands the faster I can add override disable flag.
Andy Lutomirski Feb. 5, 2017, 4:17 a.m. UTC | #5
On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
>> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
>> >> > Note that all bpf programs types are global.
>> >>
>> >> I don't think this has a clear enough meaning to work with.  In
>> >
>> > Please clarify what you mean. The quoted part says
>> > "bpf programs are global". What is not "clear enough" there?
>>
>> What does "bpf programs are global" mean?  I am genuinely unable to
>> figure out what you mean.  Here are some example guesses of what you
>> might mean:
>>
>>  - BPF programs are compiled independently of a namespace.  This is
>> certainly true, but I don't think it matters.
>>
>>  - You want BPF programs to affect everything on the system.  But this
>> doesn't seem right to be -- they only affect things in the relevant
>> cgroup, so they're not global in that sense.
>
> All bpf program types are global in the sense that you can
> make all of them to operate across all possible scopes and namespaces.

I still don't understand what you mean here.  A seccomp program runs
in the process that installs it and children -- it does not run in
"all possible scopes".  A socket filter runs on a single socket and
therefore runs in a single netns.  So presumably I'm still
misunderstanding you

> cgroup only gives a scope for the program to run, but it's
> not limited by it. The user can have the same program
> attached to two or more different cgroups, so one program
> will run across multiple cgroups.

Does this mean "BPF programs are compiled independently of a
namespace?"  If so, I don't see why it's relevant at all.  Sure, you
could compile a BPF program once and install it in two different
scopes, but that doesn't mean that the kernel should *run* it globally
in any sense.  Can you clarify?

>
>>  - The set of BPF program types and the verification rules are
>> independent of cgroup and namespace.  This is true, but I don't think
>> it matters.
>
> It matters. That's actually the key to understand. The loading part
> verifies correctness for particular program type.
> Afterwards the same program can be attached to any place.
> Including different cgroups and different namespaces.
> The 'attach' part is like 'switch on' that enables program
> on particular hook. The scope (whether it's socket or netdev or cgroup)
> is a scope that program author uses to narrow down the hook,
> but it's not an ultimate restriction.
> For example the socket program can be attached to sockets and
> share information with cls_bpf program attached to netdev.
> The kprobe tracing program can peek into kernel internal data
> and share it with cls_bpf or any other type as long as
> everything is root. The information flow is global to the whole system.

Why does any of this imply that a cgroup+bpf program that is attached
once should run for all network namespaces?

>
>> Because we're one week or so from 4.10 final, the 4.10-rc code is
>> problematic even for ip vrf, and there isn't a clear solution yet.
>> There are a bunch of requirements that seem to conflict, and something
>> has to give.
>
> let's go back to the beginning:
> - you've identified a 'malfunction' in ip vrf. It's valid one. Thank you.
> - can it be fixed without kernel changes ? Yes. David offered to do so.

He has (I think) a somewhat kludgey fix that gets the "ip netns" case
right but not the "unshare -n" case.  I think the latter can't be
fixed without kernel changes.
Alexei Starovoitov Feb. 7, 2017, 1:42 a.m. UTC | #6
On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> >> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <ast@fb.com> wrote:
> >> >> > Note that all bpf programs types are global.
> >> >>
> >> >> I don't think this has a clear enough meaning to work with.  In
> >> >
> >> > Please clarify what you mean. The quoted part says
> >> > "bpf programs are global". What is not "clear enough" there?
> >>
> >> What does "bpf programs are global" mean?  I am genuinely unable to
> >> figure out what you mean.  Here are some example guesses of what you
> >> might mean:
> >>
> >>  - BPF programs are compiled independently of a namespace.  This is
> >> certainly true, but I don't think it matters.
> >>
> >>  - You want BPF programs to affect everything on the system.  But this
> >> doesn't seem right to be -- they only affect things in the relevant
> >> cgroup, so they're not global in that sense.
> >
> > All bpf program types are global in the sense that you can
> > make all of them to operate across all possible scopes and namespaces.
> 
> I still don't understand what you mean here.  A seccomp program runs
> in the process that installs it and children -- it does not run in
> "all possible scopes". 

seccomp and classic bpf is different, since there is no concept
of the program there. cbpf is a stateless set of instructions
that belong to some entity like seccomp or socket. ebpf is stateful
and starts with the program, then hook and then scope.

> A socket filter runs on a single socket and
> therefore runs in a single netns.  So presumably I'm still
> misunderstanding you

in classic - yes. ebpf can have the same program attached to
multiple sockets in different netns.
For classic - the object is the socket and the user can only
manipulate that socket. For extended - the object is the program
and it can exist on its own. Like the program can be pinned in bpffs
while it's not attached to any hook.
For classic bpf the ideas of CRIU naturally apply, since
it checkpoints the socket and it happens that socket has
a set of statless cbpf instructions within. So it's
expected to save/restore cbpf as part of socket save/restore.
In case of ebpf the program exists independently of the socket.
Doing save/restore of the ebpf program attached to a socket
is meaningless, since it could be pinned in bpffs, attached
to other sockets, has state in bpf maps, some other process
might be actively talking to that program and so on.
ebpf is a graph of interconnected pieces. To criu such thing
one really need to freeze the whole system, all of it processes,
and stop the kernel. I don't see criu ever be applied to ebpf.

> > cgroup only gives a scope for the program to run, but it's
> > not limited by it. The user can have the same program
> > attached to two or more different cgroups, so one program
> > will run across multiple cgroups.
> 
> Does this mean "BPF programs are compiled independently of a
> namespace?"  If so, I don't see why it's relevant at all.  Sure, you
> could compile a BPF program once and install it in two different
> scopes, but that doesn't mean that the kernel should *run* it globally
> in any sense.  Can you clarify?

if single program is attached to different sockets it exactly
means that the kernel should run it for different sockets :)

> >>  - The set of BPF program types and the verification rules are
> >> independent of cgroup and namespace.  This is true, but I don't think
> >> it matters.
> >
> > It matters. That's actually the key to understand. The loading part
> > verifies correctness for particular program type.
> > Afterwards the same program can be attached to any place.
> > Including different cgroups and different namespaces.
> > The 'attach' part is like 'switch on' that enables program
> > on particular hook. The scope (whether it's socket or netdev or cgroup)
> > is a scope that program author uses to narrow down the hook,
> > but it's not an ultimate restriction.
> > For example the socket program can be attached to sockets and
> > share information with cls_bpf program attached to netdev.
> > The kprobe tracing program can peek into kernel internal data
> > and share it with cls_bpf or any other type as long as
> > everything is root. The information flow is global to the whole system.
> 
> Why does any of this imply that a cgroup+bpf program that is attached
> once should run for all network namespaces?

because cgroup is the only scope that is being used to scope this
particular bpf_type_cgroup_sock program type.
In the future we may add bpf_type_netns_sock program type
which will use exactly the same sock create hook, but will be scoped
by netns. It will be completely independent of any cgroups.
The event of socket creation by user process within given netns
will trigger a call to attached bpf program for that netns.
All other netns (where program is not attached) won't see the program called.
But the user will still be able to load single program
and attach it to two netns-es.

Potentially we can also allow a combination of scopes where
single bpf program is attached to a hook while scoped by
both netns and cgroup at the same time.
I see it as an extension to prog_attach command where
user will specify two fds (one for cgroup and one for netns) to make
given prog_fd program scoped by both.
Nothing in the current api prevents such future extensions.

And from the other thread:

> I'm not saying that at all.  I'm saying that this use case sounds
> valid, but maybe it could be solved differently.  Here are some ideas:

I'm happy to discuss api extension ideas as long as I don't hear
a week later that they should be done for 4.10

> - Expose the actual cgroup (relative to the hooked cgroup) to the BPF
> program.  Then you could parse the headers, check what cgroup you're
> in, and proceed accordingly.  This could potentially be even faster
> for your use case if done carefully because it will stress the
> instruction cache less.

that was considered. we thought about exposing cgroup inode
or some form of relative cgroup id and use that for lookup, but for
close to zero overhead network monitoring that extra lookup
is quite costly, but this idea is still on the table.
We might use it together with cls_bpf program type.

>  - Have a (non-default) flag that says "overridable".  The effect is
>  that, if /A and /A/B both have overridable programs attached, then
>  sockets in /A/B don't run /A's program.  If, however, /A has a
>  non-overridable program and /A/B has an overridable program, then
>  sockets in /A/B run both programs.  IOW overridable programs override
>  other overridable programs, but non-overridable programs never
>  override anything and are never overridden by anything.

that sounds a bit complex to maintain of the kernel side.
Is this for the case where sandboxing daemon is using non-overridable
and it still wants to allow inner daemon to do its monitoring?
I guess that's useful. I don't mind adding something like this
in the future.

> Suppose someone wants to make CGROUP_BPF work right when a container
> and a container manager both use it.  It'll have to run both programs.
> What would the semantics be if this were to be added?  This is really
> a question, not an indictment of your approach.  For all I know,
> you're proposing exactly what I suggested above.

sounds like your proposed approach to let kernel keep two sets
of overridable and non-overridable programs can work.
Need to think about it more. There are things to resolve there.
Like there gotta be a handle or some cookie that user will
give to the kernel to identify particular program within a list.
Otherwise i don't see how the user can detach the program later
or even query whether it's attached already or not.
We had a case when user daemon was restarting after an upgrade
and was attaching cls_bpf to tc. After few weeks every packet
was going through a ton of bpf programs. So I like single
program being default, since the user doing a chain of progs
would have to think about handles/cookies beforehand
and would have to enable it explicitly with the flag.

> And sandboxing needn't, and won't, wait until unprivileged bpf
> programs are added.  Plenty of sandboxes run as root.

great. Last time I heard the unprivileged progs were positioned as
must have for landlock. I'm happy to stay root only for now.
Andy Lutomirski Feb. 7, 2017, 2:57 a.m. UTC | #7
On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
>> >> What does "bpf programs are global" mean?  I am genuinely unable to
>> >> figure out what you mean.  Here are some example guesses of what you
>> >> might mean:
>> >>
>> >>  - BPF programs are compiled independently of a namespace.  This is
>> >> certainly true, but I don't think it matters.
>> >>
>> >>  - You want BPF programs to affect everything on the system.  But this
>> >> doesn't seem right to be -- they only affect things in the relevant
>> >> cgroup, so they're not global in that sense.
>> >
>> > All bpf program types are global in the sense that you can
>> > make all of them to operate across all possible scopes and namespaces.
>>
>> I still don't understand what you mean here.  A seccomp program runs
>> in the process that installs it and children -- it does not run in
>> "all possible scopes".
>
> seccomp and classic bpf is different, since there is no concept
> of the program there. cbpf is a stateless set of instructions
> that belong to some entity like seccomp or socket. ebpf is stateful
> and starts with the program, then hook and then scope.

So... are you saying that a loaded eBPF object is global in the sense
that if you attach the same object to more than one thing (two
sockets, say), the *same* program runs and shares its state?  If so, I
agree, but that's still not an argument that the *same* attachment of
an eBPF program to a cgroup should run in multiple network namespaces.
You could also attach the (same) program once per netns and its state
would be shared.

I'm pretty sure I've never suggested that an eBPF program be bound to
a namespace.  I just think that a lot of things get more
straightforward if an *attachment* of an eBPF program to a cgroup is
bound to a single namespace.

>
>> A socket filter runs on a single socket and
>> therefore runs in a single netns.  So presumably I'm still
>> misunderstanding you
>
> in classic - yes. ebpf can have the same program attached to
> multiple sockets in different netns.
> For classic - the object is the socket and the user can only
> manipulate that socket. For extended - the object is the program
> and it can exist on its own. Like the program can be pinned in bpffs
> while it's not attached to any hook.
> For classic bpf the ideas of CRIU naturally apply, since
> it checkpoints the socket and it happens that socket has
> a set of statless cbpf instructions within. So it's
> expected to save/restore cbpf as part of socket save/restore.
> In case of ebpf the program exists independently of the socket.

True.

> Doing save/restore of the ebpf program attached to a socket
> is meaningless, since it could be pinned in bpffs, attached
> to other sockets, has state in bpf maps, some other process
> might be actively talking to that program and so on.
> ebpf is a graph of interconnected pieces. To criu such thing
> one really need to freeze the whole system, all of it processes,
> and stop the kernel. I don't see criu ever be applied to ebpf.

Not true, I think.  CRIU could figure out which eBPF program is
attached to a socket and snapshot the attachment and (separately) the
eBPF object.  If the eBPF object were to be attached to two sockets,
CRIU would notice and only snaptshot the eBPF program once.  I don't
see why the whole system would need to be snapshotted.

Obviously this code isn't written yet.  But if eBPF were to be widely
used for, say, seccomp, I bet the CRIU code would show up in short
order.

>
> Potentially we can also allow a combination of scopes where
> single bpf program is attached to a hook while scoped by
> both netns and cgroup at the same time.
> I see it as an extension to prog_attach command where
> user will specify two fds (one for cgroup and one for netns) to make
> given prog_fd program scoped by both.
> Nothing in the current api prevents such future extensions.

I think this might be sufficient for all usecases.  Facebook's use
case of monitoring everything (if I understood correctly) could be
addressed by just attaching the program to the cgroup once for each
namespace.  The benefit would be a good deal of simplicity: you could
relax the capable() call to ns_capable() in the future, you wouldn't
need this patch, and you wouldn't be creating the first major
non-netns-specific network hook in the kernel.

>
> And from the other thread:
>
>> I'm not saying that at all.  I'm saying that this use case sounds
>> valid, but maybe it could be solved differently.  Here are some ideas:
>
> I'm happy to discuss api extension ideas as long as I don't hear
> a week later that they should be done for 4.10
>
>> - Expose the actual cgroup (relative to the hooked cgroup) to the BPF
>> program.  Then you could parse the headers, check what cgroup you're
>> in, and proceed accordingly.  This could potentially be even faster
>> for your use case if done carefully because it will stress the
>> instruction cache less.
>
> that was considered. we thought about exposing cgroup inode
> or some form of relative cgroup id and use that for lookup, but for
> close to zero overhead network monitoring that extra lookup
> is quite costly, but this idea is still on the table.
> We might use it together with cls_bpf program type.

Another approach would be to let the attach call take an opaque
parameter that's passed back to the eBPF program on each invocation.
That would be a single mem-to-register copy, right?

>
>>  - Have a (non-default) flag that says "overridable".  The effect is
>>  that, if /A and /A/B both have overridable programs attached, then
>>  sockets in /A/B don't run /A's program.  If, however, /A has a
>>  non-overridable program and /A/B has an overridable program, then
>>  sockets in /A/B run both programs.  IOW overridable programs override
>>  other overridable programs, but non-overridable programs never
>>  override anything and are never overridden by anything.
>
> that sounds a bit complex to maintain of the kernel side.
> Is this for the case where sandboxing daemon is using non-overridable
> and it still wants to allow inner daemon to do its monitoring?
> I guess that's useful. I don't mind adding something like this
> in the future.
>

I'm just trying to come up with sane semantics to having
non-overridable programs as well as overridable programs.  It has to
do *something* if you install both (or be disallowed).

>> Suppose someone wants to make CGROUP_BPF work right when a container
>> and a container manager both use it.  It'll have to run both programs.
>> What would the semantics be if this were to be added?  This is really
>> a question, not an indictment of your approach.  For all I know,
>> you're proposing exactly what I suggested above.
>
> sounds like your proposed approach to let kernel keep two sets
> of overridable and non-overridable programs can work.
> Need to think about it more. There are things to resolve there.
> Like there gotta be a handle or some cookie that user will
> give to the kernel to identify particular program within a list.
> Otherwise i don't see how the user can detach the program later
> or even query whether it's attached already or not.

I thought your plan was to allow at most one program per (cgroup, hook
type), in which case this isn't a problem, right?  If you call detach,
you detach the one and only program that's attached.

> We had a case when user daemon was restarting after an upgrade
> and was attaching cls_bpf to tc. After few weeks every packet
> was going through a ton of bpf programs. So I like single
> program being default, since the user doing a chain of progs
> would have to think about handles/cookies beforehand
> and would have to enable it explicitly with the flag.
>
>> And sandboxing needn't, and won't, wait until unprivileged bpf
>> programs are added.  Plenty of sandboxes run as root.
>
> great. Last time I heard the unprivileged progs were positioned as
> must have for landlock. I'm happy to stay root only for now.
>

I think that letting sandboxing features work unprivileged is
extremely valuable.  Just because cgroup+bpf is root only for now
doesn't mean it won't be used for sandboxing, though.
Alexei Starovoitov Feb. 7, 2017, 7 a.m. UTC | #8
On Mon, Feb 06, 2017 at 06:57:45PM -0800, Andy Lutomirski wrote:
> On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
> >> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> >> >> What does "bpf programs are global" mean?  I am genuinely unable to
> >> >> figure out what you mean.  Here are some example guesses of what you
> >> >> might mean:
> >> >>
> >> >>  - BPF programs are compiled independently of a namespace.  This is
> >> >> certainly true, but I don't think it matters.
> >> >>
> >> >>  - You want BPF programs to affect everything on the system.  But this
> >> >> doesn't seem right to be -- they only affect things in the relevant
> >> >> cgroup, so they're not global in that sense.
> >> >
> >> > All bpf program types are global in the sense that you can
> >> > make all of them to operate across all possible scopes and namespaces.
> >>
> >> I still don't understand what you mean here.  A seccomp program runs
> >> in the process that installs it and children -- it does not run in
> >> "all possible scopes".
> >
> > seccomp and classic bpf is different, since there is no concept
> > of the program there. cbpf is a stateless set of instructions
> > that belong to some entity like seccomp or socket. ebpf is stateful
> > and starts with the program, then hook and then scope.
> 
> So... are you saying that a loaded eBPF object is global in the sense
> that if you attach the same object to more than one thing (two
> sockets, say), the *same* program runs and shares its state?  If so, I
> agree, but that's still not an argument that the *same* attachment of
> an eBPF program to a cgroup should run in multiple network namespaces.
> You could also attach the (same) program once per netns and its state
> would be shared.
> 
> I'm pretty sure I've never suggested that an eBPF program be bound to
> a namespace.  I just think that a lot of things get more
> straightforward if an *attachment* of an eBPF program to a cgroup is
> bound to a single namespace.

Thank you for this whole discussion over the last few months.
Frankly in the beginning I wasn't 100% sure about the api we picked,
now I'm completely convinced that we absolutely made the right choice.
It's clean and keeps scoping constructs clear and explicit for the users.
So I am proposing that in the future we should add the ability to
scope bpf programs by netns. Just like current api scopes them
by cgroup. The attachment must be explicit.
Current api attaches type_cgroup_* program to a hook and scopes it
by a given cgroup. At that time some apps within that cgroup may
already run in different netns and attaching process may be
in yet another netns. There is no way to have sane semantics
without explicitly specifying the scope. Currently we do it
by explicitly specifying the cgroup. In the future we need
to extend it by specifying netns (without cgroup). Then
for container technologies that are based on netns we'll
have an efficient way of scoping programs to given netns.
And when afnetns is finally ready, the same scoping approach
will work for afnetns as well. For cases that need to
have two scopes at the same time (like cgroup and netns)
the bpf_sk_netns_id() helper will work one way and
some future bpf_sk_cgroup_id() helper will work from
the other side.
So far in multi-scope cases one dimension is dominating.
Like number of cgroups is way larger than number of netns,
so explicit bpf_sk_netns_id() from inside the programs
is faster than doing the same in the kernel.
And if in the future there will be a case with a lot of
cgroups and a lot of netns at the same time, we'll extend
the api further to specify two scopes to bpf_prog_attach command.
The kernel side will probably need a hashtable to lookup
a bpf prog based on (cgroup, netns) pair of pointers.

> >> A socket filter runs on a single socket and
> >> therefore runs in a single netns.  So presumably I'm still
> >> misunderstanding you
> >
> > in classic - yes. ebpf can have the same program attached to
> > multiple sockets in different netns.
> > For classic - the object is the socket and the user can only
> > manipulate that socket. For extended - the object is the program
> > and it can exist on its own. Like the program can be pinned in bpffs
> > while it's not attached to any hook.
> > For classic bpf the ideas of CRIU naturally apply, since
> > it checkpoints the socket and it happens that socket has
> > a set of statless cbpf instructions within. So it's
> > expected to save/restore cbpf as part of socket save/restore.
> > In case of ebpf the program exists independently of the socket.
> 
> True.
> 
> > Doing save/restore of the ebpf program attached to a socket
> > is meaningless, since it could be pinned in bpffs, attached
> > to other sockets, has state in bpf maps, some other process
> > might be actively talking to that program and so on.
> > ebpf is a graph of interconnected pieces. To criu such thing
> > one really need to freeze the whole system, all of it processes,
> > and stop the kernel. I don't see criu ever be applied to ebpf.
> 
> Not true, I think.  CRIU could figure out which eBPF program is
> attached to a socket and snapshot the attachment and (separately) the
> eBPF object.  If the eBPF object were to be attached to two sockets,
> CRIU would notice and only snaptshot the eBPF program once.  I don't

that's C part of CRIU... and as soon as it goes to do R part the initial
graph between programs, maps, and apps is now completely different
and everything is broken.
criu cannot save/restore a graph by saving/restoring one
edge at a time while graph keeps changing. It needs to freeze
the whole thing.

> need this patch, and you wouldn't be creating the first major
> non-netns-specific network hook in the kernel.

first?! last time i checked two old cgroupv1 controllers ignore netns.
nflog has an ability to log across netns.

> >> - Expose the actual cgroup (relative to the hooked cgroup) to the BPF
> >> program.  Then you could parse the headers, check what cgroup you're
> >> in, and proceed accordingly.  This could potentially be even faster
> >> for your use case if done carefully because it will stress the
> >> instruction cache less.
> >
> > that was considered. we thought about exposing cgroup inode
> > or some form of relative cgroup id and use that for lookup, but for
> > close to zero overhead network monitoring that extra lookup
> > is quite costly, but this idea is still on the table.
> > We might use it together with cls_bpf program type.
> 
> Another approach would be to let the attach call take an opaque
> parameter that's passed back to the eBPF program on each invocation.
> That would be a single mem-to-register copy, right?

i'm not sure what you mean here.
we discussed adding new types of relocations to bpf instruction stream.
Today there is only BPF_PSEUDO_MAP_FD.
We thought that netns and cgroup can be two other types.
Similarly to maps the user process will open an fd to netns (or cgroup)
and store it inside the program as BPF_PSEUDO_NETNS_FD.
At load time the kernel will increment the refcnt.
The problem with this approach is that bpf programs will suddenly
hold the references to normal kernel objects and debugging
of netns and cgroup destruction will become a nightmare.
Thefore bpf_sk_netns_id() (and future bpf_sk_cgroup_id) approach
is safer and simpler.

> >>  - Have a (non-default) flag that says "overridable".  The effect is
> >>  that, if /A and /A/B both have overridable programs attached, then
> >>  sockets in /A/B don't run /A's program.  If, however, /A has a
> >>  non-overridable program and /A/B has an overridable program, then
> >>  sockets in /A/B run both programs.  IOW overridable programs override
> >>  other overridable programs, but non-overridable programs never
> >>  override anything and are never overridden by anything.
> >
> > that sounds a bit complex to maintain of the kernel side.
> > Is this for the case where sandboxing daemon is using non-overridable
> > and it still wants to allow inner daemon to do its monitoring?
> > I guess that's useful. I don't mind adding something like this
> > in the future.
> >
> 
> I'm just trying to come up with sane semantics to having
> non-overridable programs as well as overridable programs.  It has to
> do *something* if you install both (or be disallowed).

I think what you're proposing with two sets is fine.
(if i understood the idea correctly).
right now we have only overridable and adding non-overridable
doesn't break anything as far as i can see. For simplicity we
can do it with just two pointers (no chaining). Later can add
priority and chain of programs. Which probably should be
an array of programs to have good performance.

I think your main concern all this time was that we won't be
able to extend the api to suit security/sandboxing needs.
I don't share that concern and I think this discussion
outlined the work in that area for the next few years that
we can incrementally do.

> >> Suppose someone wants to make CGROUP_BPF work right when a container
> >> and a container manager both use it.  It'll have to run both programs.
> >> What would the semantics be if this were to be added?  This is really
> >> a question, not an indictment of your approach.  For all I know,
> >> you're proposing exactly what I suggested above.
> >
> > sounds like your proposed approach to let kernel keep two sets
> > of overridable and non-overridable programs can work.
> > Need to think about it more. There are things to resolve there.
> > Like there gotta be a handle or some cookie that user will
> > give to the kernel to identify particular program within a list.
> > Otherwise i don't see how the user can detach the program later
> > or even query whether it's attached already or not.
> 
> I thought your plan was to allow at most one program per (cgroup, hook
> type), in which case this isn't a problem, right?  If you call detach,
> you detach the one and only program that's attached.

yes. when there is only one program the detach is unambiguous.
As soon as we have a chain the user will need a handle or priority
or both.

> > We had a case when user daemon was restarting after an upgrade
> > and was attaching cls_bpf to tc. After few weeks every packet
> > was going through a ton of bpf programs. So I like single
> > program being default, since the user doing a chain of progs
> > would have to think about handles/cookies beforehand
> > and would have to enable it explicitly with the flag.
> >
> >> And sandboxing needn't, and won't, wait until unprivileged bpf
> >> programs are added.  Plenty of sandboxes run as root.
> >
> > great. Last time I heard the unprivileged progs were positioned as
> > must have for landlock. I'm happy to stay root only for now.
> >
> 
> I think that letting sandboxing features work unprivileged is
> extremely valuable.  Just because cgroup+bpf is root only for now
> doesn't mean it won't be used for sandboxing, though.

agree.
diff mbox

Patch

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8c9fb29c6673..1a604bccef86 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -49,6 +49,13 @@  static void nsfs_evict(struct inode *inode)
 	ns->ops->put(ns);
 }
 
+u64 proc_get_ns_devid_inum(struct ns_common *ns)
+{
+	u64 dev = new_encode_dev(nsfs_mnt->mnt_sb->s_dev);
+
+	return (dev << 32) | ns->inum;
+}
+
 static void *__ns_get_path(struct path *path, struct ns_common *ns)
 {
 	struct vfsmount *mnt = nsfs_mnt;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 12cb8bd81d2d..b567b021e652 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -48,7 +48,7 @@  extern int pid_ns_prepare_proc(struct pid_namespace *ns);
 extern void pid_ns_release_proc(struct pid_namespace *ns);
 extern int proc_alloc_inum(unsigned int *pino);
 extern void proc_free_inum(unsigned int inum);
-
+extern u64 proc_get_ns_devid_inum(struct ns_common *ns);
 #else /* CONFIG_PROC_FS */
 
 static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; }
@@ -61,6 +61,7 @@  static inline int proc_alloc_inum(unsigned int *inum)
 }
 static inline void proc_free_inum(unsigned int inum) {}
 
+static u64 proc_get_ns_devid_inum(struct ns_common *ns) { return 0; }
 #endif /* CONFIG_PROC_FS */
 
 static inline int ns_alloc_inum(struct ns_common *ns)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..e5b8cf16cbaf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -430,6 +430,17 @@  union bpf_attr {
  *     @xdp_md: pointer to xdp_md
  *     @delta: An positive/negative integer to be added to xdp_md.data
  *     Return: 0 on success or negative on error
+ *
+ * u64 bpf_sk_netns_id(sk)
+ *     Returns unique value that identifies netns of given socket or skb.
+ *     The upper 32-bits of the return value contain device id where namespace
+ *     filesystem resides and lower 32-bits contain inode number within
+ *     that filesystem. It's the same value as:
+ *      struct stat st;
+ *      stat("/proc/pid/ns/net", &st);
+ *      return (st->st_dev << 32)  | st->st_ino;
+ *     @sk: pointer to struct sock or struct __sk_buff
+ *     Return: filesystem's device id | netns inode
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -476,7 +487,8 @@  union bpf_attr {
 	FN(set_hash_invalid),		\
 	FN(get_numa_node_id),		\
 	FN(skb_change_head),		\
-	FN(xdp_adjust_head),
+	FN(xdp_adjust_head),		\
+	FN(sk_netns_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 1969b3f118c1..171f0a019705 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -52,6 +52,7 @@ 
 #include <net/dst_metadata.h>
 #include <net/dst.h>
 #include <net/sock_reuseport.h>
+#include <linux/proc_ns.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2597,6 +2598,34 @@  static const struct bpf_func_proto bpf_xdp_event_output_proto = {
 	.arg5_type	= ARG_CONST_STACK_SIZE,
 };
 
+BPF_CALL_1(bpf_sk_netns_id, struct sock *, sk)
+{
+	return proc_get_ns_devid_inum(&sock_net(sk)->ns);
+}
+
+static const struct bpf_func_proto bpf_sk_netns_id_proto = {
+	.func		= bpf_sk_netns_id,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
+BPF_CALL_1(bpf_skb_netns_id, struct sk_buff *, skb)
+{
+	struct net_device *dev = skb->dev;
+
+	if (!dev)
+		return 0;
+	return proc_get_ns_devid_inum(&dev_net(dev)->ns);
+}
+
+static const struct bpf_func_proto bpf_skb_netns_id_proto = {
+	.func		= bpf_skb_netns_id,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
@@ -2620,6 +2649,8 @@  sk_filter_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_trace_printk:
 		if (capable(CAP_SYS_ADMIN))
 			return bpf_get_trace_printk_proto();
+	case BPF_FUNC_sk_netns_id:
+		return &bpf_skb_netns_id_proto;
 	default:
 		return NULL;
 	}
@@ -2700,6 +2731,17 @@  xdp_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+cg_sock_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_netns_id:
+		return &bpf_sk_netns_id_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 cg_skb_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -3255,7 +3297,7 @@  static const struct bpf_verifier_ops lwt_xmit_ops = {
 };
 
 static const struct bpf_verifier_ops cg_sock_ops = {
-	.get_func_proto		= sk_filter_func_proto,
+	.get_func_proto		= cg_sock_func_proto,
 	.is_valid_access	= sock_filter_is_valid_access,
 	.convert_ctx_access	= sock_filter_convert_ctx_access,
 };
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index faaffe2e139a..971649483d62 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -94,6 +94,8 @@  static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
 	(void *) BPF_FUNC_skb_under_cgroup;
 static int (*bpf_skb_change_head)(void *, int len, int flags) =
 	(void *) BPF_FUNC_skb_change_head;
+static unsigned long long (*bpf_sk_netns_id)(void *) =
+	(void *) BPF_FUNC_sk_netns_id;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/sock_flags_kern.c b/samples/bpf/sock_flags_kern.c
index 533dd11a6baa..f2507d7d847b 100644
--- a/samples/bpf/sock_flags_kern.c
+++ b/samples/bpf/sock_flags_kern.c
@@ -9,8 +9,10 @@  SEC("cgroup/sock1")
 int bpf_prog1(struct bpf_sock *sk)
 {
 	char fmt[] = "socket: family %d type %d protocol %d\n";
+	char fmt2[] = "socket: netns dev+inode %llx\n";
 
 	bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
+	bpf_trace_printk(fmt2, sizeof(fmt2), bpf_sk_netns_id(sk));
 
 	/* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets
 	 * ie., make ping6 fail
diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
index ed18e9a4909c..73b62cc0d617 100644
--- a/samples/bpf/sockex1_kern.c
+++ b/samples/bpf/sockex1_kern.c
@@ -16,6 +16,7 @@  int bpf_prog1(struct __sk_buff *skb)
 {
 	int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
 	long *value;
+	char fmt[] = "netns dev+inode %llx\n";
 
 	if (skb->pkt_type != PACKET_OUTGOING)
 		return 0;
@@ -24,6 +25,7 @@  int bpf_prog1(struct __sk_buff *skb)
 	if (value)
 		__sync_fetch_and_add(value, skb->len);
 
+	bpf_trace_printk(fmt, sizeof(fmt), bpf_sk_netns_id(skb));
 	return 0;
 }
 char _license[] SEC("license") = "GPL";