diff mbox

[net-next,3/4] bpf: add support for persistent maps/progs

Message ID ab1fceb2d68876d89bb2ebb3d2b45486d3cf2388.1444956943.git.daniel@iogearbox.net
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann Oct. 16, 2015, 1:09 a.m. UTC
This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far:

The first idea was to implement a fuse backend that proxies bpf(2)
syscalls that create/load maps and programs, where the fuse
implementation would hold these fds internally and pass them to
things like tc via UDS socket passing. There could be various fuse
implementations tailored to a particular eBPF subsystem's needs. The
advantage is that this would shift the complexity entirely into user
space, but with a couple of drawbacks along the way. One being that
fuse needs extra library dependencies and it also doesn't resolve the
issue that an extra daemon needs to run in the background. At Linux
Plumbers 2015, we've all concluded eventually that using fuse is not
an option and an in-kernel solution is needed.

The next idea I've tried out was an extension to the bpf(2) syscall
that works roughly in a way we bind(2)/connect(2) to paths backed
by special inodes in case of UDS. This works on top of any file system
that allows to create special files, where the newly created inode
operations are similar to those of S_IFSOCK inodes. The inode would
be instrumented as a lookup key in an rhashtable backend with the
prog/map stored as value. We found that there are a couple of
disadvantages on this approach. Since the underlying implementation
of the inode can differ among file systems, we need a periodic garbage
collection for dropping the rhashtable entry and the references to
the maps/progs held with it (it could be done by piggybacking into
the rhashtable's rehashing, though). The other issue is that this
requires to add something like S_IFBPF (to clearly identify this inode
as special to eBPF), where the available space in the S_IFMT is already
severly limited and could possibly clash with future POSIX values being
allocated there. Moreover, this approach might not be flexible enough
from a functionality point of view, f.e. things like future debugging
facilities, etc could be added that really wouldn't fit into bpf(2)
syscall (however, the bpf(2) syscall strictly stays the central place
to manage eBPF things).

This eventually leads us to this patch, which implements a minimal
eBPF file system. The idea is a bit similar, but to the point that
these inodes reside at one or multiple mount points. A directory
hierarchy can be tailored to a specific application use-case from the
various subsystem users and maps/progs pinned inside it. Two new eBPF
commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
order to create one or multiple special inodes from an existing file
descriptor that points to a map/program (we call it eBPF fd pinning),
or to create a new file descriptor from an existing special inode.
BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
can also be done unpriviledged when having appropriate permissions
to the path.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_NEW_FD.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        |  21 ++
 include/uapi/linux/bpf.h   |  45 +---
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/xattr.h |   3 +
 kernel/bpf/Makefile        |   4 +-
 kernel/bpf/inode.c         | 614 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 108 ++++++++
 7 files changed, 758 insertions(+), 38 deletions(-)
 create mode 100644 kernel/bpf/inode.c

Comments

Hannes Frederic Sowa Oct. 16, 2015, 10:25 a.m. UTC | #1
On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> This eventually leads us to this patch, which implements a minimal
> eBPF file system. The idea is a bit similar, but to the point that
> these inodes reside at one or multiple mount points. A directory
> hierarchy can be tailored to a specific application use-case from the
> various subsystem users and maps/progs pinned inside it. Two new eBPF
> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> order to create one or multiple special inodes from an existing file
> descriptor that points to a map/program (we call it eBPF fd pinning),
> or to create a new file descriptor from an existing special inode.
> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> can also be done unpriviledged when having appropriate permissions
> to the path.

In my opinion this is very un-unixiy, I have to say at least.

Namespaces at some point dealt with the same problem, they nowadays use
bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
the namespace alive. This at least allows someone to build up its own
hierarchy with normal unix tools and not hidden inside a C-program. For
filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
work out of the box nowadays.

I don't know in terms of how many objects bpf should be able to handle
and if such a bind-mount based solution would work, I guess not.

In my opinion I still favor a user space approach. Subsystems which use
ebpf in a way that no user space program needs to be running to control
them would need to export the fds by itself. E.g. something like
sysfs/kobject for tc? The hierarchy would then be in control of the
subsystem which could also create a proper naming hierarchy or maybe
even use an already given one. Do most other eBPF users really need to
persist file descriptors somewhere without user space control and pick
them up later? 

Sorry for the rant and thanks for posting this patchset,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 16, 2015, 1:36 p.m. UTC | #2
On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
>> This eventually leads us to this patch, which implements a minimal
>> eBPF file system. The idea is a bit similar, but to the point that
>> these inodes reside at one or multiple mount points. A directory
>> hierarchy can be tailored to a specific application use-case from the
>> various subsystem users and maps/progs pinned inside it. Two new eBPF
>> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
>> order to create one or multiple special inodes from an existing file
>> descriptor that points to a map/program (we call it eBPF fd pinning),
>> or to create a new file descriptor from an existing special inode.
>> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
>> can also be done unpriviledged when having appropriate permissions
>> to the path.
>
> In my opinion this is very un-unixiy, I have to say at least.
>
> Namespaces at some point dealt with the same problem, they nowadays use
> bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> the namespace alive. This at least allows someone to build up its own
> hierarchy with normal unix tools and not hidden inside a C-program. For
> filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> work out of the box nowadays.

Yes, that doesn't work out of the box, but I also don't know how usable
that would really be. The idea is roughly rather similar to the paths
passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
have a map/prog resource that you stick to a special inode so that you
can retrieve it at a later point in time from the same or different
processes through a new fd pointing to the resource from user side, so
that the bpf(2) syscall can be performed upon it.

With Unix tools, you could still create/remove a hierarchy or unlink
those that have maps/progs. You are correct that tools that don't
implement bpf(2) currently cannot access the content behind it, since
bpf(2) manages access to the data itself. I did like the 2nd idea though,
mentioned in the commit log, but don't know how flexible we are in
terms of adding S_IFBPF to the UAPI.

> I don't know in terms of how many objects bpf should be able to handle
> and if such a bind-mount based solution would work, I guess not.
>
> In my opinion I still favor a user space approach. Subsystems which use
> ebpf in a way that no user space program needs to be running to control
> them would need to export the fds by itself. E.g. something like
> sysfs/kobject for tc? The hierarchy would then be in control of the
> subsystem which could also create a proper naming hierarchy or maybe
> even use an already given one. Do most other eBPF users really need to
> persist file descriptors somewhere without user space control and pick
> them up later?

I was thinking about a strict predefined hierarchy dictated by the kernel
as well, but was then considering a more flexible approach that could be
tailored freely to various use cases. A predefined hierarchy would most
likely need to be resolved per subsystem and it's not really easy to map
this properly. F.e. if the kernel would try to provide unique ids (as
opposed to have a name or annotation member through the syscall), it
could end up being quite cryptic. If we let the users choose names, I'm
not sure if a single hierarchy level would be enough. Then, additionally
you have facilities like tail calls that eBPF programs could do.

In such cases, one could even craft relationships where a (strict auto
generated) tree representation would not be sufficient (f.e. recirculation
up to a certain depth). The tail called programs could be changed
atomically during runtime, etc. The other issue related to a per subsystem
representation is that bpf(2) is the central management interface for
creating/accessing maps/progs, and each subsystem then has its own little
interface to "install" them internally (f.e. via netlink, setsockopt(2),
etc). That means, with tail calls, only the 'root' programs are installed
there and further transactions would be needed in order to make individual
subsystems aware, so they could potentially generate some hierarchy; don't
know, it seems rather complex.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 4:18 p.m. UTC | #3
On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
> Namespaces at some point dealt with the same problem, they nowadays use
> bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
> the namespace alive. This at least allows someone to build up its own
> hierarchy with normal unix tools and not hidden inside a C-program. For
> filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
> work out of the box nowadays.

bind mounting of /proc/../fd was initially proposed by Andy and we've
looked at it thoroughly, but after discussion with Eric it became
apparent that it doesn't fit here. At the end we need shell tools
to access maps.
Also I think you missed the hierarchy in this patch set _is_ built with
normal 'mkdir' and files are removed with 'rm'.
The only thing that C does is BPF_PIN_FD of fd that was received from
bpf syscall. That's way cleaner api than doing bind mount from C
program.
We've considered letting open() of the file return bpf specific
anon-inode, but decided to reserve that for other more natural file
operations. Therefore BPF_NEW_FD is needed.

> I don't know in terms of how many objects bpf should be able to handle
> and if such a bind-mount based solution would work, I guess not.

We definitely missed you at the last plumbers where it was discussed :)

> In my opinion I still favor a user space approach.

that's not acceptable for tracing use cases. No daemons allowed.

> Subsystems which use
> ebpf in a way that no user space program needs to be running to control
> them would need to export the fds by itself. E.g. something like
> sysfs/kobject for tc? The hierarchy would then be in control of the
> subsystem which could also create a proper naming hierarchy or maybe
> even use an already given one. Do most other eBPF users really need to
> persist file descriptors somewhere without user space control and pick
> them up later?

I think it's way cleaner to have one way of solving it (like this patch
does) instead of asking every subsystem to solve it differently.
We've also looked at sysfs and it's ugly when it comes to removing,
since the user cannot use normal 'rm'.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 16, 2015, 4:36 p.m. UTC | #4
On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:
> On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> > On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> >> This eventually leads us to this patch, which implements a minimal
> >> eBPF file system. The idea is a bit similar, but to the point that
> >> these inodes reside at one or multiple mount points. A directory
> >> hierarchy can be tailored to a specific application use-case from the
> >> various subsystem users and maps/progs pinned inside it. Two new eBPF
> >> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> >> order to create one or multiple special inodes from an existing file
> >> descriptor that points to a map/program (we call it eBPF fd pinning),
> >> or to create a new file descriptor from an existing special inode.
> >> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> >> can also be done unpriviledged when having appropriate permissions
> >> to the path.
> >
> > In my opinion this is very un-unixiy, I have to say at least.
> >
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> Yes, that doesn't work out of the box, but I also don't know how usable
> that would really be. The idea is roughly rather similar to the paths
> passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
> have a map/prog resource that you stick to a special inode so that you
> can retrieve it at a later point in time from the same or different
> processes through a new fd pointing to the resource from user side, so
> that the bpf(2) syscall can be performed upon it.
> 
> With Unix tools, you could still create/remove a hierarchy or unlink
> those that have maps/progs. You are correct that tools that don't
> implement bpf(2) currently cannot access the content behind it, since
> bpf(2) manages access to the data itself. I did like the 2nd idea though,
> mentioned in the commit log, but don't know how flexible we are in
> terms of adding S_IFBPF to the UAPI.

I don't think it should be a problem. You referred to POSIX Standard in
your other mail but I can't see any reason why not to establish a new
file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
Event Ports) are just examples of new modes being added.

mknod /bpf/map/1 m 1 1

:)

Yes, maybe I think this is a better solution architectural instead of
constructing a new filesystem.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> >
> > In my opinion I still favor a user space approach. Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I was thinking about a strict predefined hierarchy dictated by the kernel
> as well, but was then considering a more flexible approach that could be
> tailored freely to various use cases. A predefined hierarchy would most
> likely need to be resolved per subsystem and it's not really easy to map
> this properly. F.e. if the kernel would try to provide unique ids (as
> opposed to have a name or annotation member through the syscall), it
> could end up being quite cryptic. If we let the users choose names, I'm
> not sure if a single hierarchy level would be enough. Then, additionally
> you have facilities like tail calls that eBPF programs could do.

I don't think that most subsystems need to expose those file
descriptors. Seccomp probably will have a supervisor process running and
per aggregation will also have a user space process running keeping the
fd alive. So it is all about tc/sched.

And I am not sure if tc will really needs a filesystem to handle all
this. The simplest approach is to just keep a name <-> fd mapping
somewhere in the net/sched/ subsystem and use this for all tc users.
Otherwise can we somehow Incorporate this in sysfs directory where we
maybe create a kobject per installed filter, something along those
lines.

I see that tail calls makes this all very difficult to show which entity
uses which ebpf entity in some way, as it looks like n:m relationships.

> In such cases, one could even craft relationships where a (strict auto
> generated) tree representation would not be sufficient (f.e.
> recirculation
> up to a certain depth). The tail called programs could be changed
> atomically during runtime, etc. The other issue related to a per
> subsystem
> representation is that bpf(2) is the central management interface for
> creating/accessing maps/progs, and each subsystem then has its own little
> interface to "install" them internally (f.e. via netlink, setsockopt(2),
> etc). That means, with tail calls, only the 'root' programs are installed
> there and further transactions would be needed in order to make
> individual
> subsystems aware, so they could potentially generate some hierarchy;
> don't
> know, it seems rather complex.

I understand, this is really not suitable to represent in its entirety
in sysfs or any kind of hierarchical structure right now. Either we
limit it somewhat (Alexei will certainly intervene here) or one of your
filesystem approaches will win.

But I still wonder why people are so against user space dependencies?

Another idea that I discussed with Daniel just to have it publicly
available: a userspace helper would be called for every ebpf entity
change so it could mirror or keep track ebpf handles in user space. You
can think along the lines of kernel/core_pattern. This would probably
also depend on non-anon-inode usage of ebpf fds.

Will have to think about that a bit more,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 16, 2015, 4:43 p.m. UTC | #5
Hi Alexei,

On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:
> On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> bind mounting of /proc/../fd was initially proposed by Andy and we've
> looked at it thoroughly, but after discussion with Eric it became
> apparent that it doesn't fit here. At the end we need shell tools
> to access maps.

Oh yes, I want shell tools for this very much! Maybe even that things
like strings, grep etc. work. :)

> Also I think you missed the hierarchy in this patch set _is_ built with
> normal 'mkdir' and files are removed with 'rm'.

I did not miss that, I am just concerned that if the kernel does not
enforce such a hierarchy automatically it won't really happen.

> The only thing that C does is BPF_PIN_FD of fd that was received from
> bpf syscall. That's way cleaner api than doing bind mount from C
> program.

I am with you there. Unfortunately we don't have a give "this fd a name"
syscalls so far so I totally understand the decision here.

> We've considered letting open() of the file return bpf specific
> anon-inode, but decided to reserve that for other more natural file
> operations. Therefore BPF_NEW_FD is needed.

Can't this be overloaded somehow. You can use mknod for creation and
open for regular file use. mknod is its own syscall.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> 
> We definitely missed you at the last plumbers where it was discussed :)

Yes. :(

> > In my opinion I still favor a user space approach.
> 
> that's not acceptable for tracing use cases. No daemons allowed.

Oh, tracing does not allow daemons. Why? I can only imagine embedded
users, no?

> > Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I think it's way cleaner to have one way of solving it (like this patch
> does) instead of asking every subsystem to solve it differently.
> We've also looked at sysfs and it's ugly when it comes to removing,
> since the user cannot use normal 'rm'.

Ah, okay. Probably it would depend on some tc node always referencing
the bpf entity. But I see that sysfs might become too problematic.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 16, 2015, 5:21 p.m. UTC | #6
On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> This eventually leads us to this patch, which implements a minimal
> eBPF file system. The idea is a bit similar, but to the point that
> these inodes reside at one or multiple mount points. A directory
> hierarchy can be tailored to a specific application use-case from the
> various subsystem users and maps/progs pinned inside it. Two new eBPF
> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> order to create one or multiple special inodes from an existing file
> descriptor that points to a map/program (we call it eBPF fd pinning),
> or to create a new file descriptor from an existing special inode.
> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> can also be done unpriviledged when having appropriate permissions
> to the path.
> 

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 16, 2015, 5:27 p.m. UTC | #7
On 10/16/2015 06:36 PM, Hannes Frederic Sowa wrote:
> On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:
>> On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
>>> On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
>>>> This eventually leads us to this patch, which implements a minimal
>>>> eBPF file system. The idea is a bit similar, but to the point that
>>>> these inodes reside at one or multiple mount points. A directory
>>>> hierarchy can be tailored to a specific application use-case from the
>>>> various subsystem users and maps/progs pinned inside it. Two new eBPF
>>>> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
>>>> order to create one or multiple special inodes from an existing file
>>>> descriptor that points to a map/program (we call it eBPF fd pinning),
>>>> or to create a new file descriptor from an existing special inode.
>>>> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
>>>> can also be done unpriviledged when having appropriate permissions
>>>> to the path.
>>>
>>> In my opinion this is very un-unixiy, I have to say at least.
>>>
>>> Namespaces at some point dealt with the same problem, they nowadays use
>>> bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
>>> the namespace alive. This at least allows someone to build up its own
>>> hierarchy with normal unix tools and not hidden inside a C-program. For
>>> filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
>>> work out of the box nowadays.
>>
>> Yes, that doesn't work out of the box, but I also don't know how usable
>> that would really be. The idea is roughly rather similar to the paths
>> passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
>> have a map/prog resource that you stick to a special inode so that you
>> can retrieve it at a later point in time from the same or different
>> processes through a new fd pointing to the resource from user side, so
>> that the bpf(2) syscall can be performed upon it.
>>
>> With Unix tools, you could still create/remove a hierarchy or unlink
>> those that have maps/progs. You are correct that tools that don't
>> implement bpf(2) currently cannot access the content behind it, since
>> bpf(2) manages access to the data itself. I did like the 2nd idea though,
>> mentioned in the commit log, but don't know how flexible we are in
>> terms of adding S_IFBPF to the UAPI.
>
> I don't think it should be a problem. You referred to POSIX Standard in
> your other mail but I can't see any reason why not to establish a new
> file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
> Event Ports) are just examples of new modes being added.
>
> mknod /bpf/map/1 m 1 1
>
> :)
>
> Yes, maybe I think this is a better solution architectural instead of
> constructing a new filesystem.

Yeah, also 'man 2 stat' lists a couple of others used by various systems.

The pro's of this approach would be that no new file system would be needed
and the special inode could be placed on top of any 'regular' file system
that would support special files. I do like that as well.

I'm wondering whether this would prevent us in future from opening access
to shell tools etc on that special file, but probably one could provide a
default set of file ops via init_special_inode() that could be overloaded
by the underlying fs if required.

>>> I don't know in terms of how many objects bpf should be able to handle
>>> and if such a bind-mount based solution would work, I guess not.
>>>
>>> In my opinion I still favor a user space approach. Subsystems which use
>>> ebpf in a way that no user space program needs to be running to control
>>> them would need to export the fds by itself. E.g. something like
>>> sysfs/kobject for tc? The hierarchy would then be in control of the
>>> subsystem which could also create a proper naming hierarchy or maybe
>>> even use an already given one. Do most other eBPF users really need to
>>> persist file descriptors somewhere without user space control and pick
>>> them up later?
>>
>> I was thinking about a strict predefined hierarchy dictated by the kernel
>> as well, but was then considering a more flexible approach that could be
>> tailored freely to various use cases. A predefined hierarchy would most
>> likely need to be resolved per subsystem and it's not really easy to map
>> this properly. F.e. if the kernel would try to provide unique ids (as
>> opposed to have a name or annotation member through the syscall), it
>> could end up being quite cryptic. If we let the users choose names, I'm
>> not sure if a single hierarchy level would be enough. Then, additionally
>> you have facilities like tail calls that eBPF programs could do.
>
> I don't think that most subsystems need to expose those file
> descriptors. Seccomp probably will have a supervisor process running and
> per aggregation will also have a user space process running keeping the
> fd alive. So it is all about tc/sched.
>
> And I am not sure if tc will really needs a filesystem to handle all
> this. The simplest approach is to just keep a name <-> fd mapping
> somewhere in the net/sched/ subsystem and use this for all tc users.

Solving this on a generic level eventually felt cleaner, where a subsystem
would have the choice of whether making use of this or not. tc/sched has
currently two types BPF_PROG_TYPE_SCHED_{CLS,ACT}, so a common facility
would be needed for both subsystems. It's a bit hard to see what other
subsystems would come in future, and we could end up with multiple
subsystem-specific APIs essentially doing the same thing.

At the very beginning, there was also the idea to just reference such an
object by name, but it would need to be made available somewhere (procfs?)
to get a picture and manage them from an admin pov. Having some object
exposed as a file like other ipc building blocks seems better, imho.
Whether as special file or file system, yeah, that's a different question.

[...]
> I see that tail calls makes this all very difficult to show which entity
> uses which ebpf entity in some way, as it looks like n:m relationships.

Yes, this is indeed the case.

>> In such cases, one could even craft relationships where a (strict auto
>> generated) tree representation would not be sufficient (f.e.
>> recirculation
>> up to a certain depth). The tail called programs could be changed
>> atomically during runtime, etc. The other issue related to a per
>> subsystem
>> representation is that bpf(2) is the central management interface for
>> creating/accessing maps/progs, and each subsystem then has its own little
>> interface to "install" them internally (f.e. via netlink, setsockopt(2),
>> etc). That means, with tail calls, only the 'root' programs are installed
>> there and further transactions would be needed in order to make
>> individual
>> subsystems aware, so they could potentially generate some hierarchy;
>> don't
>> know, it seems rather complex.
>
> I understand, this is really not suitable to represent in its entirety
> in sysfs or any kind of hierarchical structure right now. Either we
> limit it somewhat (Alexei will certainly intervene here) or one of your
> filesystem approaches will win.
>
> But I still wonder why people are so against user space dependencies?
>
> Another idea that I discussed with Daniel just to have it publicly
> available: a userspace helper would be called for every ebpf entity
> change so it could mirror or keep track ebpf handles in user space. You
> can think along the lines of kernel/core_pattern. This would probably
> also depend on non-anon-inode usage of ebpf fds.

Yes, it seems to me, but other than that, it would also require a user
space daemon managing all these, right? At least from the consensus at
Plumbers, running an extra daemon was considered rather impractical wrt
deployment (same with fuse).

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 5:32 p.m. UTC | #8
On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> Hi Alexei,
>
> On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:
>> On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
>>> Namespaces at some point dealt with the same problem, they nowadays use
>>> bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
>>> the namespace alive. This at least allows someone to build up its own
>>> hierarchy with normal unix tools and not hidden inside a C-program. For
>>> filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
>>> work out of the box nowadays.
>>
>> bind mounting of /proc/../fd was initially proposed by Andy and we've
>> looked at it thoroughly, but after discussion with Eric it became
>> apparent that it doesn't fit here. At the end we need shell tools
>> to access maps.
>
> Oh yes, I want shell tools for this very much! Maybe even that things
> like strings, grep etc. work. :)

yes and the only way to get there is to have it done via fs.

>> Also I think you missed the hierarchy in this patch set _is_ built with
>> normal 'mkdir' and files are removed with 'rm'.
>
> I did not miss that, I am just concerned that if the kernel does not
> enforce such a hierarchy automatically it won't really happen.

if it's easier for user to work with single level of directories,
it should be able to do so. It's not a job of the kernel to enforce
how user space apps should be designed.

> Oh, tracing does not allow daemons. Why? I can only imagine embedded
> users, no?

yes and for networking: restartability and HA.
cannot really do that with fuse/daemons.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Oct. 16, 2015, 5:37 p.m. UTC | #9
On 10/16/15 at 10:32am, Alexei Starovoitov wrote:
> On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> >Oh, tracing does not allow daemons. Why? I can only imagine embedded
> >users, no?
> 
> yes and for networking: restartability and HA.
> cannot really do that with fuse/daemons.

Right, the smaller the footprint, the better.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 5:37 p.m. UTC | #10
On 10/16/15 10:27 AM, Daniel Borkmann wrote:
>>> but don't know how flexible we are in
>>> terms of adding S_IFBPF to the UAPI.
>>
>> I don't think it should be a problem. You referred to POSIX Standard in
>> your other mail but I can't see any reason why not to establish a new
>> file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
>> Event Ports) are just examples of new modes being added.
>>
>> mknod /bpf/map/1 m 1 1
>>
>> :)
>>
>> Yes, maybe I think this is a better solution architectural instead of
>> constructing a new filesystem.
>
> Yeah, also 'man 2 stat' lists a couple of others used by various systems.
>
> The pro's of this approach would be that no new file system would be needed
> and the special inode could be placed on top of any 'regular' file system
> that would support special files. I do like that as well.

I don't like it at all for the reasons you've just stated:
'it will prevent us doing shell style access to such files'

> I'm wondering whether this would prevent us in future from opening access
> to shell tools etc on that special file, but probably one could provide a
> default set of file ops via init_special_inode() that could be overloaded
> by the underlying fs if required.

and also because adding new S_ISSOCK-like bit for bpf feel as a begining
of nightmare, since sooner or later all filesystems would need to have
a check for it like they have for sock type.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 5:42 p.m. UTC | #11
On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
> Another question:
> Should multiple mount of the filesystem result in an empty fs (a new
> instance) or in one were one can see other ebpf-fs entities? I think
> Daniel wanted to already use the mountpoint as some kind of hierarchy
> delimiter. I would have used directories for that and multiple mounts
> would then have resulted in the same content of the filesystem. IMHO
> this would remove some ambiguity but then the question arises how this
> is handled in a namespaced environment. Was there some specific reason
> to do so?

That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 16, 2015, 5:56 p.m. UTC | #12
On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>> Another question:
>> Should multiple mount of the filesystem result in an empty fs (a new
>> instance) or in one were one can see other ebpf-fs entities? I think
>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>> delimiter. I would have used directories for that and multiple mounts
>> would then have resulted in the same content of the filesystem. IMHO
>> this would remove some ambiguity but then the question arises how this
>> is handled in a namespaced environment. Was there some specific reason
>> to do so?
>
> That's an interesting question!
> I think all mounts should be independent.
> I can see tracing using one and networking using another one
> with different hierarchies suitable for their own use cases.
> What's an advantage to have the same content everywhere?
> Feels harder to manage, since different users would need to
> coordinate.

I initially had it as a mount_single() file system, where I was thinking
to have an entry under /sys/fs/bpf/, so all subsystems would work on top
of that mount point, but for the same reasons above I lifted that restriction.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 16, 2015, 6:41 p.m. UTC | #13
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>> Another question:
>>> Should multiple mount of the filesystem result in an empty fs (a new
>>> instance) or in one were one can see other ebpf-fs entities? I think
>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>> delimiter. I would have used directories for that and multiple mounts
>>> would then have resulted in the same content of the filesystem. IMHO
>>> this would remove some ambiguity but then the question arises how this
>>> is handled in a namespaced environment. Was there some specific reason
>>> to do so?
>>
>> That's an interesting question!
>> I think all mounts should be independent.
>> I can see tracing using one and networking using another one
>> with different hierarchies suitable for their own use cases.
>> What's an advantage to have the same content everywhere?
>> Feels harder to manage, since different users would need to
>> coordinate.
>
> I initially had it as a mount_single() file system, where I was thinking
> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
> of that mount point, but for the same reasons above I lifted that restriction.

I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.

A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".

That allows all hierarchy etc to be handled in userspace, just as with
my files for namespaces.

I do not understand why you have presented to userspace a magic
filesystem that you allow binding to.  That is not what I intended to
suggest and I do not know how that makes any sense.

Eric


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 7:27 p.m. UTC | #14
On 10/16/15 11:41 AM, Eric W. Biederman wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>
>> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>>> Another question:
>>>> Should multiple mount of the filesystem result in an empty fs (a new
>>>> instance) or in one were one can see other ebpf-fs entities? I think
>>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>>> delimiter. I would have used directories for that and multiple mounts
>>>> would then have resulted in the same content of the filesystem. IMHO
>>>> this would remove some ambiguity but then the question arises how this
>>>> is handled in a namespaced environment. Was there some specific reason
>>>> to do so?
>>>
>>> That's an interesting question!
>>> I think all mounts should be independent.
>>> I can see tracing using one and networking using another one
>>> with different hierarchies suitable for their own use cases.
>>> What's an advantage to have the same content everywhere?
>>> Feels harder to manage, since different users would need to
>>> coordinate.
>>
>> I initially had it as a mount_single() file system, where I was thinking
>> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
>> of that mount point, but for the same reasons above I lifted that restriction.
>
> I am missing something.
>
> When I suggested using a filesystem it was my thought there would be
> exactly one superblock per map, and the map would be specified at mount
> time.  You clearly are not implementing that.

I don't think it's practical to have sb per map, since that would mean
sb per prog and that won't scale.
Also map today is an fd that belongs to a process. I cannot see
an api from C program to do 'mount of FD' that wouldn't look like
ugly hack.

> A filesystem per map makes sense as you have a key-value store with one
> file per key.
>
> The idea is that something resembling your bpf_pin_fd function would be
> the mount system call for the filesystem.
>
> The the keys in the map could be read by "ls /mountpoint/".
> Key values could be inspected with "cat /mountpoint/key".

yes. that is still the goal for follow up patches, but contained
within given bpffs. Something bpf_pin_fd-like command for bpf syscall
would create files for keys in a map and allow 'cat' via open/read.
Such api would be much cleaner from C app point of view.
Potentially we can allow mount of a file created via BPF_PIN_FD
that will expand into keys/values.
All of that are our future plans.
There, actually, the main contention point is 'how to represent keys
and values'. whether key is hex representation or we need some
pretty-printers via format string or via schema? etc, etc.
We tried few ideas of representing keys in our fuse implementations,
but don't have an agreement yet.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 16, 2015, 7:53 p.m. UTC | #15
Alexei Starovoitov <ast@plumgrid.com> writes:

> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
[...]
>> I am missing something.
>>
>> When I suggested using a filesystem it was my thought there would be
>> exactly one superblock per map, and the map would be specified at mount
>> time.  You clearly are not implementing that.
>
> I don't think it's practical to have sb per map, since that would mean
> sb per prog and that won't scale.

What do you mean won't scale?  You want to have a name per map/prog so the
basic complexity appears the same.  Is there some crucial interaction
between the persistent dodads you are placing on a filesystem that I am
missing?

Given the fact you don't normally need any persistence without a program
I am puzzled why "scaling" is an issue of any kind.  This is for a
comparitively rare case if I am not mistaken.

> Also map today is an fd that belongs to a process. I cannot see
> an api from C program to do 'mount of FD' that wouldn't look like
> ugly hack.

mount -t bpffs ... -o fd=1234 

That is not all convoluted or hacky.  Especially compared to some of the
alternatives I am seeing.

It is no problem at all to wrap something like that in a nice function
call that has the exact same complexity of use as any of the other
options that are being explored to give something that starts out
as a filedescriptor a name.

>> A filesystem per map makes sense as you have a key-value store with one
>> file per key.
>>
>> The idea is that something resembling your bpf_pin_fd function would be
>> the mount system call for the filesystem.
>>
>> The the keys in the map could be read by "ls /mountpoint/".
>> Key values could be inspected with "cat /mountpoint/key".
>
> yes. that is still the goal for follow up patches, but contained
> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
> would create files for keys in a map and allow 'cat' via open/read.
> Such api would be much cleaner from C app point of view.
> Potentially we can allow mount of a file created via BPF_PIN_FD
> that will expand into keys/values.
> All of that are our future plans.
> There, actually, the main contention point is 'how to represent keys
> and values'. whether key is hex representation or we need some
> pretty-printers via format string or via schema? etc, etc.
> We tried few ideas of representing keys in our fuse implementations,
> but don't have an agreement yet.

My gut feel would be to keep it simple and use the same representation
you use in your existing system calls.  Certainly ordinary filenames are
keys of arbitrary binary data that can included everything except
a '\0' byte.  That they are human readable is a nice convention, but not
at all fundamental to what they are.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 16, 2015, 7:54 p.m. UTC | #16
On 10/16/2015 09:27 PM, Alexei Starovoitov wrote:
> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>>>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>>>> Another question:
>>>>> Should multiple mount of the filesystem result in an empty fs (a new
>>>>> instance) or in one were one can see other ebpf-fs entities? I think
>>>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>>>> delimiter. I would have used directories for that and multiple mounts
>>>>> would then have resulted in the same content of the filesystem. IMHO
>>>>> this would remove some ambiguity but then the question arises how this
>>>>> is handled in a namespaced environment. Was there some specific reason
>>>>> to do so?
>>>>
>>>> That's an interesting question!
>>>> I think all mounts should be independent.
>>>> I can see tracing using one and networking using another one
>>>> with different hierarchies suitable for their own use cases.
>>>> What's an advantage to have the same content everywhere?
>>>> Feels harder to manage, since different users would need to
>>>> coordinate.
>>>
>>> I initially had it as a mount_single() file system, where I was thinking
>>> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
>>> of that mount point, but for the same reasons above I lifted that restriction.
>>
>> I am missing something.
>>
>> When I suggested using a filesystem it was my thought there would be
>> exactly one superblock per map, and the map would be specified at mount
>> time.  You clearly are not implementing that.
>
> I don't think it's practical to have sb per map, since that would mean
> sb per prog and that won't scale.
> Also map today is an fd that belongs to a process. I cannot see
> an api from C program to do 'mount of FD' that wouldn't look like
> ugly hack.
>
>> A filesystem per map makes sense as you have a key-value store with one
>> file per key.
>>
>> The idea is that something resembling your bpf_pin_fd function would be
>> the mount system call for the filesystem.
>>
>> The the keys in the map could be read by "ls /mountpoint/".
>> Key values could be inspected with "cat /mountpoint/key".
>
> yes. that is still the goal for follow up patches, but contained
> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
> would create files for keys in a map and allow 'cat' via open/read.
> Such api would be much cleaner from C app point of view.
> Potentially we can allow mount of a file created via BPF_PIN_FD
> that will expand into keys/values.

Yeah, sort of making this an optional debugging facility if anything (maybe
to just get a read-only snapshot view). Having maps with a very large number
of entries might end up being problematic by its own, or mapping potential
future map candidates such as rhashtable.

> There, actually, the main contention point is 'how to represent keys
> and values'. whether key is hex representation or we need some
> pretty-printers via format string or via schema? etc, etc.
> We tried few ideas of representing keys in our fuse implementations,
> but don't have an agreement yet.

That is unclear as well to make it useful.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 16, 2015, 8:56 p.m. UTC | #17
On 10/16/15 12:53 PM, Eric W. Biederman wrote:
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
>> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
> [...]
>>> I am missing something.
>>>
>>> When I suggested using a filesystem it was my thought there would be
>>> exactly one superblock per map, and the map would be specified at mount
>>> time.  You clearly are not implementing that.
>>
>> I don't think it's practical to have sb per map, since that would mean
>> sb per prog and that won't scale.
>
> What do you mean won't scale?  You want to have a name per map/prog so the
> basic complexity appears the same.  Is there some crucial interaction
> between the persistent dodads you are placing on a filesystem that I am
> missing?
>
> Given the fact you don't normally need any persistence without a program
> I am puzzled why "scaling" is an issue of any kind.  This is for a
> comparitively rare case if I am not mistaken.

representing map as a directory tree with files as keys is indeed 'rare'
since it's mainly for debugging and slow accesses,
but 'pin_fd' functionality now popping up everywhere.
Mainly because in things like openstack there are tons of disjoint
libraries written in different languages and the only thing
common is kernel. So pin_fd/new_fd is a mandatory feature.

>> Also map today is an fd that belongs to a process. I cannot see
>> an api from C program to do 'mount of FD' that wouldn't look like
>> ugly hack.
>
> mount -t bpffs ... -o fd=1234
>
> That is not all convoluted or hacky.  Especially compared to some of the
> alternatives I am seeing.
>
> It is no problem at all to wrap something like that in a nice function
> call that has the exact same complexity of use as any of the other
> options that are being explored to give something that starts out
> as a filedescriptor a name.

Frankly, I don't think parsing 'fd=1234' string is a clean api, but
before we argue about fs philosophy of passing options, let's
get on the same page with requirements.
First goal that this patch is solving is providing an ability
to 'pin' an FD, so that map/prog won't disappear when user app exist.
Second goal of future patches is to expose map internals as a directory
structure.
These two goals are independent.
We can argue about api for 2nd, whether it's mount with fd=1234 string
or else, but for the first mount style doesn't make sense.

>>> A filesystem per map makes sense as you have a key-value store with one
>>> file per key.
>>>
>>> The idea is that something resembling your bpf_pin_fd function would be
>>> the mount system call for the filesystem.
>>>
>>> The the keys in the map could be read by "ls /mountpoint/".
>>> Key values could be inspected with "cat /mountpoint/key".
>>
>> yes. that is still the goal for follow up patches, but contained
>> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
>> would create files for keys in a map and allow 'cat' via open/read.
>> Such api would be much cleaner from C app point of view.
>> Potentially we can allow mount of a file created via BPF_PIN_FD
>> that will expand into keys/values.
>> All of that are our future plans.
>> There, actually, the main contention point is 'how to represent keys
>> and values'. whether key is hex representation or we need some
>> pretty-printers via format string or via schema? etc, etc.
>> We tried few ideas of representing keys in our fuse implementations,
>> but don't have an agreement yet.
>
> My gut feel would be to keep it simple and use the same representation
> you use in your existing system calls.  Certainly ordinary filenames are
> keys of arbitrary binary data that can included everything except
> a '\0' byte.  That they are human readable is a nice convention, but not
> at all fundamental to what they are.

that doesn't work. map keys are never human readable. they're arbitrary
binary data. That's why representing them as file name is not trivial.
Some pretty-printer is needed.
Again that is 2nd goal of bpffs in general. We cannot really solve it
now, because we cannot say 'lets represent keys like X and work
from there', since that will become kernel ABI and we won't be able to
change that.
It's equally not clear that thousands of keys can even work as files.
So quite a bit of brainstorming still to do for this 2nd goal.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 16, 2015, 11:44 p.m. UTC | #18
Alexei Starovoitov <ast@plumgrid.com> writes:

> We can argue about api for 2nd, whether it's mount with fd=1234 string
> or else, but for the first mount style doesn't make sense.

Why does mount not make sense?  It is exactly what you are looking for
so why does it not make sense?

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 17, 2015, 2:43 a.m. UTC | #19
On 10/16/15 4:44 PM, Eric W. Biederman wrote:
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
>> We can argue about api for 2nd, whether it's mount with fd=1234 string
>> or else, but for the first mount style doesn't make sense.
>
> Why does mount not make sense?  It is exactly what you are looking for
> so why does it not make sense?

hmm, how do you get a new fd back after mounting it?
Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
but now it's more convoluted and empty mounts are everywhere.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 17, 2015, 12:28 p.m. UTC | #20
On 10/17/2015 04:43 AM, Alexei Starovoitov wrote:
> On 10/16/15 4:44 PM, Eric W. Biederman wrote:
>> Alexei Starovoitov <ast@plumgrid.com> writes:
>>
>>> We can argue about api for 2nd, whether it's mount with fd=1234 string
>>> or else, but for the first mount style doesn't make sense.
>>
>> Why does mount not make sense?  It is exactly what you are looking for
>> so why does it not make sense?
>
> hmm, how do you get a new fd back after mounting it?
> Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
> but now it's more convoluted and empty mounts are everywhere.

That would be my understanding as well, but as Alexei already said,
these are two different issues, it would be step 2 (let me get back
to that further below). But in any case, I don't really like dumping
key/value somewhere as files. You have binary blobs as both, and
lets say your application has a lookup-key (for whatever reason) of
several cachelines it all ends up getting rather messy than making
it really useful for non-bpf(2) aware cmdline tools to deal with.

Anyway, another idea I've been brainstorming with Hannes today a
bit is about the following:

We register two major numbers, one for eBPF maps (X), one for eBPF
progs (Y). A user can either via cmdline call something like ...
mknod /dev/bpf/maps/map_pkts c X Z to create a special character
device, or alternatively out of an application through mknod(2)
syscall (f.e. tc when setting up maps/progs internally from the obj
file for a classifer).

Then, we still have 2 eBPF commands for bpf(2) syscall to add, say
(for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that
created a map (or prog) already has the map fd and after mknod(2) it
can open(2) the special file to get the special file fd. Then it can
call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where
attr looks like:

   union bpf_attr attr = {
     .bpf_fd    = bpf_fd,
     .dev_fd    = dev_fd,
   };

The bpf(2) syscall can check whether dev_fd belongs to an eBPF special
file and it can then copy over file->private_data from the bpf_fd
to the dev_fd's underlying file, where the private_data, as we know,
from the bpf_fd already points to a proper bpf_map/bpf_prog structure.
The map/prog would then get ref'ed and lives onwards in the char device's
lifetime. No special hashtable, gc, etc needed. The char device has fops
that we can define by ourself, and unlinking would drop the ref from
its private_data.

Now to the other part: BPF_FETCH_DEV would work similar. The application
opens the device, and fills bpf_attr as follows again:

   union bpf_attr attr = {
     .bpf_fd    = 0,
     .dev_fd    = dev_fd,
   };

This would allow us to look up the map/prog from the dev_fd's file->
private_data, and installs a new fd via bpf_{map,prog}_new_fd() that
is returned from bpf(2) for bpf-related access. The remaining fops
from the char device could still be reserved for possibilities like
debugging in future.

Now in future (2nd step), could either be to use Eric's idea and then do
something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump
attributes or other properties to some location for inspection from such
a special file, or we could use kobjects for that attached to the device
if the fops from the cdev should not be sufficient.

So closing the loop to the special files where there were concerns:

This won't forbid to have a future shell-style access possibility, and
it would also not end up as a nightmare on what you mentioned with the
S_ISSOCK-like bit in the other email.

The pinning mechanism would not require an extra file system to be mounted
somewhere, and yet the user can define himself an arbitrary hierarchy
where he puts the special files as this facility already exists. An
approach like this looks overall cleaner to me, and most likely be
realizable in fewer lines of code as well.

Thoughts?

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 18, 2015, 2:20 a.m. UTC | #21
On 10/17/15 5:28 AM, Daniel Borkmann wrote:
>
> Anyway, another idea I've been brainstorming with Hannes today a
> bit is about the following:
>
> We register two major numbers, one for eBPF maps (X), one for eBPF
> progs (Y). A user can either via cmdline call something like ...
> mknod /dev/bpf/maps/map_pkts c X Z to create a special character
> device, or alternatively out of an application through mknod(2)
> syscall (f.e. tc when setting up maps/progs internally from the obj
> file for a classifer).
>
> Then, we still have 2 eBPF commands for bpf(2) syscall to add, say
> (for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that
> created a map (or prog) already has the map fd and after mknod(2) it
> can open(2) the special file to get the special file fd. Then it can
> call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where
> attr looks like:
>
>    union bpf_attr attr = {
>      .bpf_fd    = bpf_fd,
>      .dev_fd    = dev_fd,
>    };
>
> The bpf(2) syscall can check whether dev_fd belongs to an eBPF special
> file and it can then copy over file->private_data from the bpf_fd
> to the dev_fd's underlying file, where the private_data, as we know,
> from the bpf_fd already points to a proper bpf_map/bpf_prog structure.
> The map/prog would then get ref'ed and lives onwards in the char device's
> lifetime. No special hashtable, gc, etc needed. The char device has fops
> that we can define by ourself, and unlinking would drop the ref from
> its private_data.
>
> Now to the other part: BPF_FETCH_DEV would work similar. The application
> opens the device, and fills bpf_attr as follows again:
>
>    union bpf_attr attr = {
>      .bpf_fd    = 0,
>      .dev_fd    = dev_fd,
>    };
>
> This would allow us to look up the map/prog from the dev_fd's file->
> private_data, and installs a new fd via bpf_{map,prog}_new_fd() that
> is returned from bpf(2) for bpf-related access. The remaining fops
> from the char device could still be reserved for possibilities like
> debugging in future.
>
> Now in future (2nd step), could either be to use Eric's idea and then do
> something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump
> attributes or other properties to some location for inspection from such
> a special file, or we could use kobjects for that attached to the device
> if the fops from the cdev should not be sufficient.
>
> So closing the loop to the special files where there were concerns:
>
> This won't forbid to have a future shell-style access possibility, and
> it would also not end up as a nightmare on what you mentioned with the
> S_ISSOCK-like bit in the other email.
>
> The pinning mechanism would not require an extra file system to be mounted
> somewhere, and yet the user can define himself an arbitrary hierarchy
> where he puts the special files as this facility already exists. An
> approach like this looks overall cleaner to me, and most likely be
> realizable in fewer lines of code as well.
>
> Thoughts?

that indeed sounds cleaner, less lines of code, no fs, etc, but
I don't see how it will work yet.
For chardev with our own ops we can be triggered on open and close
of that chardev, so replacing private_data will be cleared when
user process does close(dev_fd) ? There is no fops for unlink either,
it's fs only property ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 18, 2015, 3:03 p.m. UTC | #22
On 10/18/2015 04:20 AM, Alexei Starovoitov wrote:
...
> that indeed sounds cleaner, less lines of code, no fs, etc, but
> I don't see how it will work yet.

I'll have some code ready very soon to show the concept. Will post it here
tonight, stay tuned. ;)

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ae6f77..bb82764 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -8,8 +8,10 @@ 
 #define _LINUX_BPF_H 1
 
 #include <uapi/linux/bpf.h>
+
 #include <linux/workqueue.h>
 #include <linux/file.h>
+#include <linux/fs.h>
 
 struct bpf_map;
 
@@ -137,6 +139,17 @@  struct bpf_prog_aux {
 	};
 };
 
+enum bpf_fd_type {
+	BPF_FD_TYPE_PROG,
+	BPF_FD_TYPE_MAP,
+};
+
+union bpf_any {
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *raw_ptr;
+};
+
 struct bpf_array {
 	struct bpf_map map;
 	u32 elem_size;
@@ -172,6 +185,14 @@  void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+void bpf_any_get(union bpf_any raw, enum bpf_fd_type type);
+void bpf_any_put(union bpf_any raw, enum bpf_fd_type type);
+
+int bpf_fd_inode_add(const struct filename *pathname,
+		     union bpf_any raw, enum bpf_fd_type type);
+union bpf_any bpf_fd_inode_get(const struct filename *pathname,
+			       enum bpf_fd_type *type);
+
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
 #else
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..f9b412c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,16 @@  struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
-	/* create a map with given type and attributes
-	 * fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-	 * returns fd or negative error
-	 * map is deleted when fd is closed
-	 */
 	BPF_MAP_CREATE,
-
-	/* lookup key in a given map
-	 * err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value
-	 * returns zero and stores found elem into value
-	 * or negative error
-	 */
 	BPF_MAP_LOOKUP_ELEM,
-
-	/* create or update key/value pair in a given map
-	 * err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value, attr->flags
-	 * returns zero or negative error
-	 */
 	BPF_MAP_UPDATE_ELEM,
-
-	/* find and delete elem by key in a given map
-	 * err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key
-	 * returns zero or negative error
-	 */
 	BPF_MAP_DELETE_ELEM,
-
-	/* lookup key in a given map and return next key
-	 * err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->next_key
-	 * returns zero and stores next key or negative error
-	 */
 	BPF_MAP_GET_NEXT_KEY,
-
-	/* verify and load eBPF program
-	 * prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
-	 * Using attr->prog_type, attr->insns, attr->license
-	 * returns fd or negative error
-	 */
 	BPF_PROG_LOAD,
+	BPF_PIN_FD,
+	BPF_NEW_FD,
 };
 
 enum bpf_map_type {
@@ -160,6 +126,11 @@  union bpf_attr {
 		__aligned_u64	log_buf;	/* user supplied buffer */
 		__u32		kern_version;	/* checked when prog_type=kprobe */
 	};
+
+	struct { /* anonymous struct used by BPF_{PIN,NEW}_FD command */
+		__u32		fd;
+		__aligned_u64	pathname;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 7b1425a..c1c5cf7 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -75,5 +75,6 @@ 
 #define ANON_INODE_FS_MAGIC	0x09041934
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
+#define BPFFS_MAGIC		0xcafe4a11
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..3586d28 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -42,6 +42,9 @@ 
 #define XATTR_USER_PREFIX "user."
 #define XATTR_USER_PREFIX_LEN (sizeof(XATTR_USER_PREFIX) - 1)
 
+#define XATTR_BPF_PREFIX "bpf."
+#define XATTR_BPF_PREFIX_LEN (sizeof(XATTR_BPF_PREFIX) - 1)
+
 /* Security namespace */
 #define XATTR_EVM_SUFFIX "evm"
 #define XATTR_NAME_EVM XATTR_SECURITY_PREFIX XATTR_EVM_SUFFIX
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e6983be..1327258 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,2 +1,4 @@ 
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
+
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
new file mode 100644
index 0000000..5cef673
--- /dev/null
+++ b/kernel/bpf/inode.c
@@ -0,0 +1,614 @@ 
+/*
+ * Minimal file system backend for special inodes holding eBPF maps and
+ * programs, used by eBPF fd pinning.
+ *
+ * (C) 2015 Daniel Borkmann <daniel@iogearbox.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/parser.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/seq_file.h>
+#include <linux/fsnotify.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <linux/security.h>
+#include <linux/xattr.h>
+
+#define BPFFS_DEFAULT_MODE 0700
+
+enum {
+	BPF_OPT_UID,
+	BPF_OPT_GID,
+	BPF_OPT_MODE,
+	BPF_OPT_ERR,
+};
+
+struct bpf_mnt_opts {
+	kuid_t uid;
+	kgid_t gid;
+	umode_t mode;
+};
+
+struct bpf_fs_info {
+	struct bpf_mnt_opts mnt_opts;
+};
+
+struct bpf_dir_state {
+	unsigned long flags;
+};
+
+static const match_table_t bpf_tokens = {
+	{ BPF_OPT_UID, "uid=%u" },
+	{ BPF_OPT_GID, "gid=%u" },
+	{ BPF_OPT_MODE, "mode=%o" },
+	{ BPF_OPT_ERR, NULL },
+};
+
+static const struct inode_operations bpf_dir_iops;
+static const struct inode_operations bpf_prog_iops;
+static const struct inode_operations bpf_map_iops;
+
+static struct inode *bpf_get_inode(struct super_block *sb,
+				   const struct inode *dir,
+				   umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (!inode)
+		return ERR_PTR(-ENOSPC);
+
+	inode->i_ino = get_next_ino();
+	inode->i_atime = CURRENT_TIME;
+	inode->i_mtime = inode->i_atime;
+	inode->i_ctime = inode->i_atime;
+	inode_init_owner(inode, dir, mode);
+
+	return inode;
+}
+
+static struct inode *bpf_mknod(struct inode *dir, umode_t mode)
+{
+	return bpf_get_inode(dir->i_sb, dir, mode);
+}
+
+static bool bpf_dentry_name_reserved(const struct dentry *dentry)
+{
+	return strchr(dentry->d_name.name, '.');
+}
+
+enum {
+	/* Directory state is 'terminating', so no subdirectories
+	 * are allowed anymore in this directory. This is being
+	 * reserved so that in future, auto-generated directories
+	 * could be added along with the special map/prog inodes.
+	 */
+	BPF_DSTATE_TERM_BIT,
+};
+
+static bool bpf_inode_is_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	return test_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static bool bpf_inode_make_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	return dir->i_nlink != 2 ||
+	       test_and_set_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static void bpf_inode_undo_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	clear_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static int bpf_inode_type(const struct inode *inode, enum bpf_fd_type *i_type)
+{
+	if (inode->i_op == &bpf_prog_iops)
+		*i_type = BPF_FD_TYPE_PROG;
+	else if (inode->i_op == &bpf_map_iops)
+		*i_type = BPF_FD_TYPE_MAP;
+	else
+		return -EACCES;
+
+	return 0;
+}
+
+static int bpf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	void *i_private = inode->i_private;
+	enum bpf_fd_type type;
+	bool is_fd, drop_ref;
+	int ret;
+
+	is_fd = bpf_inode_type(inode, &type) == 0;
+	drop_ref = inode->i_nlink == 1;
+
+	ret = simple_unlink(dir, dentry);
+	if (!ret && is_fd && drop_ref) {
+		union bpf_any raw;
+
+		raw.raw_ptr = i_private;
+		bpf_any_put(raw, type);
+		bpf_inode_undo_term(dir);
+	}
+
+	return ret;
+}
+
+static int bpf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct bpf_dir_state *state;
+	struct inode *inode;
+
+	if (bpf_inode_is_term(dir))
+		return -EPERM;
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		return -ENOSPC;
+
+	inode = bpf_mknod(dir, dir->i_mode);
+	if (IS_ERR(inode)) {
+		kfree(state);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = state;
+	inode->i_op = &bpf_dir_iops;
+	inode->i_fop = &simple_dir_operations;
+
+	inc_nlink(inode);
+	inc_nlink(dir);
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static int bpf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	void *i_private = inode->i_private;
+	int ret;
+
+	ret = simple_rmdir(dir, dentry);
+	if (!ret)
+		kfree(i_private);
+
+	return ret;
+}
+
+static const struct inode_operations bpf_dir_iops = {
+	.lookup		= simple_lookup,
+	.mkdir		= bpf_mkdir,
+	.rmdir		= bpf_rmdir,
+	.unlink		= bpf_unlink,
+};
+
+#define XATTR_TYPE_SUFFIX "type"
+
+#define XATTR_NAME_BPF_TYPE (XATTR_BPF_PREFIX XATTR_TYPE_SUFFIX)
+#define XATTR_NAME_BPF_TYPE_LEN (sizeof(XATTR_NAME_BPF_TYPE) - 1)
+
+#define XATTR_VALUE_MAP "map"
+#define XATTR_VALUE_PROG "prog"
+
+static ssize_t bpf_getxattr(struct dentry *dentry, const char *name,
+			    void *buffer, size_t size)
+{
+	enum bpf_fd_type type;
+	ssize_t ret;
+
+	if (strncmp(name, XATTR_NAME_BPF_TYPE, XATTR_NAME_BPF_TYPE_LEN))
+		return -ENODATA;
+
+	if (bpf_inode_type(d_inode(dentry), &type))
+		return -ENODATA;
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = sizeof(XATTR_VALUE_PROG);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = sizeof(XATTR_VALUE_MAP);
+		break;
+	}
+
+	if (buffer) {
+		if (size < ret)
+			return -ERANGE;
+
+		switch (type) {
+		case BPF_FD_TYPE_PROG:
+			strncpy(buffer, XATTR_VALUE_PROG, ret);
+			break;
+		case BPF_FD_TYPE_MAP:
+			strncpy(buffer, XATTR_VALUE_MAP, ret);
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static ssize_t bpf_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	ssize_t len, used = 0;
+
+	len = security_inode_listsecurity(d_inode(dentry), buffer, size);
+	if (len < 0)
+		return len;
+
+	used += len;
+	if (buffer) {
+		if (size < used)
+			return -ERANGE;
+
+		buffer += len;
+	}
+
+	len = XATTR_NAME_BPF_TYPE_LEN + 1;
+	used += len;
+	if (buffer) {
+		if (size < used)
+			return -ERANGE;
+
+		memcpy(buffer, XATTR_NAME_BPF_TYPE, len);
+		buffer += len;
+	}
+
+	return used;
+}
+
+/* Special inodes handling map/progs currently don't allow for syscalls
+ * such as open/read/write/etc. We use the same bpf_{map,prog}_new_fd()
+ * facility for installing an fd to the user as we do on BPF_MAP_CREATE
+ * and BPF_PROG_LOAD, so an applications using bpf(2) don't see any
+ * change in behaviour. In future, the set of open/read/write/etc could
+ * be used f.e. for implementing things like debugging facilities on the
+ * underlying map/prog that would work with non-bpf(2) aware tooling.
+ */
+static const struct inode_operations bpf_prog_iops = {
+	.getxattr	= bpf_getxattr,
+	.listxattr	= bpf_listxattr,
+};
+
+static const struct inode_operations bpf_map_iops = {
+	.getxattr	= bpf_getxattr,
+	.listxattr	= bpf_listxattr,
+};
+
+static int bpf_mkmap(struct inode *dir, struct dentry *dentry,
+		     struct bpf_map *map, umode_t i_mode)
+{
+	struct inode *inode;
+
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+	if (bpf_inode_make_term(dir))
+		return -EBUSY;
+
+	inode = bpf_mknod(dir, i_mode);
+	if (IS_ERR(inode)) {
+		bpf_inode_undo_term(dir);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = map;
+	inode->i_op = &bpf_map_iops;
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static int bpf_mkprog(struct inode *dir, struct dentry *dentry,
+		      struct bpf_prog *prog, umode_t i_mode)
+{
+	struct inode *inode;
+
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+	if (bpf_inode_make_term(dir))
+		return -EBUSY;
+
+	inode = bpf_mknod(dir, i_mode);
+	if (IS_ERR(inode)) {
+		bpf_inode_undo_term(dir);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = prog;
+	inode->i_op = &bpf_prog_iops;
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static const struct bpf_mnt_opts *bpf_sb_mnt_opts(const struct super_block *sb)
+{
+	const struct bpf_fs_info *bfi = sb->s_fs_info;
+
+	return &bfi->mnt_opts;
+}
+
+static int bpf_parse_options(char *opt_data, struct bpf_mnt_opts *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int opt_val, token;
+	char *opt_ptr;
+	kuid_t uid;
+	kgid_t gid;
+
+	opts->mode = BPFFS_DEFAULT_MODE;
+
+	while ((opt_ptr = strsep(&opt_data, ",")) != NULL) {
+		if (!*opt_ptr)
+			continue;
+
+		token = match_token(opt_ptr, bpf_tokens, args);
+		switch (token) {
+		case BPF_OPT_UID:
+			if (match_int(&args[0], &opt_val))
+				return -EINVAL;
+
+			uid = make_kuid(current_user_ns(), opt_val);
+			if (!uid_valid(uid))
+				return -EINVAL;
+
+			opts->uid = uid;
+			break;
+		case BPF_OPT_GID:
+			if (match_int(&args[0], &opt_val))
+				return -EINVAL;
+
+			gid = make_kgid(current_user_ns(), opt_val);
+			if (!gid_valid(gid))
+				return -EINVAL;
+
+			opts->gid = gid;
+			break;
+		case BPF_OPT_MODE:
+			if (match_octal(&args[0], &opt_val))
+				return -EINVAL;
+
+			opts->mode = opt_val & S_IALLUGO;
+			break;
+		default:
+			return -EINVAL;
+		};
+	}
+
+	return 0;
+}
+
+static int bpf_apply_options(struct super_block *sb)
+{
+	const struct bpf_mnt_opts *opts = bpf_sb_mnt_opts(sb);
+	struct inode *inode = sb->s_root->d_inode;
+
+	inode->i_mode &= ~S_IALLUGO;
+	inode->i_mode |= opts->mode;
+
+	inode->i_uid = opts->uid;
+	inode->i_gid = opts->gid;
+
+	return 0;
+}
+
+static int bpf_show_options(struct seq_file *m, struct dentry *root)
+{
+	const struct bpf_mnt_opts *opts = bpf_sb_mnt_opts(root->d_sb);
+
+	if (!uid_eq(opts->uid, GLOBAL_ROOT_UID))
+		seq_printf(m, ",uid=%u",
+			   from_kuid_munged(&init_user_ns, opts->uid));
+
+	if (!gid_eq(opts->gid, GLOBAL_ROOT_GID))
+		seq_printf(m, ",gid=%u",
+			   from_kgid_munged(&init_user_ns, opts->gid));
+
+	if (opts->mode != BPFFS_DEFAULT_MODE)
+		seq_printf(m, ",mode=%o", opts->mode);
+
+	return 0;
+}
+
+static int bpf_remount(struct super_block *sb, int *flags, char *opt_data)
+{
+	struct bpf_fs_info *bfi = sb->s_fs_info;
+	int ret;
+
+	sync_filesystem(sb);
+
+	ret = bpf_parse_options(opt_data, &bfi->mnt_opts);
+	if (ret)
+		return ret;
+
+	bpf_apply_options(sb);
+	return 0;
+}
+
+static const struct super_operations bpf_super_ops = {
+	.statfs		= simple_statfs,
+	.remount_fs	= bpf_remount,
+	.show_options	= bpf_show_options,
+};
+
+static int bpf_fill_super(struct super_block *sb, void *opt_data, int silent)
+{
+	static struct tree_descr bpf_files[] = { { "" } };
+	struct bpf_dir_state *state;
+	struct bpf_fs_info *bfi;
+	struct inode *inode;
+	int ret = -ENOMEM;
+
+	bfi = kzalloc(sizeof(*bfi), GFP_KERNEL);
+	if (!bfi)
+		return ret;
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		goto err_bfi;
+
+	save_mount_options(sb, opt_data);
+	sb->s_fs_info = bfi;
+
+	ret = bpf_parse_options(opt_data, &bfi->mnt_opts);
+	if (ret)
+		goto err_state;
+
+	ret = simple_fill_super(sb, BPFFS_MAGIC, bpf_files);
+	if (ret)
+		goto err_state;
+
+	sb->s_op = &bpf_super_ops;
+
+	inode = sb->s_root->d_inode;
+	inode->i_op = &bpf_dir_iops;
+	inode->i_private = state;
+
+	bpf_apply_options(sb);
+
+	return 0;
+err_state:
+	kfree(state);
+err_bfi:
+	kfree(bfi);
+	return ret;
+}
+
+static void bpf_kill_super(struct super_block *sb)
+{
+	kfree(sb->s_root->d_inode->i_private);
+	kfree(sb->s_fs_info);
+	kill_litter_super(sb);
+}
+
+static struct dentry *bpf_mount(struct file_system_type *type,
+				int flags, const char *dev_name,
+				void *opt_data)
+{
+	return mount_nodev(type, flags, opt_data, bpf_fill_super);
+}
+
+static struct file_system_type bpf_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "bpf",
+	.mount		= bpf_mount,
+	.kill_sb	= bpf_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
+};
+
+MODULE_ALIAS_FS("bpf");
+MODULE_ALIAS_FS("bpffs");
+
+static int __init bpf_init(void)
+{
+	return register_filesystem(&bpf_fs_type);
+}
+fs_initcall(bpf_init);
+
+int bpf_fd_inode_add(const struct filename *pathname,
+		     union bpf_any raw, enum bpf_fd_type type)
+{
+	umode_t i_mode = S_IFREG | S_IRUSR | S_IWUSR;
+	struct inode *dir_inode;
+	struct dentry *dentry;
+	struct path path;
+	int ret;
+
+	dentry = kern_path_create(AT_FDCWD, pathname->name, &path, 0);
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		return ret;
+	}
+
+	ret = security_path_mknod(&path, dentry, i_mode, 0);
+	if (ret)
+		goto out;
+
+	dir_inode = d_inode(path.dentry);
+	if (dir_inode->i_op != &bpf_dir_iops) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ret = security_inode_mknod(dir_inode, dentry, i_mode, 0);
+	if (ret)
+		goto out;
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = bpf_mkprog(dir_inode, dentry, raw.prog, i_mode);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = bpf_mkmap(dir_inode, dentry, raw.map, i_mode);
+		break;
+	}
+out:
+	done_path_create(&path, dentry);
+	return ret;
+}
+
+union bpf_any bpf_fd_inode_get(const struct filename *pathname,
+			       enum bpf_fd_type *type)
+{
+	struct inode *inode;
+	union bpf_any raw;
+	struct path path;
+	int ret;
+
+	ret = kern_path(pathname->name, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto out;
+
+	inode = d_backing_inode(path.dentry);
+	ret = inode_permission(inode, MAY_WRITE);
+	if (ret)
+		goto out_path;
+
+	ret = bpf_inode_type(inode, type);
+	if (ret)
+		goto out_path;
+
+	raw.raw_ptr = inode->i_private;
+	if (!raw.raw_ptr) {
+		ret = -EACCES;
+		goto out_path;
+	}
+
+	bpf_any_get(raw, *type);
+	touch_atime(&path);
+	path_put(&path);
+
+	return raw;
+out_path:
+	path_put(&path);
+out:
+	raw.raw_ptr = ERR_PTR(ret);
+	return raw;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3fff82c..b4a93b8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -679,6 +679,108 @@  free_prog_nouncharge:
 	return err;
 }
 
+void bpf_any_get(union bpf_any raw, enum bpf_fd_type type)
+{
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		atomic_inc(&raw.prog->aux->refcnt);
+		break;
+	case BPF_FD_TYPE_MAP:
+		atomic_inc(&raw.map->refcnt);
+		break;
+	}
+}
+
+void bpf_any_put(union bpf_any raw, enum bpf_fd_type type)
+{
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		bpf_prog_put(raw.prog);
+		break;
+	case BPF_FD_TYPE_MAP:
+		bpf_map_put(raw.map);
+		break;
+	}
+}
+
+#define BPF_PIN_FD_LAST_FIELD	pathname
+#define BPF_NEW_FD_LAST_FIELD	BPF_PIN_FD_LAST_FIELD
+
+static int bpf_pin_fd(const union bpf_attr *attr)
+{
+	struct filename *pathname;
+	enum bpf_fd_type type;
+	union bpf_any raw;
+	int ret;
+
+	if (CHECK_ATTR(BPF_PIN_FD))
+		return -EINVAL;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	pathname = getname(u64_to_ptr(attr->pathname));
+	if (IS_ERR(pathname))
+		return PTR_ERR(pathname);
+
+	type = BPF_FD_TYPE_MAP;
+	raw.map = bpf_map_get(attr->fd);
+	if (IS_ERR(raw.map)) {
+		type = BPF_FD_TYPE_PROG;
+		raw.prog = bpf_prog_get(attr->fd);
+		if (IS_ERR(raw.prog)) {
+			ret = PTR_ERR(raw.raw_ptr);
+			goto out;
+		}
+	}
+
+	ret = bpf_fd_inode_add(pathname, raw, type);
+	if (ret != 0)
+		bpf_any_put(raw, type);
+out:
+	putname(pathname);
+	return ret;
+}
+
+static int bpf_new_fd(const union bpf_attr *attr)
+{
+	struct filename *pathname;
+	enum bpf_fd_type type;
+	union bpf_any raw;
+	int ret;
+
+	if ((CHECK_ATTR(BPF_NEW_FD)) || attr->fd != 0)
+		return -EINVAL;
+
+	pathname = getname(u64_to_ptr(attr->pathname));
+	if (IS_ERR(pathname))
+		return PTR_ERR(pathname);
+
+	raw = bpf_fd_inode_get(pathname, &type);
+	if (IS_ERR(raw.raw_ptr)) {
+		ret = PTR_ERR(raw.raw_ptr);
+		goto out;
+	}
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = bpf_prog_new_fd(raw.prog);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = bpf_map_new_fd(raw.map);
+		break;
+	default:
+		/* Shut up gcc. */
+		ret = -ENOENT;
+		break;
+	}
+
+	if (ret < 0)
+		bpf_any_put(raw, type);
+out:
+	putname(pathname);
+	return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -739,6 +841,12 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_PROG_LOAD:
 		err = bpf_prog_load(&attr);
 		break;
+	case BPF_PIN_FD:
+		err = bpf_pin_fd(&attr);
+		break;
+	case BPF_NEW_FD:
+		err = bpf_new_fd(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;