[net-next,3/4] bpf: add support for persistent maps/progs

This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far:

The first idea was to implement a fuse backend that proxies bpf(2)
syscalls that create/load maps and programs, where the fuse
implementation would hold these fds internally and pass them to
things like tc via UDS socket passing. There could be various fuse
implementations tailored to a particular eBPF subsystem's needs. The
advantage is that this would shift the complexity entirely into user
space, but with a couple of drawbacks along the way. One being that
fuse needs extra library dependencies and it also doesn't resolve the
issue that an extra daemon needs to run in the background. At Linux
Plumbers 2015, we've all concluded eventually that using fuse is not
an option and an in-kernel solution is needed.

The next idea I've tried out was an extension to the bpf(2) syscall
that works roughly in a way we bind(2)/connect(2) to paths backed
by special inodes in case of UDS. This works on top of any file system
that allows to create special files, where the newly created inode
operations are similar to those of S_IFSOCK inodes. The inode would
be instrumented as a lookup key in an rhashtable backend with the
prog/map stored as value. We found that there are a couple of
disadvantages on this approach. Since the underlying implementation
of the inode can differ among file systems, we need a periodic garbage
collection for dropping the rhashtable entry and the references to
the maps/progs held with it (it could be done by piggybacking into
the rhashtable's rehashing, though). The other issue is that this
requires to add something like S_IFBPF (to clearly identify this inode
as special to eBPF), where the available space in the S_IFMT is already
severly limited and could possibly clash with future POSIX values being
allocated there. Moreover, this approach might not be flexible enough
from a functionality point of view, f.e. things like future debugging
facilities, etc could be added that really wouldn't fit into bpf(2)
syscall (however, the bpf(2) syscall strictly stays the central place
to manage eBPF things).

This eventually leads us to this patch, which implements a minimal
eBPF file system. The idea is a bit similar, but to the point that
these inodes reside at one or multiple mount points. A directory
hierarchy can be tailored to a specific application use-case from the
various subsystem users and maps/progs pinned inside it. Two new eBPF
commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
order to create one or multiple special inodes from an existing file
descriptor that points to a map/program (we call it eBPF fd pinning),
or to create a new file descriptor from an existing special inode.
BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
can also be done unpriviledged when having appropriate permissions
to the path.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_NEW_FD.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        |  21 ++
 include/uapi/linux/bpf.h   |  45 +---
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/xattr.h |   3 +
 kernel/bpf/Makefile        |   4 +-
 kernel/bpf/inode.c         | 614 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 108 ++++++++
 7 files changed, 758 insertions(+), 38 deletions(-)
 create mode 100644 kernel/bpf/inode.c

Message ID	ab1fceb2d68876d89bb2ebb3d2b45486d3cf2388.1444956943.git.daniel@iogearbox.net
State	Changes Requested, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 2FA3A1402BC for <patchwork-incoming@ozlabs.org>; Fri, 16 Oct 2015 12:10:18 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753459AbbJPBJh (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Thu, 15 Oct 2015 21:09:37 -0400 Received: from www62.your-server.de ([213.133.104.62]:58425 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753072AbbJPBJc (ORCPT <rfc822;netdev@vger.kernel.org>); Thu, 15 Oct 2015 21:09:32 -0400 Received: from [83.76.24.107] (helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES128-GCM-SHA256:128) (Exim 4.80.1) (envelope-from <daniel@iogearbox.net>) id 1ZmtWT-0000cr-GI; Fri, 16 Oct 2015 03:09:29 +0200 From: Daniel Borkmann <daniel@iogearbox.net> To: davem@davemloft.net Cc: ast@plumgrid.com, viro@ZenIV.linux.org.uk, ebiederm@xmission.com, tgraf@suug.ch, hannes@stressinduktion.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@kernel.org> Subject: [PATCH net-next 3/4] bpf: add support for persistent maps/progs Date: Fri, 16 Oct 2015 03:09:24 +0200 Message-Id: <ab1fceb2d68876d89bb2ebb3d2b45486d3cf2388.1444956943.git.daniel@iogearbox.net> X-Mailer: git-send-email 1.9.3 In-Reply-To: <cover.1444956943.git.daniel@iogearbox.net> References: <cover.1444956943.git.daniel@iogearbox.net> In-Reply-To: <cover.1444956943.git.daniel@iogearbox.net> References: <cover.1444956943.git.daniel@iogearbox.net> X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.98.7/20971/Thu Oct 15 22:38:26 2015) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[net-next,3/4] bpf: add support for persistent maps/progs

Commit Message

Comments

Patch