diff mbox

[RFC,3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

Message ID 1471442448-1248-4-git-send-email-daniel@zonque.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Mack Aug. 17, 2016, 2 p.m. UTC
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching eBPF programs to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory. These
are the only use-cases implemented by this patch at this point, but
more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped atomically, so userspace does not have to drop
an existing program first before installing a new one, leaving a gap
in which no program is installed at all.

The current implementation walks the tree from the passed cgroup up
to the root. If there is any program of the given type installed in
any of the ancestors, the installation is rejected. This is because
programs subject to restrictions should have no way of escaping if
a higher-level cgroup has installed a program already. This restriction
can be revisited at some later point in time.

The API is guarded by CAP_NET_ADMIN right now, which is also something
that can be relaxed in the future.

The new bpf commands will return -EINVAL for !CONFIG_CGROUP_BPF.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h |  14 +++++
 kernel/bpf/syscall.c     | 132 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)

Comments

Tejun Heo Aug. 17, 2016, 2:20 p.m. UTC | #1
Hello, Daniel.

On Wed, Aug 17, 2016 at 04:00:46PM +0200, Daniel Mack wrote:
> The current implementation walks the tree from the passed cgroup up
> to the root. If there is any program of the given type installed in
> any of the ancestors, the installation is rejected. This is because
> programs subject to restrictions should have no way of escaping if
> a higher-level cgroup has installed a program already. This restriction
> can be revisited at some later point in time.

This seems a bit too restrictive to me.  Given that an entity which
can install a bpf program can remove any programs anyway, wouldn't it
make more sense to just execute the program on the closest ancestor to
the cgroup?  That'd allow selectively overriding programs for specific
subhierarchies while applying generic policies to the rest.

> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
...
> +		/* Reject installation of a program if any ancestor has one. */
> +		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
> +			struct cgroup *parent;
> +
> +			css_get(pos);

This refcnting pattern doesn't make sense.  If @pos is already
accessible (which it is in this case), getting an extra ref before
accessing it doesn't do anything.  This would necessary iff the
pointer is to be used outside the scope which is protected by the ref
on @cgrp.

> +			parent = container_of(pos, struct cgroup, self);
> +
> +			if ((is_ingress  && parent->bpf_ingress) ||
> +			    (!is_ingress && parent->bpf_egress))
> +				err = -EEXIST;
> +
> +			css_put(pos);
> +		}
> +
> +		if (err < 0) {
> +			bpf_prog_put(prog);
> +			return err;
> +		}
> +
> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> +
> +		rcu_read_lock();
> +		old_prog = rcu_dereference(*progp);
> +		rcu_assign_pointer(*progp, prog);

Wouldn't it make sense to update the descendants to point to the
programs directly so that there's no need to traverse the tree on each
packet?  It's more state to maintain but I think it would make total
sense given that it would reduce per-packet cost.

Otherwise, looks good to me.

Thanks.
Daniel Mack Aug. 17, 2016, 2:35 p.m. UTC | #2
Hi Tejun,

On 08/17/2016 04:20 PM, Tejun Heo wrote:
> On Wed, Aug 17, 2016 at 04:00:46PM +0200, Daniel Mack wrote:
>> The current implementation walks the tree from the passed cgroup up
>> to the root. If there is any program of the given type installed in
>> any of the ancestors, the installation is rejected. This is because
>> programs subject to restrictions should have no way of escaping if
>> a higher-level cgroup has installed a program already. This restriction
>> can be revisited at some later point in time.
> 
> This seems a bit too restrictive to me.  Given that an entity which
> can install a bpf program can remove any programs anyway, wouldn't it
> make more sense to just execute the program on the closest ancestor to
> the cgroup?  That'd allow selectively overriding programs for specific
> subhierarchies while applying generic policies to the rest.

The idea is not to walk the cgroup tree for each packet, to avoid costs.
Which also means we don't have to be overly restrictive, yes.

>> +static int bpf_prog_attach(const union bpf_attr *attr)
>> +{
> ...
>> +		/* Reject installation of a program if any ancestor has one. */
>> +		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
>> +			struct cgroup *parent;
>> +
>> +			css_get(pos);
> 
> This refcnting pattern doesn't make sense.  If @pos is already
> accessible (which it is in this case), getting an extra ref before
> accessing it doesn't do anything.  This would necessary iff the
> pointer is to be used outside the scope which is protected by the ref
> on @cgrp.

Right, that was a left-over from an older series. Will drop.

>> +			parent = container_of(pos, struct cgroup, self);
>> +
>> +			if ((is_ingress  && parent->bpf_ingress) ||
>> +			    (!is_ingress && parent->bpf_egress))
>> +				err = -EEXIST;
>> +
>> +			css_put(pos);
>> +		}
>> +
>> +		if (err < 0) {
>> +			bpf_prog_put(prog);
>> +			return err;
>> +		}
>> +
>> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
>> +
>> +		rcu_read_lock();
>> +		old_prog = rcu_dereference(*progp);
>> +		rcu_assign_pointer(*progp, prog);
> 
> Wouldn't it make sense to update the descendants to point to the
> programs directly so that there's no need to traverse the tree on each
> packet?  It's more state to maintain but I think it would make total
> sense given that it would reduce per-packet cost.

The idea is to only look at the actual v2 cgroup a task is in, and not
traverse in any way, to keep the per-packet cost at a minimum. That of
course means that in a deeper level of the hierarchy, the programs are
no longer executed and need to be reinstalled after a new cgroup is
created. To bring this more in line with how cgroups usually work, I
guess we should add code to copy over the bpf programs from the ancestor
once a new cgroup is instantiated.


Thanks,
Daniel
Tejun Heo Aug. 17, 2016, 3:06 p.m. UTC | #3
Hello,

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> > Wouldn't it make sense to update the descendants to point to the
> > programs directly so that there's no need to traverse the tree on each
> > packet?  It's more state to maintain but I think it would make total
> > sense given that it would reduce per-packet cost.
> 
> The idea is to only look at the actual v2 cgroup a task is in, and not
> traverse in any way, to keep the per-packet cost at a minimum. That of
> course means that in a deeper level of the hierarchy, the programs are
> no longer executed and need to be reinstalled after a new cgroup is
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

So, yeah, just executing the program from the exact cgroup wouldn't
work because it'd be easy for a delegatee to escape enforced policies
by simply creating subcgroups.

What I'd suggest is keeping two sets of pointers, one set for the
programs explicitly attached to the cgroup and the other for programs
which are to be executed for the cgroup.  The two pointers in the
latter set will point to the closest non-null program in its ancestry.
Note that the pointers may point to programs belonging to different
ancestors.

	A - B - C
              \ D - E

Let's say B has both ingress and egress programs and D only has
egress.  E's execution pointers should point to D's egress program and
B's ingress program.

The pointers would need to be updated

1. When a program is installed or removed.  For simplicity's sake, we
   can just walk the whole hierarchy.  If that's too expensive
   (shouldn't be), we can find out the closest ancestor where both
   programs can be determined and then walk from there.

2. When a new cgroup is created, it should inherit its execution table
   from its parent.

Thanks.
Tejun Heo Aug. 17, 2016, 3:08 p.m. UTC | #4
Hello, again.

Just one addition.

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

Please don't copy the actual configuration or program.  Every case
where a cgroup controller took this approach turned out pretty badly.
It gets really confusing.

Thanks.
Daniel Mack Aug. 17, 2016, 3:51 p.m. UTC | #5
On 08/17/2016 05:06 PM, Tejun Heo wrote:
> What I'd suggest is keeping two sets of pointers, one set for the
> programs explicitly attached to the cgroup and the other for programs
> which are to be executed for the cgroup.  The two pointers in the
> latter set will point to the closest non-null program in its ancestry.
> Note that the pointers may point to programs belonging to different
> ancestors.
> 
> 	A - B - C
>               \ D - E
> 
> Let's say B has both ingress and egress programs and D only has
> egress.  E's execution pointers should point to D's egress program and
> B's ingress program.
> 
> The pointers would need to be updated
> 
> 1. When a program is installed or removed.  For simplicity's sake, we
>    can just walk the whole hierarchy.  If that's too expensive
>    (shouldn't be), we can find out the closest ancestor where both
>    programs can be determined and then walk from there.
> 
> 2. When a new cgroup is created, it should inherit its execution table
>    from its parent.

Ok, yes, that sounds sane to me. Will implement that scheme.


Thanks!
Daniel
Eric Dumazet Aug. 17, 2016, 4:16 p.m. UTC | #6
On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:

> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> +
> +		rcu_read_lock();
> +		old_prog = rcu_dereference(*progp);
> +		rcu_assign_pointer(*progp, prog);
> +
> +		if (old_prog)
> +			bpf_prog_put(old_prog);
> +
> +		rcu_read_unlock();


This is a bogus locking strategy.

You do not want to use rcu_read_lock()/rcu_read_unlock() here, but
appropriate writer exclusion (a mutex probably, or a spinlock)

Then use rcu_dereference_protected() instead of rcu_dereference(*progp);
Alexei Starovoitov Aug. 17, 2016, 5:48 p.m. UTC | #7
On Wed, Aug 17, 2016 at 05:51:44PM +0200, Daniel Mack wrote:
> On 08/17/2016 05:06 PM, Tejun Heo wrote:
> > What I'd suggest is keeping two sets of pointers, one set for the
> > programs explicitly attached to the cgroup and the other for programs
> > which are to be executed for the cgroup.  The two pointers in the
> > latter set will point to the closest non-null program in its ancestry.
> > Note that the pointers may point to programs belonging to different
> > ancestors.
> > 
> > 	A - B - C
> >               \ D - E
> > 
> > Let's say B has both ingress and egress programs and D only has
> > egress.  E's execution pointers should point to D's egress program and
> > B's ingress program.
> > 
> > The pointers would need to be updated
> > 
> > 1. When a program is installed or removed.  For simplicity's sake, we
> >    can just walk the whole hierarchy.  If that's too expensive
> >    (shouldn't be), we can find out the closest ancestor where both
> >    programs can be determined and then walk from there.
> > 
> > 2. When a new cgroup is created, it should inherit its execution table
> >    from its parent.
> 
> Ok, yes, that sounds sane to me. Will implement that scheme.

That makes sense to me as well. Sounds quite flexible and fast
with only one check per packet per direction.
Alexei Starovoitov Aug. 17, 2016, 6:10 p.m. UTC | #8
On Wed, Aug 17, 2016 at 09:16:02AM -0700, Eric Dumazet wrote:
> On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:
> 
> > +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> > +
> > +		rcu_read_lock();
> > +		old_prog = rcu_dereference(*progp);
> > +		rcu_assign_pointer(*progp, prog);
> > +
> > +		if (old_prog)
> > +			bpf_prog_put(old_prog);
> > +
> > +		rcu_read_unlock();
> 
> 
> This is a bogus locking strategy.

yep. this rcu_lock/unlock won't solve the race between parallel
bpf_prog_attach calls.
please use xchg() similar to bpf_fd_array_map_update_elem()
Then prog pointer will be swapped automically and bpf_prog_put()
will free it via call_rcu.
The reader side in sk_filter_cgroup_bpf() looks correct.
Daniel Mack Aug. 18, 2016, 3:17 p.m. UTC | #9
On 08/17/2016 08:10 PM, Alexei Starovoitov wrote:
> On Wed, Aug 17, 2016 at 09:16:02AM -0700, Eric Dumazet wrote:
>> On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:
>>
>>> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
>>> +
>>> +		rcu_read_lock();
>>> +		old_prog = rcu_dereference(*progp);
>>> +		rcu_assign_pointer(*progp, prog);
>>> +
>>> +		if (old_prog)
>>> +			bpf_prog_put(old_prog);
>>> +
>>> +		rcu_read_unlock();
>>
>>
>> This is a bogus locking strategy.
> 
> yep. this rcu_lock/unlock won't solve the race between parallel
> bpf_prog_attach calls.
> please use xchg() similar to bpf_fd_array_map_update_elem()
> Then prog pointer will be swapped automically and bpf_prog_put()
> will free it via call_rcu.
> The reader side in sk_filter_cgroup_bpf() looks correct.

Thanks! I reworked all the bits I got comments on, and fixed some other
details as well. I'll wait some days to see what else shakes out of this
thread, and then post again.

FWIW, the current code can be found here:

  https://github.com/zonque/linux/commits/cg-bpf-syscall


Thanks,
Daniel
diff mbox

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 913b147..b8b8925 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@  enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -98,6 +100,11 @@  enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
 };
 
+enum bpf_attach_type {
+	BPF_ATTACH_TYPE_CGROUP_INGRESS,
+	BPF_ATTACH_TYPE_CGROUP_EGRESS,
+};
+
 #define BPF_PSEUDO_MAP_FD	1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -141,6 +148,13 @@  union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
+		__u64		attach_flags;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..036465d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,132 @@  static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	bool is_ingress = false;
+	int err = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* Flags are unused for now */
+	if (attr->attach_flags != 0)
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
+		is_ingress = true;
+		/* fall through */
+
+	case BPF_ATTACH_TYPE_CGROUP_EGRESS: {
+		struct bpf_prog *prog, *old_prog, **progp;
+		struct cgroup_subsys_state *pos;
+		struct cgroup *cgrp;
+
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			err = PTR_ERR(cgrp);
+			bpf_prog_put(prog);
+			return err;
+		}
+
+		/* Reject installation of a program if any ancestor has one. */
+		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
+			struct cgroup *parent;
+
+			css_get(pos);
+			parent = container_of(pos, struct cgroup, self);
+
+			if ((is_ingress  && parent->bpf_ingress) ||
+			    (!is_ingress && parent->bpf_egress))
+				err = -EEXIST;
+
+			css_put(pos);
+		}
+
+		if (err < 0) {
+			bpf_prog_put(prog);
+			return err;
+		}
+
+		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
+
+		rcu_read_lock();
+		old_prog = rcu_dereference(*progp);
+		rcu_assign_pointer(*progp, prog);
+
+		if (old_prog)
+			bpf_prog_put(old_prog);
+
+		rcu_read_unlock();
+		cgroup_put(cgrp);
+
+		break;
+	}
+#endif /* CONFIG_CGROUP_BPF */
+
+	default:
+		return -EINVAL;
+	}
+
+	return err;
+}
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	int err = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	switch (attr->attach_type) {
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
+	case BPF_ATTACH_TYPE_CGROUP_EGRESS: {
+		struct bpf_prog *prog, **progp;
+		struct cgroup *cgrp;
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		progp = attr->attach_type == BPF_ATTACH_TYPE_CGROUP_INGRESS ?
+			&cgrp->bpf_ingress :
+			&cgrp->bpf_egress;
+
+		rcu_read_lock();
+		prog = rcu_dereference(*progp);
+
+		if (prog) {
+			rcu_assign_pointer(*progp, NULL);
+			bpf_prog_put(prog);
+		} else {
+			err = -ENOENT;
+		}
+
+		rcu_read_unlock();
+		cgroup_put(cgrp);
+
+		break;
+	}
+#endif /* CONFIG_CGROUP_BPF */
+
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +1014,12 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;