diff mbox

[v3,5/6] net: core: run cgroup eBPF egress programs

Message ID 1472241532-11682-6-git-send-email-daniel@zonque.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Mack Aug. 26, 2016, 7:58 p.m. UTC
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/core/dev.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Daniel Borkmann Aug. 29, 2016, 10:03 p.m. UTC | #1
On 08/26/2016 09:58 PM, Daniel Mack wrote:
> If the cgroup associated with the receiving socket has an eBPF
> programs installed, run them from __dev_queue_xmit().
>
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
>
> Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> the feature is unused.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>   net/core/dev.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a75df86..17484e6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -141,6 +141,7 @@
>   #include <linux/netfilter_ingress.h>
>   #include <linux/sctp.h>
>   #include <linux/crash_dump.h>
> +#include <linux/bpf-cgroup.h>
>
>   #include "net-sysfs.h"
>
> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>   	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>   		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>
> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
> +	if (rc)
> +		return rc;

This would leak the whole skb by the way.

Apart from that, could this be modeled w/o affecting the forwarding path (at some
local output point where we know to have a valid socket)? Then you could also drop
the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
just zeroes since not available at that point?

>   	/* Disable soft irqs for various locks below. Also
>   	 * stops preemption for RCU.
>   	 */
>
Sargun Dhillon Aug. 29, 2016, 10:23 p.m. UTC | #2
On Tue, Aug 30, 2016 at 12:03:23AM +0200, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
> >If the cgroup associated with the receiving socket has an eBPF
> >programs installed, run them from __dev_queue_xmit().
> >
> >eBPF programs used in this context are expected to either return 1 to
> >let the packet pass, or != 1 to drop them. The programs have access to
> >the full skb, including the MAC headers.
> >
> >Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> >for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> >the feature is unused.
> >
> >Signed-off-by: Daniel Mack <daniel@zonque.org>
> >---
> >  net/core/dev.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index a75df86..17484e6 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -141,6 +141,7 @@
> >  #include <linux/netfilter_ingress.h>
> >  #include <linux/sctp.h>
> >  #include <linux/crash_dump.h>
> >+#include <linux/bpf-cgroup.h>
> >
> >  #include "net-sysfs.h"
> >
> >@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >  	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> >  		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
> >
> >+	rc = cgroup_bpf_run_filter(skb->sk, skb,
> >+				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
> >+	if (rc)
> >+		return rc;
> 
> This would leak the whole skb by the way.
> 
> Apart from that, could this be modeled w/o affecting the forwarding path (at some
> local output point where we know to have a valid socket)? Then you could also drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
> just zeroes since not available at that point?
> 
> >  	/* Disable soft irqs for various locks below. Also
> >  	 * stops preemption for RCU.
> >  	 */
> >
Given this patchset only effects AF_INET, and AF_INET6, why not put the hooks at 
ip_output, and ip6_output
Daniel Mack Sept. 5, 2016, 2:22 p.m. UTC | #3
On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index a75df86..17484e6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -141,6 +141,7 @@
>>   #include <linux/netfilter_ingress.h>
>>   #include <linux/sctp.h>
>>   #include <linux/crash_dump.h>
>> +#include <linux/bpf-cgroup.h>
>>
>>   #include "net-sysfs.h"
>>
>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>>   	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>   		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>
>> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
>> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>> +	if (rc)
>> +		return rc;
> 
> This would leak the whole skb by the way.

Ah, right.

> Apart from that, could this be modeled w/o affecting the forwarding path (at some
> local output point where we know to have a valid socket)? Then you could also drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
> just zeroes since not available at that point?

Hmm, I wonder where this hook could be put instead then. When placed in
ip_output() and ip6_output(), the mac headers cannot be pushed before
running the program, resulting in bogus skb data from the eBPF program.

Also, if I read the code correctly, ip[6]_output is not called for
multicast packets.

Any other ideas?


Thanks,
Daniel
Daniel Borkmann Sept. 6, 2016, 5:14 p.m. UTC | #4
On 09/05/2016 04:22 PM, Daniel Mack wrote:
> On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index a75df86..17484e6 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -141,6 +141,7 @@
>>>    #include <linux/netfilter_ingress.h>
>>>    #include <linux/sctp.h>
>>>    #include <linux/crash_dump.h>
>>> +#include <linux/bpf-cgroup.h>
>>>
>>>    #include "net-sysfs.h"
>>>
>>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>>>    	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>>    		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>>
>>> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
>>> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>>> +	if (rc)
>>> +		return rc;
>>
>> This would leak the whole skb by the way.
>
> Ah, right.
>
>> Apart from that, could this be modeled w/o affecting the forwarding path (at some
>> local output point where we know to have a valid socket)? Then you could also drop
>> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
>> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
>> just zeroes since not available at that point?
>
> Hmm, I wonder where this hook could be put instead then. When placed in
> ip_output() and ip6_output(), the mac headers cannot be pushed before
> running the program, resulting in bogus skb data from the eBPF program.

But as it stands right now, RX will only see a subset of packets in sk_filter()
layer (depending on where it's called in the proto handler implementation,
so might not even include all control messages, for example) as opposed to
the TX hook going that far even 'seeing' everything incl. forwarded packets
in the sense that we know a priori that these kind of skbs going through the
cgroup_bpf_run_filter() handler when the hook is enabled will just skip this
hook eventually anyway. What about letting such progs see /only/ local skbs
for RX and TX, with skb->data from L3 onwards (iirc, that would be similar
to what current sk_filter() programs see)?
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index a75df86..17484e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@ 
 #include <linux/netfilter_ingress.h>
 #include <linux/sctp.h>
 #include <linux/crash_dump.h>
+#include <linux/bpf-cgroup.h>
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,11 @@  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
 		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+	rc = cgroup_bpf_run_filter(skb->sk, skb,
+				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
+	if (rc)
+		return rc;
+
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
 	 */