diff mbox

[RFC,v4,net-next,17/26] tracing: allow eBPF programs to be attached to events

Message ID 1407916658-8731-18-git-send-email-ast@plumgrid.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov Aug. 13, 2014, 7:57 a.m. UTC
User interface:
fd = open("/sys/kernel/debug/tracing/__event__/filter")

write(fd, "bpf_123")

where 123 is process local FD associated with eBPF program previously loaded.
__event__ is static tracepoint event or syscall.
(kprobe support is in next patch)
Once program is successfully attached to tracepoint event, the tracepoint
will be auto-enabled

close(fd)
auto-disables tracepoint event and detaches eBPF program from it

eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- trace_printk
- dump_stack
- fetch_ptr/u64/u32/u16/u8 values from unsafe address via probe_kernel_read(),
  so that eBPF program can walk any kernel data structures

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/ftrace_event.h       |    5 +
 include/trace/bpf_trace.h          |   30 +++++
 include/trace/ftrace.h             |   10 ++
 include/uapi/linux/bpf.h           |    9 ++
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace.c           |  241 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    3 +
 kernel/trace/trace_events.c        |   36 +++++-
 kernel/trace/trace_events_filter.c |   72 ++++++++++-
 kernel/trace/trace_syscalls.c      |   18 +++
 11 files changed, 424 insertions(+), 2 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c

Comments

Brendan Gregg Aug. 14, 2014, 9:20 p.m. UTC | #1
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
[...]
> +/* For tracing filters save first six arguments of tracepoint events.
> + * On 64-bit architectures argN fields will match one to one to arguments passed
> + * to tracepoint events.
> + * On 32-bit architectures u64 arguments to events will be seen into two
> + * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
> + * match one to one
> + */
> +struct bpf_context {
> +       unsigned long arg1;
> +       unsigned long arg2;
> +       unsigned long arg3;
> +       unsigned long arg4;
> +       unsigned long arg5;
> +       unsigned long arg6;
> +       unsigned long ret;
> +};

While this works, the argN+1 shift for 32-bit is a gotcha to learn.
Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
was a way not to shift arguments, I could have one program for both.
Eg, additional arg1hi, arg2hi, ... for the higher order u32s.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 6:08 a.m. UTC | #2
On Thu, Aug 14, 2014 at 2:20 PM, Brendan Gregg
<brendan.d.gregg@gmail.com> wrote:
> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> [...]
>> +/* For tracing filters save first six arguments of tracepoint events.
>> + * On 64-bit architectures argN fields will match one to one to arguments passed
>> + * to tracepoint events.
>> + * On 32-bit architectures u64 arguments to events will be seen into two
>> + * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
>> + * match one to one
>> + */
>> +struct bpf_context {
>> +       unsigned long arg1;
>> +       unsigned long arg2;
>> +       unsigned long arg3;
>> +       unsigned long arg4;
>> +       unsigned long arg5;
>> +       unsigned long arg6;
>> +       unsigned long ret;
>> +};
>
> While this works, the argN+1 shift for 32-bit is a gotcha to learn.
> Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
> two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there

correct.
I've picked 'long' type for these tracepoint 'arguments' to match
what is going on at assembler level.
32-bit archs are passing 64-bit values in two consecutive registers
or two stack slots. So it's partially exposing architectural details.
I've tried to use u64 here, but it complicated tracepoint+ebpf patch
a lot, since I need per-architecture support for moving C arguments
into u64 variables and hacking tracepoint event definitions in a nasty
ways. This 'long' type approach is the least intrusive I could find.
Also out of 1842 total tracepoint fields, only 144 fields are 64-bit,
so rarely one would need to deal with u64. Most of the tracepoint
arguments are either longs, ints or pointers, which fits this approach
the best.
In general the eBPF design approach is to keep kernel bits as simple
as possible and move complexity to user space.
In this case some higher language than C for writing scripts can
hide this oddity.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 5:20 p.m. UTC | #3
On Thu, Aug 14, 2014 at 11:08 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Aug 14, 2014 at 2:20 PM, Brendan Gregg
> <brendan.d.gregg@gmail.com> wrote:
>> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> [...]
>>> +/* For tracing filters save first six arguments of tracepoint events.
>>> + * On 64-bit architectures argN fields will match one to one to arguments passed
>>> + * to tracepoint events.
>>> + * On 32-bit architectures u64 arguments to events will be seen into two
>>> + * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
>>> + * match one to one
>>> + */
>>> +struct bpf_context {
>>> +       unsigned long arg1;
>>> +       unsigned long arg2;
>>> +       unsigned long arg3;
>>> +       unsigned long arg4;
>>> +       unsigned long arg5;
>>> +       unsigned long arg6;
>>> +       unsigned long ret;
>>> +};
>>
>> While this works, the argN+1 shift for 32-bit is a gotcha to learn.
>> Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
>> two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
>
> correct.
> I've picked 'long' type for these tracepoint 'arguments' to match
> what is going on at assembler level.
> 32-bit archs are passing 64-bit values in two consecutive registers
> or two stack slots. So it's partially exposing architectural details.
> I've tried to use u64 here, but it complicated tracepoint+ebpf patch
> a lot, since I need per-architecture support for moving C arguments
> into u64 variables and hacking tracepoint event definitions in a nasty
> ways. This 'long' type approach is the least intrusive I could find.
> Also out of 1842 total tracepoint fields, only 144 fields are 64-bit,
> so rarely one would need to deal with u64. Most of the tracepoint
> arguments are either longs, ints or pointers, which fits this approach
> the best.
> In general the eBPF design approach is to keep kernel bits as simple
> as possible and move complexity to user space.
> In this case some higher language than C for writing scripts can
> hide this oddity.

The downside of this approach is that compat support might be
difficult or impossible.

--Andy
Andy Lutomirski Aug. 15, 2014, 5:25 p.m. UTC | #4
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> User interface:
> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>
> write(fd, "bpf_123")

I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem.  In general, write(2) should never do anything that involves
any security-relevant context of the caller.

Ideally, you would look up fd 123 in the file table of whomever called
open.  If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.

This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 5:36 p.m. UTC | #5
On Fri, Aug 15, 2014 at 10:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> The downside of this approach is that compat support might be
> difficult or impossible.

Would do you mean by compat? 32-bit programs on 64-bit kernels?
There is no such concept for eBPF. All eBPF programs are always
operating on 64-bit registers.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 5:51 p.m. UTC | #6
On Fri, Aug 15, 2014 at 10:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> User interface:
>> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>>
>> write(fd, "bpf_123")
>
> I didn't follow all the code flow leading to parsing the "bpf_123"
> string, but if it works the way I imagine it does, it's a security
> problem.  In general, write(2) should never do anything that involves
> any security-relevant context of the caller.
>
> Ideally, you would look up fd 123 in the file table of whomever called
> open.  If that's difficult to implement efficiently, then it would be
> nice to have some check that the callers of write(2) and open(2) are
> the same task and that exec wasn't called in between.
>
> This isn't a very severe security issue because you need privilege to
> open the thing in the first place, but it would still be nice to
> address.

hmm. you need to be root to open the events anyway.
pretty much the whole tracing for root only, since any kernel data
structures can be printed, stored into maps and so on.
So I don't quite follow your security concern here.

Even say root opens a tracepoint and does exec() of another
app that uploads ebpf program, gets program_fd and does
write into tracepoint fd. The root app that did this open() is
doing exec() on purpose. It's not like it's exec-ing something
it doesn't know about.

Remember, FDs was your idea in the first place ;)
I had global ids and everything root initially.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 6:50 p.m. UTC | #7
On Aug 15, 2014 10:36 AM, "Alexei Starovoitov" <ast@plumgrid.com> wrote:
>
> On Fri, Aug 15, 2014 at 10:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > The downside of this approach is that compat support might be
> > difficult or impossible.
>
> Would do you mean by compat? 32-bit programs on 64-bit kernels?
> There is no such concept for eBPF. All eBPF programs are always
> operating on 64-bit registers.

Doesn't the eBPF program need to know sizeof(long) to read these
fields correctly?  Or am I misunderstanding what the code does?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 6:53 p.m. UTC | #8
On Fri, Aug 15, 2014 at 10:51 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Aug 15, 2014 at 10:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> User interface:
>>> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>>>
>>> write(fd, "bpf_123")
>>
>> I didn't follow all the code flow leading to parsing the "bpf_123"
>> string, but if it works the way I imagine it does, it's a security
>> problem.  In general, write(2) should never do anything that involves
>> any security-relevant context of the caller.
>>
>> Ideally, you would look up fd 123 in the file table of whomever called
>> open.  If that's difficult to implement efficiently, then it would be
>> nice to have some check that the callers of write(2) and open(2) are
>> the same task and that exec wasn't called in between.
>>
>> This isn't a very severe security issue because you need privilege to
>> open the thing in the first place, but it would still be nice to
>> address.
>
> hmm. you need to be root to open the events anyway.
> pretty much the whole tracing for root only, since any kernel data
> structures can be printed, stored into maps and so on.
> So I don't quite follow your security concern here.
>
> Even say root opens a tracepoint and does exec() of another
> app that uploads ebpf program, gets program_fd and does
> write into tracepoint fd. The root app that did this open() is
> doing exec() on purpose. It's not like it's exec-ing something
> it doesn't know about.

As long as everyone who can debugfs/tracing/whatever has all
privileges, then this is fine.

If not, then it's a minor capability or MAC bypass.  Suppose you only
have one capability or, more realistically, limited MAC permissions.
You can still open the tracing file, pass it to an unwitting program
with elevated permission (e.g. using selinux's entrypoint mechanism),
and trick that program into writing bpf_123.

Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
attack is possible.

I don't think that fixing this should be a prerequisite for merging,
since the risk is so small.  Nonetheless, it would be nice.  (This
family of attacks has lead to several root vulnerabilities in the
past.)

--Andy

>
> Remember, FDs was your idea in the first place ;)
> I had global ids and everything root initially.
Alexei Starovoitov Aug. 15, 2014, 6:56 p.m. UTC | #9
On Fri, Aug 15, 2014 at 11:50 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Aug 15, 2014 10:36 AM, "Alexei Starovoitov" <ast@plumgrid.com> wrote:
>>
>> On Fri, Aug 15, 2014 at 10:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > The downside of this approach is that compat support might be
>> > difficult or impossible.
>>
>> Would do you mean by compat? 32-bit programs on 64-bit kernels?
>> There is no such concept for eBPF. All eBPF programs are always
>> operating on 64-bit registers.
>
> Doesn't the eBPF program need to know sizeof(long) to read these
> fields correctly?  Or am I misunderstanding what the code does?

correct. eBPF program would be using 8-byte read on 64-bit kernel
and 4-byte read on 32-bit kernel. Same with access to ptrace fields
and pretty much all other fields in the kernel. The program will be
different on different kernels.
Say, this bpf_context struct doesn't exist at all. The programs would
still need to be different to walk in-kernel data structures...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 7:02 p.m. UTC | #10
On Fri, Aug 15, 2014 at 11:56 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Aug 15, 2014 at 11:50 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Aug 15, 2014 10:36 AM, "Alexei Starovoitov" <ast@plumgrid.com> wrote:
>>>
>>> On Fri, Aug 15, 2014 at 10:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> > The downside of this approach is that compat support might be
>>> > difficult or impossible.
>>>
>>> Would do you mean by compat? 32-bit programs on 64-bit kernels?
>>> There is no such concept for eBPF. All eBPF programs are always
>>> operating on 64-bit registers.
>>
>> Doesn't the eBPF program need to know sizeof(long) to read these
>> fields correctly?  Or am I misunderstanding what the code does?
>
> correct. eBPF program would be using 8-byte read on 64-bit kernel
> and 4-byte read on 32-bit kernel. Same with access to ptrace fields
> and pretty much all other fields in the kernel. The program will be
> different on different kernels.
> Say, this bpf_context struct doesn't exist at all. The programs would
> still need to be different to walk in-kernel data structures...

Hmm.  I guess this isn't so bad.

What's the actual difficulty with using u64?  ISTM that, if the clang
front-end can't deal with u64, there's a bigger problem.  Or is it
something else I don't understand.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 7:07 p.m. UTC | #11
On Fri, Aug 15, 2014 at 11:53 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Aug 15, 2014 at 10:51 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Fri, Aug 15, 2014 at 10:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> User interface:
>>>> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>>>>
>>>> write(fd, "bpf_123")
>>>
>>> I didn't follow all the code flow leading to parsing the "bpf_123"
>>> string, but if it works the way I imagine it does, it's a security
>>> problem.  In general, write(2) should never do anything that involves
>>> any security-relevant context of the caller.
>>>
>>> Ideally, you would look up fd 123 in the file table of whomever called
>>> open.  If that's difficult to implement efficiently, then it would be
>>> nice to have some check that the callers of write(2) and open(2) are
>>> the same task and that exec wasn't called in between.
>>>
>>> This isn't a very severe security issue because you need privilege to
>>> open the thing in the first place, but it would still be nice to
>>> address.
>>
>> hmm. you need to be root to open the events anyway.
>> pretty much the whole tracing for root only, since any kernel data
>> structures can be printed, stored into maps and so on.
>> So I don't quite follow your security concern here.
>>
>> Even say root opens a tracepoint and does exec() of another
>> app that uploads ebpf program, gets program_fd and does
>> write into tracepoint fd. The root app that did this open() is
>> doing exec() on purpose. It's not like it's exec-ing something
>> it doesn't know about.
>
> As long as everyone who can debugfs/tracing/whatever has all
> privileges, then this is fine.
>
> If not, then it's a minor capability or MAC bypass.  Suppose you only
> have one capability or, more realistically, limited MAC permissions.

Hard to think of MAC abbreviation other than in networking way... ;)
MAC bypass... kinda sounds like L3 networking without L2... ;)

> You can still open the tracing file, pass it to an unwitting program
> with elevated permission (e.g. using selinux's entrypoint mechanism),
> and trick that program into writing bpf_123.

hmm, but to open tracing file you'd need to be root already...
otherwise yeah, if non-root could open it and pass it, then it
would be nasty.

> Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
> attack is possible.
>
> I don't think that fixing this should be a prerequisite for merging,
> since the risk is so small.  Nonetheless, it would be nice.  (This
> family of attacks has lead to several root vulnerabilities in the
> past.)

Ok. I think keeping a track of pid between open and write is kinda
ugly. Should we add some new CAP flag and check it for all file
ops? Another option is to conditionally make open() of tracing
files as cloexec...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 7:16 p.m. UTC | #12
On Fri, Aug 15, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> correct. eBPF program would be using 8-byte read on 64-bit kernel
>> and 4-byte read on 32-bit kernel. Same with access to ptrace fields
>> and pretty much all other fields in the kernel. The program will be
>> different on different kernels.
>> Say, this bpf_context struct doesn't exist at all. The programs would
>> still need to be different to walk in-kernel data structures...
>
> Hmm.  I guess this isn't so bad.
>
> What's the actual difficulty with using u64?  ISTM that, if the clang
> front-end can't deal with u64, there's a bigger problem.  Or is it
> something else I don't understand.

clang/llvm has no problem with u64 :)
This bpf_context struct for tracing is trying to answer the question:
 'what's the most convenient way to access tracepoint arguments
from a script'.
When kernel code has something like:
 trace_kfree_skb(skb, net_tx_action);
the script needs to be able to access this 'skb' and 'net_tx_action'
values through _single_ data structure.
In this proposal they are ctx->arg1 and ctx->arg2.
I've considered having different bpf_context's for every event, but
the complexity explodes. I need to hack all event definitions and so on.
imo it's better to move complexity to userspace, so program author
or high level language abstracts these details.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 7:18 p.m. UTC | #13
On Fri, Aug 15, 2014 at 12:16 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Aug 15, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> correct. eBPF program would be using 8-byte read on 64-bit kernel
>>> and 4-byte read on 32-bit kernel. Same with access to ptrace fields
>>> and pretty much all other fields in the kernel. The program will be
>>> different on different kernels.
>>> Say, this bpf_context struct doesn't exist at all. The programs would
>>> still need to be different to walk in-kernel data structures...
>>
>> Hmm.  I guess this isn't so bad.
>>
>> What's the actual difficulty with using u64?  ISTM that, if the clang
>> front-end can't deal with u64, there's a bigger problem.  Or is it
>> something else I don't understand.
>
> clang/llvm has no problem with u64 :)
> This bpf_context struct for tracing is trying to answer the question:
>  'what's the most convenient way to access tracepoint arguments
> from a script'.
> When kernel code has something like:
>  trace_kfree_skb(skb, net_tx_action);
> the script needs to be able to access this 'skb' and 'net_tx_action'
> values through _single_ data structure.
> In this proposal they are ctx->arg1 and ctx->arg2.
> I've considered having different bpf_context's for every event, but
> the complexity explodes. I need to hack all event definitions and so on.
> imo it's better to move complexity to userspace, so program author
> or high level language abstracts these details.

I still don't understand why making them long instead of u64 is
helpful, though.  I feel like I'm missing obvious here.
Andy Lutomirski Aug. 15, 2014, 7:20 p.m. UTC | #14
On Fri, Aug 15, 2014 at 12:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Aug 15, 2014 at 11:53 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Aug 15, 2014 at 10:51 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Fri, Aug 15, 2014 at 10:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> User interface:
>>>>> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>>>>>
>>>>> write(fd, "bpf_123")
>>>>
>>>> I didn't follow all the code flow leading to parsing the "bpf_123"
>>>> string, but if it works the way I imagine it does, it's a security
>>>> problem.  In general, write(2) should never do anything that involves
>>>> any security-relevant context of the caller.
>>>>
>>>> Ideally, you would look up fd 123 in the file table of whomever called
>>>> open.  If that's difficult to implement efficiently, then it would be
>>>> nice to have some check that the callers of write(2) and open(2) are
>>>> the same task and that exec wasn't called in between.
>>>>
>>>> This isn't a very severe security issue because you need privilege to
>>>> open the thing in the first place, but it would still be nice to
>>>> address.
>>>
>>> hmm. you need to be root to open the events anyway.
>>> pretty much the whole tracing for root only, since any kernel data
>>> structures can be printed, stored into maps and so on.
>>> So I don't quite follow your security concern here.
>>>
>>> Even say root opens a tracepoint and does exec() of another
>>> app that uploads ebpf program, gets program_fd and does
>>> write into tracepoint fd. The root app that did this open() is
>>> doing exec() on purpose. It's not like it's exec-ing something
>>> it doesn't know about.
>>
>> As long as everyone who can debugfs/tracing/whatever has all
>> privileges, then this is fine.
>>
>> If not, then it's a minor capability or MAC bypass.  Suppose you only
>> have one capability or, more realistically, limited MAC permissions.
>
> Hard to think of MAC abbreviation other than in networking way... ;)
> MAC bypass... kinda sounds like L3 networking without L2... ;)
>
>> You can still open the tracing file, pass it to an unwitting program
>> with elevated permission (e.g. using selinux's entrypoint mechanism),
>> and trick that program into writing bpf_123.
>
> hmm, but to open tracing file you'd need to be root already...
> otherwise yeah, if non-root could open it and pass it, then it
> would be nasty.
>
>> Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
>> attack is possible.
>>
>> I don't think that fixing this should be a prerequisite for merging,
>> since the risk is so small.  Nonetheless, it would be nice.  (This
>> family of attacks has lead to several root vulnerabilities in the
>> past.)
>
> Ok. I think keeping a track of pid between open and write is kinda
> ugly.

Agreed.

TBH, I would just add a comment to the open implementation saying
that, if unprivileged or less privileged open is allowed, then this
needs to be fixed.

> Should we add some new CAP flag and check it for all file
> ops? Another option is to conditionally make open() of tracing
> files as cloexec...

That won't help.  The same attack can be done with SCM_RIGHTS, and
cloexec can be cleared.
Alexei Starovoitov Aug. 15, 2014, 7:29 p.m. UTC | #15
On Fri, Aug 15, 2014 at 12:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> I don't think that fixing this should be a prerequisite for merging,
>>> since the risk is so small.  Nonetheless, it would be nice.  (This
>>> family of attacks has lead to several root vulnerabilities in the
>>> past.)
>>
>> Ok. I think keeping a track of pid between open and write is kinda
>> ugly.
>
> Agreed.
>
> TBH, I would just add a comment to the open implementation saying
> that, if unprivileged or less privileged open is allowed, then this
> needs to be fixed.

ok. will do.

>> Should we add some new CAP flag and check it for all file
>> ops? Another option is to conditionally make open() of tracing
>> files as cloexec...
>
> That won't help.  The same attack can be done with SCM_RIGHTS, and
> cloexec can be cleared.

ouch, can we then make ebpf FDs and may be debugfs FDs
not passable at all? Otherwise it feels that generality and
flexibility of FDs is becoming a burden.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 7:32 p.m. UTC | #16
On Fri, Aug 15, 2014 at 12:29 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Aug 15, 2014 at 12:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>
>>>> I don't think that fixing this should be a prerequisite for merging,
>>>> since the risk is so small.  Nonetheless, it would be nice.  (This
>>>> family of attacks has lead to several root vulnerabilities in the
>>>> past.)
>>>
>>> Ok. I think keeping a track of pid between open and write is kinda
>>> ugly.
>>
>> Agreed.
>>
>> TBH, I would just add a comment to the open implementation saying
>> that, if unprivileged or less privileged open is allowed, then this
>> needs to be fixed.
>
> ok. will do.
>
>>> Should we add some new CAP flag and check it for all file
>>> ops? Another option is to conditionally make open() of tracing
>>> files as cloexec...
>>
>> That won't help.  The same attack can be done with SCM_RIGHTS, and
>> cloexec can be cleared.
>
> ouch, can we then make ebpf FDs and may be debugfs FDs
> not passable at all? Otherwise it feels that generality and
> flexibility of FDs is becoming a burden.

I'm not sure there's much of a general problem.  The issue is when
there's an fd for which write(2) (or other
assumed-to-not-check-permissions calls like read, pread, pwrite, etc)
depend on context.  This is historically an issue for netlink and
various /proc files.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 15, 2014, 7:35 p.m. UTC | #17
On Fri, Aug 15, 2014 at 12:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> clang/llvm has no problem with u64 :)
>> This bpf_context struct for tracing is trying to answer the question:
>>  'what's the most convenient way to access tracepoint arguments
>> from a script'.
>> When kernel code has something like:
>>  trace_kfree_skb(skb, net_tx_action);
>> the script needs to be able to access this 'skb' and 'net_tx_action'
>> values through _single_ data structure.
>> In this proposal they are ctx->arg1 and ctx->arg2.
>> I've considered having different bpf_context's for every event, but
>> the complexity explodes. I need to hack all event definitions and so on.
>> imo it's better to move complexity to userspace, so program author
>> or high level language abstracts these details.
>
> I still don't understand why making them long instead of u64 is
> helpful, though.  I feel like I'm missing obvious here.

I promise to come back to this... Have to go off grid...
will think of it in the mean time... Appreciate this discussion!!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Aug. 19, 2014, 6:39 p.m. UTC | #18
On Fri, Aug 15, 2014 at 12:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Aug 15, 2014 at 12:16 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> This bpf_context struct for tracing is trying to answer the question:
>>  'what's the most convenient way to access tracepoint arguments
>> from a script'.
>> When kernel code has something like:
>>  trace_kfree_skb(skb, net_tx_action);
>> the script needs to be able to access this 'skb' and 'net_tx_action'
>> values through _single_ data structure.
>> In this proposal they are ctx->arg1 and ctx->arg2.
>> I've considered having different bpf_context's for every event, but
>> the complexity explodes. I need to hack all event definitions and so on.
>> imo it's better to move complexity to userspace, so program author
>> or high level language abstracts these details.
>
> I still don't understand why making them long instead of u64 is
> helpful, though.  I feel like I'm missing obvious here.

The problem statement:
- tracepoint events are defined as:
  TRACE_EVENT(sock_exceed_buf_limit,
    TP_PROTO(struct sock *sk, struct proto *prot, long allocated),
    TP_ARGS(sk, prot, allocated),

- from eBPF C program (or higher level language) I would like
  to access these tracepoint arguments as
  ctx->arg1, ctx->arg2, ctx->arg3

- accessing tracepoint fields after buffer copy is not an option,
  since it's unnecessary alloc/copy/free of a lot of values
  (including strings) per event that programs will mostly ignore.
  If needed, programs can fetch them on demand.

Bad approach #1
- define different bpf_context per event and customize eBPF verifier
  to have different program types per event, so particular program
  can be attached to one particular event only
Cons: quite complex, require trace/events/*.h hacking,
  one ebpf program cannot be used to attach to multiple events
So #1 is no go.

Approach #2
- define bpf_context once for all tracepoint events as:
  struct bpf_context {
    unsigned long arg1, arg2, arg3, ...
  };
  and main ftrace.h macro as:
    struct bpf_context ctx;
    populate_bpf_context(&ctx, args, 0, 0, 0, 0, 0);
    trace_filter_call_bpf(ftrace_file->filter, &ctx);
  where 'args' is a macro taken from TP_ARGS above and
  /* called from ftrace_raw_event_*() to copy args */
  void populate_bpf_context(struct bpf_context *ctx, ...)
  {
    va_start(args, ctx);
    ctx->arg1 = va_arg(args, unsigned long);
    ctx->arg2 = va_arg(args, unsigned long);
this approach relies on type promotion when args are passed
into vararg function.
On 64-bit arch our tracepoint arguments 'sk, prot, allocated' will
get stored into arg1, arg2, arg3 and the rest of argX will be zeros.
On 32-bit 'u64' types will be passed in two 'long' slots of vararg.
Obviously changing 'long' to 'u64' in bpf_context and in
populate_bpf_context() will not work, because types
are promoted to 'long'.
Disadvantage of this approach is that 32 vs 64 bit archs need to
deal with argX differently.
That's what you saw in this patch.

New approach #3
just discovered __MAPx() macro used by syscalls.h which can
be massaged for this use case, so define:
struct bpf_context {
  u64 arg1, arg2, arg3,...
};
and argument casting macro as:
#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
..
#define __BPF_CAST(a,...)  __CAST_TO_U64(a), __BPF_CAST6(__VA_ARGS__)

so main ftrace.h macro becomes:
struct bpf_context __ctx = ((struct bpf_context) {
  __BPF_CAST(args, 0, 0, 0, 0, 0, 0)
});
trace_filter_call_bpf(ftrace_file->filter, &__ctx);

where 'args' is still the same 'sk, prot, allocated' from TP_ARGS.
Casting macro will cast 'sk' to u64 and will assign into arg1,
'prot' will be casted to u64 and assigned to arg2, etc

All good, but the tricky part is how to cast all arguments passed
into tracepoint events into u64 without warnings on
32-bit and 64-bit architectures.
The following:
#define __CAST_TO_U64(expr) (u64) expr
will not work, since compiler will be spewing warnings about
casting pointer to integer...
The following
#define __CAST_TO_U64(expr) (u64) (long) expr
will work fine on 64-bit architectures, since all integer and
pointer types will be warning-free casted and stored
in arg1, arg2, arg3, ...
but it won't work on 32-bit architectures, since full u64
tracepoint arguments will be chopped to 'long'.

It took a lot of macro wizardry to come up with the following:
/* cast any interger or pointer type to u64 without warnings */
#define __CAST_TO_U64(expr) \
         __builtin_choose_expr(sizeof(long) < sizeof(expr), \
                               (u64) (expr - ((typeof(expr))0)), \
                               (u64) (long) expr)
the tricky part is that GCC syntax-parses and warns
in both sides of __builtin_choose_expr(), so u64 case in 32-bit
archs needs to be fancy, so all types can go through it
warning free.
Though it's tricky the end result is nice.
The disadvantages of approach #2 are solved and tracepoint
arguments are stored into 'u64 argX' on both 32 and 64-bit archs.
The extra benefit is that this casting macro is way faster than
vararg approach #2.
So in V5 of this series I'm planning to use this new approach
unless there are better ideas.
full diff of this approach:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=235d87dd7afd8d5262556cba7882d6efb25d8305
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 06c6faa9e5cc..94b896ef6b31 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -237,6 +237,7 @@  enum {
 	TRACE_EVENT_FL_WAS_ENABLED_BIT,
 	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
+	TRACE_EVENT_FL_BPF_BIT,
 };
 
 /*
@@ -259,6 +260,7 @@  enum {
 	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
 	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+	TRACE_EVENT_FL_BPF		= (1 << TRACE_EVENT_FL_BPF_BIT),
 };
 
 struct ftrace_event_call {
@@ -533,6 +535,9 @@  event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
 		event_triggers_post_call(file, tt);
 }
 
+struct bpf_context;
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx);
+
 enum {
 	FILTER_OTHER = 0,
 	FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..4dfdf738bd12
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,30 @@ 
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+	unsigned long arg1;
+	unsigned long arg2;
+	unsigned long arg3;
+	unsigned long arg4;
+	unsigned long arg5;
+	unsigned long arg6;
+	unsigned long ret;
+};
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...);
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..ad4987ac68bb 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@ 
  */
 
 #include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>
 
 /*
  * DECLARE_EVENT_CLASS can be used to add a generic function
@@ -634,6 +635,15 @@  ftrace_raw_event_##call(void *__data, proto)				\
 	if (ftrace_trigger_soft_disabled(ftrace_file))			\
 		return;							\
 									\
+	if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&	\
+	    unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+		struct bpf_context __ctx;				\
+									\
+		populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0);	\
+		trace_filter_call_bpf(ftrace_file->filter, &__ctx);	\
+		return;							\
+	}								\
+									\
 	__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
 									\
 	entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file,	\
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c18ac0c1e3e5..76d7196e518a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -396,6 +396,7 @@  enum bpf_prog_attributes {
 enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
 	BPF_PROG_TYPE_SOCKET_FILTER,
+	BPF_PROG_TYPE_TRACING_FILTER,
 };
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -406,6 +407,14 @@  enum bpf_func_id {
 	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
 	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value) */
 	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
+	BPF_FUNC_fetch_ptr,       /* void *bpf_fetch_ptr(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u64,       /* u64 bpf_fetch_u64(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u32,       /* u32 bpf_fetch_u32(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u16,       /* u16 bpf_fetch_u16(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u8,        /* u8 bpf_fetch_u8(void *unsafe_ptr) */
+	BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
+	BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
+	BPF_FUNC_printk,          /* int bpf_printk(const char *fmt, int fmt_size, ...) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c899dd..c816cd779697 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -75,6 +75,7 @@  config FTRACE_NMI_ENTER
 
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
+	depends on NET
 	bool
 
 config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 67d6369ddf83..fe897168a19e 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@  obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
 endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o
 obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM_RUNTIME),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..eac13a14dd26
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,241 @@ 
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...)
+{
+	va_list args;
+
+	va_start(args, ctx);
+
+	ctx->arg1 = va_arg(args, unsigned long);
+	ctx->arg2 = va_arg(args, unsigned long);
+	ctx->arg3 = va_arg(args, unsigned long);
+	ctx->arg4 = va_arg(args, unsigned long);
+	ctx->arg5 = va_arg(args, unsigned long);
+	ctx->arg6 = va_arg(args, unsigned long);
+
+	va_end(args);
+}
+EXPORT_SYMBOL_GPL(populate_bpf_context);
+
+static u64 bpf_fetch_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	void *unsafe_ptr = (void *) r1;
+	void *ptr = NULL;
+
+	probe_kernel_read(&ptr, unsafe_ptr, sizeof(ptr));
+	return (u64) (unsigned long) ptr;
+}
+
+#define FETCH(SIZE) \
+static u64 bpf_fetch_##SIZE(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)	\
+{									\
+	void *unsafe_ptr = (void *) r1;					\
+	SIZE val = 0;							\
+									\
+	probe_kernel_read(&val, unsafe_ptr, sizeof(val));		\
+	return (u64) (SIZE) val;					\
+}
+FETCH(u64)
+FETCH(u32)
+FETCH(u16)
+FETCH(u8)
+#undef FETCH
+
+static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	void *unsafe_ptr = (void *) r1;
+	void *safe_ptr = (void *) r2;
+	u32 size = (u32) r3;
+	char buf[64];
+	int err;
+
+	if (size < 64) {
+		err = probe_kernel_read(buf, unsafe_ptr, size);
+		if (err)
+			return err;
+		return memcmp(buf, safe_ptr, size);
+	}
+	return -1;
+}
+
+static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	trace_dump_stack(0);
+	return 0;
+}
+
+/* limited printk()
+ * only %d %u %x conversion specifiers allowed
+ */
+static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+	char *fmt = (char *) r1;
+	int fmt_cnt = 0;
+	int i;
+
+	/* bpf_check() guarantees that fmt points to bpf program stack and
+	 * fmt_size bytes of it were initialized by bpf program
+	 */
+	if (fmt[fmt_size - 1] != 0)
+		return -EINVAL;
+
+	/* check format string for allowed specifiers */
+	for (i = 0; i < fmt_size; i++)
+		if (fmt[i] == '%') {
+			if (i + 1 >= fmt_size)
+				return -EINVAL;
+			if (fmt[i + 1] != 'd' && fmt[i + 1] != 'u' &&
+			    fmt[i + 1] != 'x')
+				return -EINVAL;
+			fmt_cnt++;
+		}
+
+	if (fmt_cnt > 3)
+		return -EINVAL;
+
+	return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+			      (u32) r3, (u32) r4, (u32) r5);
+}
+
+static struct bpf_func_proto tracing_filter_funcs[] = {
+#define FETCH(SIZE)				\
+	[BPF_FUNC_fetch_##SIZE] = {		\
+		.func = bpf_fetch_##SIZE,	\
+		.gpl_only = false,		\
+		.ret_type = RET_INTEGER,	\
+	},
+	FETCH(ptr)
+	FETCH(u64)
+	FETCH(u32)
+	FETCH(u16)
+	FETCH(u8)
+#undef FETCH
+	[BPF_FUNC_memcmp] = {
+		.func = bpf_memcmp,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_ANYTHING,
+		.arg2_type = ARG_PTR_TO_STACK,
+		.arg3_type = ARG_CONST_STACK_SIZE,
+	},
+	[BPF_FUNC_dump_stack] = {
+		.func = bpf_dump_stack,
+		.gpl_only = false,
+		.ret_type = RET_VOID,
+	},
+	[BPF_FUNC_printk] = {
+		.func = bpf_printk,
+		.gpl_only = true,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_PTR_TO_STACK,
+		.arg2_type = ARG_CONST_STACK_SIZE,
+	},
+	[BPF_FUNC_map_lookup_elem] = {
+		.func = bpf_map_lookup_elem,
+		.gpl_only = false,
+		.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+		.arg1_type = ARG_CONST_MAP_PTR,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+	[BPF_FUNC_map_update_elem] = {
+		.func = bpf_map_update_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_PTR,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+		.arg3_type = ARG_PTR_TO_MAP_VALUE,
+	},
+	[BPF_FUNC_map_delete_elem] = {
+		.func = bpf_map_delete_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_PTR,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+};
+
+static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
+{
+	if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
+		return NULL;
+	return &tracing_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+	int size;
+	enum bpf_access_type type;
+} tracing_filter_ctx_access[] = {
+	[offsetof(struct bpf_context, arg1)] = {
+		FIELD_SIZEOF(struct bpf_context, arg1),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg2)] = {
+		FIELD_SIZEOF(struct bpf_context, arg2),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg3)] = {
+		FIELD_SIZEOF(struct bpf_context, arg3),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg4)] = {
+		FIELD_SIZEOF(struct bpf_context, arg4),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg5)] = {
+		FIELD_SIZEOF(struct bpf_context, arg5),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg6)] = {
+		FIELD_SIZEOF(struct bpf_context, arg6),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, ret)] = {
+		FIELD_SIZEOF(struct bpf_context, ret),
+		BPF_READ
+	},
+};
+
+static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	const struct bpf_context_access *access;
+
+	if (off < 0 || off >= ARRAY_SIZE(tracing_filter_ctx_access))
+		return false;
+
+	access = &tracing_filter_ctx_access[off];
+	if (access->size == size && (access->type & type))
+		return true;
+
+	return false;
+}
+
+static struct bpf_verifier_ops tracing_filter_ops = {
+	.get_func_proto = tracing_filter_func_proto,
+	.is_valid_access = tracing_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &tracing_filter_ops,
+	.type = BPF_PROG_TYPE_TRACING_FILTER,
+};
+
+static int __init register_tracing_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_tracing_filter_ops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 385391fb1d3b..f0b7caa71b9d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -986,12 +986,15 @@  struct ftrace_event_field {
 	int			is_signed;
 };
 
+struct bpf_prog;
+
 struct event_filter {
 	int			n_preds;	/* Number assigned */
 	int			a_preds;	/* allocated */
 	struct filter_pred	*preds;
 	struct filter_pred	*root;
 	char			*filter_string;
+	struct bpf_prog		*prog;
 };
 
 struct event_subsystem {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index ef06ce7e9cf8..d79f0ee98881 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1051,6 +1051,26 @@  event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
 	return r;
 }
 
+static int event_filter_release(struct inode *inode, struct file *filp)
+{
+	struct ftrace_event_file *file;
+	char buf[2] = "0";
+
+	mutex_lock(&event_mutex);
+	file = event_file_data(filp);
+	if (file) {
+		if (file->event_call->flags & TRACE_EVENT_FL_BPF) {
+			/* auto-disable the filter */
+			ftrace_event_enable_disable(file, 0);
+
+			/* if BPF filter was used, clear it on fd close */
+			apply_event_filter(file, buf);
+		}
+	}
+	mutex_unlock(&event_mutex);
+	return 0;
+}
+
 static ssize_t
 event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 		   loff_t *ppos)
@@ -1074,10 +1094,23 @@  event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 
 	mutex_lock(&event_mutex);
 	file = event_file_data(filp);
-	if (file)
+	if (file) {
 		err = apply_event_filter(file, buf);
+		if (!err && file->event_call->flags & TRACE_EVENT_FL_BPF)
+			/* once filter is applied, auto-enable it */
+			ftrace_event_enable_disable(file, 1);
+	}
+
 	mutex_unlock(&event_mutex);
 
+	if (file && file->event_call->flags & TRACE_EVENT_FL_BPF) {
+		/*
+		 * allocate per-cpu printk buffers, since eBPF program
+		 * might be calling bpf_trace_printk
+		 */
+		trace_printk_init_buffers();
+	}
+
 	free_page((unsigned long) buf);
 	if (err < 0)
 		return err;
@@ -1328,6 +1361,7 @@  static const struct file_operations ftrace_event_filter_fops = {
 	.open = tracing_open_generic,
 	.read = event_filter_read,
 	.write = event_filter_write,
+	.release = event_filter_release,
 	.llseek = default_llseek,
 };
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 7a8c1528e141..401fca436054 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,9 @@ 
 #include <linux/mutex.h>
 #include <linux/perf_event.h>
 #include <linux/slab.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <linux/filter.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -535,6 +538,16 @@  static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
 	return WALK_PRED_DEFAULT;
 }
 
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx)
+{
+	BUG_ON(!filter || !filter->prog);
+
+	rcu_read_lock();
+	BPF_PROG_RUN(filter->prog, (void *) ctx);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
+
 /* return 1 if event matches, 0 otherwise (discard) */
 int filter_match_preds(struct event_filter *filter, void *rec)
 {
@@ -789,6 +802,8 @@  static void __free_filter(struct event_filter *filter)
 	if (!filter)
 		return;
 
+	if (filter->prog)
+		bpf_prog_put(filter->prog);
 	__free_preds(filter);
 	kfree(filter->filter_string);
 	kfree(filter);
@@ -1857,6 +1872,48 @@  static int create_filter_start(char *filter_str, bool set_str,
 	return err;
 }
 
+static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+{
+	struct event_filter *filter;
+	struct bpf_prog *prog;
+	long ufd;
+	int err = 0;
+
+	*filterp = NULL;
+
+	filter = __alloc_filter();
+	if (!filter)
+		return -ENOMEM;
+
+	err = replace_filter_string(filter, filter_str);
+	if (err)
+		goto free_filter;
+
+	err = kstrtol(filter_str + 4, 0, &ufd);
+	if (err)
+		goto free_filter;
+
+	err = -ESRCH;
+	prog = bpf_prog_get(ufd);
+	if (!prog)
+		goto free_filter;
+
+	filter->prog = prog;
+
+	err = -EINVAL;
+	if (prog->info->prog_type != BPF_PROG_TYPE_TRACING_FILTER)
+		/* prog_id is valid, but it's not a tracing filter program */
+		goto free_filter;
+
+	*filterp = filter;
+
+	return 0;
+
+free_filter:
+	__free_filter(filter);
+	return err;
+}
+
 static void create_filter_finish(struct filter_parse_state *ps)
 {
 	if (ps) {
@@ -1966,7 +2023,20 @@  int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 		return 0;
 	}
 
-	err = create_filter(call, filter_string, true, &filter);
+	/*
+	 * 'bpf_123' string is a request to attach eBPF program with id == 123
+	 * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
+	 */
+	if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
+	    filter_string[4] != 0) {
+		err = create_filter_bpf(filter_string, &filter);
+		if (!err)
+			call->flags |= TRACE_EVENT_FL_BPF;
+	} else {
+		err = create_filter(call, filter_string, true, &filter);
+		if (!err)
+			call->flags &= ~TRACE_EVENT_FL_BPF;
+	}
 
 	/*
 	 * Always swap the call filter with the new filter
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 759d5e004517..a8d61a685480 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -7,6 +7,7 @@ 
 #include <linux/ftrace.h>
 #include <linux/perf_event.h>
 #include <asm/syscall.h>
+#include <trace/bpf_trace.h>
 
 #include "trace_output.h"
 #include "trace.h"
@@ -328,6 +329,14 @@  static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	if (!sys_data)
 		return;
 
+	if (ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF) {
+		struct bpf_context __ctx;
+		syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+				      &__ctx.arg1);
+		trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+		return;
+	}
+
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
 	local_save_flags(irq_flags);
@@ -375,6 +384,15 @@  static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
+	if (ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF) {
+		struct bpf_context __ctx;
+		syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+				      &__ctx.arg1);
+		__ctx.ret = syscall_get_return_value(current, regs);
+		trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+		return;
+	}
+
 	local_save_flags(irq_flags);
 	pc = preempt_count();