diff mbox series

[v6,bpf-next,08/11] bpf: introduce BPF_RAW_TRACEPOINT

Message ID 20180327024706.2064725-9-ast@fb.com
State Changes Requested, archived
Delegated to: BPF Maintainers
Headers show
Series bpf, tracing: introduce bpf raw tracepoints | expand

Commit Message

Alexei Starovoitov March 27, 2018, 2:47 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
       __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // program can read args[N] where N depends on tracepoint
  // and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:     mov    %ecx,%ecx
   0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since the probe funcs need to appear in kallsyms.
The alternative of having __bpf_trace_##call being global in kallsyms
could have been to keep them static and add another pointer to these
static functions to 'struct trace_event_class' and 'struct trace_event_call',
but keeping them global simplifies implementation and keeps it indepedent
from the tracing side.

Also such approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is very valuable optimization.

Since ftrace and perf side are not involved the new
BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint with prog attached
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side for_each_kernel_tracepoint() is used
to find a tracepoint with "xdp_exception" name
(that would be __tracepoint_xdp_exception record)

Then kallsyms_lookup_name() is used to find the addr
of __bpf_trace_xdp_exception() probe function.

And finally tracepoint_probe_register() is used to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
Each with its own bpf program. The kernel will execute
all tracepoint probes and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_types.h    |   1 +
 include/linux/trace_events.h |  37 +++++++++
 include/trace/bpf_probe.h    |  87 ++++++++++++++++++++
 include/trace/define_trace.h |   1 +
 include/uapi/linux/bpf.h     |  11 +++
 kernel/bpf/syscall.c         |  78 ++++++++++++++++++
 kernel/trace/bpf_trace.c     | 188 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 403 insertions(+)
 create mode 100644 include/trace/bpf_probe.h

Comments

Steven Rostedt March 27, 2018, 5:02 p.m. UTC | #1
On Mon, 26 Mar 2018 19:47:03 -0700
Alexei Starovoitov <ast@fb.com> wrote:


> Ctrl-C of tracing daemon or cmdline tool that uses this feature
> will automatically detach bpf program, unload it and
> unregister tracepoint probe.
> 
> On the kernel side for_each_kernel_tracepoint() is used

You need to update the change log to state
kernel_tracepoint_find_by_name().

But looking at the code, I really think you should do it properly and
not rely on a hack that finds your function via kallsyms lookup and
then executing the address it returns.

> to find a tracepoint with "xdp_exception" name
> (that would be __tracepoint_xdp_exception record)
> 
> Then kallsyms_lookup_name() is used to find the addr
> of __bpf_trace_xdp_exception() probe function.
> 
> And finally tracepoint_probe_register() is used to connect probe
> with tracepoint.
> 
> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
> tracepoint mechanisms. perf_event_open() can be used in parallel
> on the same tracepoint.
> Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
> Each with its own bpf program. The kernel will execute
> all tracepoint probes and all attached bpf programs.
> 
> In the future bpf_raw_tracepoints can be extended with
> query/introspection logic.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf_types.h    |   1 +
>  include/linux/trace_events.h |  37 +++++++++
>  include/trace/bpf_probe.h    |  87 ++++++++++++++++++++
>  include/trace/define_trace.h |   1 +
>  include/uapi/linux/bpf.h     |  11 +++
>  kernel/bpf/syscall.c         |  78 ++++++++++++++++++
>  kernel/trace/bpf_trace.c     | 188 +++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 403 insertions(+)
>  create mode 100644 include/trace/bpf_probe.h
> 
>


> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -468,6 +468,8 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
>  int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
>  void perf_event_detach_bpf_prog(struct perf_event *event);
>  int perf_event_query_prog_array(struct perf_event *event, void __user *info);
> +int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
> +int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
>  #else
>  static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
>  {
> @@ -487,6 +489,14 @@ perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  {
>  	return -EOPNOTSUPP;
>  }
> +static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif
>  
>  enum {
> @@ -546,6 +556,33 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4);
> +void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5);
> +void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6);
> +void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
> +void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8);
> +void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8, u64 arg9);
> +void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10);
> +void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11);
> +void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
>  void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
>  			       struct trace_event_call *call, u64 count,
>  			       struct pt_regs *regs, struct hlist_head *head,
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> new file mode 100644
> index 000000000000..d2cc0663e618
> --- /dev/null
> +++ b/include/trace/bpf_probe.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#undef TRACE_SYSTEM_VAR
> +
> +#ifdef CONFIG_BPF_EVENTS
> +
> +#undef __entry
> +#define __entry entry
> +
> +#undef __get_dynamic_array
> +#define __get_dynamic_array(field)	\
> +		((void *)__entry + (__entry->__data_loc_##field & 0xffff))
> +
> +#undef __get_dynamic_array_len
> +#define __get_dynamic_array_len(field)	\
> +		((__entry->__data_loc_##field >> 16) & 0xffff)
> +
> +#undef __get_str
> +#define __get_str(field) ((char *)__get_dynamic_array(field))
> +
> +#undef __get_bitmask
> +#define __get_bitmask(field) (char *)__get_dynamic_array(field)
> +
> +#undef __perf_count
> +#define __perf_count(c)	(c)
> +
> +#undef __perf_task
> +#define __perf_task(t)	(t)
> +
> +/* cast any integer, pointer, or small struct to u64 */
> +#define UINTTYPE(size) \
> +	__typeof__(__builtin_choose_expr(size == 1,  (u8)1, \
> +		   __builtin_choose_expr(size == 2, (u16)2, \
> +		   __builtin_choose_expr(size == 4, (u32)3, \
> +		   __builtin_choose_expr(size == 8, (u64)4, \
> +					 (void)5)))))
> +#define __CAST_TO_U64(x) ({ \
> +	typeof(x) __src = (x); \
> +	UINTTYPE(sizeof(x)) __dst; \
> +	memcpy(&__dst, &__src, sizeof(__dst)); \
> +	(u64)__dst; })
> +
> +#define __CAST1(a,...) __CAST_TO_U64(a)
> +#define __CAST2(a,...) __CAST_TO_U64(a), __CAST1(__VA_ARGS__)
> +#define __CAST3(a,...) __CAST_TO_U64(a), __CAST2(__VA_ARGS__)
> +#define __CAST4(a,...) __CAST_TO_U64(a), __CAST3(__VA_ARGS__)
> +#define __CAST5(a,...) __CAST_TO_U64(a), __CAST4(__VA_ARGS__)
> +#define __CAST6(a,...) __CAST_TO_U64(a), __CAST5(__VA_ARGS__)
> +#define __CAST7(a,...) __CAST_TO_U64(a), __CAST6(__VA_ARGS__)
> +#define __CAST8(a,...) __CAST_TO_U64(a), __CAST7(__VA_ARGS__)
> +#define __CAST9(a,...) __CAST_TO_U64(a), __CAST8(__VA_ARGS__)
> +#define __CAST10(a,...) __CAST_TO_U64(a), __CAST9(__VA_ARGS__)
> +#define __CAST11(a,...) __CAST_TO_U64(a), __CAST10(__VA_ARGS__)
> +#define __CAST12(a,...) __CAST_TO_U64(a), __CAST11(__VA_ARGS__)
> +/* tracepoints with more than 12 arguments will hit build error */
> +#define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> +
> +#undef DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> +/* no 'static' here. The bpf probe functions are global */		\
> +notrace void								\

I'm curious to why you have notrace here? Since it is separate from
perf and ftrace, for debugging purposes, it could be useful to allow
function tracing to this function.

> +__bpf_trace_##call(void *__data, proto)					\
> +{									\
> +	struct bpf_prog *prog = __data;					\
> +	\
> +	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));	\
> +}
> +
> +/*
> + * This part is compiled out, it is only here as a build time check
> + * to make sure that if the tracepoint handling changes, the
> + * bpf probe will fail to compile unless it too is updated.
> + */
> +#undef DEFINE_EVENT
> +#define DEFINE_EVENT(template, call, proto, args)			\
> +static inline void bpf_test_probe_##call(void)				\
> +{									\
> +	check_trace_callback_type_##call(__bpf_trace_##template);	\
> +}
> +
> +
> +#undef DEFINE_EVENT_PRINT
> +#define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
> +	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
> +
> +#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> +#endif /* CONFIG_BPF_EVENTS */
> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> index 96b22ace9ae7..5f8216bc261f 100644
> --- a/include/trace/define_trace.h
> +++ b/include/trace/define_trace.h
> @@ -95,6 +95,7 @@
>  #ifdef TRACEPOINTS_ENABLED
>  #include <trace/trace_events.h>
>  #include <trace/perf.h>
> +#include <trace/bpf_probe.h>
>  #endif
>  
>  #undef TRACE_EVENT
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 18b7c510c511..1878201c2d77 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,7 @@ enum bpf_cmd {
>  	BPF_MAP_GET_FD_BY_ID,
>  	BPF_OBJ_GET_INFO_BY_FD,
>  	BPF_PROG_QUERY,
> +	BPF_RAW_TRACEPOINT_OPEN,
>  };
>  
>  enum bpf_map_type {
> @@ -134,6 +135,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_SK_SKB,
>  	BPF_PROG_TYPE_CGROUP_DEVICE,
>  	BPF_PROG_TYPE_SK_MSG,
> +	BPF_PROG_TYPE_RAW_TRACEPOINT,
>  };
>  
>  enum bpf_attach_type {
> @@ -344,6 +346,11 @@ union bpf_attr {
>  		__aligned_u64	prog_ids;
>  		__u32		prog_cnt;
>  	} query;
> +
> +	struct {
> +		__u64 name;
> +		__u32 prog_fd;
> +	} raw_tracepoint;
>  } __attribute__((aligned(8)));
>  
>  /* BPF helper function descriptions:
> @@ -1152,4 +1159,8 @@ struct bpf_cgroup_dev_ctx {
>  	__u32 minor;
>  };
>  
> +struct bpf_raw_tracepoint_args {
> +	__u64 args[0];
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3aeb4ea2a93a..7486b450672e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1311,6 +1311,81 @@ static int bpf_obj_get(const union bpf_attr *attr)
>  				attr->file_flags);
>  }
>  
> +struct bpf_raw_tracepoint {
> +	struct tracepoint *tp;
> +	struct bpf_prog *prog;
> +};
> +
> +static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
> +{
> +	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
> +
> +	if (raw_tp->prog) {
> +		bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
> +		bpf_prog_put(raw_tp->prog);
> +	}
> +	kfree(raw_tp);
> +	return 0;
> +}
> +
> +static const struct file_operations bpf_raw_tp_fops = {
> +	.release	= bpf_raw_tracepoint_release,
> +	.read		= bpf_dummy_read,
> +	.write		= bpf_dummy_write,
> +};
> +
> +#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
> +
> +static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
> +{
> +	struct bpf_raw_tracepoint *raw_tp;
> +	struct tracepoint *tp;
> +	struct bpf_prog *prog;
> +	char tp_name[128];
> +	int tp_fd, err;
> +
> +	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
> +			      sizeof(tp_name) - 1) < 0)
> +		return -EFAULT;
> +	tp_name[sizeof(tp_name) - 1] = 0;
> +
> +	tp = kernel_tracepoint_find_by_name(tp_name);
> +	if (!tp)
> +		return -ENOENT;
> +
> +	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);

Please use kzalloc(), instead of open coding the "__GPF_ZERO"

> +	if (!raw_tp)
> +		return -ENOMEM;
> +	raw_tp->tp = tp;
> +
> +	prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
> +				 BPF_PROG_TYPE_RAW_TRACEPOINT);
> +	if (IS_ERR(prog)) {
> +		err = PTR_ERR(prog);
> +		goto out_free_tp;
> +	}
> +
> +	err = bpf_probe_register(raw_tp->tp, prog);
> +	if (err)
> +		goto out_put_prog;
> +
> +	raw_tp->prog = prog;
> +	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
> +				 O_CLOEXEC);
> +	if (tp_fd < 0) {
> +		bpf_probe_unregister(raw_tp->tp, prog);
> +		err = tp_fd;
> +		goto out_put_prog;
> +	}
> +	return tp_fd;
> +
> +out_put_prog:
> +	bpf_prog_put(prog);
> +out_free_tp:
> +	kfree(raw_tp);
> +	return err;
> +}
> +
>  #ifdef CONFIG_CGROUP_BPF
>  
>  #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
> @@ -1921,6 +1996,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>  	case BPF_OBJ_GET_INFO_BY_FD:
>  		err = bpf_obj_get_info_by_fd(&attr, uattr);
>  		break;
> +	case BPF_RAW_TRACEPOINT_OPEN:
> +		err = bpf_raw_tracepoint_open(&attr);
> +		break;
>  	default:
>  		err = -EINVAL;
>  		break;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index c634e093951f..00e86aa11360 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -723,6 +723,86 @@ const struct bpf_verifier_ops tracepoint_verifier_ops = {
>  const struct bpf_prog_ops tracepoint_prog_ops = {
>  };
>  
> +/*
> + * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
> + * to avoid potential recursive reuse issue when/if tracepoints are added
> + * inside bpf_*_event_output and/or bpf_get_stack_id
> + */
> +static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
> +BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
> +	   struct bpf_map *, map, u64, flags, void *, data, u64, size)
> +{
> +	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
> +
> +	perf_fetch_caller_regs(regs);
> +	return ____bpf_perf_event_output(regs, map, flags, data, size);
> +}
> +
> +static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = {
> +	.func		= bpf_perf_event_output_raw_tp,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_CONST_MAP_PTR,
> +	.arg3_type	= ARG_ANYTHING,
> +	.arg4_type	= ARG_PTR_TO_MEM,
> +	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
> +};
> +
> +BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args,
> +	   struct bpf_map *, map, u64, flags)
> +{
> +	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
> +
> +	perf_fetch_caller_regs(regs);
> +	/* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */
> +	return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
> +			       flags, 0, 0);
> +}
> +
> +static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
> +	.func		= bpf_get_stackid_raw_tp,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_CONST_MAP_PTR,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto *raw_tp_prog_func_proto(enum bpf_func_id func_id)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_perf_event_output:
> +		return &bpf_perf_event_output_proto_raw_tp;
> +	case BPF_FUNC_get_stackid:
> +		return &bpf_get_stackid_proto_raw_tp;
> +	default:
> +		return tracing_func_proto(func_id);
> +	}
> +}
> +
> +static bool raw_tp_prog_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					struct bpf_insn_access_aux *info)
> +{
> +	/* largest tracepoint in the kernel has 12 args */
> +	if (off < 0 || off >= sizeof(__u64) * 12)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	return true;
> +}
> +
> +const struct bpf_verifier_ops raw_tracepoint_verifier_ops = {
> +	.get_func_proto  = raw_tp_prog_func_proto,
> +	.is_valid_access = raw_tp_prog_is_valid_access,
> +};
> +
> +const struct bpf_prog_ops raw_tracepoint_prog_ops = {
> +};
> +
>  static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type,
>  				    struct bpf_insn_access_aux *info)
>  {
> @@ -896,3 +976,111 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  
>  	return ret;
>  }
> +
> +static __always_inline
> +void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> +{
> +	rcu_read_lock();
> +	preempt_disable();
> +	(void) BPF_PROG_RUN(prog, args);
> +	preempt_enable();
> +	rcu_read_unlock();
> +}
> +

Could you add some comments here to explain what the below is doing.

> +#define UNPACK(...)			__VA_ARGS__
> +#define REPEAT_1(FN, DL, X, ...)	FN(X)
> +#define REPEAT_2(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_1(FN, DL, __VA_ARGS__)
> +#define REPEAT_3(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_2(FN, DL, __VA_ARGS__)
> +#define REPEAT_4(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_3(FN, DL, __VA_ARGS__)
> +#define REPEAT_5(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_4(FN, DL, __VA_ARGS__)
> +#define REPEAT_6(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_5(FN, DL, __VA_ARGS__)
> +#define REPEAT_7(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_6(FN, DL, __VA_ARGS__)
> +#define REPEAT_8(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_7(FN, DL, __VA_ARGS__)
> +#define REPEAT_9(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_8(FN, DL, __VA_ARGS__)
> +#define REPEAT_10(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_9(FN, DL, __VA_ARGS__)
> +#define REPEAT_11(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_10(FN, DL, __VA_ARGS__)
> +#define REPEAT_12(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_11(FN, DL, __VA_ARGS__)
> +#define REPEAT(X, FN, DL, ...)		REPEAT_##X(FN, DL, __VA_ARGS__)
> +
> +#define SARG(X)		u64 arg##X
> +#define COPY(X)		args[X] = arg##X
> +
> +#define __DL_COM	(,)
> +#define __DL_SEM	(;)
> +
> +#define __SEQ_0_11	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
> +
> +#define BPF_TRACE_DEFN_x(x)						\
> +	void bpf_trace_run##x(struct bpf_prog *prog,			\
> +			      REPEAT(x, SARG, __DL_COM, __SEQ_0_11))	\
> +	{								\
> +		u64 args[x];						\
> +		REPEAT(x, COPY, __DL_SEM, __SEQ_0_11);			\
> +		__bpf_trace_run(prog, args);				\
> +	}								\
> +	EXPORT_SYMBOL_GPL(bpf_trace_run##x)
> +BPF_TRACE_DEFN_x(1);
> +BPF_TRACE_DEFN_x(2);
> +BPF_TRACE_DEFN_x(3);
> +BPF_TRACE_DEFN_x(4);
> +BPF_TRACE_DEFN_x(5);
> +BPF_TRACE_DEFN_x(6);
> +BPF_TRACE_DEFN_x(7);
> +BPF_TRACE_DEFN_x(8);
> +BPF_TRACE_DEFN_x(9);
> +BPF_TRACE_DEFN_x(10);
> +BPF_TRACE_DEFN_x(11);
> +BPF_TRACE_DEFN_x(12);
> +
> +static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
> +{
> +	unsigned long addr;
> +	char buf[128];
> +
> +	/*
> +	 * check that program doesn't access arguments beyond what's
> +	 * available in this tracepoint
> +	 */
> +	if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
> +		return -EINVAL;
> +
> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> +	addr = kallsyms_lookup_name(buf);
> +	if (!addr)
> +		return -ENOENT;
> +
> +	return tracepoint_probe_register(tp, (void *)addr, prog);

You are putting in a hell of a lot of trust with kallsyms returning
properly. I can see this being very fragile. This is calling a function
based on the result of kallsyms. I'm sure the security folks would love
this.

There's a few things to make this a bit more robust. One is to add a
table that points to all __bpf_trace_* functions, and verify that the
result from kallsyms is in that table.

Honestly, I think this is too much of a short cut and a hack. I know
you want to keep it "simple" and save space, but you really should do
it the same way ftrace and perf do it. That is, create a section and
have all tracepoints create a structure that holds a pointer to the
tracepoint and to the bpf probe function. Then you don't even need the
kernel_tracepoint_find_by_name(), you just iterate over your table and
you get the tracepoint and the bpf function associated to it.

Relying on kallsyms to return an address to execute is just way too
extreme and fragile for my liking.

-- Steve



> +}
> +
> +int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
> +{
> +	int err;
> +
> +	mutex_lock(&bpf_event_mutex);
> +	err = __bpf_probe_register(tp, prog);
> +	mutex_unlock(&bpf_event_mutex);
> +	return err;
> +}
> +
> +static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
> +{
> +	unsigned long addr;
> +	char buf[128];
> +
> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> +	addr = kallsyms_lookup_name(buf);
> +	if (!addr)
> +		return -ENOENT;
> +
> +	return tracepoint_probe_unregister(tp, (void *)addr, prog);
> +}
> +
> +int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
> +{
> +	int err;
> +
> +	mutex_lock(&bpf_event_mutex);
> +	err = __bpf_probe_unregister(tp, prog);
> +	mutex_unlock(&bpf_event_mutex);
> +	return err;
> +}
Steven Rostedt March 27, 2018, 5:11 p.m. UTC | #2
On Tue, 27 Mar 2018 13:02:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Honestly, I think this is too much of a short cut and a hack. I know
> you want to keep it "simple" and save space, but you really should do
> it the same way ftrace and perf do it. That is, create a section and
> have all tracepoints create a structure that holds a pointer to the
> tracepoint and to the bpf probe function. Then you don't even need the
> kernel_tracepoint_find_by_name(), you just iterate over your table and
> you get the tracepoint and the bpf function associated to it.

Also, if you do it the perf/ftrace way, you get support for module
tracepoints pretty much for free. Which would include tracepoints in
networking code that is loaded by a module.

-- Steve
Alexei Starovoitov March 27, 2018, 6:45 p.m. UTC | #3
On 3/27/18 10:02 AM, Steven Rostedt wrote:
> On Mon, 26 Mar 2018 19:47:03 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>
>> Ctrl-C of tracing daemon or cmdline tool that uses this feature
>> will automatically detach bpf program, unload it and
>> unregister tracepoint probe.
>>
>> On the kernel side for_each_kernel_tracepoint() is used
>
> You need to update the change log to state
> kernel_tracepoint_find_by_name().

ahh. right. will do.

>> +#undef DECLARE_EVENT_CLASS
>> +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>> +/* no 'static' here. The bpf probe functions are global */		\
>> +notrace void								\
>
> I'm curious to why you have notrace here? Since it is separate from
> perf and ftrace, for debugging purposes, it could be useful to allow
> function tracing to this function.

To avoid unnecessary overhead. And I don't think it's useful to trace 
them. They're tiny jump functions of one or two instructions.
Really no point wasting mentry on them.


>> +static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
>> +{
>> +	struct bpf_raw_tracepoint *raw_tp;
>> +	struct tracepoint *tp;
>> +	struct bpf_prog *prog;
>> +	char tp_name[128];
>> +	int tp_fd, err;
>> +
>> +	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
>> +			      sizeof(tp_name) - 1) < 0)
>> +		return -EFAULT;
>> +	tp_name[sizeof(tp_name) - 1] = 0;
>> +
>> +	tp = kernel_tracepoint_find_by_name(tp_name);
>> +	if (!tp)
>> +		return -ENOENT;
>> +
>> +	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
>
> Please use kzalloc(), instead of open coding the "__GPF_ZERO"

right. will do

>
> Could you add some comments here to explain what the below is doing.

To write a proper comment I need to understand it and I don't.
That's the reason why I didn't put in in common header,
because it would require proper comment on what it is and
how one can use it.
I'm expecting Daniel to follow up on this.

>> +#define UNPACK(...)			__VA_ARGS__
>> +#define REPEAT_1(FN, DL, X, ...)	FN(X)
>> +#define REPEAT_2(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_1(FN, DL, __VA_ARGS__)
>> +#define REPEAT_3(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_2(FN, DL, __VA_ARGS__)
>> +#define REPEAT_4(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_3(FN, DL, __VA_ARGS__)
>> +#define REPEAT_5(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_4(FN, DL, __VA_ARGS__)
>> +#define REPEAT_6(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_5(FN, DL, __VA_ARGS__)
>> +#define REPEAT_7(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_6(FN, DL, __VA_ARGS__)
>> +#define REPEAT_8(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_7(FN, DL, __VA_ARGS__)
>> +#define REPEAT_9(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_8(FN, DL, __VA_ARGS__)
>> +#define REPEAT_10(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_9(FN, DL, __VA_ARGS__)
>> +#define REPEAT_11(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_10(FN, DL, __VA_ARGS__)
>> +#define REPEAT_12(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_11(FN, DL, __VA_ARGS__)
>> +#define REPEAT(X, FN, DL, ...)		REPEAT_##X(FN, DL, __VA_ARGS__)
>> +

>> +
>> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
>> +	addr = kallsyms_lookup_name(buf);
>> +	if (!addr)
>> +		return -ENOENT;
>> +
>> +	return tracepoint_probe_register(tp, (void *)addr, prog);
>
> You are putting in a hell of a lot of trust with kallsyms returning
> properly. I can see this being very fragile. This is calling a function
> based on the result of kallsyms. I'm sure the security folks would love
> this.
>
> There's a few things to make this a bit more robust. One is to add a
> table that points to all __bpf_trace_* functions, and verify that the
> result from kallsyms is in that table.
>
> Honestly, I think this is too much of a short cut and a hack. I know
> you want to keep it "simple" and save space, but you really should do
> it the same way ftrace and perf do it. That is, create a section and
> have all tracepoints create a structure that holds a pointer to the
> tracepoint and to the bpf probe function. Then you don't even need the
> kernel_tracepoint_find_by_name(), you just iterate over your table and
> you get the tracepoint and the bpf function associated to it.
>
> Relying on kallsyms to return an address to execute is just way too
> extreme and fragile for my liking.

Wasting extra 8bytes * number_of_tracepoints just for lack of trust
in kallsyms doesn't sound like good trade off to me.
If kallsyms are inaccurate all sorts of things will break:
kprobes, livepatch, etc.
I'd rather suggest for ftrace to use kallsyms approach as well
and reduce memory footprint.
Steven Rostedt March 27, 2018, 6:58 p.m. UTC | #4
On Tue, 27 Mar 2018 13:11:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 27 Mar 2018 13:02:11 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Honestly, I think this is too much of a short cut and a hack. I know
> > you want to keep it "simple" and save space, but you really should do
> > it the same way ftrace and perf do it. That is, create a section and
> > have all tracepoints create a structure that holds a pointer to the
> > tracepoint and to the bpf probe function. Then you don't even need the
> > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > you get the tracepoint and the bpf function associated to it.  
> 
> Also, if you do it the perf/ftrace way, you get support for module
> tracepoints pretty much for free. Which would include tracepoints in
> networking code that is loaded by a module.

This doesn't include module code (but that wouldn't be too hard to set
up), but I compiled and booted this. I didn't test if it works (I
didn't have the way to test bpf here). But this patch applies on top of
this patch (patch 8). You can remove patch 7 and fold this into this
patch. And then you can also make the __bpf_trace_* function static.

This would be much more robust and less error prone.

-- Steve

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6fc..4fab7392e237 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -178,6 +178,15 @@
 #define TRACE_SYSCALLS()
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8);		\
+			 VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;	\
+			 KEEP(*(__bpf_raw_tp_map))			\
+			 VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
+#else
+#define BPF_RAW_TP()
+#endif
+
 #ifdef CONFIG_SERIAL_EARLYCON
 #define EARLYCON_TABLE() STRUCT_ALIGN();			\
 			 VMLINUX_SYMBOL(__earlycon_table) = .;	\
@@ -576,6 +585,7 @@
 	*(.init.rodata)							\
 	FTRACE_EVENTS()							\
 	TRACE_SYSCALLS()						\
+	BPF_RAW_TP()							\
 	KPROBE_BLACKLIST()						\
 	ERROR_INJECT_WHITELIST()					\
 	MEM_DISCARD(init.rodata)					\
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 399ebe6f90cf..fb4778c0a248 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -470,8 +470,9 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
@@ -491,14 +492,18 @@ perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	return -EOPNOTSUPP;
 }
-static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *p)
 {
 	return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *p)
 {
 	return -EOPNOTSUPP;
 }
+static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+	return NULL;
+}
 #endif
 
 enum {
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 39a283c61c51..35db8dd48c4c 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -36,4 +36,9 @@ struct tracepoint {
 	u32 num_args;
 };
 
+struct bpf_raw_event_map {
+	struct tracepoint	*tp;
+	void			*bpf_func;
+};
+
 #endif
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f100c63ff19e..6037a2f0108a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1312,7 +1312,7 @@ static int bpf_obj_get(const union bpf_attr *attr)
 }
 
 struct bpf_raw_tracepoint {
-	struct tracepoint *tp;
+	struct bpf_raw_event_map *btp;
 	struct bpf_prog *prog;
 };
 
@@ -1321,7 +1321,7 @@ static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
 	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
 
 	if (raw_tp->prog) {
-		bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
+		bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
 		bpf_prog_put(raw_tp->prog);
 	}
 	kfree(raw_tp);
@@ -1339,7 +1339,7 @@ static const struct file_operations bpf_raw_tp_fops = {
 static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 {
 	struct bpf_raw_tracepoint *raw_tp;
-	struct tracepoint *tp;
+	struct bpf_raw_event_map *btp;
 	struct bpf_prog *prog;
 	char tp_name[128];
 	int tp_fd, err;
@@ -1349,14 +1349,14 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 		return -EFAULT;
 	tp_name[sizeof(tp_name) - 1] = 0;
 
-	tp = kernel_tracepoint_find_by_name(tp_name);
-	if (!tp)
+	btp = bpf_find_raw_tracepoint(tp_name);
+	if (!btp)
 		return -ENOENT;
 
 	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
 	if (!raw_tp)
 		return -ENOMEM;
-	raw_tp->tp = tp;
+	raw_tp->btp = btp;
 
 	prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
 				 BPF_PROG_TYPE_RAW_TRACEPOINT);
@@ -1365,7 +1365,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 		goto out_free_tp;
 	}
 
-	err = bpf_probe_register(raw_tp->tp, prog);
+	err = bpf_probe_register(raw_tp->btp, prog);
 	if (err)
 		goto out_put_prog;
 
@@ -1373,7 +1373,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
 				 O_CLOEXEC);
 	if (tp_fd < 0) {
-		bpf_probe_unregister(raw_tp->tp, prog);
+		bpf_probe_unregister(raw_tp->btp, prog);
 		err = tp_fd;
 		goto out_put_prog;
 	}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index eb58ef156d36..e578b173fe1d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -965,6 +965,19 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 	return ret;
 }
 
+extern struct bpf_raw_event_map *__start__bpf_raw_tp[];
+extern struct bpf_raw_event_map *__stop__bpf_raw_tp[];
+
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+	struct bpf_raw_event_map* const *btp = __start__bpf_raw_tp;
+
+	for (; btp < __stop__bpf_raw_tp; btp++)
+		if (!strcmp((*btp)->tp->name, name))
+			return *btp;
+	return NULL;
+}
+
 static __always_inline
 void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
 {
@@ -1020,10 +1033,9 @@ BPF_TRACE_DEFN_x(10);
 BPF_TRACE_DEFN_x(11);
 BPF_TRACE_DEFN_x(12);
 
-static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
-	unsigned long addr;
-	char buf[128];
+	struct tracepoint *tp = btp->tp;
 
 	/*
 	 * check that program doesn't access arguments beyond what's
@@ -1032,43 +1044,25 @@ static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
 	if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
 		return -EINVAL;
 
-	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
-	addr = kallsyms_lookup_name(buf);
-	if (!addr)
-		return -ENOENT;
-
-	return tracepoint_probe_register(tp, (void *)addr, prog);
+	return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
 }
 
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
 	int err;
 
 	mutex_lock(&bpf_event_mutex);
-	err = __bpf_probe_register(tp, prog);
+	err = __bpf_probe_register(btp, prog);
 	mutex_unlock(&bpf_event_mutex);
 	return err;
 }
 
-static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
-{
-	unsigned long addr;
-	char buf[128];
-
-	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
-	addr = kallsyms_lookup_name(buf);
-	if (!addr)
-		return -ENOENT;
-
-	return tracepoint_probe_unregister(tp, (void *)addr, prog);
-}
-
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
 	int err;
 
 	mutex_lock(&bpf_event_mutex);
-	err = __bpf_probe_unregister(tp, prog);
+	err = tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
 	mutex_unlock(&bpf_event_mutex);
 	return err;
 }
Steven Rostedt March 27, 2018, 7 p.m. UTC | #5
On Tue, 27 Mar 2018 11:45:34 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> >> +
> >> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> >> +	addr = kallsyms_lookup_name(buf);
> >> +	if (!addr)
> >> +		return -ENOENT;
> >> +
> >> +	return tracepoint_probe_register(tp, (void *)addr, prog);  
> >
> > You are putting in a hell of a lot of trust with kallsyms returning
> > properly. I can see this being very fragile. This is calling a function
> > based on the result of kallsyms. I'm sure the security folks would love
> > this.
> >
> > There's a few things to make this a bit more robust. One is to add a
> > table that points to all __bpf_trace_* functions, and verify that the
> > result from kallsyms is in that table.
> >
> > Honestly, I think this is too much of a short cut and a hack. I know
> > you want to keep it "simple" and save space, but you really should do
> > it the same way ftrace and perf do it. That is, create a section and
> > have all tracepoints create a structure that holds a pointer to the
> > tracepoint and to the bpf probe function. Then you don't even need the
> > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > you get the tracepoint and the bpf function associated to it.
> >
> > Relying on kallsyms to return an address to execute is just way too
> > extreme and fragile for my liking.  
> 
> Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> in kallsyms doesn't sound like good trade off to me.
> If kallsyms are inaccurate all sorts of things will break:
> kprobes, livepatch, etc.
> I'd rather suggest for ftrace to use kallsyms approach as well
> and reduce memory footprint.

If Linus, Thomas, Peter, Ingo, and the security folks trust kallsyms to
return a valid function pointer from a name, then sure, we can try
going that way.

-- Steve
Steven Rostedt March 27, 2018, 7:07 p.m. UTC | #6
On Tue, 27 Mar 2018 15:00:41 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> >  Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> > in kallsyms doesn't sound like good trade off to me.
> > If kallsyms are inaccurate all sorts of things will break:
> > kprobes, livepatch, etc.

And if kallsyms breaks, these will break by failing to attach, or some
other benign error. Ftrace uses kallsyms to find functions too, but it
only enables functions based on the result, it doesn't use the result
for anything but to compare it to what it already knows.

This is a first that I know of to trust that kallsyms returns something
that you expect to execute with no other validation. It may be valid,
but it also makes me very nervous too. If others are fine with such an
approach, then OK, we can enter a new chapter of development where we
can use kallsyms to find the functions we want to call and use it.

-- Steve
Steven Rostedt March 27, 2018, 7:10 p.m. UTC | #7
[ Added Andrew Morton too ]

On Tue, 27 Mar 2018 15:00:41 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 27 Mar 2018 11:45:34 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
> 
> > >> +
> > >> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> > >> +	addr = kallsyms_lookup_name(buf);
> > >> +	if (!addr)
> > >> +		return -ENOENT;
> > >> +
> > >> +	return tracepoint_probe_register(tp, (void *)addr, prog);    
> > >
> > > You are putting in a hell of a lot of trust with kallsyms returning
> > > properly. I can see this being very fragile. This is calling a function
> > > based on the result of kallsyms. I'm sure the security folks would love
> > > this.
> > >
> > > There's a few things to make this a bit more robust. One is to add a
> > > table that points to all __bpf_trace_* functions, and verify that the
> > > result from kallsyms is in that table.
> > >
> > > Honestly, I think this is too much of a short cut and a hack. I know
> > > you want to keep it "simple" and save space, but you really should do
> > > it the same way ftrace and perf do it. That is, create a section and
> > > have all tracepoints create a structure that holds a pointer to the
> > > tracepoint and to the bpf probe function. Then you don't even need the
> > > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > > you get the tracepoint and the bpf function associated to it.
> > >
> > > Relying on kallsyms to return an address to execute is just way too
> > > extreme and fragile for my liking.    
> > 
> > Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> > in kallsyms doesn't sound like good trade off to me.
> > If kallsyms are inaccurate all sorts of things will break:
> > kprobes, livepatch, etc.
> > I'd rather suggest for ftrace to use kallsyms approach as well
> > and reduce memory footprint.  
> 
> If Linus, Thomas, Peter, Ingo, and the security folks trust kallsyms to
> return a valid function pointer from a name, then sure, we can try
> going that way.

I would like an ack from Linus and/or Andrew before we go further down
this road.

-- Steve
Mathieu Desnoyers March 27, 2018, 7:10 p.m. UTC | #8
----- On Mar 27, 2018, at 3:00 PM, rostedt rostedt@goodmis.org wrote:

> On Tue, 27 Mar 2018 11:45:34 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
> 
>> >> +
>> >> +	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
>> >> +	addr = kallsyms_lookup_name(buf);
>> >> +	if (!addr)
>> >> +		return -ENOENT;
>> >> +
>> >> +	return tracepoint_probe_register(tp, (void *)addr, prog);
>> >
>> > You are putting in a hell of a lot of trust with kallsyms returning
>> > properly. I can see this being very fragile. This is calling a function
>> > based on the result of kallsyms. I'm sure the security folks would love
>> > this.
>> >
>> > There's a few things to make this a bit more robust. One is to add a
>> > table that points to all __bpf_trace_* functions, and verify that the
>> > result from kallsyms is in that table.
>> >
>> > Honestly, I think this is too much of a short cut and a hack. I know
>> > you want to keep it "simple" and save space, but you really should do
>> > it the same way ftrace and perf do it. That is, create a section and
>> > have all tracepoints create a structure that holds a pointer to the
>> > tracepoint and to the bpf probe function. Then you don't even need the
>> > kernel_tracepoint_find_by_name(), you just iterate over your table and
>> > you get the tracepoint and the bpf function associated to it.
>> >
>> > Relying on kallsyms to return an address to execute is just way too
>> > extreme and fragile for my liking.
>> 
>> Wasting extra 8bytes * number_of_tracepoints just for lack of trust
>> in kallsyms doesn't sound like good trade off to me.
>> If kallsyms are inaccurate all sorts of things will break:
>> kprobes, livepatch, etc.
>> I'd rather suggest for ftrace to use kallsyms approach as well
>> and reduce memory footprint.
> 
> If Linus, Thomas, Peter, Ingo, and the security folks trust kallsyms to
> return a valid function pointer from a name, then sure, we can try
> going that way.

This will crash on ARM Thumb2 kernels. Also, how is this expected to
work on PowerPC ABIv1 without KALLSYMS_ALL ?

Thanks,

Mathieu
Steven Rostedt March 27, 2018, 9:04 p.m. UTC | #9
On Tue, 27 Mar 2018 14:58:24 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> +extern struct bpf_raw_event_map *__start__bpf_raw_tp[];
> +extern struct bpf_raw_event_map *__stop__bpf_raw_tp[];
> +
> +struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
> +{
> +	struct bpf_raw_event_map* const *btp = __start__bpf_raw_tp;
> +
> +	for (; btp < __stop__bpf_raw_tp; btp++)
> +		if (!strcmp((*btp)->tp->name, name))
> +			return *btp;
> +	return NULL;
> +}
> +

OK, this part is broken, and for some reason it didn't include my
changes to bpf_probe.h. I also tested this without setting BPF_EVENTS,
so I wasn't actually testing it.

I added a test in event_trace_init() to make sure that it worked:
(Not included in the patch below)

{
	struct bpf_raw_event_map *btp;
	btp = bpf_find_raw_tracepoint("sched_switch");
	if (btp)
		printk("found BPF_RAW_TRACEPOINT: %s %pS\n",
		       btp->tp->name, btp->bpf_func);
	else
		printk("COULD NOT FIND BPF_RAW_TRACEPOINT\n");
}

And it found the tracepoint.

Here's take two....

You can add my: Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6fc..4fab7392e237 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -178,6 +178,15 @@
 #define TRACE_SYSCALLS()
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8);		\
+			 VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;	\
+			 KEEP(*(__bpf_raw_tp_map))			\
+			 VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
+#else
+#define BPF_RAW_TP()
+#endif
+
 #ifdef CONFIG_SERIAL_EARLYCON
 #define EARLYCON_TABLE() STRUCT_ALIGN();			\
 			 VMLINUX_SYMBOL(__earlycon_table) = .;	\
@@ -576,6 +585,7 @@
 	*(.init.rodata)							\
 	FTRACE_EVENTS()							\
 	TRACE_SYSCALLS()						\
+	BPF_RAW_TP()							\
 	KPROBE_BLACKLIST()						\
 	ERROR_INJECT_WHITELIST()					\
 	MEM_DISCARD(init.rodata)					\
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 399ebe6f90cf..fb4778c0a248 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -470,8 +470,9 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
@@ -491,14 +492,18 @@ perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	return -EOPNOTSUPP;
 }
-static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *p)
 {
 	return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *p)
 {
 	return -EOPNOTSUPP;
 }
+static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+	return NULL;
+}
 #endif
 
 enum {
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 39a283c61c51..35db8dd48c4c 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -36,4 +36,9 @@ struct tracepoint {
 	u32 num_args;
 };
 
+struct bpf_raw_event_map {
+	struct tracepoint	*tp;
+	void			*bpf_func;
+};
+
 #endif
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index d2cc0663e618..bb8ed2f530ad 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -76,7 +76,13 @@ __bpf_trace_##call(void *__data, proto)					\
 static inline void bpf_test_probe_##call(void)				\
 {									\
 	check_trace_callback_type_##call(__bpf_trace_##template);	\
-}
+}									\
+static struct bpf_raw_event_map	__used					\
+   __attribute__((section("__bpf_raw_tp_map")))				\
+__bpf_trace_tp_map_##call= {						\
+	.tp		= &__tracepoint_##call,				\
+	.bpf_func	= (void *)__bpf_trace_##template,		\
+};
 
 
 #undef DEFINE_EVENT_PRINT
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f100c63ff19e..6037a2f0108a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1312,7 +1312,7 @@ static int bpf_obj_get(const union bpf_attr *attr)
 }
 
 struct bpf_raw_tracepoint {
-	struct tracepoint *tp;
+	struct bpf_raw_event_map *btp;
 	struct bpf_prog *prog;
 };
 
@@ -1321,7 +1321,7 @@ static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
 	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
 
 	if (raw_tp->prog) {
-		bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
+		bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
 		bpf_prog_put(raw_tp->prog);
 	}
 	kfree(raw_tp);
@@ -1339,7 +1339,7 @@ static const struct file_operations bpf_raw_tp_fops = {
 static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 {
 	struct bpf_raw_tracepoint *raw_tp;
-	struct tracepoint *tp;
+	struct bpf_raw_event_map *btp;
 	struct bpf_prog *prog;
 	char tp_name[128];
 	int tp_fd, err;
@@ -1349,14 +1349,14 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 		return -EFAULT;
 	tp_name[sizeof(tp_name) - 1] = 0;
 
-	tp = kernel_tracepoint_find_by_name(tp_name);
-	if (!tp)
+	btp = bpf_find_raw_tracepoint(tp_name);
+	if (!btp)
 		return -ENOENT;
 
 	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
 	if (!raw_tp)
 		return -ENOMEM;
-	raw_tp->tp = tp;
+	raw_tp->btp = btp;
 
 	prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
 				 BPF_PROG_TYPE_RAW_TRACEPOINT);
@@ -1365,7 +1365,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 		goto out_free_tp;
 	}
 
-	err = bpf_probe_register(raw_tp->tp, prog);
+	err = bpf_probe_register(raw_tp->btp, prog);
 	if (err)
 		goto out_put_prog;
 
@@ -1373,7 +1373,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
 				 O_CLOEXEC);
 	if (tp_fd < 0) {
-		bpf_probe_unregister(raw_tp->tp, prog);
+		bpf_probe_unregister(raw_tp->btp, prog);
 		err = tp_fd;
 		goto out_put_prog;
 	}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index eb58ef156d36..d0975094cff7 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -965,6 +965,22 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 	return ret;
 }
 
+extern struct bpf_raw_event_map __start__bpf_raw_tp;
+extern struct bpf_raw_event_map __stop__bpf_raw_tp;
+
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+	const struct bpf_raw_event_map *btp = &__start__bpf_raw_tp;
+	int i = 0;
+
+	for (; btp < &__stop__bpf_raw_tp; btp++) {
+		i++;
+		if (!strcmp(btp->tp->name, name))
+			return btp;
+	}
+	return NULL;
+}
+
 static __always_inline
 void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
 {
@@ -1020,10 +1036,9 @@ BPF_TRACE_DEFN_x(10);
 BPF_TRACE_DEFN_x(11);
 BPF_TRACE_DEFN_x(12);
 
-static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
-	unsigned long addr;
-	char buf[128];
+	struct tracepoint *tp = btp->tp;
 
 	/*
 	 * check that program doesn't access arguments beyond what's
@@ -1032,43 +1047,25 @@ static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
 	if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
 		return -EINVAL;
 
-	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
-	addr = kallsyms_lookup_name(buf);
-	if (!addr)
-		return -ENOENT;
-
-	return tracepoint_probe_register(tp, (void *)addr, prog);
+	return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
 }
 
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
 	int err;
 
 	mutex_lock(&bpf_event_mutex);
-	err = __bpf_probe_register(tp, prog);
+	err = __bpf_probe_register(btp, prog);
 	mutex_unlock(&bpf_event_mutex);
 	return err;
 }
 
-static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
-{
-	unsigned long addr;
-	char buf[128];
-
-	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
-	addr = kallsyms_lookup_name(buf);
-	if (!addr)
-		return -ENOENT;
-
-	return tracepoint_probe_unregister(tp, (void *)addr, prog);
-}
-
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 {
 	int err;
 
 	mutex_lock(&bpf_event_mutex);
-	err = __bpf_probe_unregister(tp, prog);
+	err = tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
 	mutex_unlock(&bpf_event_mutex);
 	return err;
 }
Alexei Starovoitov March 27, 2018, 10:48 p.m. UTC | #10
On 3/27/18 2:04 PM, Steven Rostedt wrote:
>
> +#ifdef CONFIG_BPF_EVENTS
> +#define BPF_RAW_TP() . = ALIGN(8);		\
> +			 VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;	\
> +			 KEEP(*(__bpf_raw_tp_map))			\
> +			 VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;

that looks to be correct, but something wrong with it.

Can you try your mini test with kasan on ?

I'm seeing this crash:
test_stacktrace_[   18.760662] start ffffffff84642438 stop ffffffff84644f60
map_raw_tp:PASS:[   18.761467] i 1 btp->tp cccccccccccccccc
prog_load raw tp[   18.762064] kasan: CONFIG_KASAN_INLINE enabled
  0 nsec
[   18.762704] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   18.765125] general protection fault: 0000 [#1] SMP KASAN PTI
[   18.765830] Modules linked in:
[   18.778358] Call Trace:
[   18.778674]  bpf_raw_tracepoint_open.isra.27+0x92/0x380

for some reason the start_bpf_raw_tp is off by 8.
Not sure how it works for you.

(gdb) p &__bpf_trace_tp_map_sys_exit
$10 = (struct bpf_raw_event_map *) 0xffffffff84642440 
<__bpf_trace_tp_map_sys_exit>

(gdb)  p &__start__bpf_raw_tp
$7 = (<data variable, no debug info> *) 0xffffffff84642438

(gdb)  p (void*)(&__start__bpf_raw_tp)+8
$11 = (void *) 0xffffffff84642440 <__bpf_trace_tp_map_sys_exit>
Mathieu Desnoyers March 27, 2018, 11:13 p.m. UTC | #11
----- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov ast@fb.com wrote:

> On 3/27/18 2:04 PM, Steven Rostedt wrote:
>>
>> +#ifdef CONFIG_BPF_EVENTS
>> +#define BPF_RAW_TP() . = ALIGN(8);		\

Given that the section consists of a 16-bytes structure elements
on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
be turned into a STRUCT_ALIGN(), especially given that the compiler
is free to up-align the structure on 32 bytes.

This could explain the kasan splat you are experiencing.

Thanks,

Mathieu


>> +			 VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;	\
>> +			 KEEP(*(__bpf_raw_tp_map))			\
>> +			 VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
> 
> that looks to be correct, but something wrong with it.
> 
> Can you try your mini test with kasan on ?
> 
> I'm seeing this crash:
> test_stacktrace_[   18.760662] start ffffffff84642438 stop ffffffff84644f60
> map_raw_tp:PASS:[   18.761467] i 1 btp->tp cccccccccccccccc
> prog_load raw tp[   18.762064] kasan: CONFIG_KASAN_INLINE enabled
>  0 nsec
> [   18.762704] kasan: GPF could be caused by NULL-ptr deref or user
> memory access
> [   18.765125] general protection fault: 0000 [#1] SMP KASAN PTI
> [   18.765830] Modules linked in:
> [   18.778358] Call Trace:
> [   18.778674]  bpf_raw_tracepoint_open.isra.27+0x92/0x380
> 
> for some reason the start_bpf_raw_tp is off by 8.
> Not sure how it works for you.
> 
> (gdb) p &__bpf_trace_tp_map_sys_exit
> $10 = (struct bpf_raw_event_map *) 0xffffffff84642440
> <__bpf_trace_tp_map_sys_exit>
> 
> (gdb)  p &__start__bpf_raw_tp
> $7 = (<data variable, no debug info> *) 0xffffffff84642438
> 
> (gdb)  p (void*)(&__start__bpf_raw_tp)+8
> $11 = (void *) 0xffffffff84642440 <__bpf_trace_tp_map_sys_exit>
Alexei Starovoitov March 28, 2018, midnight UTC | #12
On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:
> ----- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov ast@fb.com wrote:
>
>> On 3/27/18 2:04 PM, Steven Rostedt wrote:
>>>
>>> +#ifdef CONFIG_BPF_EVENTS
>>> +#define BPF_RAW_TP() . = ALIGN(8);		\
>
> Given that the section consists of a 16-bytes structure elements
> on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
> be turned into a STRUCT_ALIGN(), especially given that the compiler
> is free to up-align the structure on 32 bytes.

STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
but it fails without kasan too.
For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
comes inited with cccc:
[   22.703562] i 1 btp ffffffff8288e530 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.704638] i 2 btp ffffffff8288e540 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.705599] i 3 btp ffffffff8288e550 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.706551] i 4 btp ffffffff8288e560 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.707503] i 5 btp ffffffff8288e570 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.708452] i 6 btp ffffffff8288e580 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.709406] i 7 btp ffffffff8288e590 btp->tp cccccccccccccccc func 
cccccccccccccccc
[   22.710368] i 8 btp ffffffff8288e5a0 btp->tp cccccccccccccccc func 
cccccccccccccccc

while gdb shows that everything is good inside vmlinux
for exactly these addresses.
Some other linker magic missing?
Mathieu Desnoyers March 28, 2018, 12:44 a.m. UTC | #13
----- On Mar 27, 2018, at 8:00 PM, Alexei Starovoitov ast@fb.com wrote:

> On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:
>> ----- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov ast@fb.com wrote:
>>
>>> On 3/27/18 2:04 PM, Steven Rostedt wrote:
>>>>
>>>> +#ifdef CONFIG_BPF_EVENTS
>>>> +#define BPF_RAW_TP() . = ALIGN(8);		\
>>
>> Given that the section consists of a 16-bytes structure elements
>> on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
>> be turned into a STRUCT_ALIGN(), especially given that the compiler
>> is free to up-align the structure on 32 bytes.
> 
> STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
> but it fails without kasan too.
> For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
> comes inited with cccc:
> [   22.703562] i 1 btp ffffffff8288e530 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.704638] i 2 btp ffffffff8288e540 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.705599] i 3 btp ffffffff8288e550 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.706551] i 4 btp ffffffff8288e560 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.707503] i 5 btp ffffffff8288e570 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.708452] i 6 btp ffffffff8288e580 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.709406] i 7 btp ffffffff8288e590 btp->tp cccccccccccccccc func
> cccccccccccccccc
> [   22.710368] i 8 btp ffffffff8288e5a0 btp->tp cccccccccccccccc func
> cccccccccccccccc
> 
> while gdb shows that everything is good inside vmlinux
> for exactly these addresses.
> Some other linker magic missing?

No, Steven's iteration code is incorrect.

+extern struct bpf_raw_event_map __start__bpf_raw_tp;
+extern struct bpf_raw_event_map __stop__bpf_raw_tp;

That should be:

extern struct bpf_raw_event_map __start__bpf_raw_tp[];
extern struct bpf_raw_event_map __stop__bpf_raw_tp[];


+
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+        const struct bpf_raw_event_map *btp = &__start__bpf_raw_tp;

const struct bpf_raw_event_map *btp = __start__bpf_raw_tp;

+        int i = 0;
+
+        for (; btp < &__stop__bpf_raw_tp; btp++) {

for (; btp < __stop__bpf_raw_tp; btp++) {

Those start/stop symbols are given their address by the linker
automatically (this is a GNU linker extension). We don't want
pointers to the symbols, but rather the symbols per se to act
as start/stop addresses.

Thanks,

Mathieu

+                i++;
+                if (!strcmp(btp->tp->name, name))
+                        return btp;
+        }
+        return NULL;
+}
Alexei Starovoitov March 28, 2018, 12:51 a.m. UTC | #14
On 3/27/18 5:44 PM, Mathieu Desnoyers wrote:
> ----- On Mar 27, 2018, at 8:00 PM, Alexei Starovoitov ast@fb.com wrote:
>
>> On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:
>>> ----- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov ast@fb.com wrote:
>>>
>>>> On 3/27/18 2:04 PM, Steven Rostedt wrote:
>>>>>
>>>>> +#ifdef CONFIG_BPF_EVENTS
>>>>> +#define BPF_RAW_TP() . = ALIGN(8);		\
>>>
>>> Given that the section consists of a 16-bytes structure elements
>>> on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
>>> be turned into a STRUCT_ALIGN(), especially given that the compiler
>>> is free to up-align the structure on 32 bytes.
>>
>> STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
>> but it fails without kasan too.
>> For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
>> comes inited with cccc:
>> [   22.703562] i 1 btp ffffffff8288e530 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.704638] i 2 btp ffffffff8288e540 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.705599] i 3 btp ffffffff8288e550 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.706551] i 4 btp ffffffff8288e560 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.707503] i 5 btp ffffffff8288e570 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.708452] i 6 btp ffffffff8288e580 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.709406] i 7 btp ffffffff8288e590 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>> [   22.710368] i 8 btp ffffffff8288e5a0 btp->tp cccccccccccccccc func
>> cccccccccccccccc
>>
>> while gdb shows that everything is good inside vmlinux
>> for exactly these addresses.
>> Some other linker magic missing?
>
> No, Steven's iteration code is incorrect.
>
> +extern struct bpf_raw_event_map __start__bpf_raw_tp;
> +extern struct bpf_raw_event_map __stop__bpf_raw_tp;
>
> That should be:
>
> extern struct bpf_raw_event_map __start__bpf_raw_tp[];
> extern struct bpf_raw_event_map __stop__bpf_raw_tp[];
>
>
> +
> +struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
> +{
> +        const struct bpf_raw_event_map *btp = &__start__bpf_raw_tp;
>
> const struct bpf_raw_event_map *btp = __start__bpf_raw_tp;
>
> +        int i = 0;
> +
> +        for (; btp < &__stop__bpf_raw_tp; btp++) {
>
> for (; btp < __stop__bpf_raw_tp; btp++) {
>
> Those start/stop symbols are given their address by the linker
> automatically (this is a GNU linker extension). We don't want
> pointers to the symbols, but rather the symbols per se to act
> as start/stop addresses.

right. that part I fixed first.

Turned out it was in init.data section and got poisoned.
this fixes it:
@@ -258,6 +258,7 @@
         LIKELY_PROFILE()                                                \
         BRANCH_PROFILE()                                                \
         TRACE_PRINTKS()                                                 \
+       BPF_RAW_TP()                                                    \
         TRACEPOINT_STR()

  /*
@@ -585,7 +586,6 @@
         *(.init.rodata)                                                 \
         FTRACE_EVENTS()                                                 \
         TRACE_SYSCALLS()                                                \
-       BPF_RAW_TP()                                                    \
         KPROBE_BLACKLIST()                                              \
         ERROR_INJECT_WHITELIST()                                        \
         MEM_DISCARD(init.rodata)                                        \

and it works :)
I will clean few other nits I found while debugging and respin.
Steven Rostedt March 28, 2018, 2:06 p.m. UTC | #15
On Tue, 27 Mar 2018 17:51:55 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> Turned out it was in init.data section and got poisoned.
> this fixes it:
> @@ -258,6 +258,7 @@
>          LIKELY_PROFILE()                                                \
>          BRANCH_PROFILE()                                                \
>          TRACE_PRINTKS()                                                 \
> +       BPF_RAW_TP()                                                    \
>          TRACEPOINT_STR()
> 
>   /*
> @@ -585,7 +586,6 @@
>          *(.init.rodata)                                                 \
>          FTRACE_EVENTS()                                                 \
>          TRACE_SYSCALLS()                                                \
> -       BPF_RAW_TP()                                                    \
>          KPROBE_BLACKLIST()                                              \
>          ERROR_INJECT_WHITELIST()                                        \
>          MEM_DISCARD(init.rodata)                                        \
> 
> and it works :)
> I will clean few other nits I found while debugging and respin.

Getting it properly working was an exercise left to the reader ;-)

Sorry about that, I did a bit of copy and paste to get it working, and
copied from code that did things a bit differently, so I massaged it by
hand, and doing it quickly as I had other things to work on.

-- Steve
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5e2e8a49fb21..6d7243bfb0ff 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -19,6 +19,7 @@  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
 BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
+BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 8a1442c4e513..e37fcd7505da 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -468,6 +468,8 @@  unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
@@ -487,6 +489,14 @@  perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	return -EOPNOTSUPP;
 }
+static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+{
+	return -EOPNOTSUPP;
+}
+static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 enum {
@@ -546,6 +556,33 @@  extern void ftrace_profile_free_filter(struct perf_event *event);
 void perf_trace_buf_update(void *record, u16 type);
 void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
 
+void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
+void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
+void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3);
+void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4);
+void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5);
+void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6);
+void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
+void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8);
+void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8, u64 arg9);
+void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10);
+void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11);
+void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
 void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct trace_event_call *call, u64 count,
 			       struct pt_regs *regs, struct hlist_head *head,
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
new file mode 100644
index 000000000000..d2cc0663e618
--- /dev/null
+++ b/include/trace/bpf_probe.h
@@ -0,0 +1,87 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#undef TRACE_SYSTEM_VAR
+
+#ifdef CONFIG_BPF_EVENTS
+
+#undef __entry
+#define __entry entry
+
+#undef __get_dynamic_array
+#define __get_dynamic_array(field)	\
+		((void *)__entry + (__entry->__data_loc_##field & 0xffff))
+
+#undef __get_dynamic_array_len
+#define __get_dynamic_array_len(field)	\
+		((__entry->__data_loc_##field >> 16) & 0xffff)
+
+#undef __get_str
+#define __get_str(field) ((char *)__get_dynamic_array(field))
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __perf_count
+#define __perf_count(c)	(c)
+
+#undef __perf_task
+#define __perf_task(t)	(t)
+
+/* cast any integer, pointer, or small struct to u64 */
+#define UINTTYPE(size) \
+	__typeof__(__builtin_choose_expr(size == 1,  (u8)1, \
+		   __builtin_choose_expr(size == 2, (u16)2, \
+		   __builtin_choose_expr(size == 4, (u32)3, \
+		   __builtin_choose_expr(size == 8, (u64)4, \
+					 (void)5)))))
+#define __CAST_TO_U64(x) ({ \
+	typeof(x) __src = (x); \
+	UINTTYPE(sizeof(x)) __dst; \
+	memcpy(&__dst, &__src, sizeof(__dst)); \
+	(u64)__dst; })
+
+#define __CAST1(a,...) __CAST_TO_U64(a)
+#define __CAST2(a,...) __CAST_TO_U64(a), __CAST1(__VA_ARGS__)
+#define __CAST3(a,...) __CAST_TO_U64(a), __CAST2(__VA_ARGS__)
+#define __CAST4(a,...) __CAST_TO_U64(a), __CAST3(__VA_ARGS__)
+#define __CAST5(a,...) __CAST_TO_U64(a), __CAST4(__VA_ARGS__)
+#define __CAST6(a,...) __CAST_TO_U64(a), __CAST5(__VA_ARGS__)
+#define __CAST7(a,...) __CAST_TO_U64(a), __CAST6(__VA_ARGS__)
+#define __CAST8(a,...) __CAST_TO_U64(a), __CAST7(__VA_ARGS__)
+#define __CAST9(a,...) __CAST_TO_U64(a), __CAST8(__VA_ARGS__)
+#define __CAST10(a,...) __CAST_TO_U64(a), __CAST9(__VA_ARGS__)
+#define __CAST11(a,...) __CAST_TO_U64(a), __CAST10(__VA_ARGS__)
+#define __CAST12(a,...) __CAST_TO_U64(a), __CAST11(__VA_ARGS__)
+/* tracepoints with more than 12 arguments will hit build error */
+#define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+/* no 'static' here. The bpf probe functions are global */		\
+notrace void								\
+__bpf_trace_##call(void *__data, proto)					\
+{									\
+	struct bpf_prog *prog = __data;					\
+	\
+	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));	\
+}
+
+/*
+ * This part is compiled out, it is only here as a build time check
+ * to make sure that if the tracepoint handling changes, the
+ * bpf probe will fail to compile unless it too is updated.
+ */
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args)			\
+static inline void bpf_test_probe_##call(void)				\
+{									\
+	check_trace_callback_type_##call(__bpf_trace_##template);	\
+}
+
+
+#undef DEFINE_EVENT_PRINT
+#define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
+	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
+
+#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+#endif /* CONFIG_BPF_EVENTS */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 96b22ace9ae7..5f8216bc261f 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -95,6 +95,7 @@ 
 #ifdef TRACEPOINTS_ENABLED
 #include <trace/trace_events.h>
 #include <trace/perf.h>
+#include <trace/bpf_probe.h>
 #endif
 
 #undef TRACE_EVENT
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 18b7c510c511..1878201c2d77 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@  enum bpf_cmd {
 	BPF_MAP_GET_FD_BY_ID,
 	BPF_OBJ_GET_INFO_BY_FD,
 	BPF_PROG_QUERY,
+	BPF_RAW_TRACEPOINT_OPEN,
 };
 
 enum bpf_map_type {
@@ -134,6 +135,7 @@  enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_SK_MSG,
+	BPF_PROG_TYPE_RAW_TRACEPOINT,
 };
 
 enum bpf_attach_type {
@@ -344,6 +346,11 @@  union bpf_attr {
 		__aligned_u64	prog_ids;
 		__u32		prog_cnt;
 	} query;
+
+	struct {
+		__u64 name;
+		__u32 prog_fd;
+	} raw_tracepoint;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
@@ -1152,4 +1159,8 @@  struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_raw_tracepoint_args {
+	__u64 args[0];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3aeb4ea2a93a..7486b450672e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1311,6 +1311,81 @@  static int bpf_obj_get(const union bpf_attr *attr)
 				attr->file_flags);
 }
 
+struct bpf_raw_tracepoint {
+	struct tracepoint *tp;
+	struct bpf_prog *prog;
+};
+
+static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
+
+	if (raw_tp->prog) {
+		bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
+		bpf_prog_put(raw_tp->prog);
+	}
+	kfree(raw_tp);
+	return 0;
+}
+
+static const struct file_operations bpf_raw_tp_fops = {
+	.release	= bpf_raw_tracepoint_release,
+	.read		= bpf_dummy_read,
+	.write		= bpf_dummy_write,
+};
+
+#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
+
+static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
+{
+	struct bpf_raw_tracepoint *raw_tp;
+	struct tracepoint *tp;
+	struct bpf_prog *prog;
+	char tp_name[128];
+	int tp_fd, err;
+
+	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
+			      sizeof(tp_name) - 1) < 0)
+		return -EFAULT;
+	tp_name[sizeof(tp_name) - 1] = 0;
+
+	tp = kernel_tracepoint_find_by_name(tp_name);
+	if (!tp)
+		return -ENOENT;
+
+	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
+	if (!raw_tp)
+		return -ENOMEM;
+	raw_tp->tp = tp;
+
+	prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
+				 BPF_PROG_TYPE_RAW_TRACEPOINT);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto out_free_tp;
+	}
+
+	err = bpf_probe_register(raw_tp->tp, prog);
+	if (err)
+		goto out_put_prog;
+
+	raw_tp->prog = prog;
+	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
+				 O_CLOEXEC);
+	if (tp_fd < 0) {
+		bpf_probe_unregister(raw_tp->tp, prog);
+		err = tp_fd;
+		goto out_put_prog;
+	}
+	return tp_fd;
+
+out_put_prog:
+	bpf_prog_put(prog);
+out_free_tp:
+	kfree(raw_tp);
+	return err;
+}
+
 #ifdef CONFIG_CGROUP_BPF
 
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
@@ -1921,6 +1996,9 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET_INFO_BY_FD:
 		err = bpf_obj_get_info_by_fd(&attr, uattr);
 		break;
+	case BPF_RAW_TRACEPOINT_OPEN:
+		err = bpf_raw_tracepoint_open(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c634e093951f..00e86aa11360 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -723,6 +723,86 @@  const struct bpf_verifier_ops tracepoint_verifier_ops = {
 const struct bpf_prog_ops tracepoint_prog_ops = {
 };
 
+/*
+ * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
+ * to avoid potential recursive reuse issue when/if tracepoints are added
+ * inside bpf_*_event_output and/or bpf_get_stack_id
+ */
+static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
+BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   struct bpf_map *, map, u64, flags, void *, data, u64, size)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	return ____bpf_perf_event_output(regs, map, flags, data, size);
+}
+
+static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = {
+	.func		= bpf_perf_event_output_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
+BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   struct bpf_map *, map, u64, flags)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	/* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */
+	return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
+			       flags, 0, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
+	.func		= bpf_get_stackid_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *raw_tp_prog_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_perf_event_output:
+		return &bpf_perf_event_output_proto_raw_tp;
+	case BPF_FUNC_get_stackid:
+		return &bpf_get_stackid_proto_raw_tp;
+	default:
+		return tracing_func_proto(func_id);
+	}
+}
+
+static bool raw_tp_prog_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					struct bpf_insn_access_aux *info)
+{
+	/* largest tracepoint in the kernel has 12 args */
+	if (off < 0 || off >= sizeof(__u64) * 12)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+	return true;
+}
+
+const struct bpf_verifier_ops raw_tracepoint_verifier_ops = {
+	.get_func_proto  = raw_tp_prog_func_proto,
+	.is_valid_access = raw_tp_prog_is_valid_access,
+};
+
+const struct bpf_prog_ops raw_tracepoint_prog_ops = {
+};
+
 static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type,
 				    struct bpf_insn_access_aux *info)
 {
@@ -896,3 +976,111 @@  int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 
 	return ret;
 }
+
+static __always_inline
+void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
+{
+	rcu_read_lock();
+	preempt_disable();
+	(void) BPF_PROG_RUN(prog, args);
+	preempt_enable();
+	rcu_read_unlock();
+}
+
+#define UNPACK(...)			__VA_ARGS__
+#define REPEAT_1(FN, DL, X, ...)	FN(X)
+#define REPEAT_2(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_1(FN, DL, __VA_ARGS__)
+#define REPEAT_3(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_2(FN, DL, __VA_ARGS__)
+#define REPEAT_4(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_3(FN, DL, __VA_ARGS__)
+#define REPEAT_5(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_4(FN, DL, __VA_ARGS__)
+#define REPEAT_6(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_5(FN, DL, __VA_ARGS__)
+#define REPEAT_7(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_6(FN, DL, __VA_ARGS__)
+#define REPEAT_8(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_7(FN, DL, __VA_ARGS__)
+#define REPEAT_9(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_8(FN, DL, __VA_ARGS__)
+#define REPEAT_10(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_9(FN, DL, __VA_ARGS__)
+#define REPEAT_11(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_10(FN, DL, __VA_ARGS__)
+#define REPEAT_12(FN, DL, X, ...)	FN(X) UNPACK DL REPEAT_11(FN, DL, __VA_ARGS__)
+#define REPEAT(X, FN, DL, ...)		REPEAT_##X(FN, DL, __VA_ARGS__)
+
+#define SARG(X)		u64 arg##X
+#define COPY(X)		args[X] = arg##X
+
+#define __DL_COM	(,)
+#define __DL_SEM	(;)
+
+#define __SEQ_0_11	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
+
+#define BPF_TRACE_DEFN_x(x)						\
+	void bpf_trace_run##x(struct bpf_prog *prog,			\
+			      REPEAT(x, SARG, __DL_COM, __SEQ_0_11))	\
+	{								\
+		u64 args[x];						\
+		REPEAT(x, COPY, __DL_SEM, __SEQ_0_11);			\
+		__bpf_trace_run(prog, args);				\
+	}								\
+	EXPORT_SYMBOL_GPL(bpf_trace_run##x)
+BPF_TRACE_DEFN_x(1);
+BPF_TRACE_DEFN_x(2);
+BPF_TRACE_DEFN_x(3);
+BPF_TRACE_DEFN_x(4);
+BPF_TRACE_DEFN_x(5);
+BPF_TRACE_DEFN_x(6);
+BPF_TRACE_DEFN_x(7);
+BPF_TRACE_DEFN_x(8);
+BPF_TRACE_DEFN_x(9);
+BPF_TRACE_DEFN_x(10);
+BPF_TRACE_DEFN_x(11);
+BPF_TRACE_DEFN_x(12);
+
+static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	unsigned long addr;
+	char buf[128];
+
+	/*
+	 * check that program doesn't access arguments beyond what's
+	 * available in this tracepoint
+	 */
+	if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
+		return -EINVAL;
+
+	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+	addr = kallsyms_lookup_name(buf);
+	if (!addr)
+		return -ENOENT;
+
+	return tracepoint_probe_register(tp, (void *)addr, prog);
+}
+
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	int err;
+
+	mutex_lock(&bpf_event_mutex);
+	err = __bpf_probe_register(tp, prog);
+	mutex_unlock(&bpf_event_mutex);
+	return err;
+}
+
+static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	unsigned long addr;
+	char buf[128];
+
+	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+	addr = kallsyms_lookup_name(buf);
+	if (!addr)
+		return -ENOENT;
+
+	return tracepoint_probe_unregister(tp, (void *)addr, prog);
+}
+
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	int err;
+
+	mutex_lock(&bpf_event_mutex);
+	err = __bpf_probe_unregister(tp, prog);
+	mutex_unlock(&bpf_event_mutex);
+	return err;
+}