diff mbox series

[RFC,bpf-next,09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support

Message ID 20201022082138.2322434-10-jolsa@kernel.org
State Not Applicable
Delegated to: BPF Maintainers
Headers show
Series bpf: Speed up trampoline attach | expand

Checks

Context Check Description
jkicinski/cover_letter success Link
jkicinski/fixes_present success Link
jkicinski/patch_count fail Series longer than 15 patches
jkicinski/tree_selection success Clearly marked for bpf-next
jkicinski/subject_prefix success Link
jkicinski/source_inline success Was 0 now: 0
jkicinski/verify_signedoff success Link
jkicinski/module_param success Was 0 now: 0
jkicinski/build_32bit fail Errors and warnings before: 16029 this patch: 16030
jkicinski/kdoc success Errors and warnings before: 0 this patch: 0
jkicinski/verify_fixes success Link
jkicinski/checkpatch warning CHECK: Unbalanced braces around else statement CHECK: braces {} should be used on all arms of this statement
jkicinski/build_allmodconfig_warn success Errors and warnings before: 16067 this patch: 16067
jkicinski/header_inline success Link
jkicinski/stable success Stable not CCed

Commit Message

Jiri Olsa Oct. 22, 2020, 8:21 a.m. UTC
Adding BPF_TRAMPOLINE_BATCH_ATTACH support, that allows to attach
tracing multiple fentry/fexit pograms to trampolines within one
syscall.

Currently each tracing program is attached in seprate bpf syscall
and more importantly by separate register_ftrace_direct call, which
registers trampoline in ftrace subsystem. We can save some cycles
by simple using its batch variant register_ftrace_direct_ips.

Before:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     2,199,433,771      cycles:k               ( +-  0.55% )
       936,105,469      cycles:u               ( +-  0.37% )

             26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

After:

 Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
    { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):

     1,456,854,867      cycles:k               ( +-  0.57% )
       937,737,431      cycles:u               ( +-  0.13% )

             12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )

The new BPF_TRAMPOLINE_BATCH_ATTACH syscall command expects
following data in union bpf_attr:

  struct {
          __aligned_u64   in;
          __aligned_u64   out;
          __u32           count;
  } trampoline_batch;

  in    - pointer to user space array with file descrptors of loaded bpf
          programs to attach
  out   - pointer to user space array for resulting link descriptor
  count - number of 'in/out' file descriptors

Basically the new code gets programs from 'in' file descriptors and
attaches them the same way the current code does, apart from the last
step that registers probe ip with trampoline. This is done at the end
with new register_ftrace_direct_ips function.

The resulting link descriptors are written in 'out' array and match
'in' array file descriptors order.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/bpf.h      | 15 ++++++-
 include/uapi/linux/bpf.h |  7 ++++
 kernel/bpf/syscall.c     | 88 ++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/trampoline.c  | 69 +++++++++++++++++++++++++++----
 4 files changed, 164 insertions(+), 15 deletions(-)

Comments

Andrii Nakryiko Oct. 23, 2020, 8:03 p.m. UTC | #1
On Thu, Oct 22, 2020 at 8:01 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding BPF_TRAMPOLINE_BATCH_ATTACH support, that allows to attach
> tracing multiple fentry/fexit pograms to trampolines within one
> syscall.
>
> Currently each tracing program is attached in seprate bpf syscall
> and more importantly by separate register_ftrace_direct call, which
> registers trampoline in ftrace subsystem. We can save some cycles
> by simple using its batch variant register_ftrace_direct_ips.
>
> Before:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      2,199,433,771      cycles:k               ( +-  0.55% )
>        936,105,469      cycles:u               ( +-  0.37% )
>
>              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
>
> After:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      1,456,854,867      cycles:k               ( +-  0.57% )
>        937,737,431      cycles:u               ( +-  0.13% )
>
>              12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )
>
> The new BPF_TRAMPOLINE_BATCH_ATTACH syscall command expects
> following data in union bpf_attr:
>
>   struct {
>           __aligned_u64   in;
>           __aligned_u64   out;
>           __u32           count;
>   } trampoline_batch;
>
>   in    - pointer to user space array with file descrptors of loaded bpf
>           programs to attach
>   out   - pointer to user space array for resulting link descriptor
>   count - number of 'in/out' file descriptors
>
> Basically the new code gets programs from 'in' file descriptors and
> attaches them the same way the current code does, apart from the last
> step that registers probe ip with trampoline. This is done at the end
> with new register_ftrace_direct_ips function.
>
> The resulting link descriptors are written in 'out' array and match
> 'in' array file descriptors order.
>

I think this is a pretty hard API to use correctly from user-space.
Think about all those partially attached and/or partially detached BPF
programs. And subsequent clean up for them. Also there is nothing even
close to atomicity, so you might get a spurious invocation a few times
before batch-attach fails mid-way and the kernel (hopefully) will
detach those already attached programs in an attempt to clean
everything up. Debugging and handling that is a big pain for users,
IMO.

Here's a raw idea, let's think if it would be possible to implement
something like this. It seems like what you need is to create a set of
logically-grouped placeholders for multiple functions you are about to
attach to. Until the BPF program is attached, those placeholders are
just no-ops (e.g., they might jump to an "inactive" single trampoline,
which just immediately returns). Then you attach the BPF program
atomically into a single place, and all those no-op jumps to a
trampoline start to call the BPF program at the same time. It's not
strictly atomic, but is much closer in time with each other. Also,
because it's still a single trampoline, you get a nice mapping to a
single bpf_link, so detaching is not an issue.

Basically, maybe ftrace subsystem could provide a set of APIs to
prepare a set of functions to attach to. Then BPF subsystem would just
do what it does today, except instead of attaching to a specific
kernel function, it would attach to ftrace's placeholder. I don't know
anything about ftrace implementation, so this might be far off. But I
thought that looking at this problem from a bit of a different angle
would benefit the discussion. Thoughts?


> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/bpf.h      | 15 ++++++-
>  include/uapi/linux/bpf.h |  7 ++++
>  kernel/bpf/syscall.c     | 88 ++++++++++++++++++++++++++++++++++++++--
>  kernel/bpf/trampoline.c  | 69 +++++++++++++++++++++++++++----
>  4 files changed, 164 insertions(+), 15 deletions(-)
>

[...]
Steven Rostedt Oct. 23, 2020, 8:31 p.m. UTC | #2
On Fri, 23 Oct 2020 13:03:22 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> Basically, maybe ftrace subsystem could provide a set of APIs to
> prepare a set of functions to attach to. Then BPF subsystem would just
> do what it does today, except instead of attaching to a specific
> kernel function, it would attach to ftrace's placeholder. I don't know
> anything about ftrace implementation, so this might be far off. But I
> thought that looking at this problem from a bit of a different angle
> would benefit the discussion. Thoughts?

I probably understand bpf internals as much as you understand ftrace
internals ;-)

Anyway, what I'm currently working on, is a fast way to get to the
arguments of a function. For now, I'm just focused on x86_64, and only add
6 argments.

The main issue that Alexei had with using the ftrace trampoline, was that
the only way to get to the arguments was to set the "REGS" flag, which
would give a regs parameter that contained a full pt_regs. The problem with
this approach is that it required saving *all* regs for every function
traced. Alexei felt that this was too much overehead.

Looking at Jiri's patch, I took a look at the creation of the bpf
trampoline, and noticed that it's copying the regs on a stack (at least
what is used, which I think could be an issue).

For tracing a function, one must store all argument registers used, and
restore them, as that's how they are passed from caller to callee. And
since they are stored anyway, I figure, that should also be sent to the
function callbacks, so that they have access to them too.

I'm working on a set of patches to make this a reality.

-- Steve
Andrii Nakryiko Oct. 23, 2020, 10:23 p.m. UTC | #3
On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 23 Oct 2020 13:03:22 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > Basically, maybe ftrace subsystem could provide a set of APIs to
> > prepare a set of functions to attach to. Then BPF subsystem would just
> > do what it does today, except instead of attaching to a specific
> > kernel function, it would attach to ftrace's placeholder. I don't know
> > anything about ftrace implementation, so this might be far off. But I
> > thought that looking at this problem from a bit of a different angle
> > would benefit the discussion. Thoughts?
>
> I probably understand bpf internals as much as you understand ftrace
> internals ;-)
>

Heh :) But while we are here, what do you think about this idea of
preparing a no-op trampoline, that a bunch (thousands, potentially) of
function entries will jump to. And once all that is ready and patched
through kernel functions entry points, then allow to attach BPF
program or ftrace callback (if I get the terminology right) in a one
fast and simple operation? For users that would mean that they will
either get calls for all or none of attached kfuncs, with a simple and
reliable semantics.

Something like this, where bpf_prog attachment (which replaces nop)
happens as step 2:

+------------+  +----------+  +----------+
|  kfunc1    |  |  kfunc2  |  |  kfunc3  |
+------+-----+  +----+-----+  +----+-----+
       |             |             |
       |             |             |
       +---------------------------+
                     |
                     v
                 +---+---+           +-----------+
                 |  nop  +----------->  bpf_prog |
                 +-------+           +-----------+


> Anyway, what I'm currently working on, is a fast way to get to the
> arguments of a function. For now, I'm just focused on x86_64, and only add
> 6 argments.
>
> The main issue that Alexei had with using the ftrace trampoline, was that
> the only way to get to the arguments was to set the "REGS" flag, which
> would give a regs parameter that contained a full pt_regs. The problem with
> this approach is that it required saving *all* regs for every function
> traced. Alexei felt that this was too much overehead.
>
> Looking at Jiri's patch, I took a look at the creation of the bpf
> trampoline, and noticed that it's copying the regs on a stack (at least
> what is used, which I think could be an issue).

Right. And BPF doesn't get access to the entire pt_regs struct, so it
doesn't have to pay the prices of saving it.

But just FYI. Alexei is out till next week, so don't expect him to
reply in the next few days. But he's probably best to discuss these
nitty-gritty details with :)

>
> For tracing a function, one must store all argument registers used, and
> restore them, as that's how they are passed from caller to callee. And
> since they are stored anyway, I figure, that should also be sent to the
> function callbacks, so that they have access to them too.
>
> I'm working on a set of patches to make this a reality.
>
> -- Steve
Jiri Olsa Oct. 25, 2020, 7:41 p.m. UTC | #4
On Fri, Oct 23, 2020 at 03:23:10PM -0700, Andrii Nakryiko wrote:
> On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Fri, 23 Oct 2020 13:03:22 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > Basically, maybe ftrace subsystem could provide a set of APIs to
> > > prepare a set of functions to attach to. Then BPF subsystem would just
> > > do what it does today, except instead of attaching to a specific
> > > kernel function, it would attach to ftrace's placeholder. I don't know
> > > anything about ftrace implementation, so this might be far off. But I
> > > thought that looking at this problem from a bit of a different angle
> > > would benefit the discussion. Thoughts?
> >
> > I probably understand bpf internals as much as you understand ftrace
> > internals ;-)
> >
> 
> Heh :) But while we are here, what do you think about this idea of
> preparing a no-op trampoline, that a bunch (thousands, potentially) of
> function entries will jump to. And once all that is ready and patched
> through kernel functions entry points, then allow to attach BPF
> program or ftrace callback (if I get the terminology right) in a one
> fast and simple operation? For users that would mean that they will
> either get calls for all or none of attached kfuncs, with a simple and
> reliable semantics.

so the main pain point the batch interface is addressing, is that
every attach (BPF_RAW_TRACEPOINT_OPEN command) calls register_ftrace_direct,
and you'll need to do the same for nop trampoline, no?

I wonder if we could create some 'transaction object' represented
by fd and add it to bpf_attr::raw_tracepoint

then attach (BPF_RAW_TRACEPOINT_OPEN command) would add program to this
new 'transaction object' instead of updating ftrace directly

and when the collection is done (all BPF_RAW_TRACEPOINT_OPEN command
are executed), we'd call new bpf syscall command on that transaction
and it would call ftrace interface

something like:

  bpf(TRANSACTION_NEW) = fd
  bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_1, fd
  bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_2, fd
  ...
  bpf(TRANSACTION_DONE) for fd

jirka

> 
> Something like this, where bpf_prog attachment (which replaces nop)
> happens as step 2:
> 
> +------------+  +----------+  +----------+
> |  kfunc1    |  |  kfunc2  |  |  kfunc3  |
> +------+-----+  +----+-----+  +----+-----+
>        |             |             |
>        |             |             |
>        +---------------------------+
>                      |
>                      v
>                  +---+---+           +-----------+
>                  |  nop  +----------->  bpf_prog |
>                  +-------+           +-----------+
> 
> 
> > Anyway, what I'm currently working on, is a fast way to get to the
> > arguments of a function. For now, I'm just focused on x86_64, and only add
> > 6 argments.
> >
> > The main issue that Alexei had with using the ftrace trampoline, was that
> > the only way to get to the arguments was to set the "REGS" flag, which
> > would give a regs parameter that contained a full pt_regs. The problem with
> > this approach is that it required saving *all* regs for every function
> > traced. Alexei felt that this was too much overehead.
> >
> > Looking at Jiri's patch, I took a look at the creation of the bpf
> > trampoline, and noticed that it's copying the regs on a stack (at least
> > what is used, which I think could be an issue).
> 
> Right. And BPF doesn't get access to the entire pt_regs struct, so it
> doesn't have to pay the prices of saving it.
> 
> But just FYI. Alexei is out till next week, so don't expect him to
> reply in the next few days. But he's probably best to discuss these
> nitty-gritty details with :)
> 
> >
> > For tracing a function, one must store all argument registers used, and
> > restore them, as that's how they are passed from caller to callee. And
> > since they are stored anyway, I figure, that should also be sent to the
> > function callbacks, so that they have access to them too.
> >
> > I'm working on a set of patches to make this a reality.
> >
> > -- Steve
>
Andrii Nakryiko Oct. 26, 2020, 11:19 p.m. UTC | #5
On Sun, Oct 25, 2020 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Oct 23, 2020 at 03:23:10PM -0700, Andrii Nakryiko wrote:
> > On Fri, Oct 23, 2020 at 1:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Fri, 23 Oct 2020 13:03:22 -0700
> > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > > Basically, maybe ftrace subsystem could provide a set of APIs to
> > > > prepare a set of functions to attach to. Then BPF subsystem would just
> > > > do what it does today, except instead of attaching to a specific
> > > > kernel function, it would attach to ftrace's placeholder. I don't know
> > > > anything about ftrace implementation, so this might be far off. But I
> > > > thought that looking at this problem from a bit of a different angle
> > > > would benefit the discussion. Thoughts?
> > >
> > > I probably understand bpf internals as much as you understand ftrace
> > > internals ;-)
> > >
> >
> > Heh :) But while we are here, what do you think about this idea of
> > preparing a no-op trampoline, that a bunch (thousands, potentially) of
> > function entries will jump to. And once all that is ready and patched
> > through kernel functions entry points, then allow to attach BPF
> > program or ftrace callback (if I get the terminology right) in a one
> > fast and simple operation? For users that would mean that they will
> > either get calls for all or none of attached kfuncs, with a simple and
> > reliable semantics.
>
> so the main pain point the batch interface is addressing, is that
> every attach (BPF_RAW_TRACEPOINT_OPEN command) calls register_ftrace_direct,
> and you'll need to do the same for nop trampoline, no?

I guess I had a hope that if we know it's a nop that we are
installing, then we can do it without extra waiting, which should
speed it up quite a bit.

>
> I wonder if we could create some 'transaction object' represented
> by fd and add it to bpf_attr::raw_tracepoint
>
> then attach (BPF_RAW_TRACEPOINT_OPEN command) would add program to this
> new 'transaction object' instead of updating ftrace directly
>
> and when the collection is done (all BPF_RAW_TRACEPOINT_OPEN command
> are executed), we'd call new bpf syscall command on that transaction
> and it would call ftrace interface
>

This is conceptually something like what I had in mind, but I had a
single BPF program attached to many kernel functions in mind.
Something that's impossible today, as you mentioned in another thread.

> something like:
>
>   bpf(TRANSACTION_NEW) = fd
>   bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_1, fd
>   bpf(BPF_RAW_TRACEPOINT_OPEN) for prog_fd_2, fd
>   ...
>   bpf(TRANSACTION_DONE) for fd
>
> jirka
>
> >
> > Something like this, where bpf_prog attachment (which replaces nop)
> > happens as step 2:
> >
> > +------------+  +----------+  +----------+
> > |  kfunc1    |  |  kfunc2  |  |  kfunc3  |
> > +------+-----+  +----+-----+  +----+-----+
> >        |             |             |
> >        |             |             |
> >        +---------------------------+
> >                      |
> >                      v
> >                  +---+---+           +-----------+
> >                  |  nop  +----------->  bpf_prog |
> >                  +-------+           +-----------+
> >
> >
> > > Anyway, what I'm currently working on, is a fast way to get to the
> > > arguments of a function. For now, I'm just focused on x86_64, and only add
> > > 6 argments.
> > >
> > > The main issue that Alexei had with using the ftrace trampoline, was that
> > > the only way to get to the arguments was to set the "REGS" flag, which
> > > would give a regs parameter that contained a full pt_regs. The problem with
> > > this approach is that it required saving *all* regs for every function
> > > traced. Alexei felt that this was too much overehead.
> > >
> > > Looking at Jiri's patch, I took a look at the creation of the bpf
> > > trampoline, and noticed that it's copying the regs on a stack (at least
> > > what is used, which I think could be an issue).
> >
> > Right. And BPF doesn't get access to the entire pt_regs struct, so it
> > doesn't have to pay the prices of saving it.
> >
> > But just FYI. Alexei is out till next week, so don't expect him to
> > reply in the next few days. But he's probably best to discuss these
> > nitty-gritty details with :)
> >
> > >
> > > For tracing a function, one must store all argument registers used, and
> > > restore them, as that's how they are passed from caller to callee. And
> > > since they are stored anyway, I figure, that should also be sent to the
> > > function callbacks, so that they have access to them too.
> > >
> > > I'm working on a set of patches to make this a reality.
> > >
> > > -- Steve
> >
>
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2b16bf48aab6..d28c7ac3af3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -583,6 +583,13 @@  enum bpf_tramp_prog_type {
 	BPF_TRAMP_REPLACE, /* more than MAX */
 };
 
+struct bpf_trampoline_batch {
+	int count;
+	int idx;
+	unsigned long *ips;
+	unsigned long *addrs;
+};
+
 struct bpf_trampoline {
 	/* hlist for trampoline_table */
 	struct hlist_node hlist;
@@ -644,11 +651,14 @@  static __always_inline unsigned int bpf_dispatcher_nop_func(
 	return bpf_func(ctx, insnsi);
 }
 #ifdef CONFIG_BPF_JIT
-int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
+int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			     struct bpf_trampoline_batch *batch);
 int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
 struct bpf_trampoline *bpf_trampoline_get(u64 key,
 					  struct bpf_attach_target_info *tgt_info);
 void bpf_trampoline_put(struct bpf_trampoline *tr);
+struct bpf_trampoline_batch *bpf_trampoline_batch_alloc(int count);
+void bpf_trampoline_batch_free(struct bpf_trampoline_batch *batch);
 #define BPF_DISPATCHER_INIT(_name) {				\
 	.mutex = __MUTEX_INITIALIZER(_name.mutex),		\
 	.func = &_name##_func,					\
@@ -693,7 +703,8 @@  void bpf_ksym_add(struct bpf_ksym *ksym);
 void bpf_ksym_del(struct bpf_ksym *ksym);
 #else
 static inline int bpf_trampoline_link_prog(struct bpf_prog *prog,
-					   struct bpf_trampoline *tr)
+					   struct bpf_trampoline *tr,
+					   struct bpf_trampoline_batch *batch)
 {
 	return -ENOTSUPP;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bf5a99d803e4..04df4d576fd4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -125,6 +125,7 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TRAMPOLINE_BATCH_ATTACH,
 };
 
 enum bpf_map_type {
@@ -631,6 +632,12 @@  union bpf_attr {
 		__u32 prog_fd;
 	} raw_tracepoint;
 
+	struct { /* anonymous struct used by BPF_TRAMPOLINE_BATCH_ATTACH */
+		__aligned_u64	in;
+		__aligned_u64	out;
+		__u32		count;
+	} trampoline_batch;
+
 	struct { /* anonymous struct for BPF_BTF_LOAD */
 		__aligned_u64	btf;
 		__aligned_u64	btf_log_buf;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 61ef29f9177d..e370b37e3e8e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2553,7 +2553,8 @@  static const struct bpf_link_ops bpf_tracing_link_lops = {
 
 static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 				   int tgt_prog_fd,
-				   u32 btf_id)
+				   u32 btf_id,
+				   struct bpf_trampoline_batch *batch)
 {
 	struct bpf_link_primer link_primer;
 	struct bpf_prog *tgt_prog = NULL;
@@ -2678,7 +2679,7 @@  static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 	if (err)
 		goto out_unlock;
 
-	err = bpf_trampoline_link_prog(prog, tr);
+	err = bpf_trampoline_link_prog(prog, tr, batch);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
 		link = NULL;
@@ -2826,7 +2827,7 @@  static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 			tp_name = prog->aux->attach_func_name;
 			break;
 		}
-		return bpf_tracing_prog_attach(prog, 0, 0);
+		return bpf_tracing_prog_attach(prog, 0, 0, NULL);
 	case BPF_PROG_TYPE_RAW_TRACEPOINT:
 	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
 		if (strncpy_from_user(buf,
@@ -2879,6 +2880,81 @@  static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	return err;
 }
 
+#define BPF_RAW_TRACEPOINT_OPEN_BATCH_LAST_FIELD trampoline_batch.count
+
+static int bpf_trampoline_batch(const union bpf_attr *attr, int cmd)
+{
+	void __user *uout = u64_to_user_ptr(attr->trampoline_batch.out);
+	void __user *uin = u64_to_user_ptr(attr->trampoline_batch.in);
+	struct bpf_trampoline_batch *batch = NULL;
+	struct bpf_prog *prog;
+	int count, ret, i, fd;
+	u32 *in, *out;
+
+	if (CHECK_ATTR(BPF_RAW_TRACEPOINT_OPEN_BATCH))
+		return -EINVAL;
+
+	if (!uin || !uout)
+		return -EINVAL;
+
+	count = attr->trampoline_batch.count;
+
+	in = kcalloc(count, sizeof(u32), GFP_KERNEL);
+	out = kcalloc(count, sizeof(u32), GFP_KERNEL);
+	if (!in || !out) {
+		kfree(in);
+		kfree(out);
+		return -ENOMEM;
+	}
+
+	ret = copy_from_user(in, uin, count * sizeof(u32));
+	if (ret)
+		goto out_clean;
+
+	/* test read out array */
+	ret = copy_to_user(uout, out, count * sizeof(u32));
+	if (ret)
+		goto out_clean;
+
+	batch = bpf_trampoline_batch_alloc(count);
+	if (!batch)
+		goto out_clean;
+
+	for (i = 0; i < count; i++) {
+		if (cmd == BPF_TRAMPOLINE_BATCH_ATTACH) {
+			prog = bpf_prog_get(in[i]);
+			if (IS_ERR(prog)) {
+				ret = PTR_ERR(prog);
+				goto out_clean;
+			}
+
+			ret = -EINVAL;
+			if (prog->type != BPF_PROG_TYPE_TRACING)
+				goto out_clean;
+			if (prog->type == BPF_PROG_TYPE_TRACING &&
+			    prog->expected_attach_type == BPF_TRACE_RAW_TP)
+				goto out_clean;
+
+			fd = bpf_tracing_prog_attach(prog, 0, 0, batch);
+			if (fd < 0)
+				goto out_clean;
+
+			out[i] = fd;
+		}
+	}
+
+	ret = register_ftrace_direct_ips(batch->ips, batch->addrs, batch->idx);
+	if (!ret)
+		WARN_ON_ONCE(copy_to_user(uout, out, count * sizeof(u32)));
+
+out_clean:
+	/* XXX cleanup partialy attached array */
+	bpf_trampoline_batch_free(batch);
+	kfree(in);
+	kfree(out);
+	return ret;
+}
+
 static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 					     enum bpf_attach_type attach_type)
 {
@@ -4018,7 +4094,8 @@  static int tracing_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *
 	else if (prog->type == BPF_PROG_TYPE_EXT)
 		return bpf_tracing_prog_attach(prog,
 					       attr->link_create.target_fd,
-					       attr->link_create.target_btf_id);
+					       attr->link_create.target_btf_id,
+					       NULL);
 	return -EINVAL;
 }
 
@@ -4437,6 +4514,9 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_RAW_TRACEPOINT_OPEN:
 		err = bpf_raw_tracepoint_open(&attr);
 		break;
+	case BPF_TRAMPOLINE_BATCH_ATTACH:
+		err = bpf_trampoline_batch(&attr, cmd);
+		break;
 	case BPF_BTF_LOAD:
 		err = bpf_btf_load(&attr);
 		break;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 35c5887d82ff..3383644eccc8 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -107,6 +107,51 @@  static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 	return tr;
 }
 
+static int bpf_trampoline_batch_add(struct bpf_trampoline_batch *batch,
+				    unsigned long ip, unsigned long addr)
+{
+	int idx = batch->idx;
+
+	if (idx >= batch->count)
+		return -EINVAL;
+
+	batch->ips[idx] = ip;
+	batch->addrs[idx] = addr;
+	batch->idx++;
+	return 0;
+}
+
+struct bpf_trampoline_batch *bpf_trampoline_batch_alloc(int count)
+{
+	struct bpf_trampoline_batch *batch;
+
+	batch = kmalloc(sizeof(*batch), GFP_KERNEL);
+	if (!batch)
+		return NULL;
+
+	batch->ips = kcalloc(count, sizeof(batch->ips[0]), GFP_KERNEL);
+	batch->addrs = kcalloc(count, sizeof(batch->addrs[0]), GFP_KERNEL);
+	if (!batch->ips || !batch->addrs) {
+		kfree(batch->ips);
+		kfree(batch->addrs);
+		kfree(batch);
+		return NULL;
+	}
+
+	batch->count = count;
+	batch->idx = 0;
+	return batch;
+}
+
+void bpf_trampoline_batch_free(struct bpf_trampoline_batch *batch)
+{
+	if (!batch)
+		return;
+	kfree(batch->ips);
+	kfree(batch->addrs);
+	kfree(batch);
+}
+
 static int is_ftrace_location(void *ip)
 {
 	long addr;
@@ -144,7 +189,8 @@  static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_ad
 }
 
 /* first time registering */
-static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
+static int register_fentry(struct bpf_trampoline *tr, void *new_addr,
+			   struct bpf_trampoline_batch *batch)
 {
 	void *ip = tr->func.addr;
 	int ret;
@@ -154,9 +200,12 @@  static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
 		return ret;
 	tr->func.ftrace_managed = ret;
 
-	if (tr->func.ftrace_managed)
-		ret = register_ftrace_direct((long)ip, (long)new_addr);
-	else
+	if (tr->func.ftrace_managed) {
+		if (batch)
+			ret = bpf_trampoline_batch_add(batch, (long)ip, (long)new_addr);
+		else
+			ret = register_ftrace_direct((long)ip, (long)new_addr);
+	} else
 		ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
 	return ret;
 }
@@ -185,7 +234,8 @@  bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total)
 	return tprogs;
 }
 
-static int bpf_trampoline_update(struct bpf_trampoline *tr)
+static int bpf_trampoline_update(struct bpf_trampoline *tr,
+				 struct bpf_trampoline_batch *batch)
 {
 	void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
 	void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
@@ -230,7 +280,7 @@  static int bpf_trampoline_update(struct bpf_trampoline *tr)
 		err = modify_fentry(tr, old_image, new_image);
 	else
 		/* first time registering */
-		err = register_fentry(tr, new_image);
+		err = register_fentry(tr, new_image, batch);
 	if (err)
 		goto out;
 	tr->selector++;
@@ -261,7 +311,8 @@  static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(struct bpf_prog *prog)
 	}
 }
 
-int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
+int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr,
+			     struct bpf_trampoline_batch *batch)
 {
 	enum bpf_tramp_prog_type kind;
 	int err = 0;
@@ -299,7 +350,7 @@  int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
 	}
 	hlist_add_head(&prog->aux->tramp_hlist, &tr->progs_hlist[kind]);
 	tr->progs_cnt[kind]++;
-	err = bpf_trampoline_update(tr);
+	err = bpf_trampoline_update(tr, batch);
 	if (err) {
 		hlist_del(&prog->aux->tramp_hlist);
 		tr->progs_cnt[kind]--;
@@ -326,7 +377,7 @@  int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
 	}
 	hlist_del(&prog->aux->tramp_hlist);
 	tr->progs_cnt[kind]--;
-	err = bpf_trampoline_update(tr);
+	err = bpf_trampoline_update(tr, NULL);
 out:
 	mutex_unlock(&tr->mutex);
 	return err;