diff mbox series

[bpf-next,v2,1/6] bpf: introduce BPF dispatcher

Message ID 20191123071226.6501-2-bjorn.topel@gmail.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Introduce the BPF dispatcher and xdp_call.h | expand

Commit Message

Björn Töpel Nov. 23, 2019, 7:12 a.m. UTC
From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher is a multiway branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform multiple indirect
calls to direct calls, and therefore avoids the retpoline. The
dispatcher is generated using the BPF JIT, and relies on text poking
provided by bpf_arch_text_poke().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 16
dispatch points. This can be extended in the future.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each unique XDP program has a slot in the
dispatcher, registered by a netdev. The netdev then uses the
dispatcher to call the correct program with a direct call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 135 +++++++++++++++++++++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/dispatcher.c     | 208 ++++++++++++++++++++++++++++++++++++
 3 files changed, 344 insertions(+)
 create mode 100644 kernel/bpf/dispatcher.c

Comments

Alexei Starovoitov Nov. 24, 2019, 1:55 a.m. UTC | #1
On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);

could you please add a comment that this is gcc specific and gate it
by build_bug_on ?
I think even if compiler stays the change of flags:
RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
may change the name of this helper?
I wonder whether it's possible to make it compiler independent.

> diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> new file mode 100644
> index 000000000000..385dd76ab6d2
> --- /dev/null
> +++ b/kernel/bpf/dispatcher.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2019 Intel Corporation. */
> +
> +#ifdef CONFIG_RETPOLINE

I'm worried that such strong gating will make the code rot. Especially it's not
covered by selftests.
Could you please add xdp_call_run() to generic xdp and add a selftest ?
Also could you please benchmark it without retpoline?
iirc direct call is often faster than indirect, so I suspect this optimization
may benefit non-mitigated kernels.

> +#define DISPATCHER_HASH_BITS 10
> +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> +
> +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];

there is one DEFINE_XDP_CALL per driver, so total number of such
dispatch routines is pretty small. 1<<10 hash table is overkill.
The hash table itself is overkill :)

How about adding below:

> +#define BPF_DISPATCHER_MAX 16
> +
> +struct bpf_dispatcher {
> +	struct hlist_node hlist;
> +	void *func;
> +	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> +	int num_progs;
> +	void *image;
> +	u64 selector;
> +};

without hlist and without func to DEFINE_XDP_CALL() macro?
Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
and the rest will become a bit simpler?

> +
> +	set_vm_flush_reset_perms(image);
> +	set_memory_x((long)image, 1);
> +	d->image = image;

Can you add a common helper for this bit to share between
bpf dispatch and bpf trampoline?

> +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> +	int i, err;
> +
> +	if (!d->num_progs) {
> +		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> +				   old_image, NULL);
> +		return;

how does it work? Without doing d->selector = 0; the next addition
will try to do JUMP_TO_JUMP and will fail...

> +	}
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (d->progs[i])
> +			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> +	}
> +	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> +	if (err)
> +		return;
> +
> +	if (d->selector) {
> +		/* progs already running at this address */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> +					 old_image, new_image);
> +	} else {
> +		/* first time registering */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> +					 NULL, new_image);
> +	}
> +	if (err)
> +		return;
> +	d->selector++;
> +}

Not sure how to share selector logic between dispatch and trampoline.
But above selector=0; weirdness is a sign that sharing is probably necessary?

> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +				struct bpf_prog *to)
> +{
> +	struct bpf_dispatcher *d;
> +	bool changed = false;
> +
> +	if (from == to)
> +		return;
> +
> +	mutex_lock(&dispatcher_mutex);
> +	d = bpf_dispatcher_lookup(func);
> +	if (!d)
> +		goto out;
> +
> +	changed |= bpf_dispatcher_remove_prog(d, from);
> +	changed |= bpf_dispatcher_add_prog(d, to);
> +
> +	if (!changed)
> +		goto out;
> +
> +	bpf_dispatcher_update(d);
> +	if (!d->num_progs)
> +		bpf_dispatcher_free(d);

I think I got it why it works.
Every time the prog cnt goes to zero you free the trampoline right away
and next time it will be allocated again and kzalloc() will zero selector.
That's hard to spot.
Also if user space does for(;;) attach/detach;
it will keep stressing bpf_jit_alloc_exec.
In case of bpf trampoline attach/detach won't be stressing it.
Only load/unload which are much slower due to verification.
I guess such difference is ok.
Björn Töpel Nov. 24, 2019, 6:55 a.m. UTC | #2
On Sun, 24 Nov 2019 at 02:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
>
> could you please add a comment that this is gcc specific and gate it
> by build_bug_on ?
> I think even if compiler stays the change of flags:
> RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
> may change the name of this helper?
> I wonder whether it's possible to make it compiler independent.
>
> > diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> > new file mode 100644
> > index 000000000000..385dd76ab6d2
> > --- /dev/null
> > +++ b/kernel/bpf/dispatcher.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2019 Intel Corporation. */
> > +
> > +#ifdef CONFIG_RETPOLINE
>
> I'm worried that such strong gating will make the code rot. Especially it's not
> covered by selftests.
> Could you please add xdp_call_run() to generic xdp and add a selftest ?
> Also could you please benchmark it without retpoline?
> iirc direct call is often faster than indirect, so I suspect this optimization
> may benefit non-mitigated kernels.
>
> > +#define DISPATCHER_HASH_BITS 10
> > +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> > +
> > +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
>
> there is one DEFINE_XDP_CALL per driver, so total number of such
> dispatch routines is pretty small. 1<<10 hash table is overkill.
> The hash table itself is overkill :)
>
> How about adding below:
>
> > +#define BPF_DISPATCHER_MAX 16
> > +
> > +struct bpf_dispatcher {
> > +     struct hlist_node hlist;
> > +     void *func;
> > +     struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> > +     int num_progs;
> > +     void *image;
> > +     u64 selector;
> > +};
>
> without hlist and without func to DEFINE_XDP_CALL() macro?
> Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
> and the rest will become a bit simpler?
>
> > +
> > +     set_vm_flush_reset_perms(image);
> > +     set_memory_x((long)image, 1);
> > +     d->image = image;
>
> Can you add a common helper for this bit to share between
> bpf dispatch and bpf trampoline?
>
> > +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> > +{
> > +     void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> > +     void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> > +     s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> > +     int i, err;
> > +
> > +     if (!d->num_progs) {
> > +             bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> > +                                old_image, NULL);
> > +             return;
>
> how does it work? Without doing d->selector = 0; the next addition
> will try to do JUMP_TO_JUMP and will fail...
>
> > +     }
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (d->progs[i])
> > +                     *ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> > +     }
> > +     err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> > +     if (err)
> > +             return;
> > +
> > +     if (d->selector) {
> > +             /* progs already running at this address */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> > +                                      old_image, new_image);
> > +     } else {
> > +             /* first time registering */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> > +                                      NULL, new_image);
> > +     }
> > +     if (err)
> > +             return;
> > +     d->selector++;
> > +}
>
> Not sure how to share selector logic between dispatch and trampoline.
> But above selector=0; weirdness is a sign that sharing is probably necessary?
>
> > +
> > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> > +                             struct bpf_prog *to)
> > +{
> > +     struct bpf_dispatcher *d;
> > +     bool changed = false;
> > +
> > +     if (from == to)
> > +             return;
> > +
> > +     mutex_lock(&dispatcher_mutex);
> > +     d = bpf_dispatcher_lookup(func);
> > +     if (!d)
> > +             goto out;
> > +
> > +     changed |= bpf_dispatcher_remove_prog(d, from);
> > +     changed |= bpf_dispatcher_add_prog(d, to);
> > +
> > +     if (!changed)
> > +             goto out;
> > +
> > +     bpf_dispatcher_update(d);
> > +     if (!d->num_progs)
> > +             bpf_dispatcher_free(d);
>
> I think I got it why it works.
> Every time the prog cnt goes to zero you free the trampoline right away
> and next time it will be allocated again and kzalloc() will zero selector.
> That's hard to spot.
> Also if user space does for(;;) attach/detach;
> it will keep stressing bpf_jit_alloc_exec.
> In case of bpf trampoline attach/detach won't be stressing it.
> Only load/unload which are much slower due to verification.
> I guess such difference is ok.
>

Alexei, thanks for all feedback (on the weekend)! I agree with all of
above, and especially missing selftests and too much code duplication.

I'll do a respin, but that'll be in the next window, given that Linus
will (probably) tag the release today.


Björn
Alexei Starovoitov Nov. 24, 2019, 5:08 p.m. UTC | #3
On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> >
> > I think I got it why it works.
> > Every time the prog cnt goes to zero you free the trampoline right away
> > and next time it will be allocated again and kzalloc() will zero selector.
> > That's hard to spot.
> > Also if user space does for(;;) attach/detach;
> > it will keep stressing bpf_jit_alloc_exec.
> > In case of bpf trampoline attach/detach won't be stressing it.
> > Only load/unload which are much slower due to verification.
> > I guess such difference is ok.
> >
> 
> Alexei, thanks for all feedback (on the weekend)! I agree with all of
> above, and especially missing selftests and too much code duplication.
> 
> I'll do a respin, but that'll be in the next window, given that Linus
> will (probably) tag the release today.

I want it to land just as much as you do :) Two weeks is not a big deal. We
backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
wait for patches to reach Linus's tree. So this dispatch logic will be running
on our servers way sooner than you'd expect. I guess that explains my obsession
with quality. Same goes for libbpf.
Björn Töpel Nov. 24, 2019, 5:16 p.m. UTC | #4
On Sun, 24 Nov 2019 at 18:08, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> > >
> > > I think I got it why it works.
> > > Every time the prog cnt goes to zero you free the trampoline right away
> > > and next time it will be allocated again and kzalloc() will zero selector.
> > > That's hard to spot.
> > > Also if user space does for(;;) attach/detach;
> > > it will keep stressing bpf_jit_alloc_exec.
> > > In case of bpf trampoline attach/detach won't be stressing it.
> > > Only load/unload which are much slower due to verification.
> > > I guess such difference is ok.
> > >
> >
> > Alexei, thanks for all feedback (on the weekend)! I agree with all of
> > above, and especially missing selftests and too much code duplication.
> >
> > I'll do a respin, but that'll be in the next window, given that Linus
> > will (probably) tag the release today.
>
> I want it to land just as much as you do :) Two weeks is not a big deal. We
> backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
> wait for patches to reach Linus's tree. So this dispatch logic will be running
> on our servers way sooner than you'd expect. I guess that explains my obsession
> with quality. Same goes for libbpf.
>

No reason to rush it in! It's just a week back and forth, and your
comments were spot on.


Cheers,
Björn


>
Daniel Borkmann Nov. 25, 2019, 10:53 a.m. UTC | #5
On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher is a multiway branch code generator, mainly
> targeted for XDP programs. When an XDP program is executed via the
> bpf_prog_run_xdp(), it is invoked via an indirect call. With
> retpolines enabled, the indirect call has a substantial performance
> impact. The dispatcher is a mechanism that transform multiple indirect
> calls to direct calls, and therefore avoids the retpoline. The
> dispatcher is generated using the BPF JIT, and relies on text poking
> provided by bpf_arch_text_poke().
> 
> The dispatcher hijacks a trampoline function it via the __fentry__ nop
> of the trampoline. One dispatcher instance currently supports up to 16
> dispatch points. This can be extended in the future.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each unique XDP program has a slot in the
> dispatcher, registered by a netdev. The netdev then uses the
> dispatcher to call the correct program with a direct call.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
[...]
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> +{
> +	u8 *jg_reloc, *jg_target, *prog = *pprog;
> +	int pivot, err, jg_bytes = 1, cnt = 0;
> +	s64 jg_offset;
> +
> +	if (a == b) {
> +		/* Leaf node of recursion, i.e. not a range of indices
> +		 * anymore.
> +		 */
> +		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
> +		if (!is_simm32(progs[a]))
> +			return -1;
> +		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> +			    progs[a]);
> +		err = emit_cond_near_jump(&prog,	/* je func */
> +					  (void *)progs[a], prog,
> +					  X86_JE);
> +		if (err)
> +			return err;
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);
> +		if (err)
> +			return err;
> +
> +		*pprog = prog;
> +		return 0;
> +	}
> +
> +	/* Not a leaf node, so we pivot, and recursively descend into
> +	 * the lower and upper ranges.
> +	 */
> +	pivot = (b - a) / 2;
> +	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
> +	if (!is_simm32(progs[a + pivot]))
> +		return -1;
> +	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> +
> +	if (pivot > 2) {				/* jg upper_part */
> +		/* Require near jump. */
> +		jg_bytes = 4;
> +		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> +	} else {
> +		EMIT2(X86_JG, 0);
> +	}
> +	jg_reloc = prog;
> +
> +	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
> +				  progs);
> +	if (err)
> +		return err;
> +
> +	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
> +	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> +	 * Rule 12. (M impact, H generality) All branch targets should
> +	 * be 16-byte aligned.

Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
than on the website [0]? :) Just wondering, in your IXIA tests, did you
see any noticeable slowdowns if you don't do the 16-byte alignments as
in the rest of the kernel [1,2]?

  [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
  [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
  [2] https://lore.kernel.org/patchwork/patch/560050/

> +	 */
> +	jg_target = PTR_ALIGN(prog, 16);
> +	if (jg_target != prog)
> +		emit_nops(&prog, jg_target - prog);
> +	jg_offset = prog - jg_reloc;
> +	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> +
> +	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
> +				  b, progs);
> +	if (err)
> +		return err;
> +
> +	*pprog = prog;
> +	return 0;
> +}
Björn Töpel Nov. 25, 2019, 3:20 p.m. UTC | #6
On Mon, 25 Nov 2019 at 11:54, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The BPF dispatcher is a multiway branch code generator, mainly
> > targeted for XDP programs. When an XDP program is executed via the
> > bpf_prog_run_xdp(), it is invoked via an indirect call. With
> > retpolines enabled, the indirect call has a substantial performance
> > impact. The dispatcher is a mechanism that transform multiple indirect
> > calls to direct calls, and therefore avoids the retpoline. The
> > dispatcher is generated using the BPF JIT, and relies on text poking
> > provided by bpf_arch_text_poke().
> >
> > The dispatcher hijacks a trampoline function it via the __fentry__ nop
> > of the trampoline. One dispatcher instance currently supports up to 16
> > dispatch points. This can be extended in the future.
> >
> > An example: A module/driver allocates a dispatcher. The dispatcher is
> > shared for all netdevs. Each unique XDP program has a slot in the
> > dispatcher, registered by a netdev. The netdev then uses the
> > dispatcher to call the correct program with a direct call.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> [...]
> > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> > +{
> > +     u8 *jg_reloc, *jg_target, *prog = *pprog;
> > +     int pivot, err, jg_bytes = 1, cnt = 0;
> > +     s64 jg_offset;
> > +
> > +     if (a == b) {
> > +             /* Leaf node of recursion, i.e. not a range of indices
> > +              * anymore.
> > +              */
> > +             EMIT1(add_1mod(0x48, BPF_REG_3));       /* cmp rdx,func */
> > +             if (!is_simm32(progs[a]))
> > +                     return -1;
> > +             EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> > +                         progs[a]);
> > +             err = emit_cond_near_jump(&prog,        /* je func */
> > +                                       (void *)progs[a], prog,
> > +                                       X86_JE);
> > +             if (err)
> > +                     return err;
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
> > +             if (err)
> > +                     return err;
> > +
> > +             *pprog = prog;
> > +             return 0;
> > +     }
> > +
> > +     /* Not a leaf node, so we pivot, and recursively descend into
> > +      * the lower and upper ranges.
> > +      */
> > +     pivot = (b - a) / 2;
> > +     EMIT1(add_1mod(0x48, BPF_REG_3));               /* cmp rdx,func */
> > +     if (!is_simm32(progs[a + pivot]))
> > +             return -1;
> > +     EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> > +
> > +     if (pivot > 2) {                                /* jg upper_part */
> > +             /* Require near jump. */
> > +             jg_bytes = 4;
> > +             EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> > +     } else {
> > +             EMIT2(X86_JG, 0);
> > +     }
> > +     jg_reloc = prog;
> > +
> > +     err = emit_bpf_dispatcher(&prog, a, a + pivot,  /* emit lower_part */
> > +                               progs);
> > +     if (err)
> > +             return err;
> > +
> > +     /* Intel 64 and IA-32 ArchitecturesOptimization Reference
> > +      * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> > +      * Rule 12. (M impact, H generality) All branch targets should
> > +      * be 16-byte aligned.
>
> Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
> than on the website [0]? :)

Argh, no, you're right. Typo. Thanks!

> Just wondering, in your IXIA tests, did you
> see any noticeable slowdowns if you don't do the 16-byte alignments as
> in the rest of the kernel [1,2]?
>
>   [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
>   [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
>   [2] https://lore.kernel.org/patchwork/patch/560050/
>

Interesting! Thanks for the pointers. I'll do more benchmarking for
the next rev, and hopefully we can leave the nops out. Let's see.


Björn

> > +      */
> > +     jg_target = PTR_ALIGN(prog, 16);
> > +     if (jg_target != prog)
> > +             emit_nops(&prog, jg_target - prog);
> > +     jg_offset = prog - jg_reloc;
> > +     emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> > +
> > +     err = emit_bpf_dispatcher(&prog, a + pivot + 1, /* emit upper_part */
> > +                               b, progs);
> > +     if (err)
> > +             return err;
> > +
> > +     *pprog = prog;
> > +     return 0;
> > +}
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 15615c94804f..9ca81bc9e7f3 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -10,10 +10,12 @@ 
 #include <linux/if_vlan.h>
 #include <linux/bpf.h>
 #include <linux/memory.h>
+#include <linux/sort.h>
 #include <asm/extable.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
+#include <asm/asm-prototypes.h>
 
 static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 {
@@ -1565,6 +1567,139 @@  int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	return 0;
 }
 
+#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
+
+static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+	s64 offset;
+
+	offset = func - (ip + 2 + 4);
+	if (!is_simm32(offset)) {
+		pr_err("Target %p is out of range\n", func);
+		return -EINVAL;
+	}
+	EMIT2_off32(0x0F, jmp_cond + 0x10, offset);
+	*pprog = prog;
+	return 0;
+}
+
+static void emit_nops(u8 **pprog, unsigned int len)
+{
+	unsigned int i, noplen;
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	while (len > 0) {
+		noplen = len;
+
+		if (noplen > ASM_NOP_MAX)
+			noplen = ASM_NOP_MAX;
+
+		for (i = 0; i < noplen; i++)
+			EMIT1(ideal_nops[noplen][i]);
+		len -= noplen;
+	}
+
+	*pprog = prog;
+}
+
+static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
+{
+	u8 *jg_reloc, *jg_target, *prog = *pprog;
+	int pivot, err, jg_bytes = 1, cnt = 0;
+	s64 jg_offset;
+
+	if (a == b) {
+		/* Leaf node of recursion, i.e. not a range of indices
+		 * anymore.
+		 */
+		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
+		if (!is_simm32(progs[a]))
+			return -1;
+		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
+			    progs[a]);
+		err = emit_cond_near_jump(&prog,	/* je func */
+					  (void *)progs[a], prog,
+					  X86_JE);
+		if (err)
+			return err;
+
+		err = emit_jump(&prog,			/* jmp thunk */
+				__x86_indirect_thunk_rdx, prog);
+		if (err)
+			return err;
+
+		*pprog = prog;
+		return 0;
+	}
+
+	/* Not a leaf node, so we pivot, and recursively descend into
+	 * the lower and upper ranges.
+	 */
+	pivot = (b - a) / 2;
+	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
+	if (!is_simm32(progs[a + pivot]))
+		return -1;
+	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
+
+	if (pivot > 2) {				/* jg upper_part */
+		/* Require near jump. */
+		jg_bytes = 4;
+		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
+	} else {
+		EMIT2(X86_JG, 0);
+	}
+	jg_reloc = prog;
+
+	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
+				  progs);
+	if (err)
+		return err;
+
+	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
+	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
+	 * Rule 12. (M impact, H generality) All branch targets should
+	 * be 16-byte aligned.
+	 */
+	jg_target = PTR_ALIGN(prog, 16);
+	if (jg_target != prog)
+		emit_nops(&prog, jg_target - prog);
+	jg_offset = prog - jg_reloc;
+	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
+
+	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
+				  b, progs);
+	if (err)
+		return err;
+
+	*pprog = prog;
+	return 0;
+}
+
+static int cmp_ips(const void *a, const void *b)
+{
+	const s64 *ipa = a;
+	const s64 *ipb = b;
+
+	if (*ipa > *ipb)
+		return 1;
+	if (*ipa < *ipb)
+		return -1;
+	return 0;
+}
+
+int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
+{
+	u8 *prog = image;
+
+	sort(funcs, num_funcs, sizeof(funcs[0]), cmp_ips, NULL);
+	return emit_bpf_dispatcher(&prog, 0, num_funcs - 1, funcs);
+}
+
+#endif
+
 struct x64_jit_data {
 	struct bpf_binary_header *header;
 	int *addrs;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 3f671bf617e8..d4f330351f87 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,7 @@  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
new file mode 100644
index 000000000000..385dd76ab6d2
--- /dev/null
+++ b/kernel/bpf/dispatcher.c
@@ -0,0 +1,208 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2019 Intel Corporation. */
+
+#ifdef CONFIG_RETPOLINE
+
+#include <linux/hash.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+/* The BPF dispatcher is a multiway branch code generator. The
+ * dispatcher is a mechanism to avoid the performance penalty of an
+ * indirect call when retpolines are enabled. A dispatch client tries
+ * to register a BPF program into the dispatcher, and if there is
+ * available room in the dispatcher a direct call to the BPF program
+ * will be generated. All calls to the BPF programs called via the
+ * dispatcher will then be a direct call, instead of an indirect. The
+ * dispatcher hijacks a trampoline function it via the __fentry__ of
+ * the trampoline. The trampoline function has the following
+ * signature:
+ *	unsigned int trampoline(
+ *		const void *xdp_ctx,
+ *		const struct bpf_insn *insnsi,
+ *		unsigned int (*bpf_func)(const void *,
+ *					 const struct bpf_insn *));
+ *
+ * Direct use of the dispatcher is discouraged, and instead a wrapper
+ * such as xdp_call.h should be used.
+ */
+
+#define DISPATCHER_HASH_BITS 10
+#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
+
+static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
+
+#define BPF_DISPATCHER_MAX 16
+
+struct bpf_dispatcher {
+	struct hlist_node hlist;
+	void *func;
+	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
+	int num_progs;
+	void *image;
+	u64 selector;
+};
+
+static DEFINE_MUTEX(dispatcher_mutex);
+
+static struct bpf_dispatcher *bpf_dispatcher_lookup(void *func)
+{
+	struct bpf_dispatcher *d;
+	struct hlist_head *head;
+	void *image;
+
+	head = &dispatcher_table[hash_ptr(func, DISPATCHER_HASH_BITS)];
+	hlist_for_each_entry(d, head, hlist) {
+		if (d->func == func)
+			return d;
+	}
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return NULL;
+
+	image = bpf_jit_alloc_exec(PAGE_SIZE);
+	if (!image) {
+		kfree(d);
+		return NULL;
+	}
+
+	d->func = func;
+	INIT_HLIST_NODE(&d->hlist);
+	hlist_add_head(&d->hlist, head);
+
+	set_vm_flush_reset_perms(image);
+	set_memory_x((long)image, 1);
+	d->image = image;
+	return d;
+}
+
+static void bpf_dispatcher_free(struct bpf_dispatcher *d)
+{
+	bpf_jit_free_exec(d->image);
+	hlist_del(&d->hlist);
+	kfree(d);
+}
+
+static bool bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
+				    struct bpf_prog *prog)
+{
+	struct bpf_prog **entry = NULL;
+	int i;
+
+	if (!prog)
+		return false;
+
+	if (d->num_progs == BPF_DISPATCHER_MAX)
+		return false;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (!entry && !d->progs[i])
+			entry = &d->progs[i];
+		if (d->progs[i] == prog)
+			return false;
+	}
+
+	bpf_prog_inc(prog);
+	*entry = prog;
+	d->num_progs++;
+	return true;
+}
+
+static bool bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
+				       struct bpf_prog *prog)
+{
+	int i;
+
+	if (!prog)
+		return false;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (d->progs[i] == prog) {
+			bpf_prog_put(prog);
+			d->progs[i] = NULL;
+			d->num_progs--;
+			return true;
+		}
+	}
+	return false;
+}
+
+int __weak arch_prepare_bpf_dispatcher(void *image, s64 *funcs,
+				       int num_funcs)
+{
+	return -ENOTSUPP;
+}
+
+static void bpf_dispatcher_update(struct bpf_dispatcher *d)
+{
+	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
+	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
+	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
+	int i, err;
+
+	if (!d->num_progs) {
+		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
+				   old_image, NULL);
+		return;
+	}
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (d->progs[i])
+			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
+	}
+	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
+	if (err)
+		return;
+
+	if (d->selector) {
+		/* progs already running at this address */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
+					 old_image, new_image);
+	} else {
+		/* first time registering */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
+					 NULL, new_image);
+	}
+	if (err)
+		return;
+	d->selector++;
+}
+
+void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
+				struct bpf_prog *to)
+{
+	struct bpf_dispatcher *d;
+	bool changed = false;
+
+	if (from == to)
+		return;
+
+	mutex_lock(&dispatcher_mutex);
+	d = bpf_dispatcher_lookup(func);
+	if (!d)
+		goto out;
+
+	changed |= bpf_dispatcher_remove_prog(d, from);
+	changed |= bpf_dispatcher_add_prog(d, to);
+
+	if (!changed)
+		goto out;
+
+	bpf_dispatcher_update(d);
+	if (!d->num_progs)
+		bpf_dispatcher_free(d);
+out:
+	mutex_unlock(&dispatcher_mutex);
+}
+
+static int __init init_dispatchers(void)
+{
+	int i;
+
+	for (i = 0; i < DISPATCHER_TABLE_SIZE; i++)
+		INIT_HLIST_HEAD(&dispatcher_table[i]);
+	return 0;
+}
+late_initcall(init_dispatchers);
+
+#endif