[bpf-next,v2,1/6] bpf: introduce BPF dispatcher

Message ID	20191123071226.6501-2-bjorn.topel@gmail.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com> To: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net Cc: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>, bpf@vger.kernel.org, magnus.karlsson@gmail.com, magnus.karlsson@intel.com, jonathan.lemon@gmail.com, ecree@solarflare.com, thoiland@redhat.com, andrii.nakryiko@gmail.com, tariqt@mellanox.com, saeedm@mellanox.com, maximmi@mellanox.com Subject: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Date: Sat, 23 Nov 2019 08:12:20 +0100 Message-Id: <20191123071226.6501-2-bjorn.topel@gmail.com> In-Reply-To: <20191123071226.6501-1-bjorn.topel@gmail.com> References: <20191123071226.6501-1-bjorn.topel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Introduce the BPF dispatcher and xdp_call.h \| expand [bpf-next,v2,0/6] Introduce the BPF dispatcher and xdp_call.h [bpf-next,v2,1/6] bpf: introduce BPF dispatcher [bpf-next,v2,2/6] xdp: introduce xdp_call [bpf-next,v2,3/6] i40e: start using xdp_call.h [bpf-next,v2,4/6] ixgbe: start using xdp_call.h [bpf-next,v2,5/6] net/mlx4_en: start using xdp_call.h [bpf-next,v2,6/6] net/mlx5e: Start using xdp_call.h

Björn Töpel Nov. 23, 2019, 7:12 a.m. UTC

From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher is a multiway branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform multiple indirect
calls to direct calls, and therefore avoids the retpoline. The
dispatcher is generated using the BPF JIT, and relies on text poking
provided by bpf_arch_text_poke().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 16
dispatch points. This can be extended in the future.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each unique XDP program has a slot in the
dispatcher, registered by a netdev. The netdev then uses the
dispatcher to call the correct program with a direct call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 135 +++++++++++++++++++++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/dispatcher.c     | 208 ++++++++++++++++++++++++++++++++++++
 3 files changed, 344 insertions(+)
 create mode 100644 kernel/bpf/dispatcher.c

Alexei Starovoitov Nov. 24, 2019, 1:55 a.m. UTC | #1

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);

could you please add a comment that this is gcc specific and gate it
by build_bug_on ?
I think even if compiler stays the change of flags:
RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
may change the name of this helper?
I wonder whether it's possible to make it compiler independent.

> diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> new file mode 100644
> index 000000000000..385dd76ab6d2
> --- /dev/null
> +++ b/kernel/bpf/dispatcher.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2019 Intel Corporation. */
> +
> +#ifdef CONFIG_RETPOLINE

I'm worried that such strong gating will make the code rot. Especially it's not
covered by selftests.
Could you please add xdp_call_run() to generic xdp and add a selftest ?
Also could you please benchmark it without retpoline?
iirc direct call is often faster than indirect, so I suspect this optimization
may benefit non-mitigated kernels.

> +#define DISPATCHER_HASH_BITS 10
> +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> +
> +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];

there is one DEFINE_XDP_CALL per driver, so total number of such
dispatch routines is pretty small. 1<<10 hash table is overkill.
The hash table itself is overkill :)

How about adding below:

> +#define BPF_DISPATCHER_MAX 16
> +
> +struct bpf_dispatcher {
> +	struct hlist_node hlist;
> +	void *func;
> +	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> +	int num_progs;
> +	void *image;
> +	u64 selector;
> +};

without hlist and without func to DEFINE_XDP_CALL() macro?
Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
and the rest will become a bit simpler?

> +
> +	set_vm_flush_reset_perms(image);
> +	set_memory_x((long)image, 1);
> +	d->image = image;

Can you add a common helper for this bit to share between
bpf dispatch and bpf trampoline?

> +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> +	int i, err;
> +
> +	if (!d->num_progs) {
> +		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> +				   old_image, NULL);
> +		return;

how does it work? Without doing d->selector = 0; the next addition
will try to do JUMP_TO_JUMP and will fail...

> +	}
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (d->progs[i])
> +			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> +	}
> +	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> +	if (err)
> +		return;
> +
> +	if (d->selector) {
> +		/* progs already running at this address */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> +					 old_image, new_image);
> +	} else {
> +		/* first time registering */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> +					 NULL, new_image);
> +	}
> +	if (err)
> +		return;
> +	d->selector++;
> +}

Not sure how to share selector logic between dispatch and trampoline.
But above selector=0; weirdness is a sign that sharing is probably necessary?

> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +				struct bpf_prog *to)
> +{
> +	struct bpf_dispatcher *d;
> +	bool changed = false;
> +
> +	if (from == to)
> +		return;
> +
> +	mutex_lock(&dispatcher_mutex);
> +	d = bpf_dispatcher_lookup(func);
> +	if (!d)
> +		goto out;
> +
> +	changed |= bpf_dispatcher_remove_prog(d, from);
> +	changed |= bpf_dispatcher_add_prog(d, to);
> +
> +	if (!changed)
> +		goto out;
> +
> +	bpf_dispatcher_update(d);
> +	if (!d->num_progs)
> +		bpf_dispatcher_free(d);

I think I got it why it works.
Every time the prog cnt goes to zero you free the trampoline right away
and next time it will be allocated again and kzalloc() will zero selector.
That's hard to spot.
Also if user space does for(;;) attach/detach;
it will keep stressing bpf_jit_alloc_exec.
In case of bpf trampoline attach/detach won't be stressing it.
Only load/unload which are much slower due to verification.
I guess such difference is ok.

Björn Töpel Nov. 24, 2019, 6:55 a.m. UTC | #2

On Sun, 24 Nov 2019 at 02:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
>
> could you please add a comment that this is gcc specific and gate it
> by build_bug_on ?
> I think even if compiler stays the change of flags:
> RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
> may change the name of this helper?
> I wonder whether it's possible to make it compiler independent.
>
> > diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> > new file mode 100644
> > index 000000000000..385dd76ab6d2
> > --- /dev/null
> > +++ b/kernel/bpf/dispatcher.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2019 Intel Corporation. */
> > +
> > +#ifdef CONFIG_RETPOLINE
>
> I'm worried that such strong gating will make the code rot. Especially it's not
> covered by selftests.
> Could you please add xdp_call_run() to generic xdp and add a selftest ?
> Also could you please benchmark it without retpoline?
> iirc direct call is often faster than indirect, so I suspect this optimization
> may benefit non-mitigated kernels.
>
> > +#define DISPATCHER_HASH_BITS 10
> > +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> > +
> > +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
>
> there is one DEFINE_XDP_CALL per driver, so total number of such
> dispatch routines is pretty small. 1<<10 hash table is overkill.
> The hash table itself is overkill :)
>
> How about adding below:
>
> > +#define BPF_DISPATCHER_MAX 16
> > +
> > +struct bpf_dispatcher {
> > +     struct hlist_node hlist;
> > +     void *func;
> > +     struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> > +     int num_progs;
> > +     void *image;
> > +     u64 selector;
> > +};
>
> without hlist and without func to DEFINE_XDP_CALL() macro?
> Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
> and the rest will become a bit simpler?
>
> > +
> > +     set_vm_flush_reset_perms(image);
> > +     set_memory_x((long)image, 1);
> > +     d->image = image;
>
> Can you add a common helper for this bit to share between
> bpf dispatch and bpf trampoline?
>
> > +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> > +{
> > +     void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> > +     void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> > +     s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> > +     int i, err;
> > +
> > +     if (!d->num_progs) {
> > +             bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> > +                                old_image, NULL);
> > +             return;
>
> how does it work? Without doing d->selector = 0; the next addition
> will try to do JUMP_TO_JUMP and will fail...
>
> > +     }
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (d->progs[i])
> > +                     *ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> > +     }
> > +     err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> > +     if (err)
> > +             return;
> > +
> > +     if (d->selector) {
> > +             /* progs already running at this address */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> > +                                      old_image, new_image);
> > +     } else {
> > +             /* first time registering */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> > +                                      NULL, new_image);
> > +     }
> > +     if (err)
> > +             return;
> > +     d->selector++;
> > +}
>
> Not sure how to share selector logic between dispatch and trampoline.
> But above selector=0; weirdness is a sign that sharing is probably necessary?
>
> > +
> > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> > +                             struct bpf_prog *to)
> > +{
> > +     struct bpf_dispatcher *d;
> > +     bool changed = false;
> > +
> > +     if (from == to)
> > +             return;
> > +
> > +     mutex_lock(&dispatcher_mutex);
> > +     d = bpf_dispatcher_lookup(func);
> > +     if (!d)
> > +             goto out;
> > +
> > +     changed |= bpf_dispatcher_remove_prog(d, from);
> > +     changed |= bpf_dispatcher_add_prog(d, to);
> > +
> > +     if (!changed)
> > +             goto out;
> > +
> > +     bpf_dispatcher_update(d);
> > +     if (!d->num_progs)
> > +             bpf_dispatcher_free(d);
>
> I think I got it why it works.
> Every time the prog cnt goes to zero you free the trampoline right away
> and next time it will be allocated again and kzalloc() will zero selector.
> That's hard to spot.
> Also if user space does for(;;) attach/detach;
> it will keep stressing bpf_jit_alloc_exec.
> In case of bpf trampoline attach/detach won't be stressing it.
> Only load/unload which are much slower due to verification.
> I guess such difference is ok.
>

Alexei, thanks for all feedback (on the weekend)! I agree with all of
above, and especially missing selftests and too much code duplication.

I'll do a respin, but that'll be in the next window, given that Linus
will (probably) tag the release today.


Björn

Alexei Starovoitov Nov. 24, 2019, 5:08 p.m. UTC | #3

On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> >
> > I think I got it why it works.
> > Every time the prog cnt goes to zero you free the trampoline right away
> > and next time it will be allocated again and kzalloc() will zero selector.
> > That's hard to spot.
> > Also if user space does for(;;) attach/detach;
> > it will keep stressing bpf_jit_alloc_exec.
> > In case of bpf trampoline attach/detach won't be stressing it.
> > Only load/unload which are much slower due to verification.
> > I guess such difference is ok.
> >
> 
> Alexei, thanks for all feedback (on the weekend)! I agree with all of
> above, and especially missing selftests and too much code duplication.
> 
> I'll do a respin, but that'll be in the next window, given that Linus
> will (probably) tag the release today.

I want it to land just as much as you do :) Two weeks is not a big deal. We
backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
wait for patches to reach Linus's tree. So this dispatch logic will be running
on our servers way sooner than you'd expect. I guess that explains my obsession
with quality. Same goes for libbpf.

Björn Töpel Nov. 24, 2019, 5:16 p.m. UTC | #4

On Sun, 24 Nov 2019 at 18:08, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> > >
> > > I think I got it why it works.
> > > Every time the prog cnt goes to zero you free the trampoline right away
> > > and next time it will be allocated again and kzalloc() will zero selector.
> > > That's hard to spot.
> > > Also if user space does for(;;) attach/detach;
> > > it will keep stressing bpf_jit_alloc_exec.
> > > In case of bpf trampoline attach/detach won't be stressing it.
> > > Only load/unload which are much slower due to verification.
> > > I guess such difference is ok.
> > >
> >
> > Alexei, thanks for all feedback (on the weekend)! I agree with all of
> > above, and especially missing selftests and too much code duplication.
> >
> > I'll do a respin, but that'll be in the next window, given that Linus
> > will (probably) tag the release today.
>
> I want it to land just as much as you do :) Two weeks is not a big deal. We
> backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
> wait for patches to reach Linus's tree. So this dispatch logic will be running
> on our servers way sooner than you'd expect. I guess that explains my obsession
> with quality. Same goes for libbpf.
>

No reason to rush it in! It's just a week back and forth, and your
comments were spot on.


Cheers,
Björn


>

Daniel Borkmann Nov. 25, 2019, 10:53 a.m. UTC | #5

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher is a multiway branch code generator, mainly
> targeted for XDP programs. When an XDP program is executed via the
> bpf_prog_run_xdp(), it is invoked via an indirect call. With
> retpolines enabled, the indirect call has a substantial performance
> impact. The dispatcher is a mechanism that transform multiple indirect
> calls to direct calls, and therefore avoids the retpoline. The
> dispatcher is generated using the BPF JIT, and relies on text poking
> provided by bpf_arch_text_poke().
> 
> The dispatcher hijacks a trampoline function it via the __fentry__ nop
> of the trampoline. One dispatcher instance currently supports up to 16
> dispatch points. This can be extended in the future.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each unique XDP program has a slot in the
> dispatcher, registered by a netdev. The netdev then uses the
> dispatcher to call the correct program with a direct call.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
[...]
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> +{
> +	u8 *jg_reloc, *jg_target, *prog = *pprog;
> +	int pivot, err, jg_bytes = 1, cnt = 0;
> +	s64 jg_offset;
> +
> +	if (a == b) {
> +		/* Leaf node of recursion, i.e. not a range of indices
> +		 * anymore.
> +		 */
> +		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
> +		if (!is_simm32(progs[a]))
> +			return -1;
> +		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> +			    progs[a]);
> +		err = emit_cond_near_jump(&prog,	/* je func */
> +					  (void *)progs[a], prog,
> +					  X86_JE);
> +		if (err)
> +			return err;
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);
> +		if (err)
> +			return err;
> +
> +		*pprog = prog;
> +		return 0;
> +	}
> +
> +	/* Not a leaf node, so we pivot, and recursively descend into
> +	 * the lower and upper ranges.
> +	 */
> +	pivot = (b - a) / 2;
> +	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
> +	if (!is_simm32(progs[a + pivot]))
> +		return -1;
> +	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> +
> +	if (pivot > 2) {				/* jg upper_part */
> +		/* Require near jump. */
> +		jg_bytes = 4;
> +		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> +	} else {
> +		EMIT2(X86_JG, 0);
> +	}
> +	jg_reloc = prog;
> +
> +	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
> +				  progs);
> +	if (err)
> +		return err;
> +
> +	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
> +	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> +	 * Rule 12. (M impact, H generality) All branch targets should
> +	 * be 16-byte aligned.

Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
than on the website [0]? :) Just wondering, in your IXIA tests, did you
see any noticeable slowdowns if you don't do the 16-byte alignments as
in the rest of the kernel [1,2]?

  [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
  [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
  [2] https://lore.kernel.org/patchwork/patch/560050/

> +	 */
> +	jg_target = PTR_ALIGN(prog, 16);
> +	if (jg_target != prog)
> +		emit_nops(&prog, jg_target - prog);
> +	jg_offset = prog - jg_reloc;
> +	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> +
> +	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
> +				  b, progs);
> +	if (err)
> +		return err;
> +
> +	*pprog = prog;
> +	return 0;
> +}

Björn Töpel Nov. 25, 2019, 3:20 p.m. UTC | #6

On Mon, 25 Nov 2019 at 11:54, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The BPF dispatcher is a multiway branch code generator, mainly
> > targeted for XDP programs. When an XDP program is executed via the
> > bpf_prog_run_xdp(), it is invoked via an indirect call. With
> > retpolines enabled, the indirect call has a substantial performance
> > impact. The dispatcher is a mechanism that transform multiple indirect
> > calls to direct calls, and therefore avoids the retpoline. The
> > dispatcher is generated using the BPF JIT, and relies on text poking
> > provided by bpf_arch_text_poke().
> >
> > The dispatcher hijacks a trampoline function it via the __fentry__ nop
> > of the trampoline. One dispatcher instance currently supports up to 16
> > dispatch points. This can be extended in the future.
> >
> > An example: A module/driver allocates a dispatcher. The dispatcher is
> > shared for all netdevs. Each unique XDP program has a slot in the
> > dispatcher, registered by a netdev. The netdev then uses the
> > dispatcher to call the correct program with a direct call.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> [...]
> > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> > +{
> > +     u8 *jg_reloc, *jg_target, *prog = *pprog;
> > +     int pivot, err, jg_bytes = 1, cnt = 0;
> > +     s64 jg_offset;
> > +
> > +     if (a == b) {
> > +             /* Leaf node of recursion, i.e. not a range of indices
> > +              * anymore.
> > +              */
> > +             EMIT1(add_1mod(0x48, BPF_REG_3));       /* cmp rdx,func */
> > +             if (!is_simm32(progs[a]))
> > +                     return -1;
> > +             EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> > +                         progs[a]);
> > +             err = emit_cond_near_jump(&prog,        /* je func */
> > +                                       (void *)progs[a], prog,
> > +                                       X86_JE);
> > +             if (err)
> > +                     return err;
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
> > +             if (err)
> > +                     return err;
> > +
> > +             *pprog = prog;
> > +             return 0;
> > +     }
> > +
> > +     /* Not a leaf node, so we pivot, and recursively descend into
> > +      * the lower and upper ranges.
> > +      */
> > +     pivot = (b - a) / 2;
> > +     EMIT1(add_1mod(0x48, BPF_REG_3));               /* cmp rdx,func */
> > +     if (!is_simm32(progs[a + pivot]))
> > +             return -1;
> > +     EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> > +
> > +     if (pivot > 2) {                                /* jg upper_part */
> > +             /* Require near jump. */
> > +             jg_bytes = 4;
> > +             EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> > +     } else {
> > +             EMIT2(X86_JG, 0);
> > +     }
> > +     jg_reloc = prog;
> > +
> > +     err = emit_bpf_dispatcher(&prog, a, a + pivot,  /* emit lower_part */
> > +                               progs);
> > +     if (err)
> > +             return err;
> > +
> > +     /* Intel 64 and IA-32 ArchitecturesOptimization Reference
> > +      * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> > +      * Rule 12. (M impact, H generality) All branch targets should
> > +      * be 16-byte aligned.
>
> Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
> than on the website [0]? :)

Argh, no, you're right. Typo. Thanks!

> Just wondering, in your IXIA tests, did you
> see any noticeable slowdowns if you don't do the 16-byte alignments as
> in the rest of the kernel [1,2]?
>
>   [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
>   [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
>   [2] https://lore.kernel.org/patchwork/patch/560050/
>

Interesting! Thanks for the pointers. I'll do more benchmarking for
the next rev, and hopefully we can leave the nops out. Let's see.


Björn

> > +      */
> > +     jg_target = PTR_ALIGN(prog, 16);
> > +     if (jg_target != prog)
> > +             emit_nops(&prog, jg_target - prog);
> > +     jg_offset = prog - jg_reloc;
> > +     emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> > +
> > +     err = emit_bpf_dispatcher(&prog, a + pivot + 1, /* emit upper_part */
> > +                               b, progs);
> > +     if (err)
> > +             return err;
> > +
> > +     *pprog = prog;
> > +     return 0;
> > +}

[bpf-next,v2,1/6] bpf: introduce BPF dispatcher

Commit Message

Comments

Patch