[RFC,bpf-next,2/4] bpf: introduce BPF dispatcher

Message ID	20191113204737.31623-3-bjorn.topel@gmail.com
State	RFC
Delegated to:	BPF Maintainers
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com> To: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net Cc: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>, bpf@vger.kernel.org, magnus.karlsson@gmail.com, magnus.karlsson@intel.com, jonathan.lemon@gmail.com Subject: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Date: Wed, 13 Nov 2019 21:47:35 +0100 Message-Id: <20191113204737.31623-3-bjorn.topel@gmail.com> In-Reply-To: <20191113204737.31623-1-bjorn.topel@gmail.com> References: <20191113204737.31623-1-bjorn.topel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: bpf-owner@vger.kernel.org Precedence: bulk
Series	Introduce xdp_call.h and the BPF dispatcher \| expand [RFC,bpf-next,0/4] Introduce xdp_call.h and the BPF dispatcher [RFC,bpf-next,1/4] bpf: teach bpf_arch_text_poke() jumps [RFC,bpf-next,2/4] bpf: introduce BPF dispatcher [RFC,bpf-next,3/4] xdp: introduce xdp_call [RFC,bpf-next,4/4] i40e: start using xdp_call.h

Björn Töpel Nov. 13, 2019, 8:47 p.m. UTC

From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher builds on top of the BPF trampoline ideas;
Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
code. The dispatcher builds a dispatch table for XDP programs, for
retpoline avoidance. The table is a simple binary search model, so
lookup is O(log n). Here, the dispatch table is limited to four
entries (for laziness reason -- only 1B relative jumps :-P). If the
dispatch table is full, it will fallback to the retpoline path.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each netdev allocate a slot in the dispatcher
and a BPF program. The netdev then uses the dispatcher to call the
correct program with a direct call (actually a tail-call).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
 3 files changed, 294 insertions(+)
 create mode 100644 kernel/bpf/dispatcher.c

Edward Cree Nov. 13, 2019, 9:40 p.m. UTC | #1

On 13/11/2019 20:47, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
The first-come-first-served model for dispatcher slots might mean that
 a low-traffic user ends up getting priority while a higher-traffic
 user is stuck with the retpoline fallback.  Have you considered using
 a learning mechanism, like in my dynamic call RFC [1] earlier this
 year?  (Though I'm sure a better learning mechanism than the one I
 used there could be devised.)

> +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> +				   struct bpf_prog *prog)
> +{
> +	struct bpf_prog **entry = NULL;
> +	int i, err = 0;
> +
> +	if (d->num_progs == BPF_DISPATCHER_MAX)
> +		return err;
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (!entry && !d->progs[i])
> +			entry = &d->progs[i];
> +		if (d->progs[i] == prog)
> +			return err;
> +	}
> +
> +	prog = bpf_prog_inc(prog);
> +	if (IS_ERR(prog))
> +		return err;
> +
> +	*entry = prog;
> +	d->num_progs++;
> +	return err;
> +}
If I'm reading this function right, it always returns zero; was that
 the intention, and if so why isn't it void?

-Ed

[1] https://lkml.org/lkml/2019/2/1/948

Björn Töpel Nov. 14, 2019, 6:29 a.m. UTC | #2

On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote:
>
> On 13/11/2019 20:47, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The BPF dispatcher builds on top of the BPF trampoline ideas;
> > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> > code. The dispatcher builds a dispatch table for XDP programs, for
> > retpoline avoidance. The table is a simple binary search model, so
> > lookup is O(log n). Here, the dispatch table is limited to four
> > entries (for laziness reason -- only 1B relative jumps :-P). If the
> > dispatch table is full, it will fallback to the retpoline path.
> >
> > An example: A module/driver allocates a dispatcher. The dispatcher is
> > shared for all netdevs. Each netdev allocate a slot in the dispatcher
> > and a BPF program. The netdev then uses the dispatcher to call the
> > correct program with a direct call (actually a tail-call).
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> The first-come-first-served model for dispatcher slots might mean that
>  a low-traffic user ends up getting priority while a higher-traffic
>  user is stuck with the retpoline fallback.  Have you considered using
>  a learning mechanism, like in my dynamic call RFC [1] earlier this
>  year?  (Though I'm sure a better learning mechanism than the one I
>  used there could be devised.)
>

My rationale was that this mechanism would almost exclusively be used
by physical HW NICs using XDP. My hunch was that the number of netdevs
would be ~4, and typically less using XDP, so a more sophisticated
mechanism didn't really make sense IMO. However, your approach is more
generic and doesn't require any arch specific work. What was the push
back for your work? I'll read up on the thread. I'm intrigued to take
your series for a spin!

> > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> > +                                struct bpf_prog *prog)
> > +{
> > +     struct bpf_prog **entry = NULL;
> > +     int i, err = 0;
> > +
> > +     if (d->num_progs == BPF_DISPATCHER_MAX)
> > +             return err;
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (!entry && !d->progs[i])
> > +                     entry = &d->progs[i];
> > +             if (d->progs[i] == prog)
> > +                     return err;
> > +     }
> > +
> > +     prog = bpf_prog_inc(prog);
> > +     if (IS_ERR(prog))
> > +             return err;
> > +
> > +     *entry = prog;
> > +     d->num_progs++;
> > +     return err;
> > +}
> If I'm reading this function right, it always returns zero; was that
>  the intention, and if so why isn't it void?
>

Ugh, yeah. In general the error handling should be improved in this
RFC. If it makes sense to move forward with this series, I'll make
sure to address that. Thanks for taking a look, and for the pointers
to your work!


Cheers,
Björn

> -Ed
>
> [1] https://lkml.org/lkml/2019/2/1/948

Edward Cree Nov. 14, 2019, 10:18 a.m. UTC | #3

On 14/11/2019 06:29, Björn Töpel wrote:
> On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote:
>> On 13/11/2019 20:47, Björn Töpel wrote:
>> The first-come-first-served model for dispatcher slots might mean that
>>  a low-traffic user ends up getting priority while a higher-traffic
>>  user is stuck with the retpoline fallback.  Have you considered using
>>  a learning mechanism, like in my dynamic call RFC [1] earlier this
>>  year?  (Though I'm sure a better learning mechanism than the one I
>>  used there could be devised.)
> My rationale was that this mechanism would almost exclusively be used
> by physical HW NICs using XDP. My hunch was that the number of netdevs
> would be ~4, and typically less using XDP, so a more sophisticated
> mechanism didn't really make sense IMO.
That seems reasonable in most cases, although I can imagine systems with
 a couple of four-port boards being a thing.  I suppose the netdevs are
 likely to all have the same XDP prog, though, and if I'm reading your
 code right it seems they'd share a slot in that case.

> However, your approach is more
> generic and doesn't require any arch specific work. What was the push
> back for your work?
Mainly that I couldn't demonstrate a performance benefit from the few
 call sites I annotated, and others working in the area felt that
 manual annotation wouldn't scale — Nadav Amit had a different approach
 [2] that used a GCC plugin to apply a dispatcher on an opt-out basis
 to all the indirect calls in the kernel; the discussion on that got
 bogged down in interactions between text patching and perf tracing
 which all went *waaaay* over my head.  AFAICT the static_call series I
 was depending on never got merged, and I'm not sure if anyone's still
 working on it.

-Ed

[2] https://lkml.org/lkml/2018/12/31/19

Björn Töpel Nov. 14, 2019, 11:21 a.m. UTC | #4

On Thu, 14 Nov 2019 at 11:19, Edward Cree <ecree@solarflare.com> wrote:
>
> On 14/11/2019 06:29, Björn Töpel wrote:
[...]
> > My rationale was that this mechanism would almost exclusively be used
> > by physical HW NICs using XDP. My hunch was that the number of netdevs
> > would be ~4, and typically less using XDP, so a more sophisticated
> > mechanism didn't really make sense IMO.
> That seems reasonable in most cases, although I can imagine systems with
>  a couple of four-port boards being a thing.  I suppose the netdevs are
>  likely to all have the same XDP prog, though, and if I'm reading your
>  code right it seems they'd share a slot in that case.
>

Yup, correct!

> > However, your approach is more
> > generic and doesn't require any arch specific work. What was the push
> > back for your work?
> Mainly that I couldn't demonstrate a performance benefit from the few
>  call sites I annotated, and others working in the area felt that
>  manual annotation wouldn't scale — Nadav Amit had a different approach
>  [2] that used a GCC plugin to apply a dispatcher on an opt-out basis
>  to all the indirect calls in the kernel; the discussion on that got
>  bogged down in interactions between text patching and perf tracing
>  which all went *waaaay* over my head.  AFAICT the static_call series I
>  was depending on never got merged, and I'm not sure if anyone's still
>  working on it.
>

Again, thanks for the pointers. PeterZ is (hopefully) still working on
the static_call stuff [3]. The static_call_inline would be a good fit
here, and maybe even using static_call as a patchpad/dispatcher like
you did is a better route. I will checkout Nadav's work!


Björn

[3] https://lore.kernel.org/lkml/20191007090225.44108711.6@infradead.org/#r

> -Ed
>
> [2] https://lkml.org/lkml/2018/12/31/19

Toke Høiland-Jørgensen Nov. 14, 2019, 12:31 p.m. UTC | #5

Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.

So it's O(log n) with n == 4? Have you compared the performance of just
doing four linear compare-and-jumps? Seems to me it may not be that big
of a difference for such a small N?

> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).

Is it really accurate to call it a tail call? To me, that would imply
that it increments the tail call limit counter and all that? Isn't this
just a direct jump using the trampoline stuff?

-Toke

Daniel Borkmann Nov. 14, 2019, 1:03 p.m. UTC | #6

On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The BPF dispatcher builds on top of the BPF trampoline ideas;
>> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>> code. The dispatcher builds a dispatch table for XDP programs, for
>> retpoline avoidance. The table is a simple binary search model, so
>> lookup is O(log n). Here, the dispatch table is limited to four
>> entries (for laziness reason -- only 1B relative jumps :-P). If the
>> dispatch table is full, it will fallback to the retpoline path.
> 
> So it's O(log n) with n == 4? Have you compared the performance of just
> doing four linear compare-and-jumps? Seems to me it may not be that big
> of a difference for such a small N?

Did you perform some microbenchmarks wrt search tree? Mainly wondering
since for code emission for switch/case statements, clang/gcc turns off
indirect calls entirely under retpoline, see [0] from back then.

>> An example: A module/driver allocates a dispatcher. The dispatcher is
>> shared for all netdevs. Each netdev allocate a slot in the dispatcher
>> and a BPF program. The netdev then uses the dispatcher to call the
>> correct program with a direct call (actually a tail-call).
> 
> Is it really accurate to call it a tail call? To me, that would imply
> that it increments the tail call limit counter and all that? Isn't this
> just a direct jump using the trampoline stuff?

Not meant in BPF context here, but more general [1].

(For actual BPF tail calls I have a series close to ready for getting
rid of most indirect calls which I'll post later today.)

Best,
Daniel

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4
   [1] https://en.wikipedia.org/wiki/Tail_call

Toke Høiland-Jørgensen Nov. 14, 2019, 1:09 p.m. UTC | #7

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> The BPF dispatcher builds on top of the BPF trampoline ideas;
>>> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>>> code. The dispatcher builds a dispatch table for XDP programs, for
>>> retpoline avoidance. The table is a simple binary search model, so
>>> lookup is O(log n). Here, the dispatch table is limited to four
>>> entries (for laziness reason -- only 1B relative jumps :-P). If the
>>> dispatch table is full, it will fallback to the retpoline path.
>> 
>> So it's O(log n) with n == 4? Have you compared the performance of just
>> doing four linear compare-and-jumps? Seems to me it may not be that big
>> of a difference for such a small N?
>
> Did you perform some microbenchmarks wrt search tree? Mainly wondering
> since for code emission for switch/case statements, clang/gcc turns off
> indirect calls entirely under retpoline, see [0] from back then.

Yes, this was exactly the example I had in mind :)

>>> An example: A module/driver allocates a dispatcher. The dispatcher is
>>> shared for all netdevs. Each netdev allocate a slot in the dispatcher
>>> and a BPF program. The netdev then uses the dispatcher to call the
>>> correct program with a direct call (actually a tail-call).
>> 
>> Is it really accurate to call it a tail call? To me, that would imply
>> that it increments the tail call limit counter and all that? Isn't this
>> just a direct jump using the trampoline stuff?
>
> Not meant in BPF context here, but more general [1].

Ah, right, that makes more sense.

> (For actual BPF tail calls I have a series close to ready for getting
> rid of most indirect calls which I'll post later today.)

Cool!

-Toke

Björn Töpel Nov. 14, 2019, 1:56 p.m. UTC | #8

On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
> > Björn Töpel <bjorn.topel@gmail.com> writes:
> >> From: Björn Töpel <bjorn.topel@intel.com>
> >>
> >> The BPF dispatcher builds on top of the BPF trampoline ideas;
> >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> >> code. The dispatcher builds a dispatch table for XDP programs, for
> >> retpoline avoidance. The table is a simple binary search model, so
> >> lookup is O(log n). Here, the dispatch table is limited to four
> >> entries (for laziness reason -- only 1B relative jumps :-P). If the
> >> dispatch table is full, it will fallback to the retpoline path.
> >
> > So it's O(log n) with n == 4? Have you compared the performance of just
> > doing four linear compare-and-jumps? Seems to me it may not be that big
> > of a difference for such a small N?
>
> Did you perform some microbenchmarks wrt search tree? Mainly wondering
> since for code emission for switch/case statements, clang/gcc turns off
> indirect calls entirely under retpoline, see [0] from back then.
>

As Toke stated, binsearch is not needed for 4 entries. I started out
with 16 (and explicit ids instead of pointers), and there it made more
sense. If folks think it's a good idea to move forward -- and with 4
entries, it makes sense to make the code generator easier, or maybe
based on static_calls like Ed did.

As for ubenchmarks I only compared with 1 cmp, vs 3 vs 4 + retpoline
stated in the cover. For a proper patch I can do more in-depth
analysis. Or was it anything particular you were looking for?

For switch/case code generation there's a great paper on that here [3]
from the 2008 GCC dev summit ("A Superoptimizer Analysis of Multiway
Branch Code Generation")

[3] http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=968AE756567863243AC7B1728915861A?doi=10.1.1.602.1875&rep=rep1&type=pdf


> >> An example: A module/driver allocates a dispatcher. The dispatcher is
> >> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> >> and a BPF program. The netdev then uses the dispatcher to call the
> >> correct program with a direct call (actually a tail-call).
> >
> > Is it really accurate to call it a tail call? To me, that would imply
> > that it increments the tail call limit counter and all that? Isn't this
> > just a direct jump using the trampoline stuff?
>
> Not meant in BPF context here, but more general [1].
>
> (For actual BPF tail calls I have a series close to ready for getting
> rid of most indirect calls which I'll post later today.)
>

Thanks for the clarification, Daniel! (call vs jmp)

> Best,
> Daniel
>
>    [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4
>    [1] https://en.wikipedia.org/wiki/Tail_call

Peter Zijlstra Nov. 14, 2019, 1:58 p.m. UTC | #9

On Thu, Nov 14, 2019 at 12:21:27PM +0100, Björn Töpel wrote:
> Again, thanks for the pointers. PeterZ is (hopefully) still working on
> the static_call stuff [3]. The static_call_inline would be a good fit
> here, and maybe even using static_call as a patchpad/dispatcher like
> you did is a better route. I will checkout Nadav's work!

Yes, I'll repost that once the current pile of text_poke and ftrace
changes lands.

Toke Høiland-Jørgensen Nov. 14, 2019, 2:55 p.m. UTC | #10

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
>> > Björn Töpel <bjorn.topel@gmail.com> writes:
>> >> From: Björn Töpel <bjorn.topel@intel.com>
>> >>
>> >> The BPF dispatcher builds on top of the BPF trampoline ideas;
>> >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>> >> code. The dispatcher builds a dispatch table for XDP programs, for
>> >> retpoline avoidance. The table is a simple binary search model, so
>> >> lookup is O(log n). Here, the dispatch table is limited to four
>> >> entries (for laziness reason -- only 1B relative jumps :-P). If the
>> >> dispatch table is full, it will fallback to the retpoline path.
>> >
>> > So it's O(log n) with n == 4? Have you compared the performance of just
>> > doing four linear compare-and-jumps? Seems to me it may not be that big
>> > of a difference for such a small N?
>>
>> Did you perform some microbenchmarks wrt search tree? Mainly wondering
>> since for code emission for switch/case statements, clang/gcc turns off
>> indirect calls entirely under retpoline, see [0] from back then.
>>
>
> As Toke stated, binsearch is not needed for 4 entries. I started out
> with 16 (and explicit ids instead of pointers), and there it made more
> sense. If folks think it's a good idea to move forward -- and with 4
> entries, it makes sense to make the code generator easier, or maybe
> based on static_calls like Ed did.

I don't really have anything to back it up, but my hunch is that only 4
entries will end up being a limit that people are going to end up
hitting. And since the performance falls off quite the cliff after
hitting that limit, I do fear that this is something we will hear about
quite emphatically :)

-Toke

Björn Töpel Nov. 14, 2019, 3:03 p.m. UTC | #11

On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
[...]
> I don't really have anything to back it up, but my hunch is that only 4
> entries will end up being a limit that people are going to end up
> hitting. And since the performance falls off quite the cliff after
> hitting that limit, I do fear that this is something we will hear about
> quite emphatically :)
>

Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP,
but I don't think that's normal for deployments. ;-)

Björn

Toke Høiland-Jørgensen Nov. 14, 2019, 3:12 p.m. UTC | #12

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
> [...]
>> I don't really have anything to back it up, but my hunch is that only 4
>> entries will end up being a limit that people are going to end up
>> hitting. And since the performance falls off quite the cliff after
>> hitting that limit, I do fear that this is something we will hear about
>> quite emphatically :)
>>
>
> Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP,
> but I don't think that's normal for deployments. ;-)

Well, the fact that products like this exist kinda indicates that this
is something people are doing, no?

https://small-tree.net/products/10gbe-switches/p3e10g-6-sr/

-Toke

Alexei Starovoitov Nov. 15, 2019, 12:30 a.m. UTC | #13

On Wed, Nov 13, 2019 at 09:47:35PM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
>  kernel/bpf/Makefile         |   1 +
>  kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 294 insertions(+)
>  create mode 100644 kernel/bpf/dispatcher.c
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 28782a1c386e..d75aebf508b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -10,10 +10,12 @@
>  #include <linux/if_vlan.h>
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
> +#include <linux/sort.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/text-patching.h>
> +#include <asm/asm-prototypes.h>
>  
>  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
>  {
> @@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
>  	return 0;
>  }
>  
> +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
> +
> +/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX,
> + * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log
> + * n).
> + */
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs,
> +			       u8 *fb)
> +{
> +	u8 *prog = *pprog, *jg_reloc;
> +	int pivot, err, cnt = 0;
> +	s64 jmp_offset;
> +
> +	if (a == b) {
> +		emit_mov_imm64(&prog, BPF_REG_0,	/* movabs func,%rax */
> +			       progs[a] >> 32,
> +			       (progs[a] << 32) >> 32);

Could you try optimizing emit_mov_imm64() to recognize s32 ?
iirc there was a single x86 insns that could move and sign extend.
That should cut down on bytecode size and probably make things a bit faster?
Another alternative is compare lower 32-bit only, since on x86-64 upper 32
should be ~0 anyway for bpf prog pointers.
Looking at bookkeeping code, I think I should be able to generalize bpf
trampoline a bit and share the code for bpf dispatch.
Could you also try aligning jmp target a bit by inserting nops?
Some x86 cpus are sensitive to jmp target alignment. Even without considering
JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
here that will be pushing millions of calls through bpf dispatch.

Björn Töpel Nov. 15, 2019, 7:56 a.m. UTC | #14

On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
[...]
>
> Could you try optimizing emit_mov_imm64() to recognize s32 ?
> iirc there was a single x86 insns that could move and sign extend.
> That should cut down on bytecode size and probably make things a bit faster?
> Another alternative is compare lower 32-bit only, since on x86-64 upper 32
> should be ~0 anyway for bpf prog pointers.

Good ideas, thanks! I'll do the optimization, extend it to >4 entries
(as Toke suggested), and do a non-RFC respin.

> Looking at bookkeeping code, I think I should be able to generalize bpf
> trampoline a bit and share the code for bpf dispatch.

Ok, good!

> Could you also try aligning jmp target a bit by inserting nops?
> Some x86 cpus are sensitive to jmp target alignment. Even without considering
> JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
> here that will be pushing millions of calls through bpf dispatch.
>

Yeah, I need to address the Jcc bug anyway, so that makes sense.

Another thought; I'm using the fentry nop as patch point, so it wont
play nice with other users of fentry atm -- but the plan is to move to
Steve's *_ftrace_direct work at some point, correct?

Björn

Alexei Starovoitov Nov. 15, 2019, 9:58 p.m. UTC | #15

On Fri, Nov 15, 2019 at 08:56:59AM +0100, Björn Töpel wrote:
> On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> [...]
> >
> > Could you try optimizing emit_mov_imm64() to recognize s32 ?
> > iirc there was a single x86 insns that could move and sign extend.
> > That should cut down on bytecode size and probably make things a bit faster?
> > Another alternative is compare lower 32-bit only, since on x86-64 upper 32
> > should be ~0 anyway for bpf prog pointers.
> 
> Good ideas, thanks! I'll do the optimization, extend it to >4 entries
> (as Toke suggested), and do a non-RFC respin.
> 
> > Looking at bookkeeping code, I think I should be able to generalize bpf
> > trampoline a bit and share the code for bpf dispatch.
> 
> Ok, good!
> 
> > Could you also try aligning jmp target a bit by inserting nops?
> > Some x86 cpus are sensitive to jmp target alignment. Even without considering
> > JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
> > here that will be pushing millions of calls through bpf dispatch.
> >
> 
> Yeah, I need to address the Jcc bug anyway, so that makes sense.
> 
> Another thought; I'm using the fentry nop as patch point, so it wont
> play nice with other users of fentry atm -- but the plan is to move to
> Steve's *_ftrace_direct work at some point, correct?

Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday.
Steven has a bunch more in his tree for merging, so I cannot just pull
all of ftrace api features into bpf-next. So "be nice to other fentry users"
would have to be done during merge window or shortly after in bpf-next tree
after window closes. I think it's fine. In bpf dispatch case it's really
one dummy function we're talking about. If it was marked 'notrace'
from get go no one would blink. It's a dummy function not interesting
for ftrac-ing and not interesting from live patching pov.

Björn Töpel Nov. 18, 2019, 10:03 a.m. UTC | #16

On Fri, 15 Nov 2019 at 22:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
[...]
> > Another thought; I'm using the fentry nop as patch point, so it wont
> > play nice with other users of fentry atm -- but the plan is to move to
> > Steve's *_ftrace_direct work at some point, correct?
>
> Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday.
> Steven has a bunch more in his tree for merging, so I cannot just pull
> all of ftrace api features into bpf-next. So "be nice to other fentry users"
> would have to be done during merge window or shortly after in bpf-next tree
> after window closes. I think it's fine.

Yup, I agree.

> In bpf dispatch case it's really
> one dummy function we're talking about. If it was marked 'notrace'
> from get go no one would blink. It's a dummy function not interesting
> for ftrac-ing and not interesting from live patching pov.
>

...but marking it with 'notrace' would remove the __fentry__ nop.
Anyways, the "be nice" approach is OK.

Andrii Nakryiko Nov. 18, 2019, 7:36 p.m. UTC | #17

On Wed, Nov 13, 2019 at 12:48 PM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
>
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
>  kernel/bpf/Makefile         |   1 +
>  kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 294 insertions(+)
>  create mode 100644 kernel/bpf/dispatcher.c
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 28782a1c386e..d75aebf508b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -10,10 +10,12 @@
>  #include <linux/if_vlan.h>
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
> +#include <linux/sort.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/text-patching.h>
> +#include <asm/asm-prototypes.h>
>

[...]

> +
> +int arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
> +                               int num_progs)
> +{
> +       u64 ips[BPF_DISPATCHER_MAX] = {};
> +       u8 *fallback, *prog = image;
> +       int i, err, cnt = 0;
> +
> +       if (!num_progs || num_progs > BPF_DISPATCHER_MAX)
> +               return -EINVAL;
> +
> +       for (i = 0; i < num_progs; i++)
> +               ips[i] = (u64)progs[i]->bpf_func;
> +
> +       EMIT2(0xEB, 5); /* jmp rip+5 (skip retpoline) */
> +       fallback = prog;
> +       err = emit_jmp(&prog,   /* jmp retpoline */
> +                      __x86_indirect_thunk_rdx, prog);
> +       if (err)
> +               return err;
> +
> +       sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL);

nit: sizeof(ips[i]) looks weird...

> +       return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback);
> +}
> +
>  struct x64_jit_data {
>         struct bpf_binary_header *header;
>         int *addrs;

[...]

> +
> +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> +                                  struct bpf_prog *prog)
> +{
> +       struct bpf_prog **entry = NULL;
> +       int i, err = 0;
> +
> +       if (d->num_progs == BPF_DISPATCHER_MAX)
> +               return err;

err == 0, not what you want, probably

> +
> +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +               if (!entry && !d->progs[i])
> +                       entry = &d->progs[i];
> +               if (d->progs[i] == prog)
> +                       return err;
> +       }
> +
> +       prog = bpf_prog_inc(prog);
> +       if (IS_ERR(prog))
> +               return err;
> +
> +       *entry = prog;
> +       d->num_progs++;
> +       return err;
> +}
> +
> +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
> +                                      struct bpf_prog *prog)
> +{
> +       int i;
> +
> +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +               if (d->progs[i] == prog) {
> +                       bpf_prog_put(prog);
> +                       d->progs[i] = NULL;
> +                       d->num_progs--;

instead of allowing holes, why not swap removed prog with the last on
in d->progs?

> +                       break;
> +               }
> +       }
> +}
> +
> +int __weak arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
> +                                      int num_ids)
> +{
> +       return -ENOTSUPP;
> +}
> +
> +/* NB! bpf_dispatcher_update() might free the dispatcher! */
> +static int bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +       void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +       void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +       int err;
> +
> +       if (d->num_progs == 0) {
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_NOP,
> +                                        old_image, NULL);
> +               bpf_dispatcher_free(d);
> +               goto out;
> +       }
> +
> +       err = arch_prepare_bpf_dispatcher(new_image, &d->progs[0],
> +                                         d->num_progs);
> +       if (err)
> +               goto out;
> +
> +       if (d->selector)
> +               /* progs already running at this address */
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_JMP,
> +                                        old_image, new_image);
> +       else
> +               /* first time registering */
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JMP,
> +                                        NULL, new_image);
> +
> +       if (err)
> +               goto out;
> +       d->selector++;
> +
> +out:
> +       return err;
> +}
> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +                               struct bpf_prog *to)
> +{
> +       struct bpf_dispatcher *d;
> +
> +       if (!from && !to)
> +               return;
> +
> +       mutex_lock(&dispatcher_mutex);
> +       d = bpf_dispatcher_lookup(func);
> +       if (!d)
> +               goto out;
> +
> +       if (from)
> +               bpf_dispatcher_remove_prog(d, from);
> +
> +       if (to)
> +               bpf_dispatcher_add_prog(d, to);

this can fail

> +
> +       WARN_ON(bpf_dispatcher_update(d));

shouldn't dispatcher be removed from the list before freed? It seems
like handling dispatches freeing is better done here explicitly (and
you won't need to leave a NB remark)

> +
> +out:
> +       mutex_unlock(&dispatcher_mutex);
> +}
> +
> +static int __init init_dispatchers(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < DISPATCHER_TABLE_SIZE; i++)
> +               INIT_HLIST_HEAD(&dispatcher_table[i]);
> +       return 0;
> +}
> +late_initcall(init_dispatchers);
> +
> +#endif
> --
> 2.20.1
>

Björn Töpel Nov. 18, 2019, 8:11 p.m. UTC | #18

On Mon, 18 Nov 2019 at 20:36, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
[...]
> > +       sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL);
>
> nit: sizeof(ips[i]) looks weird...
>

Ick! Thanks for spotting.

> > +       return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback);
> > +}
> > +
> >  struct x64_jit_data {
> >         struct bpf_binary_header *header;
> >         int *addrs;
>
> [...]
>
> > +
> > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> > +                                  struct bpf_prog *prog)
> > +{
> > +       struct bpf_prog **entry = NULL;
> > +       int i, err = 0;
> > +
> > +       if (d->num_progs == BPF_DISPATCHER_MAX)
> > +               return err;
>
> err == 0, not what you want, probably
>

No, the error handling in this RFC is bad. I'll fix that in the patch proper!

[...]
> > +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
> > +                                      struct bpf_prog *prog)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +               if (d->progs[i] == prog) {
> > +                       bpf_prog_put(prog);
> > +                       d->progs[i] = NULL;
> > +                       d->num_progs--;
>
> instead of allowing holes, why not swap removed prog with the last on
> in d->progs?
>

Yeah, holes is a no go. I'll redo that.

[...]

> > +
> > +       WARN_ON(bpf_dispatcher_update(d));
>
> shouldn't dispatcher be removed from the list before freed? It seems
> like handling dispatches freeing is better done here explicitly (and
> you won't need to leave a NB remark)
>

I agree. Let's make that explicit. I'll send out a patch proper in a day or two.

Thanks for looking at the code, Andrii!


Björn

[RFC,bpf-next,2/4] bpf: introduce BPF dispatcher

Commit Message

Comments

Patch