Message ID | 20191113204737.31623-3-bjorn.topel@gmail.com |
---|---|
State | RFC |
Delegated to: | BPF Maintainers |
Headers | show |
Series | Introduce xdp_call.h and the BPF dispatcher | expand |
On 13/11/2019 20:47, Björn Töpel wrote: > From: Björn Töpel <bjorn.topel@intel.com> > > The BPF dispatcher builds on top of the BPF trampoline ideas; > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > code. The dispatcher builds a dispatch table for XDP programs, for > retpoline avoidance. The table is a simple binary search model, so > lookup is O(log n). Here, the dispatch table is limited to four > entries (for laziness reason -- only 1B relative jumps :-P). If the > dispatch table is full, it will fallback to the retpoline path. > > An example: A module/driver allocates a dispatcher. The dispatcher is > shared for all netdevs. Each netdev allocate a slot in the dispatcher > and a BPF program. The netdev then uses the dispatcher to call the > correct program with a direct call (actually a tail-call). > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> The first-come-first-served model for dispatcher slots might mean that a low-traffic user ends up getting priority while a higher-traffic user is stuck with the retpoline fallback. Have you considered using a learning mechanism, like in my dynamic call RFC [1] earlier this year? (Though I'm sure a better learning mechanism than the one I used there could be devised.) > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d, > + struct bpf_prog *prog) > +{ > + struct bpf_prog **entry = NULL; > + int i, err = 0; > + > + if (d->num_progs == BPF_DISPATCHER_MAX) > + return err; > + > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > + if (!entry && !d->progs[i]) > + entry = &d->progs[i]; > + if (d->progs[i] == prog) > + return err; > + } > + > + prog = bpf_prog_inc(prog); > + if (IS_ERR(prog)) > + return err; > + > + *entry = prog; > + d->num_progs++; > + return err; > +} If I'm reading this function right, it always returns zero; was that the intention, and if so why isn't it void? -Ed [1] https://lkml.org/lkml/2019/2/1/948
On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote: > > On 13/11/2019 20:47, Björn Töpel wrote: > > From: Björn Töpel <bjorn.topel@intel.com> > > > > The BPF dispatcher builds on top of the BPF trampoline ideas; > > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > > code. The dispatcher builds a dispatch table for XDP programs, for > > retpoline avoidance. The table is a simple binary search model, so > > lookup is O(log n). Here, the dispatch table is limited to four > > entries (for laziness reason -- only 1B relative jumps :-P). If the > > dispatch table is full, it will fallback to the retpoline path. > > > > An example: A module/driver allocates a dispatcher. The dispatcher is > > shared for all netdevs. Each netdev allocate a slot in the dispatcher > > and a BPF program. The netdev then uses the dispatcher to call the > > correct program with a direct call (actually a tail-call). > > > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> > The first-come-first-served model for dispatcher slots might mean that > a low-traffic user ends up getting priority while a higher-traffic > user is stuck with the retpoline fallback. Have you considered using > a learning mechanism, like in my dynamic call RFC [1] earlier this > year? (Though I'm sure a better learning mechanism than the one I > used there could be devised.) > My rationale was that this mechanism would almost exclusively be used by physical HW NICs using XDP. My hunch was that the number of netdevs would be ~4, and typically less using XDP, so a more sophisticated mechanism didn't really make sense IMO. However, your approach is more generic and doesn't require any arch specific work. What was the push back for your work? I'll read up on the thread. I'm intrigued to take your series for a spin! > > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d, > > + struct bpf_prog *prog) > > +{ > > + struct bpf_prog **entry = NULL; > > + int i, err = 0; > > + > > + if (d->num_progs == BPF_DISPATCHER_MAX) > > + return err; > > + > > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > > + if (!entry && !d->progs[i]) > > + entry = &d->progs[i]; > > + if (d->progs[i] == prog) > > + return err; > > + } > > + > > + prog = bpf_prog_inc(prog); > > + if (IS_ERR(prog)) > > + return err; > > + > > + *entry = prog; > > + d->num_progs++; > > + return err; > > +} > If I'm reading this function right, it always returns zero; was that > the intention, and if so why isn't it void? > Ugh, yeah. In general the error handling should be improved in this RFC. If it makes sense to move forward with this series, I'll make sure to address that. Thanks for taking a look, and for the pointers to your work! Cheers, Björn > -Ed > > [1] https://lkml.org/lkml/2019/2/1/948
On 14/11/2019 06:29, Björn Töpel wrote: > On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote: >> On 13/11/2019 20:47, Björn Töpel wrote: >> The first-come-first-served model for dispatcher slots might mean that >> a low-traffic user ends up getting priority while a higher-traffic >> user is stuck with the retpoline fallback. Have you considered using >> a learning mechanism, like in my dynamic call RFC [1] earlier this >> year? (Though I'm sure a better learning mechanism than the one I >> used there could be devised.) > My rationale was that this mechanism would almost exclusively be used > by physical HW NICs using XDP. My hunch was that the number of netdevs > would be ~4, and typically less using XDP, so a more sophisticated > mechanism didn't really make sense IMO. That seems reasonable in most cases, although I can imagine systems with a couple of four-port boards being a thing. I suppose the netdevs are likely to all have the same XDP prog, though, and if I'm reading your code right it seems they'd share a slot in that case. > However, your approach is more > generic and doesn't require any arch specific work. What was the push > back for your work? Mainly that I couldn't demonstrate a performance benefit from the few call sites I annotated, and others working in the area felt that manual annotation wouldn't scale — Nadav Amit had a different approach [2] that used a GCC plugin to apply a dispatcher on an opt-out basis to all the indirect calls in the kernel; the discussion on that got bogged down in interactions between text patching and perf tracing which all went *waaaay* over my head. AFAICT the static_call series I was depending on never got merged, and I'm not sure if anyone's still working on it. -Ed [2] https://lkml.org/lkml/2018/12/31/19
On Thu, 14 Nov 2019 at 11:19, Edward Cree <ecree@solarflare.com> wrote: > > On 14/11/2019 06:29, Björn Töpel wrote: [...] > > My rationale was that this mechanism would almost exclusively be used > > by physical HW NICs using XDP. My hunch was that the number of netdevs > > would be ~4, and typically less using XDP, so a more sophisticated > > mechanism didn't really make sense IMO. > That seems reasonable in most cases, although I can imagine systems with > a couple of four-port boards being a thing. I suppose the netdevs are > likely to all have the same XDP prog, though, and if I'm reading your > code right it seems they'd share a slot in that case. > Yup, correct! > > However, your approach is more > > generic and doesn't require any arch specific work. What was the push > > back for your work? > Mainly that I couldn't demonstrate a performance benefit from the few > call sites I annotated, and others working in the area felt that > manual annotation wouldn't scale — Nadav Amit had a different approach > [2] that used a GCC plugin to apply a dispatcher on an opt-out basis > to all the indirect calls in the kernel; the discussion on that got > bogged down in interactions between text patching and perf tracing > which all went *waaaay* over my head. AFAICT the static_call series I > was depending on never got merged, and I'm not sure if anyone's still > working on it. > Again, thanks for the pointers. PeterZ is (hopefully) still working on the static_call stuff [3]. The static_call_inline would be a good fit here, and maybe even using static_call as a patchpad/dispatcher like you did is a better route. I will checkout Nadav's work! Björn [3] https://lore.kernel.org/lkml/20191007090225.44108711.6@infradead.org/#r > -Ed > > [2] https://lkml.org/lkml/2018/12/31/19
Björn Töpel <bjorn.topel@gmail.com> writes: > From: Björn Töpel <bjorn.topel@intel.com> > > The BPF dispatcher builds on top of the BPF trampoline ideas; > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > code. The dispatcher builds a dispatch table for XDP programs, for > retpoline avoidance. The table is a simple binary search model, so > lookup is O(log n). Here, the dispatch table is limited to four > entries (for laziness reason -- only 1B relative jumps :-P). If the > dispatch table is full, it will fallback to the retpoline path. So it's O(log n) with n == 4? Have you compared the performance of just doing four linear compare-and-jumps? Seems to me it may not be that big of a difference for such a small N? > An example: A module/driver allocates a dispatcher. The dispatcher is > shared for all netdevs. Each netdev allocate a slot in the dispatcher > and a BPF program. The netdev then uses the dispatcher to call the > correct program with a direct call (actually a tail-call). Is it really accurate to call it a tail call? To me, that would imply that it increments the tail call limit counter and all that? Isn't this just a direct jump using the trampoline stuff? -Toke
On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote: > Björn Töpel <bjorn.topel@gmail.com> writes: >> From: Björn Töpel <bjorn.topel@intel.com> >> >> The BPF dispatcher builds on top of the BPF trampoline ideas; >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate >> code. The dispatcher builds a dispatch table for XDP programs, for >> retpoline avoidance. The table is a simple binary search model, so >> lookup is O(log n). Here, the dispatch table is limited to four >> entries (for laziness reason -- only 1B relative jumps :-P). If the >> dispatch table is full, it will fallback to the retpoline path. > > So it's O(log n) with n == 4? Have you compared the performance of just > doing four linear compare-and-jumps? Seems to me it may not be that big > of a difference for such a small N? Did you perform some microbenchmarks wrt search tree? Mainly wondering since for code emission for switch/case statements, clang/gcc turns off indirect calls entirely under retpoline, see [0] from back then. >> An example: A module/driver allocates a dispatcher. The dispatcher is >> shared for all netdevs. Each netdev allocate a slot in the dispatcher >> and a BPF program. The netdev then uses the dispatcher to call the >> correct program with a direct call (actually a tail-call). > > Is it really accurate to call it a tail call? To me, that would imply > that it increments the tail call limit counter and all that? Isn't this > just a direct jump using the trampoline stuff? Not meant in BPF context here, but more general [1]. (For actual BPF tail calls I have a series close to ready for getting rid of most indirect calls which I'll post later today.) Best, Daniel [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4 [1] https://en.wikipedia.org/wiki/Tail_call
Daniel Borkmann <daniel@iogearbox.net> writes: > On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote: >> Björn Töpel <bjorn.topel@gmail.com> writes: >>> From: Björn Töpel <bjorn.topel@intel.com> >>> >>> The BPF dispatcher builds on top of the BPF trampoline ideas; >>> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate >>> code. The dispatcher builds a dispatch table for XDP programs, for >>> retpoline avoidance. The table is a simple binary search model, so >>> lookup is O(log n). Here, the dispatch table is limited to four >>> entries (for laziness reason -- only 1B relative jumps :-P). If the >>> dispatch table is full, it will fallback to the retpoline path. >> >> So it's O(log n) with n == 4? Have you compared the performance of just >> doing four linear compare-and-jumps? Seems to me it may not be that big >> of a difference for such a small N? > > Did you perform some microbenchmarks wrt search tree? Mainly wondering > since for code emission for switch/case statements, clang/gcc turns off > indirect calls entirely under retpoline, see [0] from back then. Yes, this was exactly the example I had in mind :) >>> An example: A module/driver allocates a dispatcher. The dispatcher is >>> shared for all netdevs. Each netdev allocate a slot in the dispatcher >>> and a BPF program. The netdev then uses the dispatcher to call the >>> correct program with a direct call (actually a tail-call). >> >> Is it really accurate to call it a tail call? To me, that would imply >> that it increments the tail call limit counter and all that? Isn't this >> just a direct jump using the trampoline stuff? > > Not meant in BPF context here, but more general [1]. Ah, right, that makes more sense. > (For actual BPF tail calls I have a series close to ready for getting > rid of most indirect calls which I'll post later today.) Cool! -Toke
On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote: > > Björn Töpel <bjorn.topel@gmail.com> writes: > >> From: Björn Töpel <bjorn.topel@intel.com> > >> > >> The BPF dispatcher builds on top of the BPF trampoline ideas; > >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > >> code. The dispatcher builds a dispatch table for XDP programs, for > >> retpoline avoidance. The table is a simple binary search model, so > >> lookup is O(log n). Here, the dispatch table is limited to four > >> entries (for laziness reason -- only 1B relative jumps :-P). If the > >> dispatch table is full, it will fallback to the retpoline path. > > > > So it's O(log n) with n == 4? Have you compared the performance of just > > doing four linear compare-and-jumps? Seems to me it may not be that big > > of a difference for such a small N? > > Did you perform some microbenchmarks wrt search tree? Mainly wondering > since for code emission for switch/case statements, clang/gcc turns off > indirect calls entirely under retpoline, see [0] from back then. > As Toke stated, binsearch is not needed for 4 entries. I started out with 16 (and explicit ids instead of pointers), and there it made more sense. If folks think it's a good idea to move forward -- and with 4 entries, it makes sense to make the code generator easier, or maybe based on static_calls like Ed did. As for ubenchmarks I only compared with 1 cmp, vs 3 vs 4 + retpoline stated in the cover. For a proper patch I can do more in-depth analysis. Or was it anything particular you were looking for? For switch/case code generation there's a great paper on that here [3] from the 2008 GCC dev summit ("A Superoptimizer Analysis of Multiway Branch Code Generation") [3] http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=968AE756567863243AC7B1728915861A?doi=10.1.1.602.1875&rep=rep1&type=pdf > >> An example: A module/driver allocates a dispatcher. The dispatcher is > >> shared for all netdevs. Each netdev allocate a slot in the dispatcher > >> and a BPF program. The netdev then uses the dispatcher to call the > >> correct program with a direct call (actually a tail-call). > > > > Is it really accurate to call it a tail call? To me, that would imply > > that it increments the tail call limit counter and all that? Isn't this > > just a direct jump using the trampoline stuff? > > Not meant in BPF context here, but more general [1]. > > (For actual BPF tail calls I have a series close to ready for getting > rid of most indirect calls which I'll post later today.) > Thanks for the clarification, Daniel! (call vs jmp) > Best, > Daniel > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4 > [1] https://en.wikipedia.org/wiki/Tail_call
On Thu, Nov 14, 2019 at 12:21:27PM +0100, Björn Töpel wrote: > Again, thanks for the pointers. PeterZ is (hopefully) still working on > the static_call stuff [3]. The static_call_inline would be a good fit > here, and maybe even using static_call as a patchpad/dispatcher like > you did is a better route. I will checkout Nadav's work! Yes, I'll repost that once the current pile of text_poke and ftrace changes lands.
Björn Töpel <bjorn.topel@gmail.com> writes: > On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote: >> >> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote: >> > Björn Töpel <bjorn.topel@gmail.com> writes: >> >> From: Björn Töpel <bjorn.topel@intel.com> >> >> >> >> The BPF dispatcher builds on top of the BPF trampoline ideas; >> >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate >> >> code. The dispatcher builds a dispatch table for XDP programs, for >> >> retpoline avoidance. The table is a simple binary search model, so >> >> lookup is O(log n). Here, the dispatch table is limited to four >> >> entries (for laziness reason -- only 1B relative jumps :-P). If the >> >> dispatch table is full, it will fallback to the retpoline path. >> > >> > So it's O(log n) with n == 4? Have you compared the performance of just >> > doing four linear compare-and-jumps? Seems to me it may not be that big >> > of a difference for such a small N? >> >> Did you perform some microbenchmarks wrt search tree? Mainly wondering >> since for code emission for switch/case statements, clang/gcc turns off >> indirect calls entirely under retpoline, see [0] from back then. >> > > As Toke stated, binsearch is not needed for 4 entries. I started out > with 16 (and explicit ids instead of pointers), and there it made more > sense. If folks think it's a good idea to move forward -- and with 4 > entries, it makes sense to make the code generator easier, or maybe > based on static_calls like Ed did. I don't really have anything to back it up, but my hunch is that only 4 entries will end up being a limit that people are going to end up hitting. And since the performance falls off quite the cliff after hitting that limit, I do fear that this is something we will hear about quite emphatically :) -Toke
On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > [...] > I don't really have anything to back it up, but my hunch is that only 4 > entries will end up being a limit that people are going to end up > hitting. And since the performance falls off quite the cliff after > hitting that limit, I do fear that this is something we will hear about > quite emphatically :) > Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP, but I don't think that's normal for deployments. ;-) Björn
Björn Töpel <bjorn.topel@gmail.com> writes: > On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> > [...] >> I don't really have anything to back it up, but my hunch is that only 4 >> entries will end up being a limit that people are going to end up >> hitting. And since the performance falls off quite the cliff after >> hitting that limit, I do fear that this is something we will hear about >> quite emphatically :) >> > > Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP, > but I don't think that's normal for deployments. ;-) Well, the fact that products like this exist kinda indicates that this is something people are doing, no? https://small-tree.net/products/10gbe-switches/p3e10g-6-sr/ -Toke
On Wed, Nov 13, 2019 at 09:47:35PM +0100, Björn Töpel wrote: > From: Björn Töpel <bjorn.topel@intel.com> > > The BPF dispatcher builds on top of the BPF trampoline ideas; > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > code. The dispatcher builds a dispatch table for XDP programs, for > retpoline avoidance. The table is a simple binary search model, so > lookup is O(log n). Here, the dispatch table is limited to four > entries (for laziness reason -- only 1B relative jumps :-P). If the > dispatch table is full, it will fallback to the retpoline path. > > An example: A module/driver allocates a dispatcher. The dispatcher is > shared for all netdevs. Each netdev allocate a slot in the dispatcher > and a BPF program. The netdev then uses the dispatcher to call the > correct program with a direct call (actually a tail-call). > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> > --- > arch/x86/net/bpf_jit_comp.c | 96 ++++++++++++++++++ > kernel/bpf/Makefile | 1 + > kernel/bpf/dispatcher.c | 197 ++++++++++++++++++++++++++++++++++++ > 3 files changed, 294 insertions(+) > create mode 100644 kernel/bpf/dispatcher.c > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index 28782a1c386e..d75aebf508b8 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -10,10 +10,12 @@ > #include <linux/if_vlan.h> > #include <linux/bpf.h> > #include <linux/memory.h> > +#include <linux/sort.h> > #include <asm/extable.h> > #include <asm/set_memory.h> > #include <asm/nospec-branch.h> > #include <asm/text-patching.h> > +#include <asm/asm-prototypes.h> > > static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) > { > @@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags > return 0; > } > > +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) > + > +/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX, > + * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log > + * n). > + */ > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs, > + u8 *fb) > +{ > + u8 *prog = *pprog, *jg_reloc; > + int pivot, err, cnt = 0; > + s64 jmp_offset; > + > + if (a == b) { > + emit_mov_imm64(&prog, BPF_REG_0, /* movabs func,%rax */ > + progs[a] >> 32, > + (progs[a] << 32) >> 32); Could you try optimizing emit_mov_imm64() to recognize s32 ? iirc there was a single x86 insns that could move and sign extend. That should cut down on bytecode size and probably make things a bit faster? Another alternative is compare lower 32-bit only, since on x86-64 upper 32 should be ~0 anyway for bpf prog pointers. Looking at bookkeeping code, I think I should be able to generalize bpf trampoline a bit and share the code for bpf dispatch. Could you also try aligning jmp target a bit by inserting nops? Some x86 cpus are sensitive to jmp target alignment. Even without considering JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP here that will be pushing millions of calls through bpf dispatch.
On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > [...] > > Could you try optimizing emit_mov_imm64() to recognize s32 ? > iirc there was a single x86 insns that could move and sign extend. > That should cut down on bytecode size and probably make things a bit faster? > Another alternative is compare lower 32-bit only, since on x86-64 upper 32 > should be ~0 anyway for bpf prog pointers. Good ideas, thanks! I'll do the optimization, extend it to >4 entries (as Toke suggested), and do a non-RFC respin. > Looking at bookkeeping code, I think I should be able to generalize bpf > trampoline a bit and share the code for bpf dispatch. Ok, good! > Could you also try aligning jmp target a bit by inserting nops? > Some x86 cpus are sensitive to jmp target alignment. Even without considering > JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP > here that will be pushing millions of calls through bpf dispatch. > Yeah, I need to address the Jcc bug anyway, so that makes sense. Another thought; I'm using the fentry nop as patch point, so it wont play nice with other users of fentry atm -- but the plan is to move to Steve's *_ftrace_direct work at some point, correct? Björn
On Fri, Nov 15, 2019 at 08:56:59AM +0100, Björn Töpel wrote: > On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > [...] > > > > Could you try optimizing emit_mov_imm64() to recognize s32 ? > > iirc there was a single x86 insns that could move and sign extend. > > That should cut down on bytecode size and probably make things a bit faster? > > Another alternative is compare lower 32-bit only, since on x86-64 upper 32 > > should be ~0 anyway for bpf prog pointers. > > Good ideas, thanks! I'll do the optimization, extend it to >4 entries > (as Toke suggested), and do a non-RFC respin. > > > Looking at bookkeeping code, I think I should be able to generalize bpf > > trampoline a bit and share the code for bpf dispatch. > > Ok, good! > > > Could you also try aligning jmp target a bit by inserting nops? > > Some x86 cpus are sensitive to jmp target alignment. Even without considering > > JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP > > here that will be pushing millions of calls through bpf dispatch. > > > > Yeah, I need to address the Jcc bug anyway, so that makes sense. > > Another thought; I'm using the fentry nop as patch point, so it wont > play nice with other users of fentry atm -- but the plan is to move to > Steve's *_ftrace_direct work at some point, correct? Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday. Steven has a bunch more in his tree for merging, so I cannot just pull all of ftrace api features into bpf-next. So "be nice to other fentry users" would have to be done during merge window or shortly after in bpf-next tree after window closes. I think it's fine. In bpf dispatch case it's really one dummy function we're talking about. If it was marked 'notrace' from get go no one would blink. It's a dummy function not interesting for ftrac-ing and not interesting from live patching pov.
On Fri, 15 Nov 2019 at 22:58, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > [...] > > Another thought; I'm using the fentry nop as patch point, so it wont > > play nice with other users of fentry atm -- but the plan is to move to > > Steve's *_ftrace_direct work at some point, correct? > > Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday. > Steven has a bunch more in his tree for merging, so I cannot just pull > all of ftrace api features into bpf-next. So "be nice to other fentry users" > would have to be done during merge window or shortly after in bpf-next tree > after window closes. I think it's fine. Yup, I agree. > In bpf dispatch case it's really > one dummy function we're talking about. If it was marked 'notrace' > from get go no one would blink. It's a dummy function not interesting > for ftrac-ing and not interesting from live patching pov. > ...but marking it with 'notrace' would remove the __fentry__ nop. Anyways, the "be nice" approach is OK.
On Wed, Nov 13, 2019 at 12:48 PM Björn Töpel <bjorn.topel@gmail.com> wrote: > > From: Björn Töpel <bjorn.topel@intel.com> > > The BPF dispatcher builds on top of the BPF trampoline ideas; > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > code. The dispatcher builds a dispatch table for XDP programs, for > retpoline avoidance. The table is a simple binary search model, so > lookup is O(log n). Here, the dispatch table is limited to four > entries (for laziness reason -- only 1B relative jumps :-P). If the > dispatch table is full, it will fallback to the retpoline path. > > An example: A module/driver allocates a dispatcher. The dispatcher is > shared for all netdevs. Each netdev allocate a slot in the dispatcher > and a BPF program. The netdev then uses the dispatcher to call the > correct program with a direct call (actually a tail-call). > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> > --- > arch/x86/net/bpf_jit_comp.c | 96 ++++++++++++++++++ > kernel/bpf/Makefile | 1 + > kernel/bpf/dispatcher.c | 197 ++++++++++++++++++++++++++++++++++++ > 3 files changed, 294 insertions(+) > create mode 100644 kernel/bpf/dispatcher.c > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index 28782a1c386e..d75aebf508b8 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -10,10 +10,12 @@ > #include <linux/if_vlan.h> > #include <linux/bpf.h> > #include <linux/memory.h> > +#include <linux/sort.h> > #include <asm/extable.h> > #include <asm/set_memory.h> > #include <asm/nospec-branch.h> > #include <asm/text-patching.h> > +#include <asm/asm-prototypes.h> > [...] > + > +int arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs, > + int num_progs) > +{ > + u64 ips[BPF_DISPATCHER_MAX] = {}; > + u8 *fallback, *prog = image; > + int i, err, cnt = 0; > + > + if (!num_progs || num_progs > BPF_DISPATCHER_MAX) > + return -EINVAL; > + > + for (i = 0; i < num_progs; i++) > + ips[i] = (u64)progs[i]->bpf_func; > + > + EMIT2(0xEB, 5); /* jmp rip+5 (skip retpoline) */ > + fallback = prog; > + err = emit_jmp(&prog, /* jmp retpoline */ > + __x86_indirect_thunk_rdx, prog); > + if (err) > + return err; > + > + sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL); nit: sizeof(ips[i]) looks weird... > + return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback); > +} > + > struct x64_jit_data { > struct bpf_binary_header *header; > int *addrs; [...] > + > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d, > + struct bpf_prog *prog) > +{ > + struct bpf_prog **entry = NULL; > + int i, err = 0; > + > + if (d->num_progs == BPF_DISPATCHER_MAX) > + return err; err == 0, not what you want, probably > + > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > + if (!entry && !d->progs[i]) > + entry = &d->progs[i]; > + if (d->progs[i] == prog) > + return err; > + } > + > + prog = bpf_prog_inc(prog); > + if (IS_ERR(prog)) > + return err; > + > + *entry = prog; > + d->num_progs++; > + return err; > +} > + > +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d, > + struct bpf_prog *prog) > +{ > + int i; > + > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > + if (d->progs[i] == prog) { > + bpf_prog_put(prog); > + d->progs[i] = NULL; > + d->num_progs--; instead of allowing holes, why not swap removed prog with the last on in d->progs? > + break; > + } > + } > +} > + > +int __weak arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs, > + int num_ids) > +{ > + return -ENOTSUPP; > +} > + > +/* NB! bpf_dispatcher_update() might free the dispatcher! */ > +static int bpf_dispatcher_update(struct bpf_dispatcher *d) > +{ > + void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2; > + void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2; > + int err; > + > + if (d->num_progs == 0) { > + err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_NOP, > + old_image, NULL); > + bpf_dispatcher_free(d); > + goto out; > + } > + > + err = arch_prepare_bpf_dispatcher(new_image, &d->progs[0], > + d->num_progs); > + if (err) > + goto out; > + > + if (d->selector) > + /* progs already running at this address */ > + err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_JMP, > + old_image, new_image); > + else > + /* first time registering */ > + err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JMP, > + NULL, new_image); > + > + if (err) > + goto out; > + d->selector++; > + > +out: > + return err; > +} > + > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from, > + struct bpf_prog *to) > +{ > + struct bpf_dispatcher *d; > + > + if (!from && !to) > + return; > + > + mutex_lock(&dispatcher_mutex); > + d = bpf_dispatcher_lookup(func); > + if (!d) > + goto out; > + > + if (from) > + bpf_dispatcher_remove_prog(d, from); > + > + if (to) > + bpf_dispatcher_add_prog(d, to); this can fail > + > + WARN_ON(bpf_dispatcher_update(d)); shouldn't dispatcher be removed from the list before freed? It seems like handling dispatches freeing is better done here explicitly (and you won't need to leave a NB remark) > + > +out: > + mutex_unlock(&dispatcher_mutex); > +} > + > +static int __init init_dispatchers(void) > +{ > + int i; > + > + for (i = 0; i < DISPATCHER_TABLE_SIZE; i++) > + INIT_HLIST_HEAD(&dispatcher_table[i]); > + return 0; > +} > +late_initcall(init_dispatchers); > + > +#endif > -- > 2.20.1 >
On Mon, 18 Nov 2019 at 20:36, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > [...] > > + sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL); > > nit: sizeof(ips[i]) looks weird... > Ick! Thanks for spotting. > > + return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback); > > +} > > + > > struct x64_jit_data { > > struct bpf_binary_header *header; > > int *addrs; > > [...] > > > + > > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d, > > + struct bpf_prog *prog) > > +{ > > + struct bpf_prog **entry = NULL; > > + int i, err = 0; > > + > > + if (d->num_progs == BPF_DISPATCHER_MAX) > > + return err; > > err == 0, not what you want, probably > No, the error handling in this RFC is bad. I'll fix that in the patch proper! [...] > > +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d, > > + struct bpf_prog *prog) > > +{ > > + int i; > > + > > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > > + if (d->progs[i] == prog) { > > + bpf_prog_put(prog); > > + d->progs[i] = NULL; > > + d->num_progs--; > > instead of allowing holes, why not swap removed prog with the last on > in d->progs? > Yeah, holes is a no go. I'll redo that. [...] > > + > > + WARN_ON(bpf_dispatcher_update(d)); > > shouldn't dispatcher be removed from the list before freed? It seems > like handling dispatches freeing is better done here explicitly (and > you won't need to leave a NB remark) > I agree. Let's make that explicit. I'll send out a patch proper in a day or two. Thanks for looking at the code, Andrii! Björn
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 28782a1c386e..d75aebf508b8 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -10,10 +10,12 @@ #include <linux/if_vlan.h> #include <linux/bpf.h> #include <linux/memory.h> +#include <linux/sort.h> #include <asm/extable.h> #include <asm/set_memory.h> #include <asm/nospec-branch.h> #include <asm/text-patching.h> +#include <asm/asm-prototypes.h> static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) { @@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags return 0; } +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) + +/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX, + * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log + * n). + */ +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs, + u8 *fb) +{ + u8 *prog = *pprog, *jg_reloc; + int pivot, err, cnt = 0; + s64 jmp_offset; + + if (a == b) { + emit_mov_imm64(&prog, BPF_REG_0, /* movabs func,%rax */ + progs[a] >> 32, + (progs[a] << 32) >> 32); + EMIT3(0x48, 0x39, 0xC2); /* cmp rdx, rax */ + jmp_offset = fb - (prog + 2); + if (!is_imm8(jmp_offset)) + return -1; + EMIT2(X86_JNE, jmp_offset); /* jne retpoline */ + err = emit_jmp(&prog, (void *)progs[a], /* jmp bpf_prog */ + prog); + if (err) + return err; + goto out; + } + + pivot = (b - a) / 2; + emit_mov_imm64(&prog, BPF_REG_0, progs[a + pivot] >> 32, + (progs[a + pivot] << 32) >> 32); + EMIT3(0x48, 0x39, 0xC2); /* cmp rdx, rax */ + + jg_reloc = prog; + EMIT2(X86_JG, 0); /* jg pivot + 1-part */ + + err = emit_bpf_dispatcher(&prog, a, a + pivot, progs, fb); + if (err) + return err; + + jmp_offset = prog - (jg_reloc + 2); + if (!is_imm8(jmp_offset)) + return -1; + emit_code(jg_reloc, X86_JG + (jmp_offset << 8), 2); + + err = emit_bpf_dispatcher(&prog, a + pivot + 1, b, progs, fb); + if (err) + return err; +out: + *pprog = prog; + return 0; +} + +#endif + +static int cmp_ips(const void *a, const void *b) +{ + const u64 *ipa = a; + const u64 *ipb = b; + + if (*ipa > *ipb) + return 1; + if (*ipa < *ipb) + return -1; + return 0; +} + +#define BPF_DISPATCHER_MAX 4 + +int arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs, + int num_progs) +{ + u64 ips[BPF_DISPATCHER_MAX] = {}; + u8 *fallback, *prog = image; + int i, err, cnt = 0; + + if (!num_progs || num_progs > BPF_DISPATCHER_MAX) + return -EINVAL; + + for (i = 0; i < num_progs; i++) + ips[i] = (u64)progs[i]->bpf_func; + + EMIT2(0xEB, 5); /* jmp rip+5 (skip retpoline) */ + fallback = prog; + err = emit_jmp(&prog, /* jmp retpoline */ + __x86_indirect_thunk_rdx, prog); + if (err) + return err; + + sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL); + return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback); +} + struct x64_jit_data { struct bpf_binary_header *header; int *addrs; diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 3f671bf617e8..d4f330351f87 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o obj-$(CONFIG_BPF_SYSCALL) += disasm.o obj-$(CONFIG_BPF_JIT) += trampoline.o obj-$(CONFIG_BPF_SYSCALL) += btf.o +obj-$(CONFIG_BPF_JIT) += dispatcher.o ifeq ($(CONFIG_NET),y) obj-$(CONFIG_BPF_SYSCALL) += devmap.o obj-$(CONFIG_BPF_SYSCALL) += cpumap.o diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c new file mode 100644 index 000000000000..691898640720 --- /dev/null +++ b/kernel/bpf/dispatcher.c @@ -0,0 +1,197 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright(c) 2019 Intel Corporation. */ + +#ifdef CONFIG_RETPOLINE + +#include <linux/hash.h> +#include <linux/bpf.h> +#include <linux/filter.h> + +/* The BPF dispatcher is a multiway branch code generator. A user + * registers a slot (id) and can then update the BPF program for that + * slot. The dispatcher is jited, and will be rejited every time a + * slot is allocated/deallocated for performance reasons. An example: + * A module provides code for multiple netdevs. Each netdev can have + * one XDP program. The module code will allocate a dispatcher, and + * when the netdev enables XDP it allocates a new slot. + * + * Nothing like STATIC_CALL_INLINE is supported yet, so an explicit + * trampoline is needed: + * + * unsigned int dispatcher_trampoline(void *ctx, void *insn, int id) + */ + +#define DISPATCHER_HASH_BITS 10 +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS) + +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE]; + +#define BPF_DISPATCHER_MAX 4 + +struct bpf_dispatcher { + struct hlist_node hlist; + void *func; + struct bpf_prog *progs[BPF_DISPATCHER_MAX]; + int num_progs; + void *image; + u64 selector; +}; + +static DEFINE_MUTEX(dispatcher_mutex); + +struct bpf_dispatcher *bpf_dispatcher_lookup(void *func) +{ + struct bpf_dispatcher *d; + struct hlist_head *head; + void *image; + + head = &dispatcher_table[hash_ptr(func, DISPATCHER_HASH_BITS)]; + hlist_for_each_entry(d, head, hlist) { + if (d->func == func) + return d; + } + d = kzalloc(sizeof(*d), GFP_KERNEL); + if (!d) + return NULL; + + image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!image) { + kfree(d); + return NULL; + } + + d->func = func; + INIT_HLIST_NODE(&d->hlist); + hlist_add_head(&d->hlist, head); + + set_vm_flush_reset_perms(image); + set_memory_x((long)image, 1); + d->image = image; + return d; +} + +static void bpf_dispatcher_free(struct bpf_dispatcher *d) +{ + bpf_jit_free_exec(d->image); + hlist_del(&d->hlist); + kfree(d); +} + +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d, + struct bpf_prog *prog) +{ + struct bpf_prog **entry = NULL; + int i, err = 0; + + if (d->num_progs == BPF_DISPATCHER_MAX) + return err; + + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { + if (!entry && !d->progs[i]) + entry = &d->progs[i]; + if (d->progs[i] == prog) + return err; + } + + prog = bpf_prog_inc(prog); + if (IS_ERR(prog)) + return err; + + *entry = prog; + d->num_progs++; + return err; +} + +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d, + struct bpf_prog *prog) +{ + int i; + + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { + if (d->progs[i] == prog) { + bpf_prog_put(prog); + d->progs[i] = NULL; + d->num_progs--; + break; + } + } +} + +int __weak arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs, + int num_ids) +{ + return -ENOTSUPP; +} + +/* NB! bpf_dispatcher_update() might free the dispatcher! */ +static int bpf_dispatcher_update(struct bpf_dispatcher *d) +{ + void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2; + void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2; + int err; + + if (d->num_progs == 0) { + err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_NOP, + old_image, NULL); + bpf_dispatcher_free(d); + goto out; + } + + err = arch_prepare_bpf_dispatcher(new_image, &d->progs[0], + d->num_progs); + if (err) + goto out; + + if (d->selector) + /* progs already running at this address */ + err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_JMP, + old_image, new_image); + else + /* first time registering */ + err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JMP, + NULL, new_image); + + if (err) + goto out; + d->selector++; + +out: + return err; +} + +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from, + struct bpf_prog *to) +{ + struct bpf_dispatcher *d; + + if (!from && !to) + return; + + mutex_lock(&dispatcher_mutex); + d = bpf_dispatcher_lookup(func); + if (!d) + goto out; + + if (from) + bpf_dispatcher_remove_prog(d, from); + + if (to) + bpf_dispatcher_add_prog(d, to); + + WARN_ON(bpf_dispatcher_update(d)); + +out: + mutex_unlock(&dispatcher_mutex); +} + +static int __init init_dispatchers(void) +{ + int i; + + for (i = 0; i < DISPATCHER_TABLE_SIZE; i++) + INIT_HLIST_HEAD(&dispatcher_table[i]); + return 0; +} +late_initcall(init_dispatchers); + +#endif