diff mbox series

x86, retpolines: raise limit for generating indirect calls from switch-case

Message ID 20190221221941.29358-1-daniel@iogearbox.net
State Not Applicable
Delegated to: David Miller
Headers show
Series x86, retpolines: raise limit for generating indirect calls from switch-case | expand

Commit Message

Daniel Borkmann Feb. 21, 2019, 10:19 p.m. UTC
From networking side, there are numerous attempts to get rid of
indirect calls in fast-path wherever feasible in order to avoid
the cost of retpolines, for example, just to name a few:

  * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
  * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
  * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
  * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
  * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
  * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
  [...]

Recent work on XDP from Björn and Magnus additionally found that
manually transforming the XDP return code switch statement with
more than 5 cases into if-else combination would result in a
considerable speedup in XDP layer due to avoidance of indirect
calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
XDP prog attached, a 20-26% speedup has been observed [0]. Aside
from XDP, there are many other places later in the networking
stack's critical path with similar switch-case processing. Rather
than fixing every XDP-enabled driver and locations in stack by
hand, it would be good to instead raise the limit where gcc would
emit expensive indirect calls from the switch under retpolines
and stick with the default as-is in case of !retpoline configured
kernels. This would also have the advantage that for archs where
this is not necessary, we let compiler select the underlying
target optimization for these constructs and avoid potential
slow-downs by if-else hand-rewrite.

In case of gcc, this setting is controlled by case-values-threshold
which has an architecture global default that selects 4 or 5 (latter
if target does not have a case insn that compares the bounds) where
some arch back ends like arm64 or s390 override it with their own
target hooks, for example, in gcc commit db7a90aa0de5 ("S/390:
Disable prediction of indirect branches") the threshold pretty
much disables jump tables by limit of 20 under retpoline builds.
Comparing gcc's and clang's default code generation on x86-64 under
O2 level with retpoline build results in the following outcome
for 5 switch cases:

* gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:

  # gdb -batch -ex 'disassemble dispatch' ./c-switch
  Dump of assembler code for function dispatch:
   0x0000000000400be0 <+0>:     cmp    $0x4,%edi
   0x0000000000400be3 <+3>:     ja     0x400c35 <dispatch+85>
   0x0000000000400be5 <+5>:     lea    0x915f8(%rip),%rdx        # 0x4921e4
   0x0000000000400bec <+12>:    mov    %edi,%edi
   0x0000000000400bee <+14>:    movslq (%rdx,%rdi,4),%rax
   0x0000000000400bf2 <+18>:    add    %rdx,%rax
   0x0000000000400bf5 <+21>:    callq  0x400c01 <dispatch+33>
   0x0000000000400bfa <+26>:    pause
   0x0000000000400bfc <+28>:    lfence
   0x0000000000400bff <+31>:    jmp    0x400bfa <dispatch+26>
   0x0000000000400c01 <+33>:    mov    %rax,(%rsp)
   0x0000000000400c05 <+37>:    retq
   0x0000000000400c06 <+38>:    nopw   %cs:0x0(%rax,%rax,1)
   0x0000000000400c10 <+48>:    jmpq   0x400c90 <fn_3>
   0x0000000000400c15 <+53>:    nopl   (%rax)
   0x0000000000400c18 <+56>:    jmpq   0x400c70 <fn_2>
   0x0000000000400c1d <+61>:    nopl   (%rax)
   0x0000000000400c20 <+64>:    jmpq   0x400c50 <fn_1>
   0x0000000000400c25 <+69>:    nopl   (%rax)
   0x0000000000400c28 <+72>:    jmpq   0x400c40 <fn_0>
   0x0000000000400c2d <+77>:    nopl   (%rax)
   0x0000000000400c30 <+80>:    jmpq   0x400cb0 <fn_4>
   0x0000000000400c35 <+85>:    push   %rax
   0x0000000000400c36 <+86>:    callq  0x40dd80 <abort>
  End of assembler dump.

* clang with -mretpoline emitting search tree:

  # gdb -batch -ex 'disassemble dispatch' ./c-switch
  Dump of assembler code for function dispatch:
   0x0000000000400b30 <+0>:     cmp    $0x1,%edi
   0x0000000000400b33 <+3>:     jle    0x400b44 <dispatch+20>
   0x0000000000400b35 <+5>:     cmp    $0x2,%edi
   0x0000000000400b38 <+8>:     je     0x400b4d <dispatch+29>
   0x0000000000400b3a <+10>:    cmp    $0x3,%edi
   0x0000000000400b3d <+13>:    jne    0x400b52 <dispatch+34>
   0x0000000000400b3f <+15>:    jmpq   0x400c50 <fn_3>
   0x0000000000400b44 <+20>:    test   %edi,%edi
   0x0000000000400b46 <+22>:    jne    0x400b5c <dispatch+44>
   0x0000000000400b48 <+24>:    jmpq   0x400c20 <fn_0>
   0x0000000000400b4d <+29>:    jmpq   0x400c40 <fn_2>
   0x0000000000400b52 <+34>:    cmp    $0x4,%edi
   0x0000000000400b55 <+37>:    jne    0x400b66 <dispatch+54>
   0x0000000000400b57 <+39>:    jmpq   0x400c60 <fn_4>
   0x0000000000400b5c <+44>:    cmp    $0x1,%edi
   0x0000000000400b5f <+47>:    jne    0x400b66 <dispatch+54>
   0x0000000000400b61 <+49>:    jmpq   0x400c30 <fn_1>
   0x0000000000400b66 <+54>:    push   %rax
   0x0000000000400b67 <+55>:    callq  0x40dd20 <abort>
  End of assembler dump.

  For sake of comparison, clang without -mretpoline:

  # gdb -batch -ex 'disassemble dispatch' ./c-switch
  Dump of assembler code for function dispatch:
   0x0000000000400b30 <+0>:	cmp    $0x4,%edi
   0x0000000000400b33 <+3>:	ja     0x400b57 <dispatch+39>
   0x0000000000400b35 <+5>:	mov    %edi,%eax
   0x0000000000400b37 <+7>:	jmpq   *0x492148(,%rax,8)
   0x0000000000400b3e <+14>:	jmpq   0x400bf0 <fn_0>
   0x0000000000400b43 <+19>:	jmpq   0x400c30 <fn_4>
   0x0000000000400b48 <+24>:	jmpq   0x400c10 <fn_2>
   0x0000000000400b4d <+29>:	jmpq   0x400c20 <fn_3>
   0x0000000000400b52 <+34>:	jmpq   0x400c00 <fn_1>
   0x0000000000400b57 <+39>:	push   %rax
   0x0000000000400b58 <+40>:	callq  0x40dcf0 <abort>
  End of assembler dump.

Raising the cases to a high number (e.g. 100) will still result
in similar code generation pattern with clang and gcc as above,
in other words clang generally turns off jump table emission by
having an extra expansion pass under retpoline build to turn
indirectbr instructions from their IR into switch instructions
as a built-in -mno-jump-table lowering of a switch (in this
case, even if IR input already contained an indirect branch).

For gcc, adding --param=case-values-threshold=20 as in similar
fashion as s390 in order to raise the limit for x86 retpoline
enabled builds results in a small vmlinux size increase of only
0.13% (before=18,027,528 after=18,051,192). For clang this
option is ignored due to i) not being needed as mentioned and
ii) not having above cmdline parameter. Non-retpoline-enabled
builds with gcc continue to use the default case-values-threshold
setting, so nothing changes here.

  [0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/
      and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
       - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
       - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/Makefile | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Linus Torvalds Feb. 21, 2019, 10:27 p.m. UTC | #1
On Thu, Feb 21, 2019 at 2:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> In case of gcc, this setting is controlled by case-values-threshold
> which has an architecture global default that selects 4 or 5 (

Ack. For retpoline, that's much too low.

Patch looks sane, although it would be good to verify just which
versions of gcc this works for. All versions with retpoline?

                        Linus
Daniel Borkmann Feb. 21, 2019, 10:56 p.m. UTC | #2
On 02/21/2019 11:27 PM, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 2:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> In case of gcc, this setting is controlled by case-values-threshold
>> which has an architecture global default that selects 4 or 5 (
> 
> Ack. For retpoline, that's much too low.
> 
> Patch looks sane, although it would be good to verify just which
> versions of gcc this works for. All versions with retpoline?

The feature was first added in gcc 4.7 [0], under "General Optimizer Improvements":

  Support for a new parameter --param case-values-threshold=n was added to allow
  users to control the cutoff between doing switch statements as a series of if
  statements and using a jump table.

From what I can tell, original author (H.J. Lu) provided backports up to gcc 4.8
and distros seem to have pulled it from his github branch [1] as upstream gcc does
not handle backports for stable versions that old.

Thanks,
Daniel

  [0] https://www.gnu.org/software/gcc/gcc-4.7/changes.html
  [1] https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1749261
      https://github.com/hjl-tools/gcc/tree/hjl/indirect/gcc-4_8-branch/master
Jesper Dangaard Brouer Feb. 22, 2019, 7:31 a.m. UTC | #3
On Thu, 21 Feb 2019 23:19:41 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Recent work on XDP from Björn and Magnus additionally found that
> manually transforming the XDP return code switch statement with
> more than 5 cases into if-else combination would result in a
> considerable speedup in XDP layer due to avoidance of indirect
> calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
> XDP prog attached, a 20-26% speedup has been observed [0]. Aside
> from XDP, there are many other places later in the networking
> stack's critical path with similar switch-case processing. Rather
> than fixing every XDP-enabled driver and locations in stack by
> hand, it would be good to instead raise the limit where gcc would
> emit expensive indirect calls from the switch under retpolines

I'm very happy to see this.  Thanks to Björn for finding, analyzing and
providing hand-coded-if-else code that demonstrated the performance
issue for XDP.  But I do think this GCC case-values-threshold param is
a better and more generic solution to the issue we observed and
measured in XDP land. And hopefully other parts of the network stack
and kernel will also benefit.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Thanks for following up on this Daniel,
Björn Töpel Feb. 26, 2019, 6:20 p.m. UTC | #4
On Fri, 22 Feb 2019 at 08:32, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> On Thu, 21 Feb 2019 23:19:41 +0100
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> > Recent work on XDP from Björn and Magnus additionally found that
> > manually transforming the XDP return code switch statement with
> > more than 5 cases into if-else combination would result in a
> > considerable speedup in XDP layer due to avoidance of indirect
> > calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
> > XDP prog attached, a 20-26% speedup has been observed [0]. Aside
> > from XDP, there are many other places later in the networking
> > stack's critical path with similar switch-case processing. Rather
> > than fixing every XDP-enabled driver and locations in stack by
> > hand, it would be good to instead raise the limit where gcc would
> > emit expensive indirect calls from the switch under retpolines
>
> I'm very happy to see this.  Thanks to Björn for finding, analyzing and
> providing hand-coded-if-else code that demonstrated the performance
> issue for XDP.  But I do think this GCC case-values-threshold param is
> a better and more generic solution to the issue we observed and
> measured in XDP land. And hopefully other parts of the network stack
> and kernel will also benefit.
>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> Thanks for following up on this Daniel,

I definitely prefer a switch-statement over the if-else-messiness in
this context. Thanks for doing the deep-dive, Daniel!

FWIW,
Acked-by: Björn Töpel <bjorn.topel@intel.com>

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
diff mbox series

Patch

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9c5a67d1b9c1..f55420a67164 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -217,6 +217,11 @@  KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
 # Avoid indirect branches in kernel to deal with Spectre
 ifdef CONFIG_RETPOLINE
   KBUILD_CFLAGS += $(RETPOLINE_CFLAGS)
+  # Additionally, avoid generating expensive indirect jumps which
+  # are subject to retpolines for small number of switch cases.
+  # clang turns off jump table generation by default when under
+  # retpoline builds, however, gcc does not for x86.
+  KBUILD_CFLAGS += $(call cc-option,--param=case-values-threshold=20)
 endif
 
 archscripts: scripts_basic