mbox series

[bpf-next,v3,0/6] Introduce the BPF dispatcher

Message ID 20191209135522.16576-1-bjorn.topel@gmail.com
Headers show
Series Introduce the BPF dispatcher | expand

Message

Björn Töpel Dec. 9, 2019, 1:55 p.m. UTC
Overview
========

This is the 4th iteration of the series that introduces the BPF
dispatcher, which is a mechanism to avoid indirect calls.

The BPF dispatcher is a multi-way branch code generator, targeted for
BPF programs. E.g. when an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform indirect calls to
direct calls, and therefore avoids the retpoline. The dispatcher is
generated using the BPF JIT, and relies on text poking provided by
bpf_arch_text_poke().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 48
dispatch points. This can be extended in the future.

In this series, only one dispatcher instance is supported, and the
only user is XDP. The dispatcher is updated when an XDP program is
attached/detached to/from a netdev. An alternative to this could have
been to update the dispatcher at program load point, but as there are
usually more XDP programs loaded than attached, so the latter was
picked.

The XDP dispatcher is always enabled, if available, because it helps
even when retpolines are disabled. Please refer to the "Performance"
section below.

The first patch refactors the image allocation from the BPF trampoline
code. Patch two introduces the dispatcher, and patch three wires up
the XDP control-/ fast-path. Patch four adds the dispatcher to
BPF_TEST_RUN. Patch five adds a simple selftest, and the last adds
alignment to jump targets.

I have rebased the series on commit e7096c131e51 ("net: WireGuard
secure network tunnel").

Discussion/feedback
===================

My measurements did not show any improvements for the jump target 16 B
alignments. Maybe it would make sense to leave alignment out?

As the performance results show, when the dispatch table is full/48
entries, there are performance degradations for the micro-benchmarks
(XDP_DRV and xdp-perf). The dispatcher does log(n) compares/jumps, if
we assume linear scaling, this means that having more than 16 entries
in the dispatch table will degrade performance for the
"mitigations=off" case, but more for the "mitigations=auto" cases.

Generated code, x86-64
======================

The dispatcher currently has a maximum of 48 entries, where one entry
is a unique BPF program. Multiple users of a dispatcher instance using
the same BPF program will share that entry.

The program/slot lookup is performed by a binary search, O(log
n). Let's have a look at the generated code.

The trampoline function has the following signature:

  unsigned int tramp(const void *xdp_ctx,
                     const struct bpf_insn *insnsi,
                     unsigned int (*bpf_func)(const void *,
                                              const struct bpf_insn *))

On Intel x86-64 this means that rdx will contain the bpf_func. To,
make it easier to read, I've let the BPF programs have the following
range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0
(-16). 0xffffffff81c00f10 is the retpoline thunk, in this case
__x86_indirect_thunk_rdx. If retpolines are disabled the thunk will be
a regular indirect call.

The minimal dispatcher will then look like this:

ffffffffc0002000: cmp    rdx,0xffffffffffffffff
ffffffffc0002007: je     0xffffffffffffffff ; -1
ffffffffc000200d: jmp    0xffffffff81c00f10

A 16 entry dispatcher looks like this:

ffffffffc0020000: cmp    rdx,0xfffffffffffffff7 ; -9
ffffffffc0020007: jg     0xffffffffc0020130
ffffffffc002000d: cmp    rdx,0xfffffffffffffff3 ; -13
ffffffffc0020014: jg     0xffffffffc00200a0
ffffffffc002001a: cmp    rdx,0xfffffffffffffff1 ; -15
ffffffffc0020021: jg     0xffffffffc0020060
ffffffffc0020023: cmp    rdx,0xfffffffffffffff0 ; -16
ffffffffc002002a: jg     0xffffffffc0020040
ffffffffc002002c: cmp    rdx,0xfffffffffffffff0 ; -16
ffffffffc0020033: je     0xfffffffffffffff0 ; -16
ffffffffc0020039: jmp    0xffffffff81c00f10
ffffffffc002003e: xchg   ax,ax
ffffffffc0020040: cmp    rdx,0xfffffffffffffff1 ; -15
ffffffffc0020047: je     0xfffffffffffffff1 ; -15
ffffffffc002004d: jmp    0xffffffff81c00f10
ffffffffc0020052: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002005a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020060: cmp    rdx,0xfffffffffffffff2 ; -14
ffffffffc0020067: jg     0xffffffffc0020080
ffffffffc0020069: cmp    rdx,0xfffffffffffffff2 ; -14
ffffffffc0020070: je     0xfffffffffffffff2 ; -14
ffffffffc0020076: jmp    0xffffffff81c00f10
ffffffffc002007b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020080: cmp    rdx,0xfffffffffffffff3 ; -13
ffffffffc0020087: je     0xfffffffffffffff3 ; -13
ffffffffc002008d: jmp    0xffffffff81c00f10
ffffffffc0020092: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002009a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00200a0: cmp    rdx,0xfffffffffffffff5 ; -11
ffffffffc00200a7: jg     0xffffffffc00200f0
ffffffffc00200a9: cmp    rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b0: jg     0xffffffffc00200d0
ffffffffc00200b2: cmp    rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b9: je     0xfffffffffffffff4 ; -12
ffffffffc00200bf: jmp    0xffffffff81c00f10
ffffffffc00200c4: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00200cc: nop    DWORD PTR [rax+0x0]
ffffffffc00200d0: cmp    rdx,0xfffffffffffffff5 ; -11
ffffffffc00200d7: je     0xfffffffffffffff5 ; -11
ffffffffc00200dd: jmp    0xffffffff81c00f10
ffffffffc00200e2: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00200ea: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00200f0: cmp    rdx,0xfffffffffffffff6 ; -10
ffffffffc00200f7: jg     0xffffffffc0020110
ffffffffc00200f9: cmp    rdx,0xfffffffffffffff6 ; -10
ffffffffc0020100: je     0xfffffffffffffff6 ; -10
ffffffffc0020106: jmp    0xffffffff81c00f10
ffffffffc002010b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020110: cmp    rdx,0xfffffffffffffff7 ; -9
ffffffffc0020117: je     0xfffffffffffffff7 ; -9
ffffffffc002011d: jmp    0xffffffff81c00f10
ffffffffc0020122: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002012a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020130: cmp    rdx,0xfffffffffffffffb ; -5
ffffffffc0020137: jg     0xffffffffc00201d0
ffffffffc002013d: cmp    rdx,0xfffffffffffffff9 ; -7
ffffffffc0020144: jg     0xffffffffc0020190
ffffffffc0020146: cmp    rdx,0xfffffffffffffff8 ; -8
ffffffffc002014d: jg     0xffffffffc0020170
ffffffffc002014f: cmp    rdx,0xfffffffffffffff8 ; -8
ffffffffc0020156: je     0xfffffffffffffff8 ; -8
ffffffffc002015c: jmp    0xffffffff81c00f10
ffffffffc0020161: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020169: nop    DWORD PTR [rax+0x0]
ffffffffc0020170: cmp    rdx,0xfffffffffffffff9 ; -7
ffffffffc0020177: je     0xfffffffffffffff9 ; -7
ffffffffc002017d: jmp    0xffffffff81c00f10
ffffffffc0020182: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002018a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020190: cmp    rdx,0xfffffffffffffffa ; -6
ffffffffc0020197: jg     0xffffffffc00201b0
ffffffffc0020199: cmp    rdx,0xfffffffffffffffa ; -6
ffffffffc00201a0: je     0xfffffffffffffffa ; -6
ffffffffc00201a6: jmp    0xffffffff81c00f10
ffffffffc00201ab: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201b0: cmp    rdx,0xfffffffffffffffb ; -5
ffffffffc00201b7: je     0xfffffffffffffffb ; -5
ffffffffc00201bd: jmp    0xffffffff81c00f10
ffffffffc00201c2: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201ca: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00201d0: cmp    rdx,0xfffffffffffffffd ; -3
ffffffffc00201d7: jg     0xffffffffc0020220
ffffffffc00201d9: cmp    rdx,0xfffffffffffffffc ; -4
ffffffffc00201e0: jg     0xffffffffc0020200
ffffffffc00201e2: cmp    rdx,0xfffffffffffffffc ; -4
ffffffffc00201e9: je     0xfffffffffffffffc ; -4
ffffffffc00201ef: jmp    0xffffffff81c00f10
ffffffffc00201f4: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201fc: nop    DWORD PTR [rax+0x0]
ffffffffc0020200: cmp    rdx,0xfffffffffffffffd ; -3
ffffffffc0020207: je     0xfffffffffffffffd ; -3
ffffffffc002020d: jmp    0xffffffff81c00f10
ffffffffc0020212: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002021a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020220: cmp    rdx,0xfffffffffffffffe ; -2
ffffffffc0020227: jg     0xffffffffc0020240
ffffffffc0020229: cmp    rdx,0xfffffffffffffffe ; -2
ffffffffc0020230: je     0xfffffffffffffffe ; -2
ffffffffc0020236: jmp    0xffffffff81c00f10
ffffffffc002023b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020240: cmp    rdx,0xffffffffffffffff ; -1
ffffffffc0020247: je     0xffffffffffffffff ; -1
ffffffffc002024d: jmp    0xffffffff81c00f10

The nops are there to align jump targets to 16 B.

Performance
===========

The tests were performed using the xdp_rxq_info sample program with
the following command-line:

1. XDP_DRV:
  # xdp_rxq_info --dev eth0 --action XDP_DROP
2. XDP_SKB:
  # xdp_rxq_info --dev eth0 -S --action XDP_DROP
3. xdp-perf, from selftests/bpf:
  # test_progs -v -t xdp_perf


Run with mitigations=auto
-------------------------

Baseline:
1. 22.0 Mpps
2. 3.8 Mpps
3. 15 ns

Dispatcher:
1. 29.4 Mpps (+34%)
2. 4.0 Mpps  (+5%)
3. 5 ns      (+66%)

Dispatcher (full; walk all entries, and fallback):
1. 20.4 Mpps (-7%)
2. 3.8 Mpps  
3. 18 ns     (-20%)

Run with mitigations=off
------------------------

Baseline:
1. 29.6 Mpps
2. 4.1 Mpps
3. 5 ns

Dispatcher:
1. 30.7 Mpps (+4%)
2. 4.1 Mpps
1. 5 ns

Dispatcher (full; walk all entries, and fallback):
1. 27.2 Mpps (-8%)
2. 4.1 Mpps
3. 7 ns      (-40%)

Multiple xdp-perf baseline with mitigations=auto
------------------------------------------------

 Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):

             16.69 msec task-clock                #    0.984 CPUs utilized            ( +-  0.08% )
                 2      context-switches          #    0.123 K/sec                    ( +-  1.11% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 70.68% )
                97      page-faults               #    0.006 M/sec                    ( +-  0.05% )
        49,254,635      cycles                    #    2.951 GHz                      ( +-  0.09% )  (12.28%)
        42,138,558      instructions              #    0.86  insn per cycle           ( +-  0.02% )  (36.15%)
         7,315,291      branches                  #  438.300 M/sec                    ( +-  0.01% )  (59.43%)
         1,011,201      branch-misses             #   13.82% of all branches          ( +-  0.01% )  (83.31%)
        15,440,788      L1-dcache-loads           #  925.143 M/sec                    ( +-  0.00% )  (99.40%)
            39,067      L1-dcache-load-misses     #    0.25% of all L1-dcache hits    ( +-  0.04% )
             6,531      LLC-loads                 #    0.391 M/sec                    ( +-  0.05% )
               442      LLC-load-misses           #    6.76% of all LL-cache hits     ( +-  0.77% )
   <not supported>      L1-icache-loads                                             
            57,964      L1-icache-load-misses                                         ( +-  0.06% )
        15,442,496      dTLB-loads                #  925.246 M/sec                    ( +-  0.00% )
               514      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  0.73% )  (40.57%)
               130      iTLB-loads                #    0.008 M/sec                    ( +-  2.75% )  (16.69%)
     <not counted>      iTLB-load-misses                                              ( +-  8.71% )  (0.60%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

         0.0169558 +- 0.0000127 seconds time elapsed  ( +-  0.07% )

Multiple xdp-perf dispatcher with mitigations=auto
--------------------------------------------------

Note that this includes generating the dispatcher.

 Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):

              4.80 msec task-clock                #    0.953 CPUs utilized            ( +-  0.06% )
                 1      context-switches          #    0.258 K/sec                    ( +-  1.57% )
                 0      cpu-migrations            #    0.000 K/sec                  
                97      page-faults               #    0.020 M/sec                    ( +-  0.05% )
        14,185,861      cycles                    #    2.955 GHz                      ( +-  0.17% )  (50.49%)
        45,691,935      instructions              #    3.22  insn per cycle           ( +-  0.01% )  (99.19%)
         8,346,008      branches                  # 1738.709 M/sec                    ( +-  0.00% )
            13,046      branch-misses             #    0.16% of all branches          ( +-  0.10% )
        15,443,735      L1-dcache-loads           # 3217.365 M/sec                    ( +-  0.00% )
            39,585      L1-dcache-load-misses     #    0.26% of all L1-dcache hits    ( +-  0.05% )
             7,138      LLC-loads                 #    1.487 M/sec                    ( +-  0.06% )
               671      LLC-load-misses           #    9.40% of all LL-cache hits     ( +-  0.73% )
   <not supported>      L1-icache-loads                                             
            56,213      L1-icache-load-misses                                         ( +-  0.08% )
        15,443,735      dTLB-loads                # 3217.365 M/sec                    ( +-  0.00% )
     <not counted>      dTLB-load-misses                                              (0.00%)
     <not counted>      iTLB-loads                                                    (0.00%)
     <not counted>      iTLB-load-misses                                              (0.00%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

        0.00503705 +- 0.00000546 seconds time elapsed  ( +-  0.11% )


Revisions
=========

v2->v3: [1]
  * Removed xdp_call, and instead make the dispatcher available to all
    XDP users via bpf_prog_run_xdp() and dev_xdp_install(). (Toke)
  * Always enable the dispatcher, if available (Alexei)
  * Reuse BPF trampoline image allocator (Alexei)
  * Make sure the dispatcher is exercised in selftests (Alexei)
  * Only allow one dispatcher, and wire it to XDP

v1->v2: [2]
  * Fixed i386 build warning (kbuild robot)
  * Made bpf_dispatcher_lookup() static (kbuild robot)
  * Make sure xdp_call.h is only enabled for builtins
  * Add xdp_call() to ixgbe, mlx4, and mlx5

RFC->v1: [3]
  * Improved error handling (Edward and Andrii)
  * Explicit cleanup (Andrii)
  * Use 32B with sext cmp (Alexei)
  * Align jump targets to 16B (Alexei)
  * 4 to 16 entries (Toke)
  * Added stats to xdp_call_run()

[1] https://lore.kernel.org/bpf/20191123071226.6501-1-bjorn.topel@gmail.com/
[2] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/
[3] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/

Björn Töpel (6):
  bpf: move trampoline JIT image allocation to a function
  bpf: introduce BPF dispatcher
  bpf, xdp: start using the BPF dispatcher for XDP
  bpf: start using the BPF dispatcher in BPF_TEST_RUN
  selftests: bpf: add xdp_perf test
  bpf, x86: align dispatcher branch targets to 16B

 arch/x86/net/bpf_jit_comp.c                   | 150 +++++++++++++
 include/linux/bpf.h                           |   8 +
 include/linux/filter.h                        |  56 +++--
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/dispatcher.c                       | 207 ++++++++++++++++++
 kernel/bpf/syscall.c                          |  26 ++-
 kernel/bpf/trampoline.c                       |  24 +-
 net/bpf/test_run.c                            |  15 +-
 net/core/dev.c                                |  19 +-
 net/core/filter.c                             |  22 ++
 .../selftests/bpf/prog_tests/xdp_perf.c       |  25 +++
 11 files changed, 516 insertions(+), 37 deletions(-)
 create mode 100644 kernel/bpf/dispatcher.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_perf.c

Comments

Toke Høiland-Jørgensen Dec. 9, 2019, 3 p.m. UTC | #1
Björn Töpel <bjorn.topel@gmail.com> writes:

> Overview
> ========
>
> This is the 4th iteration of the series that introduces the BPF
> dispatcher, which is a mechanism to avoid indirect calls.
>
> The BPF dispatcher is a multi-way branch code generator, targeted for
> BPF programs. E.g. when an XDP program is executed via the
> bpf_prog_run_xdp(), it is invoked via an indirect call. With
> retpolines enabled, the indirect call has a substantial performance
> impact. The dispatcher is a mechanism that transform indirect calls to
> direct calls, and therefore avoids the retpoline. The dispatcher is
> generated using the BPF JIT, and relies on text poking provided by
> bpf_arch_text_poke().
>
> The dispatcher hijacks a trampoline function it via the __fentry__ nop
> of the trampoline. One dispatcher instance currently supports up to 48
> dispatch points. This can be extended in the future.
>
> In this series, only one dispatcher instance is supported, and the
> only user is XDP. The dispatcher is updated when an XDP program is
> attached/detached to/from a netdev. An alternative to this could have
> been to update the dispatcher at program load point, but as there are
> usually more XDP programs loaded than attached, so the latter was
> picked.

I like the new version where it's integrated into bpf_prog_run_xdp();
nice! :)

> The XDP dispatcher is always enabled, if available, because it helps
> even when retpolines are disabled. Please refer to the "Performance"
> section below.

Looking at those numbers, I think I would moderate "helps" to "doesn't
hurt" - a difference of less than 1ns is basically in the noise.

You mentioned in the earlier version that this would impact the time it
takes to attach an XDP program. Got any numbers for this?

-Toke
Jesper Dangaard Brouer Dec. 9, 2019, 5 p.m. UTC | #2
On Mon,  9 Dec 2019 14:55:16 +0100
Björn Töpel <bjorn.topel@gmail.com> wrote:

> Performance
> ===========
> 
> The tests were performed using the xdp_rxq_info sample program with
> the following command-line:
> 
> 1. XDP_DRV:
>   # xdp_rxq_info --dev eth0 --action XDP_DROP
> 2. XDP_SKB:
>   # xdp_rxq_info --dev eth0 -S --action XDP_DROP
> 3. xdp-perf, from selftests/bpf:
>   # test_progs -v -t xdp_perf
> 
> 
> Run with mitigations=auto
> -------------------------
> 
> Baseline:
> 1. 22.0 Mpps
> 2. 3.8 Mpps
> 3. 15 ns
> 
> Dispatcher:
> 1. 29.4 Mpps (+34%)
> 2. 4.0 Mpps  (+5%)
> 3. 5 ns      (+66%)

Thanks for providing these extra measurement points.  This is good
work.  I just want to remind people that when working at these high
speeds, it is easy to get amazed by a +34% improvement, but we have to
be careful to understand that this is saving approx 10 ns time or
cycles.

In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger
(1/3.8-1/4)*1000 = 13.15 ns.  Than #1 (22.0 Mpps -> 29.4 Mpps)
(1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns =
10 ns.  The 10 ns improvement is a big deal in XDP context, and also
correspond to my own experience with retpoline (approx 12 ns overhead).

To Bjørn, I would appreciate more digits on your Mpps numbers, so I get
more accuracy on my checks-and-balances I described above.  I suspect
the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we
get more accuracy.

 
> Dispatcher (full; walk all entries, and fallback):
> 1. 20.4 Mpps (-7%)
> 2. 3.8 Mpps  
> 3. 18 ns     (-20%)
> 
> Run with mitigations=off
> ------------------------
> 
> Baseline:
> 1. 29.6 Mpps
> 2. 4.1 Mpps
> 3. 5 ns
> 
> Dispatcher:
> 1. 30.7 Mpps (+4%)
> 2. 4.1 Mpps
> 3. 5 ns

While +4% sounds good, but could be measurement noise ;-)

 (1/29.6-1/30.7)*1000 = 1.21 ns

As both #3 says 5 ns.
Björn Töpel Dec. 9, 2019, 5:42 p.m. UTC | #3
On Mon, 9 Dec 2019 at 16:00, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Björn Töpel <bjorn.topel@gmail.com> writes:
>
[...]
>
> I like the new version where it's integrated into bpf_prog_run_xdp();
> nice! :)
>

Yes, me too! Nice suggestion!

> > The XDP dispatcher is always enabled, if available, because it helps
> > even when retpolines are disabled. Please refer to the "Performance"
> > section below.
>
> Looking at those numbers, I think I would moderate "helps" to "doesn't
> hurt" - a difference of less than 1ns is basically in the noise.
>
> You mentioned in the earlier version that this would impact the time it
> takes to attach an XDP program. Got any numbers for this?
>

Ah, no, I forgot to measure that. I'll get back with that. So, when a
new program is entered or removed from dispatcher, it needs to be
re-jited, but more importantly -- a text poke is needed. I don't know
if this is a concern or not, but let's measure it.


Björn

> -Toke
>
Björn Töpel Dec. 9, 2019, 5:45 p.m. UTC | #4
On Mon, 9 Dec 2019 at 18:00, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> On Mon,  9 Dec 2019 14:55:16 +0100
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > Performance
> > ===========
> >
> > The tests were performed using the xdp_rxq_info sample program with
> > the following command-line:
> >
> > 1. XDP_DRV:
> >   # xdp_rxq_info --dev eth0 --action XDP_DROP
> > 2. XDP_SKB:
> >   # xdp_rxq_info --dev eth0 -S --action XDP_DROP
> > 3. xdp-perf, from selftests/bpf:
> >   # test_progs -v -t xdp_perf
> >
> >
> > Run with mitigations=auto
> > -------------------------
> >
> > Baseline:
> > 1. 22.0 Mpps
> > 2. 3.8 Mpps
> > 3. 15 ns
> >
> > Dispatcher:
> > 1. 29.4 Mpps (+34%)
> > 2. 4.0 Mpps  (+5%)
> > 3. 5 ns      (+66%)
>
> Thanks for providing these extra measurement points.  This is good
> work.  I just want to remind people that when working at these high
> speeds, it is easy to get amazed by a +34% improvement, but we have to
> be careful to understand that this is saving approx 10 ns time or
> cycles.
>
> In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger
> (1/3.8-1/4)*1000 = 13.15 ns.  Than #1 (22.0 Mpps -> 29.4 Mpps)
> (1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns =
> 10 ns.  The 10 ns improvement is a big deal in XDP context, and also
> correspond to my own experience with retpoline (approx 12 ns overhead).
>

Ok, good! :-)

> To Bjørn, I would appreciate more digits on your Mpps numbers, so I get
> more accuracy on my checks-and-balances I described above.  I suspect
> the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we
> get more accuracy.
>

Ok! Let me re-run them. If you have some spare cycles, yt would be
great if you could try it out as well on your Mellanox setup.
Historically you've always been able to get more stable numbers than
I. :-)

>
> > Dispatcher (full; walk all entries, and fallback):
> > 1. 20.4 Mpps (-7%)
> > 2. 3.8 Mpps
> > 3. 18 ns     (-20%)
> >
> > Run with mitigations=off
> > ------------------------
> >
> > Baseline:
> > 1. 29.6 Mpps
> > 2. 4.1 Mpps
> > 3. 5 ns
> >
> > Dispatcher:
> > 1. 30.7 Mpps (+4%)
> > 2. 4.1 Mpps
> > 3. 5 ns
>
> While +4% sounds good, but could be measurement noise ;-)
>
>  (1/29.6-1/30.7)*1000 = 1.21 ns
>
> As both #3 says 5 ns.
>

True. Maybe that simply hints that we shouldn't use the dispatcher here?


Thanks for the comments!
Björn


> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
Jesper Dangaard Brouer Dec. 9, 2019, 7:50 p.m. UTC | #5
On Mon, 9 Dec 2019 18:45:12 +0100
Björn Töpel <bjorn.topel@gmail.com> wrote:

> On Mon, 9 Dec 2019 at 18:00, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >
> > On Mon,  9 Dec 2019 14:55:16 +0100
> > Björn Töpel <bjorn.topel@gmail.com> wrote:
> >  
> > > Performance
> > > ===========
> > >
> > > The tests were performed using the xdp_rxq_info sample program with
> > > the following command-line:
> > >
> > > 1. XDP_DRV:
> > >   # xdp_rxq_info --dev eth0 --action XDP_DROP
> > > 2. XDP_SKB:
> > >   # xdp_rxq_info --dev eth0 -S --action XDP_DROP
> > > 3. xdp-perf, from selftests/bpf:
> > >   # test_progs -v -t xdp_perf
> > >
> > >
> > > Run with mitigations=auto
> > > -------------------------
> > >
> > > Baseline:
> > > 1. 22.0 Mpps
> > > 2. 3.8 Mpps
> > > 3. 15 ns
> > >
> > > Dispatcher:
> > > 1. 29.4 Mpps (+34%)
> > > 2. 4.0 Mpps  (+5%)
> > > 3. 5 ns      (+66%)  
> >
> > Thanks for providing these extra measurement points.  This is good
> > work.  I just want to remind people that when working at these high
> > speeds, it is easy to get amazed by a +34% improvement, but we have to
> > be careful to understand that this is saving approx 10 ns time or
> > cycles.
> >
> > In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger
> > (1/3.8-1/4)*1000 = 13.15 ns.  Than #1 (22.0 Mpps -> 29.4 Mpps)
> > (1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns =
> > 10 ns.  The 10 ns improvement is a big deal in XDP context, and also
> > correspond to my own experience with retpoline (approx 12 ns overhead).
> >  
> 
> Ok, good! :-)
> 
> > To Bjørn, I would appreciate more digits on your Mpps numbers, so I get
> > more accuracy on my checks-and-balances I described above.  I suspect
> > the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we
> > get more accuracy.
> >  
> 
> Ok! Let me re-run them. 

Well, I don't think you should waste your time re-running these...

It clearly shows a significant improvement.  I'm just complaining that
I didn't have enough digits to do accurate checks-and-balances, they
are close enough that I believe them.


> If you have some spare cycles, yt would be
> great if you could try it out as well on your Mellanox setup.

I'll add it to my TODO list... but no promises.


> Historically you've always been able to get more stable numbers than
> I. :-)
> 
> >  
> > > Dispatcher (full; walk all entries, and fallback):
> > > 1. 20.4 Mpps (-7%)
> > > 2. 3.8 Mpps
> > > 3. 18 ns     (-20%)
> > >
> > > Run with mitigations=off
> > > ------------------------
> > >
> > > Baseline:
> > > 1. 29.6 Mpps
> > > 2. 4.1 Mpps
> > > 3. 5 ns
> > >
> > > Dispatcher:
> > > 1. 30.7 Mpps (+4%)
> > > 2. 4.1 Mpps
> > > 3. 5 ns  
> >
> > While +4% sounds good, but could be measurement noise ;-)
> >
> >  (1/29.6-1/30.7)*1000 = 1.21 ns
> >
> > As both #3 says 5 ns.
> >  
> 
> True. Maybe that simply hints that we shouldn't use the dispatcher here?

No. I actually think it is worth exposing this code as much as
possible. And if it really is 1.2 ns improvement, then I'll gladly take
that as well ;-)


I think this is awesome work! -- thanks for doing this!!!
Samudrala, Sridhar Dec. 10, 2019, 7:28 p.m. UTC | #6
On 12/9/2019 5:55 AM, Björn Töpel wrote:
> Overview
> ========
> 
> This is the 4th iteration of the series that introduces the BPF
> dispatcher, which is a mechanism to avoid indirect calls.

Good to see the progress with getting a mechansism to avoid indirect calls
upstream.

[...]


> Performance
> ===========
> 
> The tests were performed using the xdp_rxq_info sample program with
> the following command-line:
> 
> 1. XDP_DRV:
>    # xdp_rxq_info --dev eth0 --action XDP_DROP
> 2. XDP_SKB:
>    # xdp_rxq_info --dev eth0 -S --action XDP_DROP
> 3. xdp-perf, from selftests/bpf:
>    # test_progs -v -t xdp_perf

What is this test_progs? I don't see such ann app under selftests/bpf


> Run with mitigations=auto
> -------------------------
> 
> Baseline:
> 1. 22.0 Mpps
> 2. 3.8 Mpps
> 3. 15 ns
> 
> Dispatcher:
> 1. 29.4 Mpps (+34%)
> 2. 4.0 Mpps  (+5%)
> 3. 5 ns      (+66%)
> 
> Dispatcher (full; walk all entries, and fallback):
> 1. 20.4 Mpps (-7%)
> 2. 3.8 Mpps
> 3. 18 ns     (-20%)

Are these packets received on a single queue? Or multiple queues?
Do you see similar improvements even with xdpsock?

Thanks
Sridhar
Björn Töpel Dec. 10, 2019, 7:59 p.m. UTC | #7
On Mon, 9 Dec 2019 at 14:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
[...]
>
> Discussion/feedback
> ===================
>
> My measurements did not show any improvements for the jump target 16 B
> alignments. Maybe it would make sense to leave alignment out?
>

I did a micro benchmark with "test_progs -t xdp_prog" for all sizes
(max 48 aligned, max 64 non-aligned) of the dispatcher. I couldn't
measure any difference at all, so I will leave the last patch out
(aligning jump targets). If a workload appears where this is
measurable, it can be added at that point.

The micro benchmark also show that it makes little sense disabling the
dispatcher when "mitigations=off". The diff is within 1ns(!), when
mitigations are off. I'll post the data as a reply to the v4 cover
letter, so that the cover isn't clobbered with data. :-P


Björn
Björn Töpel Dec. 10, 2019, 8:04 p.m. UTC | #8
On Tue, 10 Dec 2019 at 20:28, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
[...]
> > The tests were performed using the xdp_rxq_info sample program with
> > the following command-line:
> >
> > 1. XDP_DRV:
> >    # xdp_rxq_info --dev eth0 --action XDP_DROP
> > 2. XDP_SKB:
> >    # xdp_rxq_info --dev eth0 -S --action XDP_DROP
> > 3. xdp-perf, from selftests/bpf:
> >    # test_progs -v -t xdp_perf
>
> What is this test_progs? I don't see such ann app under selftests/bpf
>

The "test_progs" program resides in tools/testing/selftests/bpf. The
xdp_perf is part of the series!

>
> > Run with mitigations=auto
> > -------------------------
> >
> > Baseline:
> > 1. 22.0 Mpps
> > 2. 3.8 Mpps
> > 3. 15 ns
> >
> > Dispatcher:
> > 1. 29.4 Mpps (+34%)
> > 2. 4.0 Mpps  (+5%)
> > 3. 5 ns      (+66%)
> >
> > Dispatcher (full; walk all entries, and fallback):
> > 1. 20.4 Mpps (-7%)
> > 2. 3.8 Mpps
> > 3. 18 ns     (-20%)
>
> Are these packets received on a single queue? Or multiple queues?
> Do you see similar improvements even with xdpsock?
>

Yes, just a single queue, regular XDP. I left out xdpsock for now, and
only focus on the micro benchmark and XDP. I'll get back with xdpsock
benchmarks.


Cheers,
Björn
Björn Töpel Dec. 11, 2019, 12:38 p.m. UTC | #9
On Mon, 9 Dec 2019 at 18:42, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
[...]
> > You mentioned in the earlier version that this would impact the time it
> > takes to attach an XDP program. Got any numbers for this?
> >
>
> Ah, no, I forgot to measure that. I'll get back with that. So, when a
> new program is entered or removed from dispatcher, it needs to be
> re-jited, but more importantly -- a text poke is needed. I don't know
> if this is a concern or not, but let's measure it.
>

Toke, I tried to measure the impact, but didn't really get anything
useful out. :-(

My concern was mainly that text-poking is a point of contention, and
it messes with the icache. As for contention, we're already
synchronized around the rtnl-lock. As for the icache-flush effects...
well... I'm open to suggestions how to measure the impact in a useful
way.

>
> Björn
>
> > -Toke
> >
Toke Høiland-Jørgensen Dec. 11, 2019, 1:17 p.m. UTC | #10
Björn Töpel <bjorn.topel@gmail.com> writes:

> On Mon, 9 Dec 2019 at 18:42, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>
> [...]
>> > You mentioned in the earlier version that this would impact the time it
>> > takes to attach an XDP program. Got any numbers for this?
>> >
>>
>> Ah, no, I forgot to measure that. I'll get back with that. So, when a
>> new program is entered or removed from dispatcher, it needs to be
>> re-jited, but more importantly -- a text poke is needed. I don't know
>> if this is a concern or not, but let's measure it.
>>
>
> Toke, I tried to measure the impact, but didn't really get anything
> useful out. :-(
>
> My concern was mainly that text-poking is a point of contention, and
> it messes with the icache. As for contention, we're already
> synchronized around the rtnl-lock. As for the icache-flush effects...
> well... I'm open to suggestions how to measure the impact in a useful
> way.

Hmm, how about:

Test 1:

- Run a test with a simple drop program (like you have been) on a
  physical interface (A), sampling the PPS with interval I.
- Load a new XDP program on interface B (which could just be a veth I
  guess?)
- Record the PPS delta in the sampling interval on which the program was
  loaded on interval B.

You could also record for how many intervals the throughput drops, but I
would guess you'd need a fairly short sampling interval to see anything
for this.

Test 2:

- Run an XDP_TX program that just reflects the packets.
- Have the traffic generator measure per-packet latency (from it's
  transmitted until the same packet comes back).
- As above, load a program on a different interface and look for a blip
  in the recorded latency.


Both of these tests could also be done with the program being replaced
being the one that processes packets on the physical interface (instead
of on another interface). That way you could also see if there's any
difference for that before/after patch...

-Toke