mbox series

[ovs-dev,PATCHv2,RFC,0/3] AF_XDP support for OVS

Message ID 1535757856-51123-1-git-send-email-u9012063@gmail.com
Headers show
Series AF_XDP support for OVS | expand

Message

William Tu Aug. 31, 2018, 11:24 p.m. UTC
From: root <ovs-smartnic@vmware.com>

The patch series introduces AF_XDP support for OVS netdev.
AF_XDP is a new address family working together with eBPF.
In short, a socket with AF_XDP family can receive and send
packets from an eBPF/XDP program attached to the netdev.
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst

OVS has a couple of netdev types, i.e., system, tap, or
internal.  The patch first adds a new netdev types called
"afxdp", and implement its configuration, packet reception,
and transmit functions.  Since the AF_XDP socket, xsk,
operates in userspace, once ovs-vswitchd receives packets
from xsk, the proposed architecture re-uses the existing
userspace dpif-netdev datapath.  As a result, most of
the packet processing happens at the userspace instead of
linux kernel.

Architecure
===========
               _
              |   +-------------------+
              |   |    ovs-vswitchd   |<-->ovsdb-server
              |   +-------------------+
              |   |      ofproto      |<-->OpenFlow controllers
              |   +--------+-+--------+ 
              |   | netdev | |ofproto-|
    userspace |   +--------+ |  dpif  |
              |   | netdev | +--------+
              |   |provider| |  dpif  |
              |   +---||---+ +--------+
              |       ||     |  dpif- |
              |       ||     | netdev |
              |_      ||     +--------+  
                      ||         
               _  +---||-----+--------+
              |   | af_xdp prog +     |
       kernel |   |   xsk_map         |
              |_  +--------||---------+
                           ||
                        physical
                           NIC

To simply start, create a ovs userspace bridge using dpif-netdev
by setting the datapath_type to netdev:
# ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev

And attach a linux netdev with type afxdp:
# ovs-vsctl add-port br0 afxdp-p0 -- \
    set interface afxdp-p0 type="afxdp"

Most of the implementation follows the AF_XDP sample code
in Linux kernel under samples/bpf/xdpsock_user.c.

Configuration
=============
When a new afxdp netdev is added to OVS, the patch does
the following configuration
1) attach the afxdp program and map to the netdev (see bpf/xdp.h)
   (Currently support maximum 4 afxdp netdev)
2) create an AF_XDP socket (XSK) for the afxdp netdev
3) allocate a virtual contiguous memory region, called umem, and
   register this memory to the XSK
4) setup the rx/tx ring, and umem's fill/completion ring    

Packet Flow
===========
Currently, the af_xdp ebpf program loaded to the netdev does
nothing but simply forwards the packet to its receiving queue id.

The v1 patch simplifies the buffer/ring management by introducing
a copy from umem to ovs's internal buffer, when receiving a
packet.  And when sending the packet out to another netdev,
copying the packet to the netdev's umem.

The v2 patch implements a simple push/pop umem list, where the
push/pop list contains unused umem elements. So before receiving,
any packet, pop N elements from the list and place into the FILL
queues.  When received a batch of packets, ex: 32 packets,
pop another 32 unused umem elements and refill the FILL queues.

When there are N packets to send, pop N elements from the list,
copy the packet data into each umem element, and issue send.
Once finish, recycle the sent packet's umem from the COMPLETION
queue, and push back to the umem list.

With v2, an AF_XDP packet forwarding from one netdev (ovs-p0) to
another netdev (ovs-p1) goes through the following path:
Considering the driver mode:
1) xdp program at ovs-p0 copies packet to netdev's rx umem
2) ovs-vswitchd receive the packet from ovs-p0
3) ovs dpif-netdev does parse, lookup flow table,
   and action=forward to another afxdp port
4) ovs-vswitchd copies the pachet to umem of ovs-p1, kick_tx
5) kernel copies the packet from umem to ovs-p1 tx queue

Thus, the total number of copies between two ports is 3.
I haven't tried the AF_XDP zero copy mode driver.
Hopefully by using AF_XDP zero copy mode, 1) and 5) will
be removed.  So the best case will be one copy between two
netdev.

Performance
===========
Performance is done by setting up 2 machine with back-to-back
dual port 10Gbps link with ixgbe driver.  The machine consists
of Intel Xeon CPU E5-2440 v2 with 1.90GHz, 8 cores.
One machine is sending 14Mpps, single flow, 64-byte packet, to
another machine, which runs OVS with 2 afxdp pors. One port is used
for receiving packet, and the other port for sending packet back.
All experiments are using single core.

Baseline performance using xdpsock (Mpps, 64-byte)
Benchmark   XDP_SKB    XDP_DRV
rxdrop       1.1        2.9 
txpush       1.5        1.5
l2fwd        1.08       1.12

With this patch, OVS with AF_XDP (Mpps, 64-byte)
Benchmark   XDP_SKB    XDP_DRV
rxdrop       1.4        3.3  *[1]
txpush       N/A        N/A
l2fwd        0.3        0.4

The rxdrop is using the rule
  ovs-ofctl add-flow br0 "in_port=1 actions=drop"
The l2fwd is using
  ovs-ofctl add-flow br0 "in_port=1 actions=output:2"
where port 1 and 2 are ixgbe 10Gbps port.

Apparently we need to find out why l2fwd is so slow.

[1] the number is higher than the baseline,
    I guess it's due to OVS's pmd-mode netdev.

Evaluation1: perf
=================
1) RX drop 3Mpps
  15.43%  pmd7          ovs-vswitchd        [.] miniflow_extract
  11.63%  pmd7          ovs-vswitchd        [.] dp_netdev_input__
   9.69%  pmd7          libc-2.23.so        [.] __clock_gettime
   7.23%  pmd7          ovs-vswitchd        [.] umem_elem_push
   7.12%  pmd7          ovs-vswitchd        [.] pmd_thread_main
   6.76%  pmd7          [vdso]              [.] __vdso_clock_gettime
   6.65%  pmd7          ovs-vswitchd        [.] netdev_linux_rxq_xsk
   6.46%  pmd7          [kernel.vmlinux]    [k] nmi
   6.36%  pmd7          ovs-vswitchd        [.] odp_execute_actions
   5.85%  pmd7          ovs-vswitchd        [.] netdev_linux_rxq_recv
   4.62%  pmd7          ovs-vswitchd        [.] time_timespec__
   3.77%  pmd7          ovs-vswitchd        [.] dp_netdev_process_rxq_port

2) L2fwd 0.4Mpps
  20.05%  pmd7          [kernel.vmlinux]    [k] entry_SYSCALL_64_trampoline
  11.40%  pmd7          [kernel.vmlinux]    [k] syscall_return_via_sysret 
  11.27%  pmd7          [kernel.vmlinux]    [k] __sys_sendto
   6.10%  pmd7          [kernel.vmlinux]    [k] __fget
   4.81%  pmd7          [kernel.vmlinux]    [k] xsk_sendmsg
   3.62%  pmd7          [kernel.vmlinux]    [k] sockfd_lookup_light
   3.29%  pmd7          [kernel.vmlinux]    [k] aa_label_sk_perm
   3.16%  pmd7          libpthread-2.23.so  [.] __GI___libc_sendto
   3.07%  pmd7          [kernel.vmlinux]    [k] nmi
   2.90%  pmd7          [kernel.vmlinux]    [k] entry_SYSCALL_64_after_hwframe
   2.89%  pmd7          [kernel.vmlinux]    [k] do_syscall_64

note: lots of sendto syscall on l2fwd due to tx
      lots of time spent in [kernel.vmlinux], not ovs-vswitchd

Evaluation2: strace -c 
======================
1) RX drop 3Mpps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.94    0.000199           2       114       108 recvmsg
 23.06    0.000184          11        17           poll
 19.67    0.000157           3        54        54 accept
 14.41    0.000115           2        54        40 read
  5.39    0.000043           7         6           sendmsg
  4.76    0.000038           2        21        18 recvfrom
  3.88    0.000031           2        19           getrusage
  3.01    0.000024          24         1           restart_syscall
  0.88    0.000007           4         2           sendto
------ ----------- ----------- --------- --------- ----------------
100.00    0.000798                   288       220 total

2) L2fwd 0.4Mpps
Teconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.35   11.180104      272685        41           poll
  0.64    0.072016       72016         1           restart_syscall
  0.00    0.000256           1       264       252 recvmsg
  0.00    0.000253           2       126       126 accept
  0.00    0.000102           1       126        92 read
  0.00    0.000071           6        12           sendmsg
  0.00    0.000054           1        50        42 recvfrom
  0.00    0.000035           1        44           getrusage
  0.00    0.000015           4         4           sendto
------ ----------- ----------- --------- --------- ----------------
100.00   11.252906                   668       512 total

note: strace showing lots of poll systcall, but using frace,
I saw lots of sendto system call.

Evaluation3: perf stat
======================
1) RX drop 3Mpps
       20,010.33 msec cpu-clock                 #    1.000 CPUs utilized          
         20,010.35 msec task-clock                #    1.000 CPUs utilized          
                93      context-switches          #    4.648 M/sec                  
       346,728,358      cache-references          # 17327754.023 M/sec                (19.06%)
            13,206      cache-misses              #    0.004 % of all cache refs      (19.11%)
    45,404,902,649      cycles                    # 2269110.577 GHz                   (19.13%)
    79,272,455,182      instructions              #    1.75  insn per cycle         
                                                  #    0.29  stalled cycles per insn  (23.88%)
    16,481,635,945      branches                  # 823669962.269 M/sec               (23.88%)
                 0      faults                    #    0.000 K/sec                  
    22,691,286,646      stalled-cycles-frontend   #   49.98% frontend cycles idle     (23.86%)
   <not supported>      stalled-cycles-backend                                      
        14,650,538      branch-misses             #    0.09% of all branches          (23.81%)
     1,982,308,381      bus-cycles                # 99065886.107 M/sec                (23.79%)
    37,669,050,133      ref-cycles                # 1882511251.024 M/sec              (28.55%)
       195,627,994      LLC-loads                 # 9776511.444 M/sec                 (23.79%)
             1,804      LLC-load-misses           #    0.00% of all LL-cache hits     (23.79%)
       150,563,094      LLC-stores                # 7524392.504 M/sec                 (9.51%)
               126      LLC-store-misses          #    6.297 M/sec                    (9.51%)
       807,489,043      LLC-prefetches            # 40354275.012 M/sec                (9.51%)
             3,794      LLC-prefetch-misses       #  189.605 M/sec                    (9.51%)
    22,972,900,217      dTLB-loads                # 1148070975.362 M/sec              (9.51%)
         4,386,366      dTLB-load-misses          #    0.02% of all dTLB cache hits   (9.51%)
    15,018,543,088      dTLB-stores               # 750551878.461 M/sec               (9.51%)
            22,187      dTLB-store-misses         # 1108.796 M/sec                    (9.51%)
   <not supported>      dTLB-prefetches                                             
   <not supported>      dTLB-prefetch-misses                                        
         5,248,771      iTLB-loads                # 262307.396 M/sec                  (9.51%)
            74,790      iTLB-load-misses          #    1.42% of all iTLB cache hits   (14.27%)

2) L2fwd 0.4Mpps
 Performance counter stats for process id '7095':

         10,005.64 msec cpu-clock                 #    1.000 CPUs utilized          
         10,005.64 msec task-clock                #    1.000 CPUs utilized          
                47      context-switches          #    4.698 M/sec                  
        33,231,952      cache-references          # 3321534.433 M/sec                 (19.09%)
             4,093      cache-misses              #    0.012 % of all cache refs      (19.13%)
    21,940,530,397      cycles                    # 2192956.561 GHz                   (19.13%)
    15,082,324,838      instructions              #    0.69  insn per cycle         
                                                  #    0.95  stalled cycles per insn  (23.89%)
     3,180,559,568      branches                  # 317897008.296 M/sec               (23.89%)
                 0      faults                    #    0.000 K/sec                  
    14,284,571,595      stalled-cycles-frontend   #   65.11% frontend cycles idle     (23.83%)
   <not supported>      stalled-cycles-backend                                      
        73,279,708      branch-misses             #    2.30% of all branches          (23.79%)
       997,678,140      bus-cycles                # 99717955.022 M/sec                (23.79%)
    18,958,361,307      ref-cycles                # 1894888686.357 M/sec              (28.54%)
        20,159,851      LLC-loads                 # 2014977.611 M/sec                 (23.79%)
               458      LLC-load-misses           #    0.00% of all LL-cache hits     (23.79%)
        11,159,143      LLC-stores                # 1115356.622 M/sec                 (9.51%)
                42      LLC-store-misses          #    4.198 M/sec                    (9.51%)
        54,652,812      LLC-prefetches            # 5462549.925 M/sec                 (9.51%)
             1,650      LLC-prefetch-misses       #  164.918 M/sec                    (9.51%)
     4,662,831,793      dTLB-loads                # 466050154.223 M/sec               (9.51%)
       137,048,612      dTLB-load-misses          #    2.94% of all dTLB cache hits   (9.51%)
     3,427,752,805      dTLB-stores               # 342603978.511 M/sec               (9.51%)
         1,083,813      dTLB-store-misses         # 108327.136 M/sec                  (9.51%)
   <not supported>      dTLB-prefetches                                             
   <not supported>      dTLB-prefetch-misses                                        
         4,982,148      iTLB-loads                # 497965.817 M/sec                  (9.51%)
        14,543,131      iTLB-load-misses          #  291.90% of all iTLB cache hits   (14.27%)

note: perf highlights 2 stats as red color: stalled-cycles-frontend, and iTLB-load-misses
      I think l2fwd has high syscall, causing high iTLB-load-misses.
      I don't know why on both cases, the frontend cycle, which does decode,
      is pretty high (49% and 65%)

Evaluation4: ovs-appctl dpif-netdev/pmd-stats-show
==================================================
1) RX drop 3Mpps
pmd thread numa_id 0 core_id 11:
	packets received: 82106476
	packet recirculations: 0
	avg. datapath passes per packet: 1.00
	emc hits: 82106156
	megaflow hits: 318
	avg. subtable lookups per megaflow hit: 1.00
	miss with success upcall: 2
	miss with failed upcall: 0
	avg. packets per output batch: 32.00
	idle cycles: 198086106617 (60.53%)
	processing cycles: 129181332198 (39.47%)
	avg cycles per packet: 3985.89 (327267438815/82106476)
	avg processing cycles per packet: 1573.34 (129181332198/82106476)

2) L2fwd 0.4Mpps
pmd thread numa_id 0 core_id 11:
	packets received: 8555669
	packet recirculations: 0
	avg. datapath passes per packet: 1.00
	emc hits: 8555509
	megaflow hits: 159
	avg. subtable lookups per megaflow hit: 1.00
	miss with success upcall: 1
	miss with failed upcall: 0
	avg. packets per output batch: 32.00
	idle cycles: 89532679391 (57.74%)
	processing cycles: 65538391726 (42.26%)
	avg cycles per packet: 18124.95 (155071071117/8555669)
	avg processing cycles per packet: 7660.23 (65538391726/8555669)

note: avg cycles per packet: 3985 v.s 18124

Next Step
=========
1) optimize the tx part as well as l2fwd
2) try the zero copy mode driver

v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

William Tu (3):
  afxdp: add ebpf code for afxdp and xskmap.
  netdev-linux: add new netdev type afxdp.
  tests: add afxdp test cases.

 acinclude.m4                    |   1 +
 bpf/api.h                       |   6 +
 bpf/helpers.h                   |   2 +
 bpf/maps.h                      |  12 +
 bpf/xdp.h                       |  42 ++-
 lib/automake.mk                 |   5 +-
 lib/bpf.c                       |  41 +-
 lib/bpf.h                       |   6 +-
 lib/dp-packet.c                 |  20 +
 lib/dp-packet.h                 |  27 +-
 lib/dpif-netdev-perf.h          |  16 +-
 lib/dpif-netdev.c               |  59 ++-
 lib/if_xdp.h                    |  79 ++++
 lib/netdev-dummy.c              |   1 +
 lib/netdev-linux.c              | 808 +++++++++++++++++++++++++++++++++++++++-
 lib/netdev-provider.h           |   2 +
 lib/netdev-vport.c              |   1 +
 lib/netdev.c                    |  11 +
 lib/netdev.h                    |   1 +
 lib/xdpsock.c                   |  70 ++++
 lib/xdpsock.h                   |  82 ++++
 tests/automake.mk               |  17 +
 tests/ofproto-macros.at         |   1 +
 tests/system-afxdp-macros.at    | 148 ++++++++
 tests/system-afxdp-testsuite.at |  25 ++
 tests/system-afxdp-traffic.at   |  38 ++
 vswitchd/bridge.c               |   1 +
 27 files changed, 1492 insertions(+), 30 deletions(-)
 create mode 100644 lib/if_xdp.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at
 create mode 100644 tests/system-afxdp-traffic.at