Message ID | 1559070064-7211-1-git-send-email-u9012063@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev,PATCHv10] netdev-afxdp: add new netdev type for AF_XDP. | expand |
On 28.05.2019 22:01, William Tu wrote: > The patch introduces experimental AF_XDP support for OVS netdev. > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket > type built upon the eBPF and XDP technology. It is aims to have comparable > performance to DPDK but cooperate better with existing kernel's networking > stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > attached to the netdev, by-passing a couple of Linux kernel's subsystems > As a result, AF_XDP socket shows much better performance than AF_PACKET > For more details about AF_XDP, please see linux kernel's > Documentation/networking/af_xdp.rst. Note that by default, this feature is > not compiled in. > > Signed-off-by: William Tu <u9012063@gmail.com> > --- > v1->v2: > - add a list to maintain unused umem elements > - remove copy from rx umem to ovs internal buffer > - use hugetlb to reduce misses (not much difference) > - use pmd mode netdev in OVS (huge performance improve) > - remove malloc dp_packet, instead put dp_packet in umem > > v2->v3: > - rebase on the OVS master, 7ab4b0653784 > ("configure: Check for more specific function to pull in pthread library.") > - remove the dependency on libbpf and dpif-bpf. > instead, use the built-in XDP_ATTACH feature. > - data structure optimizations for better performance, see[1] > - more test cases support > v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html > > v3->v4: > - Use AF_XDP API provided by libbpf > - Remove the dependency on XDP_ATTACH kernel patch set > - Add documentation, bpf.rst > > v4->v5: > - rebase to master > - remove rfc, squash all into a single patch > - add --enable-afxdp, so by default, AF_XDP is not compiled > - add options: xdpmode=drv,skb > - add multiple queue and multiple PMD support, with options: n_rxq > - improve documentation, rename bpf.rst to af_xdp.rst > > v5->v6 > - rebase to master, commit 0cdd5b13de91b98 > - address errors from sparse and clang > - pass travis-ci test > - address feedback from Ben > - fix issues reported by 0-day robot > - improved documentation > > v6-v7 > - rebase to master, commit abf11558c1515bf3b1 > - address feedbacks from Ilya, Ben, and Eelco, see: > https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html > - add XDP mode change, implement get/set_config, reconfigure > - Fix reconfiguration/crash issue caused by libbpf, see patch: > [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown > - perf optimization for batching umem_push/pop > - perf optimization for batching kick_tx > - test build with dpdk > - fix/refactor atomic operation > - make AF_XDP x86 specific, otherwise fail at build time > - lots of code refactoring > - add PVP setup in documentation > > v7-v8: > - Address feedback from Ilya at: > https://protect2.fireeye.com/url?k=56282ea945466a02.5629a5e6-0b1830ef36465620&u=https://patchwork.ozlabs.org/patch/1095019/ > - add netdev-linux-private.h > - fix afxdp reconfigure issue > - sort include headers > - remove unnecessary OVS_UNUSED > - coding style fixes > - error case handling and memory leak > > v8-v9: > - rebase to master 180bbbed3a3867d52 > - Address review feedback from Ben, Ilya and Eelco, at: > https://protect2.fireeye.com/url?k=b08e5d041ce72fc6.b08fd64b-56e9484896ad35db&u=https://patchwork.ozlabs.org/patch/1097740/ > - == From Ilya == > - Optimize the reconfiguration logic > - Implement .rxq_recv and .send for afxdp > - Remove system-afxdp-traffic.at, reuse existing code > - Use Ilya's rdtsc code > - remove --disable-system > - == From Eelco == > - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111: > assertion !fd != !wevent failed > - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT... > - Clear xdp program when receive signal, ctrl+c > - Add options to vswitch.xml, set xdpmode default to skb-mode > - No support for ARM and PPC, now x86_64 only > - remove redundant header includes and function/macro definitions > - remove some ifdef HAVE_AF_XDP > - == From others/both about afxdp rx and tx == > - Several umem push/pop error handling improvement/fixes > - add lock to address concurrent_txq case > - improve error handling > - add stats > - Things that are not done yet > - MTU limitation > - n_txq_desc/n_rxq_desc option. > > v9-v10 > - remove x86_64 limitation, suggested by Ben and Eelco > - add xmalloc_pagealign, free_pagealign > - minor refector > --- > Documentation/automake.mk | 1 + > Documentation/index.rst | 1 + > Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++ > Documentation/intro/install/index.rst | 1 + > acinclude.m4 | 35 ++ > configure.ac | 1 + > lib/automake.mk | 14 + > lib/dp-packet.c | 28 ++ > lib/dp-packet.h | 18 +- > lib/dpif-netdev-perf.h | 28 ++ > lib/netdev-afxdp.c | 850 ++++++++++++++++++++++++++++++++++ > lib/netdev-afxdp.h | 74 +++ > lib/netdev-linux-private.h | 139 ++++++ > lib/netdev-linux.c | 121 ++--- > lib/netdev-provider.h | 3 + > lib/netdev.c | 11 + > lib/spinlock.h | 70 +++ > lib/util.c | 43 ++ > lib/util.h | 5 + > lib/xdpsock.c | 179 +++++++ > lib/xdpsock.h | 101 ++++ > tests/automake.mk | 16 + > tests/system-afxdp-macros.at | 20 + > tests/system-afxdp-testsuite.at | 26 ++ > vswitchd/vswitch.xml | 15 + > 25 files changed, 2150 insertions(+), 83 deletions(-) > create mode 100644 Documentation/intro/install/afxdp.rst > create mode 100644 lib/netdev-afxdp.c > create mode 100644 lib/netdev-afxdp.h > create mode 100644 lib/netdev-linux-private.h > create mode 100644 lib/spinlock.h > create mode 100644 lib/xdpsock.c > create mode 100644 lib/xdpsock.h > create mode 100644 tests/system-afxdp-macros.at > create mode 100644 tests/system-afxdp-testsuite.at > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > index 082438e09a33..11cc59efc881 100644 > --- a/Documentation/automake.mk > +++ b/Documentation/automake.mk > @@ -10,6 +10,7 @@ DOC_SOURCE = \ > Documentation/intro/why-ovs.rst \ > Documentation/intro/install/index.rst \ > Documentation/intro/install/bash-completion.rst \ > + Documentation/intro/install/afxdp.rst \ > Documentation/intro/install/debian.rst \ > Documentation/intro/install/documentation.rst \ > Documentation/intro/install/distributions.rst \ > diff --git a/Documentation/index.rst b/Documentation/index.rst > index 46261235c732..aa9e7c49f179 100644 > --- a/Documentation/index.rst > +++ b/Documentation/index.rst > @@ -59,6 +59,7 @@ vSwitch? Start here. > :doc:`intro/install/windows` | > :doc:`intro/install/xenserver` | > :doc:`intro/install/dpdk` | > + :doc:`intro/install/afxdp` | > :doc:`Installation FAQs <faq/releases>` > > - **Tutorials:** :doc:`tutorials/faucet` | > diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst > new file mode 100644 > index 000000000000..a2bff5733d0a > --- /dev/null > +++ b/Documentation/intro/install/afxdp.rst > @@ -0,0 +1,433 @@ > +.. > + Licensed under the Apache License, Version 2.0 (the "License"); you may > + not use this file except in compliance with the License. You may obtain > + a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, software > + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the > + License for the specific language governing permissions and limitations > + under the License. > + > + Convention for heading levels in Open vSwitch documentation: > + > + ======= Heading 0 (reserved for the title in a document) > + ------- Heading 1 > + ~~~~~~~ Heading 2 > + +++++++ Heading 3 > + ''''''' Heading 4 > + > + Avoid deeper levels because they do not render well. > + > + > +======================== > +Open vSwitch with AF_XDP > +======================== > + > +This document describes how to build and install Open vSwitch using > +AF_XDP netdev. > + > +.. warning:: > + The AF_XDP support of Open vSwitch is considered 'experimental', > + and it is not compiled in by default. > + > + > +Introduction > +------------ > +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type > +built upon the eBPF and XDP technology. It is aims to have comparable > +performance to DPDK but cooperate better with existing kernel's networking > +stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > +attached to the netdev, by-passing a couple of Linux kernel's subsystems. > +As a result, AF_XDP socket shows much better performance than AF_PACKET. > +For more details about AF_XDP, please see linux kernel's > +Documentation/networking/af_xdp.rst > + > + > +AF_XDP Netdev > +------------- > +OVS has a couple of netdev types, i.e., system, tap, or > +dpdk. The AF_XDP feature adds a new netdev types called > +"afxdp", and implement its configuration, packet reception, > +and transmit functions. Since the AF_XDP socket, called xsk, > +operates in userspace, once ovs-vswitchd receives packets > +from xsk, the afxdp netdev re-uses the existing userspace > +dpif-netdev datapath. As a result, most of the packet processing > +happens at the userspace instead of linux kernel. > + > +:: > + > + | +-------------------+ > + | | ovs-vswitchd |<-->ovsdb-server > + | +-------------------+ > + | | ofproto |<-->OpenFlow controllers > + | +--------+-+--------+ > + | | netdev | |ofproto-| > + userspace | +--------+ | dpif | > + | | afxdp | +--------+ > + | | netdev | | dpif | > + | +---||---+ +--------+ > + | || | dpif- | > + | || | netdev | > + |_ || +--------+ > + || > + _ +---||-----+--------+ > + | | AF_XDP prog + | > + kernel | | xsk_map | > + |_ +--------||---------+ > + || > + physical > + NIC > + > + > +Build requirements > +------------------ > + > +In addition to the requirements described in :doc:`general`, building Open > +vSwitch with AF_XDP will require the following: > + > +- libbpf from kernel source tree (kernel 5.0.0 or later) > + > +- Linux kernel XDP support, with the following options (required) > + > + * CONFIG_BPF=y > + > + * CONFIG_BPF_SYSCALL=y > + > + * CONFIG_XDP_SOCKETS=y > + > + > +- The following optional Kconfig options are also recommended, but not > + required: > + > + * CONFIG_BPF_JIT=y (Performance) > + > + * CONFIG_HAVE_BPF_JIT=y (Performance) > + > + * CONFIG_XDP_SOCKETS_DIAG=y (Debugging) > + > +- Once your AF_XDP-enabled kernel is ready, if possible, run > + **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf. > + This is an OVS indepedent benchmark tools for AF_XDP. typo: s/indepedent/independent/ > + It makes sure your basic kernel requirements are met for AF_XDP. > + > + > +Installing > +---------- > +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support. > +Frist, clone a recent version of Linux bpf-next tree:: s/Frist/First/ > + > + git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git > + > +Second, go into the Linux source directory and build libbpf in the tools > +directory:: > + > + cd bpf-next/ > + cd tools/lib/bpf/ > + make && make install > + make install_headers > + > +.. note:: > + Make sure xsk.h and bpf.h are installed in system's library path, > + e.g. /usr/local/include/bpf/ or /usr/include/bpf/ > + > +Make sure the libbpf.so is installed correctly:: > + > + ldconfig > + ldconfig -p | grep libbpf > + > +Third, ensure the standard OVS requirements are installed and > +bootstrap/configure the package:: > + > + ./boot.sh && ./configure --enable-afxdp > + > +Finally, build and install OVS:: > + > + make && make install > + > +To kick start end-to-end autotesting:: > + > + uname -a # make sure having 5.0+ kernel > + make check-afxdp TESTSUITEFLAGS='1' > + > +If a test case fails, check the log at:: > + > + cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log > + > + > +Setup AF_XDP netdev > +------------------- > +Before running OVS with AF_XDP, make sure the libbpf and libelf are > +set-up right:: > + > + ldd vswitchd/ovs-vswitchd > + > +Open vSwitch should be started using userspace datapath as described > +in :doc:`general`:: > + > + ovs-vswitchd ... > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev > + > +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4) > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask, > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb":: > + > + ethtool -L enp2s0 combined 1 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=1 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:4" > + > +Or, use 4 pmds/cores and 4 queues by doing:: > + > + ethtool -L enp2s0 combined 4 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=4 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4" > + > +.. note:: > + pmd-rxq-affinity is optional. If not specified, system will auto-assign. > + > +To validate that the bridge has successfully instantiated, you can use the:: > + > + ovs-vsctl show > + > +Should show something like:: > + > + Port "ens802f0" > + Interface "ens802f0" > + type: afxdp > + options: {n_rxq="1", xdpmode=drv} > + > +Otherwise, enable debugging by:: > + > + ovs-appctl vlog/set netdev_afxdp::dbg > + > + > +References > +---------- > +Most of the design details are described in the paper presented at > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1], > +section 4, and slides[2][4]. > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction > +about AF_XDP current and future work. > + > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf > + > +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf > + > +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf > + > +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp > + > + > +Performance Tuning > +------------------ > +The name of the game is to keep your CPU running in userspace, allowing PMD > +to keep polling the AF_XDP queues without any interferences from kernel. > + > +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd > + running cores, device plug-in slot) > + > +#. Isolate your CPU by doing isolcpu at grub configure. > + > +#. IRQ should not set to pmd running core. > + > +#. The Spectre and Meltdown fixes increase the overhead of system calls. > + > + > +Debugging performance issue > +~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +While running the traffic, use linux perf tool to see where your cpu > +spends its cycle:: > + > + cd bpf-next/tools/perf > + make > + ./perf record -p `pidof ovs-vswitchd` sleep 10 > + ./perf report > + > +Measure your system call rate by doing:: > + > + pstree -p `pidof ovs-vswitchd` > + strace -c -p <your pmd's PID> > + > +Or, use OVS pmd tool:: > + > + ovs-appctl dpif-netdev/pmd-stats-show > + > + > +Example Script > +-------------- > + > +Below is a script using namespaces and veth peer:: > + > + #!/bin/bash > + ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \ > + --disable-system --detach \ > + ovs-vsctl -- add-br br0 -- set Bridge br0 \ > + protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \ > + fail-mode=secure datapath_type=netdev > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev > + > + ip netns add at_ns0 > + ovs-appctl vlog/set netdev_afxdp::dbg > + > + ip link add p0 type veth peer name afxdp-p0 > + ip link set p0 netns at_ns0 > + ip link set dev afxdp-p0 up > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp" > + > + ip netns exec at_ns0 sh << NS_EXEC_HEREDOC > + ip addr add "10.1.1.1/24" dev p0 > + ip link set dev p0 up > + NS_EXEC_HEREDOC > + > + ip netns add at_ns1 > + ip link add p1 type veth peer name afxdp-p1 > + ip link set p1 netns at_ns1 > + ip link set dev afxdp-p1 up > + > + ovs-vsctl add-port br0 afxdp-p1 -- \ > + set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp" > + ip netns exec at_ns1 sh << NS_EXEC_HEREDOC > + ip addr add "10.1.1.2/24" dev p1 > + ip link set dev p1 up > + NS_EXEC_HEREDOC > + > + ip netns exec at_ns0 ping -i .2 10.1.1.2 > + > + > +Limitations/Known Issues > +------------------------ > +#. Device's numa ID is always 0, need a way to find numa id from a netdev. > +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible > + work-around is to use OpenFlow meter action. > +#. AF_XDP device added to bridge, remove, and added again will fail. > +#. Most of the tests are done using i40e single port. Multiple ports and > + also ixgbe driver also needs to be tested. > +#. No latency test result (TODO items) > + > + > +PVP using tap device > +-------------------- > +Assume you have enp2s0 as physical nic, and a tap device connected to VM. > +First, start OVS, then add physical port:: > + > + ethtool -L enp2s0 combined 1 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=1 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:4" > + > +Start a VM with virtio and tap device:: > + > + qemu-system-x86_64 -hda ubuntu1810.qcow \ > + -m 4096 \ > + -cpu host,+x2apic -enable-kvm \ > + -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\ > + vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \ > + -netdev type=tap,id=net0,vhost=on,queues=8 \ > + -object memory-backend-file,id=mem,size=4096M,\ > + mem-path=/dev/hugepages,share=on \ > + -numa node,memdev=mem -mem-prealloc -smp 2 > + > +Create OpenFlow rules:: > + > + ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp" > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0" > + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0" > + > +Inside the VM, use xdp_rxq_info to bounce back the traffic:: > + > + ./xdp_rxq_info --dev ens3 --action XDP_TX > + > +The performance number I got is around 1.6Mpps. > +This is due to using the kernel's tap interface, which requires copying > +packet into kernel from the umem buffer in userspace. > + > + > +PVP using vhostuser device > +-------------------------- > +First, build OVS with DPDK and AFXDP:: > + > + ./configure --enable-afxdp --with-dpdk=<dpdk path> > + make -j4 && make install > + > +Create a vhost-user port from OVS:: > + > + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \ > + other_config:pmd-cpu-mask=0xfff > + ovs-vsctl add-port br0 vhost-user-1 \ > + -- set Interface vhost-user-1 type=dpdkvhostuser > + > +Start VM using vhost-user mode:: > + > + qemu-system-x86_64 -hda ubuntu1810.qcow \ > + -m 4096 \ > + -cpu host,+x2apic -enable-kvm \ > + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \ > + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \ > + -device virtio-net-pci,mac=00:00:00:00:00:01,\ > + netdev=mynet1,mq=on,vectors=10 \ > + -object memory-backend-file,id=mem,size=4096M,\ > + mem-path=/dev/hugepages,share=on \ > + -numa node,memdev=mem -mem-prealloc -smp 2 > + > +Setup the OpenFlow ruls:: > + > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" > + ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" > + > +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic:: > + > + ./xdp_rxq_info --dev ens3 --action XDP_DROP > + ./xdp_rxq_info --dev ens3 --action XDP_TX > + > +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps > + > + > +PCP container using veth > +------------------------ > +Create namespace and veth peer devices:: > + > + ip netns add at_ns0 > + ip link add p0 type veth peer name afxdp-p0 > + ip link set p0 netns at_ns0 > + ip link set dev afxdp-p0 up > + ip netns exec at_ns0 ip link set dev p0 up > + > +Attach the veth port to br0 (linux kernel mode):: > + > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 options:n_rxq=1 > + > +Or, use AF_XDP with skb mode:: > + > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb > + > +Setup the OpenFlow rules:: > + > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0" > + ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0" > + > +In the namespace, run drop or bounce back the packet:: > + > + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP > + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX > + > +Performace: for RX_DROP: 800Kpps, TX: 700Kpps > + > + > +Bug Reporting > +------------- > + > +Please report problems to dev@openvswitch.org. > diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst > index 3193c736cf17..c27a9c9d16ff 100644 > --- a/Documentation/intro/install/index.rst > +++ b/Documentation/intro/install/index.rst > @@ -45,6 +45,7 @@ Installation from Source > xenserver > userspace > dpdk > + afxdp > > Installation from Packages > -------------------------- > diff --git a/acinclude.m4 b/acinclude.m4 > index f8fc5bcd7b4c..b9eacd7c0f3c 100644 > --- a/acinclude.m4 > +++ b/acinclude.m4 > @@ -221,6 +221,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [ > ]) > ]) > > +dnl OVS_CHECK_LINUX_AF_XDP > +dnl > +dnl Check both Linux kernel AF_XDP and libbpf support > +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [ > + AC_ARG_ENABLE([afxdp], > + [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])], > + [], [enable_afxdp=no]) > + AC_MSG_CHECKING([whether AF_XDP is enabled]) > + if test "$enable_afxdp" != yes; then > + AC_MSG_RESULT([no]) > + AF_XDP_ENABLE=false > + else > + AC_MSG_RESULT([yes]) > + AF_XDP_ENABLE=true > + > + AC_CHECK_HEADER([bpf/libbpf.h], [], > + [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([linux/if_xdp.h], [], > + [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([bpf/xsk.h], [], > + [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([bpf/libbpf_util.h], [], > + [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])]) > + > + AC_DEFINE([HAVE_AF_XDP], [1], > + [Define to 1 if AF_XDP support is available and enabled.]) > + LIBBPF_LDADD=" -lbpf -lelf" > + AC_SUBST([LIBBPF_LDADD]) > + fi > + AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true) > +]) > + > dnl OVS_CHECK_DPDK > dnl > dnl Configure DPDK source tree > diff --git a/configure.ac b/configure.ac > index 505e3d041e93..29c90b73f836 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX > OVS_CHECK_DOT > OVS_CHECK_IF_DL > OVS_CHECK_STRTOK_R > +OVS_CHECK_LINUX_AF_XDP > AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]]) > AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], > [], [], [[#include <sys/stat.h>]]) > diff --git a/lib/automake.mk b/lib/automake.mk > index cc5dccf39d6b..b31e28f6e1f5 100644 > --- a/lib/automake.mk > +++ b/lib/automake.mk > @@ -14,6 +14,10 @@ if WIN32 > lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS} > endif > > +if HAVE_AF_XDP > +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD) > +endif > + > lib_libopenvswitch_la_LDFLAGS = \ > $(OVS_LTINFO) \ > -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \ > @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \ > lib/if-notifier.h \ > lib/netdev-linux.c \ > lib/netdev-linux.h \ > + lib/netdev-linux-private.h \ > lib/netdev-tc-offloads.c \ > lib/netdev-tc-offloads.h \ > lib/netlink-conntrack.c \ > @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \ > lib/tc.h > endif > > +if HAVE_AF_XDP > +lib_libopenvswitch_la_SOURCES += \ > + lib/xdpsock.c \ > + lib/xdpsock.h \ > + lib/netdev-afxdp.c \ > + lib/netdev-afxdp.h \ > + lib/spinlock.h > +endif > + > if DPDK_NETDEV > lib_libopenvswitch_la_SOURCES += \ > lib/dpdk.c \ > diff --git a/lib/dp-packet.c b/lib/dp-packet.c > index 0976a35e758b..e6a7947076b4 100644 > --- a/lib/dp-packet.c > +++ b/lib/dp-packet.c > @@ -19,6 +19,7 @@ > #include <string.h> > > #include "dp-packet.h" > +#include "netdev-afxdp.h" > #include "netdev-dpdk.h" > #include "openvswitch/dynamic-string.h" > #include "util.h" > @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated) > dp_packet_use__(b, base, allocated, DPBUF_MALLOC); > } > > +#if HAVE_AF_XDP > +/* Initialize 'b' as an empty dp_packet that contains > + * memory starting at AF_XDP umem base. > + */ > +void > +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated) > +{ > + dp_packet_set_base(b, base); > + dp_packet_set_data(b, base); > + dp_packet_set_size(b, 0); > + > + dp_packet_set_allocated(b, allocated); > + b->source = DPBUF_AFXDP; > + dp_packet_reset_offsets(b); > + pkt_metadata_init(&b->md, 0); > + dp_packet_reset_cutlen(b); > + dp_packet_reset_offload(b); > + b->packet_type = htonl(PT_ETH); > +} > +#endif > + > /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of > * memory starting at 'base'. 'base' should point to a buffer on the stack. > * (Nothing actually relies on 'base' being allocated on the stack. It could > @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b) > * created as a dp_packet */ > free_dpdk_buf((struct dp_packet*) b); > #endif > + } else if (b->source == DPBUF_AFXDP) { > + free_afxdp_buf(b); > } > } > } > @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom > case DPBUF_STACK: > OVS_NOT_REACHED(); > > + case DPBUF_AFXDP: > + OVS_NOT_REACHED(); > + > case DPBUF_STUB: > b->source = DPBUF_MALLOC; > new_base = xmalloc(new_allocated); > @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b) > { > void *p; > ovs_assert(b->source != DPBUF_DPDK); > + ovs_assert(b->source != DPBUF_AFXDP); > > if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) { > p = dp_packet_data(b); > diff --git a/lib/dp-packet.h b/lib/dp-packet.h > index a5e9ade1244a..e3438226e360 100644 > --- a/lib/dp-packet.h > +++ b/lib/dp-packet.h > @@ -25,6 +25,7 @@ > #include <rte_mbuf.h> > #endif > > +#include "netdev-afxdp.h" > #include "netdev-dpdk.h" > #include "openvswitch/list.h" > #include "packets.h" > @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source { > DPBUF_DPDK, /* buffer data is from DPDK allocated memory. > * ref to dp_packet_init_dpdk() in dp-packet.c. > */ > + DPBUF_AFXDP, /* buffer data from XDP frame */ > }; > > #define DP_PACKET_CONTEXT_SIZE 64 > @@ -89,6 +91,13 @@ struct dp_packet { > }; > }; > > +#if HAVE_AF_XDP > +struct dp_packet_afxdp { > + struct umem_pool *mpool; > + struct dp_packet packet; > +}; > +#endif > + > static inline void *dp_packet_data(const struct dp_packet *); > static inline void dp_packet_set_data(struct dp_packet *, void *); > static inline void *dp_packet_base(const struct dp_packet *); > @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *); > void dp_packet_use(struct dp_packet *, void *, size_t); > void dp_packet_use_stub(struct dp_packet *, void *, size_t); > void dp_packet_use_const(struct dp_packet *, const void *, size_t); > - > +#if HAVE_AF_XDP > +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t); > +#endif > void dp_packet_init_dpdk(struct dp_packet *); > > void dp_packet_init(struct dp_packet *, size_t); > @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b) > return; > } > > + if (b->source == DPBUF_AFXDP) { > + free_afxdp_buf(b); > + return; > + } > + > dp_packet_uninit(b); > free(b); > } > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > index 859c05613ddf..a33b9a7353ba 100644 > --- a/lib/dpif-netdev-perf.h > +++ b/lib/dpif-netdev-perf.h > @@ -21,6 +21,7 @@ > #include <stddef.h> > #include <stdint.h> > #include <string.h> > +#include <time.h> > #include <math.h> > > #ifdef DPDK_NETDEV > @@ -186,6 +187,24 @@ struct pmd_perf_stats { > char *log_reason; > }; > > +#ifdef HAVE_AF_XDP I'd like to change this to "#ifdef __linux__". 'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is Linux specific. > +static inline uint64_t > +rdtsc_syscall(struct pmd_perf_stats *s) > +{ > + struct timespec val; > + uint64_t v; > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { > + return s->last_tsc = 0; Maybe it's better to just return the value and allow caller to assign? This way you'll not need to pass any arguments here. > + } > + > + v = (uint64_t) val.tv_sec * 1000000000LL; > + v += (uint64_t) val.tv_nsec; > + > + return s->last_tsc = v; > +} > +#endif > + > /* Support for accurate timing of PMD execution on TSC clock cycle level. > * These functions are intended to be invoked in the context of pmd threads. */ > > @@ -198,6 +217,15 @@ cycles_counter_update(struct pmd_perf_stats *s) > { > #ifdef DPDK_NETDEV > return s->last_tsc = rte_get_tsc_cycles(); > +#elif defined(HAVE_AF_XDP) && defined(__x86_64__) And this should be: #elif !defined(_MSC_VER) && defined(__x86_64__) Visual Studio doesn't support inline assembly this way. Other things are portable until we're on x86_64. > + /* This is x86-specific instructions. */ > + uint32_t h, l; > + asm volatile("rdtsc" : "=a" (l), "=d" (h)); > + > + return s->last_tsc = ((uint64_t) h << 32) | l; > +#elif defined(HAVE_AF_XDP) #elif defined(__linux__) > + /* non-x86_64 architecture uses syscall */ > + return rdtsc_syscall(s); > #else > return s->last_tsc = 0; > #endif > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c > new file mode 100644 > index 000000000000..e20ee31c00f3 > --- /dev/null > +++ b/lib/netdev-afxdp.c > @@ -0,0 +1,850 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#include <config.h> > + > +#include "netdev-linux-private.h" > +#include "netdev-linux.h" > +#include "netdev-afxdp.h" > + > +#include <errno.h> > +#include <inttypes.h> > +#include <linux/rtnetlink.h> > +#include <linux/if_xdp.h> > +#include <net/if.h> > +#include <stdlib.h> > +#include <sys/resource.h> > +#include <sys/socket.h> > +#include <sys/types.h> > +#include <unistd.h> > + > +#include "dp-packet.h" > +#include "dpif-netdev.h" > +#include "openvswitch/dynamic-string.h" > +#include "openvswitch/vlog.h" > +#include "packets.h" > +#include "socket-util.h" > +#include "spinlock.h" > +#include "util.h" > +#include "xdpsock.h" > + > +#ifndef SOL_XDP > +#define SOL_XDP 283 > +#endif > + > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp); > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > + > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base)) > +#define UMEM2XPKT(base, i) \ > + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \ > + i * sizeof(struct dp_packet_afxdp)) > + > +static uint32_t prog_id; > +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id, > + int mode); > +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode); > +static void xsk_destroy(struct xsk_socket_info *xsk); > +static int xsk_configure_all(struct netdev *netdev); > +static void xsk_destroy_all(struct netdev *netdev); > + > +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size, > + int xdpmode) > +{ > + struct xsk_umem_config uconfig OVS_UNUSED; > + struct xsk_umem_info *umem; > + int ret; > + int i; > + > + umem = xcalloc(1, sizeof(*umem)); No need to parenthesize the argument of 'sizeof'. > + ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, > + NULL); > + if (ret) { > + VLOG_ERR("xsk_umem__create failed (%s) mode: %s", > + ovs_strerror(errno), > + xdpmode == XDP_COPY ? "SKB": "DRV"); > + free(umem); > + return NULL; > + } > + > + umem->buffer = buffer; > + > + /* set-up umem pool */ > + if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) { > + VLOG_ERR("umem_pool_init failed"); > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free(umem); > + return NULL; > + } > + > + for (i = NUM_FRAMES - 1; i >= 0; i--) { > + struct umem_elem *elem; > + > + elem = ALIGNED_CAST(struct umem_elem *, > + (char *)umem->buffer + i * FRAME_SIZE); > + umem_elem_push(&umem->mpool, elem); > + } > + > + /* set-up metadata */ > + if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) { > + VLOG_ERR("xpacket_pool_init failed"); > + umem_pool_cleanup(&umem->mpool); > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free(umem); > + return NULL; > + } > + > + VLOG_DBG("%s xpacket pool from %p to %p", __func__, > + umem->xpool.array, > + (char *)umem->xpool.array + > + NUM_FRAMES * sizeof(struct dp_packet_afxdp)); > + > + for (i = NUM_FRAMES - 1; i >= 0; i--) { > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + xpacket = UMEM2XPKT(umem->xpool.array, i); > + xpacket->mpool = &umem->mpool; > + > + packet = &xpacket->packet; > + packet->source = DPBUF_AFXDP; > + } > + > + return umem; > +} > + > +static struct xsk_socket_info * > +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex, > + uint32_t queue_id, int xdpmode) > +{ > + struct xsk_socket_config cfg; > + struct xsk_socket_info *xsk; > + char devname[IF_NAMESIZE]; > + uint32_t idx = 0; > + int ret; > + int i; > + > + xsk = xcalloc(1, sizeof(*xsk)); > + xsk->umem = umem; > + cfg.rx_size = CONS_NUM_DESCS; > + cfg.tx_size = PROD_NUM_DESCS; > + cfg.libbpf_flags = 0; > + > + if (xdpmode == XDP_ZEROCOPY) { > + cfg.bind_flags = XDP_ZEROCOPY; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + } else { > + cfg.bind_flags = XDP_COPY; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + } > + > + if (if_indextoname(ifindex, devname) == NULL) { > + VLOG_ERR("ifindex %d to devname failed (%s)", > + ifindex, ovs_strerror(errno)); > + free(xsk); > + return NULL; > + } > + > + ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem, > + &xsk->rx, &xsk->tx, &cfg); > + if (ret) { > + VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d", > + ovs_strerror(errno), > + xdpmode == XDP_COPY ? "SKB": "DRV", > + queue_id); > + free(xsk); > + return NULL; > + } > + > + /* Make sure the built-in AF_XDP program is loaded */ > + ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags); > + if (ret) { > + VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno)); > + xsk_socket__delete(xsk->xsk); > + free(xsk); > + return NULL; > + } > + > + /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */ > + while (!xsk_ring_prod__reserve(&xsk->umem->fq, > + PROD_NUM_DESCS - BATCH_SIZE, &idx)) { > + VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue"); > + } > + > + for (i = 0; > + i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE; > + i += FRAME_SIZE) { > + struct umem_elem *elem; > + uint64_t addr; > + > + elem = umem_elem_pop(&xsk->umem->mpool); > + addr = UMEM2DESC(elem, xsk->umem->buffer); > + > + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr; > + } > + > + xsk_ring_prod__submit(&xsk->umem->fq, > + PROD_NUM_DESCS - BATCH_SIZE); > + return xsk; > +} > + > +static struct xsk_socket_info * > +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode) > +{ > + struct xsk_socket_info *xsk; > + struct xsk_umem_info *umem; > + void *bufs; > + > + /* umem memory region */ > + bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE); > + memset(bufs, 0, NUM_FRAMES * FRAME_SIZE); > + > + /* create AF_XDP socket */ > + umem = xsk_configure_umem(bufs, > + NUM_FRAMES * FRAME_SIZE, > + xdpmode); > + if (!umem) { > + free_pagealign(bufs); > + return NULL; > + } > + > + xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode); > + if (!xsk) { > + /* clean up umem and xpacket pool */ > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free_pagealign(bufs); > + umem_pool_cleanup(&umem->mpool); > + xpacket_pool_cleanup(&umem->xpool); > + free(umem); > + } > + return xsk; > +} > + > +static int > +xsk_configure_all(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct xsk_socket_info *xsk; > + int i, ifindex; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + /* configure each queue */ > + for (i = 0; i < netdev->n_rxq; i++) { > + VLOG_INFO("%s configure queue %d mode %s", __func__, i, > + dev->xdpmode == XDP_COPY ? "SKB" : "DRV"); > + xsk = xsk_configure(ifindex, i, dev->xdpmode); > + if (!xsk) { > + VLOG_ERR("failed to create AF_XDP socket on queue %d", i); > + goto err; > + } > + dev->xsk[i] = xsk; > + xsk->rx_dropped = 0; > + xsk->tx_dropped = 0; > + } > + > + return 0; > + > +err: > + xsk_destroy_all(netdev); > + return EINVAL; > +} > + > +static void > +xsk_destroy(struct xsk_socket_info *xsk) > +{ > + struct xsk_umem *umem; > + > + if (!xsk) { > + return; > + } > + > + umem = xsk->umem->umem; > + xsk_socket__delete(xsk->xsk); > + if (xsk_umem__delete(umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + > + /* free the packet buffer */ > + free_pagealign(xsk->umem->buffer); > + > + /* cleanup umem pool */ > + umem_pool_cleanup(&xsk->umem->mpool); > + > + /* cleanup metadata pool */ > + xpacket_pool_cleanup(&xsk->umem->xpool); > + > + free(xsk->umem); > + free(xsk); > +} > + > +static void > +xsk_destroy_all(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + int i, ifindex; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + for (i = 0; i < MAX_XSKQ; i++) { > + if (dev->xsk[i]) { > + VLOG_INFO("destroy xsk[%d]", i); > + xsk_destroy(dev->xsk[i]); > + dev->xsk[i] = NULL; > + dev->xsk[i]->rx_dropped = 0; > + dev->xsk[i]->tx_dropped = 0; Dereferencing of a just assigned NULL poiner. Something is definitely wrong here. > + } > + } > + VLOG_INFO("remove xdp program"); > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > +} > + > +static inline void OVS_UNUSED > +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { > + struct xdp_statistics stat; > + socklen_t optlen; > + > + optlen = sizeof stat; > + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, > + &stat, &optlen) == 0); > + > + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", > + stat.rx_dropped, > + stat.rx_invalid_descs, > + stat.tx_invalid_descs); > +} > + > +int > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, > + char **errp OVS_UNUSED) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + const char *str_xdpmode; > + int xdpmode, new_n_rxq; > + > + ovs_mutex_lock(&dev->mutex); > + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); > + if (new_n_rxq > MAX_XSKQ) { > + ovs_mutex_unlock(&dev->mutex); > + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", > + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); > + return EINVAL; > + } > + > + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); > + if (!strcasecmp(str_xdpmode, "drv")) { > + xdpmode = XDP_ZEROCOPY; > + } else if (!strcasecmp(str_xdpmode, "skb")) { > + xdpmode = XDP_COPY; > + } else { > + VLOG_ERR("%s: Incorrect xdpmode (%s).", > + netdev_get_name(netdev), str_xdpmode); > + ovs_mutex_unlock(&dev->mutex); > + return EINVAL; > + } > + > + if (dev->requested_n_rxq != new_n_rxq > + || dev->requested_xdpmode != xdpmode) { > + dev->requested_n_rxq = new_n_rxq; > + dev->requested_xdpmode = xdpmode; > + netdev_request_reconfigure(netdev); > + } > + ovs_mutex_unlock(&dev->mutex); > + return 0; > +} > + > +int > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + > + ovs_mutex_lock(&dev->mutex); > + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); > + smap_add_format(args, "xdpmode", "%s", > + dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb"); > + ovs_mutex_unlock(&dev->mutex); > + return 0; > +} > + > +int > +netdev_afxdp_reconfigure(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; > + int err = 0; > + > + ovs_mutex_lock(&dev->mutex); > + > + if (netdev->n_rxq == dev->requested_n_rxq > + && dev->xdpmode == dev->requested_xdpmode) { > + goto out; > + } > + > + xsk_destroy_all(netdev); > + netdev->n_rxq = dev->requested_n_rxq; > + > + if (dev->requested_xdpmode == XDP_ZEROCOPY) { > + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); > + /* From SKB mode to DRV mode */ > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + dev->xdp_bind_flags = XDP_ZEROCOPY; > + dev->xdpmode = XDP_ZEROCOPY; > + > + if (setrlimit(RLIMIT_MEMLOCK, &r)) { > + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", > + ovs_strerror(errno)); > + } > + } else { > + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); > + /* From DRV mode to SKB mode */ > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + dev->xdp_bind_flags = XDP_COPY; > + dev->xdpmode = XDP_COPY; > + /* TODO: set rlimit back to previous value > + * when no device is in DRV mode. > + */ > + } > + > + err = xsk_configure_all(netdev); > + if (err) { > + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); > + } > + netdev_change_seq_changed(netdev); > +out: > + ovs_mutex_unlock(&dev->mutex); > + return err; > +} > + > +int > +netdev_afxdp_get_numa_id(const struct netdev *netdev) > +{ > + /* FIXME: Get netdev's PCIe device ID, then find > + * its NUMA node id. > + */ > + VLOG_INFO("FIXME: Device %s always use numa id 0", > + netdev_get_name(netdev)); > + return 0; > +} > + > +static void > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) > +{ > + uint32_t curr_prog_id = 0; > + uint32_t flags; > + > + /* remove_xdp_program() */ > + if (xdpmode == XDP_COPY) { > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + } else { > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + } > + > + if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) { > + bpf_set_link_xdp_fd(ifindex, -1, flags); > + } > + if (prog_id == curr_prog_id) { > + bpf_set_link_xdp_fd(ifindex, -1, flags); > + } else if (!curr_prog_id) { > + VLOG_INFO("couldn't find a prog id on a given interface"); > + } else { > + VLOG_INFO("program on interface changed, not removing"); > + } > +} > + > +void > +signal_remove_xdp(struct netdev *netdev) > +{> + struct netdev_linux *dev = netdev_linux_cast(netdev); > + int ifindex; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + VLOG_WARN("force remove xdp program"); > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > +} > + > +static struct dp_packet_afxdp * > +dp_packet_cast_afxdp(const struct dp_packet *d) > +{ > + ovs_assert(d->source == DPBUF_AFXDP); > + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); > +} > + > +void > +free_afxdp_buf(struct dp_packet *p) > +{ > + struct dp_packet_afxdp *xpacket; > + unsigned long addr; > + > + xpacket = dp_packet_cast_afxdp(p); > + if (xpacket->mpool) { > + void *base = dp_packet_base(p); > + > + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); > + umem_elem_push(xpacket->mpool, (void *)addr); > + } > +} > + > +static void > +free_afxdp_buf_batch(struct dp_packet_batch *batch) > +{ > + struct dp_packet_afxdp *xpacket = NULL; > + struct dp_packet *packet; > + void *elems[BATCH_SIZE]; > + unsigned long addr; > + > + /* all packets are AF_XDP, so handles its own delete in batch */ This comment should be somewhere else. BTW, shift right by 1 space. > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + xpacket = dp_packet_cast_afxdp(packet); > + if (xpacket->mpool) { > + void *base = dp_packet_base(packet); > + > + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); Shouldn't it be uintptr_t ? Probably in some other places too. > + elems[i] = (void *)addr; > + } > + } > + umem_elem_push_n(xpacket->mpool, batch->count, elems); > + dp_packet_batch_init(batch); > +} > + > +int > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, > + int *qfill) > +{ > + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); > + struct netdev *netdev = rx->up.netdev; > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct umem_elem *elems[BATCH_SIZE]; > + uint32_t idx_rx = 0, idx_fq = 0; > + struct xsk_socket_info *xsk; > + int qid = rxq_->queue_id; > + unsigned int rcvd, i; > + int ret = 0; > + > + xsk = dev->xsk[qid]; > + rx->fd = xsk_socket__fd(xsk->xsk); > + > + /* See if there is any packet on RX queue, > + * if yes, idx_rx is the index having the packet. > + */ > + rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); > + if (!rcvd) { > + return 0; > + } > + > + ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems); > + if (OVS_UNLIKELY(ret)) { We need to return rx buffers to mpool before releasing. Otherwise they will be lost. for (i = 0; i < rcvd; i++) { uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, i)->addr; elems[i] = xsk_umem__get_data(xsk->umem->buffer, addr); } umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); Please, re-check above code snippet before using. > + xsk_ring_cons__release(&xsk->rx, rcvd); > + xsk->rx_dropped += rcvd; > + return ENOMEM; > + } > + > + /* Prepare for the FILL queue */ > + if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) { > + /* The FILL queue is full, don't retry or process rx. Wait for kernel > + * to move received packets from FILL queue to RX queue. > + */ > + umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); Same here. > + xsk_ring_cons__release(&xsk->rx, rcvd); > + xsk->rx_dropped += rcvd; > + return ENOMEM; > + } > + > + /* Setup a dp_packet batch from descriptors in RX queue */ > + for (i = 0; i < rcvd; i++) { > + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; > + uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len; > + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); > + uint64_t index; > + > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + index = addr >> FRAME_SHIFT; > + xpacket = UMEM2XPKT(xsk->umem->xpool.array, index); > + packet = &xpacket->packet; > + > + /* Initialize the struct dp_packet */ > + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); > + dp_packet_set_size(packet, len); > + > + /* Add packet into batch, increase batch->count */ > + dp_packet_batch_add(batch, packet); > + > + idx_rx++; > + } > + /* Release the RX queue */ > + xsk_ring_cons__release(&xsk->rx, rcvd); > + > + for (i = 0; i < rcvd; i++) { > + uint64_t index; > + struct umem_elem *elem; > + > + /* Get one free umem, program it into FILL queue */ > + elem = elems[i]; > + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); > + ovs_assert((index & FRAME_SHIFT_MASK) == 0); > + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; > + > + idx_fq++; > + } > + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); > + > + if (qfill) { > + /* TODO: return the number of remaining packets in the queue. */ > + *qfill = 0; > + } > + > +#ifdef AFXDP_DEBUG > + log_xsk_stat(xsk); > +#endif > + return 0; > +} > + > +static inline int > +kick_tx(struct xsk_socket_info *xsk) > +{ > + int ret; > + > + /* This causes system call into kernel's xsk_sendmsg, and > + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). > + */ > + ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); > + if (OVS_UNLIKELY(ret < 0)) { > + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { > + return errno; > + } > + } > + /* no error, or EBUSY or EAGAIN */ > + return 0; > +} > + > +static inline bool > +check_free_batch(struct dp_packet_batch *batch) > +{ > + struct umem_pool *first_mpool = NULL; > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + if (packet->source != DPBUF_AFXDP) { > + return false; > + } > + xpacket = dp_packet_cast_afxdp(packet); > + if (i == 0) { > + first_mpool = xpacket->mpool; > + continue; > + } > + if (xpacket->mpool != first_mpool) { > + return false; > + } > + } > + /* All packets are DPBUF_AFXDP and from the same mpool */ > + return true; > +} > + > +static inline void > +afxdp_complete_tx(struct xsk_socket_info *xsk) > +{ > + struct umem_elem *elems_push[BATCH_SIZE]; > + uint32_t idx_cq = 0; > + int tx_done, j, ret; > + > + if (!xsk->outstanding_tx) { > + return; > + } > + > + ret = kick_tx(xsk); > + if (OVS_UNLIKELY(ret)) { > + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", > + ovs_strerror(ret)); > + } > + > + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq); > + if (tx_done > 0) { > + xsk_ring_cons__release(&xsk->umem->cq, tx_done); > + xsk->outstanding_tx -= tx_done; > + } > + > + /* Recycle back to umem pool */ > + for (j = 0; j < tx_done; j++) { > + struct umem_elem *elem; > + uint64_t addr; > + > + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++); > + elem = ALIGNED_CAST(struct umem_elem *, > + (char *)xsk->umem->buffer + addr); > + elems_push[j] = elem; > + } > + > + umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push); > +} > + > +int > +netdev_afxdp_batch_send(struct netdev *netdev_, int qid, > + struct dp_packet_batch *batch, > + bool concurrent_txq) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev_); > + struct xsk_socket_info *xsk = dev->xsk[qid]; > + struct umem_elem *elems_pop[BATCH_SIZE]; > + struct dp_packet *packet; > + bool free_batch = true; > + uint32_t idx = 0; > + int error = 0; > + int ret; > + > + if (OVS_UNLIKELY(concurrent_txq)) { > + ovs_spin_lock(&dev->tx_lock); Using the same lock for all queues will procude a lot of unnecessary contentions. It's better to allocate array of locks. One per tx queue. You may re-allocate it in reconfigure() implementation. > + } > + > + /* Process CQ first. */ > + afxdp_complete_tx(xsk); > + > + free_batch = check_free_batch(batch); > + > + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); > + if (OVS_UNLIKELY(ret)) { > + xsk->tx_dropped += batch->count; > + error = ENOMEM; > + goto out; > + } > + > + /* Make sure we have enough TX descs */ > + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx); > + if (OVS_UNLIKELY(ret == 0)) { > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); > + xsk->tx_dropped += batch->count; > + error = ENOMEM; > + goto out; > + } > + > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + struct umem_elem *elem; > + uint64_t index; > + > + elem = elems_pop[i]; > + /* Copy the packet to the umem we just pop from umem pool. > + * TODO: avoid this copy if the packet and the pop umem > + * are located in the same umem. > + */ > + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); > + > + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index; > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len > + = dp_packet_size(packet); > + } > + xsk_ring_prod__submit(&xsk->tx, batch->count); > + xsk->outstanding_tx += batch->count; > + > + ret = kick_tx(xsk); > + if (OVS_UNLIKELY(ret)) { > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); Do we really able to re-use these buffers? They are alredy in tx ring and probably will be sent on next kick_tx(). > + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", > + ovs_strerror(ret)); > + } > + > +out: > + if (free_batch) { > + free_afxdp_buf_batch(batch); > + } else { > + dp_packet_delete_batch(batch, true); > + } > + > + if (OVS_UNLIKELY(concurrent_txq)) { > + ovs_spin_unlock(&dev->tx_lock); > + } > + return error; > +} > + > +int > +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED) > +{ > + /* Done at reconfigure */ > + return 0; > +} > + > +void > +netdev_afxdp_destruct(struct netdev *netdev_) > +{ > + struct netdev_linux *netdev = netdev_linux_cast(netdev_); > + > + /* Note: tc is by-passed when using drv-mode, but when using > + * skb-mode, we might need to clean up tc. */ > + > + xsk_destroy_all(netdev_); > + ovs_mutex_destroy(&netdev->mutex); > +} > + > +int > +netdev_afxdp_get_stats(const struct netdev *netdev_, You don't need an underscore here. > + struct netdev_stats *stats) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev_); > + struct netdev_stats dev_stats; > + struct xsk_socket_info *xsk; > + int error, i; > + > + ovs_mutex_lock(&dev->mutex); > + > + error = get_stats_via_netlink(netdev_, &dev_stats); > + if (error) { > + VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics"); > + } else { > + /* Use kernel netdev's packet and byte counts */ > + stats->rx_packets = dev_stats.rx_packets; > + stats->rx_bytes = dev_stats.rx_bytes; > + stats->tx_packets = dev_stats.tx_packets; > + stats->tx_bytes = dev_stats.tx_bytes; > + > + stats->rx_errors += dev_stats.rx_errors; > + stats->tx_errors += dev_stats.tx_errors; > + stats->rx_dropped += dev_stats.rx_dropped; > + stats->tx_dropped += dev_stats.tx_dropped; > + stats->multicast += dev_stats.multicast; > + stats->collisions += dev_stats.collisions; > + stats->rx_length_errors += dev_stats.rx_length_errors; > + stats->rx_over_errors += dev_stats.rx_over_errors; > + stats->rx_crc_errors += dev_stats.rx_crc_errors; > + stats->rx_frame_errors += dev_stats.rx_frame_errors; > + stats->rx_fifo_errors += dev_stats.rx_fifo_errors; > + stats->rx_missed_errors += dev_stats.rx_missed_errors; > + stats->tx_aborted_errors += dev_stats.tx_aborted_errors; > + stats->tx_carrier_errors += dev_stats.tx_carrier_errors; > + stats->tx_fifo_errors += dev_stats.tx_fifo_errors; > + stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors; > + stats->tx_window_errors += dev_stats.tx_window_errors; > + > + /* Account the dropped in each xsk */ > + for (i = 0; i < MAX_XSKQ; i++) { i < netdev_n_rxq(netdev) > + xsk = dev->xsk[i]; > + if (xsk) { > + stats->rx_dropped += xsk->rx_dropped; > + stats->tx_dropped += xsk->tx_dropped; > + } > + } > + } > + ovs_mutex_unlock(&dev->mutex); > + > + return error; > +} > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h > new file mode 100644 > index 000000000000..dd2dc1a2064d > --- /dev/null > +++ b/lib/netdev-afxdp.h > @@ -0,0 +1,74 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#ifndef NETDEV_AFXDP_H > +#define NETDEV_AFXDP_H 1 > + > +#include <config.h> > + > +#ifdef HAVE_AF_XDP > + > +#include <stdint.h> > +#include <stdbool.h> > + > +/* These functions are Linux AF_XDP specific, so they should be used directly > + * only by Linux-specific code. */ > + > +#define MAX_XSKQ 16 > + > +struct netdev; > +struct xsk_socket_info; > +struct xdp_umem; > +struct dp_packet_batch; > +struct smap; > +struct dp_packet; > +struct netdev_rxq; > +struct netdev_stats; > + > +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_); > +void netdev_afxdp_destruct(struct netdev *netdev_); > + > +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, > + struct dp_packet_batch *batch, > + int *qfill); > +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid, > + struct dp_packet_batch *batch, > + bool concurrent_txq); > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, > + char **errp); > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args); > +int netdev_afxdp_get_numa_id(const struct netdev *netdev); > +int netdev_afxdp_get_stats(const struct netdev *netdev_, > + struct netdev_stats *stats); > + > +void free_afxdp_buf(struct dp_packet *p); > +int netdev_afxdp_reconfigure(struct netdev *netdev); > +void signal_remove_xdp(struct netdev *netdev); > + > +#else /* !HAVE_AF_XDP */ > + > +#include "openvswitch/compiler.h" > + > +struct dp_packet; > + > +static inline void > +free_afxdp_buf(struct dp_packet *p OVS_UNUSED) > +{ > + /* Nothing */ > +} > + > +#endif /* HAVE_AF_XDP */ > +#endif /* netdev-afxdp.h */ > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h > new file mode 100644 > index 000000000000..d43f79e6aa41 > --- /dev/null > +++ b/lib/netdev-linux-private.h > @@ -0,0 +1,139 @@ > +/* > + * Copyright (c) 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#ifndef NETDEV_LINUX_PRIVATE_H > +#define NETDEV_LINUX_PRIVATE_H 1 > + > +#include <config.h> > + > +#include <linux/filter.h> > +#include <linux/gen_stats.h> > +#include <linux/if_ether.h> > +#include <linux/if_tun.h> > +#include <linux/types.h> > +#include <linux/ethtool.h> > +#include <linux/mii.h> > +#include <stdint.h> > +#include <stdbool.h> > + > +#include "netdev-afxdp.h" > +#include "netdev-provider.h" > +#include "netdev-tc-offloads.h" > +#include "netdev-vport.h" > +#include "openvswitch/thread.h" > +#include "ovs-atomic.h" > +#include "timer.h" > +#include "xdpsock.h" > + > +/* These functions are Linux specific, so they should be used directly only by > + * Linux-specific code. */ > + > +struct netdev; > + > +struct netdev_rxq_linux { > + struct netdev_rxq up; > + bool is_tap; > + int fd; > +}; > + > +void netdev_linux_run(const struct netdev_class *); > + > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, > + const char *flag_name, bool enable); > + > +int get_stats_via_netlink(const struct netdev *netdev_, > + struct netdev_stats *stats); > + > +struct netdev_linux { > + struct netdev up; > + > + /* Protects all members below. */ > + struct ovs_mutex mutex; > + > + unsigned int cache_valid; > + > + bool miimon; /* Link status of last poll. */ > + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ > + struct timer miimon_timer; > + > + int netnsid; /* Network namespace ID. */ > + /* The following are figured out "on demand" only. They are only valid > + * when the corresponding VALID_* bit in 'cache_valid' is set. */ > + int ifindex; > + struct eth_addr etheraddr; > + int mtu; > + unsigned int ifi_flags; > + long long int carrier_resets; > + uint32_t kbits_rate; /* Policing data. */ > + uint32_t kbits_burst; > + int vport_stats_error; /* Cached error code from vport_get_stats(). > + 0 or an errno value. */ > + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU > + * or SIOCSIFMTU. > + */ > + int ether_addr_error; /* Cached error code from set/get etheraddr. */ > + int netdev_policing_error; /* Cached error code from set policing. */ > + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ > + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ > + > + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ > + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ > + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ > + > + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ > + struct tc *tc; > + > + /* For devices of class netdev_tap_class only. */ > + int tap_fd; > + bool present; /* If the device is present in the namespace */ > + uint64_t tx_dropped; /* tap device can drop if the iface is down */ > + > + /* LAG information. */ > + bool is_lag_master; /* True if the netdev is a LAG master. */ > + > + /* AF_XDP information */ > +#ifdef HAVE_AF_XDP > + struct xsk_socket_info *xsk[MAX_XSKQ]; You may allocate this array dynamically based on the n_rxq while performing reconfiguration. This way you will also have no limit on the number of rxqs. > + int requested_n_rxq; > + int xdpmode, requested_xdpmode; /* detect mode changed */ > + int xdp_flags, xdp_bind_flags; > + ovs_spinlock_t tx_lock; This also should be an array to avoid unnecessary contention. > +#endif > +}; > + > +static bool > +is_netdev_linux_class(const struct netdev_class *netdev_class) > +{ > + return netdev_class->run == netdev_linux_run; > +} > + > +static struct netdev_linux * > +netdev_linux_cast(const struct netdev *netdev) > +{ > + ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); > + > + return CONTAINER_OF(netdev, struct netdev_linux, up); > +} > + > +static struct netdev_rxq_linux * > +netdev_rxq_linux_cast(const struct netdev_rxq *rx) > +{ > + ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); > + > + return CONTAINER_OF(rx, struct netdev_rxq_linux, up); > +} > + > +#endif /* netdev-linux-private.h */ > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c > index f75d73fd39f8..2883cf1f2586 100644 > --- a/lib/netdev-linux.c > +++ b/lib/netdev-linux.c > @@ -17,6 +17,7 @@ > #include <config.h> > > #include "netdev-linux.h" > +#include "netdev-linux-private.h" > > #include <errno.h> > #include <fcntl.h> > @@ -54,6 +55,7 @@ > #include "fatal-signal.h" > #include "hash.h" > #include "openvswitch/hmap.h" > +#include "netdev-afxdp.h" > #include "netdev-provider.h" > #include "netdev-tc-offloads.h" > #include "netdev-vport.h" > @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu); > static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu); > static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes); > > -struct netdev_linux { > - struct netdev up; > - > - /* Protects all members below. */ > - struct ovs_mutex mutex; > - > - unsigned int cache_valid; > - > - bool miimon; /* Link status of last poll. */ > - long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ > - struct timer miimon_timer; > - > - int netnsid; /* Network namespace ID. */ > - /* The following are figured out "on demand" only. They are only valid > - * when the corresponding VALID_* bit in 'cache_valid' is set. */ > - int ifindex; > - struct eth_addr etheraddr; > - int mtu; > - unsigned int ifi_flags; > - long long int carrier_resets; > - uint32_t kbits_rate; /* Policing data. */ > - uint32_t kbits_burst; > - int vport_stats_error; /* Cached error code from vport_get_stats(). > - 0 or an errno value. */ > - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */ > - int ether_addr_error; /* Cached error code from set/get etheraddr. */ > - int netdev_policing_error; /* Cached error code from set policing. */ > - int get_features_error; /* Cached error code from ETHTOOL_GSET. */ > - int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ > - > - enum netdev_features current; /* Cached from ETHTOOL_GSET. */ > - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ > - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ > - > - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ > - struct tc *tc; > - > - /* For devices of class netdev_tap_class only. */ > - int tap_fd; > - bool present; /* If the device is present in the namespace */ > - uint64_t tx_dropped; /* tap device can drop if the iface is down */ > - > - /* LAG information. */ > - bool is_lag_master; /* True if the netdev is a LAG master. */ > -}; > - > -struct netdev_rxq_linux { > - struct netdev_rxq up; > - bool is_tap; > - int fd; > -}; > > /* This is set pretty low because we probably won't learn anything from the > * additional log messages. */ > @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > * changes in the device miimon status, so we can use atomic_count. */ > static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); > > -static void netdev_linux_run(const struct netdev_class *); > - > static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *, > int cmd, const char *cmd_name); > static int get_flags(const struct netdev *, unsigned int *flags); > @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev, > struct in_addr addr); > static int get_etheraddr(const char *netdev_name, struct eth_addr *ea); > static int set_etheraddr(const char *netdev_name, const struct eth_addr); > -static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *); > static int af_packet_sock(void); > static bool netdev_linux_miimon_enabled(void); > static void netdev_linux_miimon_run(void); > @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void); > static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup); > > static bool > -is_netdev_linux_class(const struct netdev_class *netdev_class) > -{ > - return netdev_class->run == netdev_linux_run; > -} > - > -static bool > is_tap_netdev(const struct netdev *netdev) > { > return netdev_get_class(netdev) == &netdev_tap_class; > } > - > -static struct netdev_linux * > -netdev_linux_cast(const struct netdev *netdev) > -{ > - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); > - > - return CONTAINER_OF(netdev, struct netdev_linux, up); > -} > - > -static struct netdev_rxq_linux * > -netdev_rxq_linux_cast(const struct netdev_rxq *rx) > -{ > - ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); > - return CONTAINER_OF(rx, struct netdev_rxq_linux, up); > -} > > static int > netdev_linux_netnsid_update__(struct netdev_linux *netdev) > @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change *change) > } > } > > -static void > +void > netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED) > { > struct nl_sock *sock; > @@ -3279,9 +3206,7 @@ exit: > .run = netdev_linux_run, \ > .wait = netdev_linux_wait, \ > .alloc = netdev_linux_alloc, \ > - .destruct = netdev_linux_destruct, \ > .dealloc = netdev_linux_dealloc, \ > - .send = netdev_linux_send, \ > .send_wait = netdev_linux_send_wait, \ > .set_etheraddr = netdev_linux_set_etheraddr, \ > .get_etheraddr = netdev_linux_get_etheraddr, \ > @@ -3312,10 +3237,8 @@ exit: > .arp_lookup = netdev_linux_arp_lookup, \ > .update_flags = netdev_linux_update_flags, \ > .rxq_alloc = netdev_linux_rxq_alloc, \ > - .rxq_construct = netdev_linux_rxq_construct, \ > .rxq_destruct = netdev_linux_rxq_destruct, \ > .rxq_dealloc = netdev_linux_rxq_dealloc, \ > - .rxq_recv = netdev_linux_rxq_recv, \ > .rxq_wait = netdev_linux_rxq_wait, \ > .rxq_drain = netdev_linux_rxq_drain > > @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class = { > NETDEV_LINUX_CLASS_COMMON, > LINUX_FLOW_OFFLOAD_API, > .type = "system", > + .is_pmd = false, > .construct = netdev_linux_construct, > + .destruct = netdev_linux_destruct, > .get_stats = netdev_linux_get_stats, > .get_features = netdev_linux_get_features, > .get_status = netdev_linux_get_status, > - .get_block_id = netdev_linux_get_block_id > + .get_block_id = netdev_linux_get_block_id, > + .send = netdev_linux_send, > + .rxq_construct = netdev_linux_rxq_construct, > + .rxq_recv = netdev_linux_rxq_recv, > }; > > const struct netdev_class netdev_tap_class = { > NETDEV_LINUX_CLASS_COMMON, > .type = "tap", > + .is_pmd = false, > .construct = netdev_linux_construct_tap, > + .destruct = netdev_linux_destruct, > .get_stats = netdev_tap_get_stats, > .get_features = netdev_linux_get_features, > .get_status = netdev_linux_get_status, > + .send = netdev_linux_send, > + .rxq_construct = netdev_linux_rxq_construct, > + .rxq_recv = netdev_linux_rxq_recv, > }; > > const struct netdev_class netdev_internal_class = { > NETDEV_LINUX_CLASS_COMMON, > LINUX_FLOW_OFFLOAD_API, > .type = "internal", > + .is_pmd = false, > .construct = netdev_linux_construct, > + .destruct = netdev_linux_destruct, > .get_stats = netdev_internal_get_stats, > .get_status = netdev_internal_get_status, > + .send = netdev_linux_send, > + .rxq_construct = netdev_linux_rxq_construct, > + .rxq_recv = netdev_linux_rxq_recv, > }; > + > +#ifdef HAVE_AF_XDP > +const struct netdev_class netdev_afxdp_class = { > + NETDEV_LINUX_CLASS_COMMON, > + .type = "afxdp", > + .is_pmd = true, > + .construct = netdev_linux_construct, > + .destruct = netdev_afxdp_destruct, > + .get_stats = netdev_afxdp_get_stats, > + .get_status = netdev_linux_get_status, > + .set_config = netdev_afxdp_set_config, > + .get_config = netdev_afxdp_get_config, > + .reconfigure = netdev_afxdp_reconfigure, > + .get_numa_id = netdev_afxdp_get_numa_id, > + .send = netdev_afxdp_batch_send, > + .rxq_construct = netdev_afxdp_rxq_construct, > + .rxq_recv = netdev_afxdp_rxq_recv, > +}; > +#endif > > > #define CODEL_N_QUEUES 0x0000 > @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst, > dst->tx_window_errors = src->tx_window_errors; > } > > -static int > +int > get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats) > { > struct ofpbuf request; > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h > index fb0c27e6e8e8..91e6a9e2bfc0 100644 > --- a/lib/netdev-provider.h > +++ b/lib/netdev-provider.h > @@ -903,6 +903,9 @@ extern const struct netdev_class netdev_linux_class; > extern const struct netdev_class netdev_internal_class; > extern const struct netdev_class netdev_tap_class; > > +#ifdef HAVE_AF_XDP > +extern const struct netdev_class netdev_afxdp_class; > +#endif > #ifdef __cplusplus > } > #endif > diff --git a/lib/netdev.c b/lib/netdev.c > index 7d7ecf6f0946..0fac117cc602 100644 > --- a/lib/netdev.c > +++ b/lib/netdev.c > @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > > static void restore_all_flags(void *aux OVS_UNUSED); > void update_device_args(struct netdev *, const struct shash *args); > +#ifdef HAVE_AF_XDP > +void signal_remove_xdp(struct netdev *netdev); > +#endif > > int > netdev_n_txq(const struct netdev *netdev) > @@ -146,6 +149,9 @@ netdev_initialize(void) > netdev_register_provider(&netdev_internal_class); > netdev_register_provider(&netdev_tap_class); > netdev_vport_tunnel_register(); > +#ifdef HAVE_AF_XDP > + netdev_register_provider(&netdev_afxdp_class); > +#endif > #endif > #if defined(__FreeBSD__) || defined(__NetBSD__) > netdev_register_provider(&netdev_tap_class); > @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED) > saved_flags & ~saved_values, > &old_flags); > } > +#ifdef HAVE_AF_XDP > + if (netdev->netdev_class == &netdev_afxdp_class) { > + signal_remove_xdp(netdev); > + } > +#endif > } > } > > diff --git a/lib/spinlock.h b/lib/spinlock.h > new file mode 100644 > index 000000000000..17d79f217410 > --- /dev/null > +++ b/lib/spinlock.h > @@ -0,0 +1,70 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > +#ifndef SPINLOCK_H > +#define SPINLOCK_H 1 > + > +#include <config.h> > + > +#include <ctype.h> > +#include <errno.h> > +#include <fcntl.h> > +#include <stdarg.h> > +#include <stdlib.h> > +#include <unistd.h> > + > +#include "ovs-atomic.h" > + > +typedef struct { It's probably better to not use 'typedef'. OVS doesn't use typedefs for structures, unions and enums usually. For example we have no typedef for 'struct ovs_mutex'. So, this should be just 'struct ovs_spinlock'. We may also add some annotations like OVS_LOCKABLE and clang thread safety annotations: OVS_ACQUIRES, OVS_TRY_LOCK, OVS_RELEASES. However, this could be done later. > + atomic_int locked; > +} ovs_spinlock_t;> + > +static inline void > +ovs_spinlock_init(ovs_spinlock_t *sl) > +{ > + atomic_init(&sl->locked, 0); > +} > + > +static inline void > +ovs_spin_lock(ovs_spinlock_t *sl) > +{ > + int exp = 0, locked = 0; > + > + while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, > + memory_order_acquire, > + memory_order_relaxed)) { > + locked = 1; > + while (locked) { > + atomic_read_relaxed(&sl->locked, &locked); > + } > + exp = 0; > + } > +} > + > +static inline void > +ovs_spin_unlock(ovs_spinlock_t *sl) > +{ > + atomic_store_explicit(&sl->locked, 0, memory_order_release); > +} > + > +static inline int OVS_UNUSED Not sure that we need UNUSED annotation since we're in header now. > +ovs_spin_trylock(ovs_spinlock_t *sl) > +{ > + int exp = 0; > + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, > + memory_order_acquire, > + memory_order_relaxed); > +} > +#endif > diff --git a/lib/util.c b/lib/util.c > index 5679232ffc5f..060b1e287bce 100644 > --- a/lib/util.c > +++ b/lib/util.c > @@ -277,6 +277,49 @@ free_cacheline(void *p) > #endif > } > > +#ifdef HAVE_AF_XDP I don't think that we need 'ifdef' here. How about re-naming 'xmalloc_cacheline' to 'xmalloc_size_align' making it allocate memory aligned to a specified size and in a dedicated cachelines? And implement two functions: xmalloc_cacheline(size) { return xmalloc_size_align(size, CACHE_LINE_SIZE); } xmalloc_pagealign(size) { return xmalloc_size_align(size, get_page_size()); } > +void * > +xmalloc_pagealign(size_t size) > +{ > +#ifdef HAVE_POSIX_MEMALIGN > + void *p; > + int error; > + > + COVERAGE_INC(util_xalloc); > + error = posix_memalign(&p, get_page_size(), size ? size : 1); > + if (error != 0) { > + out_of_memory(); > + } > + return p; > +#else > + /* Similar to xmalloc_cacheline, but replace > + * CACHE_LINE_SIZE with get_page_size() */ > + void *p = xmalloc((get_page_size() - 1) > + + sizeof(void *) > + + ROUND_UP(size, get_page_size())); I think that you don't need to round up to a page size. You need to round up to a CACHE_LINE_SIZE, probably. There is no point to allocate so much memory more. Below code should be re-checked too. > + bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *); > + void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0), > + get_page_size()); > + void **q = (void **) r - 1; > + *q = p; > + return r; > +#endif > +} > + > +void > +free_pagealign(void *p) > +{ > +#ifdef HAVE_POSIX_MEMALIGN > + free(p); > +#else > + if (p) { > + void **q = (void **) p - 1; > + free(*q); > + } > +#endif > +} > +#endif > + > char * > xasprintf(const char *format, ...) > { > diff --git a/lib/util.h b/lib/util.h > index 53354f1c6f0f..3cd8cf87fba8 100644 > --- a/lib/util.h > +++ b/lib/util.h > @@ -163,6 +163,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size); > > int string_ends_with(const char *str, const char *suffix); > > +#ifdef HAVE_AF_XDP > +void *xmalloc_pagealign(size_t) MALLOC_LIKE; > +void free_pagealign(void *); > +#endif > + > /* The C standards say that neither the 'dst' nor 'src' argument to > * memcpy() may be null, even if 'n' is zero. This wrapper tolerates > * the null case. */ > diff --git a/lib/xdpsock.c b/lib/xdpsock.c > new file mode 100644 > index 000000000000..ffdb54dfcd27 > --- /dev/null > +++ b/lib/xdpsock.c > @@ -0,0 +1,179 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > +#include <config.h> > + > +#include "xdpsock.h" > +#include "dp-packet.h" > +#include "openvswitch/compiler.h" > + > +/* Note: > + * umem_elem_push* shouldn't overflow because we always pop > + * elem first, then push back to the stack. > + */ > +static inline void > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) > +{ > + void *ptr; > + > + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { > + OVS_NOT_REACHED(); > + } > + > + ptr = &umemp->array[umemp->index]; > + memcpy(ptr, addrs, n * sizeof(void *)); > + umemp->index += n; > +} > + > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) > +{ > + ovs_spin_lock(&umemp->mutex); > + __umem_elem_push_n(umemp, n, addrs); > + ovs_spin_unlock(&umemp->mutex); > +} > + > +static inline void > +__umem_elem_push(struct umem_pool *umemp, void *addr) > +{ > + if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) { > + OVS_NOT_REACHED(); > + } > + > + umemp->array[umemp->index++] = addr; > +} > + > +void > +umem_elem_push(struct umem_pool *umemp, void *addr) > +{ > + > + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); > + > + ovs_spin_lock(&umemp->mutex); > + __umem_elem_push(umemp, addr); > + ovs_spin_unlock(&umemp->mutex); > +} > + > +static inline int > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) > +{ > + void *ptr; > + > + if (OVS_UNLIKELY(umemp->index - n < 0)) { > + return -ENOMEM; > + } > + > + umemp->index -= n; > + ptr = &umemp->array[umemp->index]; > + memcpy(addrs, ptr, n * sizeof(void *)); > + > + return 0; > +} > + > +int > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) > +{ > + int ret; > + > + ovs_spin_lock(&umemp->mutex); > + ret = __umem_elem_pop_n(umemp, n, addrs); > + ovs_spin_unlock(&umemp->mutex); > + > + return ret; > +} > + > +static inline void * > +__umem_elem_pop(struct umem_pool *umemp) > +{ > + if (OVS_UNLIKELY(umemp->index - 1 < 0)) { > + return NULL; > + } > + > + return umemp->array[--umemp->index]; > +} > + > +void * > +umem_elem_pop(struct umem_pool *umemp) > +{ > + void *ptr; > + > + ovs_spin_lock(&umemp->mutex); > + ptr = __umem_elem_pop(umemp); > + ovs_spin_unlock(&umemp->mutex); > + > + return ptr; > +} > + > +static void ** > +__umem_pool_alloc(unsigned int size) > +{ > + void *bufs; > + int ret; > + > + ret = posix_memalign(&bufs, getpagesize(), > + size * sizeof(void *)); xmalloc_pagealign ? > + if (ret) { > + return NULL; > + } > + > + memset(bufs, 0, size * sizeof(void *)); > + return (void **)bufs; > +} > + > +int > +umem_pool_init(struct umem_pool *umemp, unsigned int size) > +{ > + umemp->array = __umem_pool_alloc(size); > + if (!umemp->array) { > + return -ENOMEM; > + } > + > + umemp->size = size; > + umemp->index = 0; > + ovs_spinlock_init(&umemp->mutex); > + return 0; > +} > + > +void > +umem_pool_cleanup(struct umem_pool *umemp) > +{ > + free(umemp->array); free_pagealign ? > + umemp->array = NULL; > +} > + > +/* AF_XDP metadata init/destroy */ > +int > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) > +{ > + void *bufs; > + int ret; > + > + ret = posix_memalign(&bufs, getpagesize(), > + size * sizeof(struct dp_packet_afxdp)); > + if (ret) { > + return -ENOMEM; > + } > + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); > + > + xp->array = bufs; > + xp->size = size; > + return 0; > +} > + > +void > +xpacket_pool_cleanup(struct xpacket_pool *xp) > +{ > + free(xp->array); > + xp->array = NULL; > +} > diff --git a/lib/xdpsock.h b/lib/xdpsock.h > new file mode 100644 > index 000000000000..72578e383812 > --- /dev/null > +++ b/lib/xdpsock.h > @@ -0,0 +1,101 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#ifndef XDPSOCK_H > +#define XDPSOCK_H 1 > + > +#include <config.h> > + > +#ifdef HAVE_AF_XDP > + > +#include <bpf/xsk.h> > +#include <errno.h> > +#include <stdbool.h> > +#include <stdio.h> > + > +#include "openvswitch/thread.h" > +#include "ovs-atomic.h" > +#include "spinlock.h" > + > +#define FRAME_HEADROOM XDP_PACKET_HEADROOM > +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE > +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT > +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) > + > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS > + > +/* The worst case is all 4 queues TX/CQ/RX/FILL are full. > + * Setting NUM_FRAMES to this makes sure umem_pop always successes. > + */ > +#define NUM_FRAMES (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)) > + > +#define BATCH_SIZE NETDEV_MAX_BURST > + > +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES)); > +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS); > +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)); > + > +/* LIFO ptr_array */ > +struct umem_pool { > + int index; /* point to top */ > + unsigned int size; > + ovs_spinlock_t mutex; It's a bit confusing to name it a 'mutex'. Sounds like it's 'ovs_mutex'. Probably, it'll be better to name it 'spinlock' or just 'lock'. > + void **array; /* a pointer array, point to umem buf */ > +}; > + > +/* array-based dp_packet_afxdp */ > +struct xpacket_pool { > + unsigned int size; > + struct dp_packet_afxdp **array; > +}; > + > +struct xsk_umem_info { > + struct umem_pool mpool; > + struct xpacket_pool xpool; > + struct xsk_ring_prod fq; > + struct xsk_ring_cons cq; > + struct xsk_umem *umem; > + void *buffer; > +}; > + > +struct xsk_socket_info { > + struct xsk_ring_cons rx; > + struct xsk_ring_prod tx; > + struct xsk_umem_info *umem; > + struct xsk_socket *xsk; > + unsigned long rx_dropped; > + unsigned long tx_dropped; > + uint32_t outstanding_tx; > +}; > + > +struct umem_elem { > + struct umem_elem *next; > +}; > + > +void umem_elem_push(struct umem_pool *umemp, void *addr); > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs); > + > +void *umem_elem_pop(struct umem_pool *umemp); > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs); > + > +int umem_pool_init(struct umem_pool *umemp, unsigned int size); > +void umem_pool_cleanup(struct umem_pool *umemp); > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size); > +void xpacket_pool_cleanup(struct xpacket_pool *xp); > + > +#endif > +#endif > diff --git a/tests/automake.mk b/tests/automake.mk > index bc906fb79b46..7db64faabc71 100644 > --- a/tests/automake.mk > +++ b/tests/automake.mk > @@ -4,12 +4,14 @@ EXTRA_DIST += \ > $(SYSTEM_TESTSUITE_AT) \ > $(SYSTEM_KMOD_TESTSUITE_AT) \ > $(SYSTEM_USERSPACE_TESTSUITE_AT) \ > + $(SYSTEM_AFXDP_TESTSUITE_AT) \ > $(SYSTEM_OFFLOADS_TESTSUITE_AT) \ > $(SYSTEM_DPDK_TESTSUITE_AT) \ > $(OVSDB_CLUSTER_TESTSUITE_AT) \ > $(TESTSUITE) \ > $(SYSTEM_KMOD_TESTSUITE) \ > $(SYSTEM_USERSPACE_TESTSUITE) \ > + $(SYSTEM_AFXDP_TESTSUITE) \ > $(SYSTEM_OFFLOADS_TESTSUITE) \ > $(SYSTEM_DPDK_TESTSUITE) \ > $(OVSDB_CLUSTER_TESTSUITE) \ > @@ -159,6 +161,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \ > tests/system-userspace-macros.at \ > tests/system-userspace-packet-type-aware.at > > +SYSTEM_AFXDP_TESTSUITE_AT = \ > + tests/system-afxdp-testsuite.at \ > + tests/system-afxdp-macros.at > + > SYSTEM_TESTSUITE_AT = \ > tests/system-common-macros.at \ > tests/system-ovn.at \ > @@ -183,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite > TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch > SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite > SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite > +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite > SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite > SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite > OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite > @@ -316,6 +323,11 @@ check-system-userspace: all > set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ > "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) > > +check-afxdp: all > + $(MAKE) install > + set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \ > + "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) > + > check-offloads: all > set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ > "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) > @@ -353,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP > $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at > $(AM_V_at)mv $@.tmp $@ > > +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT) > + $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at > + $(AM_V_at)mv $@.tmp $@ > + > $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT) > $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at > $(AM_V_at)mv $@.tmp $@ > diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at > new file mode 100644 > index 000000000000..1e6f7a46b4b7 > --- /dev/null > +++ b/tests/system-afxdp-macros.at > @@ -0,0 +1,20 @@ > +# Add port to ovs bridge by using afxdp mode. > +# This will use generic XDP support in the veth driver. > +m4_define([ADD_VETH], > + [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77]) > + CONFIGURE_VETH_OFFLOADS([$1]) > + AT_CHECK([ip link set $1 netns $2]) > + AT_CHECK([ip link set dev ovs-$1 up]) > + AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \ > + set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"]) > + NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7]) > + NS_CHECK_EXEC([$2], [ip link set dev $1 up]) > + if test -n "$5"; then > + NS_CHECK_EXEC([$2], [ip link set dev $1 address $5]) > + fi > + if test -n "$6"; then > + NS_CHECK_EXEC([$2], [ip route add default via $6]) > + fi > + on_exit 'ip link del ovs-$1' > + ] > +) > diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at > new file mode 100644 > index 000000000000..9b7a29066614 > --- /dev/null > +++ b/tests/system-afxdp-testsuite.at > @@ -0,0 +1,26 @@ > +AT_INIT > + > +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc. > + > +Licensed under the Apache License, Version 2.0 (the "License"); > +you may not use this file except in compliance with the License. > +You may obtain a copy of the License at: > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > +Unless required by applicable law or agreed to in writing, software > +distributed under the License is distributed on an "AS IS" BASIS, > +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > +See the License for the specific language governing permissions and > +limitations under the License.]) > + > +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS]) > + > +m4_include([tests/ovs-macros.at]) > +m4_include([tests/ovsdb-macros.at]) > +m4_include([tests/ofproto-macros.at]) > +m4_include([tests/system-common-macros.at]) > +m4_include([tests/system-userspace-macros.at]) > +m4_include([tests/system-afxdp-macros.at]) > + > +m4_include([tests/system-traffic.at]) > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 08001dbce3d3..6195a8fd41cf 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -3082,6 +3082,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \ > </p> > </column> > > + <column name="other_config" key="xdpmode" > + type='{"type": "string", > + "enum": ["set", ["skb", "drv"]]}'> > + <p> > + Specifies the operational mode of the XDP program. > + If "drv", the XDP program is loaded into the device driver with > + zero-copy RX and TX enabled. This mode requires device driver with > + AF_XDP support and has the best performance. > + If "skb", the XDP program is using generic XDP mode in kernel with > + extra data copying between userspace and kernel. No device driver > + support is needed. Note that this is afxdp netdev type only. > + Defaults to "skb" mode. > + </p> > + </column> > + > <column name="options" key="vhost-server-path" > type='{"type": "string"}'> > <p> >
Hi Ilya, Thanks for your review. On Thu, May 30, 2019 at 8:57 AM Ilya Maximets <i.maximets@samsung.com> wrote: > > On 28.05.2019 22:01, William Tu wrote: > > The patch introduces experimental AF_XDP support for OVS netdev. > > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > > index 859c05613ddf..a33b9a7353ba 100644 > > --- a/lib/dpif-netdev-perf.h > > +++ b/lib/dpif-netdev-perf.h > > @@ -21,6 +21,7 @@ > > #include <stddef.h> > > #include <stdint.h> > > #include <string.h> > > +#include <time.h> > > #include <math.h> > > > > #ifdef DPDK_NETDEV > > @@ -186,6 +187,24 @@ struct pmd_perf_stats { > > char *log_reason; > > }; > > > > +#ifdef HAVE_AF_XDP > > I'd like to change this to "#ifdef __linux__". > 'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is > Linux specific. Yes, thanks, will do it. > > > +static inline uint64_t > > +rdtsc_syscall(struct pmd_perf_stats *s) > > +{ > > + struct timespec val; > > + uint64_t v; > > + > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { > > + return s->last_tsc = 0; > > Maybe it's better to just return the value and allow caller to assign? Do you mean just: return s->last_tsc; > This way you'll not need to pass any arguments here. I don't understand, I still need to pass &val, right? > > > + } > > + > > + v = (uint64_t) val.tv_sec * 1000000000LL; > > + v += (uint64_t) val.tv_nsec; > > + > > + return s->last_tsc = v; > > +} > > +#endif > > + > > /* Support for accurate timing of PMD execution on TSC clock cycle level. > > * These functions are intended to be invoked in the context of pmd threads. */ > > > > @@ -198,6 +217,15 @@ cycles_counter_update(struct pmd_perf_stats *s) > > { > > #ifdef DPDK_NETDEV > > return s->last_tsc = rte_get_tsc_cycles(); > > +#elif defined(HAVE_AF_XDP) && defined(__x86_64__) > > And this should be: > #elif !defined(_MSC_VER) && defined(__x86_64__) > > Visual Studio doesn't support inline assembly this way. > Other things are portable until we're on x86_64. right, thanks! > > > + /* This is x86-specific instructions. */ > > + uint32_t h, l; > > + asm volatile("rdtsc" : "=a" (l), "=d" (h)); > > + > > + return s->last_tsc = ((uint64_t) h << 32) | l; > > +#elif defined(HAVE_AF_XDP) > > #elif defined(__linux__) > > > + /* non-x86_64 architecture uses syscall */ > > + return rdtsc_syscall(s); > > #else > > return s->last_tsc = 0; > > #endif > > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c > > new file mode 100644 <snip> > > + > > +static void > > +xsk_destroy_all(struct netdev *netdev) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + int i, ifindex; > > + > > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > > + > > + for (i = 0; i < MAX_XSKQ; i++) { > > + if (dev->xsk[i]) { > > + VLOG_INFO("destroy xsk[%d]", i); > > + xsk_destroy(dev->xsk[i]); > > + dev->xsk[i] = NULL; > > + dev->xsk[i]->rx_dropped = 0; > > + dev->xsk[i]->tx_dropped = 0; > > Dereferencing of a just assigned NULL poiner. Something is definitely > wrong here. oh, thanks a lot. > > > + } > > + } > > + VLOG_INFO("remove xdp program"); > > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > > +} > > + > > +static inline void OVS_UNUSED > > +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { > > + struct xdp_statistics stat; > > + socklen_t optlen; > > + > > + optlen = sizeof stat; > > + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, > > + &stat, &optlen) == 0); > > + > > + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", > > + stat.rx_dropped, > > + stat.rx_invalid_descs, > > + stat.tx_invalid_descs); > > +} > > + > > +int > > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, > > + char **errp OVS_UNUSED) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + const char *str_xdpmode; > > + int xdpmode, new_n_rxq; > > + > > + ovs_mutex_lock(&dev->mutex); > > + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); > > + if (new_n_rxq > MAX_XSKQ) { > > + ovs_mutex_unlock(&dev->mutex); > > + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", > > + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); > > + return EINVAL; > > + } > > + > > + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); > > + if (!strcasecmp(str_xdpmode, "drv")) { > > + xdpmode = XDP_ZEROCOPY; > > + } else if (!strcasecmp(str_xdpmode, "skb")) { > > + xdpmode = XDP_COPY; > > + } else { > > + VLOG_ERR("%s: Incorrect xdpmode (%s).", > > + netdev_get_name(netdev), str_xdpmode); > > + ovs_mutex_unlock(&dev->mutex); > > + return EINVAL; > > + } > > + > > + if (dev->requested_n_rxq != new_n_rxq > > + || dev->requested_xdpmode != xdpmode) { > > + dev->requested_n_rxq = new_n_rxq; > > + dev->requested_xdpmode = xdpmode; > > + netdev_request_reconfigure(netdev); > > + } > > + ovs_mutex_unlock(&dev->mutex); > > + return 0; > > +} > > + > > +int > > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + > > + ovs_mutex_lock(&dev->mutex); > > + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); > > + smap_add_format(args, "xdpmode", "%s", > > + dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb"); > > + ovs_mutex_unlock(&dev->mutex); > > + return 0; > > +} > > + > > +int > > +netdev_afxdp_reconfigure(struct netdev *netdev) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; > > + int err = 0; > > + > > + ovs_mutex_lock(&dev->mutex); > > + > > + if (netdev->n_rxq == dev->requested_n_rxq > > + && dev->xdpmode == dev->requested_xdpmode) { > > + goto out; > > + } > > + > > + xsk_destroy_all(netdev); > > + netdev->n_rxq = dev->requested_n_rxq; > > + > > + if (dev->requested_xdpmode == XDP_ZEROCOPY) { > > + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); > > + /* From SKB mode to DRV mode */ > > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > > + dev->xdp_bind_flags = XDP_ZEROCOPY; > > + dev->xdpmode = XDP_ZEROCOPY; > > + > > + if (setrlimit(RLIMIT_MEMLOCK, &r)) { > > + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", > > + ovs_strerror(errno)); > > + } > > + } else { > > + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); > > + /* From DRV mode to SKB mode */ > > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > > + dev->xdp_bind_flags = XDP_COPY; > > + dev->xdpmode = XDP_COPY; > > + /* TODO: set rlimit back to previous value > > + * when no device is in DRV mode. > > + */ > > + } > > + > > + err = xsk_configure_all(netdev); > > + if (err) { > > + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); > > + } > > + netdev_change_seq_changed(netdev); > > +out: > > + ovs_mutex_unlock(&dev->mutex); > > + return err; > > +} > > + > > +int > > +netdev_afxdp_get_numa_id(const struct netdev *netdev) > > +{ > > + /* FIXME: Get netdev's PCIe device ID, then find > > + * its NUMA node id. > > + */ > > + VLOG_INFO("FIXME: Device %s always use numa id 0", > > + netdev_get_name(netdev)); > > + return 0; > > +} > > + > > +static void > > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) > > +{ > > + uint32_t curr_prog_id = 0; > > + uint32_t flags; > > + > > + /* remove_xdp_program() */ > > + if (xdpmode == XDP_COPY) { > > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > > + } else { > > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > > + } > > + > > + if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) { > > + bpf_set_link_xdp_fd(ifindex, -1, flags); > > + } > > + if (prog_id == curr_prog_id) { > > + bpf_set_link_xdp_fd(ifindex, -1, flags); > > + } else if (!curr_prog_id) { > > + VLOG_INFO("couldn't find a prog id on a given interface"); > > + } else { > > + VLOG_INFO("program on interface changed, not removing"); > > + } > > +} > > + > > +void > > +signal_remove_xdp(struct netdev *netdev) > > +{> + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + int ifindex; > > + > > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > > + > > + VLOG_WARN("force remove xdp program"); > > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > > +} > > + > > +static struct dp_packet_afxdp * > > +dp_packet_cast_afxdp(const struct dp_packet *d) > > +{ > > + ovs_assert(d->source == DPBUF_AFXDP); > > + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); > > +} > > + > > +void > > +free_afxdp_buf(struct dp_packet *p) > > +{ > > + struct dp_packet_afxdp *xpacket; > > + unsigned long addr; > > + > > + xpacket = dp_packet_cast_afxdp(p); > > + if (xpacket->mpool) { > > + void *base = dp_packet_base(p); > > + > > + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); > > + umem_elem_push(xpacket->mpool, (void *)addr); > > + } > > +} > > + > > +static void > > +free_afxdp_buf_batch(struct dp_packet_batch *batch) > > +{ > > + struct dp_packet_afxdp *xpacket = NULL; > > + struct dp_packet *packet; > > + void *elems[BATCH_SIZE]; > > + unsigned long addr; > > + > > + /* all packets are AF_XDP, so handles its own delete in batch */ > > This comment should be somewhere else. > > BTW, shift right by 1 space. > > > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > > + xpacket = dp_packet_cast_afxdp(packet); > > + if (xpacket->mpool) { > > + void *base = dp_packet_base(packet); > > + > > + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); > > Shouldn't it be uintptr_t ? Probably in some other places too. right, I will use uintptr_t here and other places. > > > + elems[i] = (void *)addr; > > + } > > + } > > + umem_elem_push_n(xpacket->mpool, batch->count, elems); > > + dp_packet_batch_init(batch); > > +} > > + > > +int > > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, > > + int *qfill) > > +{ > > + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); > > + struct netdev *netdev = rx->up.netdev; > > + struct netdev_linux *dev = netdev_linux_cast(netdev); > > + struct umem_elem *elems[BATCH_SIZE]; > > + uint32_t idx_rx = 0, idx_fq = 0; > > + struct xsk_socket_info *xsk; > > + int qid = rxq_->queue_id; > > + unsigned int rcvd, i; > > + int ret = 0; > > + > > + xsk = dev->xsk[qid]; > > + rx->fd = xsk_socket__fd(xsk->xsk); > > + > > + /* See if there is any packet on RX queue, > > + * if yes, idx_rx is the index having the packet. > > + */ > > + rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); > > + if (!rcvd) { > > + return 0; > > + } > > + > > + ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems); > > + if (OVS_UNLIKELY(ret)) { > > We need to return rx buffers to mpool before releasing. > Otherwise they will be lost. > > for (i = 0; i < rcvd; i++) { > uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, i)->addr; > > elems[i] = xsk_umem__get_data(xsk->umem->buffer, addr); > } > umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); > > Please, re-check above code snippet before using. good point, thanks > > > + xsk_ring_cons__release(&xsk->rx, rcvd); > > + xsk->rx_dropped += rcvd; > > + return ENOMEM; > > + } > > + > > + /* Prepare for the FILL queue */ > > + if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) { > > + /* The FILL queue is full, don't retry or process rx. Wait for kernel > > + * to move received packets from FILL queue to RX queue. > > + */ > > + umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); > > Same here. > > > + xsk_ring_cons__release(&xsk->rx, rcvd); > > + xsk->rx_dropped += rcvd; > > + return ENOMEM; > > + } > > + > > + /* Setup a dp_packet batch from descriptors in RX queue */ > > + for (i = 0; i < rcvd; i++) { > > + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; > > + uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len; > > + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); > > + uint64_t index; > > + > > + struct dp_packet_afxdp *xpacket; > > + struct dp_packet *packet; > > + > > + index = addr >> FRAME_SHIFT; > > + xpacket = UMEM2XPKT(xsk->umem->xpool.array, index); > > + packet = &xpacket->packet; > > + > > + /* Initialize the struct dp_packet */ > > + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); > > + dp_packet_set_size(packet, len); > > + > > + /* Add packet into batch, increase batch->count */ > > + dp_packet_batch_add(batch, packet); > > + > > + idx_rx++; > > + } > > + /* Release the RX queue */ > > + xsk_ring_cons__release(&xsk->rx, rcvd); > > + > > + for (i = 0; i < rcvd; i++) { > > + uint64_t index; > > + struct umem_elem *elem; > > + > > + /* Get one free umem, program it into FILL queue */ > > + elem = elems[i]; > > + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); > > + ovs_assert((index & FRAME_SHIFT_MASK) == 0); > > + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; > > + > > + idx_fq++; > > + } > > + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); > > + > > + if (qfill) { > > + /* TODO: return the number of remaining packets in the queue. */ > > + *qfill = 0; > > + } > > + > > +#ifdef AFXDP_DEBUG > > + log_xsk_stat(xsk); > > +#endif > > + return 0; > > +} > > + > > +static inline int > > +kick_tx(struct xsk_socket_info *xsk) > > +{ > > + int ret; > > + > > + /* This causes system call into kernel's xsk_sendmsg, and > > + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). > > + */ > > + ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); > > + if (OVS_UNLIKELY(ret < 0)) { > > + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { > > + return errno; > > + } > > + } > > + /* no error, or EBUSY or EAGAIN */ > > + return 0; > > +} > > + > > +static inline bool > > +check_free_batch(struct dp_packet_batch *batch) > > +{ > > + struct umem_pool *first_mpool = NULL; > > + struct dp_packet_afxdp *xpacket; > > + struct dp_packet *packet; > > + > > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > > + if (packet->source != DPBUF_AFXDP) { > > + return false; > > + } > > + xpacket = dp_packet_cast_afxdp(packet); > > + if (i == 0) { > > + first_mpool = xpacket->mpool; > > + continue; > > + } > > + if (xpacket->mpool != first_mpool) { > > + return false; > > + } > > + } > > + /* All packets are DPBUF_AFXDP and from the same mpool */ > > + return true; > > +} > > + > > +static inline void > > +afxdp_complete_tx(struct xsk_socket_info *xsk) > > +{ > > + struct umem_elem *elems_push[BATCH_SIZE]; > > + uint32_t idx_cq = 0; > > + int tx_done, j, ret; > > + > > + if (!xsk->outstanding_tx) { > > + return; > > + } > > + > > + ret = kick_tx(xsk); > > + if (OVS_UNLIKELY(ret)) { > > + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", > > + ovs_strerror(ret)); > > + } > > + > > + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq); > > + if (tx_done > 0) { > > + xsk_ring_cons__release(&xsk->umem->cq, tx_done); > > + xsk->outstanding_tx -= tx_done; > > + } > > + > > + /* Recycle back to umem pool */ > > + for (j = 0; j < tx_done; j++) { > > + struct umem_elem *elem; > > + uint64_t addr; > > + > > + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++); > > + elem = ALIGNED_CAST(struct umem_elem *, > > + (char *)xsk->umem->buffer + addr); > > + elems_push[j] = elem; > > + } > > + > > + umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push); > > +} > > + > > +int > > +netdev_afxdp_batch_send(struct netdev *netdev_, int qid, > > + struct dp_packet_batch *batch, > > + bool concurrent_txq) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev_); > > + struct xsk_socket_info *xsk = dev->xsk[qid]; > > + struct umem_elem *elems_pop[BATCH_SIZE]; > > + struct dp_packet *packet; > > + bool free_batch = true; > > + uint32_t idx = 0; > > + int error = 0; > > + int ret; > > + > > + if (OVS_UNLIKELY(concurrent_txq)) { > > + ovs_spin_lock(&dev->tx_lock); > > Using the same lock for all queues will procude a lot of unnecessary > contentions. It's better to allocate array of locks. One per tx queue. > You may re-allocate it in reconfigure() implementation. Right, will do per tx lock in next version. > > > + } > > + > > + /* Process CQ first. */ > > + afxdp_complete_tx(xsk); > > + > > + free_batch = check_free_batch(batch); > > + > > + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); > > + if (OVS_UNLIKELY(ret)) { > > + xsk->tx_dropped += batch->count; > > + error = ENOMEM; > > + goto out; > > + } > > + > > + /* Make sure we have enough TX descs */ > > + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx); > > + if (OVS_UNLIKELY(ret == 0)) { > > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); > > + xsk->tx_dropped += batch->count; > > + error = ENOMEM; > > + goto out; > > + } > > + > > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > > + struct umem_elem *elem; > > + uint64_t index; > > + > > + elem = elems_pop[i]; > > + /* Copy the packet to the umem we just pop from umem pool. > > + * TODO: avoid this copy if the packet and the pop umem > > + * are located in the same umem. > > + */ > > + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); > > + > > + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); > > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index; > > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len > > + = dp_packet_size(packet); > > + } > > + xsk_ring_prod__submit(&xsk->tx, batch->count); > > + xsk->outstanding_tx += batch->count; > > + > > + ret = kick_tx(xsk); > > + if (OVS_UNLIKELY(ret)) { > > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); > > Do we really able to re-use these buffers? They are alredy in tx ring and > probably will be sent on next kick_tx(). > Right, so probably I should just print the warning message and continue. > > + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", > > + ovs_strerror(ret)); > > + } > > + > > +out: > > + if (free_batch) { > > + free_afxdp_buf_batch(batch); > > + } else { > > + dp_packet_delete_batch(batch, true); > > + } > > + > > + if (OVS_UNLIKELY(concurrent_txq)) { > > + ovs_spin_unlock(&dev->tx_lock); > > + } > > + return error; > > +} > > + > > +int > > +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED) > > +{ > > + /* Done at reconfigure */ > > + return 0; > > +} > > + > > +void > > +netdev_afxdp_destruct(struct netdev *netdev_) > > +{ > > + struct netdev_linux *netdev = netdev_linux_cast(netdev_); > > + > > + /* Note: tc is by-passed when using drv-mode, but when using > > + * skb-mode, we might need to clean up tc. */ > > + > > + xsk_destroy_all(netdev_); > > + ovs_mutex_destroy(&netdev->mutex); > > +} > > + > > +int > > +netdev_afxdp_get_stats(const struct netdev *netdev_, > > You don't need an underscore here. > > > + struct netdev_stats *stats) > > +{ > > + struct netdev_linux *dev = netdev_linux_cast(netdev_); > > + struct netdev_stats dev_stats; > > + struct xsk_socket_info *xsk; > > + int error, i; > > + > > + ovs_mutex_lock(&dev->mutex); > > + > > + error = get_stats_via_netlink(netdev_, &dev_stats); > > + if (error) { > > + VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics"); > > + } else { > > + /* Use kernel netdev's packet and byte counts */ > > + stats->rx_packets = dev_stats.rx_packets; > > + stats->rx_bytes = dev_stats.rx_bytes; > > + stats->tx_packets = dev_stats.tx_packets; > > + stats->tx_bytes = dev_stats.tx_bytes; > > + > > + stats->rx_errors += dev_stats.rx_errors; > > + stats->tx_errors += dev_stats.tx_errors; > > + stats->rx_dropped += dev_stats.rx_dropped; > > + stats->tx_dropped += dev_stats.tx_dropped; > > + stats->multicast += dev_stats.multicast; > > + stats->collisions += dev_stats.collisions; > > + stats->rx_length_errors += dev_stats.rx_length_errors; > > + stats->rx_over_errors += dev_stats.rx_over_errors; > > + stats->rx_crc_errors += dev_stats.rx_crc_errors; > > + stats->rx_frame_errors += dev_stats.rx_frame_errors; > > + stats->rx_fifo_errors += dev_stats.rx_fifo_errors; > > + stats->rx_missed_errors += dev_stats.rx_missed_errors; > > + stats->tx_aborted_errors += dev_stats.tx_aborted_errors; > > + stats->tx_carrier_errors += dev_stats.tx_carrier_errors; > > + stats->tx_fifo_errors += dev_stats.tx_fifo_errors; > > + stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors; > > + stats->tx_window_errors += dev_stats.tx_window_errors; > > + > > + /* Account the dropped in each xsk */ > > + for (i = 0; i < MAX_XSKQ; i++) { > > i < netdev_n_rxq(netdev) > > > + xsk = dev->xsk[i]; > > + if (xsk) { > > + stats->rx_dropped += xsk->rx_dropped; > > + stats->tx_dropped += xsk->tx_dropped; > > + } > > + } > > + } > > + ovs_mutex_unlock(&dev->mutex); > > + > > + return error; > > +} > > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h > > new file mode 100644 > > index 000000000000..dd2dc1a2064d > > --- /dev/null > > +++ b/lib/netdev-afxdp.h > > @@ -0,0 +1,74 @@ > > +/* > > + * Copyright (c) 2018, 2019 Nicira, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > + > > +#ifndef NETDEV_AFXDP_H > > +#define NETDEV_AFXDP_H 1 > > + > > +#include <config.h> > > + > > +#ifdef HAVE_AF_XDP > > + > > +#include <stdint.h> > > +#include <stdbool.h> > > + > > +/* These functions are Linux AF_XDP specific, so they should be used directly > > + * only by Linux-specific code. */ > > + > > +#define MAX_XSKQ 16 > > + > > +struct netdev; > > +struct xsk_socket_info; > > +struct xdp_umem; > > +struct dp_packet_batch; > > +struct smap; > > +struct dp_packet; > > +struct netdev_rxq; > > +struct netdev_stats; > > + > > +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_); > > +void netdev_afxdp_destruct(struct netdev *netdev_); > > + > > +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, > > + struct dp_packet_batch *batch, > > + int *qfill); > > +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid, > > + struct dp_packet_batch *batch, > > + bool concurrent_txq); > > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, > > + char **errp); > > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args); > > +int netdev_afxdp_get_numa_id(const struct netdev *netdev); > > +int netdev_afxdp_get_stats(const struct netdev *netdev_, > > + struct netdev_stats *stats); > > + > > +void free_afxdp_buf(struct dp_packet *p); > > +int netdev_afxdp_reconfigure(struct netdev *netdev); > > +void signal_remove_xdp(struct netdev *netdev); > > + > > +#else /* !HAVE_AF_XDP */ > > + > > +#include "openvswitch/compiler.h" > > + > > +struct dp_packet; > > + > > +static inline void > > +free_afxdp_buf(struct dp_packet *p OVS_UNUSED) > > +{ > > + /* Nothing */ > > +} > > + > > +#endif /* HAVE_AF_XDP */ > > +#endif /* netdev-afxdp.h */ > > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h > > new file mode 100644 > > index 000000000000..d43f79e6aa41 > > --- /dev/null > > +++ b/lib/netdev-linux-private.h > > @@ -0,0 +1,139 @@ > > +/* > > + * Copyright (c) 2019 Nicira, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > + > > +#ifndef NETDEV_LINUX_PRIVATE_H > > +#define NETDEV_LINUX_PRIVATE_H 1 > > + > > +#include <config.h> > > + > > +#include <linux/filter.h> > > +#include <linux/gen_stats.h> > > +#include <linux/if_ether.h> > > +#include <linux/if_tun.h> > > +#include <linux/types.h> > > +#include <linux/ethtool.h> > > +#include <linux/mii.h> > > +#include <stdint.h> > > +#include <stdbool.h> > > + > > +#include "netdev-afxdp.h" > > +#include "netdev-provider.h" > > +#include "netdev-tc-offloads.h" > > +#include "netdev-vport.h" > > +#include "openvswitch/thread.h" > > +#include "ovs-atomic.h" > > +#include "timer.h" > > +#include "xdpsock.h" > > + > > +/* These functions are Linux specific, so they should be used directly only by > > + * Linux-specific code. */ > > + > > +struct netdev; > > + > > +struct netdev_rxq_linux { > > + struct netdev_rxq up; > > + bool is_tap; > > + int fd; > > +}; > > + > > +void netdev_linux_run(const struct netdev_class *); > > + > > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, > > + const char *flag_name, bool enable); > > + > > +int get_stats_via_netlink(const struct netdev *netdev_, > > + struct netdev_stats *stats); > > + > > +struct netdev_linux { > > + struct netdev up; > > + > > + /* Protects all members below. */ > > + struct ovs_mutex mutex; > > + > > + unsigned int cache_valid; > > + > > + bool miimon; /* Link status of last poll. */ > > + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ > > + struct timer miimon_timer; > > + > > + int netnsid; /* Network namespace ID. */ > > + /* The following are figured out "on demand" only. They are only valid > > + * when the corresponding VALID_* bit in 'cache_valid' is set. */ > > + int ifindex; > > + struct eth_addr etheraddr; > > + int mtu; > > + unsigned int ifi_flags; > > + long long int carrier_resets; > > + uint32_t kbits_rate; /* Policing data. */ > > + uint32_t kbits_burst; > > + int vport_stats_error; /* Cached error code from vport_get_stats(). > > + 0 or an errno value. */ > > + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU > > + * or SIOCSIFMTU. > > + */ > > + int ether_addr_error; /* Cached error code from set/get etheraddr. */ > > + int netdev_policing_error; /* Cached error code from set policing. */ > > + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ > > + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ > > + > > + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ > > + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ > > + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ > > + > > + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ > > + struct tc *tc; > > + > > + /* For devices of class netdev_tap_class only. */ > > + int tap_fd; > > + bool present; /* If the device is present in the namespace */ > > + uint64_t tx_dropped; /* tap device can drop if the iface is down */ > > + > > + /* LAG information. */ > > + bool is_lag_master; /* True if the netdev is a LAG master. */ > > + > > + /* AF_XDP information */ > > +#ifdef HAVE_AF_XDP > > + struct xsk_socket_info *xsk[MAX_XSKQ]; > > You may allocate this array dynamically based on the n_rxq while performing > reconfiguration. This way you will also have no limit on the number of rxqs. make sense, thanks. <snip> > > --- /dev/null > > +++ b/lib/spinlock.h > > @@ -0,0 +1,70 @@ > > +/* > > + * Copyright (c) 2018, 2019 Nicira, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > +#ifndef SPINLOCK_H > > +#define SPINLOCK_H 1 > > + > > +#include <config.h> > > + > > +#include <ctype.h> > > +#include <errno.h> > > +#include <fcntl.h> > > +#include <stdarg.h> > > +#include <stdlib.h> > > +#include <unistd.h> > > + > > +#include "ovs-atomic.h" > > + > > +typedef struct { > > It's probably better to not use 'typedef'. OVS doesn't use > typedefs for structures, unions and enums usually. > For example we have no typedef for 'struct ovs_mutex'. > So, this should be just 'struct ovs_spinlock'. > > > We may also add some annotations like OVS_LOCKABLE and clang > thread safety annotations: OVS_ACQUIRES, OVS_TRY_LOCK, OVS_RELEASES. > However, this could be done later. > OK, will do it. > > + atomic_int locked; > > +} ovs_spinlock_t;> + > > +static inline void > > +ovs_spinlock_init(ovs_spinlock_t *sl) > > +{ > > + atomic_init(&sl->locked, 0); > > +} > > + > > +static inline void > > +ovs_spin_lock(ovs_spinlock_t *sl) > > +{ > > + int exp = 0, locked = 0; > > + > > + while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, > > + memory_order_acquire, > > + memory_order_relaxed)) { > > + locked = 1; > > + while (locked) { > > + atomic_read_relaxed(&sl->locked, &locked); > > + } > > + exp = 0; > > + } > > +} > > + > > +static inline void > > +ovs_spin_unlock(ovs_spinlock_t *sl) > > +{ > > + atomic_store_explicit(&sl->locked, 0, memory_order_release); > > +} > > + > > +static inline int OVS_UNUSED > > Not sure that we need UNUSED annotation since we're in header now. > > > +ovs_spin_trylock(ovs_spinlock_t *sl) > > +{ > > + int exp = 0; > > + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, > > + memory_order_acquire, > > + memory_order_relaxed); > > +} > > +#endif > > diff --git a/lib/util.c b/lib/util.c > > index 5679232ffc5f..060b1e287bce 100644 > > --- a/lib/util.c > > +++ b/lib/util.c > > @@ -277,6 +277,49 @@ free_cacheline(void *p) > > #endif > > } > > > > +#ifdef HAVE_AF_XDP > > I don't think that we need 'ifdef' here. > > How about re-naming 'xmalloc_cacheline' to 'xmalloc_size_align' > making it allocate memory aligned to a specified size and in > a dedicated cachelines? > > And implement two functions: > xmalloc_cacheline(size) > { > return xmalloc_size_align(size, CACHE_LINE_SIZE); > } > xmalloc_pagealign(size) > { > return xmalloc_size_align(size, get_page_size()); > } > I think it's better, will do it. > > +void * > > +xmalloc_pagealign(size_t size) > > +{ > > +#ifdef HAVE_POSIX_MEMALIGN > > + void *p; > > + int error; > > + > > + COVERAGE_INC(util_xalloc); > > + error = posix_memalign(&p, get_page_size(), size ? size : 1); > > + if (error != 0) { > > + out_of_memory(); > > + } > > + return p; > > +#else > > + /* Similar to xmalloc_cacheline, but replace > > + * CACHE_LINE_SIZE with get_page_size() */ > > + void *p = xmalloc((get_page_size() - 1) > > + + sizeof(void *) > > + + ROUND_UP(size, get_page_size())); > > I think that you don't need to round up to a page size. > You need to round up to a CACHE_LINE_SIZE, probably. > There is no point to allocate so much memory more. > Below code should be re-checked too. > Right, in worst case we should waste (page_size() - 1) bytes. > > + bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *); > > + void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0), > > + get_page_size()); > > + void **q = (void **) r - 1; > > + *q = p; > > + return r; > > +#endif > > +} > > + > > +void > > +free_pagealign(void *p) > > +{ > > +#ifdef HAVE_POSIX_MEMALIGN > > + free(p); > > +#else > > + if (p) { > > + void **q = (void **) p - 1; > > + free(*q); > > + } > > +#endif > > +} > > +#endif > > + > > char * > > xasprintf(const char *format, ...) > > { > > diff --git a/lib/util.h b/lib/util.h > > index 53354f1c6f0f..3cd8cf87fba8 100644 > > --- a/lib/util.h > > +++ b/lib/util.h > > @@ -163,6 +163,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size); > > > > int string_ends_with(const char *str, const char *suffix); > > > > +#ifdef HAVE_AF_XDP > > +void *xmalloc_pagealign(size_t) MALLOC_LIKE; > > +void free_pagealign(void *); > > +#endif > > + > > /* The C standards say that neither the 'dst' nor 'src' argument to > > * memcpy() may be null, even if 'n' is zero. This wrapper tolerates > > * the null case. */ > > diff --git a/lib/xdpsock.c b/lib/xdpsock.c > > new file mode 100644 > > index 000000000000..ffdb54dfcd27 > > --- /dev/null > > +++ b/lib/xdpsock.c > > @@ -0,0 +1,179 @@ > > +/* > > + * Copyright (c) 2018, 2019 Nicira, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > +#include <config.h> > > + > > +#include "xdpsock.h" > > +#include "dp-packet.h" > > +#include "openvswitch/compiler.h" > > + > > +/* Note: > > + * umem_elem_push* shouldn't overflow because we always pop > > + * elem first, then push back to the stack. > > + */ > > +static inline void > > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) > > +{ > > + void *ptr; > > + > > + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { > > + OVS_NOT_REACHED(); > > + } > > + > > + ptr = &umemp->array[umemp->index]; > > + memcpy(ptr, addrs, n * sizeof(void *)); > > + umemp->index += n; > > +} > > + > > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) > > +{ > > + ovs_spin_lock(&umemp->mutex); > > + __umem_elem_push_n(umemp, n, addrs); > > + ovs_spin_unlock(&umemp->mutex); > > +} > > + > > +static inline void > > +__umem_elem_push(struct umem_pool *umemp, void *addr) > > +{ > > + if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) { > > + OVS_NOT_REACHED(); > > + } > > + > > + umemp->array[umemp->index++] = addr; > > +} > > + > > +void > > +umem_elem_push(struct umem_pool *umemp, void *addr) > > +{ > > + > > + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); > > + > > + ovs_spin_lock(&umemp->mutex); > > + __umem_elem_push(umemp, addr); > > + ovs_spin_unlock(&umemp->mutex); > > +} > > + > > +static inline int > > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) > > +{ > > + void *ptr; > > + > > + if (OVS_UNLIKELY(umemp->index - n < 0)) { > > + return -ENOMEM; > > + } > > + > > + umemp->index -= n; > > + ptr = &umemp->array[umemp->index]; > > + memcpy(addrs, ptr, n * sizeof(void *)); > > + > > + return 0; > > +} > > + > > +int > > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) > > +{ > > + int ret; > > + > > + ovs_spin_lock(&umemp->mutex); > > + ret = __umem_elem_pop_n(umemp, n, addrs); > > + ovs_spin_unlock(&umemp->mutex); > > + > > + return ret; > > +} > > + > > +static inline void * > > +__umem_elem_pop(struct umem_pool *umemp) > > +{ > > + if (OVS_UNLIKELY(umemp->index - 1 < 0)) { > > + return NULL; > > + } > > + > > + return umemp->array[--umemp->index]; > > +} > > + > > +void * > > +umem_elem_pop(struct umem_pool *umemp) > > +{ > > + void *ptr; > > + > > + ovs_spin_lock(&umemp->mutex); > > + ptr = __umem_elem_pop(umemp); > > + ovs_spin_unlock(&umemp->mutex); > > + > > + return ptr; > > +} > > + > > +static void ** > > +__umem_pool_alloc(unsigned int size) > > +{ > > + void *bufs; > > + int ret; > > + > > + ret = posix_memalign(&bufs, getpagesize(), > > + size * sizeof(void *)); > > xmalloc_pagealign ? > > > + if (ret) { > > + return NULL; > > + } > > + > > + memset(bufs, 0, size * sizeof(void *)); > > + return (void **)bufs; > > +} > > + > > +int > > +umem_pool_init(struct umem_pool *umemp, unsigned int size) > > +{ > > + umemp->array = __umem_pool_alloc(size); > > + if (!umemp->array) { > > + return -ENOMEM; > > + } > > + > > + umemp->size = size; > > + umemp->index = 0; > > + ovs_spinlock_init(&umemp->mutex); > > + return 0; > > +} > > + > > +void > > +umem_pool_cleanup(struct umem_pool *umemp) > > +{ > > + free(umemp->array); > > free_pagealign ? > > > + umemp->array = NULL; > > +} > > + > > +/* AF_XDP metadata init/destroy */ > > +int > > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) > > +{ > > + void *bufs; > > + int ret; > > + > > + ret = posix_memalign(&bufs, getpagesize(), > > + size * sizeof(struct dp_packet_afxdp)); > > + if (ret) { > > + return -ENOMEM; > > + } > > + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); > > + > > + xp->array = bufs; > > + xp->size = size; > > + return 0; > > +} > > + > > +void > > +xpacket_pool_cleanup(struct xpacket_pool *xp) > > +{ > > + free(xp->array); > > + xp->array = NULL; > > +} > > diff --git a/lib/xdpsock.h b/lib/xdpsock.h > > new file mode 100644 > > index 000000000000..72578e383812 > > --- /dev/null > > +++ b/lib/xdpsock.h > > @@ -0,0 +1,101 @@ > > +/* > > + * Copyright (c) 2018, 2019 Nicira, Inc. > > + * > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > + * you may not use this file except in compliance with the License. > > + * You may obtain a copy of the License at: > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > + > > +#ifndef XDPSOCK_H > > +#define XDPSOCK_H 1 > > + > > +#include <config.h> > > + > > +#ifdef HAVE_AF_XDP > > + > > +#include <bpf/xsk.h> > > +#include <errno.h> > > +#include <stdbool.h> > > +#include <stdio.h> > > + > > +#include "openvswitch/thread.h" > > +#include "ovs-atomic.h" > > +#include "spinlock.h" > > + > > +#define FRAME_HEADROOM XDP_PACKET_HEADROOM > > +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE > > +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT > > +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) > > + > > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS > > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS > > + > > +/* The worst case is all 4 queues TX/CQ/RX/FILL are full. > > + * Setting NUM_FRAMES to this makes sure umem_pop always successes. > > + */ > > +#define NUM_FRAMES (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)) > > + > > +#define BATCH_SIZE NETDEV_MAX_BURST > > + > > +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES)); > > +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS); > > +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)); > > + > > +/* LIFO ptr_array */ > > +struct umem_pool { > > + int index; /* point to top */ > > + unsigned int size; > > + ovs_spinlock_t mutex; > > It's a bit confusing to name it a 'mutex'. Sounds like it's 'ovs_mutex'. > Probably, it'll be better to name it 'spinlock' or just 'lock'. > OK <snip> Regards, William
On 02.06.2019 16:43, William Tu wrote: > Hi Ilya, > > Thanks for your review. > > On Thu, May 30, 2019 at 8:57 AM Ilya Maximets <i.maximets@samsung.com> wrote: >> >> On 28.05.2019 22:01, William Tu wrote: >>> The patch introduces experimental AF_XDP support for OVS netdev. >>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h >>> index 859c05613ddf..a33b9a7353ba 100644 >>> --- a/lib/dpif-netdev-perf.h >>> +++ b/lib/dpif-netdev-perf.h >>> @@ -21,6 +21,7 @@ >>> #include <stddef.h> >>> #include <stdint.h> >>> #include <string.h> >>> +#include <time.h> >>> #include <math.h> >>> >>> #ifdef DPDK_NETDEV >>> @@ -186,6 +187,24 @@ struct pmd_perf_stats { >>> char *log_reason; >>> }; >>> >>> +#ifdef HAVE_AF_XDP >> >> I'd like to change this to "#ifdef __linux__". >> 'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is >> Linux specific. > > Yes, thanks, will do it. > >> >>> +static inline uint64_t >>> +rdtsc_syscall(struct pmd_perf_stats *s) >>> +{ >>> + struct timespec val; >>> + uint64_t v; >>> + >>> + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { >>> + return s->last_tsc = 0; >> >> Maybe it's better to just return the value and allow caller to assign? > > Do you mean just: > return s->last_tsc; Yes. > >> This way you'll not need to pass any arguments here. > > I don't understand, I still need to pass &val, right? I meant not passing the argument to rdtsc_syscall, i.e. rdtsc_syscall(void). Best regards, Ilya Maximets.
diff --git a/Documentation/automake.mk b/Documentation/automake.mk index 082438e09a33..11cc59efc881 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -10,6 +10,7 @@ DOC_SOURCE = \ Documentation/intro/why-ovs.rst \ Documentation/intro/install/index.rst \ Documentation/intro/install/bash-completion.rst \ + Documentation/intro/install/afxdp.rst \ Documentation/intro/install/debian.rst \ Documentation/intro/install/documentation.rst \ Documentation/intro/install/distributions.rst \ diff --git a/Documentation/index.rst b/Documentation/index.rst index 46261235c732..aa9e7c49f179 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -59,6 +59,7 @@ vSwitch? Start here. :doc:`intro/install/windows` | :doc:`intro/install/xenserver` | :doc:`intro/install/dpdk` | + :doc:`intro/install/afxdp` | :doc:`Installation FAQs <faq/releases>` - **Tutorials:** :doc:`tutorials/faucet` | diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst new file mode 100644 index 000000000000..a2bff5733d0a --- /dev/null +++ b/Documentation/intro/install/afxdp.rst @@ -0,0 +1,433 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + + +======================== +Open vSwitch with AF_XDP +======================== + +This document describes how to build and install Open vSwitch using +AF_XDP netdev. + +.. warning:: + The AF_XDP support of Open vSwitch is considered 'experimental', + and it is not compiled in by default. + + +Introduction +------------ +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type +built upon the eBPF and XDP technology. It is aims to have comparable +performance to DPDK but cooperate better with existing kernel's networking +stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program +attached to the netdev, by-passing a couple of Linux kernel's subsystems. +As a result, AF_XDP socket shows much better performance than AF_PACKET. +For more details about AF_XDP, please see linux kernel's +Documentation/networking/af_xdp.rst + + +AF_XDP Netdev +------------- +OVS has a couple of netdev types, i.e., system, tap, or +dpdk. The AF_XDP feature adds a new netdev types called +"afxdp", and implement its configuration, packet reception, +and transmit functions. Since the AF_XDP socket, called xsk, +operates in userspace, once ovs-vswitchd receives packets +from xsk, the afxdp netdev re-uses the existing userspace +dpif-netdev datapath. As a result, most of the packet processing +happens at the userspace instead of linux kernel. + +:: + + | +-------------------+ + | | ovs-vswitchd |<-->ovsdb-server + | +-------------------+ + | | ofproto |<-->OpenFlow controllers + | +--------+-+--------+ + | | netdev | |ofproto-| + userspace | +--------+ | dpif | + | | afxdp | +--------+ + | | netdev | | dpif | + | +---||---+ +--------+ + | || | dpif- | + | || | netdev | + |_ || +--------+ + || + _ +---||-----+--------+ + | | AF_XDP prog + | + kernel | | xsk_map | + |_ +--------||---------+ + || + physical + NIC + + +Build requirements +------------------ + +In addition to the requirements described in :doc:`general`, building Open +vSwitch with AF_XDP will require the following: + +- libbpf from kernel source tree (kernel 5.0.0 or later) + +- Linux kernel XDP support, with the following options (required) + + * CONFIG_BPF=y + + * CONFIG_BPF_SYSCALL=y + + * CONFIG_XDP_SOCKETS=y + + +- The following optional Kconfig options are also recommended, but not + required: + + * CONFIG_BPF_JIT=y (Performance) + + * CONFIG_HAVE_BPF_JIT=y (Performance) + + * CONFIG_XDP_SOCKETS_DIAG=y (Debugging) + +- Once your AF_XDP-enabled kernel is ready, if possible, run + **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf. + This is an OVS indepedent benchmark tools for AF_XDP. + It makes sure your basic kernel requirements are met for AF_XDP. + + +Installing +---------- +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support. +Frist, clone a recent version of Linux bpf-next tree:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git + +Second, go into the Linux source directory and build libbpf in the tools +directory:: + + cd bpf-next/ + cd tools/lib/bpf/ + make && make install + make install_headers + +.. note:: + Make sure xsk.h and bpf.h are installed in system's library path, + e.g. /usr/local/include/bpf/ or /usr/include/bpf/ + +Make sure the libbpf.so is installed correctly:: + + ldconfig + ldconfig -p | grep libbpf + +Third, ensure the standard OVS requirements are installed and +bootstrap/configure the package:: + + ./boot.sh && ./configure --enable-afxdp + +Finally, build and install OVS:: + + make && make install + +To kick start end-to-end autotesting:: + + uname -a # make sure having 5.0+ kernel + make check-afxdp TESTSUITEFLAGS='1' + +If a test case fails, check the log at:: + + cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log + + +Setup AF_XDP netdev +------------------- +Before running OVS with AF_XDP, make sure the libbpf and libelf are +set-up right:: + + ldd vswitchd/ovs-vswitchd + +Open vSwitch should be started using userspace datapath as described +in :doc:`general`:: + + ovs-vswitchd ... + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4) +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask, +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb":: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Or, use 4 pmds/cores and 4 queues by doing:: + + ethtool -L enp2s0 combined 4 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=4 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4" + +.. note:: + pmd-rxq-affinity is optional. If not specified, system will auto-assign. + +To validate that the bridge has successfully instantiated, you can use the:: + + ovs-vsctl show + +Should show something like:: + + Port "ens802f0" + Interface "ens802f0" + type: afxdp + options: {n_rxq="1", xdpmode=drv} + +Otherwise, enable debugging by:: + + ovs-appctl vlog/set netdev_afxdp::dbg + + +References +---------- +Most of the design details are described in the paper presented at +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1], +section 4, and slides[2][4]. +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction +about AF_XDP current and future work. + +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf + +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf + +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf + +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp + + +Performance Tuning +------------------ +The name of the game is to keep your CPU running in userspace, allowing PMD +to keep polling the AF_XDP queues without any interferences from kernel. + +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd + running cores, device plug-in slot) + +#. Isolate your CPU by doing isolcpu at grub configure. + +#. IRQ should not set to pmd running core. + +#. The Spectre and Meltdown fixes increase the overhead of system calls. + + +Debugging performance issue +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +While running the traffic, use linux perf tool to see where your cpu +spends its cycle:: + + cd bpf-next/tools/perf + make + ./perf record -p `pidof ovs-vswitchd` sleep 10 + ./perf report + +Measure your system call rate by doing:: + + pstree -p `pidof ovs-vswitchd` + strace -c -p <your pmd's PID> + +Or, use OVS pmd tool:: + + ovs-appctl dpif-netdev/pmd-stats-show + + +Example Script +-------------- + +Below is a script using namespaces and veth peer:: + + #!/bin/bash + ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \ + --disable-system --detach \ + ovs-vsctl -- add-br br0 -- set Bridge br0 \ + protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \ + fail-mode=secure datapath_type=netdev + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + + ip netns add at_ns0 + ovs-appctl vlog/set netdev_afxdp::dbg + + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp" + + ip netns exec at_ns0 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.1/24" dev p0 + ip link set dev p0 up + NS_EXEC_HEREDOC + + ip netns add at_ns1 + ip link add p1 type veth peer name afxdp-p1 + ip link set p1 netns at_ns1 + ip link set dev afxdp-p1 up + + ovs-vsctl add-port br0 afxdp-p1 -- \ + set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp" + ip netns exec at_ns1 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.2/24" dev p1 + ip link set dev p1 up + NS_EXEC_HEREDOC + + ip netns exec at_ns0 ping -i .2 10.1.1.2 + + +Limitations/Known Issues +------------------------ +#. Device's numa ID is always 0, need a way to find numa id from a netdev. +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible + work-around is to use OpenFlow meter action. +#. AF_XDP device added to bridge, remove, and added again will fail. +#. Most of the tests are done using i40e single port. Multiple ports and + also ixgbe driver also needs to be tested. +#. No latency test result (TODO items) + + +PVP using tap device +-------------------- +Assume you have enp2s0 as physical nic, and a tap device connected to VM. +First, start OVS, then add physical port:: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Start a VM with virtio and tap device:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\ + vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \ + -netdev type=tap,id=net0,vhost=on,queues=8 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Create OpenFlow rules:: + + ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp" + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0" + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_TX + +The performance number I got is around 1.6Mpps. +This is due to using the kernel's tap interface, which requires copying +packet into kernel from the umem buffer in userspace. + + +PVP using vhostuser device +-------------------------- +First, build OVS with DPDK and AFXDP:: + + ./configure --enable-afxdp --with-dpdk=<dpdk path> + make -j4 && make install + +Create a vhost-user port from OVS:: + + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \ + other_config:pmd-cpu-mask=0xfff + ovs-vsctl add-port br0 vhost-user-1 \ + -- set Interface vhost-user-1 type=dpdkvhostuser + +Start VM using vhost-user mode:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \ + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \ + -device virtio-net-pci,mac=00:00:00:00:00:01,\ + netdev=mynet1,mq=on,vectors=10 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Setup the OpenFlow ruls:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" + ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_DROP + ./xdp_rxq_info --dev ens3 --action XDP_TX + +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps + + +PCP container using veth +------------------------ +Create namespace and veth peer devices:: + + ip netns add at_ns0 + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ip netns exec at_ns0 ip link set dev p0 up + +Attach the veth port to br0 (linux kernel mode):: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 options:n_rxq=1 + +Or, use AF_XDP with skb mode:: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb + +Setup the OpenFlow rules:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0" + ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0" + +In the namespace, run drop or bounce back the packet:: + + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX + +Performace: for RX_DROP: 800Kpps, TX: 700Kpps + + +Bug Reporting +------------- + +Please report problems to dev@openvswitch.org. diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst index 3193c736cf17..c27a9c9d16ff 100644 --- a/Documentation/intro/install/index.rst +++ b/Documentation/intro/install/index.rst @@ -45,6 +45,7 @@ Installation from Source xenserver userspace dpdk + afxdp Installation from Packages -------------------------- diff --git a/acinclude.m4 b/acinclude.m4 index f8fc5bcd7b4c..b9eacd7c0f3c 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -221,6 +221,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [ ]) ]) +dnl OVS_CHECK_LINUX_AF_XDP +dnl +dnl Check both Linux kernel AF_XDP and libbpf support +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [ + AC_ARG_ENABLE([afxdp], + [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])], + [], [enable_afxdp=no]) + AC_MSG_CHECKING([whether AF_XDP is enabled]) + if test "$enable_afxdp" != yes; then + AC_MSG_RESULT([no]) + AF_XDP_ENABLE=false + else + AC_MSG_RESULT([yes]) + AF_XDP_ENABLE=true + + AC_CHECK_HEADER([bpf/libbpf.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])]) + + AC_CHECK_HEADER([linux/if_xdp.h], [], + [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/xsk.h], [], + [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/libbpf_util.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])]) + + AC_DEFINE([HAVE_AF_XDP], [1], + [Define to 1 if AF_XDP support is available and enabled.]) + LIBBPF_LDADD=" -lbpf -lelf" + AC_SUBST([LIBBPF_LDADD]) + fi + AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true) +]) + dnl OVS_CHECK_DPDK dnl dnl Configure DPDK source tree diff --git a/configure.ac b/configure.ac index 505e3d041e93..29c90b73f836 100644 --- a/configure.ac +++ b/configure.ac @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX OVS_CHECK_DOT OVS_CHECK_IF_DL OVS_CHECK_STRTOK_R +OVS_CHECK_LINUX_AF_XDP AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]]) AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], [], [], [[#include <sys/stat.h>]]) diff --git a/lib/automake.mk b/lib/automake.mk index cc5dccf39d6b..b31e28f6e1f5 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -14,6 +14,10 @@ if WIN32 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS} endif +if HAVE_AF_XDP +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD) +endif + lib_libopenvswitch_la_LDFLAGS = \ $(OVS_LTINFO) \ -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \ @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \ lib/if-notifier.h \ lib/netdev-linux.c \ lib/netdev-linux.h \ + lib/netdev-linux-private.h \ lib/netdev-tc-offloads.c \ lib/netdev-tc-offloads.h \ lib/netlink-conntrack.c \ @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \ lib/tc.h endif +if HAVE_AF_XDP +lib_libopenvswitch_la_SOURCES += \ + lib/xdpsock.c \ + lib/xdpsock.h \ + lib/netdev-afxdp.c \ + lib/netdev-afxdp.h \ + lib/spinlock.h +endif + if DPDK_NETDEV lib_libopenvswitch_la_SOURCES += \ lib/dpdk.c \ diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 0976a35e758b..e6a7947076b4 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -19,6 +19,7 @@ #include <string.h> #include "dp-packet.h" +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/dynamic-string.h" #include "util.h" @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated) dp_packet_use__(b, base, allocated, DPBUF_MALLOC); } +#if HAVE_AF_XDP +/* Initialize 'b' as an empty dp_packet that contains + * memory starting at AF_XDP umem base. + */ +void +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated) +{ + dp_packet_set_base(b, base); + dp_packet_set_data(b, base); + dp_packet_set_size(b, 0); + + dp_packet_set_allocated(b, allocated); + b->source = DPBUF_AFXDP; + dp_packet_reset_offsets(b); + pkt_metadata_init(&b->md, 0); + dp_packet_reset_cutlen(b); + dp_packet_reset_offload(b); + b->packet_type = htonl(PT_ETH); +} +#endif + /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of * memory starting at 'base'. 'base' should point to a buffer on the stack. * (Nothing actually relies on 'base' being allocated on the stack. It could @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b) * created as a dp_packet */ free_dpdk_buf((struct dp_packet*) b); #endif + } else if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); } } } @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom case DPBUF_STACK: OVS_NOT_REACHED(); + case DPBUF_AFXDP: + OVS_NOT_REACHED(); + case DPBUF_STUB: b->source = DPBUF_MALLOC; new_base = xmalloc(new_allocated); @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b) { void *p; ovs_assert(b->source != DPBUF_DPDK); + ovs_assert(b->source != DPBUF_AFXDP); if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) { p = dp_packet_data(b); diff --git a/lib/dp-packet.h b/lib/dp-packet.h index a5e9ade1244a..e3438226e360 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -25,6 +25,7 @@ #include <rte_mbuf.h> #endif +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/list.h" #include "packets.h" @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source { DPBUF_DPDK, /* buffer data is from DPDK allocated memory. * ref to dp_packet_init_dpdk() in dp-packet.c. */ + DPBUF_AFXDP, /* buffer data from XDP frame */ }; #define DP_PACKET_CONTEXT_SIZE 64 @@ -89,6 +91,13 @@ struct dp_packet { }; }; +#if HAVE_AF_XDP +struct dp_packet_afxdp { + struct umem_pool *mpool; + struct dp_packet packet; +}; +#endif + static inline void *dp_packet_data(const struct dp_packet *); static inline void dp_packet_set_data(struct dp_packet *, void *); static inline void *dp_packet_base(const struct dp_packet *); @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *); void dp_packet_use(struct dp_packet *, void *, size_t); void dp_packet_use_stub(struct dp_packet *, void *, size_t); void dp_packet_use_const(struct dp_packet *, const void *, size_t); - +#if HAVE_AF_XDP +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t); +#endif void dp_packet_init_dpdk(struct dp_packet *); void dp_packet_init(struct dp_packet *, size_t); @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b) return; } + if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); + return; + } + dp_packet_uninit(b); free(b); } diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 859c05613ddf..a33b9a7353ba 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -21,6 +21,7 @@ #include <stddef.h> #include <stdint.h> #include <string.h> +#include <time.h> #include <math.h> #ifdef DPDK_NETDEV @@ -186,6 +187,24 @@ struct pmd_perf_stats { char *log_reason; }; +#ifdef HAVE_AF_XDP +static inline uint64_t +rdtsc_syscall(struct pmd_perf_stats *s) +{ + struct timespec val; + uint64_t v; + + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { + return s->last_tsc = 0; + } + + v = (uint64_t) val.tv_sec * 1000000000LL; + v += (uint64_t) val.tv_nsec; + + return s->last_tsc = v; +} +#endif + /* Support for accurate timing of PMD execution on TSC clock cycle level. * These functions are intended to be invoked in the context of pmd threads. */ @@ -198,6 +217,15 @@ cycles_counter_update(struct pmd_perf_stats *s) { #ifdef DPDK_NETDEV return s->last_tsc = rte_get_tsc_cycles(); +#elif defined(HAVE_AF_XDP) && defined(__x86_64__) + /* This is x86-specific instructions. */ + uint32_t h, l; + asm volatile("rdtsc" : "=a" (l), "=d" (h)); + + return s->last_tsc = ((uint64_t) h << 32) | l; +#elif defined(HAVE_AF_XDP) + /* non-x86_64 architecture uses syscall */ + return rdtsc_syscall(s); #else return s->last_tsc = 0; #endif diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c new file mode 100644 index 000000000000..e20ee31c00f3 --- /dev/null +++ b/lib/netdev-afxdp.c @@ -0,0 +1,850 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include <config.h> + +#include "netdev-linux-private.h" +#include "netdev-linux.h" +#include "netdev-afxdp.h" + +#include <errno.h> +#include <inttypes.h> +#include <linux/rtnetlink.h> +#include <linux/if_xdp.h> +#include <net/if.h> +#include <stdlib.h> +#include <sys/resource.h> +#include <sys/socket.h> +#include <sys/types.h> +#include <unistd.h> + +#include "dp-packet.h" +#include "dpif-netdev.h" +#include "openvswitch/dynamic-string.h" +#include "openvswitch/vlog.h" +#include "packets.h" +#include "socket-util.h" +#include "spinlock.h" +#include "util.h" +#include "xdpsock.h" + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +VLOG_DEFINE_THIS_MODULE(netdev_afxdp); +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); + +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base)) +#define UMEM2XPKT(base, i) \ + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \ + i * sizeof(struct dp_packet_afxdp)) + +static uint32_t prog_id; +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id, + int mode); +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode); +static void xsk_destroy(struct xsk_socket_info *xsk); +static int xsk_configure_all(struct netdev *netdev); +static void xsk_destroy_all(struct netdev *netdev); + +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size, + int xdpmode) +{ + struct xsk_umem_config uconfig OVS_UNUSED; + struct xsk_umem_info *umem; + int ret; + int i; + + umem = xcalloc(1, sizeof(*umem)); + ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, + NULL); + if (ret) { + VLOG_ERR("xsk_umem__create failed (%s) mode: %s", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV"); + free(umem); + return NULL; + } + + umem->buffer = buffer; + + /* set-up umem pool */ + if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) { + VLOG_ERR("umem_pool_init failed"); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct umem_elem *elem; + + elem = ALIGNED_CAST(struct umem_elem *, + (char *)umem->buffer + i * FRAME_SIZE); + umem_elem_push(&umem->mpool, elem); + } + + /* set-up metadata */ + if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) { + VLOG_ERR("xpacket_pool_init failed"); + umem_pool_cleanup(&umem->mpool); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + VLOG_DBG("%s xpacket pool from %p to %p", __func__, + umem->xpool.array, + (char *)umem->xpool.array + + NUM_FRAMES * sizeof(struct dp_packet_afxdp)); + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + xpacket = UMEM2XPKT(umem->xpool.array, i); + xpacket->mpool = &umem->mpool; + + packet = &xpacket->packet; + packet->source = DPBUF_AFXDP; + } + + return umem; +} + +static struct xsk_socket_info * +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex, + uint32_t queue_id, int xdpmode) +{ + struct xsk_socket_config cfg; + struct xsk_socket_info *xsk; + char devname[IF_NAMESIZE]; + uint32_t idx = 0; + int ret; + int i; + + xsk = xcalloc(1, sizeof(*xsk)); + xsk->umem = umem; + cfg.rx_size = CONS_NUM_DESCS; + cfg.tx_size = PROD_NUM_DESCS; + cfg.libbpf_flags = 0; + + if (xdpmode == XDP_ZEROCOPY) { + cfg.bind_flags = XDP_ZEROCOPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + } else { + cfg.bind_flags = XDP_COPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + } + + if (if_indextoname(ifindex, devname) == NULL) { + VLOG_ERR("ifindex %d to devname failed (%s)", + ifindex, ovs_strerror(errno)); + free(xsk); + return NULL; + } + + ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem, + &xsk->rx, &xsk->tx, &cfg); + if (ret) { + VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV", + queue_id); + free(xsk); + return NULL; + } + + /* Make sure the built-in AF_XDP program is loaded */ + ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags); + if (ret) { + VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno)); + xsk_socket__delete(xsk->xsk); + free(xsk); + return NULL; + } + + /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */ + while (!xsk_ring_prod__reserve(&xsk->umem->fq, + PROD_NUM_DESCS - BATCH_SIZE, &idx)) { + VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue"); + } + + for (i = 0; + i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE; + i += FRAME_SIZE) { + struct umem_elem *elem; + uint64_t addr; + + elem = umem_elem_pop(&xsk->umem->mpool); + addr = UMEM2DESC(elem, xsk->umem->buffer); + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, + PROD_NUM_DESCS - BATCH_SIZE); + return xsk; +} + +static struct xsk_socket_info * +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode) +{ + struct xsk_socket_info *xsk; + struct xsk_umem_info *umem; + void *bufs; + + /* umem memory region */ + bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE); + memset(bufs, 0, NUM_FRAMES * FRAME_SIZE); + + /* create AF_XDP socket */ + umem = xsk_configure_umem(bufs, + NUM_FRAMES * FRAME_SIZE, + xdpmode); + if (!umem) { + free_pagealign(bufs); + return NULL; + } + + xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode); + if (!xsk) { + /* clean up umem and xpacket pool */ + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free_pagealign(bufs); + umem_pool_cleanup(&umem->mpool); + xpacket_pool_cleanup(&umem->xpool); + free(umem); + } + return xsk; +} + +static int +xsk_configure_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk; + int i, ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + /* configure each queue */ + for (i = 0; i < netdev->n_rxq; i++) { + VLOG_INFO("%s configure queue %d mode %s", __func__, i, + dev->xdpmode == XDP_COPY ? "SKB" : "DRV"); + xsk = xsk_configure(ifindex, i, dev->xdpmode); + if (!xsk) { + VLOG_ERR("failed to create AF_XDP socket on queue %d", i); + goto err; + } + dev->xsk[i] = xsk; + xsk->rx_dropped = 0; + xsk->tx_dropped = 0; + } + + return 0; + +err: + xsk_destroy_all(netdev); + return EINVAL; +} + +static void +xsk_destroy(struct xsk_socket_info *xsk) +{ + struct xsk_umem *umem; + + if (!xsk) { + return; + } + + umem = xsk->umem->umem; + xsk_socket__delete(xsk->xsk); + if (xsk_umem__delete(umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + + /* free the packet buffer */ + free_pagealign(xsk->umem->buffer); + + /* cleanup umem pool */ + umem_pool_cleanup(&xsk->umem->mpool); + + /* cleanup metadata pool */ + xpacket_pool_cleanup(&xsk->umem->xpool); + + free(xsk->umem); + free(xsk); +} + +static void +xsk_destroy_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int i, ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + for (i = 0; i < MAX_XSKQ; i++) { + if (dev->xsk[i]) { + VLOG_INFO("destroy xsk[%d]", i); + xsk_destroy(dev->xsk[i]); + dev->xsk[i] = NULL; + dev->xsk[i]->rx_dropped = 0; + dev->xsk[i]->tx_dropped = 0; + } + } + VLOG_INFO("remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); +} + +static inline void OVS_UNUSED +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { + struct xdp_statistics stat; + socklen_t optlen; + + optlen = sizeof stat; + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, + &stat, &optlen) == 0); + + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", + stat.rx_dropped, + stat.rx_invalid_descs, + stat.tx_invalid_descs); +} + +int +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp OVS_UNUSED) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + const char *str_xdpmode; + int xdpmode, new_n_rxq; + + ovs_mutex_lock(&dev->mutex); + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); + if (new_n_rxq > MAX_XSKQ) { + ovs_mutex_unlock(&dev->mutex); + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); + return EINVAL; + } + + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); + if (!strcasecmp(str_xdpmode, "drv")) { + xdpmode = XDP_ZEROCOPY; + } else if (!strcasecmp(str_xdpmode, "skb")) { + xdpmode = XDP_COPY; + } else { + VLOG_ERR("%s: Incorrect xdpmode (%s).", + netdev_get_name(netdev), str_xdpmode); + ovs_mutex_unlock(&dev->mutex); + return EINVAL; + } + + if (dev->requested_n_rxq != new_n_rxq + || dev->requested_xdpmode != xdpmode) { + dev->requested_n_rxq = new_n_rxq; + dev->requested_xdpmode = xdpmode; + netdev_request_reconfigure(netdev); + } + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +int +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); + smap_add_format(args, "xdpmode", "%s", + dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb"); + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +int +netdev_afxdp_reconfigure(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + int err = 0; + + ovs_mutex_lock(&dev->mutex); + + if (netdev->n_rxq == dev->requested_n_rxq + && dev->xdpmode == dev->requested_xdpmode) { + goto out; + } + + xsk_destroy_all(netdev); + netdev->n_rxq = dev->requested_n_rxq; + + if (dev->requested_xdpmode == XDP_ZEROCOPY) { + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); + /* From SKB mode to DRV mode */ + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + dev->xdp_bind_flags = XDP_ZEROCOPY; + dev->xdpmode = XDP_ZEROCOPY; + + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", + ovs_strerror(errno)); + } + } else { + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); + /* From DRV mode to SKB mode */ + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + dev->xdp_bind_flags = XDP_COPY; + dev->xdpmode = XDP_COPY; + /* TODO: set rlimit back to previous value + * when no device is in DRV mode. + */ + } + + err = xsk_configure_all(netdev); + if (err) { + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); + } + netdev_change_seq_changed(netdev); +out: + ovs_mutex_unlock(&dev->mutex); + return err; +} + +int +netdev_afxdp_get_numa_id(const struct netdev *netdev) +{ + /* FIXME: Get netdev's PCIe device ID, then find + * its NUMA node id. + */ + VLOG_INFO("FIXME: Device %s always use numa id 0", + netdev_get_name(netdev)); + return 0; +} + +static void +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) +{ + uint32_t curr_prog_id = 0; + uint32_t flags; + + /* remove_xdp_program() */ + if (xdpmode == XDP_COPY) { + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + } else { + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + } + + if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) { + bpf_set_link_xdp_fd(ifindex, -1, flags); + } + if (prog_id == curr_prog_id) { + bpf_set_link_xdp_fd(ifindex, -1, flags); + } else if (!curr_prog_id) { + VLOG_INFO("couldn't find a prog id on a given interface"); + } else { + VLOG_INFO("program on interface changed, not removing"); + } +} + +void +signal_remove_xdp(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + VLOG_WARN("force remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); +} + +static struct dp_packet_afxdp * +dp_packet_cast_afxdp(const struct dp_packet *d) +{ + ovs_assert(d->source == DPBUF_AFXDP); + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); +} + +void +free_afxdp_buf(struct dp_packet *p) +{ + struct dp_packet_afxdp *xpacket; + unsigned long addr; + + xpacket = dp_packet_cast_afxdp(p); + if (xpacket->mpool) { + void *base = dp_packet_base(p); + + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); + umem_elem_push(xpacket->mpool, (void *)addr); + } +} + +static void +free_afxdp_buf_batch(struct dp_packet_batch *batch) +{ + struct dp_packet_afxdp *xpacket = NULL; + struct dp_packet *packet; + void *elems[BATCH_SIZE]; + unsigned long addr; + + /* all packets are AF_XDP, so handles its own delete in batch */ + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + xpacket = dp_packet_cast_afxdp(packet); + if (xpacket->mpool) { + void *base = dp_packet_base(packet); + + addr = (unsigned long)base & (~FRAME_SHIFT_MASK); + elems[i] = (void *)addr; + } + } + umem_elem_push_n(xpacket->mpool, batch->count, elems); + dp_packet_batch_init(batch); +} + +int +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, + int *qfill) +{ + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); + struct netdev *netdev = rx->up.netdev; + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct umem_elem *elems[BATCH_SIZE]; + uint32_t idx_rx = 0, idx_fq = 0; + struct xsk_socket_info *xsk; + int qid = rxq_->queue_id; + unsigned int rcvd, i; + int ret = 0; + + xsk = dev->xsk[qid]; + rx->fd = xsk_socket__fd(xsk->xsk); + + /* See if there is any packet on RX queue, + * if yes, idx_rx is the index having the packet. + */ + rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); + if (!rcvd) { + return 0; + } + + ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems); + if (OVS_UNLIKELY(ret)) { + xsk_ring_cons__release(&xsk->rx, rcvd); + xsk->rx_dropped += rcvd; + return ENOMEM; + } + + /* Prepare for the FILL queue */ + if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) { + /* The FILL queue is full, don't retry or process rx. Wait for kernel + * to move received packets from FILL queue to RX queue. + */ + umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); + xsk_ring_cons__release(&xsk->rx, rcvd); + xsk->rx_dropped += rcvd; + return ENOMEM; + } + + /* Setup a dp_packet batch from descriptors in RX queue */ + for (i = 0; i < rcvd; i++) { + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; + uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len; + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); + uint64_t index; + + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + index = addr >> FRAME_SHIFT; + xpacket = UMEM2XPKT(xsk->umem->xpool.array, index); + packet = &xpacket->packet; + + /* Initialize the struct dp_packet */ + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); + dp_packet_set_size(packet, len); + + /* Add packet into batch, increase batch->count */ + dp_packet_batch_add(batch, packet); + + idx_rx++; + } + /* Release the RX queue */ + xsk_ring_cons__release(&xsk->rx, rcvd); + + for (i = 0; i < rcvd; i++) { + uint64_t index; + struct umem_elem *elem; + + /* Get one free umem, program it into FILL queue */ + elem = elems[i]; + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); + ovs_assert((index & FRAME_SHIFT_MASK) == 0); + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; + + idx_fq++; + } + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); + + if (qfill) { + /* TODO: return the number of remaining packets in the queue. */ + *qfill = 0; + } + +#ifdef AFXDP_DEBUG + log_xsk_stat(xsk); +#endif + return 0; +} + +static inline int +kick_tx(struct xsk_socket_info *xsk) +{ + int ret; + + /* This causes system call into kernel's xsk_sendmsg, and + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). + */ + ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); + if (OVS_UNLIKELY(ret < 0)) { + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { + return errno; + } + } + /* no error, or EBUSY or EAGAIN */ + return 0; +} + +static inline bool +check_free_batch(struct dp_packet_batch *batch) +{ + struct umem_pool *first_mpool = NULL; + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + if (packet->source != DPBUF_AFXDP) { + return false; + } + xpacket = dp_packet_cast_afxdp(packet); + if (i == 0) { + first_mpool = xpacket->mpool; + continue; + } + if (xpacket->mpool != first_mpool) { + return false; + } + } + /* All packets are DPBUF_AFXDP and from the same mpool */ + return true; +} + +static inline void +afxdp_complete_tx(struct xsk_socket_info *xsk) +{ + struct umem_elem *elems_push[BATCH_SIZE]; + uint32_t idx_cq = 0; + int tx_done, j, ret; + + if (!xsk->outstanding_tx) { + return; + } + + ret = kick_tx(xsk); + if (OVS_UNLIKELY(ret)) { + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", + ovs_strerror(ret)); + } + + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq); + if (tx_done > 0) { + xsk_ring_cons__release(&xsk->umem->cq, tx_done); + xsk->outstanding_tx -= tx_done; + } + + /* Recycle back to umem pool */ + for (j = 0; j < tx_done; j++) { + struct umem_elem *elem; + uint64_t addr; + + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++); + elem = ALIGNED_CAST(struct umem_elem *, + (char *)xsk->umem->buffer + addr); + elems_push[j] = elem; + } + + umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push); +} + +int +netdev_afxdp_batch_send(struct netdev *netdev_, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev_); + struct xsk_socket_info *xsk = dev->xsk[qid]; + struct umem_elem *elems_pop[BATCH_SIZE]; + struct dp_packet *packet; + bool free_batch = true; + uint32_t idx = 0; + int error = 0; + int ret; + + if (OVS_UNLIKELY(concurrent_txq)) { + ovs_spin_lock(&dev->tx_lock); + } + + /* Process CQ first. */ + afxdp_complete_tx(xsk); + + free_batch = check_free_batch(batch); + + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); + if (OVS_UNLIKELY(ret)) { + xsk->tx_dropped += batch->count; + error = ENOMEM; + goto out; + } + + /* Make sure we have enough TX descs */ + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx); + if (OVS_UNLIKELY(ret == 0)) { + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); + xsk->tx_dropped += batch->count; + error = ENOMEM; + goto out; + } + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + struct umem_elem *elem; + uint64_t index; + + elem = elems_pop[i]; + /* Copy the packet to the umem we just pop from umem pool. + * TODO: avoid this copy if the packet and the pop umem + * are located in the same umem. + */ + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); + + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index; + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len + = dp_packet_size(packet); + } + xsk_ring_prod__submit(&xsk->tx, batch->count); + xsk->outstanding_tx += batch->count; + + ret = kick_tx(xsk); + if (OVS_UNLIKELY(ret)) { + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", + ovs_strerror(ret)); + } + +out: + if (free_batch) { + free_afxdp_buf_batch(batch); + } else { + dp_packet_delete_batch(batch, true); + } + + if (OVS_UNLIKELY(concurrent_txq)) { + ovs_spin_unlock(&dev->tx_lock); + } + return error; +} + +int +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED) +{ + /* Done at reconfigure */ + return 0; +} + +void +netdev_afxdp_destruct(struct netdev *netdev_) +{ + struct netdev_linux *netdev = netdev_linux_cast(netdev_); + + /* Note: tc is by-passed when using drv-mode, but when using + * skb-mode, we might need to clean up tc. */ + + xsk_destroy_all(netdev_); + ovs_mutex_destroy(&netdev->mutex); +} + +int +netdev_afxdp_get_stats(const struct netdev *netdev_, + struct netdev_stats *stats) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev_); + struct netdev_stats dev_stats; + struct xsk_socket_info *xsk; + int error, i; + + ovs_mutex_lock(&dev->mutex); + + error = get_stats_via_netlink(netdev_, &dev_stats); + if (error) { + VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics"); + } else { + /* Use kernel netdev's packet and byte counts */ + stats->rx_packets = dev_stats.rx_packets; + stats->rx_bytes = dev_stats.rx_bytes; + stats->tx_packets = dev_stats.tx_packets; + stats->tx_bytes = dev_stats.tx_bytes; + + stats->rx_errors += dev_stats.rx_errors; + stats->tx_errors += dev_stats.tx_errors; + stats->rx_dropped += dev_stats.rx_dropped; + stats->tx_dropped += dev_stats.tx_dropped; + stats->multicast += dev_stats.multicast; + stats->collisions += dev_stats.collisions; + stats->rx_length_errors += dev_stats.rx_length_errors; + stats->rx_over_errors += dev_stats.rx_over_errors; + stats->rx_crc_errors += dev_stats.rx_crc_errors; + stats->rx_frame_errors += dev_stats.rx_frame_errors; + stats->rx_fifo_errors += dev_stats.rx_fifo_errors; + stats->rx_missed_errors += dev_stats.rx_missed_errors; + stats->tx_aborted_errors += dev_stats.tx_aborted_errors; + stats->tx_carrier_errors += dev_stats.tx_carrier_errors; + stats->tx_fifo_errors += dev_stats.tx_fifo_errors; + stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors; + stats->tx_window_errors += dev_stats.tx_window_errors; + + /* Account the dropped in each xsk */ + for (i = 0; i < MAX_XSKQ; i++) { + xsk = dev->xsk[i]; + if (xsk) { + stats->rx_dropped += xsk->rx_dropped; + stats->tx_dropped += xsk->tx_dropped; + } + } + } + ovs_mutex_unlock(&dev->mutex); + + return error; +} diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h new file mode 100644 index 000000000000..dd2dc1a2064d --- /dev/null +++ b/lib/netdev-afxdp.h @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_AFXDP_H +#define NETDEV_AFXDP_H 1 + +#include <config.h> + +#ifdef HAVE_AF_XDP + +#include <stdint.h> +#include <stdbool.h> + +/* These functions are Linux AF_XDP specific, so they should be used directly + * only by Linux-specific code. */ + +#define MAX_XSKQ 16 + +struct netdev; +struct xsk_socket_info; +struct xdp_umem; +struct dp_packet_batch; +struct smap; +struct dp_packet; +struct netdev_rxq; +struct netdev_stats; + +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_); +void netdev_afxdp_destruct(struct netdev *netdev_); + +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, + struct dp_packet_batch *batch, + int *qfill); +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq); +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp); +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args); +int netdev_afxdp_get_numa_id(const struct netdev *netdev); +int netdev_afxdp_get_stats(const struct netdev *netdev_, + struct netdev_stats *stats); + +void free_afxdp_buf(struct dp_packet *p); +int netdev_afxdp_reconfigure(struct netdev *netdev); +void signal_remove_xdp(struct netdev *netdev); + +#else /* !HAVE_AF_XDP */ + +#include "openvswitch/compiler.h" + +struct dp_packet; + +static inline void +free_afxdp_buf(struct dp_packet *p OVS_UNUSED) +{ + /* Nothing */ +} + +#endif /* HAVE_AF_XDP */ +#endif /* netdev-afxdp.h */ diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h new file mode 100644 index 000000000000..d43f79e6aa41 --- /dev/null +++ b/lib/netdev-linux-private.h @@ -0,0 +1,139 @@ +/* + * Copyright (c) 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_LINUX_PRIVATE_H +#define NETDEV_LINUX_PRIVATE_H 1 + +#include <config.h> + +#include <linux/filter.h> +#include <linux/gen_stats.h> +#include <linux/if_ether.h> +#include <linux/if_tun.h> +#include <linux/types.h> +#include <linux/ethtool.h> +#include <linux/mii.h> +#include <stdint.h> +#include <stdbool.h> + +#include "netdev-afxdp.h" +#include "netdev-provider.h" +#include "netdev-tc-offloads.h" +#include "netdev-vport.h" +#include "openvswitch/thread.h" +#include "ovs-atomic.h" +#include "timer.h" +#include "xdpsock.h" + +/* These functions are Linux specific, so they should be used directly only by + * Linux-specific code. */ + +struct netdev; + +struct netdev_rxq_linux { + struct netdev_rxq up; + bool is_tap; + int fd; +}; + +void netdev_linux_run(const struct netdev_class *); + +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, + const char *flag_name, bool enable); + +int get_stats_via_netlink(const struct netdev *netdev_, + struct netdev_stats *stats); + +struct netdev_linux { + struct netdev up; + + /* Protects all members below. */ + struct ovs_mutex mutex; + + unsigned int cache_valid; + + bool miimon; /* Link status of last poll. */ + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ + struct timer miimon_timer; + + int netnsid; /* Network namespace ID. */ + /* The following are figured out "on demand" only. They are only valid + * when the corresponding VALID_* bit in 'cache_valid' is set. */ + int ifindex; + struct eth_addr etheraddr; + int mtu; + unsigned int ifi_flags; + long long int carrier_resets; + uint32_t kbits_rate; /* Policing data. */ + uint32_t kbits_burst; + int vport_stats_error; /* Cached error code from vport_get_stats(). + 0 or an errno value. */ + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU + * or SIOCSIFMTU. + */ + int ether_addr_error; /* Cached error code from set/get etheraddr. */ + int netdev_policing_error; /* Cached error code from set policing. */ + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ + + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ + + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ + struct tc *tc; + + /* For devices of class netdev_tap_class only. */ + int tap_fd; + bool present; /* If the device is present in the namespace */ + uint64_t tx_dropped; /* tap device can drop if the iface is down */ + + /* LAG information. */ + bool is_lag_master; /* True if the netdev is a LAG master. */ + + /* AF_XDP information */ +#ifdef HAVE_AF_XDP + struct xsk_socket_info *xsk[MAX_XSKQ]; + int requested_n_rxq; + int xdpmode, requested_xdpmode; /* detect mode changed */ + int xdp_flags, xdp_bind_flags; + ovs_spinlock_t tx_lock; +#endif +}; + +static bool +is_netdev_linux_class(const struct netdev_class *netdev_class) +{ + return netdev_class->run == netdev_linux_run; +} + +static struct netdev_linux * +netdev_linux_cast(const struct netdev *netdev) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); + + return CONTAINER_OF(netdev, struct netdev_linux, up); +} + +static struct netdev_rxq_linux * +netdev_rxq_linux_cast(const struct netdev_rxq *rx) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); + + return CONTAINER_OF(rx, struct netdev_rxq_linux, up); +} + +#endif /* netdev-linux-private.h */ diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index f75d73fd39f8..2883cf1f2586 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -17,6 +17,7 @@ #include <config.h> #include "netdev-linux.h" +#include "netdev-linux-private.h" #include <errno.h> #include <fcntl.h> @@ -54,6 +55,7 @@ #include "fatal-signal.h" #include "hash.h" #include "openvswitch/hmap.h" +#include "netdev-afxdp.h" #include "netdev-provider.h" #include "netdev-tc-offloads.h" #include "netdev-vport.h" @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu); static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu); static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes); -struct netdev_linux { - struct netdev up; - - /* Protects all members below. */ - struct ovs_mutex mutex; - - unsigned int cache_valid; - - bool miimon; /* Link status of last poll. */ - long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ - struct timer miimon_timer; - - int netnsid; /* Network namespace ID. */ - /* The following are figured out "on demand" only. They are only valid - * when the corresponding VALID_* bit in 'cache_valid' is set. */ - int ifindex; - struct eth_addr etheraddr; - int mtu; - unsigned int ifi_flags; - long long int carrier_resets; - uint32_t kbits_rate; /* Policing data. */ - uint32_t kbits_burst; - int vport_stats_error; /* Cached error code from vport_get_stats(). - 0 or an errno value. */ - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */ - int ether_addr_error; /* Cached error code from set/get etheraddr. */ - int netdev_policing_error; /* Cached error code from set policing. */ - int get_features_error; /* Cached error code from ETHTOOL_GSET. */ - int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ - - enum netdev_features current; /* Cached from ETHTOOL_GSET. */ - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ - - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ - struct tc *tc; - - /* For devices of class netdev_tap_class only. */ - int tap_fd; - bool present; /* If the device is present in the namespace */ - uint64_t tx_dropped; /* tap device can drop if the iface is down */ - - /* LAG information. */ - bool is_lag_master; /* True if the netdev is a LAG master. */ -}; - -struct netdev_rxq_linux { - struct netdev_rxq up; - bool is_tap; - int fd; -}; /* This is set pretty low because we probably won't learn anything from the * additional log messages. */ @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); * changes in the device miimon status, so we can use atomic_count. */ static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); -static void netdev_linux_run(const struct netdev_class *); - static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *, int cmd, const char *cmd_name); static int get_flags(const struct netdev *, unsigned int *flags); @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev, struct in_addr addr); static int get_etheraddr(const char *netdev_name, struct eth_addr *ea); static int set_etheraddr(const char *netdev_name, const struct eth_addr); -static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *); static int af_packet_sock(void); static bool netdev_linux_miimon_enabled(void); static void netdev_linux_miimon_run(void); @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void); static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup); static bool -is_netdev_linux_class(const struct netdev_class *netdev_class) -{ - return netdev_class->run == netdev_linux_run; -} - -static bool is_tap_netdev(const struct netdev *netdev) { return netdev_get_class(netdev) == &netdev_tap_class; } - -static struct netdev_linux * -netdev_linux_cast(const struct netdev *netdev) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); - - return CONTAINER_OF(netdev, struct netdev_linux, up); -} - -static struct netdev_rxq_linux * -netdev_rxq_linux_cast(const struct netdev_rxq *rx) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); - return CONTAINER_OF(rx, struct netdev_rxq_linux, up); -} static int netdev_linux_netnsid_update__(struct netdev_linux *netdev) @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change *change) } } -static void +void netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED) { struct nl_sock *sock; @@ -3279,9 +3206,7 @@ exit: .run = netdev_linux_run, \ .wait = netdev_linux_wait, \ .alloc = netdev_linux_alloc, \ - .destruct = netdev_linux_destruct, \ .dealloc = netdev_linux_dealloc, \ - .send = netdev_linux_send, \ .send_wait = netdev_linux_send_wait, \ .set_etheraddr = netdev_linux_set_etheraddr, \ .get_etheraddr = netdev_linux_get_etheraddr, \ @@ -3312,10 +3237,8 @@ exit: .arp_lookup = netdev_linux_arp_lookup, \ .update_flags = netdev_linux_update_flags, \ .rxq_alloc = netdev_linux_rxq_alloc, \ - .rxq_construct = netdev_linux_rxq_construct, \ .rxq_destruct = netdev_linux_rxq_destruct, \ .rxq_dealloc = netdev_linux_rxq_dealloc, \ - .rxq_recv = netdev_linux_rxq_recv, \ .rxq_wait = netdev_linux_rxq_wait, \ .rxq_drain = netdev_linux_rxq_drain @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class = { NETDEV_LINUX_CLASS_COMMON, LINUX_FLOW_OFFLOAD_API, .type = "system", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_linux_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, - .get_block_id = netdev_linux_get_block_id + .get_block_id = netdev_linux_get_block_id, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_tap_class = { NETDEV_LINUX_CLASS_COMMON, .type = "tap", + .is_pmd = false, .construct = netdev_linux_construct_tap, + .destruct = netdev_linux_destruct, .get_stats = netdev_tap_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_internal_class = { NETDEV_LINUX_CLASS_COMMON, LINUX_FLOW_OFFLOAD_API, .type = "internal", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_internal_get_stats, .get_status = netdev_internal_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; + +#ifdef HAVE_AF_XDP +const struct netdev_class netdev_afxdp_class = { + NETDEV_LINUX_CLASS_COMMON, + .type = "afxdp", + .is_pmd = true, + .construct = netdev_linux_construct, + .destruct = netdev_afxdp_destruct, + .get_stats = netdev_afxdp_get_stats, + .get_status = netdev_linux_get_status, + .set_config = netdev_afxdp_set_config, + .get_config = netdev_afxdp_get_config, + .reconfigure = netdev_afxdp_reconfigure, + .get_numa_id = netdev_afxdp_get_numa_id, + .send = netdev_afxdp_batch_send, + .rxq_construct = netdev_afxdp_rxq_construct, + .rxq_recv = netdev_afxdp_rxq_recv, +}; +#endif #define CODEL_N_QUEUES 0x0000 @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst, dst->tx_window_errors = src->tx_window_errors; } -static int +int get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats) { struct ofpbuf request; diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h index fb0c27e6e8e8..91e6a9e2bfc0 100644 --- a/lib/netdev-provider.h +++ b/lib/netdev-provider.h @@ -903,6 +903,9 @@ extern const struct netdev_class netdev_linux_class; extern const struct netdev_class netdev_internal_class; extern const struct netdev_class netdev_tap_class; +#ifdef HAVE_AF_XDP +extern const struct netdev_class netdev_afxdp_class; +#endif #ifdef __cplusplus } #endif diff --git a/lib/netdev.c b/lib/netdev.c index 7d7ecf6f0946..0fac117cc602 100644 --- a/lib/netdev.c +++ b/lib/netdev.c @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); static void restore_all_flags(void *aux OVS_UNUSED); void update_device_args(struct netdev *, const struct shash *args); +#ifdef HAVE_AF_XDP +void signal_remove_xdp(struct netdev *netdev); +#endif int netdev_n_txq(const struct netdev *netdev) @@ -146,6 +149,9 @@ netdev_initialize(void) netdev_register_provider(&netdev_internal_class); netdev_register_provider(&netdev_tap_class); netdev_vport_tunnel_register(); +#ifdef HAVE_AF_XDP + netdev_register_provider(&netdev_afxdp_class); +#endif #endif #if defined(__FreeBSD__) || defined(__NetBSD__) netdev_register_provider(&netdev_tap_class); @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED) saved_flags & ~saved_values, &old_flags); } +#ifdef HAVE_AF_XDP + if (netdev->netdev_class == &netdev_afxdp_class) { + signal_remove_xdp(netdev); + } +#endif } } diff --git a/lib/spinlock.h b/lib/spinlock.h new file mode 100644 index 000000000000..17d79f217410 --- /dev/null +++ b/lib/spinlock.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#ifndef SPINLOCK_H +#define SPINLOCK_H 1 + +#include <config.h> + +#include <ctype.h> +#include <errno.h> +#include <fcntl.h> +#include <stdarg.h> +#include <stdlib.h> +#include <unistd.h> + +#include "ovs-atomic.h" + +typedef struct { + atomic_int locked; +} ovs_spinlock_t; + +static inline void +ovs_spinlock_init(ovs_spinlock_t *sl) +{ + atomic_init(&sl->locked, 0); +} + +static inline void +ovs_spin_lock(ovs_spinlock_t *sl) +{ + int exp = 0, locked = 0; + + while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, + memory_order_acquire, + memory_order_relaxed)) { + locked = 1; + while (locked) { + atomic_read_relaxed(&sl->locked, &locked); + } + exp = 0; + } +} + +static inline void +ovs_spin_unlock(ovs_spinlock_t *sl) +{ + atomic_store_explicit(&sl->locked, 0, memory_order_release); +} + +static inline int OVS_UNUSED +ovs_spin_trylock(ovs_spinlock_t *sl) +{ + int exp = 0; + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, + memory_order_acquire, + memory_order_relaxed); +} +#endif diff --git a/lib/util.c b/lib/util.c index 5679232ffc5f..060b1e287bce 100644 --- a/lib/util.c +++ b/lib/util.c @@ -277,6 +277,49 @@ free_cacheline(void *p) #endif } +#ifdef HAVE_AF_XDP +void * +xmalloc_pagealign(size_t size) +{ +#ifdef HAVE_POSIX_MEMALIGN + void *p; + int error; + + COVERAGE_INC(util_xalloc); + error = posix_memalign(&p, get_page_size(), size ? size : 1); + if (error != 0) { + out_of_memory(); + } + return p; +#else + /* Similar to xmalloc_cacheline, but replace + * CACHE_LINE_SIZE with get_page_size() */ + void *p = xmalloc((get_page_size() - 1) + + sizeof(void *) + + ROUND_UP(size, get_page_size())); + bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *); + void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0), + get_page_size()); + void **q = (void **) r - 1; + *q = p; + return r; +#endif +} + +void +free_pagealign(void *p) +{ +#ifdef HAVE_POSIX_MEMALIGN + free(p); +#else + if (p) { + void **q = (void **) p - 1; + free(*q); + } +#endif +} +#endif + char * xasprintf(const char *format, ...) { diff --git a/lib/util.h b/lib/util.h index 53354f1c6f0f..3cd8cf87fba8 100644 --- a/lib/util.h +++ b/lib/util.h @@ -163,6 +163,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size); int string_ends_with(const char *str, const char *suffix); +#ifdef HAVE_AF_XDP +void *xmalloc_pagealign(size_t) MALLOC_LIKE; +void free_pagealign(void *); +#endif + /* The C standards say that neither the 'dst' nor 'src' argument to * memcpy() may be null, even if 'n' is zero. This wrapper tolerates * the null case. */ diff --git a/lib/xdpsock.c b/lib/xdpsock.c new file mode 100644 index 000000000000..ffdb54dfcd27 --- /dev/null +++ b/lib/xdpsock.c @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#include <config.h> + +#include "xdpsock.h" +#include "dp-packet.h" +#include "openvswitch/compiler.h" + +/* Note: + * umem_elem_push* shouldn't overflow because we always pop + * elem first, then push back to the stack. + */ +static inline void +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { + OVS_NOT_REACHED(); + } + + ptr = &umemp->array[umemp->index]; + memcpy(ptr, addrs, n * sizeof(void *)); + umemp->index += n; +} + +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + ovs_spin_lock(&umemp->mutex); + __umem_elem_push_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->mutex); +} + +static inline void +__umem_elem_push(struct umem_pool *umemp, void *addr) +{ + if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) { + OVS_NOT_REACHED(); + } + + umemp->array[umemp->index++] = addr; +} + +void +umem_elem_push(struct umem_pool *umemp, void *addr) +{ + + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); + + ovs_spin_lock(&umemp->mutex); + __umem_elem_push(umemp, addr); + ovs_spin_unlock(&umemp->mutex); +} + +static inline int +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index - n < 0)) { + return -ENOMEM; + } + + umemp->index -= n; + ptr = &umemp->array[umemp->index]; + memcpy(addrs, ptr, n * sizeof(void *)); + + return 0; +} + +int +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + int ret; + + ovs_spin_lock(&umemp->mutex); + ret = __umem_elem_pop_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->mutex); + + return ret; +} + +static inline void * +__umem_elem_pop(struct umem_pool *umemp) +{ + if (OVS_UNLIKELY(umemp->index - 1 < 0)) { + return NULL; + } + + return umemp->array[--umemp->index]; +} + +void * +umem_elem_pop(struct umem_pool *umemp) +{ + void *ptr; + + ovs_spin_lock(&umemp->mutex); + ptr = __umem_elem_pop(umemp); + ovs_spin_unlock(&umemp->mutex); + + return ptr; +} + +static void ** +__umem_pool_alloc(unsigned int size) +{ + void *bufs; + int ret; + + ret = posix_memalign(&bufs, getpagesize(), + size * sizeof(void *)); + if (ret) { + return NULL; + } + + memset(bufs, 0, size * sizeof(void *)); + return (void **)bufs; +} + +int +umem_pool_init(struct umem_pool *umemp, unsigned int size) +{ + umemp->array = __umem_pool_alloc(size); + if (!umemp->array) { + return -ENOMEM; + } + + umemp->size = size; + umemp->index = 0; + ovs_spinlock_init(&umemp->mutex); + return 0; +} + +void +umem_pool_cleanup(struct umem_pool *umemp) +{ + free(umemp->array); + umemp->array = NULL; +} + +/* AF_XDP metadata init/destroy */ +int +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) +{ + void *bufs; + int ret; + + ret = posix_memalign(&bufs, getpagesize(), + size * sizeof(struct dp_packet_afxdp)); + if (ret) { + return -ENOMEM; + } + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); + + xp->array = bufs; + xp->size = size; + return 0; +} + +void +xpacket_pool_cleanup(struct xpacket_pool *xp) +{ + free(xp->array); + xp->array = NULL; +} diff --git a/lib/xdpsock.h b/lib/xdpsock.h new file mode 100644 index 000000000000..72578e383812 --- /dev/null +++ b/lib/xdpsock.h @@ -0,0 +1,101 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef XDPSOCK_H +#define XDPSOCK_H 1 + +#include <config.h> + +#ifdef HAVE_AF_XDP + +#include <bpf/xsk.h> +#include <errno.h> +#include <stdbool.h> +#include <stdio.h> + +#include "openvswitch/thread.h" +#include "ovs-atomic.h" +#include "spinlock.h" + +#define FRAME_HEADROOM XDP_PACKET_HEADROOM +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) + +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS + +/* The worst case is all 4 queues TX/CQ/RX/FILL are full. + * Setting NUM_FRAMES to this makes sure umem_pop always successes. + */ +#define NUM_FRAMES (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)) + +#define BATCH_SIZE NETDEV_MAX_BURST + +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES)); +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS); +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)); + +/* LIFO ptr_array */ +struct umem_pool { + int index; /* point to top */ + unsigned int size; + ovs_spinlock_t mutex; + void **array; /* a pointer array, point to umem buf */ +}; + +/* array-based dp_packet_afxdp */ +struct xpacket_pool { + unsigned int size; + struct dp_packet_afxdp **array; +}; + +struct xsk_umem_info { + struct umem_pool mpool; + struct xpacket_pool xpool; + struct xsk_ring_prod fq; + struct xsk_ring_cons cq; + struct xsk_umem *umem; + void *buffer; +}; + +struct xsk_socket_info { + struct xsk_ring_cons rx; + struct xsk_ring_prod tx; + struct xsk_umem_info *umem; + struct xsk_socket *xsk; + unsigned long rx_dropped; + unsigned long tx_dropped; + uint32_t outstanding_tx; +}; + +struct umem_elem { + struct umem_elem *next; +}; + +void umem_elem_push(struct umem_pool *umemp, void *addr); +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs); + +void *umem_elem_pop(struct umem_pool *umemp); +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs); + +int umem_pool_init(struct umem_pool *umemp, unsigned int size); +void umem_pool_cleanup(struct umem_pool *umemp); +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size); +void xpacket_pool_cleanup(struct xpacket_pool *xp); + +#endif +#endif diff --git a/tests/automake.mk b/tests/automake.mk index bc906fb79b46..7db64faabc71 100644 --- a/tests/automake.mk +++ b/tests/automake.mk @@ -4,12 +4,14 @@ EXTRA_DIST += \ $(SYSTEM_TESTSUITE_AT) \ $(SYSTEM_KMOD_TESTSUITE_AT) \ $(SYSTEM_USERSPACE_TESTSUITE_AT) \ + $(SYSTEM_AFXDP_TESTSUITE_AT) \ $(SYSTEM_OFFLOADS_TESTSUITE_AT) \ $(SYSTEM_DPDK_TESTSUITE_AT) \ $(OVSDB_CLUSTER_TESTSUITE_AT) \ $(TESTSUITE) \ $(SYSTEM_KMOD_TESTSUITE) \ $(SYSTEM_USERSPACE_TESTSUITE) \ + $(SYSTEM_AFXDP_TESTSUITE) \ $(SYSTEM_OFFLOADS_TESTSUITE) \ $(SYSTEM_DPDK_TESTSUITE) \ $(OVSDB_CLUSTER_TESTSUITE) \ @@ -159,6 +161,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \ tests/system-userspace-macros.at \ tests/system-userspace-packet-type-aware.at +SYSTEM_AFXDP_TESTSUITE_AT = \ + tests/system-afxdp-testsuite.at \ + tests/system-afxdp-macros.at + SYSTEM_TESTSUITE_AT = \ tests/system-common-macros.at \ tests/system-ovn.at \ @@ -183,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite @@ -316,6 +323,11 @@ check-system-userspace: all set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) +check-afxdp: all + $(MAKE) install + set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \ + "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) + check-offloads: all set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) @@ -353,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT) + $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at + $(AM_V_at)mv $@.tmp $@ + $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT) $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at new file mode 100644 index 000000000000..1e6f7a46b4b7 --- /dev/null +++ b/tests/system-afxdp-macros.at @@ -0,0 +1,20 @@ +# Add port to ovs bridge by using afxdp mode. +# This will use generic XDP support in the veth driver. +m4_define([ADD_VETH], + [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77]) + CONFIGURE_VETH_OFFLOADS([$1]) + AT_CHECK([ip link set $1 netns $2]) + AT_CHECK([ip link set dev ovs-$1 up]) + AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \ + set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"]) + NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7]) + NS_CHECK_EXEC([$2], [ip link set dev $1 up]) + if test -n "$5"; then + NS_CHECK_EXEC([$2], [ip link set dev $1 address $5]) + fi + if test -n "$6"; then + NS_CHECK_EXEC([$2], [ip route add default via $6]) + fi + on_exit 'ip link del ovs-$1' + ] +) diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at new file mode 100644 index 000000000000..9b7a29066614 --- /dev/null +++ b/tests/system-afxdp-testsuite.at @@ -0,0 +1,26 @@ +AT_INIT + +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at: + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License.]) + +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS]) + +m4_include([tests/ovs-macros.at]) +m4_include([tests/ovsdb-macros.at]) +m4_include([tests/ofproto-macros.at]) +m4_include([tests/system-common-macros.at]) +m4_include([tests/system-userspace-macros.at]) +m4_include([tests/system-afxdp-macros.at]) + +m4_include([tests/system-traffic.at]) diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 08001dbce3d3..6195a8fd41cf 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -3082,6 +3082,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \ </p> </column> + <column name="other_config" key="xdpmode" + type='{"type": "string", + "enum": ["set", ["skb", "drv"]]}'> + <p> + Specifies the operational mode of the XDP program. + If "drv", the XDP program is loaded into the device driver with + zero-copy RX and TX enabled. This mode requires device driver with + AF_XDP support and has the best performance. + If "skb", the XDP program is using generic XDP mode in kernel with + extra data copying between userspace and kernel. No device driver + support is needed. Note that this is afxdp netdev type only. + Defaults to "skb" mode. + </p> + </column> + <column name="options" key="vhost-server-path" type='{"type": "string"}'> <p>
The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu <u9012063@gmail.com> --- v1->v2: - add a list to maintain unused umem elements - remove copy from rx umem to ovs internal buffer - use hugetlb to reduce misses (not much difference) - use pmd mode netdev in OVS (huge performance improve) - remove malloc dp_packet, instead put dp_packet in umem v2->v3: - rebase on the OVS master, 7ab4b0653784 ("configure: Check for more specific function to pull in pthread library.") - remove the dependency on libbpf and dpif-bpf. instead, use the built-in XDP_ATTACH feature. - data structure optimizations for better performance, see[1] - more test cases support v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html v3->v4: - Use AF_XDP API provided by libbpf - Remove the dependency on XDP_ATTACH kernel patch set - Add documentation, bpf.rst v4->v5: - rebase to master - remove rfc, squash all into a single patch - add --enable-afxdp, so by default, AF_XDP is not compiled - add options: xdpmode=drv,skb - add multiple queue and multiple PMD support, with options: n_rxq - improve documentation, rename bpf.rst to af_xdp.rst v5->v6 - rebase to master, commit 0cdd5b13de91b98 - address errors from sparse and clang - pass travis-ci test - address feedback from Ben - fix issues reported by 0-day robot - improved documentation v6-v7 - rebase to master, commit abf11558c1515bf3b1 - address feedbacks from Ilya, Ben, and Eelco, see: https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html - add XDP mode change, implement get/set_config, reconfigure - Fix reconfiguration/crash issue caused by libbpf, see patch: [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown - perf optimization for batching umem_push/pop - perf optimization for batching kick_tx - test build with dpdk - fix/refactor atomic operation - make AF_XDP x86 specific, otherwise fail at build time - lots of code refactoring - add PVP setup in documentation v7-v8: - Address feedback from Ilya at: https://patchwork.ozlabs.org/patch/1095019/ - add netdev-linux-private.h - fix afxdp reconfigure issue - sort include headers - remove unnecessary OVS_UNUSED - coding style fixes - error case handling and memory leak v8-v9: - rebase to master 180bbbed3a3867d52 - Address review feedback from Ben, Ilya and Eelco, at: https://patchwork.ozlabs.org/patch/1097740/ - == From Ilya == - Optimize the reconfiguration logic - Implement .rxq_recv and .send for afxdp - Remove system-afxdp-traffic.at, reuse existing code - Use Ilya's rdtsc code - remove --disable-system - == From Eelco == - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111: assertion !fd != !wevent failed - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT... - Clear xdp program when receive signal, ctrl+c - Add options to vswitch.xml, set xdpmode default to skb-mode - No support for ARM and PPC, now x86_64 only - remove redundant header includes and function/macro definitions - remove some ifdef HAVE_AF_XDP - == From others/both about afxdp rx and tx == - Several umem push/pop error handling improvement/fixes - add lock to address concurrent_txq case - improve error handling - add stats - Things that are not done yet - MTU limitation - n_txq_desc/n_rxq_desc option. v9-v10 - remove x86_64 limitation, suggested by Ben and Eelco - add xmalloc_pagealign, free_pagealign - minor refector --- Documentation/automake.mk | 1 + Documentation/index.rst | 1 + Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++ Documentation/intro/install/index.rst | 1 + acinclude.m4 | 35 ++ configure.ac | 1 + lib/automake.mk | 14 + lib/dp-packet.c | 28 ++ lib/dp-packet.h | 18 +- lib/dpif-netdev-perf.h | 28 ++ lib/netdev-afxdp.c | 850 ++++++++++++++++++++++++++++++++++ lib/netdev-afxdp.h | 74 +++ lib/netdev-linux-private.h | 139 ++++++ lib/netdev-linux.c | 121 ++--- lib/netdev-provider.h | 3 + lib/netdev.c | 11 + lib/spinlock.h | 70 +++ lib/util.c | 43 ++ lib/util.h | 5 + lib/xdpsock.c | 179 +++++++ lib/xdpsock.h | 101 ++++ tests/automake.mk | 16 + tests/system-afxdp-macros.at | 20 + tests/system-afxdp-testsuite.at | 26 ++ vswitchd/vswitch.xml | 15 + 25 files changed, 2150 insertions(+), 83 deletions(-) create mode 100644 Documentation/intro/install/afxdp.rst create mode 100644 lib/netdev-afxdp.c create mode 100644 lib/netdev-afxdp.h create mode 100644 lib/netdev-linux-private.h create mode 100644 lib/spinlock.h create mode 100644 lib/xdpsock.c create mode 100644 lib/xdpsock.h create mode 100644 tests/system-afxdp-macros.at create mode 100644 tests/system-afxdp-testsuite.at