From patchwork Wed Jun 5 20:47:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: William Tu X-Patchwork-Id: 1110729 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="EgRKVWf9"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 45K19J5vkGz9s9y for ; Thu, 6 Jun 2019 06:48:52 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id CDB8CACD; Wed, 5 Jun 2019 20:48:49 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 294B92F for ; Wed, 5 Jun 2019 20:48:49 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pg1-f193.google.com (mail-pg1-f193.google.com [209.85.215.193]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 004D986E for ; Wed, 5 Jun 2019 20:48:42 +0000 (UTC) Received: by mail-pg1-f193.google.com with SMTP id w34so13035779pga.12 for ; Wed, 05 Jun 2019 13:48:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:date:message-id; bh=/ZFRQut1y32k9ItKzrsflZRl56VT3wLi1zkGmvrgD0A=; b=EgRKVWf9hCkFv2kNiu8Nk/B6COuLV6ZBy/odqChUDtBRyY19YvQTiihefSlTUe5ySz Rpg8MPHWfU044Hi/1XL/81SrAVQ+uwjXcy8ogL8B8O/Cr3g42TvQMmOBEDo7ccWp/7+a ozLM06DoBY8c0ilPORrtKxihy5S2YFVITY/2t+5OpecONFNw/7EXlxCDIcdyy3jjgIwz voc0kBZeJdQ2qoD0OrLGVo/rWcgBh66Ie2Cl+ALY5i5RHBjsN5duznWvQbrrdGRQXyxi 6/L3u9G2GBULELJjgGOpKKoWCEoDdQFr+c1+/MmYHes6S+m85pIIsUuB8Kv0Rg4rpM6I t1bA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=/ZFRQut1y32k9ItKzrsflZRl56VT3wLi1zkGmvrgD0A=; b=Zu+Taq3xQqS7x+U0lAbTrNzsZU3IPDWc/KYxCJG+DxSssLhkYEMDQPBp9aht9X5OZw EMmcYfk7r544ABMEs9wNGGs+s5VsNkKABh3nSfQHJ30xKOWoatKMn0UYTVLvILQtk9MZ lXYhMzT9ZS0xDLFazlh1E53dpLmkIPZsmYC5HCBCWhSsdnnMXTyIMD0oLWco04jZgXwG 0fXqmCcj+KWEYAh1Pn1XC2XJtA8+m5UMhKhcxYQjHyyw5DLtufMnMNMBpEO1i0yqLVti W2BhpOiS4w7agxci7XcdTbO+wBHpeC1x2o/IZ12I7Ikwj6pb7LPC9c9kmdIl48xXixc5 Zi6w== X-Gm-Message-State: APjAAAUZTEzFDzp4vNQUGZYt0lu84WYDAwGVARqY10u5hzGesf7MxPFQ Vr4SqliVHq6CmKXjEiuKd//y8rZR X-Google-Smtp-Source: APXvYqzAJcAOFpqKIQ79BV28+TSFjatRILsePF4upNSKt9MU8KXF74KbdFzH7vyPnjxybRIx+mfa6g== X-Received: by 2002:a17:90a:e397:: with SMTP id b23mr25333151pjz.117.1559767720667; Wed, 05 Jun 2019 13:48:40 -0700 (PDT) Received: from sc9-mailhost2.vmware.com ([66.170.99.2]) by smtp.gmail.com with ESMTPSA id j8sm20919351pfi.148.2019.06.05.13.48.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 05 Jun 2019 13:48:39 -0700 (PDT) From: William Tu To: dev@openvswitch.org, i.maximets@samsung.com, echaudro@redhat.com Date: Wed, 5 Jun 2019 13:47:51 -0700 Message-Id: <1559767671-6175-1-git-send-email-u9012063@gmail.com> X-Mailer: git-send-email 2.7.4 X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCHv11] netdev-afxdp: add new netdev type for AF_XDP. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu --- v1->v2: - add a list to maintain unused umem elements - remove copy from rx umem to ovs internal buffer - use hugetlb to reduce misses (not much difference) - use pmd mode netdev in OVS (huge performance improve) - remove malloc dp_packet, instead put dp_packet in umem v2->v3: - rebase on the OVS master, 7ab4b0653784 ("configure: Check for more specific function to pull in pthread library.") - remove the dependency on libbpf and dpif-bpf. instead, use the built-in XDP_ATTACH feature. - data structure optimizations for better performance, see[1] - more test cases support v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html v3->v4: - Use AF_XDP API provided by libbpf - Remove the dependency on XDP_ATTACH kernel patch set - Add documentation, bpf.rst v4->v5: - rebase to master - remove rfc, squash all into a single patch - add --enable-afxdp, so by default, AF_XDP is not compiled - add options: xdpmode=drv,skb - add multiple queue and multiple PMD support, with options: n_rxq - improve documentation, rename bpf.rst to af_xdp.rst v5->v6 - rebase to master, commit 0cdd5b13de91b98 - address errors from sparse and clang - pass travis-ci test - address feedback from Ben - fix issues reported by 0-day robot - improved documentation v6-v7 - rebase to master, commit abf11558c1515bf3b1 - address feedbacks from Ilya, Ben, and Eelco, see: https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html - add XDP mode change, implement get/set_config, reconfigure - Fix reconfiguration/crash issue caused by libbpf, see patch: [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown - perf optimization for batching umem_push/pop - perf optimization for batching kick_tx - test build with dpdk - fix/refactor atomic operation - make AF_XDP x86 specific, otherwise fail at build time - lots of code refactoring - add PVP setup in documentation v7-v8: - Address feedback from Ilya at: https://patchwork.ozlabs.org/patch/1095019/ - add netdev-linux-private.h - fix afxdp reconfigure issue - sort include headers - remove unnecessary OVS_UNUSED - coding style fixes - error case handling and memory leak v8-v9: - rebase to master 180bbbed3a3867d52 - Address review feedback from Ben, Ilya and Eelco, at: https://patchwork.ozlabs.org/patch/1097740/ - == From Ilya == - Optimize the reconfiguration logic - Implement .rxq_recv and .send for afxdp - Remove system-afxdp-traffic.at, reuse existing code - Use Ilya's rdtsc code - remove --disable-system - == From Eelco == - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111: assertion !fd != !wevent failed - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT... - Clear xdp program when receive signal, ctrl+c - Add options to vswitch.xml, set xdpmode default to skb-mode - No support for ARM and PPC, now x86_64 only - remove redundant header includes and function/macro definitions - remove some ifdef HAVE_AF_XDP - == From others/both about afxdp rx and tx == - Several umem push/pop error handling improvement/fixes - add lock to address concurrent_txq case - improve error handling - add stats - Things that are not done yet - MTU limitation - n_txq_desc/n_rxq_desc option. v9-v10 - remove x86_64 limitation, suggested by Ben and Eelco - add xmalloc_pagealign, free_pagealign - minor refector v10-v11 - address feedback from Ilya at https://patchwork.ozlabs.org/patch/1106495/ - fix typos, and some refactoring - refactor existing code and introduce xmalloc pagealign - fix a couple of error handling case - allocate per-txq lock - dynamic allocate xsk array - fix cycle_counter_update() for non-x86/non-linux case --- Documentation/automake.mk | 1 + Documentation/index.rst | 1 + Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++ Documentation/intro/install/index.rst | 1 + acinclude.m4 | 35 ++ configure.ac | 1 + lib/automake.mk | 14 + lib/dp-packet.c | 28 ++ lib/dp-packet.h | 18 +- lib/dpif-netdev-perf.h | 26 + lib/netdev-afxdp.c | 891 ++++++++++++++++++++++++++++++++++ lib/netdev-afxdp.h | 74 +++ lib/netdev-linux-private.h | 139 ++++++ lib/netdev-linux.c | 121 ++--- lib/netdev-provider.h | 3 + lib/netdev.c | 11 + lib/spinlock.h | 70 +++ lib/util.c | 92 +++- lib/util.h | 5 + lib/xdpsock.c | 170 +++++++ lib/xdpsock.h | 101 ++++ tests/automake.mk | 16 + tests/system-afxdp-macros.at | 20 + tests/system-afxdp-testsuite.at | 26 + vswitchd/vswitch.xml | 15 + 25 files changed, 2204 insertions(+), 108 deletions(-) create mode 100644 Documentation/intro/install/afxdp.rst create mode 100644 lib/netdev-afxdp.c create mode 100644 lib/netdev-afxdp.h create mode 100644 lib/netdev-linux-private.h create mode 100644 lib/spinlock.h create mode 100644 lib/xdpsock.c create mode 100644 lib/xdpsock.h create mode 100644 tests/system-afxdp-macros.at create mode 100644 tests/system-afxdp-testsuite.at diff --git a/Documentation/automake.mk b/Documentation/automake.mk index 082438e09a33..11cc59efc881 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -10,6 +10,7 @@ DOC_SOURCE = \ Documentation/intro/why-ovs.rst \ Documentation/intro/install/index.rst \ Documentation/intro/install/bash-completion.rst \ + Documentation/intro/install/afxdp.rst \ Documentation/intro/install/debian.rst \ Documentation/intro/install/documentation.rst \ Documentation/intro/install/distributions.rst \ diff --git a/Documentation/index.rst b/Documentation/index.rst index 46261235c732..aa9e7c49f179 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -59,6 +59,7 @@ vSwitch? Start here. :doc:`intro/install/windows` | :doc:`intro/install/xenserver` | :doc:`intro/install/dpdk` | + :doc:`intro/install/afxdp` | :doc:`Installation FAQs ` - **Tutorials:** :doc:`tutorials/faucet` | diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst new file mode 100644 index 000000000000..554964396353 --- /dev/null +++ b/Documentation/intro/install/afxdp.rst @@ -0,0 +1,433 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + + +======================== +Open vSwitch with AF_XDP +======================== + +This document describes how to build and install Open vSwitch using +AF_XDP netdev. + +.. warning:: + The AF_XDP support of Open vSwitch is considered 'experimental', + and it is not compiled in by default. + + +Introduction +------------ +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type +built upon the eBPF and XDP technology. It is aims to have comparable +performance to DPDK but cooperate better with existing kernel's networking +stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program +attached to the netdev, by-passing a couple of Linux kernel's subsystems. +As a result, AF_XDP socket shows much better performance than AF_PACKET. +For more details about AF_XDP, please see linux kernel's +Documentation/networking/af_xdp.rst + + +AF_XDP Netdev +------------- +OVS has a couple of netdev types, i.e., system, tap, or +dpdk. The AF_XDP feature adds a new netdev types called +"afxdp", and implement its configuration, packet reception, +and transmit functions. Since the AF_XDP socket, called xsk, +operates in userspace, once ovs-vswitchd receives packets +from xsk, the afxdp netdev re-uses the existing userspace +dpif-netdev datapath. As a result, most of the packet processing +happens at the userspace instead of linux kernel. + +:: + + | +-------------------+ + | | ovs-vswitchd |<-->ovsdb-server + | +-------------------+ + | | ofproto |<-->OpenFlow controllers + | +--------+-+--------+ + | | netdev | |ofproto-| + userspace | +--------+ | dpif | + | | afxdp | +--------+ + | | netdev | | dpif | + | +---||---+ +--------+ + | || | dpif- | + | || | netdev | + |_ || +--------+ + || + _ +---||-----+--------+ + | | AF_XDP prog + | + kernel | | xsk_map | + |_ +--------||---------+ + || + physical + NIC + + +Build requirements +------------------ + +In addition to the requirements described in :doc:`general`, building Open +vSwitch with AF_XDP will require the following: + +- libbpf from kernel source tree (kernel 5.0.0 or later) + +- Linux kernel XDP support, with the following options (required) + + * CONFIG_BPF=y + + * CONFIG_BPF_SYSCALL=y + + * CONFIG_XDP_SOCKETS=y + + +- The following optional Kconfig options are also recommended, but not + required: + + * CONFIG_BPF_JIT=y (Performance) + + * CONFIG_HAVE_BPF_JIT=y (Performance) + + * CONFIG_XDP_SOCKETS_DIAG=y (Debugging) + +- Once your AF_XDP-enabled kernel is ready, if possible, run + **./xdpsock -r -N -z -i ** under linux/samples/bpf. + This is an OVS independent benchmark tools for AF_XDP. + It makes sure your basic kernel requirements are met for AF_XDP. + + +Installing +---------- +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support. +First, clone a recent version of Linux bpf-next tree:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git + +Second, go into the Linux source directory and build libbpf in the tools +directory:: + + cd bpf-next/ + cd tools/lib/bpf/ + make && make install + make install_headers + +.. note:: + Make sure xsk.h and bpf.h are installed in system's library path, + e.g. /usr/local/include/bpf/ or /usr/include/bpf/ + +Make sure the libbpf.so is installed correctly:: + + ldconfig + ldconfig -p | grep libbpf + +Third, ensure the standard OVS requirements are installed and +bootstrap/configure the package:: + + ./boot.sh && ./configure --enable-afxdp + +Finally, build and install OVS:: + + make && make install + +To kick start end-to-end autotesting:: + + uname -a # make sure having 5.0+ kernel + make check-afxdp TESTSUITEFLAGS='1' + +If a test case fails, check the log at:: + + cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log + + +Setup AF_XDP netdev +------------------- +Before running OVS with AF_XDP, make sure the libbpf and libelf are +set-up right:: + + ldd vswitchd/ovs-vswitchd + +Open vSwitch should be started using userspace datapath as described +in :doc:`general`:: + + ovs-vswitchd ... + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4) +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask, +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb":: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Or, use 4 pmds/cores and 4 queues by doing:: + + ethtool -L enp2s0 combined 4 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=4 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4" + +.. note:: + pmd-rxq-affinity is optional. If not specified, system will auto-assign. + +To validate that the bridge has successfully instantiated, you can use the:: + + ovs-vsctl show + +Should show something like:: + + Port "ens802f0" + Interface "ens802f0" + type: afxdp + options: {n_rxq="1", xdpmode=drv} + +Otherwise, enable debugging by:: + + ovs-appctl vlog/set netdev_afxdp::dbg + + +References +---------- +Most of the design details are described in the paper presented at +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1], +section 4, and slides[2][4]. +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction +about AF_XDP current and future work. + +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf + +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf + +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf + +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp + + +Performance Tuning +------------------ +The name of the game is to keep your CPU running in userspace, allowing PMD +to keep polling the AF_XDP queues without any interferences from kernel. + +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd + running cores, device plug-in slot) + +#. Isolate your CPU by doing isolcpu at grub configure. + +#. IRQ should not set to pmd running core. + +#. The Spectre and Meltdown fixes increase the overhead of system calls. + + +Debugging performance issue +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +While running the traffic, use linux perf tool to see where your cpu +spends its cycle:: + + cd bpf-next/tools/perf + make + ./perf record -p `pidof ovs-vswitchd` sleep 10 + ./perf report + +Measure your system call rate by doing:: + + pstree -p `pidof ovs-vswitchd` + strace -c -p + +Or, use OVS pmd tool:: + + ovs-appctl dpif-netdev/pmd-stats-show + + +Example Script +-------------- + +Below is a script using namespaces and veth peer:: + + #!/bin/bash + ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \ + --disable-system --detach \ + ovs-vsctl -- add-br br0 -- set Bridge br0 \ + protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \ + fail-mode=secure datapath_type=netdev + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + + ip netns add at_ns0 + ovs-appctl vlog/set netdev_afxdp::dbg + + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp" + + ip netns exec at_ns0 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.1/24" dev p0 + ip link set dev p0 up + NS_EXEC_HEREDOC + + ip netns add at_ns1 + ip link add p1 type veth peer name afxdp-p1 + ip link set p1 netns at_ns1 + ip link set dev afxdp-p1 up + + ovs-vsctl add-port br0 afxdp-p1 -- \ + set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp" + ip netns exec at_ns1 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.2/24" dev p1 + ip link set dev p1 up + NS_EXEC_HEREDOC + + ip netns exec at_ns0 ping -i .2 10.1.1.2 + + +Limitations/Known Issues +------------------------ +#. Device's numa ID is always 0, need a way to find numa id from a netdev. +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible + work-around is to use OpenFlow meter action. +#. AF_XDP device added to bridge, remove, and added again will fail. +#. Most of the tests are done using i40e single port. Multiple ports and + also ixgbe driver also needs to be tested. +#. No latency test result (TODO items) + + +PVP using tap device +-------------------- +Assume you have enp2s0 as physical nic, and a tap device connected to VM. +First, start OVS, then add physical port:: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Start a VM with virtio and tap device:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\ + vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \ + -netdev type=tap,id=net0,vhost=on,queues=8 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Create OpenFlow rules:: + + ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp" + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0" + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_TX + +The performance number I got is around 1.6Mpps. +This is due to using the kernel's tap interface, which requires copying +packet into kernel from the umem buffer in userspace. + + +PVP using vhostuser device +-------------------------- +First, build OVS with DPDK and AFXDP:: + + ./configure --enable-afxdp --with-dpdk= + make -j4 && make install + +Create a vhost-user port from OVS:: + + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \ + other_config:pmd-cpu-mask=0xfff + ovs-vsctl add-port br0 vhost-user-1 \ + -- set Interface vhost-user-1 type=dpdkvhostuser + +Start VM using vhost-user mode:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \ + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \ + -device virtio-net-pci,mac=00:00:00:00:00:01,\ + netdev=mynet1,mq=on,vectors=10 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Setup the OpenFlow ruls:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" + ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_DROP + ./xdp_rxq_info --dev ens3 --action XDP_TX + +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps + + +PCP container using veth +------------------------ +Create namespace and veth peer devices:: + + ip netns add at_ns0 + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ip netns exec at_ns0 ip link set dev p0 up + +Attach the veth port to br0 (linux kernel mode):: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 options:n_rxq=1 + +Or, use AF_XDP with skb mode:: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb + +Setup the OpenFlow rules:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0" + ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0" + +In the namespace, run drop or bounce back the packet:: + + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX + +Performace: for RX_DROP: 800Kpps, TX: 700Kpps + + +Bug Reporting +------------- + +Please report problems to dev@openvswitch.org. diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst index 3193c736cf17..c27a9c9d16ff 100644 --- a/Documentation/intro/install/index.rst +++ b/Documentation/intro/install/index.rst @@ -45,6 +45,7 @@ Installation from Source xenserver userspace dpdk + afxdp Installation from Packages -------------------------- diff --git a/acinclude.m4 b/acinclude.m4 index cf9cc8b8b0de..721653ab0ec0 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [ ]) ]) +dnl OVS_CHECK_LINUX_AF_XDP +dnl +dnl Check both Linux kernel AF_XDP and libbpf support +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [ + AC_ARG_ENABLE([afxdp], + [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])], + [], [enable_afxdp=no]) + AC_MSG_CHECKING([whether AF_XDP is enabled]) + if test "$enable_afxdp" != yes; then + AC_MSG_RESULT([no]) + AF_XDP_ENABLE=false + else + AC_MSG_RESULT([yes]) + AF_XDP_ENABLE=true + + AC_CHECK_HEADER([bpf/libbpf.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])]) + + AC_CHECK_HEADER([linux/if_xdp.h], [], + [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/xsk.h], [], + [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/libbpf_util.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])]) + + AC_DEFINE([HAVE_AF_XDP], [1], + [Define to 1 if AF_XDP support is available and enabled.]) + LIBBPF_LDADD=" -lbpf -lelf" + AC_SUBST([LIBBPF_LDADD]) + fi + AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true) +]) + dnl OVS_CHECK_DPDK dnl dnl Configure DPDK source tree diff --git a/configure.ac b/configure.ac index 2dbe9a9178e3..9e23e1c6958c 100644 --- a/configure.ac +++ b/configure.ac @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX OVS_CHECK_DOT OVS_CHECK_IF_DL OVS_CHECK_STRTOK_R +OVS_CHECK_LINUX_AF_XDP AC_CHECK_DECLS([sys_siglist], [], [], [[#include ]]) AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], [], [], [[#include ]]) diff --git a/lib/automake.mk b/lib/automake.mk index cc5dccf39d6b..b31e28f6e1f5 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -14,6 +14,10 @@ if WIN32 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS} endif +if HAVE_AF_XDP +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD) +endif + lib_libopenvswitch_la_LDFLAGS = \ $(OVS_LTINFO) \ -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \ @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \ lib/if-notifier.h \ lib/netdev-linux.c \ lib/netdev-linux.h \ + lib/netdev-linux-private.h \ lib/netdev-tc-offloads.c \ lib/netdev-tc-offloads.h \ lib/netlink-conntrack.c \ @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \ lib/tc.h endif +if HAVE_AF_XDP +lib_libopenvswitch_la_SOURCES += \ + lib/xdpsock.c \ + lib/xdpsock.h \ + lib/netdev-afxdp.c \ + lib/netdev-afxdp.h \ + lib/spinlock.h +endif + if DPDK_NETDEV lib_libopenvswitch_la_SOURCES += \ lib/dpdk.c \ diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 0976a35e758b..e6a7947076b4 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -19,6 +19,7 @@ #include #include "dp-packet.h" +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/dynamic-string.h" #include "util.h" @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated) dp_packet_use__(b, base, allocated, DPBUF_MALLOC); } +#if HAVE_AF_XDP +/* Initialize 'b' as an empty dp_packet that contains + * memory starting at AF_XDP umem base. + */ +void +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated) +{ + dp_packet_set_base(b, base); + dp_packet_set_data(b, base); + dp_packet_set_size(b, 0); + + dp_packet_set_allocated(b, allocated); + b->source = DPBUF_AFXDP; + dp_packet_reset_offsets(b); + pkt_metadata_init(&b->md, 0); + dp_packet_reset_cutlen(b); + dp_packet_reset_offload(b); + b->packet_type = htonl(PT_ETH); +} +#endif + /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of * memory starting at 'base'. 'base' should point to a buffer on the stack. * (Nothing actually relies on 'base' being allocated on the stack. It could @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b) * created as a dp_packet */ free_dpdk_buf((struct dp_packet*) b); #endif + } else if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); } } } @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom case DPBUF_STACK: OVS_NOT_REACHED(); + case DPBUF_AFXDP: + OVS_NOT_REACHED(); + case DPBUF_STUB: b->source = DPBUF_MALLOC; new_base = xmalloc(new_allocated); @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b) { void *p; ovs_assert(b->source != DPBUF_DPDK); + ovs_assert(b->source != DPBUF_AFXDP); if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) { p = dp_packet_data(b); diff --git a/lib/dp-packet.h b/lib/dp-packet.h index a5e9ade1244a..e3438226e360 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -25,6 +25,7 @@ #include #endif +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/list.h" #include "packets.h" @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source { DPBUF_DPDK, /* buffer data is from DPDK allocated memory. * ref to dp_packet_init_dpdk() in dp-packet.c. */ + DPBUF_AFXDP, /* buffer data from XDP frame */ }; #define DP_PACKET_CONTEXT_SIZE 64 @@ -89,6 +91,13 @@ struct dp_packet { }; }; +#if HAVE_AF_XDP +struct dp_packet_afxdp { + struct umem_pool *mpool; + struct dp_packet packet; +}; +#endif + static inline void *dp_packet_data(const struct dp_packet *); static inline void dp_packet_set_data(struct dp_packet *, void *); static inline void *dp_packet_base(const struct dp_packet *); @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *); void dp_packet_use(struct dp_packet *, void *, size_t); void dp_packet_use_stub(struct dp_packet *, void *, size_t); void dp_packet_use_const(struct dp_packet *, const void *, size_t); - +#if HAVE_AF_XDP +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t); +#endif void dp_packet_init_dpdk(struct dp_packet *); void dp_packet_init(struct dp_packet *, size_t); @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b) return; } + if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); + return; + } + dp_packet_uninit(b); free(b); } diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 859c05613ddf..6b6dfda7db1c 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -21,6 +21,7 @@ #include #include #include +#include #include #ifdef DPDK_NETDEV @@ -186,6 +187,24 @@ struct pmd_perf_stats { char *log_reason; }; +#ifdef __linux__ +static inline uint64_t +rdtsc_syscall(struct pmd_perf_stats *s) +{ + struct timespec val; + uint64_t v; + + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { + return s->last_tsc; + } + + v = (uint64_t) val.tv_sec * 1000000000LL; + v += (uint64_t) val.tv_nsec; + + return s->last_tsc = v; +} +#endif + /* Support for accurate timing of PMD execution on TSC clock cycle level. * These functions are intended to be invoked in the context of pmd threads. */ @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s) { #ifdef DPDK_NETDEV return s->last_tsc = rte_get_tsc_cycles(); +#elif !defined(_MSC_VER) && defined(__x86_64__) + uint32_t h, l; + asm volatile("rdtsc" : "=a" (l), "=d" (h)); + + return s->last_tsc = ((uint64_t) h << 32) | l; +#elif defined(__linux__) + return rdtsc_syscall(s); #else return s->last_tsc = 0; #endif diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c new file mode 100644 index 000000000000..a6543e8f5126 --- /dev/null +++ b/lib/netdev-afxdp.c @@ -0,0 +1,891 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "netdev-linux-private.h" +#include "netdev-linux.h" +#include "netdev-afxdp.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dp-packet.h" +#include "dpif-netdev.h" +#include "openvswitch/dynamic-string.h" +#include "openvswitch/vlog.h" +#include "packets.h" +#include "socket-util.h" +#include "spinlock.h" +#include "util.h" +#include "xdpsock.h" + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +VLOG_DEFINE_THIS_MODULE(netdev_afxdp); +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); + +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base)) +#define UMEM2XPKT(base, i) \ + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \ + i * sizeof(struct dp_packet_afxdp)) + +static uint32_t prog_id; +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id, + int mode); +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode); +static void xsk_destroy(struct xsk_socket_info *xsk); +static int xsk_configure_all(struct netdev *netdev); +static void xsk_destroy_all(struct netdev *netdev); + +static struct xsk_umem_info * +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode) +{ + struct xsk_umem_config uconfig OVS_UNUSED; + struct xsk_umem_info *umem; + int ret; + int i; + + umem = xcalloc(1, sizeof *umem); + ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, + NULL); + if (ret) { + VLOG_ERR("xsk_umem__create failed (%s) mode: %s", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV"); + free(umem); + return NULL; + } + + umem->buffer = buffer; + + /* set-up umem pool */ + if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) { + VLOG_ERR("umem_pool_init failed"); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct umem_elem *elem; + + elem = ALIGNED_CAST(struct umem_elem *, + (char *)umem->buffer + i * FRAME_SIZE); + umem_elem_push(&umem->mpool, elem); + } + + /* set-up metadata */ + if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) { + VLOG_ERR("xpacket_pool_init failed"); + umem_pool_cleanup(&umem->mpool); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + VLOG_DBG("%s xpacket pool from %p to %p", __func__, + umem->xpool.array, + (char *)umem->xpool.array + + NUM_FRAMES * sizeof(struct dp_packet_afxdp)); + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + xpacket = UMEM2XPKT(umem->xpool.array, i); + xpacket->mpool = &umem->mpool; + + packet = &xpacket->packet; + packet->source = DPBUF_AFXDP; + } + + return umem; +} + +static struct xsk_socket_info * +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex, + uint32_t queue_id, int xdpmode) +{ + struct xsk_socket_config cfg; + struct xsk_socket_info *xsk; + char devname[IF_NAMESIZE]; + uint32_t idx = 0; + int ret; + int i; + + xsk = xcalloc(1, sizeof(*xsk)); + xsk->umem = umem; + cfg.rx_size = CONS_NUM_DESCS; + cfg.tx_size = PROD_NUM_DESCS; + cfg.libbpf_flags = 0; + + if (xdpmode == XDP_ZEROCOPY) { + cfg.bind_flags = XDP_ZEROCOPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + } else { + cfg.bind_flags = XDP_COPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + } + + if (if_indextoname(ifindex, devname) == NULL) { + VLOG_ERR("ifindex %d to devname failed (%s)", + ifindex, ovs_strerror(errno)); + free(xsk); + return NULL; + } + + ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem, + &xsk->rx, &xsk->tx, &cfg); + if (ret) { + VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV", + queue_id); + free(xsk); + return NULL; + } + + /* Make sure the built-in AF_XDP program is loaded */ + ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags); + if (ret) { + VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno)); + xsk_socket__delete(xsk->xsk); + free(xsk); + return NULL; + } + + /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */ + while (!xsk_ring_prod__reserve(&xsk->umem->fq, + PROD_NUM_DESCS - BATCH_SIZE, &idx)) { + VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue"); + } + + for (i = 0; + i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE; + i += FRAME_SIZE) { + struct umem_elem *elem; + uint64_t addr; + + elem = umem_elem_pop(&xsk->umem->mpool); + addr = UMEM2DESC(elem, xsk->umem->buffer); + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, + PROD_NUM_DESCS - BATCH_SIZE); + return xsk; +} + +static struct xsk_socket_info * +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode) +{ + struct xsk_socket_info *xsk; + struct xsk_umem_info *umem; + void *bufs; + + /* umem memory region */ + bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE); + memset(bufs, 0, NUM_FRAMES * FRAME_SIZE); + + /* create AF_XDP socket */ + umem = xsk_configure_umem(bufs, + NUM_FRAMES * FRAME_SIZE, + xdpmode); + if (!umem) { + free_pagealign(bufs); + return NULL; + } + + xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode); + if (!xsk) { + /* clean up umem and xpacket pool */ + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free_pagealign(bufs); + umem_pool_cleanup(&umem->mpool); + xpacket_pool_cleanup(&umem->xpool); + free(umem); + } + return xsk; +} + +static int +xsk_configure_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk; + int i, ifindex, n_rxq; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + n_rxq = netdev_n_rxq(netdev); + dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *)); + + /* configure each queue */ + for (i = 0; i < n_rxq; i++) { + VLOG_INFO("%s configure queue %d mode %s", __func__, i, + dev->xdpmode == XDP_COPY ? "SKB" : "DRV"); + xsk = xsk_configure(ifindex, i, dev->xdpmode); + if (!xsk) { + VLOG_ERR("failed to create AF_XDP socket on queue %d", i); + dev->xsks[i] = NULL; + goto err; + } + dev->xsks[i] = xsk; + xsk->rx_dropped = 0; + xsk->tx_dropped = 0; + } + + return 0; + +err: + xsk_destroy_all(netdev); + return EINVAL; +} + +static void +xsk_destroy(struct xsk_socket_info *xsk) +{ + struct xsk_umem *umem; + + umem = xsk->umem->umem; + xsk_socket__delete(xsk->xsk); + if (xsk_umem__delete(umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + + /* free the packet buffer */ + free_pagealign(xsk->umem->buffer); + + /* cleanup umem pool */ + umem_pool_cleanup(&xsk->umem->mpool); + + /* cleanup metadata pool */ + xpacket_pool_cleanup(&xsk->umem->xpool); + + free(xsk->umem); + free(xsk); +} + +static void +xsk_destroy_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int i, ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + for (i = 0; i < netdev_n_rxq(netdev); i++) { + if (dev->xsks && dev->xsks[i]) { + VLOG_INFO("destroy xsk[%d]", i); + xsk_destroy(dev->xsks[i]); + dev->xsks[i] = NULL; + } + } + + VLOG_INFO("remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); + + free(dev->xsks); +} + +static inline void OVS_UNUSED +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { + struct xdp_statistics stat; + socklen_t optlen; + + optlen = sizeof stat; + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, + &stat, &optlen) == 0); + + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", + stat.rx_dropped, + stat.rx_invalid_descs, + stat.tx_invalid_descs); +} + +int +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp OVS_UNUSED) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + const char *str_xdpmode; + int xdpmode, new_n_rxq; + + ovs_mutex_lock(&dev->mutex); + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); + if (new_n_rxq > MAX_XSKQ) { + ovs_mutex_unlock(&dev->mutex); + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); + return EINVAL; + } + + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); + if (!strcasecmp(str_xdpmode, "drv")) { + xdpmode = XDP_ZEROCOPY; + } else if (!strcasecmp(str_xdpmode, "skb")) { + xdpmode = XDP_COPY; + } else { + VLOG_ERR("%s: Incorrect xdpmode (%s).", + netdev_get_name(netdev), str_xdpmode); + ovs_mutex_unlock(&dev->mutex); + return EINVAL; + } + + if (dev->requested_n_rxq != new_n_rxq + || dev->requested_xdpmode != xdpmode) { + dev->requested_n_rxq = new_n_rxq; + dev->requested_xdpmode = xdpmode; + netdev_request_reconfigure(netdev); + } + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +int +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); + smap_add_format(args, "xdpmode", "%s", + dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb"); + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +static void +netdev_afxdp_alloc_txq(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int n_txqs = netdev_n_rxq(netdev); + int i; + + dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock)); + + for (i = 0; i < n_txqs; i++) { + ovs_spinlock_init(&dev->tx_locks[i]); + } +} + +int +netdev_afxdp_reconfigure(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + int err = 0; + + ovs_mutex_lock(&dev->mutex); + + if (netdev->n_rxq == dev->requested_n_rxq + && dev->xdpmode == dev->requested_xdpmode) { + goto out; + } + + xsk_destroy_all(netdev); + free(dev->tx_locks); + + netdev->n_rxq = dev->requested_n_rxq; + netdev_afxdp_alloc_txq(netdev); + + if (dev->requested_xdpmode == XDP_ZEROCOPY) { + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); + /* From SKB mode to DRV mode */ + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + dev->xdp_bind_flags = XDP_ZEROCOPY; + dev->xdpmode = XDP_ZEROCOPY; + + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", + ovs_strerror(errno)); + } + } else { + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); + /* From DRV mode to SKB mode */ + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + dev->xdp_bind_flags = XDP_COPY; + dev->xdpmode = XDP_COPY; + /* TODO: set rlimit back to previous value + * when no device is in DRV mode. + */ + } + + err = xsk_configure_all(netdev); + if (err) { + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); + } + netdev_change_seq_changed(netdev); +out: + ovs_mutex_unlock(&dev->mutex); + return err; +} + +int +netdev_afxdp_get_numa_id(const struct netdev *netdev) +{ + /* FIXME: Get netdev's PCIe device ID, then find + * its NUMA node id. + */ + VLOG_INFO("FIXME: Device %s always use numa id 0", + netdev_get_name(netdev)); + return 0; +} + +static void +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) +{ + uint32_t curr_prog_id = 0; + uint32_t flags; + + /* remove_xdp_program() */ + if (xdpmode == XDP_COPY) { + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + } else { + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + } + + if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) { + bpf_set_link_xdp_fd(ifindex, -1, flags); + } + if (prog_id == curr_prog_id) { + bpf_set_link_xdp_fd(ifindex, -1, flags); + } else if (!curr_prog_id) { + VLOG_INFO("couldn't find a prog id on a given interface"); + } else { + VLOG_INFO("program on interface changed, not removing"); + } +} + +void +signal_remove_xdp(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + VLOG_WARN("force remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); +} + +static struct dp_packet_afxdp * +dp_packet_cast_afxdp(const struct dp_packet *d) +{ + ovs_assert(d->source == DPBUF_AFXDP); + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); +} + +void +free_afxdp_buf(struct dp_packet *p) +{ + struct dp_packet_afxdp *xpacket; + uintptr_t addr; + + xpacket = dp_packet_cast_afxdp(p); + if (xpacket->mpool) { + void *base = dp_packet_base(p); + + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); + umem_elem_push(xpacket->mpool, (void *)addr); + } +} + +static void +free_afxdp_buf_batch(struct dp_packet_batch *batch) +{ + struct dp_packet_afxdp *xpacket = NULL; + struct dp_packet *packet; + void *elems[BATCH_SIZE]; + uintptr_t addr; + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + xpacket = dp_packet_cast_afxdp(packet); + if (xpacket->mpool) { + void *base = dp_packet_base(packet); + + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); + elems[i] = (void *)addr; + } + } + umem_elem_push_n(xpacket->mpool, batch->count, elems); + dp_packet_batch_init(batch); +} + +static inline void +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx) +{ + void *elems[BATCH_SIZE]; + int i; + + for (i = 0; i < rcvd; i++) { + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); + + elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK)); + } + umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); + + xsk_ring_cons__release(&xsk->rx, rcvd); + xsk->rx_dropped += rcvd; +} + +int +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, + int *qfill) +{ + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); + struct netdev *netdev = rx->up.netdev; + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct umem_elem *elems[BATCH_SIZE]; + uint32_t idx_rx = 0, idx_fq = 0; + struct xsk_socket_info *xsk; + int qid = rxq_->queue_id; + unsigned int rcvd, i; + int ret = 0; + + xsk = dev->xsks[qid]; + if (!xsk) { + return 0; + } + + rx->fd = xsk_socket__fd(xsk->xsk); + + /* See if there is any packet on RX queue, + * if yes, idx_rx is the index having the packet. + */ + rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); + if (!rcvd) { + return 0; + } + + ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems); + if (OVS_UNLIKELY(ret)) { + handle_rx_fail(xsk, rcvd, idx_rx); + return ENOMEM; + } + + /* Prepare for the FILL queue */ + if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) { + /* The FILL queue is full, don't retry or process rx. Wait for kernel + * to move received packets from FILL queue to RX queue. + */ + umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems); + handle_rx_fail(xsk, rcvd, idx_rx); + return ENOMEM; + } + + /* Setup a dp_packet batch from descriptors in RX queue */ + for (i = 0; i < rcvd; i++) { + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; + uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len; + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); + uint64_t index; + + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + index = addr >> FRAME_SHIFT; + xpacket = UMEM2XPKT(xsk->umem->xpool.array, index); + packet = &xpacket->packet; + + /* Initialize the struct dp_packet */ + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); + dp_packet_set_size(packet, len); + + /* Add packet into batch, increase batch->count */ + dp_packet_batch_add(batch, packet); + + idx_rx++; + } + /* Release the RX queue */ + xsk_ring_cons__release(&xsk->rx, rcvd); + + for (i = 0; i < rcvd; i++) { + uint64_t index; + struct umem_elem *elem; + + /* Get one free umem, program it into FILL queue */ + elem = elems[i]; + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); + ovs_assert((index & FRAME_SHIFT_MASK) == 0); + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; + + idx_fq++; + } + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); + + if (qfill) { + /* TODO: return the number of remaining packets in the queue. */ + *qfill = 0; + } + +#ifdef AFXDP_DEBUG + log_xsk_stat(xsk); +#endif + return 0; +} + +static inline int +kick_tx(struct xsk_socket_info *xsk) +{ + int ret; + + /* This causes system call into kernel's xsk_sendmsg, and + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). + */ + ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); + if (OVS_UNLIKELY(ret < 0)) { + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { + return errno; + } + } + /* no error, or EBUSY or EAGAIN */ + return 0; +} + +static inline bool +check_free_batch(struct dp_packet_batch *batch) +{ + struct umem_pool *first_mpool = NULL; + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + if (packet->source != DPBUF_AFXDP) { + return false; + } + xpacket = dp_packet_cast_afxdp(packet); + if (i == 0) { + first_mpool = xpacket->mpool; + continue; + } + if (xpacket->mpool != first_mpool) { + return false; + } + } + /* All packets are DPBUF_AFXDP and from the same mpool */ + return true; +} + +static inline void +afxdp_complete_tx(struct xsk_socket_info *xsk) +{ + struct umem_elem *elems_push[BATCH_SIZE]; + uint32_t idx_cq = 0; + int tx_done, j, ret; + + if (!xsk->outstanding_tx) { + return; + } + + ret = kick_tx(xsk); + if (OVS_UNLIKELY(ret)) { + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", + ovs_strerror(ret)); + } + + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq); + if (tx_done > 0) { + xsk_ring_cons__release(&xsk->umem->cq, tx_done); + xsk->outstanding_tx -= tx_done; + } + + /* Recycle back to umem pool */ + for (j = 0; j < tx_done; j++) { + struct umem_elem *elem; + uint64_t addr; + + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++); + elem = ALIGNED_CAST(struct umem_elem *, + (char *)xsk->umem->buffer + addr); + elems_push[j] = elem; + } + + umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push); +} + +int +netdev_afxdp_batch_send(struct netdev *netdev, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk = dev->xsks[qid]; + struct umem_elem *elems_pop[BATCH_SIZE]; + struct dp_packet *packet; + bool free_batch = true; + uint32_t idx = 0; + int error = 0; + int ret; + + if (!xsk) { + goto out; + } + + if (OVS_UNLIKELY(concurrent_txq)) { + qid = qid % dev->up.n_txq; + ovs_spin_lock(&dev->tx_locks[qid]); + } + + /* Process CQ first. */ + afxdp_complete_tx(xsk); + + free_batch = check_free_batch(batch); + + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); + if (OVS_UNLIKELY(ret)) { + xsk->tx_dropped += batch->count; + error = ENOMEM; + goto out; + } + + /* Make sure we have enough TX descs */ + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx); + if (OVS_UNLIKELY(ret == 0)) { + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop); + xsk->tx_dropped += batch->count; + error = ENOMEM; + goto out; + } + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + struct umem_elem *elem; + uint64_t index; + + elem = elems_pop[i]; + /* Copy the packet to the umem we just pop from umem pool. + * TODO: avoid this copy if the packet and the pop umem + * are located in the same umem. + */ + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); + + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index; + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len + = dp_packet_size(packet); + } + xsk_ring_prod__submit(&xsk->tx, batch->count); + xsk->outstanding_tx += batch->count; + + ret = kick_tx(xsk); + if (OVS_UNLIKELY(ret)) { + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", + ovs_strerror(ret)); + } + +out: + if (free_batch) { + free_afxdp_buf_batch(batch); + } else { + dp_packet_delete_batch(batch, true); + } + + if (OVS_UNLIKELY(concurrent_txq)) { + ovs_spin_unlock(&dev->tx_locks[qid]); + } + return error; +} + +int +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED) +{ + /* Done at reconfigure */ + return 0; +} + +void +netdev_afxdp_destruct(struct netdev *netdev_) +{ + struct netdev_linux *netdev = netdev_linux_cast(netdev_); + + /* Note: tc is by-passed when using drv-mode, but when using + * skb-mode, we might need to clean up tc. */ + + xsk_destroy_all(netdev_); + ovs_mutex_destroy(&netdev->mutex); +} + +int +netdev_afxdp_get_stats(const struct netdev *netdev, + struct netdev_stats *stats) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct netdev_stats dev_stats; + struct xsk_socket_info *xsk; + int error, i; + + ovs_mutex_lock(&dev->mutex); + + error = get_stats_via_netlink(netdev, &dev_stats); + if (error) { + VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics"); + } else { + /* Use kernel netdev's packet and byte counts */ + stats->rx_packets = dev_stats.rx_packets; + stats->rx_bytes = dev_stats.rx_bytes; + stats->tx_packets = dev_stats.tx_packets; + stats->tx_bytes = dev_stats.tx_bytes; + + stats->rx_errors += dev_stats.rx_errors; + stats->tx_errors += dev_stats.tx_errors; + stats->rx_dropped += dev_stats.rx_dropped; + stats->tx_dropped += dev_stats.tx_dropped; + stats->multicast += dev_stats.multicast; + stats->collisions += dev_stats.collisions; + stats->rx_length_errors += dev_stats.rx_length_errors; + stats->rx_over_errors += dev_stats.rx_over_errors; + stats->rx_crc_errors += dev_stats.rx_crc_errors; + stats->rx_frame_errors += dev_stats.rx_frame_errors; + stats->rx_fifo_errors += dev_stats.rx_fifo_errors; + stats->rx_missed_errors += dev_stats.rx_missed_errors; + stats->tx_aborted_errors += dev_stats.tx_aborted_errors; + stats->tx_carrier_errors += dev_stats.tx_carrier_errors; + stats->tx_fifo_errors += dev_stats.tx_fifo_errors; + stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors; + stats->tx_window_errors += dev_stats.tx_window_errors; + + /* Account the dropped in each xsk */ + for (i = 0; i < netdev_n_rxq(netdev); i++) { + xsk = dev->xsks[i]; + if (xsk) { + stats->rx_dropped += xsk->rx_dropped; + stats->tx_dropped += xsk->tx_dropped; + } + } + } + ovs_mutex_unlock(&dev->mutex); + + return error; +} diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h new file mode 100644 index 000000000000..dd2dc1a2064d --- /dev/null +++ b/lib/netdev-afxdp.h @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_AFXDP_H +#define NETDEV_AFXDP_H 1 + +#include + +#ifdef HAVE_AF_XDP + +#include +#include + +/* These functions are Linux AF_XDP specific, so they should be used directly + * only by Linux-specific code. */ + +#define MAX_XSKQ 16 + +struct netdev; +struct xsk_socket_info; +struct xdp_umem; +struct dp_packet_batch; +struct smap; +struct dp_packet; +struct netdev_rxq; +struct netdev_stats; + +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_); +void netdev_afxdp_destruct(struct netdev *netdev_); + +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, + struct dp_packet_batch *batch, + int *qfill); +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq); +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp); +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args); +int netdev_afxdp_get_numa_id(const struct netdev *netdev); +int netdev_afxdp_get_stats(const struct netdev *netdev_, + struct netdev_stats *stats); + +void free_afxdp_buf(struct dp_packet *p); +int netdev_afxdp_reconfigure(struct netdev *netdev); +void signal_remove_xdp(struct netdev *netdev); + +#else /* !HAVE_AF_XDP */ + +#include "openvswitch/compiler.h" + +struct dp_packet; + +static inline void +free_afxdp_buf(struct dp_packet *p OVS_UNUSED) +{ + /* Nothing */ +} + +#endif /* HAVE_AF_XDP */ +#endif /* netdev-afxdp.h */ diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h new file mode 100644 index 000000000000..6a0388cf9dc3 --- /dev/null +++ b/lib/netdev-linux-private.h @@ -0,0 +1,139 @@ +/* + * Copyright (c) 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_LINUX_PRIVATE_H +#define NETDEV_LINUX_PRIVATE_H 1 + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "netdev-afxdp.h" +#include "netdev-provider.h" +#include "netdev-tc-offloads.h" +#include "netdev-vport.h" +#include "openvswitch/thread.h" +#include "ovs-atomic.h" +#include "timer.h" +#include "xdpsock.h" + +/* These functions are Linux specific, so they should be used directly only by + * Linux-specific code. */ + +struct netdev; + +struct netdev_rxq_linux { + struct netdev_rxq up; + bool is_tap; + int fd; +}; + +void netdev_linux_run(const struct netdev_class *); + +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, + const char *flag_name, bool enable); + +int get_stats_via_netlink(const struct netdev *netdev_, + struct netdev_stats *stats); + +struct netdev_linux { + struct netdev up; + + /* Protects all members below. */ + struct ovs_mutex mutex; + + unsigned int cache_valid; + + bool miimon; /* Link status of last poll. */ + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ + struct timer miimon_timer; + + int netnsid; /* Network namespace ID. */ + /* The following are figured out "on demand" only. They are only valid + * when the corresponding VALID_* bit in 'cache_valid' is set. */ + int ifindex; + struct eth_addr etheraddr; + int mtu; + unsigned int ifi_flags; + long long int carrier_resets; + uint32_t kbits_rate; /* Policing data. */ + uint32_t kbits_burst; + int vport_stats_error; /* Cached error code from vport_get_stats(). + 0 or an errno value. */ + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU + * or SIOCSIFMTU. + */ + int ether_addr_error; /* Cached error code from set/get etheraddr. */ + int netdev_policing_error; /* Cached error code from set policing. */ + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ + + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ + + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ + struct tc *tc; + + /* For devices of class netdev_tap_class only. */ + int tap_fd; + bool present; /* If the device is present in the namespace */ + uint64_t tx_dropped; /* tap device can drop if the iface is down */ + + /* LAG information. */ + bool is_lag_master; /* True if the netdev is a LAG master. */ + + /* AF_XDP information */ +#ifdef HAVE_AF_XDP + struct xsk_socket_info **xsks; + int requested_n_rxq; + int xdpmode, requested_xdpmode; /* detect mode changed */ + int xdp_flags, xdp_bind_flags; + struct ovs_spinlock *tx_locks; +#endif +}; + +static bool +is_netdev_linux_class(const struct netdev_class *netdev_class) +{ + return netdev_class->run == netdev_linux_run; +} + +static struct netdev_linux * +netdev_linux_cast(const struct netdev *netdev) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); + + return CONTAINER_OF(netdev, struct netdev_linux, up); +} + +static struct netdev_rxq_linux * +netdev_rxq_linux_cast(const struct netdev_rxq *rx) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); + + return CONTAINER_OF(rx, struct netdev_rxq_linux, up); +} + +#endif /* netdev-linux-private.h */ diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index f75d73fd39f8..2883cf1f2586 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -17,6 +17,7 @@ #include #include "netdev-linux.h" +#include "netdev-linux-private.h" #include #include @@ -54,6 +55,7 @@ #include "fatal-signal.h" #include "hash.h" #include "openvswitch/hmap.h" +#include "netdev-afxdp.h" #include "netdev-provider.h" #include "netdev-tc-offloads.h" #include "netdev-vport.h" @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu); static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu); static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes); -struct netdev_linux { - struct netdev up; - - /* Protects all members below. */ - struct ovs_mutex mutex; - - unsigned int cache_valid; - - bool miimon; /* Link status of last poll. */ - long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ - struct timer miimon_timer; - - int netnsid; /* Network namespace ID. */ - /* The following are figured out "on demand" only. They are only valid - * when the corresponding VALID_* bit in 'cache_valid' is set. */ - int ifindex; - struct eth_addr etheraddr; - int mtu; - unsigned int ifi_flags; - long long int carrier_resets; - uint32_t kbits_rate; /* Policing data. */ - uint32_t kbits_burst; - int vport_stats_error; /* Cached error code from vport_get_stats(). - 0 or an errno value. */ - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */ - int ether_addr_error; /* Cached error code from set/get etheraddr. */ - int netdev_policing_error; /* Cached error code from set policing. */ - int get_features_error; /* Cached error code from ETHTOOL_GSET. */ - int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ - - enum netdev_features current; /* Cached from ETHTOOL_GSET. */ - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ - - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ - struct tc *tc; - - /* For devices of class netdev_tap_class only. */ - int tap_fd; - bool present; /* If the device is present in the namespace */ - uint64_t tx_dropped; /* tap device can drop if the iface is down */ - - /* LAG information. */ - bool is_lag_master; /* True if the netdev is a LAG master. */ -}; - -struct netdev_rxq_linux { - struct netdev_rxq up; - bool is_tap; - int fd; -}; /* This is set pretty low because we probably won't learn anything from the * additional log messages. */ @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); * changes in the device miimon status, so we can use atomic_count. */ static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); -static void netdev_linux_run(const struct netdev_class *); - static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *, int cmd, const char *cmd_name); static int get_flags(const struct netdev *, unsigned int *flags); @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev, struct in_addr addr); static int get_etheraddr(const char *netdev_name, struct eth_addr *ea); static int set_etheraddr(const char *netdev_name, const struct eth_addr); -static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *); static int af_packet_sock(void); static bool netdev_linux_miimon_enabled(void); static void netdev_linux_miimon_run(void); @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void); static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup); static bool -is_netdev_linux_class(const struct netdev_class *netdev_class) -{ - return netdev_class->run == netdev_linux_run; -} - -static bool is_tap_netdev(const struct netdev *netdev) { return netdev_get_class(netdev) == &netdev_tap_class; } - -static struct netdev_linux * -netdev_linux_cast(const struct netdev *netdev) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); - - return CONTAINER_OF(netdev, struct netdev_linux, up); -} - -static struct netdev_rxq_linux * -netdev_rxq_linux_cast(const struct netdev_rxq *rx) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); - return CONTAINER_OF(rx, struct netdev_rxq_linux, up); -} static int netdev_linux_netnsid_update__(struct netdev_linux *netdev) @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change *change) } } -static void +void netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED) { struct nl_sock *sock; @@ -3279,9 +3206,7 @@ exit: .run = netdev_linux_run, \ .wait = netdev_linux_wait, \ .alloc = netdev_linux_alloc, \ - .destruct = netdev_linux_destruct, \ .dealloc = netdev_linux_dealloc, \ - .send = netdev_linux_send, \ .send_wait = netdev_linux_send_wait, \ .set_etheraddr = netdev_linux_set_etheraddr, \ .get_etheraddr = netdev_linux_get_etheraddr, \ @@ -3312,10 +3237,8 @@ exit: .arp_lookup = netdev_linux_arp_lookup, \ .update_flags = netdev_linux_update_flags, \ .rxq_alloc = netdev_linux_rxq_alloc, \ - .rxq_construct = netdev_linux_rxq_construct, \ .rxq_destruct = netdev_linux_rxq_destruct, \ .rxq_dealloc = netdev_linux_rxq_dealloc, \ - .rxq_recv = netdev_linux_rxq_recv, \ .rxq_wait = netdev_linux_rxq_wait, \ .rxq_drain = netdev_linux_rxq_drain @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class = { NETDEV_LINUX_CLASS_COMMON, LINUX_FLOW_OFFLOAD_API, .type = "system", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_linux_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, - .get_block_id = netdev_linux_get_block_id + .get_block_id = netdev_linux_get_block_id, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_tap_class = { NETDEV_LINUX_CLASS_COMMON, .type = "tap", + .is_pmd = false, .construct = netdev_linux_construct_tap, + .destruct = netdev_linux_destruct, .get_stats = netdev_tap_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_internal_class = { NETDEV_LINUX_CLASS_COMMON, LINUX_FLOW_OFFLOAD_API, .type = "internal", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_internal_get_stats, .get_status = netdev_internal_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; + +#ifdef HAVE_AF_XDP +const struct netdev_class netdev_afxdp_class = { + NETDEV_LINUX_CLASS_COMMON, + .type = "afxdp", + .is_pmd = true, + .construct = netdev_linux_construct, + .destruct = netdev_afxdp_destruct, + .get_stats = netdev_afxdp_get_stats, + .get_status = netdev_linux_get_status, + .set_config = netdev_afxdp_set_config, + .get_config = netdev_afxdp_get_config, + .reconfigure = netdev_afxdp_reconfigure, + .get_numa_id = netdev_afxdp_get_numa_id, + .send = netdev_afxdp_batch_send, + .rxq_construct = netdev_afxdp_rxq_construct, + .rxq_recv = netdev_afxdp_rxq_recv, +}; +#endif #define CODEL_N_QUEUES 0x0000 @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst, dst->tx_window_errors = src->tx_window_errors; } -static int +int get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats) { struct ofpbuf request; diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h index fb0c27e6e8e8..91e6a9e2bfc0 100644 --- a/lib/netdev-provider.h +++ b/lib/netdev-provider.h @@ -903,6 +903,9 @@ extern const struct netdev_class netdev_linux_class; extern const struct netdev_class netdev_internal_class; extern const struct netdev_class netdev_tap_class; +#ifdef HAVE_AF_XDP +extern const struct netdev_class netdev_afxdp_class; +#endif #ifdef __cplusplus } #endif diff --git a/lib/netdev.c b/lib/netdev.c index 7d7ecf6f0946..0fac117cc602 100644 --- a/lib/netdev.c +++ b/lib/netdev.c @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); static void restore_all_flags(void *aux OVS_UNUSED); void update_device_args(struct netdev *, const struct shash *args); +#ifdef HAVE_AF_XDP +void signal_remove_xdp(struct netdev *netdev); +#endif int netdev_n_txq(const struct netdev *netdev) @@ -146,6 +149,9 @@ netdev_initialize(void) netdev_register_provider(&netdev_internal_class); netdev_register_provider(&netdev_tap_class); netdev_vport_tunnel_register(); +#ifdef HAVE_AF_XDP + netdev_register_provider(&netdev_afxdp_class); +#endif #endif #if defined(__FreeBSD__) || defined(__NetBSD__) netdev_register_provider(&netdev_tap_class); @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED) saved_flags & ~saved_values, &old_flags); } +#ifdef HAVE_AF_XDP + if (netdev->netdev_class == &netdev_afxdp_class) { + signal_remove_xdp(netdev); + } +#endif } } diff --git a/lib/spinlock.h b/lib/spinlock.h new file mode 100644 index 000000000000..1ae634f23a6b --- /dev/null +++ b/lib/spinlock.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#ifndef SPINLOCK_H +#define SPINLOCK_H 1 + +#include + +#include +#include +#include +#include +#include +#include + +#include "ovs-atomic.h" + +struct ovs_spinlock { + OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked; +}; + +static inline void +ovs_spinlock_init(struct ovs_spinlock *sl) +{ + atomic_init(&sl->locked, 0); +} + +static inline void +ovs_spin_lock(struct ovs_spinlock *sl) +{ + int exp = 0, locked = 0; + + while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, + memory_order_acquire, + memory_order_relaxed)) { + locked = 1; + while (locked) { + atomic_read_relaxed(&sl->locked, &locked); + } + exp = 0; + } +} + +static inline void +ovs_spin_unlock(struct ovs_spinlock *sl) +{ + atomic_store_explicit(&sl->locked, 0, memory_order_release); +} + +static inline int +ovs_spin_trylock(struct ovs_spinlock *sl) +{ + int exp = 0; + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, + memory_order_acquire, + memory_order_relaxed); +} +#endif diff --git a/lib/util.c b/lib/util.c index 7b8ab81f6ee1..5eb20995b370 100644 --- a/lib/util.c +++ b/lib/util.c @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s) return xrealloc(p, *n * s); } -/* Allocates and returns 'size' bytes of memory aligned to a cache line and in - * dedicated cache lines. That is, the memory block returned will not share a - * cache line with other data, avoiding "false sharing". +/* Allocates and returns 'size' bytes of memory aligned to 'alignment' bytes. + * 'alignment' must be a power of two and a multiple of sizeof(void *). * - * Use free_cacheline() to free the returned memory block. */ + * Use free_size_align() to free the returned memory block. */ void * -xmalloc_cacheline(size_t size) +xmalloc_size_align(size_t size, size_t alignment) { #ifdef HAVE_POSIX_MEMALIGN void *p; int error; COVERAGE_INC(util_xalloc); - error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1); + error = posix_memalign(&p, alignment, size ? size : 1); if (error != 0) { out_of_memory(); } @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size) #else /* Allocate room for: * - * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the - * pointer to be aligned exactly sizeof(void *) bytes before the - * beginning of a cache line. + * - Header padding: Up to alignment - 1 bytes, to allow the + * pointer 'q' to be aligned exactly sizeof(void *) bytes before the + * beginning of the alignment. * * - Pointer: A pointer to the start of the header padding, to allow us * to free() the block later. * * - User data: 'size' bytes. * - * - Trailer padding: Enough to bring the user data up to a cache line + * - Trailer padding: Enough to bring the user data up to a alignment * multiple. * * +---------------+---------+------------------------+---------+ @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size) * p q r * */ - void *p = xmalloc((CACHE_LINE_SIZE - 1) - + sizeof(void *) - + ROUND_UP(size, CACHE_LINE_SIZE)); - bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *); - void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0), - CACHE_LINE_SIZE); - void **q = (void **) r - 1; + void *p, *r, **q; + bool runt; + + COVERAGE_INC(util_xalloc); + if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) { + ovs_abort(0, "Invalid alignment"); + } + + p = xmalloc((alignment - 1) + + sizeof(void *) + + ROUND_UP(size, alignment)); + + runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *); + /* When the padding size < sizeof(void*), we don't have enough room for + * pointer 'q'. As a reuslt, need to move 'r' to the next alignment. + * So ROUND_UP when xmalloc above, and ROUND_UP again when calculate 'r' + * below. + */ + r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment); + q = (void **) r - 1; *q = p; + return r; #endif } +void +free_size_align(void *p) +{ +#ifdef HAVE_POSIX_MEMALIGN + free(p); +#else + if (p) { + void **q = (void **) p - 1; + free(*q); + } +#endif +} + +/* Allocates and returns 'size' bytes of memory aligned to a cache line and in + * dedicated cache lines. That is, the memory block returned will not share a + * cache line with other data, avoiding "false sharing". + * + * Use free_cacheline() to free the returned memory block. */ +void * +xmalloc_cacheline(size_t size) +{ + return xmalloc_size_align(size, CACHE_LINE_SIZE); +} + /* Like xmalloc_cacheline() but clears the allocated memory to all zero * bytes. */ void * @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size) void free_cacheline(void *p) { -#ifdef HAVE_POSIX_MEMALIGN - free(p); -#else - if (p) { - void **q = (void **) p - 1; - free(*q); - } -#endif + free_size_align(p); +} + +void * +xmalloc_pagealign(size_t size) +{ + return xmalloc_size_align(size, get_page_size()); +} + +void +free_pagealign(void *p) +{ + free_size_align(p); } char * diff --git a/lib/util.h b/lib/util.h index c26605abdce3..33665748274c 100644 --- a/lib/util.h +++ b/lib/util.h @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size); int string_ends_with(const char *str, const char *suffix); +void *xmalloc_pagealign(size_t) MALLOC_LIKE; +void free_pagealign(void *); +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE; +void free_size_align(void *); + /* The C standards say that neither the 'dst' nor 'src' argument to * memcpy() may be null, even if 'n' is zero. This wrapper tolerates * the null case. */ diff --git a/lib/xdpsock.c b/lib/xdpsock.c new file mode 100644 index 000000000000..ea39fa557290 --- /dev/null +++ b/lib/xdpsock.c @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#include + +#include "xdpsock.h" +#include "dp-packet.h" +#include "openvswitch/compiler.h" + +/* Note: + * umem_elem_push* shouldn't overflow because we always pop + * elem first, then push back to the stack. + */ +static inline void +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { + OVS_NOT_REACHED(); + } + + ptr = &umemp->array[umemp->index]; + memcpy(ptr, addrs, n * sizeof(void *)); + umemp->index += n; +} + +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + ovs_spin_lock(&umemp->lock); + __umem_elem_push_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->lock); +} + +static inline void +__umem_elem_push(struct umem_pool *umemp, void *addr) +{ + if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) { + OVS_NOT_REACHED(); + } + + umemp->array[umemp->index++] = addr; +} + +void +umem_elem_push(struct umem_pool *umemp, void *addr) +{ + + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); + + ovs_spin_lock(&umemp->lock); + __umem_elem_push(umemp, addr); + ovs_spin_unlock(&umemp->lock); +} + +static inline int +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index - n < 0)) { + return -ENOMEM; + } + + umemp->index -= n; + ptr = &umemp->array[umemp->index]; + memcpy(addrs, ptr, n * sizeof(void *)); + + return 0; +} + +int +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + int ret; + + ovs_spin_lock(&umemp->lock); + ret = __umem_elem_pop_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->lock); + + return ret; +} + +static inline void * +__umem_elem_pop(struct umem_pool *umemp) +{ + if (OVS_UNLIKELY(umemp->index - 1 < 0)) { + return NULL; + } + + return umemp->array[--umemp->index]; +} + +void * +umem_elem_pop(struct umem_pool *umemp) +{ + void *ptr; + + ovs_spin_lock(&umemp->lock); + ptr = __umem_elem_pop(umemp); + ovs_spin_unlock(&umemp->lock); + + return ptr; +} + +static void ** +__umem_pool_alloc(unsigned int size) +{ + void *bufs; + + bufs = xmalloc_pagealign(size * sizeof(void *)); + memset(bufs, 0, size * sizeof(void *)); + + return (void **)bufs; +} + +int +umem_pool_init(struct umem_pool *umemp, unsigned int size) +{ + umemp->array = __umem_pool_alloc(size); + if (!umemp->array) { + return -ENOMEM; + } + + umemp->size = size; + umemp->index = 0; + ovs_spinlock_init(&umemp->lock); + return 0; +} + +void +umem_pool_cleanup(struct umem_pool *umemp) +{ + free_pagealign(umemp->array); + umemp->array = NULL; +} + +/* AF_XDP metadata init/destroy */ +int +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) +{ + void *bufs; + + bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp)); + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); + + xp->array = bufs; + xp->size = size; + + return 0; +} + +void +xpacket_pool_cleanup(struct xpacket_pool *xp) +{ + free_pagealign(xp->array); + xp->array = NULL; +} diff --git a/lib/xdpsock.h b/lib/xdpsock.h new file mode 100644 index 000000000000..1a1093381243 --- /dev/null +++ b/lib/xdpsock.h @@ -0,0 +1,101 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef XDPSOCK_H +#define XDPSOCK_H 1 + +#include + +#ifdef HAVE_AF_XDP + +#include +#include +#include +#include + +#include "openvswitch/thread.h" +#include "ovs-atomic.h" +#include "spinlock.h" + +#define FRAME_HEADROOM XDP_PACKET_HEADROOM +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) + +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS + +/* The worst case is all 4 queues TX/CQ/RX/FILL are full. + * Setting NUM_FRAMES to this makes sure umem_pop always successes. + */ +#define NUM_FRAMES (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)) + +#define BATCH_SIZE NETDEV_MAX_BURST + +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES)); +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS); +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS)); + +/* LIFO ptr_array */ +struct umem_pool { + int index; /* point to top */ + unsigned int size; + struct ovs_spinlock lock; + void **array; /* a pointer array, point to umem buf */ +}; + +/* array-based dp_packet_afxdp */ +struct xpacket_pool { + unsigned int size; + struct dp_packet_afxdp **array; +}; + +struct xsk_umem_info { + struct umem_pool mpool; + struct xpacket_pool xpool; + struct xsk_ring_prod fq; + struct xsk_ring_cons cq; + struct xsk_umem *umem; + void *buffer; +}; + +struct xsk_socket_info { + struct xsk_ring_cons rx; + struct xsk_ring_prod tx; + struct xsk_umem_info *umem; + struct xsk_socket *xsk; + unsigned long rx_dropped; + unsigned long tx_dropped; + uint32_t outstanding_tx; +}; + +struct umem_elem { + struct umem_elem *next; +}; + +void umem_elem_push(struct umem_pool *umemp, void *addr); +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs); + +void *umem_elem_pop(struct umem_pool *umemp); +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs); + +int umem_pool_init(struct umem_pool *umemp, unsigned int size); +void umem_pool_cleanup(struct umem_pool *umemp); +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size); +void xpacket_pool_cleanup(struct xpacket_pool *xp); + +#endif +#endif diff --git a/tests/automake.mk b/tests/automake.mk index 2956e68b242c..131564bb0bd3 100644 --- a/tests/automake.mk +++ b/tests/automake.mk @@ -4,12 +4,14 @@ EXTRA_DIST += \ $(SYSTEM_TESTSUITE_AT) \ $(SYSTEM_KMOD_TESTSUITE_AT) \ $(SYSTEM_USERSPACE_TESTSUITE_AT) \ + $(SYSTEM_AFXDP_TESTSUITE_AT) \ $(SYSTEM_OFFLOADS_TESTSUITE_AT) \ $(SYSTEM_DPDK_TESTSUITE_AT) \ $(OVSDB_CLUSTER_TESTSUITE_AT) \ $(TESTSUITE) \ $(SYSTEM_KMOD_TESTSUITE) \ $(SYSTEM_USERSPACE_TESTSUITE) \ + $(SYSTEM_AFXDP_TESTSUITE) \ $(SYSTEM_OFFLOADS_TESTSUITE) \ $(SYSTEM_DPDK_TESTSUITE) \ $(OVSDB_CLUSTER_TESTSUITE) \ @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \ tests/system-userspace-macros.at \ tests/system-userspace-packet-type-aware.at +SYSTEM_AFXDP_TESTSUITE_AT = \ + tests/system-afxdp-testsuite.at \ + tests/system-afxdp-macros.at + SYSTEM_TESTSUITE_AT = \ tests/system-common-macros.at \ tests/system-ovn.at \ @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite @@ -317,6 +324,11 @@ check-system-userspace: all set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) +check-afxdp: all + $(MAKE) install + set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \ + "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) + check-offloads: all set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT) + $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at + $(AM_V_at)mv $@.tmp $@ + $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT) $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at new file mode 100644 index 000000000000..1e6f7a46b4b7 --- /dev/null +++ b/tests/system-afxdp-macros.at @@ -0,0 +1,20 @@ +# Add port to ovs bridge by using afxdp mode. +# This will use generic XDP support in the veth driver. +m4_define([ADD_VETH], + [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77]) + CONFIGURE_VETH_OFFLOADS([$1]) + AT_CHECK([ip link set $1 netns $2]) + AT_CHECK([ip link set dev ovs-$1 up]) + AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \ + set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"]) + NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7]) + NS_CHECK_EXEC([$2], [ip link set dev $1 up]) + if test -n "$5"; then + NS_CHECK_EXEC([$2], [ip link set dev $1 address $5]) + fi + if test -n "$6"; then + NS_CHECK_EXEC([$2], [ip route add default via $6]) + fi + on_exit 'ip link del ovs-$1' + ] +) diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at new file mode 100644 index 000000000000..9b7a29066614 --- /dev/null +++ b/tests/system-afxdp-testsuite.at @@ -0,0 +1,26 @@ +AT_INIT + +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at: + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License.]) + +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS]) + +m4_include([tests/ovs-macros.at]) +m4_include([tests/ovsdb-macros.at]) +m4_include([tests/ofproto-macros.at]) +m4_include([tests/system-common-macros.at]) +m4_include([tests/system-userspace-macros.at]) +m4_include([tests/system-afxdp-macros.at]) + +m4_include([tests/system-traffic.at]) diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 89c06a1b7877..1e3acbbb8075 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \

+ +

+ Specifies the operational mode of the XDP program. + If "drv", the XDP program is loaded into the device driver with + zero-copy RX and TX enabled. This mode requires device driver with + AF_XDP support and has the best performance. + If "skb", the XDP program is using generic XDP mode in kernel with + extra data copying between userspace and kernel. No device driver + support is needed. Note that this is afxdp netdev type only. + Defaults to "skb" mode. +

+
+