diff mbox series

[ovs-dev,PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1557446095-127446-1-git-send-email-u9012063@gmail.com
State Superseded
Headers show
Series [ovs-dev,PATCHv8] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu May 9, 2019, 11:54 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this is not
compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>

---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

v4->v5:
- rebase to master
- remove rfc, squash all into a single patch
- add --enable-afxdp, so by default, AF_XDP is not compiled
- add options: xdpmode=drv,skb
- add multiple queue and multiple PMD support, with options: n_rxq
- improve documentation, rename bpf.rst to af_xdp.rst

v5->v6
- rebase to master, commit 0cdd5b13de91b98
- address errors from sparse and clang
- pass travis-ci test
- address feedback from Ben
- fix issues reported by 0-day robot
- improved documentation

v6-v7
- rebase to master, commit abf11558c1515bf3b1
- address feedbacks from Ilya, Ben, and Eelco, see:
  https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
- add XDP mode change, implement get/set_config, reconfigure
- Fix reconfiguration/crash issue caused by libbpf, see patch:
  [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
- perf optimization for batching umem_push/pop
- perf optimization for batching kick_tx
- test build with dpdk
- fix/refactor atomic operation
- make AF_XDP x86 specific, otherwise fail at build time
- lots of code refactoring
- add PVP setup in documentation

v7-v8:
- Address feedback from Ilya at:
  https://patchwork.ozlabs.org/patch/1095019/
- add netdev-linux-private.h
- fix afxdp reconfigure issue
- sort include headers
- remove unnecessary OVS_UNUSED
- coding style fixes
- error case handling and memory leak
---
 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/afxdp.rst | 479 +++++++++++++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  32 ++
 configure.ac                          |   1 +
 lib/automake.mk                       |  13 +
 lib/dp-packet.c                       |  33 ++
 lib/dp-packet.h                       |  22 +-
 lib/dpif-netdev-perf.h                |  14 +
 lib/netdev-afxdp.c                    | 727 +++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |  53 ++
 lib/netdev-linux-private.h            | 124 +++++
 lib/netdev-linux.c                    | 137 +++--
 lib/netdev-linux.h                    |  14 +
 lib/netdev-provider.h                 |   4 +-
 lib/netdev.c                          |   3 +
 lib/xdpsock.c                         | 239 +++++++++
 lib/xdpsock.h                         | 123 +++++
 tests/automake.mk                     |  17 +
 tests/system-afxdp-macros.at          | 153 ++++++
 tests/system-afxdp-testsuite.at       |  26 +
 tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
 23 files changed, 3137 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at
 create mode 100644 tests/system-afxdp-traffic.at

Comments

Ilya Maximets May 13, 2019, 5:48 p.m. UTC | #1
On 10.05.2019 2:54, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this is not
> compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> 
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> 
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 479 +++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  32 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  13 +
>  lib/dp-packet.c                       |  33 ++
>  lib/dp-packet.h                       |  22 +-
>  lib/dpif-netdev-perf.h                |  14 +
>  lib/netdev-afxdp.c                    | 727 +++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  53 ++
>  lib/netdev-linux-private.h            | 124 +++++
>  lib/netdev-linux.c                    | 137 +++--
>  lib/netdev-linux.h                    |  14 +
>  lib/netdev-provider.h                 |   4 +-
>  lib/netdev.c                          |   3 +
>  lib/xdpsock.c                         | 239 +++++++++
>  lib/xdpsock.h                         | 123 +++++
>  tests/automake.mk                     |  17 +
>  tests/system-afxdp-macros.at          | 153 ++++++
>  tests/system-afxdp-testsuite.at       |  26 +
>  tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
>  23 files changed, 3137 insertions(+), 58 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
>  create mode 100644 tests/system-afxdp-traffic.at
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>  
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..1222b433dbbb
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,479 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> +built upon the eBPF and XDP technology.  It is aims to have comparable
> +performance to DPDK but cooperate better with existing kernel's networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +internal.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the proposed architecture re-uses the existing
> +userspace dpif-netdev datapath.  As a result, most of
> +the packet processing happens at the userspace instead of
> +linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- If possible, run **./xdpsock -r -N -z -i <your device>** under
> +  linux/samples/bpf.  This is the OVS indepedent benchmark tools for AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> +Frist, clone a recent version of Linux bpf-next tree::
> +
> +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp
> +
> +if a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd --disable-system
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +.. note::
> +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
> +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
> +   and datapath_type=netdev when adding a new bridge.
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +To validate that the bridge has successfully instantiated, you can use the::
> +
> +  ovs-vsctl show
> +
> +should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debug by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> +about AF_XDP current and future work.
> +
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, allowing PMD
> +to keep polling the AF_XDP queues without any interferences from kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +make check-afxdp
> +----------------
> +When executing 'make check-afxdp', OVS creates namespaces, sets up AF_XDP on
> +veth devices and kicks start the testing.  So far we have the following test
> +cases::
> +
> + AF_XDP netdev datapath-sanity
> +
> +  1: datapath - ping between two ports               ok
> +  2: datapath - ping between two ports on vlan       ok
> +  3: datapath - ping6 between two ports              ok
> +  4: datapath - ping6 between two ports on vlan      ok
> +  5: datapath - ping over vxlan tunnel               ok
> +  6: datapath - ping over vxlan6 tunnel              ok
> +  7: datapath - ping over gre tunnel                 ok
> +  8: datapath - ping over erspan v1 tunnel           ok
> +  9: datapath - ping over erspan v2 tunnel           ok
> + 10: datapath - ping over ip6erspan v1 tunnel        ok
> + 11: datapath - ping over ip6erspan v2 tunnel        ok
> + 12: datapath - ping over geneve tunnel              ok
> + 13: datapath - ping over geneve6 tunnel             ok
> + 14: datapath - clone action                         ok
> + 15: datapath - basic truncate action                ok
> +
> + conntrack
> +
> + 16: conntrack - controller                          ok
> + 17: conntrack - force commit                        ok
> + 18: conntrack - ct flush by 5-tuple                 ok
> + 19: conntrack - IPv4 ping                           ok
> + 20: conntrack - get_nconns and get/set_maxconns     ok
> + 21: conntrack - IPv6 ping                           ok
> +
> + system-ovn
> +
> + 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
> + 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
> + 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
> + 25: ovn -- load-balancing                           ok
> + 26: ovn -- load-balancing - same subnet.            ok
> + 27: ovn -- load balancing in gateway router         ok
> + 28: ovn -- multiple gateway routers, load-balancing ok
> + 29: ovn -- load balancing in router with gateway router port ok
> + 30: ovn -- DNAT and SNAT on distributed router - N/S ok
> + 31: ovn -- DNAT and SNAT on distributed router - E/W ok
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +The performance number I got is around 700Kpps.
> +This is due to using the kernel's tap interface, which requires copying
> +packet into kernel from the umem buffer in userspace.
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
> +
> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>  
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index b532a4579266..5782f7e4bc2e 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -221,6 +221,38 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>  
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 505e3d041e93..29c90b73f836 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index cc5dccf39d6b..686e57f8c472 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-tc-offloads.c \
>  	lib/netdev-tc-offloads.h \
>  	lib/netlink-conntrack.c \
> @@ -409,6 +414,14 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..7d086dc5e860 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -22,6 +22,9 @@
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
>  
>  static void
>  dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source source)
> @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>  
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  It could
> @@ -122,6 +146,11 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +#ifdef HAVE_AF_XDP
> +            free_afxdp_buf(b);
> +#endif
> +            return;
>          }
>      }
>  }
> @@ -248,6 +277,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>  
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +465,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>  
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..0f533201f956 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,10 @@
>  #include <rte_mbuf.h>
>  #endif
>  
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
> +
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +46,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
>                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>  
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +94,13 @@ struct dp_packet {
>      };
>  };
>  
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +134,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>  
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +198,12 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>  
> +#ifdef HAVE_AF_XDP
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +#endif
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..cc91720fad6e 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif HAVE_AF_XDP
> +    /* This is x86-specific instructions. */
> +    union {
> +        uint64_t tsc_64;
> +        struct {
> +            uint32_t lo_32;
> +            uint32_t hi_32;
> +        };
> +    } tsc;
> +    asm volatile("rdtsc" :
> +             "=a" (tsc.lo_32),
> +             "=d" (tsc.hi_32));
> +
> +    return s->last_tsc = tsc.tsc_64;
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..cd1b9ca8be77
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,727 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#if !defined(__i386__) && !defined(__x86_64__)
> +#error AF_XDP supported only for Linux on x86 or x86_64
> +#endif
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"
> +#include "netdev-afxdp.h"
> +
> +#include <arpa/inet.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/sockios.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <net/if_arp.h>
> +#include <net/route.h>
> +#include <netinet/in.h>
> +#include <netpacket/packet.h>
> +#include <poll.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/utsname.h>
> +#include <unistd.h>
> +
> +#include "coverage.h"
> +#include "dp-packet.h"
> +#include "dpif-netlink.h"
> +#include "dpif-netdev.h"
> +#include "fatal-signal.h"
> +#include "hash.h"
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "netlink-notifier.h"
> +#include "netlink-socket.h"
> +#include "netlink.h"
> +#include "netnsid.h"
> +#include "openflow/openflow.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/hmap.h"
> +#include "openvswitch/ofpbuf.h"
> +#include "openvswitch/poll-loop.h"
> +#include "openvswitch/vlog.h"
> +#include "openvswitch/shash.h"
> +#include "ovs-atomic.h"
> +#include "packets.h"
> +#include "rtnetlink.h"
> +#include "socket-util.h"
> +#include "sset.h"
> +#include "tc.h"
> +#include "timer.h"
> +#include "unaligned.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static uint32_t prog_id;
> +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +
> +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
> +                                                int xdpmode)
> +{
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof(*umem));
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> +                           NULL);
> +
> +    if (ret) {
> +        VLOG_ERR("xsk umem create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    umem_pool_init(&umem->mpool, NUM_FRAMES);
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    xpacket_pool_init(&umem->xpool, NUM_FRAMES);
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
> +
> +    for (i = 0;
> +         i < PROD_NUM_DESCS * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +    int ret;
> +
> +    /* umem memory region */
> +    ret = posix_memalign(&bufs, get_page_size(),
> +                         NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +    ovs_assert(!ret);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        (void)xsk_umem__delete(umem->umem);
> +        free(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk;
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    /* configure each queue */
> +    for (i = 0; i < netdev->n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
> +            goto err;
> +        }
> +        dev->xsk[i] = xsk;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
> +{
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +    ds_put_hex_dump(&ds, buf, count, 0, false);
> +    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
> +    ds_destroy(&ds);
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk)
> +{
> +    struct xsk_umem *umem;
> +
> +    if (!xsk) {
> +        return;
> +    }
> +
> +    umem = xsk->umem->umem;
> +    xsk_socket__delete(xsk->xsk);
> +    (void)xsk_umem__delete(umem);
> +
> +    /* free the packet buffer */
> +    free(xsk->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> +
> +    free(xsk->umem);
> +    free(xsk);
> +}
> +
> +void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < MAX_XSKQ; i++) {
> +        if (dev->xsk[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsk[i]);
> +            dev->xsk[i] = NULL;
> +        }
> +    }
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static inline void OVS_UNUSED
> +print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *xdpmode;
> +    int new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (new_n_rxq != netdev->n_rxq) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        netdev_request_reconfigure(netdev);
> +    }
> +
> +    xdpmode = smap_get(args, "xdpmode");
> +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
> +        dev->requested_xdpmode = XDP_ZEROCOPY;
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            netdev_request_reconfigure(netdev);
> +        }
> +    } else {
> +        dev->requested_xdpmode = XDP_COPY;
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            netdev_request_reconfigure(netdev);
> +        }
> +    }

Above code will request reconfiguration infinitely until it reconfiguration
finished. This could cause multiple reconfigurations in a row for the same
configuration change. Better version could look like this:

    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
    if (new_n_rxq > MAX_XSKQ) {
        ovs_mutex_unlock(&dev->mutex);
        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
        return EINVAL;
    }

    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
    if (!strcasecmp(str_xdpmode, "drv")) {
        xdpmode = XDP_ZEROCOPY;
    } else if (!strcasecmp(str_xdpmode, "skb")) {
        xdpmode = XDP_COPY;
    } else {
        VLOG_ERR("%s: Incorrect xdpmode (%s).",
                 netdev_get_name(netdev), str_xdpmode);
        ovs_mutex_unlock(&dev->mutex);
        return EINVAL;
    }

    if (dev->requested_n_rxq != new_n_rxq
        || dev->requested_xdpmode != xdpmode) {
        dev->requested_n_rxq = new_n_rxq;
        dev->requested_xdpmode = xdpmode
        netdev_request_reconfigure(netdev);
    }

The main difference is checking "new" with "requested", not the "new" with
"current". This allows us to request reconfiguration only once for each
change. I also made few cosmetic changes which you may find useful, however
it's up to you.

> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    netdev->n_rxq = dev->requested_n_rxq;
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t curr_prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    }
> +    if (prog_id == curr_prog_id) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    } else if (!curr_prog_id) {
> +        VLOG_INFO("couldn't find a prog id on a given interface");
> +    } else {
> +        VLOG_INFO("program on interface changed, not removing");
> +    }
> +}
> +
> +struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    unsigned long addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +        struct dp_packet_afxdp *xpacket = NULL;
> +        struct dp_packet *packet;
> +        void *elems[BATCH_SIZE];
> +        unsigned long addr;
> +
> +       /* all packets are AF_XDP, so handles its own delete in batch */
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            xpacket = dp_packet_cast_afxdp(packet);
> +            if (xpacket->mpool) {
> +                void *base = dp_packet_base(packet);
> +
> +                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +                elems[i] = (void *)addr;
> +            }
> +        }
> +        umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +        dp_packet_batch_init(batch);
> +}
> +
> +/* Receive packet from AF_XDP socket */
> +int
> +netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                     struct dp_packet_batch *batch)
> +{
> +    struct umem_elem *elems[BATCH_SIZE];
> +    uint32_t idx_rx = 0, idx_fq = 0;
> +    unsigned int rcvd, i;
> +    int ret = 0;
> +
> +    /* See if there is any packet on RX queue,
> +     * if yes, idx_rx is the index having the packet.
> +     */
> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;
> +    }
> +
> +    /* Form a dp_packet batch from descriptor in RX queue */

s/From/To/ ?

> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> +
> +        packet = &xpacket->packet;
> +        xpacket->mpool = &xsk->umem->mpool;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +
> +    /* We've consume rcvd packets in RX, now re-fill the
> +     * same number back to FILL queue.
> +     */
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {> +        return -ENOMEM;
> +    }

Can this be done before actually receiving packets? i.e. don't receive
anything if cant refill.

> +
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        while (OVS_UNLIKELY(ret == 0)) {
> +            /* The FILL queue is full, so retry. (or skip)? */
> +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        }
> +
> +        /* Get one free umem, program it into FILL queue */
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> +
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> +    xsk->rx_npkts += rcvd;
> +
> +#ifdef AFXDP_DEBUG
> +    print_xsk_stat(xsk);
> +#endif
> +    return 0;
> +}
> +
> +static inline int kick_tx(struct xsk_socket_info *xsk)
> +{
> +    int ret;
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +int
> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                              struct dp_packet_batch *batch)
> +{
> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t tx_done, idx_cq = 0;
> +    struct dp_packet *packet;
> +    uint32_t idx = 0;
> +    int j, ret, retry_count = 0;
> +    const int max_retry = 4;
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        return EAGAIN;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +        return EAGAIN;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * We can avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +        return ret;
> +    }
> +
> +retry:
> +    /* Process CQ */
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> +    if (tx_done > 0) {
> +        xsk->outstanding_tx -= tx_done;
> +        xsk->tx_npkts += tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +
> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> +    ovs_assert(ret == 0);
> +
> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +
> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> +        /* If there are still a lot not transmitted, try harder. */
> +        if (retry_count++ > max_retry) {
> +            return 0;
> +        }
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> new file mode 100644
> index 000000000000..6518d8fca0b5
> --- /dev/null
> +++ b/lib/netdev-afxdp.h
> @@ -0,0 +1,53 @@
> +/*
> + * Copyright (c) 2018 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_AFXDP_H
> +#define NETDEV_AFXDP_H 1
> +
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +/* These functions are Linux AF_XDP specific, so they should be used directly
> + * only by Linux-specific code. */
> +#define MAX_XSKQ 16
> +struct netdev;
> +struct xsk_socket_info;
> +struct xdp_umem;
> +struct dp_packet_batch;
> +struct smap;
> +struct dp_packet;
> +
> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet *d);
> +
> +int xsk_configure_all(struct netdev *netdev);
> +
> +void xsk_destroy_all(struct netdev *netdev);
> +
> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                         struct dp_packet_batch *batch);
> +
> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                                  struct dp_packet_batch *batch);
> +
> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                            char **errp);
> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> +
> +void free_afxdp_buf(struct dp_packet *p);
> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> +#endif /* netdev-afxdp.h */
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> new file mode 100644
> index 000000000000..3dd3d902b3c4
> --- /dev/null
> +++ b/lib/netdev-linux-private.h
> @@ -0,0 +1,124 @@
> +/*
> + * Copyright (c) 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_LINUX_PRIVATE_H
> +#define NETDEV_LINUX_PRIVATE_H 1
> +
> +#include <config.h>
> +
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"
> +
> +#if HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
> +
> +/* These functions are Linux specific, so they should be used directly only by
> + * Linux-specific code. */
> +
> +struct netdev;
> +
> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
> +                                  const char *flag_name, bool enable);
> +int linux_get_ifindex(const char *netdev_name);
> +
> +#define LINUX_FLOW_OFFLOAD_API                          \
> +   .flow_flush = netdev_tc_flow_flush,                  \
> +   .flow_dump_create = netdev_tc_flow_dump_create,      \
> +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
> +   .flow_dump_next = netdev_tc_flow_dump_next,          \
> +   .flow_put = netdev_tc_flow_put,                      \
> +   .flow_get = netdev_tc_flow_get,                      \
> +   .flow_del = netdev_tc_flow_del,                      \
> +   .init_flow_api = netdev_tc_init_flow_api
> +
> +struct netdev_linux {
> +    struct netdev up;
> +
> +    /* Protects all members below. */
> +    struct ovs_mutex mutex;
> +
> +    unsigned int cache_valid;
> +
> +    bool miimon;                    /* Link status of last poll. */
> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> +    struct timer miimon_timer;
> +
> +    int netnsid;                    /* Network namespace ID. */
> +    /* The following are figured out "on demand" only.  They are only valid
> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> +    int ifindex;
> +    struct eth_addr etheraddr;
> +    int mtu;
> +    unsigned int ifi_flags;
> +    long long int carrier_resets;
> +    uint32_t kbits_rate;        /* Policing data. */
> +    uint32_t kbits_burst;
> +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> +                                   0 or an errno value. */
> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> +                                 * or SIOCSIFMTU.
> +                                 */
> +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> +    int netdev_policing_error;  /* Cached error code from set policing. */
> +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> +
> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> +
> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> +    struct tc *tc;
> +
> +    /* For devices of class netdev_tap_class only. */
> +    int tap_fd;
> +    bool present;               /* If the device is present in the namespace */
> +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> +
> +    /* LAG information. */
> +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> +
> +    /* AF_XDP information */
> +#ifdef HAVE_AF_XDP
> +    struct xsk_socket_info *xsk[MAX_XSKQ];
> +    int requested_n_rxq;
> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> +    int xdp_flags, xdp_bind_flags;
> +#endif
> +};
> +
> +static struct netdev_linux *
> +netdev_linux_cast(const struct netdev *netdev)
> +{
> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +}
> +
> +#endif /* netdev-linux-private.h */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index f75d73fd39f8..1f190406d145 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -17,6 +17,7 @@
>  #include <config.h>
>  
>  #include "netdev-linux.h"
> +#include "netdev-linux-private.h"
>  
>  #include <errno.h>
>  #include <fcntl.h>
> @@ -54,6 +55,7 @@
>  #include "fatal-signal.h"
>  #include "hash.h"
>  #include "openvswitch/hmap.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-provider.h"
>  #include "netdev-tc-offloads.h"
>  #include "netdev-vport.h"
> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
>  
> -struct netdev_linux {
> -    struct netdev up;
> -
> -    /* Protects all members below. */
> -    struct ovs_mutex mutex;
> -
> -    unsigned int cache_valid;
> -
> -    bool miimon;                    /* Link status of last poll. */
> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> -    struct timer miimon_timer;
> -
> -    int netnsid;                    /* Network namespace ID. */
> -    /* The following are figured out "on demand" only.  They are only valid
> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> -    int ifindex;
> -    struct eth_addr etheraddr;
> -    int mtu;
> -    unsigned int ifi_flags;
> -    long long int carrier_resets;
> -    uint32_t kbits_rate;        /* Policing data. */
> -    uint32_t kbits_burst;
> -    int vport_stats_error;      /* Cached error code from vport_get_stats().
> -                                   0 or an errno value. */
> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
> -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> -    int netdev_policing_error;  /* Cached error code from set policing. */
> -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> -
> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> -
> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> -    struct tc *tc;
> -
> -    /* For devices of class netdev_tap_class only. */
> -    int tap_fd;
> -    bool present;               /* If the device is present in the namespace */
> -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> -
> -    /* LAG information. */
> -    bool is_lag_master;         /* True if the netdev is a LAG master. */
> -};
>  
>  struct netdev_rxq_linux {
>      struct netdev_rxq up;
> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class *netdev_class)
>      return netdev_class->run == netdev_linux_run;
>  }
>  
> +#if HAVE_AF_XDP
>  static bool
> -is_tap_netdev(const struct netdev *netdev)
> +is_afxdp_netdev(const struct netdev *netdev)
>  {
> -    return netdev_get_class(netdev) == &netdev_tap_class;
> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
>  }
> -
> -static struct netdev_linux *
> -netdev_linux_cast(const struct netdev *netdev)
> +#else
> +static bool
> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
>  {
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> -
> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +    return false;
> +}
> +#endif
> +static bool
> +is_tap_netdev(const struct netdev *netdev)
> +{
> +    return netdev_get_class(netdev) == &netdev_tap_class;
>  }
>  
>  static struct netdev_rxq_linux *
> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>          atomic_count_dec(&miimon_cnt);
>      }
>  
> +#if HAVE_AF_XDP
> +    if (is_afxdp_netdev(netdev_)) {
> +        xsk_destroy_all(netdev_);
> +    }
> +#endif
>      ovs_mutex_destroy(&netdev->mutex);
>  }
>  
> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>      rx->is_tap = is_tap_netdev(netdev_);
>      if (rx->is_tap) {
>          rx->fd = netdev->tap_fd;
> -    } else {
> +    } else if (!is_afxdp_netdev(netdev_)) {
>          struct sockaddr_ll sll;
>          int ifindex, val;
>          /* Result of tcpdump -dd inbound */
> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>  {
>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>      struct netdev *netdev = rx->up.netdev;
> -    struct dp_packet *buffer;
> +    struct dp_packet *buffer = NULL;
>      ssize_t retval;
>      int mtu;
>  
> +#if HAVE_AF_XDP
> +    if (is_afxdp_netdev(netdev)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> +        int qid = rxq_->queue_id;
> +
> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> +    }

Maybe it's better to just implement '.rxq_recv' inside netdev-afxdp.c ?
Also, you missed clearing the '*qfill'.

> +#endif
>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>          mtu = ETH_PAYLOAD_MAX;
>      }
> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>      /* Assume Ethernet port. No need to set packet_type. */
>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>                                             DP_NETDEV_HEADROOM);
> +
>      retval = (rx->is_tap
>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>      int error = 0;
>      int sock = 0;
>  
> -    if (!is_tap_netdev(netdev_)) {
> +    if (!is_tap_netdev(netdev_) &&
> +        !is_afxdp_netdev(netdev_)) {
>          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>              error = EOPNOTSUPP;
>              goto free_batch;
> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>          }
>  
>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +#if HAVE_AF_XDP
> +    } else if (is_afxdp_netdev(netdev_)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> +        struct dp_packet_afxdp *xpacket;
> +        struct umem_pool *first_mpool;
> +        struct dp_packet *packet;
> +
> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> +
> +        /* all packets must come frome the same umem pool
> +         * and has DPBUF_AFXDP type, otherwise free on-by-one
> +         */
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            if (packet->source != DPBUF_AFXDP) {
> +                goto free_batch;
> +            }
> +
> +            xpacket = dp_packet_cast_afxdp(packet);
> +            if (i == 0) {
> +                first_mpool = xpacket->mpool;
> +                continue;
> +            }
> +            if (xpacket->mpool != first_mpool) {
> +                goto free_batch;
> +            }
> +        }
> +        /* free in batch */
> +        free_afxdp_buf_batch(batch);
> +        return error;


There are a lot of afxdp specific code here and 'netdev_linux_send' doesn't
provide any magic, i.e. has no real code suitable for all netdev types.
Maybe it's better to just implement own '.send' function inside netdev-afxdp.c ?

> +#endif
>      } else {
>          error = netdev_linux_tap_batch_send(netdev_, batch);
>      }
> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "system",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_linux_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
>  const struct netdev_class netdev_tap_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      .type = "tap",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct_tap,
>      .get_stats = netdev_tap_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3343,10 +3352,26 @@ const struct netdev_class netdev_internal_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "internal",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_internal_get_stats,
>      .get_status = netdev_internal_get_status,
>  };
> +
> +#ifdef HAVE_AF_XDP
> +const struct netdev_class netdev_afxdp_class = {
> +    NETDEV_LINUX_CLASS_COMMON,
> +    .type = "afxdp",
> +    .is_pmd = true,
> +    .construct = netdev_linux_construct,
> +    .get_stats = netdev_linux_get_stats,
> +    .get_status = netdev_linux_get_status,
> +    .set_config = netdev_afxdp_set_config,
> +    .get_config = netdev_afxdp_get_config,
> +    .reconfigure = netdev_afxdp_reconfigure,
> +    .get_numa_id = netdev_afxdp_get_numa_id,
> +};
> +#endif
>  
>  
>  #define CODEL_N_QUEUES 0x0000
> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> index 17ca9120168a..b812e64cb078 100644
> --- a/lib/netdev-linux.h
> +++ b/lib/netdev-linux.h
> @@ -19,6 +19,20 @@
>  
>  #include <stdint.h>
>  #include <stdbool.h>
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"
>  
>  /* These functions are Linux specific, so they should be used directly only by
>   * Linux-specific code. */
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e6e8e8..d433818f7064 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class;
>  #endif
>  extern const struct netdev_class netdev_internal_class;
>  extern const struct netdev_class netdev_tap_class;
> -
> +#if HAVE_AF_XDP
> +extern const struct netdev_class netdev_afxdp_class;
> +#endif
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6f0946..e2fae37d5a5e 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -146,6 +146,9 @@ netdev_initialize(void)
>          netdev_register_provider(&netdev_internal_class);
>          netdev_register_provider(&netdev_tap_class);
>          netdev_vport_tunnel_register();
> +#ifdef HAVE_AF_XDP
> +        netdev_register_provider(&netdev_afxdp_class);
> +#endif
>  #endif
>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>          netdev_register_provider(&netdev_tap_class);
> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> new file mode 100644
> index 000000000000..2d80e74d69e4
> --- /dev/null
> +++ b/lib/xdpsock.c
> @@ -0,0 +1,239 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +#include <config.h>
> +
> +#include "xdpsock.h"
> +
> +#include <ctype.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <syslog.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include "async-append.h"
> +#include "coverage.h"
> +#include "dirs.h"
> +#include "dp-packet.h"
> +#include "openvswitch/compiler.h"
> +#include "openvswitch/vlog.h"
> +#include "ovs-atomic.h"
> +#include "ovs-thread.h"
> +#include "sat-math.h"
> +#include "socket-util.h"
> +#include "svec.h"
> +#include "syslog-direct.h"
> +#include "syslog-libc.h"
> +#include "syslog-provider.h"
> +#include "timeval.h"
> +#include "unixctl.h"
> +#include "util.h"
> +
> +static inline void
> +ovs_spinlock_init(ovs_spinlock_t *sl)
> +{
> +    atomic_init(&sl->locked, 0);
> +}
> +
> +static inline void
> +ovs_spin_lock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0, locked = 0;
> +
> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed)) {
> +        locked = 1;
> +        while (locked) {
> +            atomic_read_relaxed(&sl->locked, &locked);
> +        }
> +        exp = 0;
> +    }
> +}
> +
> +static inline void
> +ovs_spin_unlock(ovs_spinlock_t *sl)
> +{
> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> +}
> +
> +static inline int OVS_UNUSED
> +ovs_spin_trylock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0;
> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed);
> +}
> +
> +inline int
> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> +        return -ENOMEM;
> +    }
> +
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(ptr, addrs, n * sizeof(void *));
> +    umemp->index += n;
> +
> +    return 0;
> +}
> +
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_push_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void
> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +    umemp->array[umemp->index++] = addr;
> +}
> +
> +void
> +umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +
> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> +        /* stack is overflow, this should not happen */
> +        OVS_NOT_REACHED();
> +    }
> +
> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    __umem_elem_push(umemp, addr);
> +    ovs_spin_unlock(&umemp->mutex);
> +}
> +
> +inline int
> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->index -= n;
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(addrs, ptr, n * sizeof(void *));
> +
> +    return 0;
> +}
> +
> +int
> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void *
> +__umem_elem_pop(struct umem_pool *umemp)
> +{
> +    return umemp->array[--umemp->index];
> +}
> +
> +void *
> +umem_elem_pop(struct umem_pool *umemp)
> +{
> +    void *ptr;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ptr = __umem_elem_pop(umemp);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ptr;
> +}
> +
> +void **
> +__umem_pool_alloc(unsigned int size)
> +{
> +    void *bufs;
> +
> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(void *)) == 0);
> +    memset(bufs, 0, size * sizeof(void *));
> +    return (void **)bufs;
> +}
> +
> +unsigned int
> +umem_elem_count(struct umem_pool *mpool)
> +{
> +    return mpool->index;
> +}
> +
> +int
> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> +{
> +    umemp->array = __umem_pool_alloc(size);
> +    if (!umemp->array) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    umemp->size = size;
> +    umemp->index = 0;
> +    ovs_spinlock_init(&umemp->mutex);
> +    return 0;
> +}
> +
> +void
> +umem_pool_cleanup(struct umem_pool *umemp)
> +{
> +    free(umemp->array);
> +}
> +
> +/* AF_XDP metadata init/destroy */
> +int
> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> +{
> +    void *bufs;
> +
> +    /* TODO: check HAVE_POSIX_MEMALIGN  */
> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(struct dp_packet_afxdp)) == 0);
> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> +
> +    xp->array = bufs;
> +    xp->size = size;
> +    return 0;
> +}
> +
> +void
> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> +{
> +    free(xp->array);
> +}
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> new file mode 100644
> index 000000000000..aabaa8e5df24
> --- /dev/null
> +++ b/lib/xdpsock.h
> @@ -0,0 +1,123 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef XDPSOCK_H
> +#define XDPSOCK_H 1
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/xsk.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <locale.h>
> +#include <net/if.h>
> +#include <poll.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +
> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> +#define BATCH_SIZE      NETDEV_MAX_BURST
> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> +
> +#define NUM_FRAMES      4096
> +#define PROD_NUM_DESCS  512
> +#define CONS_NUM_DESCS  512
> +
> +#ifdef USE_XSK_DEFAULT
> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> +#endif

Should there be ifdef-else-endif ?

> +
> +typedef struct {
> +    atomic_int locked;
> +} ovs_spinlock_t;
> +
> +/* LIFO ptr_array */
> +struct umem_pool {
> +    int index;      /* point to top */
> +    unsigned int size;
> +    ovs_spinlock_t mutex;
> +    void **array;   /* a pointer array, point to umem buf */
> +};
> +
> +/* array-based dp_packet_afxdp */
> +struct xpacket_pool {
> +    unsigned int size;
> +    struct dp_packet_afxdp **array;
> +};
> +
> +struct xsk_umem_info {
> +    struct umem_pool mpool;
> +    struct xpacket_pool xpool;
> +    struct xsk_ring_prod fq;
> +    struct xsk_ring_cons cq;
> +    struct xsk_umem *umem;
> +    void *buffer;
> +};
> +
> +struct xsk_socket_info {
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_umem_info *umem;
> +    struct xsk_socket *xsk;
> +    unsigned long rx_npkts;
> +    unsigned long tx_npkts;
> +    unsigned long prev_rx_npkts;
> +    unsigned long prev_tx_npkts;
> +    uint32_t outstanding_tx;
> +};
> +
> +struct umem_elem {
> +    struct umem_elem *next;
> +};
> +
> +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void *__umem_elem_pop(struct umem_pool *umemp);
> +void *umem_elem_pop(struct umem_pool *umemp);
> +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void **__umem_pool_alloc(unsigned int size);
> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> +void umem_pool_cleanup(struct umem_pool *umemp);
> +unsigned int umem_elem_count(struct umem_pool *mpool);
> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> +
> +#endif
> diff --git a/tests/automake.mk b/tests/automake.mk
> index ea16532dd2a0..715cef9a6b3b 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>  	$(SYSTEM_TESTSUITE_AT) \
>  	$(SYSTEM_KMOD_TESTSUITE_AT) \
>  	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
> +	$(SYSTEM_AFXDP_TESTSUITE_AT) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>  	$(SYSTEM_DPDK_TESTSUITE_AT) \
>  	$(OVSDB_CLUSTER_TESTSUITE_AT) \
>  	$(TESTSUITE) \
>  	$(SYSTEM_KMOD_TESTSUITE) \
>  	$(SYSTEM_USERSPACE_TESTSUITE) \
> +	$(SYSTEM_AFXDP_TESTSUITE) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE) \
>  	$(SYSTEM_DPDK_TESTSUITE) \
>  	$(OVSDB_CLUSTER_TESTSUITE) \
> @@ -158,6 +160,11 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>  	tests/system-userspace-macros.at \
>  	tests/system-userspace-packet-type-aware.at
>  
> +SYSTEM_AFXDP_TESTSUITE_AT = \
> +	tests/system-afxdp-testsuite.at \
> +	tests/system-afxdp-traffic.at \
> +	tests/system-afxdp-macros.at
> +
>  SYSTEM_TESTSUITE_AT = \
>  	tests/system-common-macros.at \
>  	tests/system-ovn.at \
> @@ -182,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>  SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> @@ -315,6 +323,11 @@ check-system-userspace: all
>  	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>  
> +check-afxdp: all
> +	$(MAKE) install
> +	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> +	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> +
>  check-offloads: all
>  	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> @@ -352,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
>  
> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> +	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> +	$(AM_V_at)mv $@.tmp $@
> +
>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
> new file mode 100644
> index 000000000000..2c58c2d6554b
> --- /dev/null
> +++ b/tests/system-afxdp-macros.at
> @@ -0,0 +1,153 @@
> +# _ADD_BR([name])
> +#
> +# Expands into the proper ovs-vsctl commands to create a bridge with the
> +# appropriate type and properties
> +m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type=netdev protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
> +
> +# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
> +#
> +# Creates a database and starts ovsdb-server, starts ovs-vswitchd
> +# connected to that database, calls ovs-vsctl to create a bridge named
> +# br0 with predictable settings, passing 'vsctl-args' as additional
> +# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
> +# output (e.g. because it includes "create" commands) then 'vsctl-output'
> +# specifies the expected output after filtering through uuidfilt.
> +m4_define([OVS_TRAFFIC_VSWITCHD_START],
> +  [
> +   export OVS_PKGDATADIR=$(`pwd`)
> +   _OVS_VSWITCHD_START([--disable-system])
> +   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
> +])
> +
> +# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
> +#
> +# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
> +# for messages with severity WARN or higher and signaling an error if any
> +# is present.  The optional WHITELIST may contain shell-quoted "sed"
> +# commands to delete any warnings that are actually expected, e.g.:
> +#
> +#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
> +#
> +# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
> +# invoked. They can be used to perform additional cleanups such as name space
> +# removal.
> +m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
> +  [OVS_VSWITCHD_STOP([dnl
> +$1";/netdev_linux.*obtaining netdev stats via vport failed/d
> +/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d
> +/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
> +/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
> +"])
> +   AT_CHECK([:; $2])
> +  ])
> +
> +m4_define([ADD_VETH_AFXDP],
> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
> +      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
> +      AT_CHECK([ip link set $1 netns $2])
> +      AT_CHECK([ip link set dev ovs-$1 up])
> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> +                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> +      if test -n "$5"; then
> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> +      fi
> +      if test -n "$6"; then
> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> +      fi
> +      on_exit 'ip link del ovs-$1'
> +    ]
> +)
> +
> +# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
> +m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
> +    ]
> +)
> +
> +# CONFIGURE_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads for veths.  The userspace datapath uses the AF_PACKET
> +# socket to receive packets for veths.  Unfortunately, the AF_PACKET socket
> +# doesn't play well with offloads:
> +# 1. GSO packets are received without segmentation and therefore discarded.
> +# 2. Packets with offloaded partial checksum are received with the wrong
> +#    checksum, therefore discarded by the receiver.
> +#
> +# By disabling tx offloads in the non-OVS side of the veth peer we make sure
> +# that the AF_PACKET socket will not receive bad packets.
> +#
> +# This is a workaround, and should be removed when offloads are properly
> +# supported in netdev-linux.
> +m4_define([CONFIGURE_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
> +)
> +
> +# CHECK_CONNTRACK()
> +#
> +# Perform requirements checks for running conntrack tests.
> +#
> +m4_define([CHECK_CONNTRACK],
> +    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
> +)
> +
> +# CHECK_CONNTRACK_ALG()
> +#
> +# Perform requirements checks for running conntrack ALG tests. The userspace
> +# supports FTP and TFTP.
> +#
> +m4_define([CHECK_CONNTRACK_ALG])
> +
> +# CHECK_CONNTRACK_FRAG()
> +#
> +# Perform requirements checks for running conntrack fragmentations tests.
> +# The userspace doesn't support fragmentation yet, so skip the tests.
> +m4_define([CHECK_CONNTRACK_FRAG],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_LOCAL_STACK()
> +#
> +# Perform requirements checks for running conntrack tests with local stack.
> +# While the kernel connection tracker automatically passes all the connection
> +# tracking state from an internal port to the OpenvSwitch kernel module, there
> +# is simply no way of doing that with the userspace, so skip the tests.
> +m4_define([CHECK_CONNTRACK_LOCAL_STACK],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_NAT()
> +#
> +# Perform requirements checks for running conntrack NAT tests. The userspace
> +# datapath supports NAT.
> +#
> +m4_define([CHECK_CONNTRACK_NAT])
> +
> +# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
> +#
> +# Perform requirements checks for running ovs-dpctl flush-conntrack by
> +# conntrack 5-tuple test. The userspace datapath does not support
> +# this feature yet.
> +m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-set-maxconns or
> +# ovs-dpctl ct-get-maxconns. The userspace datapath does support this feature.
> +m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
> +
> +# CHECK_CT_DPIF_GET_NCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-get-nconns. The
> +# userspace datapath does support this feature.
> +m4_define([CHECK_CT_DPIF_GET_NCONNS])
> diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
> new file mode 100644
> index 000000000000..538c0d15d556
> --- /dev/null
> +++ b/tests/system-afxdp-testsuite.at
> @@ -0,0 +1,26 @@
> +AT_INIT
> +
> +AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
> +
> +Licensed under the Apache License, Version 2.0 (the "License");
> +you may not use this file except in compliance with the License.
> +You may obtain a copy of the License at:
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> +
> +Unless required by applicable law or agreed to in writing, software
> +distributed under the License is distributed on an "AS IS" BASIS,
> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +See the License for the specific language governing permissions and
> +limitations under the License.])
> +
> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> +
> +m4_include([tests/ovs-macros.at])
> +m4_include([tests/ovsdb-macros.at])
> +m4_include([tests/ofproto-macros.at])
> +m4_include([tests/system-afxdp-macros.at])
> +m4_include([tests/system-common-macros.at])
> +
> +m4_include([tests/system-afxdp-traffic.at])
> +m4_include([tests/system-ovn.at])
> diff --git a/tests/system-afxdp-traffic.at b/tests/system-afxdp-traffic.at
> new file mode 100644
> index 000000000000..26f72acf48ef
> --- /dev/null
> +++ b/tests/system-afxdp-traffic.at

Why not using the common 'tests/system-traffic.at' ?
If the 'ADD_VETH_AFXDP' is the only macro you need to replace,
you may move the 'ADD_VETH' out to all tests/system-*-macros.at
instead. It'll be much less code duplication.
Another option is to rename 'ADD_VETH' to '_ADD_VETH' with some
additional arguments and implement wrappers in each of specified
tests/system-*-macros.at.

> @@ -0,0 +1,978 @@
> +AT_BANNER([AF_XDP netdev datapath-sanity])
> +
> +AT_SETUP([datapath - ping between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ulimit -l unlimited
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
> +ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
> +ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan tunnel])
> +OVS_CHECK_VXLAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> +                  [id 0 dstport 4789])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan6 tunnel])
> +OVS_CHECK_VXLAN_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
> +                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over gre tunnel])
> +OVS_CHECK_GRE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=1 options:erspan_idx=7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=2 options:erspan_dir=1 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> +                [options:key=123 options:erspan_ver=1 options:erspan_idx=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> +                   [10.1.1.1/24], [local fc00:100::1 seq key 123 erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> +                [options:key=121 options:erspan_ver=2 options:erspan_dir=0 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> +                   [10.1.1.1/24],
> +                   [local fc00:100::1 seq key 121 erspan_ver 2 erspan_dir ingress erspan_hwid 0x7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve tunnel])
> +OVS_CHECK_GENEVE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> +                  [vni 0])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve6 tunnel])
> +OVS_CHECK_GENEVE_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
> +                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - clone action])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
> +                    -- set interface ovs-p1 ofport_request=2])
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1 actions=NORMAL
> +priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
> +priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - basic truncate action])
> +AT_SKIP_IF([test $HAVE_NC = no])
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-ofctl del-flows br0])
> +
> +dnl Create p0 and ovs-p0(1)
> +ADD_NAMESPACES(at_ns0)
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
> +NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
> +
> +dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
> +AT_CHECK([ip link add p1 type veth peer name ovs-p1])
> +on_exit 'ip link del ovs-p1'
> +AT_CHECK([ip link set dev ovs-p1 up])
> +AT_CHECK([ip link set dev p1 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
> +dnl Use p1 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
> +
> +dnl Create p2(5) and ovs-p2(4)
> +AT_CHECK([ip link add p2 type veth peer name ovs-p2])
> +on_exit 'ip link del ovs-p2'
> +AT_CHECK([ip link set dev ovs-p2 up])
> +AT_CHECK([ip link set dev p2 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
> +dnl Use p2 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
> +
> +dnl basic test
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl use this file as payload file for ncat
> +AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
> +on_exit 'rm -f payload200.bin'
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl packet with truncated size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=100
> +])
> +dnl packet with original size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=242
> +])
> +
> +dnl more complicated output actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +dnl SLOW_ACTION: disable kernel datapath truncate support
> +dnl Repeat the test above, but exercise the SLOW_ACTION code path
> +AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
> +
> +dnl SLOW_ACTION test1: check datapatch actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
> +AT_CHECK([tail -3 stdout], [0],
> +[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
> +This flow is handled by the userspace slow path because it:
> +  - Uses action(s) not supported by datapath.
> +])
> +
> +dnl SLOW_ACTION test2: check actual packet truncate
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +
> +AT_BANNER([conntrack])
> +
> +AT_SETUP([conntrack - controller])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - force commit])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(force,commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
> +table=1,in_port=2,ct_state=+trk,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> +])
> +
> +dnl
> +dnl Check that the directionality has been changed by force commit.
> +dnl
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
> +])
> +
> +dnl OK, now send another packet from port 1 and see that it switches again
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - ct flush by 5-tuple])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),2
> +priority=100,in_port=2,udp,action=ct(zone=5,commit),1
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Test UDP from port 1
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [1], [dnl
> +])
> +
> +dnl Test UDP from port 2
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +])
> +
> +dnl Test ICMP traffic
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [stdout])
> +AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
> +icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
> +])
> +
> +ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
> +ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [1], [dnl
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv4 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - get_nconns and get/set_maxconns])
> +CHECK_CONNTRACK()
> +CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +CHECK_CT_DPIF_GET_NCONNS()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +1
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +3000000
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
> +setting maxconns successful
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +0
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv6 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +AT_DATA([flows.txt], [dnl
> +
> +dnl ICMPv6 echo request and reply go to table 1.  The rest of the traffic goes
> +dnl through normal action.
> +table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
> +table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
> +table=0,priority=1,action=normal
> +
> +dnl Allow everything from ns0->ns1. Only allow return traffic from ns1->ns0.
> +table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
> +table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
> +table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
> +table=1,priority=1,action=drop
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +dnl The above ping creates state in the connection tracker.  We're not
> +dnl interested in that state.
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], [dnl
> +icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
>
Ilya Maximets May 14, 2019, 12:09 p.m. UTC | #2
Few more comments inline.

On 10.05.2019 2:54, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this is not
> compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> 
> ---

<snip>

> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd --disable-system
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +.. note::
> +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
> +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
> +   and datapath_type=netdev when adding a new bridge.

I don't think that '--disable-system' is needed. It doesn't affect anything.

<snip>

> +int
> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                              struct dp_packet_batch *batch)
> +{

One important issue here. netdev_linux_send() is thread-safe, because
all the syscalls and memory allocations there are thread-safe.
However, all the xsk_ring_* APIs are not thread safe and if two
threads will try to send packets to the same tx queue they might
destroy the rings. So, it's necessary to start using 'concurrent_txq'
flag with per-queue locks.
Note that 'concurrent_txq' == 'false' only if 'n_txq' > 'n_pmd_threads'.

> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t tx_done, idx_cq = 0;
> +    struct dp_packet *packet;
> +    uint32_t idx = 0;
> +    int j, ret, retry_count = 0;
> +    const int max_retry = 4;
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        return EAGAIN;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +        return EAGAIN;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * We can avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +        return ret;
> +    }
> +
> +retry:
> +    /* Process CQ */
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> +    if (tx_done > 0) {
> +        xsk->outstanding_tx -= tx_done;
> +        xsk->tx_npkts += tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +
> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> +    ovs_assert(ret == 0);
> +
> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +
> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> +        /* If there are still a lot not transmitted, try harder. */
> +        if (retry_count++ > max_retry) {
> +            return 0;
> +        }
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
William Tu May 15, 2019, 11:20 p.m. UTC | #3
> > +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
> > +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
> > +   and datapath_type=netdev when adding a new bridge.
>
> I don't think that '--disable-system' is needed. It doesn't affect anything.
>

Thanks I will remove it.

> <snip>
>
> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                              struct dp_packet_batch *batch)
> > +{
>
> One important issue here. netdev_linux_send() is thread-safe, because
> all the syscalls and memory allocations there are thread-safe.
> However, all the xsk_ring_* APIs are not thread safe and if two
> threads will try to send packets to the same tx queue they might
> destroy the rings. So, it's necessary to start using 'concurrent_txq'
> flag with per-queue locks.
> Note that 'concurrent_txq' == 'false' only if 'n_txq' > 'n_pmd_threads'.
>

Thanks!
I have one question. For example if I have n_txq=4 and n_pmd_threds=2,
then concurrent_txq = false.

Assume pmd1 processing rx queue0 on port1 and pmd2 processes rx queue0 on port2.
What if both pmd1 and pmd2 try to send AF_XDP packet tx queue0 on port2?
Then both pmd threads are calling the send function on port2 queue0
concurrently.
Does that mean I have to unconditionally add per-queue lock?

Regards,
William

> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t tx_done, idx_cq = 0;
> > +    struct dp_packet *packet;
> > +    uint32_t idx = 0;
> > +    int j, ret, retry_count = 0;
> > +    const int max_retry = 4;
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return EAGAIN;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +        return EAGAIN;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * We can avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +        return ret;
> > +    }
> > +
> > +retry:
> > +    /* Process CQ */
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk->outstanding_tx -= tx_done;
> > +        xsk->tx_npkts += tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> > +    ovs_assert(ret == 0);
> > +
> > +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> > +        /* If there are still a lot not transmitted, try harder. */
> > +        if (retry_count++ > max_retry) {
> > +            return 0;
> > +        }
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
William Tu May 15, 2019, 11:27 p.m. UTC | #4
Hi Ilya,

Thanks for your feedback.

On Mon, May 13, 2019 at 10:48 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 10.05.2019 2:54, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> > built upon the eBPF and XDP technology.  It is aims to have comparable
> > performance to DPDK but cooperate better with existing kernel's networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > attached to the netdev, by-passing a couple of Linux kernel's subsystems
> > As a result, AF_XDP socket shows much better performance than AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this is not
> > compiled in.
> >
> > Signed-off-by: William Tu <u9012063@gmail.com>
> >
> > ---

snip

> > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                        char **errp OVS_UNUSED)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    const char *xdpmode;
> > +    int new_n_rxq;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > +    if (new_n_rxq > MAX_XSKQ) {
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        return EINVAL;
> > +    }
> > +
> > +    if (new_n_rxq != netdev->n_rxq) {
> > +        dev->requested_n_rxq = new_n_rxq;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +
> > +    xdpmode = smap_get(args, "xdpmode");
> > +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
> > +        dev->requested_xdpmode = XDP_ZEROCOPY;
> > +        if (dev->xdpmode != dev->requested_xdpmode) {
> > +            netdev_request_reconfigure(netdev);
> > +        }
> > +    } else {
> > +        dev->requested_xdpmode = XDP_COPY;
> > +        if (dev->xdpmode != dev->requested_xdpmode) {
> > +            netdev_request_reconfigure(netdev);
> > +        }
> > +    }
>
> Above code will request reconfiguration infinitely until it reconfiguration
> finished. This could cause multiple reconfigurations in a row for the same
> configuration change. Better version could look like this:
>
>     new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>     if (new_n_rxq > MAX_XSKQ) {
>         ovs_mutex_unlock(&dev->mutex);
>         VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
>                  netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
>         return EINVAL;
>     }
>
>     str_xdpmode = smap_get_def(args, "xdpmode", "skb");
>     if (!strcasecmp(str_xdpmode, "drv")) {
>         xdpmode = XDP_ZEROCOPY;
>     } else if (!strcasecmp(str_xdpmode, "skb")) {
>         xdpmode = XDP_COPY;
>     } else {
>         VLOG_ERR("%s: Incorrect xdpmode (%s).",
>                  netdev_get_name(netdev), str_xdpmode);
>         ovs_mutex_unlock(&dev->mutex);
>         return EINVAL;
>     }
>
>     if (dev->requested_n_rxq != new_n_rxq
>         || dev->requested_xdpmode != xdpmode) {
>         dev->requested_n_rxq = new_n_rxq;
>         dev->requested_xdpmode = xdpmode
>         netdev_request_reconfigure(netdev);
>     }
>
> The main difference is checking "new" with "requested", not the "new" with
> "current". This allows us to request reconfiguration only once for each
> change. I also made few cosmetic changes which you may find useful, however
> it's up to you.

Thanks, will fix it in next version.

>
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> > +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> > +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > +        uint64_t index;
> > +
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        index = addr >> FRAME_SHIFT;
> > +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> > +
> > +        packet = &xpacket->packet;
> > +        xpacket->mpool = &xsk->umem->mpool;
> > +
> > +        /* Initialize the struct dp_packet */
> > +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> > +        dp_packet_set_size(packet, len);
> > +
> > +        /* Add packet into batch, increase batch->count */
> > +        dp_packet_batch_add(batch, packet);
> > +
> > +        idx_rx++;
> > +    }
> > +
> > +    /* We've consume rcvd packets in RX, now re-fill the
> > +     * same number back to FILL queue.
> > +     */
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +    if (OVS_UNLIKELY(ret)) {> +        return -ENOMEM;
> > +    }
>
> Can this be done before actually receiving packets? i.e. don't receive
> anything if cant refill.

I'm not sure I understand your point.
Do you suggest moving this umem_elem_pop_n in the beginning?
I think at this point the rcvd is > 0, otherwise it will return.
So we already know there are packets at rx ring.

>
> > +
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t index;
> > +        struct umem_elem *elem;
> > +
> > +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> > +        while (OVS_UNLIKELY(ret == 0)) {
> > +            /* The FILL queue is full, so retry. (or skip)? */
> > +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> > +        }

And if we can't refill, we will keep trying.

> > +
> > +        /* Get one free umem, program it into FILL queue */
> > +        elem = elems[i];
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> > +
> > +        idx_fq++;
> > +    }
> > +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> > +
> > +    /* Release the RX queue */
> > +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > +    xsk->rx_npkts += rcvd;
> > +
> > +#ifdef AFXDP_DEBUG
> > +    print_xsk_stat(xsk);
> > +#endif
> > +    return 0;
> > +}
> > +
> > +static inline int kick_tx(struct xsk_socket_info *xsk)
> > +{
> > +    int ret;
> > +
> > +    /* This causes system call into kernel's xsk_sendmsg, and
> > +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> > +     */
> > +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +    if (OVS_UNLIKELY(ret < 0)) {
> > +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> > +            return errno;
> > +        }
> > +    }
> > +    /* no error, or EBUSY or EAGAIN */
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                              struct dp_packet_batch *batch)
> > +{
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t tx_done, idx_cq = 0;
> > +    struct dp_packet *packet;
> > +    uint32_t idx = 0;
> > +    int j, ret, retry_count = 0;
> > +    const int max_retry = 4;
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return EAGAIN;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +        return EAGAIN;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * We can avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +        return ret;
> > +    }
> > +
> > +retry:
> > +    /* Process CQ */
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk->outstanding_tx -= tx_done;
> > +        xsk->tx_npkts += tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> > +    ovs_assert(ret == 0);
> > +
> > +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> > +        /* If there are still a lot not transmitted, try harder. */
> > +        if (retry_count++ > max_retry) {
> > +            return 0;
> > +        }
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..6518d8fca0b5
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,53 @@
> > +/*
> > + * Copyright (c) 2018 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used directly
> > + * only by Linux-specific code. */
> > +#define MAX_XSKQ 16
> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +
> > +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet *d);
> > +
> > +int xsk_configure_all(struct netdev *netdev);
> > +
> > +void xsk_destroy_all(struct netdev *netdev);
> > +
> > +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > +                         struct dp_packet_batch *batch);
> > +
> > +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                                  struct dp_packet_batch *batch);
> > +
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..3dd3d902b3c4
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,124 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
> > +
> > +#if HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
> > +
> > +/* These functions are Linux specific, so they should be used directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
> > +                                  const char *flag_name, bool enable);
> > +int linux_get_ifindex(const char *netdev_name);
> > +
> > +#define LINUX_FLOW_OFFLOAD_API                          \
> > +   .flow_flush = netdev_tc_flow_flush,                  \
> > +   .flow_dump_create = netdev_tc_flow_dump_create,      \
> > +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
> > +   .flow_dump_next = netdev_tc_flow_dump_next,          \
> > +   .flow_put = netdev_tc_flow_put,                      \
> > +   .flow_get = netdev_tc_flow_get,                      \
> > +   .flow_del = netdev_tc_flow_del,                      \
> > +   .init_flow_api = netdev_tc_init_flow_api
> > +
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set policing. */
> > +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info *xsk[MAX_XSKQ];
> > +    int requested_n_rxq;
> > +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > +    int xdp_flags, xdp_bind_flags;
> > +#endif
> > +};
> > +
> > +static struct netdev_linux *
> > +netdev_linux_cast(const struct netdev *netdev)
> > +{
> > +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +}
> > +
> > +#endif /* netdev-linux-private.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..1f190406d145 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -17,6 +17,7 @@
> >  #include <config.h>
> >
> >  #include "netdev-linux.h"
> > +#include "netdev-linux-private.h"
> >
> >  #include <errno.h>
> >  #include <fcntl.h>
> > @@ -54,6 +55,7 @@
> >  #include "fatal-signal.h"
> >  #include "hash.h"
> >  #include "openvswitch/hmap.h"
> > +#include "netdev-afxdp.h"
> >  #include "netdev-provider.h"
> >  #include "netdev-tc-offloads.h"
> >  #include "netdev-vport.h"
> > @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
> >  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
> >
> > -struct netdev_linux {
> > -    struct netdev up;
> > -
> > -    /* Protects all members below. */
> > -    struct ovs_mutex mutex;
> > -
> > -    unsigned int cache_valid;
> > -
> > -    bool miimon;                    /* Link status of last poll. */
> > -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> > -    struct timer miimon_timer;
> > -
> > -    int netnsid;                    /* Network namespace ID. */
> > -    /* The following are figured out "on demand" only.  They are only valid
> > -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > -    int ifindex;
> > -    struct eth_addr etheraddr;
> > -    int mtu;
> > -    unsigned int ifi_flags;
> > -    long long int carrier_resets;
> > -    uint32_t kbits_rate;        /* Policing data. */
> > -    uint32_t kbits_burst;
> > -    int vport_stats_error;      /* Cached error code from vport_get_stats().
> > -                                   0 or an errno value. */
> > -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
> > -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> > -    int netdev_policing_error;  /* Cached error code from set policing. */
> > -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> > -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> > -
> > -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > -
> > -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> > -    struct tc *tc;
> > -
> > -    /* For devices of class netdev_tap_class only. */
> > -    int tap_fd;
> > -    bool present;               /* If the device is present in the namespace */
> > -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> > -
> > -    /* LAG information. */
> > -    bool is_lag_master;         /* True if the netdev is a LAG master. */
> > -};
> >
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> > @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class *netdev_class)
> >      return netdev_class->run == netdev_linux_run;
> >  }
> >
> > +#if HAVE_AF_XDP
> >  static bool
> > -is_tap_netdev(const struct netdev *netdev)
> > +is_afxdp_netdev(const struct netdev *netdev)
> >  {
> > -    return netdev_get_class(netdev) == &netdev_tap_class;
> > +    return netdev_get_class(netdev) == &netdev_afxdp_class;
> >  }
> > -
> > -static struct netdev_linux *
> > -netdev_linux_cast(const struct netdev *netdev)
> > +#else
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> >  {
> > -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > -
> > -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +    return false;
> > +}
> > +#endif
> > +static bool
> > +is_tap_netdev(const struct netdev *netdev)
> > +{
> > +    return netdev_get_class(netdev) == &netdev_tap_class;
> >  }
> >
> >  static struct netdev_rxq_linux *
> > @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
> >          atomic_count_dec(&miimon_cnt);
> >      }
> >
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev_)) {
> > +        xsk_destroy_all(netdev_);
> > +    }
> > +#endif
> >      ovs_mutex_destroy(&netdev->mutex);
> >  }
> >
> > @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >      rx->is_tap = is_tap_netdev(netdev_);
> >      if (rx->is_tap) {
> >          rx->fd = netdev->tap_fd;
> > -    } else {
> > +    } else if (!is_afxdp_netdev(netdev_)) {
> >          struct sockaddr_ll sll;
> >          int ifindex, val;
> >          /* Result of tcpdump -dd inbound */
> > @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >  {
> >      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >      struct netdev *netdev = rx->up.netdev;
> > -    struct dp_packet *buffer;
> > +    struct dp_packet *buffer = NULL;
> >      ssize_t retval;
> >      int mtu;
> >
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +        int qid = rxq_->queue_id;
> > +
> > +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> > +    }
>
> Maybe it's better to just implement '.rxq_recv' inside netdev-afxdp.c ?
> Also, you missed clearing the '*qfill'.
>
> > +#endif
> >      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> > @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >      /* Assume Ethernet port. No need to set packet_type. */
> >      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> >                                             DP_NETDEV_HEADROOM);
> > +
> >      retval = (rx->is_tap
> >                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> >                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >      int error = 0;
> >      int sock = 0;
> >
> > -    if (!is_tap_netdev(netdev_)) {
> > +    if (!is_tap_netdev(netdev_) &&
> > +        !is_afxdp_netdev(netdev_)) {
> >          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >              error = EOPNOTSUPP;
> >              goto free_batch;
> > @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >          }
> >
> >          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +#if HAVE_AF_XDP
> > +    } else if (is_afxdp_netdev(netdev_)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct umem_pool *first_mpool;
> > +        struct dp_packet *packet;
> > +
> > +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> > +
> > +        /* all packets must come frome the same umem pool
> > +         * and has DPBUF_AFXDP type, otherwise free on-by-one
> > +         */
> > +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +            if (packet->source != DPBUF_AFXDP) {
> > +                goto free_batch;
> > +            }
> > +
> > +            xpacket = dp_packet_cast_afxdp(packet);
> > +            if (i == 0) {
> > +                first_mpool = xpacket->mpool;
> > +                continue;
> > +            }
> > +            if (xpacket->mpool != first_mpool) {
> > +                goto free_batch;
> > +            }
> > +        }
> > +        /* free in batch */
> > +        free_afxdp_buf_batch(batch);
> > +        return error;
>
>
> There are a lot of afxdp specific code here and 'netdev_linux_send' doesn't
> provide any magic, i.e. has no real code suitable for all netdev types.
> Maybe it's better to just implement own '.send' function inside netdev-afxdp.c ?

Yes, I will do that.

>
> > +#endif
> >      } else {
> >          error = netdev_linux_tap_batch_send(netdev_, batch);
> >      }
> > @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "system",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_linux_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
> >  const struct netdev_class netdev_tap_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      .type = "tap",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct_tap,
> >      .get_stats = netdev_tap_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3343,10 +3352,26 @@ const struct netdev_class netdev_internal_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "internal",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_internal_get_stats,
> >      .get_status = netdev_internal_get_status,
> >  };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > +    NETDEV_LINUX_CLASS_COMMON,
> > +    .type = "afxdp",
> > +    .is_pmd = true,
> > +    .construct = netdev_linux_construct,
> > +    .get_stats = netdev_linux_get_stats,
> > +    .get_status = netdev_linux_get_status,
> > +    .set_config = netdev_afxdp_set_config,
> > +    .get_config = netdev_afxdp_get_config,
> > +    .reconfigure = netdev_afxdp_reconfigure,
> > +    .get_numa_id = netdev_afxdp_get_numa_id,
> > +};
> > +#endif
> >
> >
> >  #define CODEL_N_QUEUES 0x0000
> > diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> > index 17ca9120168a..b812e64cb078 100644
> > --- a/lib/netdev-linux.h
> > +++ b/lib/netdev-linux.h
> > @@ -19,6 +19,20 @@
> >
> >  #include <stdint.h>
> >  #include <stdbool.h>
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
> >
> >  /* These functions are Linux specific, so they should be used directly only by
> >   * Linux-specific code. */
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index fb0c27e6e8e8..d433818f7064 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class;
> >  #endif
> >  extern const struct netdev_class netdev_internal_class;
> >  extern const struct netdev_class netdev_tap_class;
> > -
> > +#if HAVE_AF_XDP
> > +extern const struct netdev_class netdev_afxdp_class;
> > +#endif
> >  #ifdef  __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 7d7ecf6f0946..e2fae37d5a5e 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -146,6 +146,9 @@ netdev_initialize(void)
> >          netdev_register_provider(&netdev_internal_class);
> >          netdev_register_provider(&netdev_tap_class);
> >          netdev_vport_tunnel_register();
> > +#ifdef HAVE_AF_XDP
> > +        netdev_register_provider(&netdev_afxdp_class);
> > +#endif
> >  #endif
> >  #if defined(__FreeBSD__) || defined(__NetBSD__)
> >          netdev_register_provider(&netdev_tap_class);
> > diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> > new file mode 100644
> > index 000000000000..2d80e74d69e4
> > --- /dev/null
> > +++ b/lib/xdpsock.c
> > @@ -0,0 +1,239 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +#include <config.h>
> > +
> > +#include "xdpsock.h"
> > +
> > +#include <ctype.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdarg.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <syslog.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +
> > +#include "async-append.h"
> > +#include "coverage.h"
> > +#include "dirs.h"
> > +#include "dp-packet.h"
> > +#include "openvswitch/compiler.h"
> > +#include "openvswitch/vlog.h"
> > +#include "ovs-atomic.h"
> > +#include "ovs-thread.h"
> > +#include "sat-math.h"
> > +#include "socket-util.h"
> > +#include "svec.h"
> > +#include "syslog-direct.h"
> > +#include "syslog-libc.h"
> > +#include "syslog-provider.h"
> > +#include "timeval.h"
> > +#include "unixctl.h"
> > +#include "util.h"
> > +
> > +static inline void
> > +ovs_spinlock_init(ovs_spinlock_t *sl)
> > +{
> > +    atomic_init(&sl->locked, 0);
> > +}
> > +
> > +static inline void
> > +ovs_spin_lock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0, locked = 0;
> > +
> > +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed)) {
> > +        locked = 1;
> > +        while (locked) {
> > +            atomic_read_relaxed(&sl->locked, &locked);
> > +        }
> > +        exp = 0;
> > +    }
> > +}
> > +
> > +static inline void
> > +ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED
> > +ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
> > +
> > +inline int
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +
> > +    return 0;
> > +}
> > +
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> > +        /* stack is overflow, this should not happen */
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ptr;
> > +}
> > +
> > +void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(void *)) == 0);
> > +    memset(bufs, 0, size * sizeof(void *));
> > +    return (void **)bufs;
> > +}
> > +
> > +unsigned int
> > +umem_elem_count(struct umem_pool *mpool)
> > +{
> > +    return mpool->index;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->mutex);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > +    free(umemp->array);
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    /* TODO: check HAVE_POSIX_MEMALIGN  */
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(struct dp_packet_afxdp)) == 0);
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free(xp->array);
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..aabaa8e5df24
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,123 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <bpf/libbpf.h>
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <getopt.h>
> > +#include <libgen.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/if_ether.h>
> > +#include <locale.h>
> > +#include <net/if.h>
> > +#include <poll.h>
> > +#include <pthread.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <sys/mman.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define NUM_FRAMES      4096
> > +#define PROD_NUM_DESCS  512
> > +#define CONS_NUM_DESCS  512
> > +
> > +#ifdef USE_XSK_DEFAULT
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#endif
>
> Should there be ifdef-else-endif ?

good catch, thanks.
William
Ilya Maximets May 16, 2019, 6:58 a.m. UTC | #5
On 16.05.2019 2:20, William Tu wrote:
>>> +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
>>> +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
>>> +   and datapath_type=netdev when adding a new bridge.
>>
>> I don't think that '--disable-system' is needed. It doesn't affect anything.
>>
> 
> Thanks I will remove it.
> 
>> <snip>
>>
>>> +int
>>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                              struct dp_packet_batch *batch)
>>> +{
>>
>> One important issue here. netdev_linux_send() is thread-safe, because
>> all the syscalls and memory allocations there are thread-safe.
>> However, all the xsk_ring_* APIs are not thread safe and if two
>> threads will try to send packets to the same tx queue they might
>> destroy the rings. So, it's necessary to start using 'concurrent_txq'
>> flag with per-queue locks.
>> Note that 'concurrent_txq' == 'false' only if 'n_txq' > 'n_pmd_threads'.
>>
> 
> Thanks!
> I have one question. For example if I have n_txq=4 and n_pmd_threds=2,
> then concurrent_txq = false.
> 
> Assume pmd1 processing rx queue0 on port1 and pmd2 processes rx queue0 on port2.
> What if both pmd1 and pmd2 try to send AF_XDP packet tx queue0 on port2?
> Then both pmd threads are calling the send function on port2 queue0
> concurrently.
> Does that mean I have to unconditionally add per-queue lock?

No. You don't need that. dpif-netdev manages Tx queues in a way that
two threads will never use same Tx queue of the same netdev without
'concurrent_txq' set to 'true'.

In your case above you have 'n_txq' > 'n_pmd_threads' and dpif-netdev
will use static Tx queue ids, i.e. pmd1 will always use Tx queue #0
and pmd2 will always use Tx queue #1, main thread will always use
Tx queue #2.

If you'll have 'n_txq' <= 'n_pmd_threads', dpif-netdev will use dynamic
Tx queue ids with some kind of XPS mechanism. i.e. threads will allocate
Tx queue id before sending. XPS will try, but it doesn't guarantee that
other thread will not use same queue, so 'concurrent_txq' will be always
"true" for this port.

You may use 'netdev_dpdk_send__()' as a reference.

> 
> Regards,
> William
> 
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t tx_done, idx_cq = 0;
>>> +    struct dp_packet *packet;
>>> +    uint32_t idx = 0;
>>> +    int j, ret, retry_count = 0;
>>> +    const int max_retry = 4;
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * We can avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
>>> +
>>> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +        return ret;
>>> +    }
>>> +
>>> +retry:
>>> +    /* Process CQ */
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk->outstanding_tx -= tx_done;
>>> +        xsk->tx_npkts += tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
>>> +    ovs_assert(ret == 0);
>>> +
>>> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +
>>> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
>>> +        /* If there are still a lot not transmitted, try harder. */
>>> +        if (retry_count++ > max_retry) {
>>> +            return 0;
>>> +        }
>>> +        goto retry;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
> 
>
Ilya Maximets May 16, 2019, 7:27 a.m. UTC | #6
On 16.05.2019 2:27, William Tu wrote:
> Hi Ilya,
> 
> Thanks for your feedback.
> 
> On Mon, May 13, 2019 at 10:48 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>
>> On 10.05.2019 2:54, William Tu wrote:
>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
>>> built upon the eBPF and XDP technology.  It is aims to have comparable
>>> performance to DPDK but cooperate better with existing kernel's networking
>>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
>>> attached to the netdev, by-passing a couple of Linux kernel's subsystems
>>> As a result, AF_XDP socket shows much better performance than AF_PACKET
>>> For more details about AF_XDP, please see linux kernel's
>>> Documentation/networking/af_xdp.rst. Note that by default, this is not
>>> compiled in.
>>>
>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>>
>>> ---
> 
> snip
> 
>>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
>>> +                        char **errp OVS_UNUSED)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    const char *xdpmode;
>>> +    int new_n_rxq;
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +
>>> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>>> +    if (new_n_rxq > MAX_XSKQ) {
>>> +        ovs_mutex_unlock(&dev->mutex);
>>> +        return EINVAL;
>>> +    }
>>> +
>>> +    if (new_n_rxq != netdev->n_rxq) {
>>> +        dev->requested_n_rxq = new_n_rxq;
>>> +        netdev_request_reconfigure(netdev);
>>> +    }
>>> +
>>> +    xdpmode = smap_get(args, "xdpmode");
>>> +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
>>> +        dev->requested_xdpmode = XDP_ZEROCOPY;
>>> +        if (dev->xdpmode != dev->requested_xdpmode) {
>>> +            netdev_request_reconfigure(netdev);
>>> +        }
>>> +    } else {
>>> +        dev->requested_xdpmode = XDP_COPY;
>>> +        if (dev->xdpmode != dev->requested_xdpmode) {
>>> +            netdev_request_reconfigure(netdev);
>>> +        }
>>> +    }
>>
>> Above code will request reconfiguration infinitely until it reconfiguration
>> finished. This could cause multiple reconfigurations in a row for the same
>> configuration change. Better version could look like this:
>>
>>     new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>>     if (new_n_rxq > MAX_XSKQ) {
>>         ovs_mutex_unlock(&dev->mutex);
>>         VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
>>                  netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
>>         return EINVAL;
>>     }
>>
>>     str_xdpmode = smap_get_def(args, "xdpmode", "skb");
>>     if (!strcasecmp(str_xdpmode, "drv")) {
>>         xdpmode = XDP_ZEROCOPY;
>>     } else if (!strcasecmp(str_xdpmode, "skb")) {
>>         xdpmode = XDP_COPY;
>>     } else {
>>         VLOG_ERR("%s: Incorrect xdpmode (%s).",
>>                  netdev_get_name(netdev), str_xdpmode);
>>         ovs_mutex_unlock(&dev->mutex);
>>         return EINVAL;
>>     }
>>
>>     if (dev->requested_n_rxq != new_n_rxq
>>         || dev->requested_xdpmode != xdpmode) {
>>         dev->requested_n_rxq = new_n_rxq;
>>         dev->requested_xdpmode = xdpmode
>>         netdev_request_reconfigure(netdev);
>>     }
>>
>> The main difference is checking "new" with "requested", not the "new" with
>> "current". This allows us to request reconfiguration only once for each
>> change. I also made few cosmetic changes which you may find useful, however
>> it's up to you.
> 
> Thanks, will fix it in next version.
> 
>>
>>> +    for (i = 0; i < rcvd; i++) {
>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
>>> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>>> +        uint64_t index;
>>> +
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct dp_packet *packet;
>>> +
>>> +        index = addr >> FRAME_SHIFT;
>>> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
>>> +
>>> +        packet = &xpacket->packet;
>>> +        xpacket->mpool = &xsk->umem->mpool;
>>> +
>>> +        /* Initialize the struct dp_packet */
>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
>>> +        dp_packet_set_size(packet, len);
>>> +
>>> +        /* Add packet into batch, increase batch->count */
>>> +        dp_packet_batch_add(batch, packet);
>>> +
>>> +        idx_rx++;
>>> +    }
>>> +
>>> +    /* We've consume rcvd packets in RX, now re-fill the
>>> +     * same number back to FILL queue.
>>> +     */
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>> +    if (OVS_UNLIKELY(ret)) {> +        return -ENOMEM;
>>> +    }
>>
>> Can this be done before actually receiving packets? i.e. don't receive
>> anything if cant refill.
> 
> I'm not sure I understand your point.
> Do you suggest moving this umem_elem_pop_n in the beginning?


Yes. I meant this.
It's just an optimization. We don't need to iterate over 'xsk->rx'
if 'umem_elem_pop_n' failed because you're returning error and the
batch will not be used.

Actually, I looked at the code again and I see that in case of
'umem_elem_pop_n' failure you're not releasing the packets, i.e.
xsk_ring_cons__peek() reserved some packets, but xsk_ring_cons__release()
never called. This will break the rx ring on the next call.

I think that code should look like this:

rcvd = xsk_ring_cons__peek(&xsk->rx ...)
if (!rcvd) {
    return 0;
}

ret = umem_elem_pop_n(rcvd)
if (OVS_UNLIKELY(ret)) {
    xsk_ring_cons__release(&xsk->rx, rcvd);
    /* TODO: Account packets as dropped. */
    return -ENOMEM;
}

for (i = 0; i < rcvd; i++) {
    ...
    dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
    ...
    dp_packet_batch_add(batch, packet);
}

/* Release the RX queue */
xsk_ring_cons__release(&xsk->rx, rcvd);

for (i = 0; i < rcvd; i++) {
    ...
    *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
}
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);


> I think at this point the rcvd is > 0, otherwise it will return.
> So we already know there are packets at rx ring.
> 
>>
>>> +
>>> +    for (i = 0; i < rcvd; i++) {
>>> +        uint64_t index;
>>> +        struct umem_elem *elem;
>>> +
>>> +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
>>> +        while (OVS_UNLIKELY(ret == 0)) {
>>> +            /* The FILL queue is full, so retry. (or skip)? */
>>> +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
>>> +        }
> 
> And if we can't refill, we will keep trying.

One additional point here is that you don't need to call prod__reserve
for each packet. you could use:

while (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
    /* The FILL queue is full, so retry. */
}

xsk_ring_prod__reserve() doesn't support partial allocations, i.e. it will
reserve rcvd or 0 elements.

> 
>>> +
>>> +        /* Get one free umem, program it into FILL queue */
>>> +        elem = elems[i];
>>> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
>>> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
>>> +
>>> +        idx_fq++;
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
>>> +
>>> +    /* Release the RX queue */
>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
>>> +    xsk->rx_npkts += rcvd;
>>> +
>>> +#ifdef AFXDP_DEBUG
>>> +    print_xsk_stat(xsk);
>>> +#endif
>>> +    return 0;
>>> +}
>>> +
>>> +static inline int kick_tx(struct xsk_socket_info *xsk)
>>> +{
>>> +    int ret;
>>> +
>>> +    /* This causes system call into kernel's xsk_sendmsg, and
>>> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
>>> +     */
>>> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
>>> +    if (OVS_UNLIKELY(ret < 0)) {
>>> +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
>>> +            return errno;
>>> +        }
>>> +    }
>>> +    /* no error, or EBUSY or EAGAIN */
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                              struct dp_packet_batch *batch)
>>> +{
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t tx_done, idx_cq = 0;
>>> +    struct dp_packet *packet;
>>> +    uint32_t idx = 0;
>>> +    int j, ret, retry_count = 0;
>>> +    const int max_retry = 4;
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * We can avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
>>> +
>>> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +        return ret;
>>> +    }
>>> +
>>> +retry:
>>> +    /* Process CQ */
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk->outstanding_tx -= tx_done;
>>> +        xsk->tx_npkts += tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
>>> +    ovs_assert(ret == 0);
>>> +
>>> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +
>>> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
>>> +        /* If there are still a lot not transmitted, try harder. */
>>> +        if (retry_count++ > max_retry) {
>>> +            return 0;
>>> +        }
>>> +        goto retry;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
>>> new file mode 100644
>>> index 000000000000..6518d8fca0b5
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.h
>>> @@ -0,0 +1,53 @@
>>> +/*
>>> + * Copyright (c) 2018 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_AFXDP_H
>>> +#define NETDEV_AFXDP_H 1
>>> +
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +/* These functions are Linux AF_XDP specific, so they should be used directly
>>> + * only by Linux-specific code. */
>>> +#define MAX_XSKQ 16
>>> +struct netdev;
>>> +struct xsk_socket_info;
>>> +struct xdp_umem;
>>> +struct dp_packet_batch;
>>> +struct smap;
>>> +struct dp_packet;
>>> +
>>> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet *d);
>>> +
>>> +int xsk_configure_all(struct netdev *netdev);
>>> +
>>> +void xsk_destroy_all(struct netdev *netdev);
>>> +
>>> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
>>> +                         struct dp_packet_batch *batch);
>>> +
>>> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                                  struct dp_packet_batch *batch);
>>> +
>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
>>> +                            char **errp);
>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
>>> +
>>> +void free_afxdp_buf(struct dp_packet *p);
>>> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
>>> +#endif /* netdev-afxdp.h */
>>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>>> new file mode 100644
>>> index 000000000000..3dd3d902b3c4
>>> --- /dev/null
>>> +++ b/lib/netdev-linux-private.h
>>> @@ -0,0 +1,124 @@
>>> +/*
>>> + * Copyright (c) 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>> +
>>> +#if HAVE_AF_XDP
>>> +#include "netdev-afxdp.h"
>>> +#endif
>>> +
>>> +/* These functions are Linux specific, so they should be used directly only by
>>> + * Linux-specific code. */
>>> +
>>> +struct netdev;
>>> +
>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
>>> +                                  const char *flag_name, bool enable);
>>> +int linux_get_ifindex(const char *netdev_name);
>>> +
>>> +#define LINUX_FLOW_OFFLOAD_API                          \
>>> +   .flow_flush = netdev_tc_flow_flush,                  \
>>> +   .flow_dump_create = netdev_tc_flow_dump_create,      \
>>> +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
>>> +   .flow_dump_next = netdev_tc_flow_dump_next,          \
>>> +   .flow_put = netdev_tc_flow_put,                      \
>>> +   .flow_get = netdev_tc_flow_get,                      \
>>> +   .flow_del = netdev_tc_flow_del,                      \
>>> +   .init_flow_api = netdev_tc_init_flow_api
>>> +
>>> +struct netdev_linux {
>>> +    struct netdev up;
>>> +
>>> +    /* Protects all members below. */
>>> +    struct ovs_mutex mutex;
>>> +
>>> +    unsigned int cache_valid;
>>> +
>>> +    bool miimon;                    /* Link status of last poll. */
>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
>>> +    struct timer miimon_timer;
>>> +
>>> +    int netnsid;                    /* Network namespace ID. */
>>> +    /* The following are figured out "on demand" only.  They are only valid
>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
>>> +    int ifindex;
>>> +    struct eth_addr etheraddr;
>>> +    int mtu;
>>> +    unsigned int ifi_flags;
>>> +    long long int carrier_resets;
>>> +    uint32_t kbits_rate;        /* Policing data. */
>>> +    uint32_t kbits_burst;
>>> +    int vport_stats_error;      /* Cached error code from vport_get_stats().
>>> +                                   0 or an errno value. */
>>> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
>>> +                                 * or SIOCSIFMTU.
>>> +                                 */
>>> +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
>>> +    int netdev_policing_error;  /* Cached error code from set policing. */
>>> +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
>>> +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
>>> +
>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
>>> +
>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
>>> +    struct tc *tc;
>>> +
>>> +    /* For devices of class netdev_tap_class only. */
>>> +    int tap_fd;
>>> +    bool present;               /* If the device is present in the namespace */
>>> +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
>>> +
>>> +    /* LAG information. */
>>> +    bool is_lag_master;         /* True if the netdev is a LAG master. */
>>> +
>>> +    /* AF_XDP information */
>>> +#ifdef HAVE_AF_XDP
>>> +    struct xsk_socket_info *xsk[MAX_XSKQ];
>>> +    int requested_n_rxq;
>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
>>> +    int xdp_flags, xdp_bind_flags;
>>> +#endif
>>> +};
>>> +
>>> +static struct netdev_linux *
>>> +netdev_linux_cast(const struct netdev *netdev)
>>> +{
>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +}
>>> +
>>> +#endif /* netdev-linux-private.h */
>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>> index f75d73fd39f8..1f190406d145 100644
>>> --- a/lib/netdev-linux.c
>>> +++ b/lib/netdev-linux.c
>>> @@ -17,6 +17,7 @@
>>>  #include <config.h>
>>>
>>>  #include "netdev-linux.h"
>>> +#include "netdev-linux-private.h"
>>>
>>>  #include <errno.h>
>>>  #include <fcntl.h>
>>> @@ -54,6 +55,7 @@
>>>  #include "fatal-signal.h"
>>>  #include "hash.h"
>>>  #include "openvswitch/hmap.h"
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-provider.h"
>>>  #include "netdev-tc-offloads.h"
>>>  #include "netdev-vport.h"
>>> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
>>>
>>> -struct netdev_linux {
>>> -    struct netdev up;
>>> -
>>> -    /* Protects all members below. */
>>> -    struct ovs_mutex mutex;
>>> -
>>> -    unsigned int cache_valid;
>>> -
>>> -    bool miimon;                    /* Link status of last poll. */
>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
>>> -    struct timer miimon_timer;
>>> -
>>> -    int netnsid;                    /* Network namespace ID. */
>>> -    /* The following are figured out "on demand" only.  They are only valid
>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
>>> -    int ifindex;
>>> -    struct eth_addr etheraddr;
>>> -    int mtu;
>>> -    unsigned int ifi_flags;
>>> -    long long int carrier_resets;
>>> -    uint32_t kbits_rate;        /* Policing data. */
>>> -    uint32_t kbits_burst;
>>> -    int vport_stats_error;      /* Cached error code from vport_get_stats().
>>> -                                   0 or an errno value. */
>>> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
>>> -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
>>> -    int netdev_policing_error;  /* Cached error code from set policing. */
>>> -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
>>> -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
>>> -
>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
>>> -
>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
>>> -    struct tc *tc;
>>> -
>>> -    /* For devices of class netdev_tap_class only. */
>>> -    int tap_fd;
>>> -    bool present;               /* If the device is present in the namespace */
>>> -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
>>> -
>>> -    /* LAG information. */
>>> -    bool is_lag_master;         /* True if the netdev is a LAG master. */
>>> -};
>>>
>>>  struct netdev_rxq_linux {
>>>      struct netdev_rxq up;
>>> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class *netdev_class)
>>>      return netdev_class->run == netdev_linux_run;
>>>  }
>>>
>>> +#if HAVE_AF_XDP
>>>  static bool
>>> -is_tap_netdev(const struct netdev *netdev)
>>> +is_afxdp_netdev(const struct netdev *netdev)
>>>  {
>>> -    return netdev_get_class(netdev) == &netdev_tap_class;
>>> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
>>>  }
>>> -
>>> -static struct netdev_linux *
>>> -netdev_linux_cast(const struct netdev *netdev)
>>> +#else
>>> +static bool
>>> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
>>>  {
>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>> -
>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +    return false;
>>> +}
>>> +#endif
>>> +static bool
>>> +is_tap_netdev(const struct netdev *netdev)
>>> +{
>>> +    return netdev_get_class(netdev) == &netdev_tap_class;
>>>  }
>>>
>>>  static struct netdev_rxq_linux *
>>> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>>>          atomic_count_dec(&miimon_cnt);
>>>      }
>>>
>>> +#if HAVE_AF_XDP
>>> +    if (is_afxdp_netdev(netdev_)) {
>>> +        xsk_destroy_all(netdev_);
>>> +    }
>>> +#endif
>>>      ovs_mutex_destroy(&netdev->mutex);
>>>  }
>>>
>>> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>>>      rx->is_tap = is_tap_netdev(netdev_);
>>>      if (rx->is_tap) {
>>>          rx->fd = netdev->tap_fd;
>>> -    } else {
>>> +    } else if (!is_afxdp_netdev(netdev_)) {
>>>          struct sockaddr_ll sll;
>>>          int ifindex, val;
>>>          /* Result of tcpdump -dd inbound */
>>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>>  {
>>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>>      struct netdev *netdev = rx->up.netdev;
>>> -    struct dp_packet *buffer;
>>> +    struct dp_packet *buffer = NULL;
>>>      ssize_t retval;
>>>      int mtu;
>>>
>>> +#if HAVE_AF_XDP
>>> +    if (is_afxdp_netdev(netdev)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +        int qid = rxq_->queue_id;
>>> +
>>> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
>>> +    }
>>
>> Maybe it's better to just implement '.rxq_recv' inside netdev-afxdp.c ?
>> Also, you missed clearing the '*qfill'.
>>
>>> +#endif
>>>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>>>          mtu = ETH_PAYLOAD_MAX;
>>>      }
>>> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>>      /* Assume Ethernet port. No need to set packet_type. */
>>>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>>>                                             DP_NETDEV_HEADROOM);
>>> +
>>>      retval = (rx->is_tap
>>>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>>>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
>>> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>>>      int error = 0;
>>>      int sock = 0;
>>>
>>> -    if (!is_tap_netdev(netdev_)) {
>>> +    if (!is_tap_netdev(netdev_) &&
>>> +        !is_afxdp_netdev(netdev_)) {
>>>          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>>>              error = EOPNOTSUPP;
>>>              goto free_batch;
>>> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>>>          }
>>>
>>>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
>>> +#if HAVE_AF_XDP
>>> +    } else if (is_afxdp_netdev(netdev_)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct umem_pool *first_mpool;
>>> +        struct dp_packet *packet;
>>> +
>>> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
>>> +
>>> +        /* all packets must come frome the same umem pool
>>> +         * and has DPBUF_AFXDP type, otherwise free on-by-one
>>> +         */
>>> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +            if (packet->source != DPBUF_AFXDP) {
>>> +                goto free_batch;
>>> +            }
>>> +
>>> +            xpacket = dp_packet_cast_afxdp(packet);
>>> +            if (i == 0) {
>>> +                first_mpool = xpacket->mpool;
>>> +                continue;
>>> +            }
>>> +            if (xpacket->mpool != first_mpool) {
>>> +                goto free_batch;
>>> +            }
>>> +        }
>>> +        /* free in batch */
>>> +        free_afxdp_buf_batch(batch);
>>> +        return error;
>>
>>
>> There are a lot of afxdp specific code here and 'netdev_linux_send' doesn't
>> provide any magic, i.e. has no real code suitable for all netdev types.
>> Maybe it's better to just implement own '.send' function inside netdev-afxdp.c ?
> 
> Yes, I will do that.
> 
>>
>>> +#endif
>>>      } else {
>>>          error = netdev_linux_tap_batch_send(netdev_, batch);
>>>      }
>>> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "system",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_linux_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
>>>  const struct netdev_class netdev_tap_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      .type = "tap",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct_tap,
>>>      .get_stats = netdev_tap_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3343,10 +3352,26 @@ const struct netdev_class netdev_internal_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "internal",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_internal_get_stats,
>>>      .get_status = netdev_internal_get_status,
>>>  };
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +const struct netdev_class netdev_afxdp_class = {
>>> +    NETDEV_LINUX_CLASS_COMMON,
>>> +    .type = "afxdp",
>>> +    .is_pmd = true,
>>> +    .construct = netdev_linux_construct,
>>> +    .get_stats = netdev_linux_get_stats,
>>> +    .get_status = netdev_linux_get_status,
>>> +    .set_config = netdev_afxdp_set_config,
>>> +    .get_config = netdev_afxdp_get_config,
>>> +    .reconfigure = netdev_afxdp_reconfigure,
>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
>>> +};
>>> +#endif
>>>
>>>
>>>  #define CODEL_N_QUEUES 0x0000
>>> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
>>> index 17ca9120168a..b812e64cb078 100644
>>> --- a/lib/netdev-linux.h
>>> +++ b/lib/netdev-linux.h
>>> @@ -19,6 +19,20 @@
>>>
>>>  #include <stdint.h>
>>>  #include <stdbool.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>>
>>>  /* These functions are Linux specific, so they should be used directly only by
>>>   * Linux-specific code. */
>>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>>> index fb0c27e6e8e8..d433818f7064 100644
>>> --- a/lib/netdev-provider.h
>>> +++ b/lib/netdev-provider.h
>>> @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class;
>>>  #endif
>>>  extern const struct netdev_class netdev_internal_class;
>>>  extern const struct netdev_class netdev_tap_class;
>>> -
>>> +#if HAVE_AF_XDP
>>> +extern const struct netdev_class netdev_afxdp_class;
>>> +#endif
>>>  #ifdef  __cplusplus
>>>  }
>>>  #endif
>>> diff --git a/lib/netdev.c b/lib/netdev.c
>>> index 7d7ecf6f0946..e2fae37d5a5e 100644
>>> --- a/lib/netdev.c
>>> +++ b/lib/netdev.c
>>> @@ -146,6 +146,9 @@ netdev_initialize(void)
>>>          netdev_register_provider(&netdev_internal_class);
>>>          netdev_register_provider(&netdev_tap_class);
>>>          netdev_vport_tunnel_register();
>>> +#ifdef HAVE_AF_XDP
>>> +        netdev_register_provider(&netdev_afxdp_class);
>>> +#endif
>>>  #endif
>>>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>>>          netdev_register_provider(&netdev_tap_class);
>>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
>>> new file mode 100644
>>> index 000000000000..2d80e74d69e4
>>> --- /dev/null
>>> +++ b/lib/xdpsock.c
>>> @@ -0,0 +1,239 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +#include <config.h>
>>> +
>>> +#include "xdpsock.h"
>>> +
>>> +#include <ctype.h>
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <stdarg.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/stat.h>
>>> +#include <sys/types.h>
>>> +#include <syslog.h>
>>> +#include <time.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "async-append.h"
>>> +#include "coverage.h"
>>> +#include "dirs.h"
>>> +#include "dp-packet.h"
>>> +#include "openvswitch/compiler.h"
>>> +#include "openvswitch/vlog.h"
>>> +#include "ovs-atomic.h"
>>> +#include "ovs-thread.h"
>>> +#include "sat-math.h"
>>> +#include "socket-util.h"
>>> +#include "svec.h"
>>> +#include "syslog-direct.h"
>>> +#include "syslog-libc.h"
>>> +#include "syslog-provider.h"
>>> +#include "timeval.h"
>>> +#include "unixctl.h"
>>> +#include "util.h"
>>> +
>>> +static inline void
>>> +ovs_spinlock_init(ovs_spinlock_t *sl)
>>> +{
>>> +    atomic_init(&sl->locked, 0);
>>> +}
>>> +
>>> +static inline void
>>> +ovs_spin_lock(ovs_spinlock_t *sl)
>>> +{
>>> +    int exp = 0, locked = 0;
>>> +
>>> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed)) {
>>> +        locked = 1;
>>> +        while (locked) {
>>> +            atomic_read_relaxed(&sl->locked, &locked);
>>> +        }
>>> +        exp = 0;
>>> +    }
>>> +}
>>> +
>>> +static inline void
>>> +ovs_spin_unlock(ovs_spinlock_t *sl)
>>> +{
>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
>>> +}
>>> +
>>> +static inline int OVS_UNUSED
>>> +ovs_spin_trylock(ovs_spinlock_t *sl)
>>> +{
>>> +    int exp = 0;
>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed);
>>> +}
>>> +
>>> +inline int
>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(ptr, addrs, n * sizeof(void *));
>>> +    umemp->index += n;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_push_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void
>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +    umemp->array[umemp->index++] = addr;
>>> +}
>>> +
>>> +void
>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
>>> +        /* stack is overflow, this should not happen */
>>> +        OVS_NOT_REACHED();
>>> +    }
>>> +
>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    __umem_elem_push(umemp, addr);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +}
>>> +
>>> +inline int
>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umemp->index -= n;
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(addrs, ptr, n * sizeof(void *));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void *
>>> +__umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    return umemp->array[--umemp->index];
>>> +}
>>> +
>>> +void *
>>> +umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ptr = __umem_elem_pop(umemp);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ptr;
>>> +}
>>> +
>>> +void **
>>> +__umem_pool_alloc(unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(void *)) == 0);
>>> +    memset(bufs, 0, size * sizeof(void *));
>>> +    return (void **)bufs;
>>> +}
>>> +
>>> +unsigned int
>>> +umem_elem_count(struct umem_pool *mpool)
>>> +{
>>> +    return mpool->index;
>>> +}
>>> +
>>> +int
>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
>>> +{
>>> +    umemp->array = __umem_pool_alloc(size);
>>> +    if (!umemp->array) {
>>> +        OVS_NOT_REACHED();
>>> +    }
>>> +
>>> +    umemp->size = size;
>>> +    umemp->index = 0;
>>> +    ovs_spinlock_init(&umemp->mutex);
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +umem_pool_cleanup(struct umem_pool *umemp)
>>> +{
>>> +    free(umemp->array);
>>> +}
>>> +
>>> +/* AF_XDP metadata init/destroy */
>>> +int
>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    /* TODO: check HAVE_POSIX_MEMALIGN  */
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(struct dp_packet_afxdp)) == 0);
>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
>>> +
>>> +    xp->array = bufs;
>>> +    xp->size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
>>> +{
>>> +    free(xp->array);
>>> +}
>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>> new file mode 100644
>>> index 000000000000..aabaa8e5df24
>>> --- /dev/null
>>> +++ b/lib/xdpsock.h
>>> @@ -0,0 +1,123 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef XDPSOCK_H
>>> +#define XDPSOCK_H 1
>>> +
>>> +#include <bpf/libbpf.h>
>>> +#include <bpf/xsk.h>
>>> +#include <errno.h>
>>> +#include <getopt.h>
>>> +#include <libgen.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/if_link.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <locale.h>
>>> +#include <net/if.h>
>>> +#include <poll.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/resource.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/types.h>
>>> +#include <sys/mman.h>
>>> +#include <time.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +
>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>> +
>>> +#define NUM_FRAMES      4096
>>> +#define PROD_NUM_DESCS  512
>>> +#define CONS_NUM_DESCS  512
>>> +
>>> +#ifdef USE_XSK_DEFAULT
>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>> +#endif
>>
>> Should there be ifdef-else-endif ?
> 
> good catch, thanks.
> William
> 
>
Eelco Chaudron May 17, 2019, 10:23 a.m. UTC | #7
Hi William,

First a list of issues I found during some basic testing...

- When I restart or stop OVS (using the systemctl interface as found in 
RHEL) it does not clean up the BFP program causing the restart to fail:

   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1 
reconfig fails
   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set 
interface eno1 new configuration

   I need to manually run "ip link set dev eno1 xdp off" to make it 
recover.


- When I remove a bridge, I get an emer in the revalidator:

   2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
   2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111: 
assertion !fd != !wevent failed in poll_create_node()

   Easy to replicate with this:

     $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0 
datapath_type=netdev
     $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1 
type="afxdp" options:xdpmode=drv
     $ ovs-vsctl del-br ovs_pvp_br0


- High pmd usage on the statistics, even with no packets is this 
expected?

   $ ovs-appctl dpif-netdev/pmd-rxq-show
   pmd thread numa_id 0 core_id 1:
     isolated : false
     port: dpdk0             queue-id:  0  pmd usage:  0 %
     port: eno1              queue-id:  0  pmd usage: 49 %

   It goes up slowly and gets stuck at 49%


- When doing the PVP testing I noticed that the physical port has odd/no
   tx statistics:

   $ ovs-ofctl dump-ports ovs_pvp_br0
   OFPST_PORT reply (xid=0x2): 3 ports
     port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, 
crc=0
              tx pkts=0, bytes=0, drop=0, errs=0, coll=0
     port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0, 
frame=0, over=0, crc=0
              tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
     port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0, 
over=0, crc=0
              tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0


- Packets larger than 1028 bytes are dropped. Guess this needs to be 
fixed, and we need to state that jumbo frames are not supported. Are you 
planning on adding this?

   Currently I can find not mentioning of MTU limitation in the 
documentation, or any code to prevent it from being changed above the 
supported limit.


- ovs-vswitchd is still crashing or stops forwarding packets when trying 
to do
   PVP testing with Qemu that has a TAP interface doing XDP and running 
packets
   at wire speed to the 10G interface.

   When trying with lower volume packets it seems to work, so with 1% 
traffic
   rate, it forwards packets without any problems (148,771 pps). If I go 
to
   10% the first couple of packet pass, then it stops forwarding. If 
it's not
   crashing I still see packets being received by eno1 flow rules, but 
no
   packets make it to the VM.

     Program terminated with signal SIGSEGV, Segmentation fault.
     #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0, 
batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
     654	    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, 
(void **)elems_pop);
     [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
     Missing separate debuginfos, use: dnf debuginfo-install 
openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
     (gdb) bt
     #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0, 
batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
     #1  0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540, 
qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized 
out>) at lib/netdev-linux.c:1486
     #2  0x0000000000906051 in netdev_send (netdev=<optimized out>, 
qid=qid@entry=0, batch=batch@entry=0x7fc928005570, 
concurrent_txq=concurrent_txq@entry=true)
         at lib/netdev.c:797
     #3  0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port 
(pmd=pmd@entry=0x7fc95e735010, p=p@entry=0x7fc928005540) at 
lib/dpif-netdev.c:4185
     #4  0x00000000008d2faf in dp_netdev_pmd_flush_output_packets 
(pmd=pmd@entry=0x7fc95e735010, force=force@entry=false) at 
lib/dpif-netdev.c:4225
     #5  0x00000000008db317 in dp_netdev_pmd_flush_output_packets 
(force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
     #6  dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fc95e735010, 
rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
     #7  0x00000000008db67d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5446
     #8  0x000000000095c96d in ovsthread_wrapper (aux_=<optimized out>) 
at lib/ovs-thread.c:352
     #9  0x00007fc9789d62de in start_thread () from 
/lib64/libpthread.so.0
     #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6


- make check-afxpd is failing for me, however, make check-kernel works 
fine.
   Did not dive into it too much, but it fails here for all test cases, 
this is the same build I use for testing.

   ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge 
br0 datapath_type=netdev 
protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 
fail-mode=secure  --
   --- /dev/null	2019-05-16 09:09:33.445562692 -0400
   +++ 
/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr	2019-05-17 
05:46:20.506814939 -0400
   @@ -0,0 +1,2 @@
   +ovs-vsctl: Error detected while setting up 'br0'.  See ovs-vswitchd 
log for details.
   +ovs-vsctl: The default log directory is 
"/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
   ovsdb-server.log:
  > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file 
/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
  > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server (Open 
vSwitch) 2.11.90
   ovs-vswitchd.log:
  > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file 
/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
  > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU cores 
on NUMA node 0
  > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes 
and 28 CPU cores
  > 
2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock: 
connecting...
  > 
2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock: 
connected
  > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open 
vSwitch) 2.11.90
  > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev: 
creating tap device failed: Device or resource busy
  > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev already 
exists but cannot be opened: No such device
  > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open 
datapath of type netdev: No such device
  > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open datapath 
br0: No such device
  > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge 
br0: No such device
   1. system-afxdp-traffic.at:3:  FAILED (system-afxdp-traffic.at:4)




The following might be useful when combining DPDK and AF_XDP:

   Currently, DPDK and AF_XDP polling can be combined on a single PMD 
thread, it
   might be nice to have an option to not do this, i.e. have separate 
PMD
   threads for each type. I know we can do this with assigning specific 
PMDs to
   queues, but this will disable auto-balancing. This will also help 
later if
   we would add poll() mode support for AF_XDP.


Other review comments see inline below. I reviewed the code, not the 
unit tests or automake changes.

Cheers,


Eelco


On 10 May 2019, at 1:54, William Tu wrote:

> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket 
> type
> built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's 
> networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
> program
> attached to the netdev, by-passing a couple of Linux kernel's 
> subsystems
> As a result, AF_XDP socket shows much better performance than 
> AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this is not
> compiled in.
>
> Signed-off-by: William Tu <u9012063@gmail.com>
>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
>
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread 
> library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: 
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
>
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
>
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
>
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
>
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 479 +++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  32 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  13 +
>  lib/dp-packet.c                       |  33 ++
>  lib/dp-packet.h                       |  22 +-
>  lib/dpif-netdev-perf.h                |  14 +
>  lib/netdev-afxdp.c                    | 727 +++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  53 ++
>  lib/netdev-linux-private.h            | 124 +++++
>  lib/netdev-linux.c                    | 137 +++--
>  lib/netdev-linux.h                    |  14 +
>  lib/netdev-provider.h                 |   4 +-
>  lib/netdev.c                          |   3 +
>  lib/xdpsock.c                         | 239 +++++++++
>  lib/xdpsock.h                         | 123 +++++
>  tests/automake.mk                     |  17 +
>  tests/system-afxdp-macros.at          | 153 ++++++
>  tests/system-afxdp-testsuite.at       |  26 +
>  tests/system-afxdp-traffic.at         | 978 
> ++++++++++++++++++++++++++++++++++
>  23 files changed, 3137 insertions(+), 58 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
>  create mode 100644 tests/system-afxdp-traffic.at
>
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst 
> b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..1222b433dbbb
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,479 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); 
> you may
> +      not use this file except in compliance with the License. You 
> may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, 
> software
> +      distributed under the License is distributed on an "AS IS" 
> BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied. See the
> +      License for the specific language governing permissions and 
> limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux 
> socket type
> +built upon the eBPF and XDP technology.  It is aims to have 
> comparable
> +performance to DPDK but cooperate better with existing kernel's 
> networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
> program
> +attached to the netdev, by-passing a couple of Linux kernel's 
> subsystems.
> +As a result, AF_XDP socket shows much better performance than 
> AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +internal.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the proposed architecture re-uses the existing
> +userspace dpif-netdev datapath.  As a result, most of
> +the packet processing happens at the userspace instead of
> +linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building 
> Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but 
> not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- If possible, run **./xdpsock -r -N -z -i <your device>** under
> +  linux/samples/bpf.  This is the OVS indepedent benchmark tools for 
> AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF 
> support.
> +Frist, clone a recent version of Linux bpf-next tree::
> +
> +  git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the 
> tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp
> +
> +if a test case fails, check the log at::
> +
> +  cat 
> tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd --disable-system
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +.. note::
> +   OVS AF_XDP netdev is using the userspace datapath, the same 
> datapath
> +   as used by OVS-DPDK.  So it requires --disable-system for 
> ovs-vswitchd
> +   and datapath_type=netdev when adding a new bridge.

As mentioned earlier offline I think --disable-system can be removed as 
the Kernel and userspace datapath can be run at the same time.

> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on 
> core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or 
> "skb"::

Wondering how options:xdpmode should operate without it being specified? 
I would prefer that if the option is not specified it would try drv, and 
if it fails fallback to skb.

We need to add these new options to the vswitch.xml file

> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +

Add some text that pmd-rxq-affinity is not a requirement, the system 
will auto (re)assign.
Also, note that cores used by pmd-rxq-affinity are not shared/used by 
floating PMDs.

> +To validate that the bridge has successfully instantiated, you can 
> use the::
> +
> +  ovs-vsctl show
> +
> +should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debug by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good 
> introduction
> +about AF_XDP current and future work.
> +
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] 
> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] 
> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] 
> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, 
> allowing PMD
> +to keep polling the AF_XDP queues without any interferences from 
> kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by 
> AF_XDP, pmd
> +   running cores, device plug-in slot)

How can you do this? The code is not taking care of NUMA, and memory is 
allocated with posix_memalign so no idea which NUMA node it gets 
allocated.

> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system 
> calls.
> +

Maybe be more consistent, either one or two newlines before a heading?

> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl 
> \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 
> \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a 
> netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A 
> possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports 
> and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +make check-afxdp
> +----------------
> +When executing 'make check-afxdp', OVS creates namespaces, sets up 
> AF_XDP on
> +veth devices and kicks start the testing.  So far we have the 
> following test
> +cases::
> +
> + AF_XDP netdev datapath-sanity
> +
> +  1: datapath - ping between two ports               ok
> +  2: datapath - ping between two ports on vlan       ok
> +  3: datapath - ping6 between two ports              ok
> +  4: datapath - ping6 between two ports on vlan      ok
> +  5: datapath - ping over vxlan tunnel               ok
> +  6: datapath - ping over vxlan6 tunnel              ok
> +  7: datapath - ping over gre tunnel                 ok
> +  8: datapath - ping over erspan v1 tunnel           ok
> +  9: datapath - ping over erspan v2 tunnel           ok
> + 10: datapath - ping over ip6erspan v1 tunnel        ok
> + 11: datapath - ping over ip6erspan v2 tunnel        ok
> + 12: datapath - ping over geneve tunnel              ok
> + 13: datapath - ping over geneve6 tunnel             ok
> + 14: datapath - clone action                         ok
> + 15: datapath - basic truncate action                ok
> +
> + conntrack
> +
> + 16: conntrack - controller                          ok
> + 17: conntrack - force commit                        ok
> + 18: conntrack - ct flush by 5-tuple                 ok
> + 19: conntrack - IPv4 ping                           ok
> + 20: conntrack - get_nconns and get/set_maxconns     ok
> + 21: conntrack - IPv6 ping                           ok
> +
> + system-ovn
> +
> + 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
> + 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
> + 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
> + 25: ovn -- load-balancing                           ok
> + 26: ovn -- load-balancing - same subnet.            ok
> + 27: ovn -- load balancing in gateway router         ok
> + 28: ovn -- multiple gateway routers, load-balancing ok
> + 29: ovn -- load balancing in router with gateway router port ok
> + 30: ovn -- DNAT and SNAT on distributed router - N/S ok
> + 31: ovn -- DNAT and SNAT on distributed router - E/W ok
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to 
> VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0

Maybe add tap as XDP or else it will be an AF_PACKET interface polling 
in the main thread.

> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +The performance number I got is around 700Kpps.
> +This is due to using the kernel's tap interface, which requires 
> copying
> +packet into kernel from the umem buffer in userspace.
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev 
> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev 
> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, 
> actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, 
> actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
> +

Remove the xdpmode=skb above... Also, see above on the PF_PACKET 
interface in the bridge_run(),
I would advise against using this, and you might want to remove it.

> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 
> options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst 
> b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index b532a4579266..5782f7e4bc2e 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -221,6 +221,38 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP 
> support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP 
> support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP 
> support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and 
> enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 505e3d041e93..29c90b73f836 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct 
> stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index cc5dccf39d6b..686e57f8c472 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-tc-offloads.c \
>  	lib/netdev-tc-offloads.h \
>  	lib/netlink-conntrack.c \
> @@ -409,6 +414,14 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..7d086dc5e860 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -22,6 +22,9 @@
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif

Why the protection above? You do not do this in netdev-linux.c.
Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?

>  static void
>  dp_packet_init__(struct dp_packet *b, size_t allocated, enum 
> dp_packet_source source)
> @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base, 
> size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t 
> allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif

Guess the above ifdef saves some bytes if not build with AF_XDP, but we 
do not seem to do it for other functions either, like 
dp_packet_init_dpdk().

>  /* Initializes 'b' as an empty dp_packet that contains the 
> 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the 
> stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  
> It could
> @@ -122,6 +146,11 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +#ifdef HAVE_AF_XDP
> +            free_afxdp_buf(b);
> +#endif

If you move the #ifdef HAVE_AF_XDP check to the include file (see 
comment above), you can use the DPDK inline trick and remove the #ifdef 
above.
See lib/netdev-dpdk.h

> +            return;
>          }
>      }
>  }
> @@ -248,6 +277,9 @@ dp_packet_resize__(struct dp_packet *b, size_t 
> new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +465,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == 
> dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..0f533201f956 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,10 @@
>  #include <rte_mbuf.h>
>  #endif
>
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
> +

See comment in dp-packet.c, if done all #ifdef HAVE_AF_XDP in this file 
can be removed.

>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +46,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated 
> memory.
>                                  * ref to dp_packet_init_dpdk() in 
> dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +94,13 @@ struct dp_packet {
>      };
>  };
>
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +134,9 @@ static inline const void 
> *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +198,12 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>
> +#ifdef HAVE_AF_XDP
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +#endif
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..cc91720fad6e 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif HAVE_AF_XDP

We need to add support for at least ARM and PPC, not sure how to do this 
nicely.
This code is already a quick cut/paste from DPDK, license?

> +    /* This is x86-specific instructions. */
> +    union {
> +        uint64_t tsc_64;
> +        struct {
> +            uint32_t lo_32;
> +            uint32_t hi_32;
> +        };
> +    } tsc;
> +    asm volatile("rdtsc" :
> +             "=a" (tsc.lo_32),
> +             "=d" (tsc.hi_32));
> +
> +    return s->last_tsc = tsc.tsc_64;
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..cd1b9ca8be77
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,727 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#if !defined(__i386__) && !defined(__x86_64__)
> +#error AF_XDP supported only for Linux on x86 or x86_64

Any reason why we do not support PPC and ARM?

> +#endif
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"

Swap the two above, see comment in netdev-linux-private.h

> +#include "netdev-afxdp.h"
> +
> +#include <arpa/inet.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/sockios.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <net/if_arp.h>
> +#include <net/route.h>
> +#include <netinet/in.h>
> +#include <netpacket/packet.h>
> +#include <poll.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/utsname.h>
> +#include <unistd.h>
> +

Some of these includes are included by netdev-linux(-private).h already 
so why not remove them?

> +#include "coverage.h"
> +#include "dp-packet.h"
> +#include "dpif-netlink.h"
> +#include "dpif-netdev.h"
> +#include "fatal-signal.h"
> +#include "hash.h"
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "netlink-notifier.h"
> +#include "netlink-socket.h"
> +#include "netlink.h"
> +#include "netnsid.h"
> +#include "openflow/openflow.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/hmap.h"
> +#include "openvswitch/ofpbuf.h"
> +#include "openvswitch/poll-loop.h"
> +#include "openvswitch/vlog.h"
> +#include "openvswitch/shash.h"
> +#include "ovs-atomic.h"
> +#include "packets.h"
> +#include "rtnetlink.h"
> +#include "socket-util.h"
> +#include "sset.h"
> +#include "tc.h"
> +#include "timer.h"
> +#include "unaligned.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif

Do we really need to include the above? Or should we update the install 
instruction to move them over from the kernel headers?

> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char 
> *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base 
> + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static uint32_t prog_id;
> +static struct xsk_socket_info *xsk_configure(int ifindex, int 
> xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +
> +static struct xsk_umem_info *xsk_configure_umem(void *buffer, 
> uint64_t size,
> +                                                int xdpmode)
> +{
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof(*umem));
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, 
> &umem->cq,
> +                           NULL);

Here you pass no user data, so this call will allocate 
XSK_RING_PROD__DEFAULT_NUM_DESCS and XSK_RING_CONS__DEFAULT_NUM_DESCS, 
not the values you define in xdpsock.h

> +
> +    if (ret) {
> +        VLOG_ERR("xsk umem create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    umem_pool_init(&umem->mpool, NUM_FRAMES);

Here we should check for return value, see also note in xdpsock.c

> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    xpacket_pool_init(&umem->xpool, NUM_FRAMES);

Check return value and cleanup/return NULL on error, see 
xpacket_pool_init()

> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, 
> umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
> +

We should check if we got the entries we requested

> +    for (i = 0;
> +         i < PROD_NUM_DESCS * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);

Error check?

> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +    int ret;
> +
> +    /* umem memory region */
> +    ret = posix_memalign(&bufs, get_page_size(),
> +                         NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +    ovs_assert(!ret);

We should not assert, just report out of memory and return NULL.

> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        (void)xsk_umem__delete(umem->umem);
> +        free(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk;
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    /* configure each queue */
> +    for (i = 0; i < netdev->n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", 
> i);
> +            goto err;
> +        }
> +        dev->xsk[i] = xsk;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
> +{
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +    ds_put_hex_dump(&ds, buf, count, 0, false);
> +    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
> +    ds_destroy(&ds);
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk)
> +{
> +    struct xsk_umem *umem;
> +
> +    if (!xsk) {
> +        return;
> +    }
> +
> +    umem = xsk->umem->umem;
> +    xsk_socket__delete(xsk->xsk);
> +    (void)xsk_umem__delete(umem);

I would log any errors here, specially if we ever support sharing of 
umem.

> +
> +    /* free the packet buffer */
> +    free(xsk->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> +
> +    free(xsk->umem);
> +    free(xsk);
> +}
> +
> +void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < MAX_XSKQ; i++) {
> +        if (dev->xsk[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsk[i]);
> +            dev->xsk[i] = NULL;
> +        }
> +    }
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static inline void OVS_UNUSED
> +print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 
> XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid 
> %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}

Do we want to move this to some specific statistics dump, like 
"ovs-vsctl get Interface eno1 statistics"
If you want to keep it, maybe rename it to log_xsk_stats()

> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap 
> *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *xdpmode;
> +    int new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (new_n_rxq != netdev->n_rxq) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        netdev_request_reconfigure(netdev);
> +    }
> +
> +    xdpmode = smap_get(args, "xdpmode");
> +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
> +        dev->requested_xdpmode = XDP_ZEROCOPY;
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            netdev_request_reconfigure(netdev);
> +        }
> +    } else {
> +        dev->requested_xdpmode = XDP_COPY;
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            netdev_request_reconfigure(netdev);
> +        }
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap 
> *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    netdev->n_rxq = dev->requested_n_rxq;
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", 
> netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", 
> netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", 
> netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t curr_prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    }
> +    if (prog_id == curr_prog_id) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    } else if (!curr_prog_id) {
> +        VLOG_INFO("couldn't find a prog id on a given interface");
> +    } else {
> +        VLOG_INFO("program on interface changed, not removing");
> +    }
> +}
> +
> +struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    unsigned long addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +        struct dp_packet_afxdp *xpacket = NULL;
> +        struct dp_packet *packet;
> +        void *elems[BATCH_SIZE];
> +        unsigned long addr;
> +
> +       /* all packets are AF_XDP, so handles its own delete in batch 
> */
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            xpacket = dp_packet_cast_afxdp(packet);
> +            if (xpacket->mpool) {
> +                void *base = dp_packet_base(packet);
> +
> +                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +                elems[i] = (void *)addr;
> +            }
> +        }
> +        umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +        dp_packet_batch_init(batch);
> +}
> +
> +/* Receive packet from AF_XDP socket */
> +int
> +netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                     struct dp_packet_batch *batch)
> +{
> +    struct umem_elem *elems[BATCH_SIZE];
> +    uint32_t idx_rx = 0, idx_fq = 0;
> +    unsigned int rcvd, i;
> +    int ret = 0;
> +
> +    /* See if there is any packet on RX queue,
> +     * if yes, idx_rx is the index having the packet.
> +     */
> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;
> +    }
> +
> +    /* Form a dp_packet batch from descriptor in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, 
> idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> +
> +        packet = &xpacket->packet;
> +        xpacket->mpool = &xsk->umem->mpool;

Do we need to set this up again? This should be static and setup in 
xsk_configure_umem()

> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - 
> FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +
> +    /* We've consume rcvd packets in RX, now re-fill the
> +     * same number back to FILL queue.
> +     */
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {
> +        return -ENOMEM;
> +    }
> +

I saw Ilya's comments on this section also, but should we not continue 
to process the batch even if we can't stock the kernel with new buffers? 
Maybe other PMDs have a bunch of packets pending (send and receive) so 
if we are temporarily out of buffers.
Maybe we can re-stock later...

> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        while (OVS_UNLIKELY(ret == 0)) {
> +            /* The FILL queue is full, so retry. (or skip)? */
> +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        }
> +
> +        /* Get one free umem, program it into FILL queue */
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> +
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk->rx, rcvd);

We should move this more up, so the entries are available for the kernel 
to fill...

> +    xsk->rx_npkts += rcvd;
> +
> +#ifdef AFXDP_DEBUG
> +    print_xsk_stat(xsk);
> +#endif
> +    return 0;
> +}
> +
> +static inline int kick_tx(struct xsk_socket_info *xsk)
> +{
> +    int ret;
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, 
> NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == 
> EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +int
> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                              struct dp_packet_batch *batch)
> +{

See Ilya's comment on thread safety on the ring APIs.

> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t tx_done, idx_cq = 0;
> +    struct dp_packet *packet;
> +    uint32_t idx = 0;
> +    int j, ret, retry_count = 0;
> +    const int max_retry = 4;
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void 
> **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        return EAGAIN;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void 
> **)elems_pop);
> +        return EAGAIN;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * We can avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */

The comment mentions the copy can be avoided, but it's not implemented 
in the code, is this correct or was something removed?

> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void 
> **)elems_pop);
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +        return ret;

I think we should still try to recover the CQ below, even on failure.

> +    }
> +
> +retry:
> +    /* Process CQ */
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, 
> &idx_cq);
> +    if (tx_done > 0) {
> +        xsk->outstanding_tx -= tx_done;
> +        xsk->tx_npkts += tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +
> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void 
> **)elems_push);
> +    ovs_assert(ret == 0);
> +
> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +
> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) 
> {
> +        /* If there are still a lot not transmitted, try harder. */
> +        if (retry_count++ > max_retry) {
> +            return 0;
> +        }
> +        goto retry;
> +    }
> +

I think the code above is causing my lockup at wire speed mentioned 
above...
I guess the retry_count expires every transmit sending packets to the 
TAP interface.
No all buffers are used... This is causing the umem_elem_pop_n() in the 
beginning to fail, hence the buffers are never returned!

Guess we might need some reclaim in the beginning, or maybe even in the 
rx loop?
> +    return 0;
> +}
> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> new file mode 100644
> index 000000000000..6518d8fca0b5
> --- /dev/null
> +++ b/lib/netdev-afxdp.h
> @@ -0,0 +1,53 @@
> +/*
> + * Copyright (c) 2018 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_AFXDP_H
> +#define NETDEV_AFXDP_H 1
> +
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +/* These functions are Linux AF_XDP specific, so they should be used 
> directly
> + * only by Linux-specific code. */

Extra enter?

> +#define MAX_XSKQ 16

Extra enter?

> +struct netdev;
> +struct xsk_socket_info;
> +struct xdp_umem;
> +struct dp_packet_batch;
> +struct smap;
> +struct dp_packet;
> +
> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet 
> *d);
> +
> +int xsk_configure_all(struct netdev *netdev);
> +
> +void xsk_destroy_all(struct netdev *netdev);
> +
> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                         struct dp_packet_batch *batch);
> +
> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                                  struct dp_packet_batch *batch);
> +
> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap 
> *args,
> +                            char **errp);
> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap 
> *args);
> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> +
> +void free_afxdp_buf(struct dp_packet *p);
> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> +#endif /* netdev-afxdp.h */
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> new file mode 100644
> index 000000000000..3dd3d902b3c4
> --- /dev/null
> +++ b/lib/netdev-linux-private.h
> @@ -0,0 +1,124 @@
> +/*
> + * Copyright (c) 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_LINUX_PRIVATE_H
> +#define NETDEV_LINUX_PRIVATE_H 1
> +
> +#include <config.h>
> +
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"

Why include all the above? They where just added to netdev-linux.h, so 
if you make sure you include netdev-lunux.h before -private it should 
work out.

> +
> +#if HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif

See earlier comment

> +
> +/* These functions are Linux specific, so they should be used 
> directly only by
> + * Linux-specific code. */
> +
> +struct netdev;
> +
> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t 
> flag,
> +                                  const char *flag_name, bool 
> enable);
> +int linux_get_ifindex(const char *netdev_name);
> +

These functions are now both specified in netdev-linux.h and 
netdev-linux-private.h

> +#define LINUX_FLOW_OFFLOAD_API                          \
> +   .flow_flush = netdev_tc_flow_flush,                  \
> +   .flow_dump_create = netdev_tc_flow_dump_create,      \
> +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
> +   .flow_dump_next = netdev_tc_flow_dump_next,          \
> +   .flow_put = netdev_tc_flow_put,                      \
> +   .flow_get = netdev_tc_flow_get,                      \
> +   .flow_del = netdev_tc_flow_del,                      \
> +   .init_flow_api = netdev_tc_init_flow_api
> +

Same here, this define is in both include files.

> +struct netdev_linux {
> +    struct netdev up;
> +
> +    /* Protects all members below. */
> +    struct ovs_mutex mutex;
> +
> +    unsigned int cache_valid;
> +
> +    bool miimon;                    /* Link status of last poll. */
> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if 
> <= 0. */
> +    struct timer miimon_timer;
> +
> +    int netnsid;                    /* Network namespace ID. */
> +    /* The following are figured out "on demand" only.  They are only 
> valid
> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> +    int ifindex;
> +    struct eth_addr etheraddr;
> +    int mtu;
> +    unsigned int ifi_flags;
> +    long long int carrier_resets;
> +    uint32_t kbits_rate;        /* Policing data. */
> +    uint32_t kbits_burst;
> +    int vport_stats_error;      /* Cached error code from 
> vport_get_stats().
> +                                   0 or an errno value. */
> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> +                                 * or SIOCSIFMTU.
> +                                 */
> +    int ether_addr_error;       /* Cached error code from set/get 
> etheraddr. */
> +    int netdev_policing_error;  /* Cached error code from set 
> policing. */
> +    int get_features_error;     /* Cached error code from 
> ETHTOOL_GSET. */
> +    int get_ifindex_error;      /* Cached error code from 
> SIOCGIFINDEX. */
> +
> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> +
> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. 
> */
> +    struct tc *tc;
> +
> +    /* For devices of class netdev_tap_class only. */
> +    int tap_fd;
> +    bool present;               /* If the device is present in the 
> namespace */
> +    uint64_t tx_dropped;        /* tap device can drop if the iface 
> is down */
> +
> +    /* LAG information. */
> +    bool is_lag_master;         /* True if the netdev is a LAG 
> master. */
> +
> +    /* AF_XDP information */
> +#ifdef HAVE_AF_XDP
> +    struct xsk_socket_info *xsk[MAX_XSKQ];
> +    int requested_n_rxq;
> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> +    int xdp_flags, xdp_bind_flags;
> +#endif
> +};
> +
> +static struct netdev_linux *
> +netdev_linux_cast(const struct netdev *netdev)
> +{

In the original definition there was an assert() here, was it removed by 
accident?
netdev_linux_rxq_xsk
> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +}
> +
> +#endif /* netdev-linux-private.h */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index f75d73fd39f8..1f190406d145 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -17,6 +17,7 @@
>  #include <config.h>
>
>  #include "netdev-linux.h"
> +#include "netdev-linux-private.h"
>
>  #include <errno.h>
>  #include <fcntl.h>
> @@ -54,6 +55,7 @@
>  #include "fatal-signal.h"
>  #include "hash.h"
>  #include "openvswitch/hmap.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-provider.h"
>  #include "netdev-tc-offloads.h"
>  #include "netdev-vport.h"
> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int 
> mtu);
>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t 
> burst_bytes);
>  
> -struct netdev_linux {
> -    struct netdev up;
> -
> -    /* Protects all members below. */
> -    struct ovs_mutex mutex;
> -
> -    unsigned int cache_valid;
> -
> -    bool miimon;                    /* Link status of last poll. */
> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if 
> <= 0. */
> -    struct timer miimon_timer;
> -
> -    int netnsid;                    /* Network namespace ID. */
> -    /* The following are figured out "on demand" only.  They are only 
> valid
> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> -    int ifindex;
> -    struct eth_addr etheraddr;
> -    int mtu;
> -    unsigned int ifi_flags;
> -    long long int carrier_resets;
> -    uint32_t kbits_rate;        /* Policing data. */
> -    uint32_t kbits_burst;
> -    int vport_stats_error;      /* Cached error code from 
> vport_get_stats().
> -                                   0 or an errno value. */
> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU 
> or SIOCSIFMTU. */
> -    int ether_addr_error;       /* Cached error code from set/get 
> etheraddr. */
> -    int netdev_policing_error;  /* Cached error code from set 
> policing. */
> -    int get_features_error;     /* Cached error code from 
> ETHTOOL_GSET. */
> -    int get_ifindex_error;      /* Cached error code from 
> SIOCGIFINDEX. */
> -
> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> -
> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. 
> */
> -    struct tc *tc;
> -
> -    /* For devices of class netdev_tap_class only. */
> -    int tap_fd;
> -    bool present;               /* If the device is present in the 
> namespace */
> -    uint64_t tx_dropped;        /* tap device can drop if the iface 
> is down */
> -
> -    /* LAG information. */
> -    bool is_lag_master;         /* True if the netdev is a LAG 
> master. */
> -};
>
>  struct netdev_rxq_linux {
>      struct netdev_rxq up;
> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class 
> *netdev_class)
>      return netdev_class->run == netdev_linux_run;
>  }
>
> +#if HAVE_AF_XDP
>  static bool
> -is_tap_netdev(const struct netdev *netdev)
> +is_afxdp_netdev(const struct netdev *netdev)
>  {
> -    return netdev_get_class(netdev) == &netdev_tap_class;
> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
>  }
> -
> -static struct netdev_linux *
> -netdev_linux_cast(const struct netdev *netdev)
> +#else
> +static bool
> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
>  {
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> -
> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +    return false;
> +}
> +#endif
> +static bool
> +is_tap_netdev(const struct netdev *netdev)
> +{
> +    return netdev_get_class(netdev) == &netdev_tap_class;
>  }
>
>  static struct netdev_rxq_linux *
> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>          atomic_count_dec(&miimon_cnt);
>      }
>
> +#if HAVE_AF_XDP
> +    if (is_afxdp_netdev(netdev_)) {
> +        xsk_destroy_all(netdev_);
> +    }
> +#endif

Think you can remove the HAVE_AF_XDP here, as you do not use it below 
either.

>      ovs_mutex_destroy(&netdev->mutex);
>  }
>
> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq 
> *rxq_)
>      rx->is_tap = is_tap_netdev(netdev_);
>      if (rx->is_tap) {
>          rx->fd = netdev->tap_fd;
> -    } else {
> +    } else if (!is_afxdp_netdev(netdev_)) {
>          struct sockaddr_ll sll;
>          int ifindex, val;
>          /* Result of tcpdump -dd inbound */
> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, 
> struct dp_packet_batch *batch,
>  {
>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>      struct netdev *netdev = rx->up.netdev;
> -    struct dp_packet *buffer;
> +    struct dp_packet *buffer = NULL;
>      ssize_t retval;
>      int mtu;
>
> +#if HAVE_AF_XDP

Think this #if HAVE_AF_XDP can be removed as the compiler should 
optimize out the if (false).

> +    if (is_afxdp_netdev(netdev)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> +        int qid = rxq_->queue_id;
> +
> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> +    }
> +#endif
>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>          mtu = ETH_PAYLOAD_MAX;
>      }
> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, 
> struct dp_packet_batch *batch,
>      /* Assume Ethernet port. No need to set packet_type. */
>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>                                             DP_NETDEV_HEADROOM);
> +
>      retval = (rx->is_tap
>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int 
> qid OVS_UNUSED,
>      int error = 0;
>      int sock = 0;
>
> -    if (!is_tap_netdev(netdev_)) {
> +    if (!is_tap_netdev(netdev_) &&
> +        !is_afxdp_netdev(netdev_)) {
>          if 
> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>              error = EOPNOTSUPP;
>              goto free_batch;
> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int 
> qid OVS_UNUSED,
>          }
>
>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +#if HAVE_AF_XDP

Same here remove the #if HAVE_AF_XDP

> +    } else if (is_afxdp_netdev(netdev_)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> +        struct dp_packet_afxdp *xpacket;
> +        struct umem_pool *first_mpool;
> +        struct dp_packet *packet;
> +
> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> +
> +        /* all packets must come frome the same umem pool
> +         * and has DPBUF_AFXDP type, otherwise free on-by-one
> +         */
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            if (packet->source != DPBUF_AFXDP) {
> +                goto free_batch;
> +            }
> +
> +            xpacket = dp_packet_cast_afxdp(packet);
> +            if (i == 0) {
> +                first_mpool = xpacket->mpool;
> +                continue;
> +            }
> +            if (xpacket->mpool != first_mpool) {
> +                goto free_batch;
> +            }
> +        }

Why do not we not move all the packet type checks to 
free_afxdp_buf_batch()?

> +        /* free in batch */
> +        free_afxdp_buf_batch(batch);
> +        return error;
> +#endif
>      } else {
>          error = netdev_linux_tap_batch_send(netdev_, batch);
>      }
> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "system",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_linux_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
>  const struct netdev_class netdev_tap_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      .type = "tap",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct_tap,
>      .get_stats = netdev_tap_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3343,10 +3352,26 @@ const struct netdev_class 
> netdev_internal_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "internal",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_internal_get_stats,
>      .get_status = netdev_internal_get_status,
>  };
> +
> +#ifdef HAVE_AF_XDP
> +const struct netdev_class netdev_afxdp_class = {
> +    NETDEV_LINUX_CLASS_COMMON,
> +    .type = "afxdp",
> +    .is_pmd = true,
> +    .construct = netdev_linux_construct,
> +    .get_stats = netdev_linux_get_stats,
> +    .get_status = netdev_linux_get_status,
> +    .set_config = netdev_afxdp_set_config,
> +    .get_config = netdev_afxdp_get_config,
> +    .reconfigure = netdev_afxdp_reconfigure,
> +    .get_numa_id = netdev_afxdp_get_numa_id,
> +};
> +#endif
>  
>
>  #define CODEL_N_QUEUES 0x0000
> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> index 17ca9120168a..b812e64cb078 100644
> --- a/lib/netdev-linux.h
> +++ b/lib/netdev-linux.h
> @@ -19,6 +19,20 @@
>
>  #include <stdint.h>
>  #include <stdbool.h>
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"

Is there a reason why you move all these includes here? If there is you 
might as well remove the duplicates from .c files that include 
netdev-linux.h, for example, netdev-linux.c

>  /* These functions are Linux specific, so they should be used 
> directly only by
>   * Linux-specific code. */
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e6e8e8..d433818f7064 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -902,7 +902,9 @@ extern const struct netdev_class 
> netdev_linux_class;
>  #endif
>  extern const struct netdev_class netdev_internal_class;
>  extern const struct netdev_class netdev_tap_class;
> -
> +#if HAVE_AF_XDP
> +extern const struct netdev_class netdev_afxdp_class;
> +#endif
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6f0946..e2fae37d5a5e 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -146,6 +146,9 @@ netdev_initialize(void)
>          netdev_register_provider(&netdev_internal_class);
>          netdev_register_provider(&netdev_tap_class);
>          netdev_vport_tunnel_register();
> +#ifdef HAVE_AF_XDP
> +        netdev_register_provider(&netdev_afxdp_class);
> +#endif
>  #endif
>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>          netdev_register_provider(&netdev_tap_class);
> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> new file mode 100644
> index 000000000000..2d80e74d69e4
> --- /dev/null
> +++ b/lib/xdpsock.c
> @@ -0,0 +1,239 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +#include <config.h>
> +
> +#include "xdpsock.h"
> +
> +#include <ctype.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <syslog.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include "async-append.h"
> +#include "coverage.h"
> +#include "dirs.h"
> +#include "dp-packet.h"
> +#include "openvswitch/compiler.h"
> +#include "openvswitch/vlog.h"
> +#include "ovs-atomic.h"
> +#include "ovs-thread.h"
> +#include "sat-math.h"
> +#include "socket-util.h"
> +#include "svec.h"
> +#include "syslog-direct.h"
> +#include "syslog-libc.h"
> +#include "syslog-provider.h"
> +#include "timeval.h"
> +#include "unixctl.h"
> +#include "util.h"
> +
> +static inline void
> +ovs_spinlock_init(ovs_spinlock_t *sl)
> +{
> +    atomic_init(&sl->locked, 0);
> +}
> +
> +static inline void
> +ovs_spin_lock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0, locked = 0;
> +
> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, 
> &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed)) {
> +        locked = 1;
> +        while (locked) {
> +            atomic_read_relaxed(&sl->locked, &locked);
> +        }
> +        exp = 0;
> +    }
> +}
> +
> +static inline void
> +ovs_spin_unlock(ovs_spinlock_t *sl)
> +{
> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> +}
> +
> +static inline int OVS_UNUSED
> +ovs_spin_trylock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0;
> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 
> 1,
> +                memory_order_acquire,
> +                memory_order_relaxed);
> +}

Move spinlock function out to a common file

> +
> +inline int
> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {

This is a stack overflow

> +        return -ENOMEM;
> +    }
> +
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(ptr, addrs, n * sizeof(void *));
> +    umemp->index += n;
> +
> +    return 0;
> +}
> +
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_push_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void
> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +    umemp->array[umemp->index++] = addr;
> +}
> +
> +void
> +umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +
> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> +        /* stack is overflow, this should not happen */
> +        OVS_NOT_REACHED();
> +    }

Should this not be moved after the spinlock, i.e. to __umem_elem_push

> +
> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    __umem_elem_push(umemp, addr);
> +    ovs_spin_unlock(&umemp->mutex);
> +}
> +
> +inline int
> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->index -= n;
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(addrs, ptr, n * sizeof(void *));
> +
> +    return 0;
> +}
> +
> +int
> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void *
> +__umem_elem_pop(struct umem_pool *umemp)
> +{

There is no check here to see if there are actual any elements left, 
like there is for pop_n,
so we could corrupt memory/umem_pool

> +    return umemp->array[--umemp->index];
> +}
> +
> +void *
> +umem_elem_pop(struct umem_pool *umemp)
> +{
> +    void *ptr;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ptr = __umem_elem_pop(umemp);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ptr;
> +}
> +
> +void **
> +__umem_pool_alloc(unsigned int size)
> +{
> +    void *bufs;
> +
> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(void *)) == 0);

We should not assert, just return NULL here.

> +    memset(bufs, 0, size * sizeof(void *));
> +    return (void **)bufs;
> +}
> +
> +unsigned int
> +umem_elem_count(struct umem_pool *mpool)
> +{
> +    return mpool->index;
> +}
> +
> +int
> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> +{
> +    umemp->array = __umem_pool_alloc(size);
> +    if (!umemp->array) {
> +        OVS_NOT_REACHED();

If NULL is returned return ENOMEM

> +    }
> +
> +    umemp->size = size;
> +    umemp->index = 0;
> +    ovs_spinlock_init(&umemp->mutex);
> +    return 0;
> +}
> +
> +void
> +umem_pool_cleanup(struct umem_pool *umemp)
> +{
> +    free(umemp->array);
        umemp->array = NULL;
> +}
> +
> +/* AF_XDP metadata init/destroy */
> +int
> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> +{
> +    void *bufs;
> +
> +    /* TODO: check HAVE_POSIX_MEMALIGN  */

Guess the above needs to be done

> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(struct dp_packet_afxdp)) 
> == 0);

We should not assert, just return false

> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> +
> +    xp->array = bufs;
> +    xp->size = size;
> +    return 0;
> +}
> +
> +void
> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> +{
> +    free(xp->array);
        xp->array = NULL;
> +}
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> new file mode 100644
> index 000000000000..aabaa8e5df24
> --- /dev/null
> +++ b/lib/xdpsock.h
> @@ -0,0 +1,123 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef XDPSOCK_H
> +#define XDPSOCK_H 1
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/xsk.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <locale.h>
> +#include <net/if.h>
> +#include <poll.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +
> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> +#define BATCH_SIZE      NETDEV_MAX_BURST

Move this item to the bottom, so you have FRAME specific define's first

> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> +
> +#define NUM_FRAMES      4096

Should we add a note/check to make sure this value is a power of 2?

> +#define PROD_NUM_DESCS  512
> +#define CONS_NUM_DESCS  512
> +
> +#ifdef USE_XSK_DEFAULT
> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> +#endif

Any reason for having this? Should we use the default values? They are 
4x larger than you have, did it make any difference in performance 
results?
We could make it configurable like for DPDK, using the 
n_txq_desc/n_rxq_desc option.

> +
> +typedef struct {
> +    atomic_int locked;
> +} ovs_spinlock_t;
> +

Think we should move the ovs_spinlock code and includes to some global 
place, maybe util or thread

> +/* LIFO ptr_array */
> +struct umem_pool {
> +    int index;      /* point to top */
> +    unsigned int size;
> +    ovs_spinlock_t mutex;
> +    void **array;   /* a pointer array, point to umem buf */
> +};
> +
> +/* array-based dp_packet_afxdp */
> +struct xpacket_pool {
> +    unsigned int size;
> +    struct dp_packet_afxdp **array;
> +};
> +
> +struct xsk_umem_info {
> +    struct umem_pool mpool;
> +    struct xpacket_pool xpool;
> +    struct xsk_ring_prod fq;
> +    struct xsk_ring_cons cq;
> +    struct xsk_umem *umem;
> +    void *buffer;
> +};
> +
> +struct xsk_socket_info {
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_umem_info *umem;
> +    struct xsk_socket *xsk;
> +    unsigned long rx_npkts;
> +    unsigned long tx_npkts;
> +    unsigned long prev_rx_npkts;
> +    unsigned long prev_tx_npkts;
> +    uint32_t outstanding_tx;
> +};
> +
> +struct umem_elem {
> +    struct umem_elem *next;
> +};
> +
> +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void *__umem_elem_pop(struct umem_pool *umemp);
> +void *umem_elem_pop(struct umem_pool *umemp);
> +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void **__umem_pool_alloc(unsigned int size);
> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> +void umem_pool_cleanup(struct umem_pool *umemp);
> +unsigned int umem_elem_count(struct umem_pool *mpool);
> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> +

Think all the __umem_* function are only used internally so they should 
be come static and be removed here.

> +#endif
> diff --git a/tests/automake.mk b/tests/automake.mk
> index ea16532dd2a0..715cef9a6b3b 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>  	$(SYSTEM_TESTSUITE_AT) \
>  	$(SYSTEM_KMOD_TESTSUITE_AT) \
>  	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
> +	$(SYSTEM_AFXDP_TESTSUITE_AT) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>  	$(SYSTEM_DPDK_TESTSUITE_AT) \
>  	$(OVSDB_CLUSTER_TESTSUITE_AT) \
>  	$(TESTSUITE) \
>  	$(SYSTEM_KMOD_TESTSUITE) \
>  	$(SYSTEM_USERSPACE_TESTSUITE) \
> +	$(SYSTEM_AFXDP_TESTSUITE) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE) \
>  	$(SYSTEM_DPDK_TESTSUITE) \
>  	$(OVSDB_CLUSTER_TESTSUITE) \
> @@ -158,6 +160,11 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>  	tests/system-userspace-macros.at \
>  	tests/system-userspace-packet-type-aware.at
>
> +SYSTEM_AFXDP_TESTSUITE_AT = \
> +	tests/system-afxdp-testsuite.at \
> +	tests/system-afxdp-traffic.at \
> +	tests/system-afxdp-macros.at
> +
>  SYSTEM_TESTSUITE_AT = \
>  	tests/system-common-macros.at \
>  	tests/system-ovn.at \
> @@ -182,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>  SYSTEM_USERSPACE_TESTSUITE = 
> $(srcdir)/tests/system-userspace-testsuite
> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> @@ -315,6 +323,11 @@ check-system-userspace: all
>  	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" 
> --recheck)
>
> +check-afxdp: all
> +	$(MAKE) install
> +	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> +	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> +
>  check-offloads: all
>  	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" 
> --recheck)
> @@ -352,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 
> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
>
> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) 
> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> +	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> +	$(AM_V_at)mv $@.tmp $@
> +
>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) 
> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
> diff --git a/tests/system-afxdp-macros.at 
> b/tests/system-afxdp-macros.at
> new file mode 100644
> index 000000000000..2c58c2d6554b
> --- /dev/null
> +++ b/tests/system-afxdp-macros.at
> @@ -0,0 +1,153 @@
> +# _ADD_BR([name])
> +#
> +# Expands into the proper ovs-vsctl commands to create a bridge with 
> the
> +# appropriate type and properties
> +m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 
> datapath_type=netdev 
> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 
> fail-mode=secure ]])
> +
> +# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], 
> [=override])
> +#
> +# Creates a database and starts ovsdb-server, starts ovs-vswitchd
> +# connected to that database, calls ovs-vsctl to create a bridge 
> named
> +# br0 with predictable settings, passing 'vsctl-args' as additional
> +# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
> +# output (e.g. because it includes "create" commands) then 
> 'vsctl-output'
> +# specifies the expected output after filtering through uuidfilt.
> +m4_define([OVS_TRAFFIC_VSWITCHD_START],
> +  [
> +   export OVS_PKGDATADIR=$(`pwd`)
> +   _OVS_VSWITCHD_START([--disable-system])
> +   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| 
> uuidfilt])], [0], [$2])
> +])
> +
> +# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
> +#
> +# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log 
> files
> +# for messages with severity WARN or higher and signaling an error if 
> any
> +# is present.  The optional WHITELIST may contain shell-quoted "sed"
> +# commands to delete any warnings that are actually expected, e.g.:
> +#
> +#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
> +#
> +# 'extra_cmds' are shell commands to be executed afte 
> OVS_VSWITCHD_STOP() is
> +# invoked. They can be used to perform additional cleanups such as 
> name space
> +# removal.
> +m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
> +  [OVS_VSWITCHD_STOP([dnl
> +$1";/netdev_linux.*obtaining netdev stats via vport failed/d
> +/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. 
> The Open vSwitch kernel module is probably not loaded./d
> +/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
> +/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
> +"])
> +   AT_CHECK([:; $2])
> +  ])
> +
> +m4_define([ADD_VETH_AFXDP],
> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 
> 77])
> +      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
> +      AT_CHECK([ip link set $1 netns $2])
> +      AT_CHECK([ip link set dev ovs-$1 up])
> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> +                set interface ovs-$1 external-ids:iface-id="$1" 
> type="afxdp"])
> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> +      if test -n "$5"; then
> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> +      fi
> +      if test -n "$6"; then
> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> +      fi
> +      on_exit 'ip link del ovs-$1'
> +    ]
> +)
> +
> +# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
> +m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
> +    ]
> +)
> +
> +# CONFIGURE_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads for veths.  The userspace datapath uses the 
> AF_PACKET
> +# socket to receive packets for veths.  Unfortunately, the AF_PACKET 
> socket
> +# doesn't play well with offloads:
> +# 1. GSO packets are received without segmentation and therefore 
> discarded.
> +# 2. Packets with offloaded partial checksum are received with the 
> wrong
> +#    checksum, therefore discarded by the receiver.
> +#
> +# By disabling tx offloads in the non-OVS side of the veth peer we 
> make sure
> +# that the AF_PACKET socket will not receive bad packets.
> +#
> +# This is a workaround, and should be removed when offloads are 
> properly
> +# supported in netdev-linux.
> +m4_define([CONFIGURE_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
> +)
> +
> +# CHECK_CONNTRACK()
> +#
> +# Perform requirements checks for running conntrack tests.
> +#
> +m4_define([CHECK_CONNTRACK],
> +    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
> +)
> +
> +# CHECK_CONNTRACK_ALG()
> +#
> +# Perform requirements checks for running conntrack ALG tests. The 
> userspace
> +# supports FTP and TFTP.
> +#
> +m4_define([CHECK_CONNTRACK_ALG])
> +
> +# CHECK_CONNTRACK_FRAG()
> +#
> +# Perform requirements checks for running conntrack fragmentations 
> tests.
> +# The userspace doesn't support fragmentation yet, so skip the tests.
> +m4_define([CHECK_CONNTRACK_FRAG],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_LOCAL_STACK()
> +#
> +# Perform requirements checks for running conntrack tests with local 
> stack.
> +# While the kernel connection tracker automatically passes all the 
> connection
> +# tracking state from an internal port to the OpenvSwitch kernel 
> module, there
> +# is simply no way of doing that with the userspace, so skip the 
> tests.
> +m4_define([CHECK_CONNTRACK_LOCAL_STACK],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_NAT()
> +#
> +# Perform requirements checks for running conntrack NAT tests. The 
> userspace
> +# datapath supports NAT.
> +#
> +m4_define([CHECK_CONNTRACK_NAT])
> +
> +# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
> +#
> +# Perform requirements checks for running ovs-dpctl flush-conntrack 
> by
> +# conntrack 5-tuple test. The userspace datapath does not support
> +# this feature yet.
> +m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-set-maxconns 
> or
> +# ovs-dpctl ct-get-maxconns. The userspace datapath does support this 
> feature.
> +m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
> +
> +# CHECK_CT_DPIF_GET_NCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-get-nconns. 
> The
> +# userspace datapath does support this feature.
> +m4_define([CHECK_CT_DPIF_GET_NCONNS])
> diff --git a/tests/system-afxdp-testsuite.at 
> b/tests/system-afxdp-testsuite.at
> new file mode 100644
> index 000000000000..538c0d15d556
> --- /dev/null
> +++ b/tests/system-afxdp-testsuite.at
> @@ -0,0 +1,26 @@
> +AT_INIT
> +
> +AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
> +
> +Licensed under the Apache License, Version 2.0 (the "License");
> +you may not use this file except in compliance with the License.
> +You may obtain a copy of the License at:
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> +
> +Unless required by applicable law or agreed to in writing, software
> +distributed under the License is distributed on an "AS IS" BASIS,
> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> +See the License for the specific language governing permissions and
> +limitations under the License.])
> +
> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> +
> +m4_include([tests/ovs-macros.at])
> +m4_include([tests/ovsdb-macros.at])
> +m4_include([tests/ofproto-macros.at])
> +m4_include([tests/system-afxdp-macros.at])
> +m4_include([tests/system-common-macros.at])
> +
> +m4_include([tests/system-afxdp-traffic.at])
> +m4_include([tests/system-ovn.at])
> diff --git a/tests/system-afxdp-traffic.at 
> b/tests/system-afxdp-traffic.at
> new file mode 100644
> index 000000000000..26f72acf48ef
> --- /dev/null
> +++ b/tests/system-afxdp-traffic.at
> @@ -0,0 +1,978 @@
> +AT_BANNER([AF_XDP netdev datapath-sanity])
> +
> +AT_SETUP([datapath - ping between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ulimit -l unlimited
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
> +ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. 
> Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
> +ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. 
> Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan tunnel])
> +OVS_CHECK_VXLAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], 
> [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24],
> +                  [id 0 dstport 4789])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], 
> [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan6 tunnel])
> +OVS_CHECK_VXLAN_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], 
> "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], 
> [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], 
> [10.1.1.1/24],
> +                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], 
> [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over gre tunnel])
> +OVS_CHECK_GRE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], 
> [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], 
> [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], 
> [10.1.1.100/24], [options:key=1 options:erspan_ver=1 
> options:erspan_idx=7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], 
> [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], 
> [10.1.1.100/24], [options:key=1 options:erspan_ver=2 
> options:erspan_dir=1 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 
> 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], 
> [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], 
> nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], 
> [10.1.1.100/24],
> +                [options:key=123 options:erspan_ver=1 
> options:erspan_idx=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], 
> [fc00:100::100],
> +                   [10.1.1.1/24], [local fc00:100::1 seq key 123 
> erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], 
> [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], 
> nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], 
> [10.1.1.100/24],
> +                [options:key=121 options:erspan_ver=2 
> options:erspan_dir=0 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], 
> [fc00:100::100],
> +                   [10.1.1.1/24],
> +                   [local fc00:100::1 seq key 121 erspan_ver 2 
> erspan_dir ingress erspan_hwid 0x7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], 
> [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve tunnel])
> +OVS_CHECK_GENEVE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], 
> [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24],
> +                  [vni 0])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], 
> [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve6 tunnel])
> +OVS_CHECK_GENEVE_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth 
> pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], 
> "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a 
> native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], 
> [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], 
> [10.1.1.1/24],
> +                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], 
> [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 
> | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - clone action])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
> +                    -- set interface ovs-p1 ofport_request=2])
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1 actions=NORMAL
> +priority=10 
> in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), 
> output:2
> +priority=10 
> in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, 
> controller), output:1
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir 
> --pidfile 2> ofctl_monitor.log])
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 
> icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 
> icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 
> icmp_csum: <skip>
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - basic truncate action])
> +AT_SKIP_IF([test $HAVE_NC = no])
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-ofctl del-flows br0])
> +
> +dnl Create p0 and ovs-p0(1)
> +ADD_NAMESPACES(at_ns0)
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address 
> e6:66:c1:11:11:11])
> +NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
> +
> +dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will 
> appear in p1
> +AT_CHECK([ip link add p1 type veth peer name ovs-p1])
> +on_exit 'ip link del ovs-p1'
> +AT_CHECK([ip link set dev ovs-p1 up])
> +AT_CHECK([ip link set dev p1 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 
> ofport_request=2])
> +dnl Use p1 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 
> ofport_request=3])
> +
> +dnl Create p2(5) and ovs-p2(4)
> +AT_CHECK([ip link add p2 type veth peer name ovs-p2])
> +on_exit 'ip link del ovs-p2'
> +AT_CHECK([ip link set dev ovs-p2 up])
> +AT_CHECK([ip link set dev p2 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 
> ofport_request=4])
> +dnl Use p2 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 
> ofport_request=5])
> +
> +dnl basic test
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 
> actions=output(port=2,max_len=100),output:4
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl use this file as payload file for ncat
> +AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> 
> /dev/null])
> +on_exit 'rm -f payload200.bin'
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < 
> payload200.bin])
> +
> +dnl packet with truncated size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=100
> +])
> +dnl packet with original size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=242
> +])
> +
> +dnl more complicated output actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 
> actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < 
> payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +dnl SLOW_ACTION: disable kernel datapath truncate support
> +dnl Repeat the test above, but exercise the SLOW_ACTION code path
> +AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
> +
> +dnl SLOW_ACTION test1: check datapatch actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-appctl ofproto/trace br0 
> "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], 
> [0], [stdout])
> +AT_CHECK([tail -3 stdout], [0],
> +[Datapath actions: 
> trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
> +This flow is handled by the userspace slow path because it:
> +  - Uses action(s) not supported by datapath.
> +])
> +
> +dnl SLOW_ACTION test2: check actual packet truncate
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < 
> payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed 
> -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +
> +AT_BANNER([conntrack])
> +
> +AT_SETUP([conntrack - controller])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg 
> ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic 
> from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir 
> --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 
> '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 
> ct\(commit\),controller 
> '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 
> '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl Check this output. We only see the latter two packets, not the 
> first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) 
> data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 
> udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 
> ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 
> (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 
> udp_csum:0
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - force commit])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg 
> ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(force,commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
> +table=1,in_port=2,ct_state=+trk,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir 
> --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 
> actions=resubmit(,0)"])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 
> actions=resubmit(,0)"])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 
> actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +dnl Check this output. We only see the latter two packets, not the 
> first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via 
> action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 
> udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 
> ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 
> (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 
> udp_csum:0
> +])
> +
> +dnl
> +dnl Check that the directionality has been changed by force commit.
> +dnl
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.2,"], [], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
> +])
> +
> +dnl OK, now send another packet from port 1 and see that it switches 
> again
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 
> actions=resubmit(,0)"])
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - ct flush by 5-tuple])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),2
> +priority=100,in_port=2,udp,action=ct(zone=5,commit),1
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Test UDP from port 1
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 
> actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack 
> 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.1,"], [1], [dnl
> +])
> +
> +dnl Test UDP from port 2
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 
> packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 
> actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.2,"], [0], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 
> 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], 
> [0], [dnl
> +])
> +
> +dnl Test ICMP traffic
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.2,"], [0], [stdout])
> +AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
> +icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
> +])
> +
> +ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
> +ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 
> "orig=.src=10\.1\.1\.2,"], [1], [dnl
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv4 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic 
> from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], 
> [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | 
> FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - get_nconns and get/set_maxconns])
> +CHECK_CONNTRACK()
> +CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +CHECK_CT_DPIF_GET_NCONNS()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic 
> from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], 
> [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], 
> [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +1
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +3000000
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
> +setting maxconns successful
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +0
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv6 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +AT_DATA([flows.txt], [dnl
> +
> +dnl ICMPv6 echo request and reply go to table 1.  The rest of the 
> traffic goes
> +dnl through normal action.
> +table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
> +table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
> +table=0,priority=1,action=normal
> +
> +dnl Allow everything from ns0->ns1. Only allow return traffic from 
> ns1->ns0.
> +table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
> +table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
> +table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
> +table=1,priority=1,action=drop
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +dnl The above ping creates state in the connection tracker.  We're 
> not
> +dnl interested in that state.
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | 
> FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | 
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], 
> [dnl
> +icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> -- 
> 2.7.4
Ilya Maximets May 17, 2019, 12:39 p.m. UTC | #8
Hi.
Just a few comments to the issues you're listed.

Best regards, Ilya Maximets.

On 17.05.2019 13:23, Eelco Chaudron wrote:
> Hi William,
> 
> First a list of issues I found during some basic testing...
> 
> - When I restart or stop OVS (using the systemctl interface as found in RHEL) it does not clean up the BFP program causing the restart to fail:
> 
>   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1 reconfig fails
>   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set interface eno1 new configuration
> 
>   I need to manually run "ip link set dev eno1 xdp off" to make it recover.

Userspace datapath requires '--cleanup' option passed to 'ovs-appctl exit'
to clean up allocated resources. Otherwise datapath will not be destroyed,
i.e. netdev will not be destroyed --> no xdp program unloading.

> 
> - When I remove a bridge, I get an emer in the revalidator:
> 
>   2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
>   2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111: assertion !fd != !wevent failed in poll_create_node()
> 

This actually should never happen. Looks like a memory corruption.

>   Easy to replicate with this:
> 
>     $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0 datapath_type=netdev
>     $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1 type="afxdp" options:xdpmode=drv
>     $ ovs-vsctl del-br ovs_pvp_br0
> 
> 
> - High pmd usage on the statistics, even with no packets is this expected?
> 
>   $ ovs-appctl dpif-netdev/pmd-rxq-show
>   pmd thread numa_id 0 core_id 1:
>     isolated : false
>     port: dpdk0             queue-id:  0  pmd usage:  0 %
>     port: eno1              queue-id:  0  pmd usage: 49 %
> 
>   It goes up slowly and gets stuck at 49%
> 
> 
> - When doing the PVP testing I noticed that the physical port has odd/no
>   tx statistics:
> 
>   $ ovs-ofctl dump-ports ovs_pvp_br0
>   OFPST_PORT reply (xid=0x2): 3 ports
>     port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>     port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>     port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
> 
> 
> - Packets larger than 1028 bytes are dropped. Guess this needs to be fixed, and we need to state that jumbo frames are not supported. Are you planning on adding this?
> 
>   Currently I can find not mentioning of MTU limitation in the documentation, or any code to prevent it from being changed above the supported limit.

Actually Jumbo frames are supported, but yes, the packet size
is limited by the page size. So, jumbo frames up to ~3.5K should
be supported without issues.
We'll need to determine the upper limit and reject requested mtu
if it's larger.

> 
> 
> - ovs-vswitchd is still crashing or stops forwarding packets when trying to do
>   PVP testing with Qemu that has a TAP interface doing XDP and running packets
>   at wire speed to the 10G interface.

Actually, there are a lot of places in current version where rings/umems could
be corrupted leading to unpredictable memory corruptions/crashes/time wasting
trying to allocate exhausted resources.

> 
>   When trying with lower volume packets it seems to work, so with 1% traffic
>   rate, it forwards packets without any problems (148,771 pps). If I go to
>   10% the first couple of packet pass, then it stops forwarding. If it's not
>   crashing I still see packets being received by eno1 flow rules, but no
>   packets make it to the VM.

<snip>

> 
> 
> 
> The following might be useful when combining DPDK and AF_XDP:
> 
>   Currently, DPDK and AF_XDP polling can be combined on a single PMD thread, it
>   might be nice to have an option to not do this, i.e. have separate PMD
>   threads for each type. I know we can do this with assigning specific PMDs to
>   queues, but this will disable auto-balancing. This will also help later if
>   we would add poll() mode support for AF_XDP.

This might make some sense, but certainly not on this stage of development.
I don't think that we should expect any production level performance or fine grained
solution that must work perfectly in all corner cases.
For now it's better to focus on making it reliable. At least to handle all the control
sequences (stop/start/restart/reconfigure) and ability to forward traffic without
hangs/crashes/memory corruptions.


<snip>

>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>> index 859c05613ddf..cc91720fad6e 100644
>> --- a/lib/dpif-netdev-perf.h
>> +++ b/lib/dpif-netdev-perf.h
>> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>>  {
>>  #ifdef DPDK_NETDEV
>>      return s->last_tsc = rte_get_tsc_cycles();
>> +#elif HAVE_AF_XDP
> 
> We need to add support for at least ARM and PPC, not sure how to do this nicely.

IMHO, it's not required for the experimental feature. But yes,
we'll need to add support later. I'm thinking about fallback to
CLOCK_MONOTONIC_RAW (not that portable too) and further to just
time_usec().

> This code is already a quick cut/paste from DPDK, license?

Good question.
If you worried, I could gift a version written from scratch (I swear):

---
    uint32_t h, l;

    asm volatile("rdtsc" : "=a" (l), "=d" (h));

    return s->last_tsc = ((uint64_t) h << 32) | l;
---

> 
>> +    /* This is x86-specific instructions. */
>> +    union {
>> +        uint64_t tsc_64;
>> +        struct {
>> +            uint32_t lo_32;
>> +            uint32_t hi_32;
>> +        };
>> +    } tsc;
>> +    asm volatile("rdtsc" :
>> +             "=a" (tsc.lo_32),
>> +             "=d" (tsc.hi_32));
>> +
>> +    return s->last_tsc = tsc.tsc_64;
>>  #else
>>      return s->last_tsc = 0;
>>  #endif
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> new file mode 100644
>> index 000000000000..cd1b9ca8be77
>> --- /dev/null
>> +++ b/lib/netdev-afxdp.c
>> @@ -0,0 +1,727 @@
>> +/*
>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#if !defined(__i386__) && !defined(__x86_64__)
>> +#error AF_XDP supported only for Linux on x86 or x86_64
> 
> Any reason why we do not support PPC and ARM?

Simple: rdtsc.

Actually, right now we need to restrict support to only x86_64, because
above rdtsc is in 64bit form and will not work for 32bit cpu.

<snip>

>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>> new file mode 100644
>> index 000000000000..3dd3d902b3c4
>> --- /dev/null
>> +++ b/lib/netdev-linux-private.h
>> @@ -0,0 +1,124 @@
>> +/*
>> + * Copyright (c) 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef NETDEV_LINUX_PRIVATE_H
>> +#define NETDEV_LINUX_PRIVATE_H 1
>> +
>> +#include <config.h>
>> +
>> +#include <linux/filter.h>
>> +#include <linux/gen_stats.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/if_tun.h>
>> +#include <linux/types.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/mii.h>
>> +#include <stdint.h>
>> +#include <stdbool.h>
>> +
>> +#include "netdev-provider.h"
>> +#include "netdev-tc-offloads.h"
>> +#include "netdev-vport.h"
>> +#include "openvswitch/thread.h"
>> +#include "ovs-atomic.h"
>> +#include "timer.h"
> 
> Why include all the above? They where just added to netdev-linux.h, so if you make sure you include netdev-lunux.h before -private it should work out.

This doesn't look right. File should include everything it uses.
If something from above headers used in this file, headers should
stay. But if the header is not used in this file, it should be
not included here. Otherwise we'll mess up all the includes.

<snip>

>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>  {
>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>      struct netdev *netdev = rx->up.netdev;
>> -    struct dp_packet *buffer;
>> +    struct dp_packet *buffer = NULL;
>>      ssize_t retval;
>>      int mtu;
>>
>> +#if HAVE_AF_XDP
> 
> Think this #if HAVE_AF_XDP can be removed as the compiler should optimize out the if (false).

I guess this will cause build failure with -O0.

<snip>

>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>> new file mode 100644
>> index 000000000000..aabaa8e5df24
>> --- /dev/null
>> +++ b/lib/xdpsock.h
>> @@ -0,0 +1,123 @@
>> +/*
>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef XDPSOCK_H
>> +#define XDPSOCK_H 1
>> +
>> +#include <bpf/libbpf.h>
>> +#include <bpf/xsk.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <libgen.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_link.h>
>> +#include <linux/if_xdp.h>
>> +#include <linux/if_ether.h>
>> +#include <locale.h>
>> +#include <net/if.h>
>> +#include <poll.h>
>> +#include <pthread.h>
>> +#include <signal.h>
>> +#include <stdbool.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/resource.h>
>> +#include <sys/socket.h>
>> +#include <sys/types.h>
>> +#include <sys/mman.h>
>> +#include <time.h>
>> +#include <unistd.h>
>> +
>> +#include "openvswitch/thread.h"
>> +#include "ovs-atomic.h"
>> +
>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>> +#define BATCH_SIZE      NETDEV_MAX_BURST
> 
> Move this item to the bottom, so you have FRAME specific define's first
> 
>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>> +
>> +#define NUM_FRAMES      4096
> 
> Should we add a note/check to make sure this value is a power of 2?
> 
>> +#define PROD_NUM_DESCS  512
>> +#define CONS_NUM_DESCS  512
>> +
>> +#ifdef USE_XSK_DEFAULT
>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>> +#endif
> 
> Any reason for having this? Should we use the default values? They are 4x larger than you have, did it make any difference in performance results?
> We could make it configurable like for DPDK, using the n_txq_desc/n_rxq_desc option.

I think, this is not necessary right now and could be done later.

<snip>
William Tu May 18, 2019, 1:16 a.m. UTC | #9
Hi Eelco,

Thanks for all the feedbacks. There are some issues in driver, some
in libbpf, and some in my implementation. I will work on it ASAP.

On Fri, May 17, 2019 at 3:23 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> Hi William,
>
> First a list of issues I found during some basic testing...
>
> - When I restart or stop OVS (using the systemctl interface as found in
> RHEL) it does not clean up the BFP program causing the restart to fail:
>
>    2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> reconfig fails
>    2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
> interface eno1 new configuration
>
>    I need to manually run "ip link set dev eno1 xdp off" to make it
> recover.

I think this is a bug in libbpf, see
[PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown

>
>
> - When I remove a bridge, I get an emer in the revalidator:
>
>    2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
>    2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
> assertion !fd != !wevent failed in poll_create_node()
>
>    Easy to replicate with this:
>
>      $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
> datapath_type=netdev
>      $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
> type="afxdp" options:xdpmode=drv
>      $ ovs-vsctl del-br ovs_pvp_br0
>
Thanks I can reproduce it. Will make sure to fix it .

>
> - High pmd usage on the statistics, even with no packets is this
> expected?
>
>    $ ovs-appctl dpif-netdev/pmd-rxq-show
>    pmd thread numa_id 0 core_id 1:
>      isolated : false
>      port: dpdk0             queue-id:  0  pmd usage:  0 %
>      port: eno1              queue-id:  0  pmd usage: 49 %
>
>    It goes up slowly and gets stuck at 49%
>
>
> - When doing the PVP testing I noticed that the physical port has odd/no
>    tx statistics:
>
>    $ ovs-ofctl dump-ports ovs_pvp_br0
>    OFPST_PORT reply (xid=0x2): 3 ports
>      port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0,
> crc=0
>               tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>      port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0,
> frame=0, over=0, crc=0
>               tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>      port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0,
> over=0, crc=0
>               tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
>
I think the ixgbe driver has some issue.
If you run skb-mode, I think the stats are correct.
See patch [1/2] ixgbe: fix AF_XDP tx byte count

>
> - Packets larger than 1028 bytes are dropped. Guess this needs to be
> fixed, and we need to state that jumbo frames are not supported. Are you
> planning on adding this?
>
>    Currently I can find not mentioning of MTU limitation in the
> documentation, or any code to prevent it from being changed above the
> supported limit.
>
>
> - ovs-vswitchd is still crashing or stops forwarding packets when trying
> to do
>    PVP testing with Qemu that has a TAP interface doing XDP and running
> packets
>    at wire speed to the 10G interface.
>
>    When trying with lower volume packets it seems to work, so with 1%
> traffic
>    rate, it forwards packets without any problems (148,771 pps). If I go
> to
>    10% the first couple of packet pass, then it stops forwarding. If
> it's not
>    crashing I still see packets being received by eno1 flow rules, but
> no
>    packets make it to the VM.
>
>      Program terminated with signal SIGSEGV, Segmentation fault.
>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>      654            ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count,
> (void **)elems_pop);
>      [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
>      Missing separate debuginfos, use: dnf debuginfo-install
> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
>      (gdb) bt
>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>      #1  0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540,
> qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized
> out>) at lib/netdev-linux.c:1486
>      #2  0x0000000000906051 in netdev_send (netdev=<optimized out>,
> qid=qid@entry=0, batch=batch@entry=0x7fc928005570,
> concurrent_txq=concurrent_txq@entry=true)
>          at lib/netdev.c:797
>      #3  0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port
> (pmd=pmd@entry=0x7fc95e735010, p=p@entry=0x7fc928005540) at
> lib/dpif-netdev.c:4185
>      #4  0x00000000008d2faf in dp_netdev_pmd_flush_output_packets
> (pmd=pmd@entry=0x7fc95e735010, force=force@entry=false) at
> lib/dpif-netdev.c:4225
>      #5  0x00000000008db317 in dp_netdev_pmd_flush_output_packets
> (force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
>      #6  dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fc95e735010,
> rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
>      #7  0x00000000008db67d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5446
>      #8  0x000000000095c96d in ovsthread_wrapper (aux_=<optimized out>)
> at lib/ovs-thread.c:352
>      #9  0x00007fc9789d62de in start_thread () from
> /lib64/libpthread.so.0
>      #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6
>
>
> - make check-afxpd is failing for me, however, make check-kernel works
> fine.
>    Did not dive into it too much, but it fails here for all test cases,
> this is the same build I use for testing.
>
>    ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge
> br0 datapath_type=netdev
> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
> fail-mode=secure  --
>    --- /dev/null        2019-05-16 09:09:33.445562692 -0400
>    +++
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr        2019-05-17
> 05:46:20.506814939 -0400
>    @@ -0,0 +1,2 @@
>    +ovs-vsctl: Error detected while setting up 'br0'.  See ovs-vswitchd
> log for details.
>    +ovs-vsctl: The default log directory is
> "/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
>    ovsdb-server.log:
>   > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
>   > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server (Open
> vSwitch) 2.11.90
>    ovs-vswitchd.log:
>   > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
>   > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU cores
> on NUMA node 0
>   > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes
> and 28 CPU cores
>   >
> 2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connecting...
>   >
> 2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connected
>   > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open
> vSwitch) 2.11.90
>   > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev:
> creating tap device failed: Device or resource busy

I think the tap device is not cleared from previous settings.
Or do
#ip link del dev ovs-netdev
then check again.

>   > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev already
> exists but cannot be opened: No such device
>   > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open
> datapath of type netdev: No such device
>   > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open datapath
> br0: No such device
>   > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge
> br0: No such device
>    1. system-afxdp-traffic.at:3:  FAILED (system-afxdp-traffic.at:4)
>
>
>
>
> The following might be useful when combining DPDK and AF_XDP:
>
>    Currently, DPDK and AF_XDP polling can be combined on a single PMD
> thread, it
>    might be nice to have an option to not do this, i.e. have separate
> PMD
>    threads for each type. I know we can do this with assigning specific
> PMDs to
>    queues, but this will disable auto-balancing. This will also help
> later if
>    we would add poll() mode support for AF_XDP.
>
>
> Other review comments see inline below. I reviewed the code, not the
> unit tests or automake changes.
>
<snip>

> > +.. note::
> > +   OVS AF_XDP netdev is using the userspace datapath, the same
> > datapath
> > +   as used by OVS-DPDK.  So it requires --disable-system for
> > ovs-vswitchd
> > +   and datapath_type=netdev when adding a new bridge.
>
> As mentioned earlier offline I think --disable-system can be removed as
> the Kernel and userspace datapath can be run at the same time.
>
Yes, thanks

> > +
> > +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> > core 4)
> > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> > "skb"::
>
> Wondering how options:xdpmode should operate without it being specified?
> I would prefer that if the option is not specified it would try drv, and
> if it fails fallback to skb.
>
by default it is using skb mode. I prefer skb-mode as default since
it has less driver-related issues.

> We need to add these new options to the vswitch.xml file

OK
>
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Or, use 4 pmds/cores and 4 queues by doing::
> > +
> > +  ethtool -L enp2s0 combined 4
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=4 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > +
>
> Add some text that pmd-rxq-affinity is not a requirement, the system
> will auto (re)assign.
> Also, note that cores used by pmd-rxq-affinity are not shared/used by
> floating PMDs.

Good point, thanks
>
> > +To validate that the bridge has successfully instantiated, you can
> > use the::
> > +
> > +  ovs-vsctl show
> > +
> > +should show something like::
> > +
> > +  Port "ens802f0"
> > +   Interface "ens802f0"
> > +      type: afxdp
> > +      options: {n_rxq="1", xdpmode=drv}
> > +
> > +Otherwise, enable debug by::
> > +
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +References
> > +----------
> > +Most of the design details are described in the paper presented at
> > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> > +section 4, and slides[2][4].
> > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> > introduction
> > +about AF_XDP current and future work.
> > +
> > +
> > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > +
> > +[2]
> > http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > +
> > +[3]
> > http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > +
> > +[4]
> > https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > +
> > +
> > +Performance Tuning
> > +------------------
> > +The name of the game is to keep your CPU running in userspace,
> > allowing PMD
> > +to keep polling the AF_XDP queues without any interferences from
> > kernel.
> > +
> > +#. Make sure everything is in the same NUMA node (memory used by
> > AF_XDP, pmd
> > +   running cores, device plug-in slot)
>
> How can you do this? The code is not taking care of NUMA, and memory is
> allocated with posix_memalign so no idea which NUMA node it gets
> allocated.

right... I'm hoping that users can be aware of this and run ovs-vswitchd
using taskset -p <cpu mask>
So running the process on the correct NUMA node.

>
> > +#. Isolate your CPU by doing isolcpu at grub configure.
> > +
> > +#. IRQ should not set to pmd running core.
> > +
> > +#. The Spectre and Meltdown fixes increase the overhead of system
> > calls.
> > +
>
> Maybe be more consistent, either one or two newlines before a heading?
OK

<snip>

> > +
> > +Create OpenFlow rules::
> > +
> > +  ovs-vsctl add-port br0 tap0
>
> Maybe add tap as XDP or else it will be an AF_PACKET interface polling
> in the main thread.

I think it should work (XDP on tap). Let me try.
What's your concern here about polling AF_PACKET?

>
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > +
> > +
> > +Attach the veth port to br0 (linux kernel mode)::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
> > +
>
> Remove the xdpmode=skb above... Also, see above on the PF_PACKET
> interface in the bridge_run(),
> I would advise against using this, and you might want to remove it.

OK
>
> > +
> > +Or, use AF_XDP with skb mode::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
> > options:xdpmode=skb
> > +
> > +Setup the OpenFlow rules::
> > +
<snip>
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index 0976a35e758b..7d086dc5e860 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -22,6 +22,9 @@
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "util.h"
> > +#ifdef HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> Why the protection above? You do not do this in netdev-linux.c.
> Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?
>

OK will fix it

> >  static void
> >  dp_packet_init__(struct dp_packet *b, size_t allocated, enum
> > dp_packet_source source)
> > @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> > size_t allocated)
> >      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> >  }
> >

<snip>

> > --- /dev/null
> > +++ b/lib/netdev-afxdp.c
> > @@ -0,0 +1,727 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#if !defined(__i386__) && !defined(__x86_64__)
> > +#error AF_XDP supported only for Linux on x86 or x86_64
>
> Any reason why we do not support PPC and ARM?
>
> > +#endif
> > +
> > +#include <config.h>
> > +
> > +#include "netdev-linux-private.h"
> > +#include "netdev-linux.h"
>
> Swap the two above, see comment in netdev-linux-private.h
>
> > +#include "netdev-afxdp.h"
> > +
> > +#include <arpa/inet.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <inttypes.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <linux/rtnetlink.h>
> > +#include <linux/sockios.h>
> > +#include <linux/if_xdp.h>
> > +#include <net/if.h>
> > +#include <net/if_arp.h>
> > +#include <net/route.h>
> > +#include <netinet/in.h>
> > +#include <netpacket/packet.h>
> > +#include <poll.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/utsname.h>
> > +#include <unistd.h>
> > +
>
> Some of these includes are included by netdev-linux(-private).h already
> so why not remove them?

OK will remove them.
>
> > +#include "coverage.h"
> > +#include "dp-packet.h"
> > +#include "dpif-netlink.h"
> > +#include "dpif-netdev.h"
> > +#include "fatal-signal.h"
> > +#include "hash.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "netlink-notifier.h"
> > +#include "netlink-socket.h"
> > +#include "netlink.h"
> > +#include "netnsid.h"
> > +#include "openflow/openflow.h"
> > +#include "openvswitch/dynamic-string.h"
> > +#include "openvswitch/hmap.h"
> > +#include "openvswitch/ofpbuf.h"
> > +#include "openvswitch/poll-loop.h"
> > +#include "openvswitch/vlog.h"
> > +#include "openvswitch/shash.h"
> > +#include "ovs-atomic.h"
> > +#include "packets.h"
> > +#include "rtnetlink.h"
> > +#include "socket-util.h"
> > +#include "sset.h"
> > +#include "tc.h"
> > +#include "timer.h"
> > +#include "unaligned.h"
> > +#include "util.h"
> > +#include "xdpsock.h"
> > +
> > +#ifndef SOL_XDP
> > +#define SOL_XDP 283
> > +#endif
> > +#ifndef AF_XDP
> > +#define AF_XDP 44
> > +#endif
> > +#ifndef PF_XDP
> > +#define PF_XDP AF_XDP
> > +#endif
>
> Do we really need to include the above? Or should we update the install
> instruction to move them over from the kernel headers?
>
I think we only need SOL_XDP.

> > +
> > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > +
> > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> > *)base))
> > +#define UMEM2XPKT(base, i) \
> > +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base
> > + \
> > +                               i * sizeof(struct dp_packet_afxdp))
> > +

<snip>
Some comments here about umem and queue are discussed later with Ilya.
Will address them together.

> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                              struct dp_packet_batch *batch)
> > +{
>
> See Ilya's comment on thread safety on the ring APIs.
>
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t tx_done, idx_cq = 0;
> > +    struct dp_packet *packet;
> > +    uint32_t idx = 0;
> > +    int j, ret, retry_count = 0;
> > +    const int max_retry = 4;
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return EAGAIN;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +        return EAGAIN;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * We can avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
>
> The comment mentions the copy can be avoided, but it's not implemented
> in the code, is this correct or was something removed?
>
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));

it's not implemented yet. now it's always making a copy
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +        return ret;
>
> I think we should still try to recover the CQ below, even on failure.
>
> > +    }
> > +
> > +retry:
> > +    /* Process CQ */
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count,
> > &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk->outstanding_tx -= tx_done;
> > +        xsk->tx_npkts += tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> > **)elems_push);
> > +    ovs_assert(ret == 0);
> > +
> > +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2))
> > {
> > +        /* If there are still a lot not transmitted, try harder. */
> > +        if (retry_count++ > max_retry) {
> > +            return 0;
> > +        }
> > +        goto retry;
> > +    }
> > +
>
> I think the code above is causing my lockup at wire speed mentioned
> above...
> I guess the retry_count expires every transmit sending packets to the
> TAP interface.
> No all buffers are used... This is causing the umem_elem_pop_n() in the
> beginning to fail, hence the buffers are never returned!
>
> Guess we might need some reclaim in the beginning, or maybe even in the
> rx loop?

right, let me re-work this part of code.

> > +    return 0;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..6518d8fca0b5
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,53 @@
> > +/*
> > + * Copyright (c) 2018 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used
> > directly
> > + * only by Linux-specific code. */
>
> Extra enter?
>
> > +#define MAX_XSKQ 16
>
> Extra enter?
>

OK

> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +
> > +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet
> > *d);
> > +
> > +int xsk_configure_all(struct netdev *netdev);
> > +
> > +void xsk_destroy_all(struct netdev *netdev);
> > +
> > +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > +                         struct dp_packet_batch *batch);
> > +
> > +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                                  struct dp_packet_batch *batch);
> > +
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..3dd3d902b3c4
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,124 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Why include all the above? They where just added to netdev-linux.h, so
> if you make sure you include netdev-lunux.h before -private it should
> work out.
>
> > +
> > +#if HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> See earlier comment
>
> > +
> > +/* These functions are Linux specific, so they should be used
> > directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> > flag,
> > +                                  const char *flag_name, bool
> > enable);
> > +int linux_get_ifindex(const char *netdev_name);
> > +
>
> These functions are now both specified in netdev-linux.h and
> netdev-linux-private.h
>
> > +#define LINUX_FLOW_OFFLOAD_API                          \
> > +   .flow_flush = netdev_tc_flow_flush,                  \
> > +   .flow_dump_create = netdev_tc_flow_dump_create,      \
> > +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
> > +   .flow_dump_next = netdev_tc_flow_dump_next,          \
> > +   .flow_put = netdev_tc_flow_put,                      \
> > +   .flow_get = netdev_tc_flow_get,                      \
> > +   .flow_del = netdev_tc_flow_del,                      \
> > +   .init_flow_api = netdev_tc_init_flow_api
> > +
>
> Same here, this define is in both include files.
>
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only
> > valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > +    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the
> > namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info *xsk[MAX_XSKQ];
> > +    int requested_n_rxq;
> > +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > +    int xdp_flags, xdp_bind_flags;
> > +#endif
> > +};
> > +
> > +static struct netdev_linux *
> > +netdev_linux_cast(const struct netdev *netdev)
> > +{
>
> In the original definition there was an assert() here, was it removed by
> accident?
> netdev_linux_rxq_xsk
> > +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +}
> > +
> > +#endif /* netdev-linux-private.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..1f190406d145 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -17,6 +17,7 @@
> >  #include <config.h>
> >
> >  #include "netdev-linux.h"
> > +#include "netdev-linux-private.h"
> >
> >  #include <errno.h>
> >  #include <fcntl.h>
> > @@ -54,6 +55,7 @@
> >  #include "fatal-signal.h"
> >  #include "hash.h"
> >  #include "openvswitch/hmap.h"
> > +#include "netdev-afxdp.h"
> >  #include "netdev-provider.h"
> >  #include "netdev-tc-offloads.h"
> >  #include "netdev-vport.h"
> > @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int
> > mtu);
> >  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> > burst_bytes);
> >
> > -struct netdev_linux {
> > -    struct netdev up;
> > -
> > -    /* Protects all members below. */
> > -    struct ovs_mutex mutex;
> > -
> > -    unsigned int cache_valid;
> > -
> > -    bool miimon;                    /* Link status of last poll. */
> > -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > -    struct timer miimon_timer;
> > -
> > -    int netnsid;                    /* Network namespace ID. */
> > -    /* The following are figured out "on demand" only.  They are only
> > valid
> > -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > -    int ifindex;
> > -    struct eth_addr etheraddr;
> > -    int mtu;
> > -    unsigned int ifi_flags;
> > -    long long int carrier_resets;
> > -    uint32_t kbits_rate;        /* Policing data. */
> > -    uint32_t kbits_burst;
> > -    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > -                                   0 or an errno value. */
> > -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > or SIOCSIFMTU. */
> > -    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > -    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > -    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > -    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > -
> > -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > -
> > -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > -    struct tc *tc;
> > -
> > -    /* For devices of class netdev_tap_class only. */
> > -    int tap_fd;
> > -    bool present;               /* If the device is present in the
> > namespace */
> > -    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > -
> > -    /* LAG information. */
> > -    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > -};
> >
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> > @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class
> > *netdev_class)
> >      return netdev_class->run == netdev_linux_run;
> >  }
> >
> > +#if HAVE_AF_XDP
> >  static bool
> > -is_tap_netdev(const struct netdev *netdev)
> > +is_afxdp_netdev(const struct netdev *netdev)
> >  {
> > -    return netdev_get_class(netdev) == &netdev_tap_class;
> > +    return netdev_get_class(netdev) == &netdev_afxdp_class;
> >  }
> > -
> > -static struct netdev_linux *
> > -netdev_linux_cast(const struct netdev *netdev)
> > +#else
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> >  {
> > -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > -
> > -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +    return false;
> > +}
> > +#endif
> > +static bool
> > +is_tap_netdev(const struct netdev *netdev)
> > +{
> > +    return netdev_get_class(netdev) == &netdev_tap_class;
> >  }
> >
> >  static struct netdev_rxq_linux *
> > @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
> >          atomic_count_dec(&miimon_cnt);
> >      }
> >
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev_)) {
> > +        xsk_destroy_all(netdev_);
> > +    }
> > +#endif
>
> Think you can remove the HAVE_AF_XDP here, as you do not use it below
> either.

Yes

>
> >      ovs_mutex_destroy(&netdev->mutex);
> >  }
> >
> > @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq
> > *rxq_)
> >      rx->is_tap = is_tap_netdev(netdev_);
> >      if (rx->is_tap) {
> >          rx->fd = netdev->tap_fd;
> > -    } else {
> > +    } else if (!is_afxdp_netdev(netdev_)) {
> >          struct sockaddr_ll sll;
> >          int ifindex, val;
> >          /* Result of tcpdump -dd inbound */
> > @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> >  {
> >      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >      struct netdev *netdev = rx->up.netdev;
> > -    struct dp_packet *buffer;
> > +    struct dp_packet *buffer = NULL;
> >      ssize_t retval;
> >      int mtu;
> >
> > +#if HAVE_AF_XDP
>
> Think this #if HAVE_AF_XDP can be removed as the compiler should
> optimize out the if (false).
>
> > +    if (is_afxdp_netdev(netdev)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +        int qid = rxq_->queue_id;
> > +
> > +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> > +    }
> > +#endif
> >      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> > @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> >      /* Assume Ethernet port. No need to set packet_type. */
> >      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> >                                             DP_NETDEV_HEADROOM);
> > +
> >      retval = (rx->is_tap
> >                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> >                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> >      int error = 0;
> >      int sock = 0;
> >
> > -    if (!is_tap_netdev(netdev_)) {
> > +    if (!is_tap_netdev(netdev_) &&
> > +        !is_afxdp_netdev(netdev_)) {
> >          if
> > (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >              error = EOPNOTSUPP;
> >              goto free_batch;
> > @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> >          }
> >
> >          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +#if HAVE_AF_XDP
>
> Same here remove the #if HAVE_AF_XDP
>
> > +    } else if (is_afxdp_netdev(netdev_)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct umem_pool *first_mpool;
> > +        struct dp_packet *packet;
> > +
> > +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> > +
> > +        /* all packets must come frome the same umem pool
> > +         * and has DPBUF_AFXDP type, otherwise free on-by-one
> > +         */
> > +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +            if (packet->source != DPBUF_AFXDP) {
> > +                goto free_batch;
> > +            }
> > +
> > +            xpacket = dp_packet_cast_afxdp(packet);
> > +            if (i == 0) {
> > +                first_mpool = xpacket->mpool;
> > +                continue;
> > +            }
> > +            if (xpacket->mpool != first_mpool) {
> > +                goto free_batch;
> > +            }
> > +        }
>
> Why do not we not move all the packet type checks to
> free_afxdp_buf_batch()?

Here I plan to move them into the afxdp-specific send function.
So it won't be part of netdev_linux_send(), hopefully it's more clean.
>
> > +        /* free in batch */
> > +        free_afxdp_buf_batch(batch);
> > +        return error;
> > +#endif
> >      } else {
> >          error = netdev_linux_tap_batch_send(netdev_, batch);
> >      }
> > @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "system",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_linux_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
> >  const struct netdev_class netdev_tap_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      .type = "tap",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct_tap,
> >      .get_stats = netdev_tap_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3343,10 +3352,26 @@ const struct netdev_class
> > netdev_internal_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "internal",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_internal_get_stats,
> >      .get_status = netdev_internal_get_status,
> >  };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > +    NETDEV_LINUX_CLASS_COMMON,
> > +    .type = "afxdp",
> > +    .is_pmd = true,
> > +    .construct = netdev_linux_construct,
> > +    .get_stats = netdev_linux_get_stats,
> > +    .get_status = netdev_linux_get_status,
> > +    .set_config = netdev_afxdp_set_config,
> > +    .get_config = netdev_afxdp_get_config,
> > +    .reconfigure = netdev_afxdp_reconfigure,
> > +    .get_numa_id = netdev_afxdp_get_numa_id,
> > +};
> > +#endif
> >
> >
> >  #define CODEL_N_QUEUES 0x0000
> > diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> > index 17ca9120168a..b812e64cb078 100644
> > --- a/lib/netdev-linux.h
> > +++ b/lib/netdev-linux.h
> > @@ -19,6 +19,20 @@
> >
> >  #include <stdint.h>
> >  #include <stdbool.h>
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Is there a reason why you move all these includes here? If there is you
> might as well remove the duplicates from .c files that include
> netdev-linux.h, for example, netdev-linux.c

<snip>
Will work on this part later.

> > +static inline void
> > +ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED
> > +ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp,
> > 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
>
> Move spinlock function out to a common file
>
OK, I plan to add lib/spinlock.h and move to it.

> > +
> > +inline int
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>
> This is a stack overflow
>
> > +        return -ENOMEM;
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +
> > +    return 0;
> > +}
> > +
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> > +        /* stack is overflow, this should not happen */
> > +        OVS_NOT_REACHED();
> > +    }
>
> Should this not be moved after the spinlock, i.e. to __umem_elem_push
OK

>
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
>
> There is no check here to see if there are actual any elements left,
> like there is for pop_n,
> so we could corrupt memory/umem_pool
>
Yes, I will add a check.

> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ptr;
> > +}
> > +
> > +void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(void *)) == 0);
>
> We should not assert, just return NULL here.
>
> > +    memset(bufs, 0, size * sizeof(void *));
> > +    return (void **)bufs;
> > +}
> > +
> > +unsigned int
> > +umem_elem_count(struct umem_pool *mpool)
> > +{
> > +    return mpool->index;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        OVS_NOT_REACHED();
>
> If NULL is returned return ENOMEM
>
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->mutex);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > +    free(umemp->array);
>         umemp->array = NULL;
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    /* TODO: check HAVE_POSIX_MEMALIGN  */
>
> Guess the above needs to be done
>
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(struct dp_packet_afxdp))
> > == 0);
>
> We should not assert, just return false
>
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free(xp->array);
>         xp->array = NULL;
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..aabaa8e5df24
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,123 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <bpf/libbpf.h>
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <getopt.h>
> > +#include <libgen.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/if_ether.h>
> > +#include <locale.h>
> > +#include <net/if.h>
> > +#include <poll.h>
> > +#include <pthread.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <sys/mman.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
>
> Move this item to the bottom, so you have FRAME specific define's first
>
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define NUM_FRAMES      4096
>
> Should we add a note/check to make sure this value is a power of 2?

Sure, will do it.

>
> > +#define PROD_NUM_DESCS  512
> > +#define CONS_NUM_DESCS  512
> > +
> > +#ifdef USE_XSK_DEFAULT
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#endif
>
> Any reason for having this? Should we use the default values? They are
> 4x larger than you have, did it make any difference in performance
> results?
> We could make it configurable like for DPDK, using the
> n_txq_desc/n_rxq_desc option.

Will add this option later.

>
> > +
> > +typedef struct {
> > +    atomic_int locked;
> > +} ovs_spinlock_t;
> > +
>
> Think we should move the ovs_spinlock code and includes to some global
> place, maybe util or thread

ok, will move to lib/spinlock.h

>
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > +    int index;      /* point to top */
> > +    unsigned int size;
> > +    ovs_spinlock_t mutex;
> > +    void **array;   /* a pointer array, point to umem buf */
> > +};
> > +
> > +/* array-based dp_packet_afxdp */
> > +struct xpacket_pool {
> > +    unsigned int size;
> > +    struct dp_packet_afxdp **array;
> > +};
> > +
> > +struct xsk_umem_info {
> > +    struct umem_pool mpool;
> > +    struct xpacket_pool xpool;
> > +    struct xsk_ring_prod fq;
> > +    struct xsk_ring_cons cq;
> > +    struct xsk_umem *umem;
> > +    void *buffer;
> > +};
> > +
> > +struct xsk_socket_info {
> > +    struct xsk_ring_cons rx;
> > +    struct xsk_ring_prod tx;
> > +    struct xsk_umem_info *umem;
> > +    struct xsk_socket *xsk;
> > +    unsigned long rx_npkts;
> > +    unsigned long tx_npkts;
> > +    unsigned long prev_rx_npkts;
> > +    unsigned long prev_tx_npkts;
> > +    uint32_t outstanding_tx;
> > +};
> > +
> > +struct umem_elem {
> > +    struct umem_elem *next;
> > +};
> > +
> > +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> > +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void *__umem_elem_pop(struct umem_pool *umemp);
> > +void *umem_elem_pop(struct umem_pool *umemp);
> > +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void **__umem_pool_alloc(unsigned int size);
> > +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > +void umem_pool_cleanup(struct umem_pool *umemp);
> > +unsigned int umem_elem_count(struct umem_pool *mpool);
> > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > +
>
> Think all the __umem_* function are only used internally so they should
> be come static and be removed here.
>
OK.

Regards,
William
Eelco Chaudron May 20, 2019, 10:38 a.m. UTC | #10
On 17 May 2019, at 14:39, Ilya Maximets wrote:

> Hi.
> Just a few comments to the issues you're listed.
>
> Best regards, Ilya Maximets.
>
> On 17.05.2019 13:23, Eelco Chaudron wrote:
>> Hi William,
>>
>> First a list of issues I found during some basic testing...
>>
>> - When I restart or stop OVS (using the systemctl interface as found 
>> in RHEL) it does not clean up the BFP program causing the restart to 
>> fail:
>>
>>   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1 
>> reconfig fails
>>   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set 
>> interface eno1 new configuration
>>
>>   I need to manually run "ip link set dev eno1 xdp off" to make it 
>> recover.
>
> Userspace datapath requires '--cleanup' option passed to 'ovs-appctl 
> exit'
> to clean up allocated resources. Otherwise datapath will not be 
> destroyed,
> i.e. netdev will not be destroyed --> no xdp program unloading.

Maybe we should try to reload/cleanup at startup?

>>
>> - When I remove a bridge, I get an emer in the revalidator:
>>
>>   2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp 
>> program
>>   
>> 2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111: 
>> assertion !fd != !wevent failed in poll_create_node()
>>
>
> This actually should never happen. Looks like a memory corruption.
>
>>   Easy to replicate with this:
>>
>>     $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0 
>> datapath_type=netdev
>>     $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1 
>> type="afxdp" options:xdpmode=drv
>>     $ ovs-vsctl del-br ovs_pvp_br0
>>
>>
>> - High pmd usage on the statistics, even with no packets is this 
>> expected?
>>
>>   $ ovs-appctl dpif-netdev/pmd-rxq-show
>>   pmd thread numa_id 0 core_id 1:
>>     isolated : false
>>     port: dpdk0             queue-id:  0  pmd 
>> usage:  0 %
>>     port: eno1              queue-id:  0  pmd 
>> usage: 49 %
>>
>>   It goes up slowly and gets stuck at 49%
>>
>>
>> - When doing the PVP testing I noticed that the physical port has 
>> odd/no
>>   tx statistics:
>>
>>   $ ovs-ofctl dump-ports ovs_pvp_br0
>>   OFPST_PORT reply (xid=0x2): 3 ports
>>     port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, 
>> over=0, crc=0
>>              tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>>     port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, 
>> errs=0, frame=0, over=0, crc=0
>>              tx pkts=0, bytes=19789272440056, drop=0, 
>> errs=0, coll=0
>>     port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, 
>> frame=0, over=0, crc=0
>>              tx pkts=4058, bytes=502504, drop=0, errs=0, 
>> coll=0
>>
>>
>> - Packets larger than 1028 bytes are dropped. Guess this needs to be 
>> fixed, and we need to state that jumbo frames are not supported. Are 
>> you planning on adding this?
>>
>>   Currently I can find not mentioning of MTU limitation in the 
>> documentation, or any code to prevent it from being changed above the 
>> supported limit.
>
> Actually Jumbo frames are supported, but yes, the packet size
> is limited by the page size. So, jumbo frames up to ~3.5K should
> be supported without issues.
> We'll need to determine the upper limit and reject requested mtu
> if it's larger.

Currently, none-jumbo frames are not even working, and I think a jumbo 
check should be added as we allocate chunks of 2048.

>>
>>
>> - ovs-vswitchd is still crashing or stops forwarding packets when 
>> trying to do
>>   PVP testing with Qemu that has a TAP interface doing XDP and 
>> running packets
>>   at wire speed to the 10G interface.
>
> Actually, there are a lot of places in current version where 
> rings/umems could
> be corrupted leading to unpredictable memory corruptions/crashes/time 
> wasting
> trying to allocate exhausted resources.

ACK :)

>>
>>   When trying with lower volume packets it seems to work, so with 1% 
>> traffic
>>   rate, it forwards packets without any problems (148,771 pps). If I 
>> go to
>>   10% the first couple of packet pass, then it stops forwarding. If 
>> it's not
>>   crashing I still see packets being received by eno1 flow rules, 
>> but no
>>   packets make it to the VM.
>
> <snip>
>
>>
>>
>>
>> The following might be useful when combining DPDK and AF_XDP:
>>
>>   Currently, DPDK and AF_XDP polling can be combined on a single PMD 
>> thread, it
>>   might be nice to have an option to not do this, i.e. have separate 
>> PMD
>>   threads for each type. I know we can do this with assigning 
>> specific PMDs to
>>   queues, but this will disable auto-balancing. This will also help 
>> later if
>>   we would add poll() mode support for AF_XDP.
>
> This might make some sense, but certainly not on this stage of 
> development.
> I don't think that we should expect any production level performance 
> or fine grained
> solution that must work perfectly in all corner cases.
> For now it's better to focus on making it reliable. At least to handle 
> all the control
> sequences (stop/start/restart/reconfigure) and ability to forward 
> traffic without
> hangs/crashes/memory corruptions.

Agreed, this was just something to take into consideration for further 
development…

>
> <snip>
>
>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>> index 859c05613ddf..cc91720fad6e 100644
>>> --- a/lib/dpif-netdev-perf.h
>>> +++ b/lib/dpif-netdev-perf.h
>>> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>>>  {
>>>  #ifdef DPDK_NETDEV
>>>      return s->last_tsc = rte_get_tsc_cycles();
>>> +#elif HAVE_AF_XDP
>>
>> We need to add support for at least ARM and PPC, not sure how to do 
>> this nicely.
>
> IMHO, it's not required for the experimental feature. But yes,
> we'll need to add support later. I'm thinking about fallback to
> CLOCK_MONOTONIC_RAW (not that portable too) and further to just
> time_usec().

This might add some system calls and this function is called quite 
often.
Maybe just return 0 in these cases (if we do not divide by it :).

>> This code is already a quick cut/paste from DPDK, license?
>
> Good question.

I guess copying is fine as long as we give them credits. I was more 
hinting that if we do it Intel 64 bit we might as well do it for ARM and 
PPC and experimental support for those…

> If you worried, I could gift a version written from scratch (I swear):
>
> ---
>     uint32_t h, l;
>
>     asm volatile("rdtsc" : "=a" (l), "=d" (h));
>
>     return s->last_tsc = ((uint64_t) h << 32) | l;
> ---
>
>>
>>> +    /* This is x86-specific instructions. */
>>> +    union {
>>> +        uint64_t tsc_64;
>>> +        struct {
>>> +            uint32_t lo_32;
>>> +            uint32_t hi_32;
>>> +        };
>>> +    } tsc;
>>> +    asm volatile("rdtsc" :
>>> +             "=a" (tsc.lo_32),
>>> +             "=d" (tsc.hi_32));
>>> +
>>> +    return s->last_tsc = tsc.tsc_64;
>>>  #else
>>>      return s->last_tsc = 0;
>>>  #endif
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> new file mode 100644
>>> index 000000000000..cd1b9ca8be77
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -0,0 +1,727 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, 
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>> implied.
>>> + * See the License for the specific language governing permissions 
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#if !defined(__i386__) && !defined(__x86_64__)
>>> +#error AF_XDP supported only for Linux on x86 or x86_64
>>
>> Any reason why we do not support PPC and ARM?
>
> Simple: rdtsc.
>
> Actually, right now we need to restrict support to only x86_64, 
> because
> above rdtsc is in 64bit form and will not work for 32bit cpu.

If this is all, why not copy the rest from DPDK and support all…

> <snip>
>
>>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>>> new file mode 100644
>>> index 000000000000..3dd3d902b3c4
>>> --- /dev/null
>>> +++ b/lib/netdev-linux-private.h
>>> @@ -0,0 +1,124 @@
>>> +/*
>>> + * Copyright (c) 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, 
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>> implied.
>>> + * See the License for the specific language governing permissions 
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>
>> Why include all the above? They where just added to netdev-linux.h, 
>> so if you make sure you include netdev-lunux.h before -private it 
>> should work out.
>
> This doesn't look right. File should include everything it uses.
> If something from above headers used in this file, headers should
> stay. But if the header is not used in this file, it should be
> not included here. Otherwise we'll mess up all the includes.
>

ACK, guess just some checking/cleanup is needed.

> <snip>
>
>>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq 
>>> *rxq_, struct dp_packet_batch *batch,
>>>  {
>>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>>      struct netdev *netdev = rx->up.netdev;
>>> -    struct dp_packet *buffer;
>>> +    struct dp_packet *buffer = NULL;
>>>      ssize_t retval;
>>>      int mtu;
>>>
>>> +#if HAVE_AF_XDP
>>
>> Think this #if HAVE_AF_XDP can be removed as the compiler should 
>> optimize out the if (false).
>
> I guess this will cause build failure with -O0.
>
Guess you are right, looking at it again, it will fail at:

return netdev_linux_rxq_xsk(dev->xsk[qid], batch);

as the dev structure does not have dev->xsk…


> <snip>
>
>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>> new file mode 100644
>>> index 000000000000..aabaa8e5df24
>>> --- /dev/null
>>> +++ b/lib/xdpsock.h
>>> @@ -0,0 +1,123 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, 
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>> implied.
>>> + * See the License for the specific language governing permissions 
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef XDPSOCK_H
>>> +#define XDPSOCK_H 1
>>> +
>>> +#include <bpf/libbpf.h>
>>> +#include <bpf/xsk.h>
>>> +#include <errno.h>
>>> +#include <getopt.h>
>>> +#include <libgen.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/if_link.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <locale.h>
>>> +#include <net/if.h>
>>> +#include <poll.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/resource.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/types.h>
>>> +#include <sys/mman.h>
>>> +#include <time.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +
>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>
>> Move this item to the bottom, so you have FRAME specific define's 
>> first
>>
>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>> +
>>> +#define NUM_FRAMES      4096
>>
>> Should we add a note/check to make sure this value is a power of 2?
>>
>>> +#define PROD_NUM_DESCS  512
>>> +#define CONS_NUM_DESCS  512
>>> +
>>> +#ifdef USE_XSK_DEFAULT
>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>> +#endif
>>
>> Any reason for having this? Should we use the default values? They 
>> are 4x larger than you have, did it make any difference in 
>> performance results?
>> We could make it configurable like for DPDK, using the 
>> n_txq_desc/n_rxq_desc option.
>
> I think, this is not necessary right now and could be done later.

Guess we can do this later, but the vswitch.xml document needs some 
clarification as all options are under the “PMD (Poll Mode Driver) 
Options” section, which pure technically AF_XDP is not. However, we 
support the pmd-rxq-affinity/n_rxq options.
> <snip>
Eelco Chaudron May 20, 2019, 1:37 p.m. UTC | #11
On 18 May 2019, at 3:16, William Tu wrote:

> Hi Eelco,
>
> Thanks for all the feedbacks. There are some issues in driver, some
> in libbpf, and some in my implementation. I will work on it ASAP.

My pleasure, see answers to your questions below…
>
> On Fri, May 17, 2019 at 3:23 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>> Hi William,
>>
>> First a list of issues I found during some basic testing...
>>
>> - When I restart or stop OVS (using the systemctl interface as found 
>> in
>> RHEL) it does not clean up the BFP program causing the restart to 
>> fail:
>>
>>    2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
>> reconfig fails
>>    2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
>> interface eno1 new configuration
>>
>>    I need to manually run "ip link set dev eno1 xdp off" to make it
>> recover.
>
> I think this is a bug in libbpf, see
> [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
>
>>
>>
>> - When I remove a bridge, I get an emer in the revalidator:
>>
>>    2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp 
>> program
>>    2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
>> assertion !fd != !wevent failed in poll_create_node()
>>
>>    Easy to replicate with this:
>>
>>      $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
>> datapath_type=netdev
>>      $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
>> type="afxdp" options:xdpmode=drv
>>      $ ovs-vsctl del-br ovs_pvp_br0
>>
> Thanks I can reproduce it. Will make sure to fix it .
>
>>
>> - High pmd usage on the statistics, even with no packets is this
>> expected?
>>
>>    $ ovs-appctl dpif-netdev/pmd-rxq-show
>>    pmd thread numa_id 0 core_id 1:
>>      isolated : false
>>      port: dpdk0             queue-id:  0  pmd usage:  0 %
>>      port: eno1              queue-id:  0  pmd usage: 49 %
>>
>>    It goes up slowly and gets stuck at 49%
>>
>>
>> - When doing the PVP testing I noticed that the physical port has 
>> odd/no
>>    tx statistics:
>>
>>    $ ovs-ofctl dump-ports ovs_pvp_br0
>>    OFPST_PORT reply (xid=0x2): 3 ports
>>      port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0,
>> crc=0
>>               tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>>      port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0,
>> frame=0, over=0, crc=0
>>               tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>>      port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, 
>> frame=0,
>> over=0, crc=0
>>               tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
>>
> I think the ixgbe driver has some issue.

Cool, this is the driver I’m using…

> If you run skb-mode, I think the stats are correct.
> See patch [1/2] ixgbe: fix AF_XDP tx byte count
>
>>
>> - Packets larger than 1028 bytes are dropped. Guess this needs to be
>> fixed, and we need to state that jumbo frames are not supported. Are 
>> you
>> planning on adding this?
>>
>>    Currently I can find not mentioning of MTU limitation in the
>> documentation, or any code to prevent it from being changed above the
>> supported limit.
>>
>>
>> - ovs-vswitchd is still crashing or stops forwarding packets when 
>> trying
>> to do
>>    PVP testing with Qemu that has a TAP interface doing XDP and 
>> running
>> packets
>>    at wire speed to the 10G interface.
>>
>>    When trying with lower volume packets it seems to work, so with 1%
>> traffic
>>    rate, it forwards packets without any problems (148,771 pps). If I 
>> go
>> to
>>    10% the first couple of packet pass, then it stops forwarding. If
>> it's not
>>    crashing I still see packets being received by eno1 flow rules, 
>> but
>> no
>>    packets make it to the VM.
>>
>>      Program terminated with signal SIGSEGV, Segmentation fault.
>>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send 
>> (xsk=0x0,
>> batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>>      654            ret = umem_elem_pop_n(&xsk->umem->mpool, 
>> batch->count,
>> (void **)elems_pop);
>>      [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
>>      Missing separate debuginfos, use: dnf debuginfo-install
>> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
>>      (gdb) bt
>>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send 
>> (xsk=0x0,
>> batch=batch@entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>>      #1  0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540,
>> qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized
>> out>) at lib/netdev-linux.c:1486
>>      #2  0x0000000000906051 in netdev_send (netdev=<optimized out>,
>> qid=qid@entry=0, batch=batch@entry=0x7fc928005570,
>> concurrent_txq=concurrent_txq@entry=true)
>>          at lib/netdev.c:797
>>      #3  0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port
>> (pmd=pmd@entry=0x7fc95e735010, p=p@entry=0x7fc928005540) at
>> lib/dpif-netdev.c:4185
>>      #4  0x00000000008d2faf in dp_netdev_pmd_flush_output_packets
>> (pmd=pmd@entry=0x7fc95e735010, force=force@entry=false) at
>> lib/dpif-netdev.c:4225
>>      #5  0x00000000008db317 in dp_netdev_pmd_flush_output_packets
>> (force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
>>      #6  dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fc95e735010,
>> rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
>>      #7  0x00000000008db67d in pmd_thread_main (f_=<optimized out>) 
>> at
>> lib/dpif-netdev.c:5446
>>      #8  0x000000000095c96d in ovsthread_wrapper (aux_=<optimized 
>> out>)
>> at lib/ovs-thread.c:352
>>      #9  0x00007fc9789d62de in start_thread () from
>> /lib64/libpthread.so.0
>>      #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6
>>
>>
>> - make check-afxpd is failing for me, however, make check-kernel 
>> works
>> fine.
>>    Did not dive into it too much, but it fails here for all test 
>> cases,
>> this is the same build I use for testing.
>>
>>    ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge
>> br0 datapath_type=netdev
>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
>> fail-mode=secure  --
>>    --- /dev/null        2019-05-16 09:09:33.445562692 -0400
>>    +++
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr 
>>        2019-05-17
>> 05:46:20.506814939 -0400
>>    @@ -0,0 +1,2 @@
>>    +ovs-vsctl: Error detected while setting up 'br0'.  See 
>> ovs-vswitchd
>> log for details.
>>    +ovs-vsctl: The default log directory is
>> "/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
>>    ovsdb-server.log:
>>   > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
>>   > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server 
>> (Open
>> vSwitch) 2.11.90
>>    ovs-vswitchd.log:
>>   > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
>>   > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU 
>> cores
>> on NUMA node 0
>>   > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA 
>> nodes
>> and 28 CPU cores
>>   >
>> 2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
>> connecting...
>>   >
>> 2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
>> connected
>>   > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open
>> vSwitch) 2.11.90
>>   > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev:
>> creating tap device failed: Device or resource busy
>
> I think the tap device is not cleared from previous settings.
> Or do
> #ip link del dev ovs-netdev
> then check again.

If after a reboot the first thing I do is the check, it seems to work. 
If OVS is running its failing all tests, which was my scenario.

>>   > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev 
>> already
>> exists but cannot be opened: No such device
>>   > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open
>> datapath of type netdev: No such device
>>   > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open 
>> datapath
>> br0: No such device
>>   > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge
>> br0: No such device
>>    1. system-afxdp-traffic.at:3:  FAILED (system-afxdp-traffic.at:4)
>>
>>
>>
>>
>> The following might be useful when combining DPDK and AF_XDP:
>>
>>    Currently, DPDK and AF_XDP polling can be combined on a single PMD
>> thread, it
>>    might be nice to have an option to not do this, i.e. have separate
>> PMD
>>    threads for each type. I know we can do this with assigning 
>> specific
>> PMDs to
>>    queues, but this will disable auto-balancing. This will also help
>> later if
>>    we would add poll() mode support for AF_XDP.
>>
>>
>> Other review comments see inline below. I reviewed the code, not the
>> unit tests or automake changes.
>>
> <snip>
>
>>> +.. note::
>>> +   OVS AF_XDP netdev is using the userspace datapath, the same
>>> datapath
>>> +   as used by OVS-DPDK.  So it requires --disable-system for
>>> ovs-vswitchd
>>> +   and datapath_type=netdev when adding a new bridge.
>>
>> As mentioned earlier offline I think --disable-system can be removed 
>> as
>> the Kernel and userspace datapath can be run at the same time.
>>
> Yes, thanks
>
>>> +
>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
>>> core 4)
>>> +on 1 queue (queue 0) device, configure these options: 
>>> **pmd-cpu-mask,
>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
>>> "skb"::
>>
>> Wondering how options:xdpmode should operate without it being 
>> specified?
>> I would prefer that if the option is not specified it would try drv, 
>> and
>> if it fails fallback to skb.
>>
> by default it is using skb mode. I prefer skb-mode as default since
> it has less driver-related issues.

I see it from the other way around, we should use HW support as much as 
possible.
We get better performance, and as this is experimental we will find the 
driver issues earlier.

>> We need to add these new options to the vswitch.xml file
>
> OK
>>
>>> +
>>> +  ethtool -L enp2s0 combined 1
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=1 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:4"
>>> +
>>> +Or, use 4 pmds/cores and 4 queues by doing::
>>> +
>>> +  ethtool -L enp2s0 combined 4
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=4 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
>>> +
>>
>> Add some text that pmd-rxq-affinity is not a requirement, the system
>> will auto (re)assign.
>> Also, note that cores used by pmd-rxq-affinity are not shared/used by
>> floating PMDs.
>
> Good point, thanks
>>
>>> +To validate that the bridge has successfully instantiated, you can
>>> use the::
>>> +
>>> +  ovs-vsctl show
>>> +
>>> +should show something like::
>>> +
>>> +  Port "ens802f0"
>>> +   Interface "ens802f0"
>>> +      type: afxdp
>>> +      options: {n_rxq="1", xdpmode=drv}
>>> +
>>> +Otherwise, enable debug by::
>>> +
>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>> +
>>> +References
>>> +----------
>>> +Most of the design details are described in the paper presented at
>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open 
>>> vSwitch"[1],
>>> +section 4, and slides[2][4].
>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
>>> introduction
>>> +about AF_XDP current and future work.
>>> +
>>> +
>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
>>> +
>>> +[2]
>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
>>> +
>>> +[3]
>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>>> +
>>> +[4]
>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
>>> +
>>> +
>>> +Performance Tuning
>>> +------------------
>>> +The name of the game is to keep your CPU running in userspace,
>>> allowing PMD
>>> +to keep polling the AF_XDP queues without any interferences from
>>> kernel.
>>> +
>>> +#. Make sure everything is in the same NUMA node (memory used by
>>> AF_XDP, pmd
>>> +   running cores, device plug-in slot)
>>
>> How can you do this? The code is not taking care of NUMA, and memory 
>> is
>> allocated with posix_memalign so no idea which NUMA node it gets
>> allocated.
>
> right... I'm hoping that users can be aware of this and run 
> ovs-vswitchd
> using taskset -p <cpu mask>
> So running the process on the correct NUMA node.

Not too familiar with the posix_memalign backend, will this force to use 
the correct NUMA memory.

This will not work in practice as ovs-vswitchd gets started, for 
example, by systemd. This will also not work if you have NICs on 
different NUMAs.

>>
>>> +#. Isolate your CPU by doing isolcpu at grub configure.
>>> +
>>> +#. IRQ should not set to pmd running core.
>>> +
>>> +#. The Spectre and Meltdown fixes increase the overhead of system
>>> calls.
>>> +
>>
>> Maybe be more consistent, either one or two newlines before a 
>> heading?
> OK
>
> <snip>
>
>>> +
>>> +Create OpenFlow rules::
>>> +
>>> +  ovs-vsctl add-port br0 tap0
>>
>> Maybe add tap as XDP or else it will be an AF_PACKET interface 
>> polling
>> in the main thread.
>
> I think it should work (XDP on tap). Let me try.
> What's your concern here about polling AF_PACKET?

So when mixing native kernel drivers with the userspace datapath. The 
kernel interfaces are serviced by an AF_PACKET socket the main OVS 
thread, and not in the PMD threads. This could delay general tasks, or 
general tasks will delay packet processing. If you use an XDP version of 
the TAP it’s polled in the PMD thread.

>
>>
>>> +  ovs-ofctl del-flows br0
>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
>>> +
>>> +
>>> +Attach the veth port to br0 (linux kernel mode)::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
>>> +
>>
>> Remove the xdpmode=skb above... Also, see above on the PF_PACKET
>> interface in the bridge_run(),
>> I would advise against using this, and you might want to remove it.
>
> OK
>>
>>> +
>>> +Or, use AF_XDP with skb mode::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
>>> options:xdpmode=skb
>>> +
>>> +Setup the OpenFlow rules::
>>> +
> <snip>
>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>>> index 0976a35e758b..7d086dc5e860 100644
>>> --- a/lib/dp-packet.c
>>> +++ b/lib/dp-packet.c
>>> @@ -22,6 +22,9 @@
>>>  #include "netdev-dpdk.h"
>>>  #include "openvswitch/dynamic-string.h"
>>>  #include "util.h"
>>> +#ifdef HAVE_AF_XDP
>>> +#include "netdev-afxdp.h"
>>> +#endif
>>
>> Why the protection above? You do not do this in netdev-linux.c.
>> Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?
>>
>
> OK will fix it
>
>>>  static void
>>>  dp_packet_init__(struct dp_packet *b, size_t allocated, enum
>>> dp_packet_source source)
>>> @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base,
>>> size_t allocated)
>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>>>  }
>>>
>
> <snip>
>
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -0,0 +1,727 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#if !defined(__i386__) && !defined(__x86_64__)
>>> +#error AF_XDP supported only for Linux on x86 or x86_64
>>
>> Any reason why we do not support PPC and ARM?
>>
>>> +#endif
>>> +
>>> +#include <config.h>
>>> +
>>> +#include "netdev-linux-private.h"
>>> +#include "netdev-linux.h"
>>
>> Swap the two above, see comment in netdev-linux-private.h
>>
>>> +#include "netdev-afxdp.h"
>>> +
>>> +#include <arpa/inet.h>
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <inttypes.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <linux/rtnetlink.h>
>>> +#include <linux/sockios.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <net/if.h>
>>> +#include <net/if_arp.h>
>>> +#include <net/route.h>
>>> +#include <netinet/in.h>
>>> +#include <netpacket/packet.h>
>>> +#include <poll.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/ioctl.h>
>>> +#include <sys/types.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/utsname.h>
>>> +#include <unistd.h>
>>> +
>>
>> Some of these includes are included by netdev-linux(-private).h 
>> already
>> so why not remove them?
>
> OK will remove them.
>>
>>> +#include "coverage.h"
>>> +#include "dp-packet.h"
>>> +#include "dpif-netlink.h"
>>> +#include "dpif-netdev.h"
>>> +#include "fatal-signal.h"
>>> +#include "hash.h"
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "netlink-notifier.h"
>>> +#include "netlink-socket.h"
>>> +#include "netlink.h"
>>> +#include "netnsid.h"
>>> +#include "openflow/openflow.h"
>>> +#include "openvswitch/dynamic-string.h"
>>> +#include "openvswitch/hmap.h"
>>> +#include "openvswitch/ofpbuf.h"
>>> +#include "openvswitch/poll-loop.h"
>>> +#include "openvswitch/vlog.h"
>>> +#include "openvswitch/shash.h"
>>> +#include "ovs-atomic.h"
>>> +#include "packets.h"
>>> +#include "rtnetlink.h"
>>> +#include "socket-util.h"
>>> +#include "sset.h"
>>> +#include "tc.h"
>>> +#include "timer.h"
>>> +#include "unaligned.h"
>>> +#include "util.h"
>>> +#include "xdpsock.h"
>>> +
>>> +#ifndef SOL_XDP
>>> +#define SOL_XDP 283
>>> +#endif
>>> +#ifndef AF_XDP
>>> +#define AF_XDP 44
>>> +#endif
>>> +#ifndef PF_XDP
>>> +#define PF_XDP AF_XDP
>>> +#endif
>>
>> Do we really need to include the above? Or should we update the 
>> install
>> instruction to move them over from the kernel headers?
>>
> I think we only need SOL_XDP.
>
>>> +
>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>> +
>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
>>> *)base))
>>> +#define UMEM2XPKT(base, i) \
>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char 
>>> *)base
>>> + \
>>> +                               i * sizeof(struct dp_packet_afxdp))
>>> +
>
> <snip>
> Some comments here about umem and queue are discussed later with Ilya.
> Will address them together.
>
>>> +int
>>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                              struct dp_packet_batch *batch)
>>> +{
>>
>> See Ilya's comment on thread safety on the ring APIs.
>>
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t tx_done, idx_cq = 0;
>>> +    struct dp_packet *packet;
>>> +    uint32_t idx = 0;
>>> +    int j, ret, retry_count = 0;
>>> +    const int max_retry = 4;
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * We can avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>
>> The comment mentions the copy can be avoided, but it's not 
>> implemented
>> in the code, is this correct or was something removed?
>>
>>> +        memcpy(elem, dp_packet_data(packet), 
>>> dp_packet_size(packet));
>
> it's not implemented yet. now it's always making a copy
>>> +
>>> +        index = (uint64_t)((char *)elem - (char 
>>> *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +        return ret;
>>
>> I think we should still try to recover the CQ below, even on failure.
>>
>>> +    }
>>> +
>>> +retry:
>>> +    /* Process CQ */
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count,
>>> &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk->outstanding_tx -= tx_done;
>>> +        xsk->tx_npkts += tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
>>> **)elems_push);
>>> +    ovs_assert(ret == 0);
>>> +
>>> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +
>>> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 
>>> 2))
>>> {
>>> +        /* If there are still a lot not transmitted, try harder. */
>>> +        if (retry_count++ > max_retry) {
>>> +            return 0;
>>> +        }
>>> +        goto retry;
>>> +    }
>>> +
>>
>> I think the code above is causing my lockup at wire speed mentioned
>> above...
>> I guess the retry_count expires every transmit sending packets to the
>> TAP interface.
>> No all buffers are used... This is causing the umem_elem_pop_n() in 
>> the
>> beginning to fail, hence the buffers are never returned!
>>
>> Guess we might need some reclaim in the beginning, or maybe even in 
>> the
>> rx loop?
>
> right, let me re-work this part of code.
>
>>> +    return 0;
>>> +}
>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
>>> new file mode 100644
>>> index 000000000000..6518d8fca0b5
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.h
>>> @@ -0,0 +1,53 @@
>>> +/*
>>> + * Copyright (c) 2018 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_AFXDP_H
>>> +#define NETDEV_AFXDP_H 1
>>> +
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +/* These functions are Linux AF_XDP specific, so they should be 
>>> used
>>> directly
>>> + * only by Linux-specific code. */
>>
>> Extra enter?
>>
>>> +#define MAX_XSKQ 16
>>
>> Extra enter?
>>
>
> OK
>
>>> +struct netdev;
>>> +struct xsk_socket_info;
>>> +struct xdp_umem;
>>> +struct dp_packet_batch;
>>> +struct smap;
>>> +struct dp_packet;
>>> +
>>> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct 
>>> dp_packet
>>> *d);
>>> +
>>> +int xsk_configure_all(struct netdev *netdev);
>>> +
>>> +void xsk_destroy_all(struct netdev *netdev);
>>> +
>>> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
>>> +                         struct dp_packet_batch *batch);
>>> +
>>> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                                  struct dp_packet_batch *batch);
>>> +
>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct 
>>> smap
>>> *args,
>>> +                            char **errp);
>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct 
>>> smap
>>> *args);
>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
>>> +
>>> +void free_afxdp_buf(struct dp_packet *p);
>>> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
>>> +#endif /* netdev-afxdp.h */
>>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>>> new file mode 100644
>>> index 000000000000..3dd3d902b3c4
>>> --- /dev/null
>>> +++ b/lib/netdev-linux-private.h
>>> @@ -0,0 +1,124 @@
>>> +/*
>>> + * Copyright (c) 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>
>> Why include all the above? They where just added to netdev-linux.h, 
>> so
>> if you make sure you include netdev-lunux.h before -private it should
>> work out.
>>
>>> +
>>> +#if HAVE_AF_XDP
>>> +#include "netdev-afxdp.h"
>>> +#endif
>>
>> See earlier comment
>>
>>> +
>>> +/* These functions are Linux specific, so they should be used
>>> directly only by
>>> + * Linux-specific code. */
>>> +
>>> +struct netdev;
>>> +
>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
>>> flag,
>>> +                                  const char *flag_name, bool
>>> enable);
>>> +int linux_get_ifindex(const char *netdev_name);
>>> +
>>
>> These functions are now both specified in netdev-linux.h and
>> netdev-linux-private.h
>>
>>> +#define LINUX_FLOW_OFFLOAD_API                          \
>>> +   .flow_flush = netdev_tc_flow_flush,                  \
>>> +   .flow_dump_create = netdev_tc_flow_dump_create,      \
>>> +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
>>> +   .flow_dump_next = netdev_tc_flow_dump_next,          \
>>> +   .flow_put = netdev_tc_flow_put,                      \
>>> +   .flow_get = netdev_tc_flow_get,                      \
>>> +   .flow_del = netdev_tc_flow_del,                      \
>>> +   .init_flow_api = netdev_tc_init_flow_api
>>> +
>>
>> Same here, this define is in both include files.
>>
>>> +struct netdev_linux {
>>> +    struct netdev up;
>>> +
>>> +    /* Protects all members below. */
>>> +    struct ovs_mutex mutex;
>>> +
>>> +    unsigned int cache_valid;
>>> +
>>> +    bool miimon;                    /* Link status of last poll. */
>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> +    struct timer miimon_timer;
>>> +
>>> +    int netnsid;                    /* Network namespace ID. */
>>> +    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> +    int ifindex;
>>> +    struct eth_addr etheraddr;
>>> +    int mtu;
>>> +    unsigned int ifi_flags;
>>> +    long long int carrier_resets;
>>> +    uint32_t kbits_rate;        /* Policing data. */
>>> +    uint32_t kbits_burst;
>>> +    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> +                                   0 or an errno value. */
>>> +    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> +                                 * or SIOCSIFMTU.
>>> +                                 */
>>> +    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> +    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> +    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> +    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> +
>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> +
>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> +    struct tc *tc;
>>> +
>>> +    /* For devices of class netdev_tap_class only. */
>>> +    int tap_fd;
>>> +    bool present;               /* If the device is present in the
>>> namespace */
>>> +    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> +
>>> +    /* LAG information. */
>>> +    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> +
>>> +    /* AF_XDP information */
>>> +#ifdef HAVE_AF_XDP
>>> +    struct xsk_socket_info *xsk[MAX_XSKQ];
>>> +    int requested_n_rxq;
>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
>>> +    int xdp_flags, xdp_bind_flags;
>>> +#endif
>>> +};
>>> +
>>> +static struct netdev_linux *
>>> +netdev_linux_cast(const struct netdev *netdev)
>>> +{
>>
>> In the original definition there was an assert() here, was it removed 
>> by
>> accident?
>> netdev_linux_rxq_xsk
>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +}
>>> +
>>> +#endif /* netdev-linux-private.h */
>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>> index f75d73fd39f8..1f190406d145 100644
>>> --- a/lib/netdev-linux.c
>>> +++ b/lib/netdev-linux.c
>>> @@ -17,6 +17,7 @@
>>>  #include <config.h>
>>>
>>>  #include "netdev-linux.h"
>>> +#include "netdev-linux-private.h"
>>>
>>>  #include <errno.h>
>>>  #include <fcntl.h>
>>> @@ -54,6 +55,7 @@
>>>  #include "fatal-signal.h"
>>>  #include "hash.h"
>>>  #include "openvswitch/hmap.h"
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-provider.h"
>>>  #include "netdev-tc-offloads.h"
>>>  #include "netdev-vport.h"
>>> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, 
>>> int
>>> mtu);
>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
>>> burst_bytes);
>>>
>>> -struct netdev_linux {
>>> -    struct netdev up;
>>> -
>>> -    /* Protects all members below. */
>>> -    struct ovs_mutex mutex;
>>> -
>>> -    unsigned int cache_valid;
>>> -
>>> -    bool miimon;                    /* Link status of last poll. */
>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> -    struct timer miimon_timer;
>>> -
>>> -    int netnsid;                    /* Network namespace ID. */
>>> -    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> -    int ifindex;
>>> -    struct eth_addr etheraddr;
>>> -    int mtu;
>>> -    unsigned int ifi_flags;
>>> -    long long int carrier_resets;
>>> -    uint32_t kbits_rate;        /* Policing data. */
>>> -    uint32_t kbits_burst;
>>> -    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> -                                   0 or an errno value. */
>>> -    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> or SIOCSIFMTU. */
>>> -    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> -    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> -    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> -    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> -
>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> -
>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> -    struct tc *tc;
>>> -
>>> -    /* For devices of class netdev_tap_class only. */
>>> -    int tap_fd;
>>> -    bool present;               /* If the device is present in the
>>> namespace */
>>> -    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> -
>>> -    /* LAG information. */
>>> -    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> -};
>>>
>>>  struct netdev_rxq_linux {
>>>      struct netdev_rxq up;
>>> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct 
>>> netdev_class
>>> *netdev_class)
>>>      return netdev_class->run == netdev_linux_run;
>>>  }
>>>
>>> +#if HAVE_AF_XDP
>>>  static bool
>>> -is_tap_netdev(const struct netdev *netdev)
>>> +is_afxdp_netdev(const struct netdev *netdev)
>>>  {
>>> -    return netdev_get_class(netdev) == &netdev_tap_class;
>>> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
>>>  }
>>> -
>>> -static struct netdev_linux *
>>> -netdev_linux_cast(const struct netdev *netdev)
>>> +#else
>>> +static bool
>>> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
>>>  {
>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>> -
>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +    return false;
>>> +}
>>> +#endif
>>> +static bool
>>> +is_tap_netdev(const struct netdev *netdev)
>>> +{
>>> +    return netdev_get_class(netdev) == &netdev_tap_class;
>>>  }
>>>
>>>  static struct netdev_rxq_linux *
>>> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>>>          atomic_count_dec(&miimon_cnt);
>>>      }
>>>
>>> +#if HAVE_AF_XDP
>>> +    if (is_afxdp_netdev(netdev_)) {
>>> +        xsk_destroy_all(netdev_);
>>> +    }
>>> +#endif
>>
>> Think you can remove the HAVE_AF_XDP here, as you do not use it below
>> either.
>
> Yes
>
>>
>>>      ovs_mutex_destroy(&netdev->mutex);
>>>  }
>>>
>>> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq
>>> *rxq_)
>>>      rx->is_tap = is_tap_netdev(netdev_);
>>>      if (rx->is_tap) {
>>>          rx->fd = netdev->tap_fd;
>>> -    } else {
>>> +    } else if (!is_afxdp_netdev(netdev_)) {
>>>          struct sockaddr_ll sll;
>>>          int ifindex, val;
>>>          /* Result of tcpdump -dd inbound */
>>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq 
>>> *rxq_,
>>> struct dp_packet_batch *batch,
>>>  {
>>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>>      struct netdev *netdev = rx->up.netdev;
>>> -    struct dp_packet *buffer;
>>> +    struct dp_packet *buffer = NULL;
>>>      ssize_t retval;
>>>      int mtu;
>>>
>>> +#if HAVE_AF_XDP
>>
>> Think this #if HAVE_AF_XDP can be removed as the compiler should
>> optimize out the if (false).
>>
>>> +    if (is_afxdp_netdev(netdev)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +        int qid = rxq_->queue_id;
>>> +
>>> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
>>> +    }
>>> +#endif
>>>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>>>          mtu = ETH_PAYLOAD_MAX;
>>>      }
>>> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
>>> struct dp_packet_batch *batch,
>>>      /* Assume Ethernet port. No need to set packet_type. */
>>>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>>>                                             DP_NETDEV_HEADROOM);
>>> +
>>>      retval = (rx->is_tap
>>>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>>>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
>>> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int
>>> qid OVS_UNUSED,
>>>      int error = 0;
>>>      int sock = 0;
>>>
>>> -    if (!is_tap_netdev(netdev_)) {
>>> +    if (!is_tap_netdev(netdev_) &&
>>> +        !is_afxdp_netdev(netdev_)) {
>>>          if
>>> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>>>              error = EOPNOTSUPP;
>>>              goto free_batch;
>>> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int
>>> qid OVS_UNUSED,
>>>          }
>>>
>>>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
>>> +#if HAVE_AF_XDP
>>
>> Same here remove the #if HAVE_AF_XDP
>>
>>> +    } else if (is_afxdp_netdev(netdev_)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct umem_pool *first_mpool;
>>> +        struct dp_packet *packet;
>>> +
>>> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], 
>>> batch);
>>> +
>>> +        /* all packets must come frome the same umem pool
>>> +         * and has DPBUF_AFXDP type, otherwise free on-by-one
>>> +         */
>>> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +            if (packet->source != DPBUF_AFXDP) {
>>> +                goto free_batch;
>>> +            }
>>> +
>>> +            xpacket = dp_packet_cast_afxdp(packet);
>>> +            if (i == 0) {
>>> +                first_mpool = xpacket->mpool;
>>> +                continue;
>>> +            }
>>> +            if (xpacket->mpool != first_mpool) {
>>> +                goto free_batch;
>>> +            }
>>> +        }
>>
>> Why do not we not move all the packet type checks to
>> free_afxdp_buf_batch()?
>
> Here I plan to move them into the afxdp-specific send function.
> So it won't be part of netdev_linux_send(), hopefully it's more clean.
>>
>>> +        /* free in batch */
>>> +        free_afxdp_buf_batch(batch);
>>> +        return error;
>>> +#endif
>>>      } else {
>>>          error = netdev_linux_tap_batch_send(netdev_, batch);
>>>      }
>>> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = 
>>> {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "system",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_linux_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = 
>>> {
>>>  const struct netdev_class netdev_tap_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      .type = "tap",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct_tap,
>>>      .get_stats = netdev_tap_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3343,10 +3352,26 @@ const struct netdev_class
>>> netdev_internal_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "internal",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_internal_get_stats,
>>>      .get_status = netdev_internal_get_status,
>>>  };
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +const struct netdev_class netdev_afxdp_class = {
>>> +    NETDEV_LINUX_CLASS_COMMON,
>>> +    .type = "afxdp",
>>> +    .is_pmd = true,
>>> +    .construct = netdev_linux_construct,
>>> +    .get_stats = netdev_linux_get_stats,
>>> +    .get_status = netdev_linux_get_status,
>>> +    .set_config = netdev_afxdp_set_config,
>>> +    .get_config = netdev_afxdp_get_config,
>>> +    .reconfigure = netdev_afxdp_reconfigure,
>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
>>> +};
>>> +#endif
>>>
>>>
>>>  #define CODEL_N_QUEUES 0x0000
>>> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
>>> index 17ca9120168a..b812e64cb078 100644
>>> --- a/lib/netdev-linux.h
>>> +++ b/lib/netdev-linux.h
>>> @@ -19,6 +19,20 @@
>>>
>>>  #include <stdint.h>
>>>  #include <stdbool.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>
>> Is there a reason why you move all these includes here? If there is 
>> you
>> might as well remove the duplicates from .c files that include
>> netdev-linux.h, for example, netdev-linux.c
>
> <snip>
> Will work on this part later.
>
>>> +static inline void
>>> +ovs_spin_unlock(ovs_spinlock_t *sl)
>>> +{
>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
>>> +}
>>> +
>>> +static inline int OVS_UNUSED
>>> +ovs_spin_trylock(ovs_spinlock_t *sl)
>>> +{
>>> +    int exp = 0;
>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked, 
>>> &exp,
>>> 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed);
>>> +}
>>
>> Move spinlock function out to a common file
>>
> OK, I plan to add lib/spinlock.h and move to it.
>
>>> +
>>> +inline int
>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>>
>> This is a stack overflow
>>
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(ptr, addrs, n * sizeof(void *));
>>> +    umemp->index += n;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_push_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void
>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +    umemp->array[umemp->index++] = addr;
>>> +}
>>> +
>>> +void
>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
>>> +        /* stack is overflow, this should not happen */
>>> +        OVS_NOT_REACHED();
>>> +    }
>>
>> Should this not be moved after the spinlock, i.e. to __umem_elem_push
> OK
>
>>
>>> +
>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    __umem_elem_push(umemp, addr);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +}
>>> +
>>> +inline int
>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umemp->index -= n;
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(addrs, ptr, n * sizeof(void *));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void *
>>> +__umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>
>> There is no check here to see if there are actual any elements left,
>> like there is for pop_n,
>> so we could corrupt memory/umem_pool
>>
> Yes, I will add a check.
>
>>> +    return umemp->array[--umemp->index];
>>> +}
>>> +
>>> +void *
>>> +umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ptr = __umem_elem_pop(umemp);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ptr;
>>> +}
>>> +
>>> +void **
>>> +__umem_pool_alloc(unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(void *)) == 0);
>>
>> We should not assert, just return NULL here.
>>
>>> +    memset(bufs, 0, size * sizeof(void *));
>>> +    return (void **)bufs;
>>> +}
>>> +
>>> +unsigned int
>>> +umem_elem_count(struct umem_pool *mpool)
>>> +{
>>> +    return mpool->index;
>>> +}
>>> +
>>> +int
>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
>>> +{
>>> +    umemp->array = __umem_pool_alloc(size);
>>> +    if (!umemp->array) {
>>> +        OVS_NOT_REACHED();
>>
>> If NULL is returned return ENOMEM
>>
>>> +    }
>>> +
>>> +    umemp->size = size;
>>> +    umemp->index = 0;
>>> +    ovs_spinlock_init(&umemp->mutex);
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +umem_pool_cleanup(struct umem_pool *umemp)
>>> +{
>>> +    free(umemp->array);
>>         umemp->array = NULL;
>>> +}
>>> +
>>> +/* AF_XDP metadata init/destroy */
>>> +int
>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    /* TODO: check HAVE_POSIX_MEMALIGN  */
>>
>> Guess the above needs to be done
>>
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(struct 
>>> dp_packet_afxdp))
>>> == 0);
>>
>> We should not assert, just return false
>>
>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
>>> +
>>> +    xp->array = bufs;
>>> +    xp->size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
>>> +{
>>> +    free(xp->array);
>>         xp->array = NULL;
>>> +}
>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>> new file mode 100644
>>> index 000000000000..aabaa8e5df24
>>> --- /dev/null
>>> +++ b/lib/xdpsock.h
>>> @@ -0,0 +1,123 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef XDPSOCK_H
>>> +#define XDPSOCK_H 1
>>> +
>>> +#include <bpf/libbpf.h>
>>> +#include <bpf/xsk.h>
>>> +#include <errno.h>
>>> +#include <getopt.h>
>>> +#include <libgen.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/if_link.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <locale.h>
>>> +#include <net/if.h>
>>> +#include <poll.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/resource.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/types.h>
>>> +#include <sys/mman.h>
>>> +#include <time.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +
>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>
>> Move this item to the bottom, so you have FRAME specific define's 
>> first
>>
>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>> +
>>> +#define NUM_FRAMES      4096
>>
>> Should we add a note/check to make sure this value is a power of 2?
>
> Sure, will do it.
>
>>
>>> +#define PROD_NUM_DESCS  512
>>> +#define CONS_NUM_DESCS  512
>>> +
>>> +#ifdef USE_XSK_DEFAULT
>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>> +#endif
>>
>> Any reason for having this? Should we use the default values? They 
>> are
>> 4x larger than you have, did it make any difference in performance
>> results?
>> We could make it configurable like for DPDK, using the
>> n_txq_desc/n_rxq_desc option.
>
> Will add this option later.
>
>>
>>> +
>>> +typedef struct {
>>> +    atomic_int locked;
>>> +} ovs_spinlock_t;
>>> +
>>
>> Think we should move the ovs_spinlock code and includes to some 
>> global
>> place, maybe util or thread
>
> ok, will move to lib/spinlock.h
>
>>
>>> +/* LIFO ptr_array */
>>> +struct umem_pool {
>>> +    int index;      /* point to top */
>>> +    unsigned int size;
>>> +    ovs_spinlock_t mutex;
>>> +    void **array;   /* a pointer array, point to umem buf */
>>> +};
>>> +
>>> +/* array-based dp_packet_afxdp */
>>> +struct xpacket_pool {
>>> +    unsigned int size;
>>> +    struct dp_packet_afxdp **array;
>>> +};
>>> +
>>> +struct xsk_umem_info {
>>> +    struct umem_pool mpool;
>>> +    struct xpacket_pool xpool;
>>> +    struct xsk_ring_prod fq;
>>> +    struct xsk_ring_cons cq;
>>> +    struct xsk_umem *umem;
>>> +    void *buffer;
>>> +};
>>> +
>>> +struct xsk_socket_info {
>>> +    struct xsk_ring_cons rx;
>>> +    struct xsk_ring_prod tx;
>>> +    struct xsk_umem_info *umem;
>>> +    struct xsk_socket *xsk;
>>> +    unsigned long rx_npkts;
>>> +    unsigned long tx_npkts;
>>> +    unsigned long prev_rx_npkts;
>>> +    unsigned long prev_tx_npkts;
>>> +    uint32_t outstanding_tx;
>>> +};
>>> +
>>> +struct umem_elem {
>>> +    struct umem_elem *next;
>>> +};
>>> +
>>> +void __umem_elem_push(struct umem_pool *umemp, void *addr);
>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
>>> +int __umem_elem_push_n(struct umem_pool *umemp, int n, void 
>>> **addrs);
>>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
>>> +
>>> +void *__umem_elem_pop(struct umem_pool *umemp);
>>> +void *umem_elem_pop(struct umem_pool *umemp);
>>> +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void 
>>> **addrs);
>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
>>> +
>>> +void **__umem_pool_alloc(unsigned int size);
>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
>>> +void umem_pool_cleanup(struct umem_pool *umemp);
>>> +unsigned int umem_elem_count(struct umem_pool *mpool);
>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
>>> +
>>
>> Think all the __umem_* function are only used internally so they 
>> should
>> be come static and be removed here.
>>
> OK.
>
> Regards,
> William
William Tu May 20, 2019, 4:52 p.m. UTC | #12
On Mon, May 20, 2019 at 3:38 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 17 May 2019, at 14:39, Ilya Maximets wrote:
>
> > Hi.
> > Just a few comments to the issues you're listed.
> >
> > Best regards, Ilya Maximets.
> >
> > On 17.05.2019 13:23, Eelco Chaudron wrote:
> >> Hi William,
> >>
> >> First a list of issues I found during some basic testing...
> >>
> >> - When I restart or stop OVS (using the systemctl interface as found
> >> in RHEL) it does not clean up the BFP program causing the restart to
> >> fail:
> >>
> >>   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> >> reconfig fails
> >>   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
> >> interface eno1 new configuration
> >>
> >>   I need to manually run "ip link set dev eno1 xdp off" to make it
> >> recover.
> >
> > Userspace datapath requires '--cleanup' option passed to 'ovs-appctl
> > exit'
> > to clean up allocated resources. Otherwise datapath will not be
> > destroyed,
> > i.e. netdev will not be destroyed --> no xdp program unloading.
>
> Maybe we should try to reload/cleanup at startup?

OK, will add it into the make check-afxdp cleanup process.
>
> >>
> >> - When I remove a bridge, I get an emer in the revalidator:
> >>
> >>   2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp
> >> program
> >>
> >> 2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
> >> assertion !fd != !wevent failed in poll_create_node()
> >>
> >
> > This actually should never happen. Looks like a memory corruption.
> >
> >>   Easy to replicate with this:
> >>
> >>     $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
> >> datapath_type=netdev
> >>     $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
> >> type="afxdp" options:xdpmode=drv
> >>     $ ovs-vsctl del-br ovs_pvp_br0
> >>
> >>
> >> - High pmd usage on the statistics, even with no packets is this
> >> expected?
> >>
> >>   $ ovs-appctl dpif-netdev/pmd-rxq-show
> >>   pmd thread numa_id 0 core_id 1:
> >>     isolated : false
> >>     port: dpdk0             queue-id:  0  pmd
> >> usage:  0 %
> >>     port: eno1              queue-id:  0  pmd
> >> usage: 49 %
> >>
> >>   It goes up slowly and gets stuck at 49%
> >>
> >>
> >> - When doing the PVP testing I noticed that the physical port has
> >> odd/no
> >>   tx statistics:
> >>
> >>   $ ovs-ofctl dump-ports ovs_pvp_br0
> >>   OFPST_PORT reply (xid=0x2): 3 ports
> >>     port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0,
> >> over=0, crc=0
> >>              tx pkts=0, bytes=0, drop=0, errs=0, coll=0
> >>     port  eno1: rx pkts=103256197, bytes=6195630508, drop=0,
> >> errs=0, frame=0, over=0, crc=0
> >>              tx pkts=0, bytes=19789272440056, drop=0,
> >> errs=0, coll=0
> >>     port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0,
> >> frame=0, over=0, crc=0
> >>              tx pkts=4058, bytes=502504, drop=0, errs=0,
> >> coll=0
> >>
> >>
> >> - Packets larger than 1028 bytes are dropped. Guess this needs to be
> >> fixed, and we need to state that jumbo frames are not supported. Are
> >> you planning on adding this?
> >>
> >>   Currently I can find not mentioning of MTU limitation in the
> >> documentation, or any code to prevent it from being changed above the
> >> supported limit.
> >
> > Actually Jumbo frames are supported, but yes, the packet size
> > is limited by the page size. So, jumbo frames up to ~3.5K should
> > be supported without issues.
> > We'll need to determine the upper limit and reject requested mtu
> > if it's larger.
>
> Currently, none-jumbo frames are not even working, and I think a jumbo
> check should be added as we allocate chunks of 2048.

Right, we can re-allocate the chunk to a larger size.
But for now, I will limit it to 2048 and error out when mtu is larger.

>
> >>
> >>
> >> - ovs-vswitchd is still crashing or stops forwarding packets when
> >> trying to do
> >>   PVP testing with Qemu that has a TAP interface doing XDP and
> >> running packets
> >>   at wire speed to the 10G interface.
> >
> > Actually, there are a lot of places in current version where
> > rings/umems could
> > be corrupted leading to unpredictable memory corruptions/crashes/time
> > wasting
> > trying to allocate exhausted resources.
>
> ACK :)

Yes, I'm working on a newer version.

> > Simple: rdtsc.
> >
> > Actually, right now we need to restrict support to only x86_64,
> > because
> > above rdtsc is in 64bit form and will not work for 32bit cpu.
>
> If this is all, why not copy the rest from DPDK and support all…

I don't know AF_XDP is supported for 32bit or not.
For next version, let's just assume x86_64.

Thanks for the feedbacks
William
Ben Pfaff May 20, 2019, 10:57 p.m. UTC | #13
On Fri, May 17, 2019 at 12:23:35PM +0200, Eelco Chaudron wrote:
> - When I restart or stop OVS (using the systemctl interface as found in
> RHEL) it does not clean up the BFP program causing the restart to fail:
> 
>   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> reconfig fails
>   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set interface
> eno1 new configuration
> 
>   I need to manually run "ip link set dev eno1 xdp off" to make it recover.

William, in case you don't know about it, fatal_signal_add_hook() is
likely the right way to fix this.
Ben Pfaff May 20, 2019, 10:58 p.m. UTC | #14
On Mon, May 20, 2019 at 03:57:21PM -0700, Ben Pfaff wrote:
> On Fri, May 17, 2019 at 12:23:35PM +0200, Eelco Chaudron wrote:
> > - When I restart or stop OVS (using the systemctl interface as found in
> > RHEL) it does not clean up the BFP program causing the restart to fail:
> > 
> >   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> > reconfig fails
> >   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set interface
> > eno1 new configuration
> > 
> >   I need to manually run "ip link set dev eno1 xdp off" to make it recover.
> 
> William, in case you don't know about it, fatal_signal_add_hook() is
> likely the right way to fix this.

William, I failed to read your followup before I responded, so maybe
this advice was not helpful.
William Tu May 20, 2019, 11:07 p.m. UTC | #15
On Mon, May 20, 2019 at 3:58 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Mon, May 20, 2019 at 03:57:21PM -0700, Ben Pfaff wrote:
> > On Fri, May 17, 2019 at 12:23:35PM +0200, Eelco Chaudron wrote:
> > > - When I restart or stop OVS (using the systemctl interface as found in
> > > RHEL) it does not clean up the BFP program causing the restart to fail:
> > >
> > >   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> > > reconfig fails
> > >   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set interface
> > > eno1 new configuration
> > >
> > >   I need to manually run "ip link set dev eno1 xdp off" to make it recover.
> >
> > William, in case you don't know about it, fatal_signal_add_hook() is
> > likely the right way to fix this.
>
> William, I failed to read your followup before I responded, so maybe
> this advice was not helpful.

Thanks Ben. I'm still experimenting and finding out the root cause.
I will see if fatal_signal_add_hook is helpful and add it in to the
next version.

William
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e09a33..11cc59efc881 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 46261235c732..aa9e7c49f179 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..1222b433dbbb
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,479 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+internal.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the proposed architecture re-uses the existing
+userspace dpif-netdev datapath.  As a result, most of
+the packet processing happens at the userspace instead of
+linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+
+  * CONFIG_BPF=y
+
+  * CONFIG_BPF_SYSCALL=y
+
+  * CONFIG_XDP_SOCKETS=y
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  * CONFIG_BPF_JIT=y (Performance)
+
+  * CONFIG_HAVE_BPF_JIT=y (Performance)
+
+  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
+
+- If possible, run **./xdpsock -r -N -z -i <your device>** under
+  linux/samples/bpf.  This is the OVS indepedent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+Frist, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp
+
+if a test case fails, check the log at::
+
+  cat tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd --disable-system
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+.. note::
+   OVS AF_XDP netdev is using the userspace datapath, the same datapath
+   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
+   and datapath_type=netdev when adding a new bridge.
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debug by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. AF_XDP device added to bridge, remove, and added again will fail.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+
+
+make check-afxdp
+----------------
+When executing 'make check-afxdp', OVS creates namespaces, sets up AF_XDP on
+veth devices and kicks start the testing.  So far we have the following test
+cases::
+
+ AF_XDP netdev datapath-sanity
+
+  1: datapath - ping between two ports               ok
+  2: datapath - ping between two ports on vlan       ok
+  3: datapath - ping6 between two ports              ok
+  4: datapath - ping6 between two ports on vlan      ok
+  5: datapath - ping over vxlan tunnel               ok
+  6: datapath - ping over vxlan6 tunnel              ok
+  7: datapath - ping over gre tunnel                 ok
+  8: datapath - ping over erspan v1 tunnel           ok
+  9: datapath - ping over erspan v2 tunnel           ok
+ 10: datapath - ping over ip6erspan v1 tunnel        ok
+ 11: datapath - ping over ip6erspan v2 tunnel        ok
+ 12: datapath - ping over geneve tunnel              ok
+ 13: datapath - ping over geneve6 tunnel             ok
+ 14: datapath - clone action                         ok
+ 15: datapath - basic truncate action                ok
+
+ conntrack
+
+ 16: conntrack - controller                          ok
+ 17: conntrack - force commit                        ok
+ 18: conntrack - ct flush by 5-tuple                 ok
+ 19: conntrack - IPv4 ping                           ok
+ 20: conntrack - get_nconns and get/set_maxconns     ok
+ 21: conntrack - IPv6 ping                           ok
+
+ system-ovn
+
+ 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
+ 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
+ 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
+ 25: ovn -- load-balancing                           ok
+ 26: ovn -- load-balancing - same subnet.            ok
+ 27: ovn -- load balancing in gateway router         ok
+ 28: ovn -- multiple gateway routers, load-balancing ok
+ 29: ovn -- load balancing in router with gateway router port ok
+ 30: ovn -- DNAT and SNAT on distributed router - N/S ok
+ 31: ovn -- DNAT and SNAT on distributed router - E/W ok
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+The performance number I got is around 700Kpps.
+This is due to using the kernel's tap interface, which requires copying
+packet into kernel from the umem buffer in userspace.
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0 (linux kernel mode)::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
+
+
+Or, use AF_XDP with skb mode::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
+
+Performace: for RX_DROP: 800Kpps, TX: 700Kpps
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/acinclude.m4 b/acinclude.m4
index b532a4579266..5782f7e4bc2e 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -221,6 +221,38 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index 505e3d041e93..29c90b73f836 100644
--- a/configure.ac
+++ b/configure.ac
@@ -99,6 +99,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index cc5dccf39d6b..686e57f8c472 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -14,6 +14,10 @@  if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
+endif
+
 lib_libopenvswitch_la_LDFLAGS = \
         $(OVS_LTINFO) \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
@@ -392,6 +396,7 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/if-notifier.h \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
+	lib/netdev-linux-private.h \
 	lib/netdev-tc-offloads.c \
 	lib/netdev-tc-offloads.h \
 	lib/netlink-conntrack.c \
@@ -409,6 +414,14 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/xdpsock.c \
+	lib/xdpsock.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..7d086dc5e860 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -22,6 +22,9 @@ 
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
+#ifdef HAVE_AF_XDP
+#include "netdev-afxdp.h"
+#endif
 
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source source)
@@ -59,6 +62,27 @@  dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
     dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
 }
 
+#if HAVE_AF_XDP
+/* Initialize 'b' as an empty dp_packet that contains
+ * memory starting at AF_XDP umem base.
+ */
+void
+dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
+{
+    dp_packet_set_base(b, base);
+    dp_packet_set_data(b, base);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_set_allocated(b, allocated);
+    b->source = DPBUF_AFXDP;
+    dp_packet_reset_offsets(b);
+    pkt_metadata_init(&b->md, 0);
+    dp_packet_reset_cutlen(b);
+    dp_packet_reset_offload(b);
+    b->packet_type = htonl(PT_ETH);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -122,6 +146,11 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+#ifdef HAVE_AF_XDP
+            free_afxdp_buf(b);
+#endif
+            return;
         }
     }
 }
@@ -248,6 +277,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +465,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..0f533201f956 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,10 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#ifdef HAVE_AF_XDP
+#include "netdev-afxdp.h"
+#endif
+
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +46,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* buffer data from XDP frame */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +94,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -122,7 +134,9 @@  static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
 void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
-
+#if HAVE_AF_XDP
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -184,6 +198,12 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+#ifdef HAVE_AF_XDP
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
+            return;
+        }
+#endif
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..cc91720fad6e 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -198,6 +198,20 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif HAVE_AF_XDP
+    /* This is x86-specific instructions. */
+    union {
+        uint64_t tsc_64;
+        struct {
+            uint32_t lo_32;
+            uint32_t hi_32;
+        };
+    } tsc;
+    asm volatile("rdtsc" :
+             "=a" (tsc.lo_32),
+             "=d" (tsc.hi_32));
+
+    return s->last_tsc = tsc.tsc_64;
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..cd1b9ca8be77
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,727 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if !defined(__i386__) && !defined(__x86_64__)
+#error AF_XDP supported only for Linux on x86 or x86_64
+#endif
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-afxdp.h"
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/rtnetlink.h>
+#include <linux/sockios.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <net/if_arp.h>
+#include <net/route.h>
+#include <netinet/in.h>
+#include <netpacket/packet.h>
+#include <poll.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/utsname.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dpif-netlink.h"
+#include "dpif-netdev.h"
+#include "fatal-signal.h"
+#include "hash.h"
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "netlink-notifier.h"
+#include "netlink-socket.h"
+#include "netlink.h"
+#include "netnsid.h"
+#include "openflow/openflow.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/ofpbuf.h"
+#include "openvswitch/poll-loop.h"
+#include "openvswitch/vlog.h"
+#include "openvswitch/shash.h"
+#include "ovs-atomic.h"
+#include "packets.h"
+#include "rtnetlink.h"
+#include "socket-util.h"
+#include "sset.h"
+#include "tc.h"
+#include "timer.h"
+#include "unaligned.h"
+#include "util.h"
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+#define UMEM2XPKT(base, i) \
+                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
+                               i * sizeof(struct dp_packet_afxdp))
+
+static uint32_t prog_id;
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+
+static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
+                                                int xdpmode)
+{
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xcalloc(1, sizeof(*umem));
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           NULL);
+
+    if (ret) {
+        VLOG_ERR("xsk umem create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        free(umem);
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* set-up umem pool */
+    umem_pool_init(&umem->mpool, NUM_FRAMES);
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct umem_elem *elem;
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* set-up metadata */
+    xpacket_pool_init(&umem->xpool, NUM_FRAMES);
+
+    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = UMEM2XPKT(umem->xpool.array, i);
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0;
+    int ret;
+    int i;
+
+    xsk = xcalloc(1, sizeof(*xsk));
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        free(xsk);
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        free(xsk);
+        return NULL;
+    }
+
+    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
+
+    for (i = 0;
+         i < PROD_NUM_DESCS * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+    int ret;
+
+    /* umem memory region */
+    ret = posix_memalign(&bufs, get_page_size(),
+                         NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+    ovs_assert(!ret);
+
+    /* create AF_XDP socket */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        free(bufs);
+        return NULL;
+    }
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* clean up umem and xpacket pool */
+        (void)xsk_umem__delete(umem->umem);
+        free(bufs);
+        umem_pool_cleanup(&umem->mpool);
+        xpacket_pool_cleanup(&umem->xpool);
+        free(umem);
+    }
+    return xsk;
+}
+
+int
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk;
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    /* configure each queue */
+    for (i = 0; i < netdev->n_rxq; i++) {
+        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
+                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk) {
+            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
+            goto err;
+        }
+        dev->xsk[i] = xsk;
+    }
+
+    return 0;
+
+err:
+    xsk_destroy_all(netdev);
+    return EINVAL;
+}
+
+static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
+{
+    struct ds ds = DS_EMPTY_INITIALIZER;
+    ds_put_hex_dump(&ds, buf, count, 0, false);
+    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
+    ds_destroy(&ds);
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk)
+{
+    struct xsk_umem *umem;
+
+    if (!xsk) {
+        return;
+    }
+
+    umem = xsk->umem->umem;
+    xsk_socket__delete(xsk->xsk);
+    (void)xsk_umem__delete(umem);
+
+    /* free the packet buffer */
+    free(xsk->umem->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&xsk->umem->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&xsk->umem->xpool);
+
+    free(xsk->umem);
+    free(xsk);
+}
+
+void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    for (i = 0; i < MAX_XSKQ; i++) {
+        if (dev->xsk[i]) {
+            VLOG_INFO("destroy xsk[%d]", i);
+            xsk_destroy(dev->xsk[i]);
+            dev->xsk[i] = NULL;
+        }
+    }
+    VLOG_INFO("remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static inline void OVS_UNUSED
+print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+               &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                stat.rx_dropped,
+                stat.rx_invalid_descs,
+                stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    const char *xdpmode;
+    int new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (new_n_rxq != netdev->n_rxq) {
+        dev->requested_n_rxq = new_n_rxq;
+        netdev_request_reconfigure(netdev);
+    }
+
+    xdpmode = smap_get(args, "xdpmode");
+    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
+        dev->requested_xdpmode = XDP_ZEROCOPY;
+        if (dev->xdpmode != dev->requested_xdpmode) {
+            netdev_request_reconfigure(netdev);
+        }
+    } else {
+        dev->requested_xdpmode = XDP_COPY;
+        if (dev->xdpmode != dev->requested_xdpmode) {
+            netdev_request_reconfigure(netdev);
+        }
+    }
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+    netdev->n_rxq = dev->requested_n_rxq;
+
+    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
+        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
+        /* From SKB mode to DRV mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+        dev->xdp_bind_flags = XDP_ZEROCOPY;
+        dev->xdpmode = XDP_ZEROCOPY;
+
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
+                      ovs_strerror(errno));
+        }
+    } else {
+        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
+        /* From DRV mode to SKB mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+        dev->xdp_bind_flags = XDP_COPY;
+        dev->xdpmode = XDP_COPY;
+        /* TODO: set rlimit back to previous value
+         * when no device is in DRV mode.
+         */
+    }
+
+    err = xsk_configure_all(netdev);
+    if (err) {
+        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0",
+              netdev_get_name(netdev));
+    return 0;
+}
+
+void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t curr_prog_id = 0;
+    uint32_t flags;
+
+    /* remove_xdp_program() */
+    if (xdpmode == XDP_COPY) {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    } else {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    }
+
+    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    }
+    if (prog_id == curr_prog_id) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    } else if (!curr_prog_id) {
+        VLOG_INFO("couldn't find a prog id on a given interface");
+    } else {
+        VLOG_INFO("program on interface changed, not removing");
+    }
+}
+
+struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    unsigned long addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+        struct dp_packet_afxdp *xpacket = NULL;
+        struct dp_packet *packet;
+        void *elems[BATCH_SIZE];
+        unsigned long addr;
+
+       /* all packets are AF_XDP, so handles its own delete in batch */
+        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+            xpacket = dp_packet_cast_afxdp(packet);
+            if (xpacket->mpool) {
+                void *base = dp_packet_base(packet);
+
+                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+                elems[i] = (void *)addr;
+            }
+        }
+        umem_elem_push_n(xpacket->mpool, batch->count, elems);
+        dp_packet_batch_init(batch);
+}
+
+/* Receive packet from AF_XDP socket */
+int
+netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
+                     struct dp_packet_batch *batch)
+{
+    struct umem_elem *elems[BATCH_SIZE];
+    uint32_t idx_rx = 0, idx_fq = 0;
+    unsigned int rcvd, i;
+    int ret = 0;
+
+    /* See if there is any packet on RX queue,
+     * if yes, idx_rx is the index having the packet.
+     */
+    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return 0;
+    }
+
+    /* Form a dp_packet batch from descriptor in RX queue */
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
+        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
+        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+        uint64_t index;
+
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        index = addr >> FRAME_SHIFT;
+        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
+
+        packet = &xpacket->packet;
+        xpacket->mpool = &xsk->umem->mpool;
+
+        /* Initialize the struct dp_packet */
+        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+
+    /* We've consume rcvd packets in RX, now re-fill the
+     * same number back to FILL queue.
+     */
+    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
+    if (OVS_UNLIKELY(ret)) {
+        return -ENOMEM;
+    }
+
+    for (i = 0; i < rcvd; i++) {
+        uint64_t index;
+        struct umem_elem *elem;
+
+        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
+        while (OVS_UNLIKELY(ret == 0)) {
+            /* The FILL queue is full, so retry. (or skip)? */
+            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
+        }
+
+        /* Get one free umem, program it into FILL queue */
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+
+    /* Release the RX queue */
+    xsk_ring_cons__release(&xsk->rx, rcvd);
+    xsk->rx_npkts += rcvd;
+
+#ifdef AFXDP_DEBUG
+    print_xsk_stat(xsk);
+#endif
+    return 0;
+}
+
+static inline int kick_tx(struct xsk_socket_info *xsk)
+{
+    int ret;
+
+    /* This causes system call into kernel's xsk_sendmsg, and
+     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
+     */
+    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
+    if (OVS_UNLIKELY(ret < 0)) {
+        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
+            return errno;
+        }
+    }
+    /* no error, or EBUSY or EAGAIN */
+    return 0;
+}
+
+int
+netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
+                              struct dp_packet_batch *batch)
+{
+    struct umem_elem *elems_pop[BATCH_SIZE];
+    struct umem_elem *elems_push[BATCH_SIZE];
+    uint32_t tx_done, idx_cq = 0;
+    struct dp_packet *packet;
+    uint32_t idx = 0;
+    int j, ret, retry_count = 0;
+    const int max_retry = 4;
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        return EAGAIN;
+    }
+
+    /* Make sure we have enough TX descs */
+    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+        return EAGAIN;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct umem_elem *elem;
+        uint64_t index;
+
+        elem = elems_pop[i];
+        /* Copy the packet to the umem we just pop from umem pool.
+         * We can avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk->tx, batch->count);
+    xsk->outstanding_tx += batch->count;
+
+    ret = kick_tx(xsk);
+    if (OVS_UNLIKELY(ret)) {
+        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+        return ret;
+    }
+
+retry:
+    /* Process CQ */
+    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
+    if (tx_done > 0) {
+        xsk->outstanding_tx -= tx_done;
+        xsk->tx_npkts += tx_done;
+    }
+
+    /* Recycle back to umem pool */
+    for (j = 0; j < tx_done; j++) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)xsk->umem->buffer + addr);
+        elems_push[j] = elem;
+    }
+
+    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
+    ovs_assert(ret == 0);
+
+    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
+
+    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
+        /* If there are still a lot not transmitted, try harder. */
+        if (retry_count++ > max_retry) {
+            return 0;
+        }
+        goto retry;
+    }
+
+    return 0;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..6518d8fca0b5
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,53 @@ 
+/*
+ * Copyright (c) 2018 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+#define MAX_XSKQ 16
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+
+struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet *d);
+
+int xsk_configure_all(struct netdev *netdev);
+
+void xsk_destroy_all(struct netdev *netdev);
+
+int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
+                         struct dp_packet_batch *batch);
+
+int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
+                                  struct dp_packet_batch *batch);
+
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+
+void free_afxdp_buf(struct dp_packet *p);
+void free_afxdp_buf_batch(struct dp_packet_batch *batch);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
new file mode 100644
index 000000000000..3dd3d902b3c4
--- /dev/null
+++ b/lib/netdev-linux-private.h
@@ -0,0 +1,124 @@ 
+/*
+ * Copyright (c) 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_LINUX_PRIVATE_H
+#define NETDEV_LINUX_PRIVATE_H 1
+
+#include <config.h>
+
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
+
+#if HAVE_AF_XDP
+#include "netdev-afxdp.h"
+#endif
+
+/* These functions are Linux specific, so they should be used directly only by
+ * Linux-specific code. */
+
+struct netdev;
+
+int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
+                                  const char *flag_name, bool enable);
+int linux_get_ifindex(const char *netdev_name);
+
+#define LINUX_FLOW_OFFLOAD_API                          \
+   .flow_flush = netdev_tc_flow_flush,                  \
+   .flow_dump_create = netdev_tc_flow_dump_create,      \
+   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
+   .flow_dump_next = netdev_tc_flow_dump_next,          \
+   .flow_put = netdev_tc_flow_put,                      \
+   .flow_get = netdev_tc_flow_get,                      \
+   .flow_del = netdev_tc_flow_del,                      \
+   .init_flow_api = netdev_tc_init_flow_api
+
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+    /* AF_XDP information */
+#ifdef HAVE_AF_XDP
+    struct xsk_socket_info *xsk[MAX_XSKQ];
+    int requested_n_rxq;
+    int xdpmode, requested_xdpmode; /* detect mode changed */
+    int xdp_flags, xdp_bind_flags;
+#endif
+};
+
+static struct netdev_linux *
+netdev_linux_cast(const struct netdev *netdev)
+{
+    return CONTAINER_OF(netdev, struct netdev_linux, up);
+}
+
+#endif /* netdev-linux-private.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index f75d73fd39f8..1f190406d145 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -17,6 +17,7 @@ 
 #include <config.h>
 
 #include "netdev-linux.h"
+#include "netdev-linux-private.h"
 
 #include <errno.h>
 #include <fcntl.h>
@@ -54,6 +55,7 @@ 
 #include "fatal-signal.h"
 #include "hash.h"
 #include "openvswitch/hmap.h"
+#include "netdev-afxdp.h"
 #include "netdev-provider.h"
 #include "netdev-tc-offloads.h"
 #include "netdev-vport.h"
@@ -487,51 +489,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
 
 struct netdev_rxq_linux {
     struct netdev_rxq up;
@@ -579,18 +536,23 @@  is_netdev_linux_class(const struct netdev_class *netdev_class)
     return netdev_class->run == netdev_linux_run;
 }
 
+#if HAVE_AF_XDP
 static bool
-is_tap_netdev(const struct netdev *netdev)
+is_afxdp_netdev(const struct netdev *netdev)
 {
-    return netdev_get_class(netdev) == &netdev_tap_class;
+    return netdev_get_class(netdev) == &netdev_afxdp_class;
 }
-
-static struct netdev_linux *
-netdev_linux_cast(const struct netdev *netdev)
+#else
+static bool
+is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
 {
-    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
-
-    return CONTAINER_OF(netdev, struct netdev_linux, up);
+    return false;
+}
+#endif
+static bool
+is_tap_netdev(const struct netdev *netdev)
+{
+    return netdev_get_class(netdev) == &netdev_tap_class;
 }
 
 static struct netdev_rxq_linux *
@@ -1084,6 +1046,11 @@  netdev_linux_destruct(struct netdev *netdev_)
         atomic_count_dec(&miimon_cnt);
     }
 
+#if HAVE_AF_XDP
+    if (is_afxdp_netdev(netdev_)) {
+        xsk_destroy_all(netdev_);
+    }
+#endif
     ovs_mutex_destroy(&netdev->mutex);
 }
 
@@ -1113,7 +1080,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     rx->is_tap = is_tap_netdev(netdev_);
     if (rx->is_tap) {
         rx->fd = netdev->tap_fd;
-    } else {
+    } else if (!is_afxdp_netdev(netdev_)) {
         struct sockaddr_ll sll;
         int ifindex, val;
         /* Result of tcpdump -dd inbound */
@@ -1318,10 +1285,18 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
 {
     struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
     struct netdev *netdev = rx->up.netdev;
-    struct dp_packet *buffer;
+    struct dp_packet *buffer = NULL;
     ssize_t retval;
     int mtu;
 
+#if HAVE_AF_XDP
+    if (is_afxdp_netdev(netdev)) {
+        struct netdev_linux *dev = netdev_linux_cast(netdev);
+        int qid = rxq_->queue_id;
+
+        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
+    }
+#endif
     if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
         mtu = ETH_PAYLOAD_MAX;
     }
@@ -1329,6 +1304,7 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     /* Assume Ethernet port. No need to set packet_type. */
     buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
                                            DP_NETDEV_HEADROOM);
+
     retval = (rx->is_tap
               ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
               : netdev_linux_rxq_recv_sock(rx->fd, buffer));
@@ -1480,7 +1456,8 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
     int error = 0;
     int sock = 0;
 
-    if (!is_tap_netdev(netdev_)) {
+    if (!is_tap_netdev(netdev_) &&
+        !is_afxdp_netdev(netdev_)) {
         if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
             error = EOPNOTSUPP;
             goto free_batch;
@@ -1499,6 +1476,36 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
         }
 
         error = netdev_linux_sock_batch_send(sock, ifindex, batch);
+#if HAVE_AF_XDP
+    } else if (is_afxdp_netdev(netdev_)) {
+        struct netdev_linux *dev = netdev_linux_cast(netdev_);
+        struct dp_packet_afxdp *xpacket;
+        struct umem_pool *first_mpool;
+        struct dp_packet *packet;
+
+        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
+
+        /* all packets must come frome the same umem pool
+         * and has DPBUF_AFXDP type, otherwise free on-by-one
+         */
+        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+            if (packet->source != DPBUF_AFXDP) {
+                goto free_batch;
+            }
+
+            xpacket = dp_packet_cast_afxdp(packet);
+            if (i == 0) {
+                first_mpool = xpacket->mpool;
+                continue;
+            }
+            if (xpacket->mpool != first_mpool) {
+                goto free_batch;
+            }
+        }
+        /* free in batch */
+        free_afxdp_buf_batch(batch);
+        return error;
+#endif
     } else {
         error = netdev_linux_tap_batch_send(netdev_, batch);
     }
@@ -3323,6 +3330,7 @@  const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
@@ -3333,6 +3341,7 @@  const struct netdev_class netdev_linux_class = {
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
@@ -3343,10 +3352,26 @@  const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .get_stats = netdev_linux_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
index 17ca9120168a..b812e64cb078 100644
--- a/lib/netdev-linux.h
+++ b/lib/netdev-linux.h
@@ -19,6 +19,20 @@ 
 
 #include <stdint.h>
 #include <stdbool.h>
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
 
 /* These functions are Linux specific, so they should be used directly only by
  * Linux-specific code. */
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e6e8e8..d433818f7064 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -902,7 +902,9 @@  extern const struct netdev_class netdev_linux_class;
 #endif
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
-
+#if HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 7d7ecf6f0946..e2fae37d5a5e 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -146,6 +146,9 @@  netdev_initialize(void)
         netdev_register_provider(&netdev_internal_class);
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
diff --git a/lib/xdpsock.c b/lib/xdpsock.c
new file mode 100644
index 000000000000..2d80e74d69e4
--- /dev/null
+++ b/lib/xdpsock.c
@@ -0,0 +1,239 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+
+#include "xdpsock.h"
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <syslog.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "async-append.h"
+#include "coverage.h"
+#include "dirs.h"
+#include "dp-packet.h"
+#include "openvswitch/compiler.h"
+#include "openvswitch/vlog.h"
+#include "ovs-atomic.h"
+#include "ovs-thread.h"
+#include "sat-math.h"
+#include "socket-util.h"
+#include "svec.h"
+#include "syslog-direct.h"
+#include "syslog-libc.h"
+#include "syslog-provider.h"
+#include "timeval.h"
+#include "unixctl.h"
+#include "util.h"
+
+static inline void
+ovs_spinlock_init(ovs_spinlock_t *sl)
+{
+    atomic_init(&sl->locked, 0);
+}
+
+static inline void
+ovs_spin_lock(ovs_spinlock_t *sl)
+{
+    int exp = 0, locked = 0;
+
+    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed)) {
+        locked = 1;
+        while (locked) {
+            atomic_read_relaxed(&sl->locked, &locked);
+        }
+        exp = 0;
+    }
+}
+
+static inline void
+ovs_spin_unlock(ovs_spinlock_t *sl)
+{
+    atomic_store_explicit(&sl->locked, 0, memory_order_release);
+}
+
+static inline int OVS_UNUSED
+ovs_spin_trylock(ovs_spinlock_t *sl)
+{
+    int exp = 0;
+    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed);
+}
+
+inline int
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
+        return -ENOMEM;
+    }
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+
+    return 0;
+}
+
+int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->mutex);
+    ret = __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ret;
+}
+
+inline void
+__umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+
+    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
+        /* stack is overflow, this should not happen */
+        OVS_NOT_REACHED();
+    }
+
+    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
+
+    ovs_spin_lock(&umemp->mutex);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->mutex);
+}
+
+inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->mutex);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ret;
+}
+
+inline void *
+__umem_elem_pop(struct umem_pool *umemp)
+{
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->mutex);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ptr;
+}
+
+void **
+__umem_pool_alloc(unsigned int size)
+{
+    void *bufs;
+
+    ovs_assert(posix_memalign(&bufs, getpagesize(),
+                              size * sizeof(void *)) == 0);
+    memset(bufs, 0, size * sizeof(void *));
+    return (void **)bufs;
+}
+
+unsigned int
+umem_elem_count(struct umem_pool *mpool)
+{
+    return mpool->index;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        OVS_NOT_REACHED();
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spinlock_init(&umemp->mutex);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp)
+{
+    free(umemp->array);
+}
+
+/* AF_XDP metadata init/destroy */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    void *bufs;
+
+    /* TODO: check HAVE_POSIX_MEMALIGN  */
+    ovs_assert(posix_memalign(&bufs, getpagesize(),
+                              size * sizeof(struct dp_packet_afxdp)) == 0);
+    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
+
+    xp->array = bufs;
+    xp->size = size;
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free(xp->array);
+}
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
new file mode 100644
index 000000000000..aabaa8e5df24
--- /dev/null
+++ b/lib/xdpsock.h
@@ -0,0 +1,123 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+
+#include <bpf/libbpf.h>
+#include <bpf/xsk.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <locale.h>
+#include <net/if.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+
+#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
+#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
+#define BATCH_SIZE      NETDEV_MAX_BURST
+#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define NUM_FRAMES      4096
+#define PROD_NUM_DESCS  512
+#define CONS_NUM_DESCS  512
+
+#ifdef USE_XSK_DEFAULT
+#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
+#endif
+
+typedef struct {
+    atomic_int locked;
+} ovs_spinlock_t;
+
+/* LIFO ptr_array */
+struct umem_pool {
+    int index;      /* point to top */
+    unsigned int size;
+    ovs_spinlock_t mutex;
+    void **array;   /* a pointer array, point to umem buf */
+};
+
+/* array-based dp_packet_afxdp */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp **array;
+};
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    unsigned long rx_npkts;
+    unsigned long tx_npkts;
+    unsigned long prev_rx_npkts;
+    unsigned long prev_tx_npkts;
+    uint32_t outstanding_tx;
+};
+
+struct umem_elem {
+    struct umem_elem *next;
+};
+
+void __umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *__umem_elem_pop(struct umem_pool *umemp);
+void *umem_elem_pop(struct umem_pool *umemp);
+int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+void **__umem_pool_alloc(unsigned int size);
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+unsigned int umem_elem_count(struct umem_pool *mpool);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
diff --git a/tests/automake.mk b/tests/automake.mk
index ea16532dd2a0..715cef9a6b3b 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -158,6 +160,11 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-traffic.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -182,6 +189,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -315,6 +323,11 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	$(MAKE) install
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -352,6 +365,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..2c58c2d6554b
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,153 @@ 
+# _ADD_BR([name])
+#
+# Expands into the proper ovs-vsctl commands to create a bridge with the
+# appropriate type and properties
+m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type=netdev protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
+
+# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
+#
+# Creates a database and starts ovsdb-server, starts ovs-vswitchd
+# connected to that database, calls ovs-vsctl to create a bridge named
+# br0 with predictable settings, passing 'vsctl-args' as additional
+# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
+# output (e.g. because it includes "create" commands) then 'vsctl-output'
+# specifies the expected output after filtering through uuidfilt.
+m4_define([OVS_TRAFFIC_VSWITCHD_START],
+  [
+   export OVS_PKGDATADIR=$(`pwd`)
+   _OVS_VSWITCHD_START([--disable-system])
+   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
+])
+
+# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
+#
+# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
+# for messages with severity WARN or higher and signaling an error if any
+# is present.  The optional WHITELIST may contain shell-quoted "sed"
+# commands to delete any warnings that are actually expected, e.g.:
+#
+#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
+#
+# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
+# invoked. They can be used to perform additional cleanups such as name space
+# removal.
+m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
+  [OVS_VSWITCHD_STOP([dnl
+$1";/netdev_linux.*obtaining netdev stats via vport failed/d
+/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d
+/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
+/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
+"])
+   AT_CHECK([:; $2])
+  ])
+
+m4_define([ADD_VETH_AFXDP],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
+
+# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
+m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
+    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
+     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
+     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
+    ]
+)
+
+# CONFIGURE_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads for veths.  The userspace datapath uses the AF_PACKET
+# socket to receive packets for veths.  Unfortunately, the AF_PACKET socket
+# doesn't play well with offloads:
+# 1. GSO packets are received without segmentation and therefore discarded.
+# 2. Packets with offloaded partial checksum are received with the wrong
+#    checksum, therefore discarded by the receiver.
+#
+# By disabling tx offloads in the non-OVS side of the veth peer we make sure
+# that the AF_PACKET socket will not receive bad packets.
+#
+# This is a workaround, and should be removed when offloads are properly
+# supported in netdev-linux.
+m4_define([CONFIGURE_VETH_OFFLOADS],
+    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
+)
+
+# CHECK_CONNTRACK()
+#
+# Perform requirements checks for running conntrack tests.
+#
+m4_define([CHECK_CONNTRACK],
+    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
+)
+
+# CHECK_CONNTRACK_ALG()
+#
+# Perform requirements checks for running conntrack ALG tests. The userspace
+# supports FTP and TFTP.
+#
+m4_define([CHECK_CONNTRACK_ALG])
+
+# CHECK_CONNTRACK_FRAG()
+#
+# Perform requirements checks for running conntrack fragmentations tests.
+# The userspace doesn't support fragmentation yet, so skip the tests.
+m4_define([CHECK_CONNTRACK_FRAG],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_LOCAL_STACK()
+#
+# Perform requirements checks for running conntrack tests with local stack.
+# While the kernel connection tracker automatically passes all the connection
+# tracking state from an internal port to the OpenvSwitch kernel module, there
+# is simply no way of doing that with the userspace, so skip the tests.
+m4_define([CHECK_CONNTRACK_LOCAL_STACK],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_NAT()
+#
+# Perform requirements checks for running conntrack NAT tests. The userspace
+# datapath supports NAT.
+#
+m4_define([CHECK_CONNTRACK_NAT])
+
+# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
+#
+# Perform requirements checks for running ovs-dpctl flush-conntrack by
+# conntrack 5-tuple test. The userspace datapath does not support
+# this feature yet.
+m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CT_DPIF_SET_GET_MAXCONNS()
+#
+# Perform requirements checks for running ovs-dpctl ct-set-maxconns or
+# ovs-dpctl ct-get-maxconns. The userspace datapath does support this feature.
+m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
+
+# CHECK_CT_DPIF_GET_NCONNS()
+#
+# Perform requirements checks for running ovs-dpctl ct-get-nconns. The
+# userspace datapath does support this feature.
+m4_define([CHECK_CT_DPIF_GET_NCONNS])
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..538c0d15d556
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+m4_include([tests/system-common-macros.at])
+
+m4_include([tests/system-afxdp-traffic.at])
+m4_include([tests/system-ovn.at])
diff --git a/tests/system-afxdp-traffic.at b/tests/system-afxdp-traffic.at
new file mode 100644
index 000000000000..26f72acf48ef
--- /dev/null
+++ b/tests/system-afxdp-traffic.at
@@ -0,0 +1,978 @@ 
+AT_BANNER([AF_XDP netdev datapath-sanity])
+
+AT_SETUP([datapath - ping between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+ulimit -l unlimited
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
+ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
+ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan tunnel])
+OVS_CHECK_VXLAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+                  [id 0 dstport 4789])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan6 tunnel])
+OVS_CHECK_VXLAN_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
+                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over gre tunnel])
+OVS_CHECK_GRE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over erspan v1 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=1 options:erspan_idx=7])
+ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over erspan v2 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=2 options:erspan_dir=1 options:erspan_hwid=0x7])
+ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
+AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
+                [options:key=123 options:erspan_ver=1 options:erspan_idx=0x7])
+ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
+                   [10.1.1.1/24], [local fc00:100::1 seq key 123 erspan_ver 1 erspan 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
+AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
+                [options:key=121 options:erspan_ver=2 options:erspan_dir=0 options:erspan_hwid=0x7])
+ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
+                   [10.1.1.1/24],
+                   [local fc00:100::1 seq key 121 erspan_ver 2 erspan_dir ingress erspan_hwid 0x7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve tunnel])
+OVS_CHECK_GENEVE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+                  [vni 0])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve6 tunnel])
+OVS_CHECK_GENEVE_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
+                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - clone action])
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
+                    -- set interface ovs-p1 ofport_request=2])
+
+AT_DATA([flows.txt], [dnl
+priority=1 actions=NORMAL
+priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
+priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - basic truncate action])
+AT_SKIP_IF([test $HAVE_NC = no])
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-ofctl del-flows br0])
+
+dnl Create p0 and ovs-p0(1)
+ADD_NAMESPACES(at_ns0)
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
+NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
+
+dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
+AT_CHECK([ip link add p1 type veth peer name ovs-p1])
+on_exit 'ip link del ovs-p1'
+AT_CHECK([ip link set dev ovs-p1 up])
+AT_CHECK([ip link set dev p1 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
+dnl Use p1 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
+
+dnl Create p2(5) and ovs-p2(4)
+AT_CHECK([ip link add p2 type veth peer name ovs-p2])
+on_exit 'ip link del ovs-p2'
+AT_CHECK([ip link set dev ovs-p2 up])
+AT_CHECK([ip link set dev p2 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
+dnl Use p2 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
+
+dnl basic test
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl use this file as payload file for ncat
+AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
+on_exit 'rm -f payload200.bin'
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl packet with truncated size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=100
+])
+dnl packet with original size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=242
+])
+
+dnl more complicated output actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+dnl SLOW_ACTION: disable kernel datapath truncate support
+dnl Repeat the test above, but exercise the SLOW_ACTION code path
+AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
+
+dnl SLOW_ACTION test1: check datapatch actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
+AT_CHECK([tail -3 stdout], [0],
+[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
+This flow is handled by the userspace slow path because it:
+  - Uses action(s) not supported by datapath.
+])
+
+dnl SLOW_ACTION test2: check actual packet truncate
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+
+AT_BANNER([conntrack])
+
+AT_SETUP([conntrack - controller])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(commit),controller
+priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
+priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl Send an unsolicited reply from port 2. This should be dropped.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
+
+dnl OK, now start a new connection from port 1.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
+
+dnl Now try a reply from port 2.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
+NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - force commit])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(force,commit),controller
+priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
+priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
+table=1,in_port=2,ct_state=+trk,udp,action=controller
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl Send an unsolicited reply from port 2. This should be dropped.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+dnl OK, now start a new connection from port 1.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+
+dnl Now try a reply from port 2.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
+NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
+])
+
+dnl
+dnl Check that the directionality has been changed by force commit.
+dnl
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [], [dnl
+udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
+])
+
+dnl OK, now send another packet from port 1 and see that it switches again
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
+udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - ct flush by 5-tuple])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(commit),2
+priority=100,in_port=2,udp,action=ct(zone=5,commit),1
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Test UDP from port 1
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
+udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [1], [dnl
+])
+
+dnl Test UDP from port 2
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [dnl
+udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+])
+
+dnl Test ICMP traffic
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [stdout])
+AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
+icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
+])
+
+ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
+ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [1], [dnl
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv4 ping])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
+priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+dnl Pings from ns1->ns0 should fail.
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
+7 packets transmitted, 0 received, 100% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - get_nconns and get/set_maxconns])
+CHECK_CONNTRACK()
+CHECK_CT_DPIF_SET_GET_MAXCONNS()
+CHECK_CT_DPIF_GET_NCONNS()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
+priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: maxconns missing or malformed (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
+ovs-vswitchd: maxconns missing or malformed (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
+1
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+3000000
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
+setting maxconns successful
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+10
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
+0
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+10
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv6 ping])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+AT_DATA([flows.txt], [dnl
+
+dnl ICMPv6 echo request and reply go to table 1.  The rest of the traffic goes
+dnl through normal action.
+table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
+table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
+table=0,priority=1,action=normal
+
+dnl Allow everything from ns0->ns1. Only allow return traffic from ns1->ns0.
+table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
+table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
+table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
+table=1,priority=1,action=drop
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
+
+dnl The above ping creates state in the connection tracker.  We're not
+dnl interested in that state.
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+dnl Pings from ns1->ns0 should fail.
+NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | FORMAT_PING], [0], [dnl
+7 packets transmitted, 0 received, 100% packet loss, time 0ms
+])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], [dnl
+icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP