diff mbox series

[ovs-dev,PATCHv7] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1556910148-52443-1-git-send-email-u9012063@gmail.com
State Superseded
Headers show
Series [ovs-dev,PATCHv7] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu May 3, 2019, 7:02 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst

Signed-off-by: William Tu <u9012063@gmail.com>

---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

v4->v5:
- rebase to master
- remove rfc, squash all into a single patch
- add --enable-afxdp, so by default, AF_XDP is not compiled
- add options: xdpmode=drv,skb
- add multiple queue and multiple PMD support, with options: n_rxq
- improve documentation, rename bpf.rst to af_xdp.rst

v5->v6
- rebase to master, commit 0cdd5b13de91b98
- address errors from sparse and clang
- pass travis-ci test
- address feedback from Ben
- fix issues reported by 0-day robot
- improved documentation

v6-v7
- rebase to master, commit abf11558c1515bf3b1
- address feedbacks from Ilya, Ben, and Eelco, see:
  https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
- add XDP mode change, implement get/set_config, reconfigure
- Fix reconfiguration/crash issue caused by libbpf, see patch:
  [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
- perf optimization for batching umem_push/pop
- perf optimization for batching kick_tx
- test build with dpdk
- fix/refactor atomic operation
- make AF_XDP x86 specific, otherwise fail at build time
- lots of code refactoring
- add PVP setup in documentation
---
---
 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/afxdp.rst | 469 ++++++++++++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  32 ++
 configure.ac                          |   1 +
 lib/automake.mk                       |  12 +
 lib/dp-packet.c                       |  12 +
 lib/dp-packet.h                       |  18 +
 lib/dpif-netdev-perf.h                |  14 +
 lib/netdev-afxdp.c                    | 698 ++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |  51 ++
 lib/netdev-linux.c                    | 118 ++--
 lib/netdev-linux.h                    |  72 +++
 lib/netdev-provider.h                 |   4 +-
 lib/netdev.c                          |   3 +
 lib/xdpsock.c                         | 236 ++++++++
 lib/xdpsock.h                         | 127 +++++
 tests/automake.mk                     |  17 +
 tests/system-afxdp-macros.at          | 153 ++++++
 tests/system-afxdp-testsuite.at       |  26 +
 tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
 22 files changed, 2993 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at
 create mode 100644 tests/system-afxdp-traffic.at

Comments

Ilya Maximets May 6, 2019, 12:37 p.m. UTC | #1
Hi. Thanks for a new version.

Quick review inline.

Best regards, Ilya Maximets.

On 03.05.2019 22:02, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> 
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> ---
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 469 ++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  32 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  12 +
>  lib/dp-packet.c                       |  12 +
>  lib/dp-packet.h                       |  18 +
>  lib/dpif-netdev-perf.h                |  14 +
>  lib/netdev-afxdp.c                    | 698 ++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  51 ++
>  lib/netdev-linux.c                    | 118 ++--
>  lib/netdev-linux.h                    |  72 +++
>  lib/netdev-provider.h                 |   4 +-
>  lib/netdev.c                          |   3 +
>  lib/xdpsock.c                         | 236 ++++++++
>  lib/xdpsock.h                         | 127 +++++
>  tests/automake.mk                     |  17 +
>  tests/system-afxdp-macros.at          | 153 ++++++
>  tests/system-afxdp-testsuite.at       |  26 +
>  tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
>  22 files changed, 2993 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
>  create mode 100644 tests/system-afxdp-traffic.at
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>  
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..d68a4ac7ff8b
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,469 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> +built upon the eBPF and XDP technology.  It is aims to have comparable
> +performance to DPDK but cooperate better with existing kernel's networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +internal.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the proposed architecture re-uses the existing
> +userspace dpif-netdev datapath.  As a result, most of
> +the packet processing happens at the userspace instead of
> +linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)

Empty line will be good here.

> +  ``_CONFIG_BPF=y``
> +
> +  ``_CONFIG_BPF_SYSCALL=y``
> +
> +  ``_CONFIG_XDP_SOCKETS=y``

It also be better to make a dot list instead. Like:

* item1
* item2

And why these underscores here?

> +
> +
> +- The following optional Kconfig options are also recommended, but not
> +  required:
> +
> +  ``_CONFIG_BPF_JIT=y`` (Performance)
> +
> +  ``_CONFIG_HAVE_BPF_JIT=y`` (Performance)
> +
> +  ``_CONFIG_XDP_SOCKETS_DIAG=y`` (Debugging)
> +
> +- If possible, run **./xdpsock -r -N -z -i <your device>** under
> +  linux/samples/bpf.  This is the OVS indepedent benchmark tools for AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> +Frist, clone a recent version of Linux bpf-next tree::
> +
> +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp
> +
> +if a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd --disable-system
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +.. note::
> +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
> +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
> +   and datapath_type=netdev when adding a new bridge.
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +To validate that the bridge has successfully instantiated, you can use the::
> +
> +  ovs-vsctl show
> +
> +should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debug by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> +about AF_XDP current and future work.
> +
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, allowing PMD
> +to keep polling the AF_XDP queues without any interferences from kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +make check-afxdp
> +----------------
> +When executing 'make check-afxdp', OVS creates namespaces, sets up AF_XDP on
> +veth devices and kicks start the testing.  So far we have the following test
> +cases::
> +
> + AF_XDP netdev datapath-sanity
> +
> +  1: datapath - ping between two ports               ok
> +  2: datapath - ping between two ports on vlan       ok
> +  3: datapath - ping6 between two ports              ok
> +  4: datapath - ping6 between two ports on vlan      ok
> +  5: datapath - ping over vxlan tunnel               ok
> +  6: datapath - ping over vxlan6 tunnel              ok
> +  7: datapath - ping over gre tunnel                 ok
> +  8: datapath - ping over erspan v1 tunnel           ok
> +  9: datapath - ping over erspan v2 tunnel           ok
> + 10: datapath - ping over ip6erspan v1 tunnel        ok
> + 11: datapath - ping over ip6erspan v2 tunnel        ok
> + 12: datapath - ping over geneve tunnel              ok
> + 13: datapath - ping over geneve6 tunnel             ok
> + 14: datapath - clone action                         ok
> + 15: datapath - basic truncate action                ok
> +
> + conntrack
> +
> + 16: conntrack - controller                          ok
> + 17: conntrack - force commit                        ok
> + 18: conntrack - ct flush by 5-tuple                 ok
> + 19: conntrack - IPv4 ping                           ok
> + 20: conntrack - get_nconns and get/set_maxconns     ok
> + 21: conntrack - IPv6 ping                           ok
> +
> + system-ovn
> +
> + 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
> + 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
> + 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
> + 25: ovn -- load-balancing                           ok
> + 26: ovn -- load-balancing - same subnet.            ok
> + 27: ovn -- load balancing in gateway router         ok
> + 28: ovn -- multiple gateway routers, load-balancing ok
> + 29: ovn -- load balancing in router with gateway router port ok
> + 30: ovn -- DNAT and SNAT on distributed router - N/S ok
> + 31: ovn -- DNAT and SNAT on distributed router - E/W ok
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +The performance number I got is around 700Kpps.
> +This is due to using the kernel's tap interface, which requires copying
> +packet into kernel from the umem buffer in userspace.
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1 type="afxdp" options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>  
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index b532a4579266..5782f7e4bc2e 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -221,6 +221,38 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>  
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 505e3d041e93..29c90b73f836 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index cc5dccf39d6b..e3c1d9cbf363 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -409,6 +413,14 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..c50f88e6e056 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -22,6 +22,9 @@
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
>  
>  static void
>  dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source source)
> @@ -122,6 +125,11 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +#ifdef HAVE_AF_XDP
> +            free_afxdp_buf(b);
> +#endif
> +            return;
>          }
>      }
>  }
> @@ -248,6 +256,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>  
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +444,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>  
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..91dcb886899f 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,10 @@
>  #include <rte_mbuf.h>
>  #endif
>  
> +#ifdef HAVE_AF_XDP
> +#include "netdev-afxdp.h"
> +#endif
> +
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +46,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
>                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>  
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +94,13 @@ struct dp_packet {
>      };
>  };
>  
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -184,6 +196,12 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>  
> +#ifdef HAVE_AF_XDP
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf((struct dp_packet *)b);

I think that pointer cast is not needed here. BTW, I don't know
why it exists for dpdk case above.

> +            return;
> +        }
> +#endif
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..cc91720fad6e 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif HAVE_AF_XDP
> +    /* This is x86-specific instructions. */
> +    union {
> +        uint64_t tsc_64;
> +        struct {
> +            uint32_t lo_32;
> +            uint32_t hi_32;
> +        };
> +    } tsc;
> +    asm volatile("rdtsc" :
> +             "=a" (tsc.lo_32),
> +             "=d" (tsc.hi_32));
> +
> +    return s->last_tsc = tsc.tsc_64;
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..48de4eaaeed3
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,698 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#if !defined(__i386__) && !defined(__x86_64__)
> +#error AF_XDP supported only for Linux on x86 or x86_64
> +#endif
> +
> +#include <config.h>

Some space here.

> +#include "netdev-linux.h"

And here.

> +#include <errno.h>
> +#include <fcntl.h>
> +#include <sys/types.h>
> +#include <netinet/in.h>
> +#include <arpa/inet.h>
> +#include <inttypes.h>
> +#include <sys/ioctl.h>
> +#include <sys/socket.h>
> +#include <sys/utsname.h>
> +#include <netpacket/packet.h>
> +#include <net/if.h>
> +#include <net/if_arp.h>
> +#include <net/route.h>
> +#include <poll.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include "coverage.h"
> +#include "dp-packet.h"
> +#include "dpif-netlink.h"
> +#include "dpif-netdev.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "fatal-signal.h"
> +#include "hash.h"
> +#include "openvswitch/hmap.h"
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "netlink-notifier.h"
> +#include "netlink-socket.h"
> +#include "netlink.h"
> +#include "netnsid.h"
> +#include "openvswitch/ofpbuf.h"
> +#include "openflow/openflow.h"
> +#include "ovs-atomic.h"
> +#include "packets.h"
> +#include "openvswitch/poll-loop.h"
> +#include "rtnetlink.h"
> +#include "openvswitch/shash.h"
> +#include "socket-util.h"
> +#include "sset.h"
> +#include "tc.h"
> +#include "timer.h"
> +#include "unaligned.h"
> +#include "openvswitch/vlog.h"
> +#include "util.h"
> +#include "netdev-afxdp.h"

Above header should be at the top, near to 'netdev-linux'.

> +
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/sockios.h>
> +#include <linux/if_xdp.h>
> +#include "xdpsock.h"

All above headers should be within corresponding blocks.
i.e. system headers along with system headers, ovs headers
with ovs headers above. In lexicographical order, if possible.

> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> +#define UMEM2XPKT(base, i) \
> +    ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> +    i * sizeof(struct dp_packet_afxdp))

Please, align this line to the first parenthesis.

> +
> +static uint32_t prog_id;
> +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +
> +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
> +                                                int xdpmode)
> +{
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof(*umem));
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> +                           NULL);
> +
> +    if (ret) {
> +        VLOG_ERR("xsk umem create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");

free(umem);

> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    umem_pool_init(&umem->mpool, NUM_FRAMES);
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    xpacket_pool_init(&umem->xpool, NUM_FRAMES);
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));

free(xsk);

> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);

free(xsk);

> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);

xsk_socket__delete(xsk->xsk);
free(xsk);

> +        return NULL;
> +    }
> +
> +    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
> +
> +    for (i = 0;
> +         i < PROD_NUM_DESCS * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +    int ret;
> +
> +    /* umem memory region */
> +    ret = posix_memalign(&bufs, getpagesize(),
> +                         NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +    ovs_assert(!ret);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {

free(bufs);

> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        free(bufs);
> +        (void)xsk_umem__delete(umem->umem);

'umem' created on 'bufs', so you need to delete 'umem' before
freeing the 'bufs'.

> +        umem_pool_cleanup(&xsk->umem->mpool);
> +        xpacket_pool_cleanup(&xsk->umem->xpool);

There is no xsk here:

umem_pool_cleanup(&umem->mpool);
xpacket_pool_cleanup(&umem->xpool);

And:

free(umem);

> +    }
> +    return xsk;
> +}
> +
> +void
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk;
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev->name);
> +
> +    /* configure each queue */
> +    for (i = 0; i < netdev->n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                dev->xdpmode == XDP_COPY ? "SKB" : "DRV");

Please, align to the first parenthesis.

> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);

Need to destroy sockets that was configured.

> +            return;

We also need to return the error and check the result in caller functions.

> +        }
> +        dev->xsk[i] = xsk;
> +    }
> +}
> +
> +static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
> +{
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +    ds_put_hex_dump(&ds, buf, count, 0, false);
> +    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
> +    ds_destroy(&ds);
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk)
> +{
> +    struct xsk_umem *umem;
> +
> +    if (!xsk) {
> +        return;
> +    }
> +
> +    umem = xsk->umem->umem;
> +    xsk_socket__delete(xsk->xsk);
> +    (void)xsk_umem__delete(umem);
> +
> +    /* free the packet buffer */
> +    free(xsk->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk->umem->xpool);

free(xsk->umem);
free(xsk);

> +}
> +
> +void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev->name);
> +
> +    for (i = 0; i < MAX_XSKQ; i++) {
> +        if (dev->xsk[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsk[i]);

dev->xsk[i] = NULL;
To avoid double destroy on multiple reconfigurations.

> +        }
> +    }
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static inline void OVS_UNUSED
> +print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> +                &stat, &optlen) == 0);

Please, align to arguments of 'getsockopt'.

> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> +                     stat.rx_dropped,
> +                     stat.rx_invalid_descs,
> +                     stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    const char *xdpmode;
> +    int new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (new_n_rxq != netdev->n_rxq) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        netdev_request_reconfigure(netdev);
> +    }
> +
> +    xdpmode = smap_get(args, "xdpmode");
> +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
> +        dev->requested_xdpmode = XDP_ZEROCOPY;
> +
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            VLOG_INFO("AF_XDP device %s in DRV mode", netdev->name);
> +
> +            /* From SKB mode to DRV mode */
> +            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +            dev->xdp_bind_flags = XDP_ZEROCOPY;
> +            dev->xdpmode = XDP_ZEROCOPY;

Some of these flags are used while device destruction. Also, they are used
while reporting the current device status. So, we should not update them
before the actual reconfiguration. This should be done after the xsk_destroy_all().
Same for the 'else' case and the 'setrlimit'.

> +            netdev_request_reconfigure(netdev);
> +        }
> +    } else {
> +        dev->requested_xdpmode = XDP_COPY;
> +        if (dev->xdpmode != dev->requested_xdpmode) {
> +            VLOG_INFO("AF_XDP device %s in SKB mode", netdev->name);
> +
> +            /* From DRV mode to SKB mode */
> +            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +            dev->xdp_bind_flags = XDP_COPY;
> +            dev->xdpmode = XDP_COPY;
> +            netdev_request_reconfigure(netdev);
> +        }
> +    }
> +
> +    if (dev->xdpmode == XDP_ZEROCOPY) {
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
> +                      ovs_strerror(errno));
> +        }
> +    }
> +
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +
> +    netdev->n_rxq = dev->requested_n_rxq;
> +    dev->xdpmode = dev->requested_xdpmode;
> +
> +    xsk_configure_all(netdev);

Need to get the actual status of xsk_configure_all() and set the 'err'
accordingly to avoid using broken device.

Another thought is that destroy/configure of each xsk should be implemented
as rxq_construct/destruct() callbacks, so the datapath will handle rxqs, i.e.
open and close them when needed. But this could be implemented later.

> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0", netdev->name);

s/netdev->name/netdev_get_name(netdev)/g

For all the other places in the code.

> +    return 0;
> +}
> +
> +void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t curr_prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    }
> +    if (prog_id == curr_prog_id) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    } else if (!curr_prog_id) {
> +        VLOG_INFO("couldn't find a prog id on a given interface");
> +    } else {
> +        VLOG_INFO("program on interface changed, not removing");
> +    }
> +}
> +
> +static inline struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d OVS_UNUSED)

'd' is not UNUSED.

> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    unsigned long addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +        struct dp_packet_afxdp *xpacket = NULL;
> +        struct dp_packet *packet;
> +        void *elems[BATCH_SIZE];
> +        unsigned long addr;
> +
> +       /* all packets are AF_XDP, so handles its own delete in batch */
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            xpacket = dp_packet_cast_afxdp(packet);
> +            if (xpacket->mpool) {
> +                void *base = dp_packet_base(packet);
> +
> +                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +                elems[i] = (void *)addr;
> +            }
> +        }
> +        umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +        dp_packet_batch_init(batch);
> +}
> +
> +/* Receive packet from AF_XDP socket */
> +int
> +netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                     struct dp_packet_batch *batch)
> +{
> +    struct umem_elem *elems[BATCH_SIZE];
> +    uint32_t idx_rx = 0, idx_fq = 0;
> +    unsigned int rcvd, i;
> +    int ret = 0;
> +
> +    /* See if there is any packet on RX queue,
> +     * if yes, idx_rx is the index having the packet.
> +     */
> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;
> +    }
> +
> +    /* Form a dp_packet batch from descriptor in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> +
> +        packet = &xpacket->packet;
> +        xpacket->mpool = &xsk->umem->mpool;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_set_base(packet, pkt);
> +        dp_packet_set_data(packet, pkt);
> +        dp_packet_set_size(packet, len);

There must be some more work done. We need to clear all the data
that left from the previous packets.

You may initialize source and call dp_packet_init_specific() on xpool
initialization, but base, data, size, packet_type, cutlen and offload
flags should be initialized for each packet.

You probably may implement your own dp_packet_use_afxdp() for this.

> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +
> +    /* We've consume rcvd packets in RX, now re-fill the
> +     * same number back to FILL queue.
> +     */
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {
> +        return -ENOMEM;
> +    }
> +
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        while (OVS_UNLIKELY(ret == 0)) {
> +            /* The FILL queue is full, so retry. (or skip)? */
> +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> +        }
> +
> +        /* Get one free umem, program it into FILL queue */
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> +
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> +    xsk->rx_npkts += rcvd;
> +
> +#ifdef AFXDP_DEBUG
> +    print_xsk_stat(xsk);
> +#endif
> +    return 0;
> +}
> +
> +static void kick_tx(struct xsk_socket_info *xsk)
> +{
> +    int ret;
> +
> +    /* This causes system call into kernel, avoid calling
> +     * this as much as we can.
> +     */
> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> +    if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN || errno == EBUSY) {
> +        return;

This makes no much sense. Did you want to print something on error?

> +    }
> +}
> +
> +int
> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                              struct dp_packet_batch *batch)
> +{
> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t tx_done, idx_cq = 0;
> +    struct dp_packet *packet;
> +    uint32_t idx = 0;
> +    int j, ret;
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        return -EAGAIN;
> +    }
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        return -EAGAIN;
> +    }

We should probably make umem_elem_pop_n() first before the
xsk_ring_prod__reserve(), because we can't undo the xsk_ring_prod__reserve(),
but we can push umem_elems back in case of xsk_ring_prod__reserve() failure.

> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        if (OVS_UNLIKELY(!elem)) {
> +            return -EAGAIN;
> +        }
> +
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * We can avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    kick_tx(xsk);
> +retry:
> +
> +    /* Process CQ */
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> +    if (tx_done > 0) {
> +        xsk->outstanding_tx -= tx_done;
> +        xsk->tx_npkts += tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        goto out;
> +    }
> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +
> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> +        /* If there are still a lot not transmitted,
> +         * try harder.
> +         */
> +        goto retry;
> +    }
> +out:
> +    return 0;
> +}
> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> new file mode 100644
> index 000000000000..e0a49a89accf
> --- /dev/null
> +++ b/lib/netdev-afxdp.h
> @@ -0,0 +1,51 @@
> +/*
> + * Copyright (c) 2018 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_AFXDP_H
> +#define NETDEV_AFXDP_H 1
> +
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +/* These functions are Linux AF_XDP specific, so they should be used directly
> + * only by Linux-specific code. */
> +#define MAX_XSKQ 16
> +struct netdev;
> +struct xsk_socket_info;
> +struct xdp_umem;
> +struct dp_packet_batch;
> +struct smap;
> +struct dp_packet;
> +
> +void xsk_configure_all(struct netdev *netdev);
> +
> +void xsk_destroy_all(struct netdev *netdev);
> +
> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> +                         struct dp_packet_batch *batch);
> +
> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> +                                  struct dp_packet_batch *batch);
> +
> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                            char **errp);
> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> +
> +void free_afxdp_buf(struct dp_packet *p);
> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> +#endif /* netdev-afxdp.h */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index f75d73fd39f8..a17cf614a00c 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -75,6 +75,7 @@
>  #include "unaligned.h"
>  #include "openvswitch/vlog.h"
>  #include "util.h"
> +#include "netdev-afxdp.h"

Headers should be added in lexicographical order.

>  
>  VLOG_DEFINE_THIS_MODULE(netdev_linux);
>  
> @@ -487,51 +488,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
>  
> -struct netdev_linux {
> -    struct netdev up;
> -
> -    /* Protects all members below. */
> -    struct ovs_mutex mutex;
> -
> -    unsigned int cache_valid;
> -
> -    bool miimon;                    /* Link status of last poll. */
> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> -    struct timer miimon_timer;
> -
> -    int netnsid;                    /* Network namespace ID. */
> -    /* The following are figured out "on demand" only.  They are only valid
> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> -    int ifindex;
> -    struct eth_addr etheraddr;
> -    int mtu;
> -    unsigned int ifi_flags;
> -    long long int carrier_resets;
> -    uint32_t kbits_rate;        /* Policing data. */
> -    uint32_t kbits_burst;
> -    int vport_stats_error;      /* Cached error code from vport_get_stats().
> -                                   0 or an errno value. */
> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
> -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> -    int netdev_policing_error;  /* Cached error code from set policing. */
> -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> -
> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> -
> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> -    struct tc *tc;
> -
> -    /* For devices of class netdev_tap_class only. */
> -    int tap_fd;
> -    bool present;               /* If the device is present in the namespace */
> -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> -
> -    /* LAG information. */
> -    bool is_lag_master;         /* True if the netdev is a LAG master. */
> -};
>  
>  struct netdev_rxq_linux {
>      struct netdev_rxq up;
> @@ -579,13 +535,26 @@ is_netdev_linux_class(const struct netdev_class *netdev_class)
>      return netdev_class->run == netdev_linux_run;
>  }
>  
> +#if HAVE_AF_XDP
> +static bool
> +is_afxdp_netdev(const struct netdev *netdev)
> +{
> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
> +}
> +#else
> +static bool
> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> +{
> +    return false;
> +}
> +#endif
>  static bool
>  is_tap_netdev(const struct netdev *netdev)
>  {
>      return netdev_get_class(netdev) == &netdev_tap_class;
>  }
>  
> -static struct netdev_linux *
> +struct netdev_linux *
>  netdev_linux_cast(const struct netdev *netdev)
>  {
>      ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> @@ -1083,7 +1052,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>      if (netdev->miimon_interval > 0) {
>          atomic_count_dec(&miimon_cnt);
>      }
> -
> +#if HAVE_AF_XDP
> +    if (is_afxdp_netdev(netdev_)) {
> +        xsk_destroy_all(netdev_);
> +    }
> +#endif
>      ovs_mutex_destroy(&netdev->mutex);
>  }
>  
> @@ -1113,7 +1086,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>      rx->is_tap = is_tap_netdev(netdev_);
>      if (rx->is_tap) {
>          rx->fd = netdev->tap_fd;
> -    } else {
> +    } else if (!is_afxdp_netdev(netdev_)) {
>          struct sockaddr_ll sll;
>          int ifindex, val;
>          /* Result of tcpdump -dd inbound */
> @@ -1318,10 +1291,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>  {
>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>      struct netdev *netdev = rx->up.netdev;
> -    struct dp_packet *buffer;
> +    struct dp_packet *buffer = NULL;
>      ssize_t retval;
>      int mtu;
>  
> +#if HAVE_AF_XDP
> +    if (is_afxdp_netdev(netdev)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> +        int qid = rxq_->queue_id;
> +
> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> +    }
> +#endif
>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>          mtu = ETH_PAYLOAD_MAX;
>      }
> @@ -1329,6 +1310,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>      /* Assume Ethernet port. No need to set packet_type. */
>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>                                             DP_NETDEV_HEADROOM);
> +
>      retval = (rx->is_tap
>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> @@ -1480,7 +1462,8 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>      int error = 0;
>      int sock = 0;
>  
> -    if (!is_tap_netdev(netdev_)) {
> +    if (!is_tap_netdev(netdev_) &&
> +        !is_afxdp_netdev(netdev_)) {
>          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>              error = EOPNOTSUPP;
>              goto free_batch;
> @@ -1499,6 +1482,23 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>          }
>  
>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +#if HAVE_AF_XDP
> +    } else if (is_afxdp_netdev(netdev_)) {
> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> +        struct dp_packet *packet;
> +
> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> +
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +            if (packet->source != DPBUF_AFXDP) {

You have to be sure that all packets are from the same umem, not only
the same type.

> +                 /* free one-by-one */
> +                goto free_batch;
> +            }
> +        }
> +        /* free in batch */
> +        free_afxdp_buf_batch(batch);
> +        return 0;
> +#endif
>      } else {
>          error = netdev_linux_tap_batch_send(netdev_, batch);
>      }
> @@ -3323,6 +3323,7 @@ const struct netdev_class netdev_linux_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "system",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_linux_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3333,6 +3334,7 @@ const struct netdev_class netdev_linux_class = {
>  const struct netdev_class netdev_tap_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      .type = "tap",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct_tap,
>      .get_stats = netdev_tap_get_stats,
>      .get_features = netdev_linux_get_features,
> @@ -3343,10 +3345,26 @@ const struct netdev_class netdev_internal_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "internal",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
>      .get_stats = netdev_internal_get_stats,
>      .get_status = netdev_internal_get_status,
>  };
> +
> +#ifdef HAVE_AF_XDP
> +const struct netdev_class netdev_afxdp_class = {
> +    NETDEV_LINUX_CLASS_COMMON,
> +    .type = "afxdp",
> +    .is_pmd = true,
> +    .construct = netdev_linux_construct,
> +    .get_stats = netdev_linux_get_stats,
> +    .get_status = netdev_linux_get_status,
> +    .set_config = netdev_afxdp_set_config,
> +    .get_config = netdev_afxdp_get_config,
> +    .reconfigure = netdev_afxdp_reconfigure,
> +    .get_numa_id = netdev_afxdp_get_numa_id,
> +};
> +#endif
>  
>  
>  #define CODEL_N_QUEUES 0x0000
> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> index 17ca9120168a..570f9134e3d4 100644
> --- a/lib/netdev-linux.h
> +++ b/lib/netdev-linux.h
> @@ -19,6 +19,21 @@
>  
>  #include <stdint.h>
>  #include <stdbool.h>
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "timer.h"
> +#include "ovs-atomic.h"
> +#include "netdev-afxdp.h"
>  
>  /* These functions are Linux specific, so they should be used directly only by
>   * Linux-specific code. */
> @@ -28,6 +43,7 @@ struct netdev;
>  int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
>                                    const char *flag_name, bool enable);
>  int linux_get_ifindex(const char *netdev_name);
> +struct netdev_linux *netdev_linux_cast(const struct netdev *netdev);
>  
>  #define LINUX_FLOW_OFFLOAD_API                          \
>     .flow_flush = netdev_tc_flow_flush,                  \
> @@ -39,4 +55,60 @@ int linux_get_ifindex(const char *netdev_name);
>     .flow_del = netdev_tc_flow_del,                      \
>     .init_flow_api = netdev_tc_init_flow_api
>  
> +struct netdev_linux {
> +    struct netdev up;
> +
> +    /* Protects all members below. */
> +    struct ovs_mutex mutex;
> +
> +    unsigned int cache_valid;
> +
> +    bool miimon;                    /* Link status of last poll. */
> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> +    struct timer miimon_timer;
> +
> +    int netnsid;                    /* Network namespace ID. */
> +    /* The following are figured out "on demand" only.  They are only valid
> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> +    int ifindex;
> +    struct eth_addr etheraddr;
> +    int mtu;
> +    unsigned int ifi_flags;
> +    long long int carrier_resets;
> +    uint32_t kbits_rate;        /* Policing data. */
> +    uint32_t kbits_burst;
> +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> +                                   0 or an errno value. */
> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> +                                 * or SIOCSIFMTU.
> +                                 */
> +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> +    int netdev_policing_error;  /* Cached error code from set policing. */
> +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> +
> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> +
> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> +    struct tc *tc;
> +
> +    /* For devices of class netdev_tap_class only. */
> +    int tap_fd;
> +    bool present;               /* If the device is present in the namespace */
> +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> +
> +    /* LAG information. */
> +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> +
> +    /* AF_XDP information */
> +#ifdef HAVE_AF_XDP
> +    struct xsk_socket_info *xsk[MAX_XSKQ];
> +    int requested_n_rxq;
> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> +    int xdp_flags, xdp_bind_flags;
> +#endif
> +};
> +

Exposing internal data structures is not a good thing.
You may create lib/netdev-linux-private.h and move the structure with the
cast function there.

>  #endif /* netdev-linux.h */
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e6e8e8..d433818f7064 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class;
>  #endif
>  extern const struct netdev_class netdev_internal_class;
>  extern const struct netdev_class netdev_tap_class;
> -
> +#if HAVE_AF_XDP
> +extern const struct netdev_class netdev_afxdp_class;
> +#endif
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6f0946..e2fae37d5a5e 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -146,6 +146,9 @@ netdev_initialize(void)
>          netdev_register_provider(&netdev_internal_class);
>          netdev_register_provider(&netdev_tap_class);
>          netdev_vport_tunnel_register();
> +#ifdef HAVE_AF_XDP
> +        netdev_register_provider(&netdev_afxdp_class);
> +#endif
>  #endif
>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>          netdev_register_provider(&netdev_tap_class);
> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> new file mode 100644
> index 000000000000..7f20e16364e3
> --- /dev/null
> +++ b/lib/xdpsock.c
> @@ -0,0 +1,236 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +#include <config.h>

Space here.

> +#include <ctype.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <syslog.h>
> +#include <time.h>
> +#include <unistd.h>

Space here.

> +#include "openvswitch/vlog.h"
> +#include "async-append.h"
> +#include "coverage.h"
> +#include "dirs.h"
> +#include "ovs-thread.h"
> +#include "sat-math.h"
> +#include "socket-util.h"
> +#include "svec.h"
> +#include "syslog-direct.h"
> +#include "syslog-libc.h"
> +#include "syslog-provider.h"
> +#include "timeval.h"
> +#include "unixctl.h"
> +#include "util.h"
> +#include "ovs-atomic.h"
> +#include "openvswitch/compiler.h"
> +#include "dp-packet.h"

Please, keep them sorted.

> +
> +#include "xdpsock.h"

This should be moved closer to 'config.h'

> +
> +static inline void ovs_spinlock_init(ovs_spinlock_t *sl)

Please, keep the consistency in function definitions.
Function name should start from the new line. Same for other functions.

> +{
> +    sl->locked = 0;

atomic_init(&sl->locked, 0);

> +}
> +
> +static inline void ovs_spin_lock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0, locked = 0;
> +
> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed)) {
> +        locked = 1;
> +        while (locked) {
> +            atomic_read_relaxed(&sl->locked, &locked);
> +        }
> +        exp = 0;
> +    }
> +}
> +
> +static inline void ovs_spin_unlock(ovs_spinlock_t *sl)
> +{
> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> +}
> +
> +static inline int OVS_UNUSED ovs_spin_trylock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0;
> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed);
> +}
> +
> +inline int
> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> +        return -ENOMEM;
> +    }
> +
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(ptr, addrs, n * sizeof(void *));
> +    umemp->index += n;
> +
> +    return 0;
> +}
> +
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_push_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void
> +__umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)

unused?

> +{
> +    umemp->array[umemp->index++] = addr;
> +}
> +
> +void
> +umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)

unused?

> +{
> +
> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> +        /* stack is full */
> +        /* it's possible that one umem gets pushed twice,
> +         * because actions=1,2,3... multiple ports?

In case of multiple output ports, packet will be cloned, so it's
not the case. Most probably, you're pushing buffer from the wrong
umem, i.e. umem of the diferent port/rxq.

> +        */
> +        OVS_NOT_REACHED();
> +    }
> +
> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    __umem_elem_push(umemp, addr);
> +    ovs_spin_unlock(&umemp->mutex);
> +}
> +
> +inline int
> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->index -= n;
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(addrs, ptr, n * sizeof(void *));
> +
> +    return 0;
> +}
> +
> +int
> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +inline void *
> +__umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)

unused?

> +{
> +    return umemp->array[--umemp->index];
> +}
> +
> +void *
> +umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)

unused?

> +{
> +    void *ptr;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ptr = __umem_elem_pop(umemp);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ptr;
> +}
> +
> +void **
> +__umem_pool_alloc(unsigned int size)
> +{
> +    void *bufs;
> +
> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(void *)) == 0);
> +    memset(bufs, 0, size * sizeof(void *));
> +    return (void **)bufs;
> +}
> +
> +unsigned int
> +umem_elem_count(struct umem_pool *mpool)
> +{
> +    return mpool->index;
> +}
> +
> +int
> +umem_pool_init(struct umem_pool *umemp OVS_UNUSED, unsigned int size)
> +{
> +    umemp->array = __umem_pool_alloc(size);
> +    if (!umemp->array) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    umemp->size = size;
> +    umemp->index = 0;
> +    ovs_spinlock_init(&umemp->mutex);
> +    return 0;
> +}
> +
> +void
> +umem_pool_cleanup(struct umem_pool *umemp OVS_UNUSED)
> +{
> +    free(umemp->array);
> +}
> +
> +/* AF_XDP metadata init/destroy */
> +int
> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> +{
> +    void *bufs;
> +
> +    /* TODO: check HAVE_POSIX_MEMALIGN  */
> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> +                              size * sizeof(struct dp_packet_afxdp)) == 0);
> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> +
> +    xp->array = bufs;
> +    xp->size = size;
> +    return 0;
> +}
> +
> +void
> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> +{
> +    free(xp->array);
> +}
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> new file mode 100644
> index 000000000000..52d7faaacf75
> --- /dev/null
> +++ b/lib/xdpsock.h
> @@ -0,0 +1,127 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +#ifndef XDPSOCK_H
> +#define XDPSOCK_H 1
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <net/if.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +#include <locale.h>
> +#include <sys/types.h>
> +#include <poll.h>
> +#include <bpf/libbpf.h>
> +
> +#include "ovs-atomic.h"
> +#include "openvswitch/thread.h"
> +#include <bpf/xsk.h>

Same for the headers.

> +
> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> +#define BATCH_SIZE      NETDEV_MAX_BURST
> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> +
> +#define NUM_FRAMES      4096
> +#define PROD_NUM_DESCS  512
> +#define CONS_NUM_DESCS  512
> +
> +#ifdef USE_XSK_DEFAULT
> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> +#endif
> +
> +typedef struct {
> +    atomic_int locked;
> +} ovs_spinlock_t;
> +
> +/* LIFO ptr_array */
> +struct umem_pool {
> +    int index;      /* point to top */
> +    unsigned int size;
> +    ovs_spinlock_t mutex;
> +    void **array;   /* a pointer array */
> +};
> +
> +/* array-based dp_packet_afxdp */
> +struct xpacket_pool {
> +    unsigned int size;
> +    struct dp_packet_afxdp **array;
> +};
> +
> +struct xsk_umem_info {
> +    struct umem_pool mpool;
> +    struct xpacket_pool xpool;
> +    struct xsk_ring_prod fq;
> +    struct xsk_ring_cons cq;
> +    struct xsk_umem *umem;
> +    void *buffer;
> +};
> +
> +struct xsk_socket_info {
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_umem_info *umem;
> +    struct xsk_socket *xsk;
> +    unsigned long rx_npkts;
> +    unsigned long tx_npkts;
> +    unsigned long prev_rx_npkts;
> +    unsigned long prev_tx_npkts;
> +    uint32_t outstanding_tx;
> +};
> +
> +struct umem_elem_head {
> +    unsigned int index;
> +    struct ovs_mutex mutex;
> +    uint32_t n;
> +};

This structure is not used.

> +
> +struct umem_elem {
> +    struct umem_elem *next;
> +};
> +
> +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void *__umem_elem_pop(struct umem_pool *umemp);
> +void *umem_elem_pop(struct umem_pool *umemp);
> +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void **__umem_pool_alloc(unsigned int size);
> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> +void umem_pool_cleanup(struct umem_pool *umemp);
> +unsigned int umem_elem_count(struct umem_pool *mpool);
> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> +
> +#endif
> diff --git a/tests/automake.mk b/tests/automake.mk
> index ea16532dd2a0..715cef9a6b3b 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>  	$(SYSTEM_TESTSUITE_AT) \
>  	$(SYSTEM_KMOD_TESTSUITE_AT) \
>  	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
> +	$(SYSTEM_AFXDP_TESTSUITE_AT) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>  	$(SYSTEM_DPDK_TESTSUITE_AT) \
>  	$(OVSDB_CLUSTER_TESTSUITE_AT) \
>  	$(TESTSUITE) \
>  	$(SYSTEM_KMOD_TESTSUITE) \
>  	$(SYSTEM_USERSPACE_TESTSUITE) \
> +	$(SYSTEM_AFXDP_TESTSUITE) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE) \
>  	$(SYSTEM_DPDK_TESTSUITE) \
>  	$(OVSDB_CLUSTER_TESTSUITE) \
> @@ -158,6 +160,11 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>  	tests/system-userspace-macros.at \
>  	tests/system-userspace-packet-type-aware.at
>  
> +SYSTEM_AFXDP_TESTSUITE_AT = \
> +	tests/system-afxdp-testsuite.at \
> +	tests/system-afxdp-traffic.at \
> +	tests/system-afxdp-macros.at
> +
>  SYSTEM_TESTSUITE_AT = \
>  	tests/system-common-macros.at \
>  	tests/system-ovn.at \
> @@ -182,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>  SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> @@ -315,6 +323,11 @@ check-system-userspace: all
>  	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>  
> +check-afxdp: all
> +	$(MAKE) install
> +	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> +	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> +
>  check-offloads: all
>  	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> @@ -352,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
>  
> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> +	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> +	$(AM_V_at)mv $@.tmp $@
> +
>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
> new file mode 100644
> index 000000000000..2c58c2d6554b
> --- /dev/null
> +++ b/tests/system-afxdp-macros.at
> @@ -0,0 +1,153 @@
> +# _ADD_BR([name])
> +#
> +# Expands into the proper ovs-vsctl commands to create a bridge with the
> +# appropriate type and properties
> +m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type=netdev protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
> +
> +# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
> +#
> +# Creates a database and starts ovsdb-server, starts ovs-vswitchd
> +# connected to that database, calls ovs-vsctl to create a bridge named
> +# br0 with predictable settings, passing 'vsctl-args' as additional
> +# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
> +# output (e.g. because it includes "create" commands) then 'vsctl-output'
> +# specifies the expected output after filtering through uuidfilt.
> +m4_define([OVS_TRAFFIC_VSWITCHD_START],
> +  [
> +   export OVS_PKGDATADIR=$(`pwd`)
> +   _OVS_VSWITCHD_START([--disable-system])
> +   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
> +])
> +
> +# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
> +#
> +# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
> +# for messages with severity WARN or higher and signaling an error if any
> +# is present.  The optional WHITELIST may contain shell-quoted "sed"
> +# commands to delete any warnings that are actually expected, e.g.:
> +#
> +#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
> +#
> +# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
> +# invoked. They can be used to perform additional cleanups such as name space
> +# removal.
> +m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
> +  [OVS_VSWITCHD_STOP([dnl
> +$1";/netdev_linux.*obtaining netdev stats via vport failed/d
> +/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d
> +/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
> +/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
> +"])
> +   AT_CHECK([:; $2])
> +  ])
> +
> +m4_define([ADD_VETH_AFXDP],
> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
> +      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
> +      AT_CHECK([ip link set $1 netns $2])
> +      AT_CHECK([ip link set dev ovs-$1 up])
> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> +                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> +      if test -n "$5"; then
> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> +      fi
> +      if test -n "$6"; then
> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> +      fi
> +      on_exit 'ip link del ovs-$1'
> +    ]
> +)
> +
> +# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
> +m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
> +     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
> +    ]
> +)
> +
> +# CONFIGURE_VETH_OFFLOADS([VETH])
> +#
> +# Disable TX offloads for veths.  The userspace datapath uses the AF_PACKET
> +# socket to receive packets for veths.  Unfortunately, the AF_PACKET socket
> +# doesn't play well with offloads:
> +# 1. GSO packets are received without segmentation and therefore discarded.
> +# 2. Packets with offloaded partial checksum are received with the wrong
> +#    checksum, therefore discarded by the receiver.
> +#
> +# By disabling tx offloads in the non-OVS side of the veth peer we make sure
> +# that the AF_PACKET socket will not receive bad packets.
> +#
> +# This is a workaround, and should be removed when offloads are properly
> +# supported in netdev-linux.
> +m4_define([CONFIGURE_VETH_OFFLOADS],
> +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
> +)
> +
> +# CHECK_CONNTRACK()
> +#
> +# Perform requirements checks for running conntrack tests.
> +#
> +m4_define([CHECK_CONNTRACK],
> +    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
> +)
> +
> +# CHECK_CONNTRACK_ALG()
> +#
> +# Perform requirements checks for running conntrack ALG tests. The userspace
> +# supports FTP and TFTP.
> +#
> +m4_define([CHECK_CONNTRACK_ALG])
> +
> +# CHECK_CONNTRACK_FRAG()
> +#
> +# Perform requirements checks for running conntrack fragmentations tests.
> +# The userspace doesn't support fragmentation yet, so skip the tests.
> +m4_define([CHECK_CONNTRACK_FRAG],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_LOCAL_STACK()
> +#
> +# Perform requirements checks for running conntrack tests with local stack.
> +# While the kernel connection tracker automatically passes all the connection
> +# tracking state from an internal port to the OpenvSwitch kernel module, there
> +# is simply no way of doing that with the userspace, so skip the tests.
> +m4_define([CHECK_CONNTRACK_LOCAL_STACK],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CONNTRACK_NAT()
> +#
> +# Perform requirements checks for running conntrack NAT tests. The userspace
> +# datapath supports NAT.
> +#
> +m4_define([CHECK_CONNTRACK_NAT])
> +
> +# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
> +#
> +# Perform requirements checks for running ovs-dpctl flush-conntrack by
> +# conntrack 5-tuple test. The userspace datapath does not support
> +# this feature yet.
> +m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
> +[
> +    AT_SKIP_IF([:])
> +])
> +
> +# CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-set-maxconns or
> +# ovs-dpctl ct-get-maxconns. The userspace datapath does support this feature.
> +m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
> +
> +# CHECK_CT_DPIF_GET_NCONNS()
> +#
> +# Perform requirements checks for running ovs-dpctl ct-get-nconns. The
> +# userspace datapath does support this feature.
> +m4_define([CHECK_CT_DPIF_GET_NCONNS])
> diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
> new file mode 100644
> index 000000000000..538c0d15d556
> --- /dev/null
> +++ b/tests/system-afxdp-testsuite.at
> @@ -0,0 +1,26 @@
> +AT_INIT
> +
> +AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
> +
> +Licensed under the Apache License, Version 2.0 (the "License");
> +you may not use this file except in compliance with the License.
> +You may obtain a copy of the License at:
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> +
> +Unless required by applicable law or agreed to in writing, software
> +distributed under the License is distributed on an "AS IS" BASIS,
> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +See the License for the specific language governing permissions and
> +limitations under the License.])
> +
> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> +
> +m4_include([tests/ovs-macros.at])
> +m4_include([tests/ovsdb-macros.at])
> +m4_include([tests/ofproto-macros.at])
> +m4_include([tests/system-afxdp-macros.at])
> +m4_include([tests/system-common-macros.at])
> +
> +m4_include([tests/system-afxdp-traffic.at])
> +m4_include([tests/system-ovn.at])
> diff --git a/tests/system-afxdp-traffic.at b/tests/system-afxdp-traffic.at
> new file mode 100644
> index 000000000000..26f72acf48ef
> --- /dev/null
> +++ b/tests/system-afxdp-traffic.at
> @@ -0,0 +1,978 @@
> +AT_BANNER([AF_XDP netdev datapath-sanity])
> +
> +AT_SETUP([datapath - ping between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ulimit -l unlimited
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
> +ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
> +
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping6 between two ports on vlan])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
> +ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
> +
> +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> +dnl waiting, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
> +
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan tunnel])
> +OVS_CHECK_VXLAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> +                  [id 0 dstport 4789])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over vxlan6 tunnel])
> +OVS_CHECK_VXLAN_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
> +                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over gre tunnel])
> +OVS_CHECK_GRE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=1 options:erspan_idx=7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=2 options:erspan_dir=1 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> +                [options:key=123 options:erspan_ver=1 options:erspan_idx=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> +                   [10.1.1.1/24], [local fc00:100::1 seq key 123 erspan_ver 1 erspan 7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
> +OVS_CHECK_GRE()
> +OVS_CHECK_ERSPAN()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> +                [options:key=121 options:erspan_ver=2 options:erspan_dir=0 options:erspan_hwid=0x7])
> +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> +                   [10.1.1.1/24],
> +                   [local fc00:100::1 seq key 121 erspan_ver 2 erspan_dir ingress erspan_hwid 0x7])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve tunnel])
> +OVS_CHECK_GENEVE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> +                  [vni 0])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], [OK
> +])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - ping over geneve6 tunnel])
> +OVS_CHECK_GENEVE_UDP6ZEROCSUM()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
> +                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
> +
> +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> +])
> +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> +])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay with different packet sizes
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - clone action])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
> +                    -- set interface ovs-p1 ofport_request=2])
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1 actions=NORMAL
> +priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
> +priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([datapath - basic truncate action])
> +AT_SKIP_IF([test $HAVE_NC = no])
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-ofctl del-flows br0])
> +
> +dnl Create p0 and ovs-p0(1)
> +ADD_NAMESPACES(at_ns0)
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
> +NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
> +
> +dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
> +AT_CHECK([ip link add p1 type veth peer name ovs-p1])
> +on_exit 'ip link del ovs-p1'
> +AT_CHECK([ip link set dev ovs-p1 up])
> +AT_CHECK([ip link set dev p1 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
> +dnl Use p1 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
> +
> +dnl Create p2(5) and ovs-p2(4)
> +AT_CHECK([ip link add p2 type veth peer name ovs-p2])
> +on_exit 'ip link del ovs-p2'
> +AT_CHECK([ip link set dev ovs-p2 up])
> +AT_CHECK([ip link set dev p2 up])
> +AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
> +dnl Use p2 to check the truncated packet
> +AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
> +
> +dnl basic test
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl use this file as payload file for ncat
> +AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
> +on_exit 'rm -f payload200.bin'
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl packet with truncated size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=100
> +])
> +dnl packet with original size
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=242
> +])
> +
> +dnl more complicated output actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_DATA([flows.txt], [dnl
> +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
> +])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +dnl SLOW_ACTION: disable kernel datapath truncate support
> +dnl Repeat the test above, but exercise the SLOW_ACTION code path
> +AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
> +
> +dnl SLOW_ACTION test1: check datapatch actions
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
> +AT_CHECK([tail -3 stdout], [0],
> +[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
> +This flow is handled by the userspace slow path because it:
> +  - Uses action(s) not supported by datapath.
> +])
> +
> +dnl SLOW_ACTION test2: check actual packet truncate
> +AT_CHECK([ovs-ofctl del-flows br0])
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> +
> +dnl 100 + 100 + 242 + min(65535,242) = 684
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=684
> +])
> +
> +dnl 242 + 100 + min(242,200) = 542
> +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> +n_bytes=542
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +
> +AT_BANNER([conntrack])
> +
> +AT_SETUP([conntrack - controller])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - force commit])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(force,commit),controller
> +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
> +table=1,in_port=2,ct_state=+trk,udp,action=controller
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> +NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> +])
> +
> +dnl
> +dnl Check that the directionality has been changed by force commit.
> +dnl
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
> +])
> +
> +dnl OK, now send another packet from port 1 and see that it switches again
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +AT_CHECK([ovs-appctl revalidator/purge], [0])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - ct flush by 5-tuple])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,udp,action=ct(commit),2
> +priority=100,in_port=2,udp,action=ct(zone=5,commit),1
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Test UDP from port 1
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [1], [dnl
> +])
> +
> +dnl Test UDP from port 2
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [dnl
> +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +])
> +
> +dnl Test ICMP traffic
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [stdout])
> +AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
> +icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
> +])
> +
> +ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
> +ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [1], [dnl
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv4 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - get_nconns and get/set_maxconns])
> +CHECK_CONNTRACK()
> +CHECK_CT_DPIF_SET_GET_MAXCONNS()
> +CHECK_CT_DPIF_GET_NCONNS()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=100,in_port=1,icmp,action=ct(commit),2
> +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
> +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
> +ovs-vswitchd: datapath not found (Invalid argument)
> +ovs-appctl: ovs-vswitchd: server returned an error
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +1
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +3000000
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
> +setting maxconns successful
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> +0
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> +10
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv6 ping])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> +
> +AT_DATA([flows.txt], [dnl
> +
> +dnl ICMPv6 echo request and reply go to table 1.  The rest of the traffic goes
> +dnl through normal action.
> +table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
> +table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
> +table=0,priority=1,action=normal
> +
> +dnl Allow everything from ns0->ns1. Only allow return traffic from ns1->ns0.
> +table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
> +table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
> +table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
> +table=1,priority=1,action=drop
> +])
> +
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> +
> +dnl The above ping creates state in the connection tracker.  We're not
> +dnl interested in that state.
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> +
> +dnl Pings from ns1->ns0 should fail.
> +NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | FORMAT_PING], [0], [dnl
> +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> +])
> +
> +dnl Pings from ns0->ns1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], [dnl
> +icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
>
Ilya Maximets May 6, 2019, 2:32 p.m. UTC | #2
On 03.05.2019 22:02, William Tu wrote:
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +    int ret;
> +
> +    /* umem memory region */
> +    ret = posix_memalign(&bufs, getpagesize(),
> +                         NUM_FRAMES * FRAME_SIZE);

Please, use 'get_page_size()' from lib/util.h instead of 'getpagesize()'
here and in other places. 'getpagesize()' is not portable.

Best regards, Ilya Maximets.
William Tu May 8, 2019, 4:59 p.m. UTC | #3
Hi Ilya,

Thanks for your review.
I will fix them in my next version.

On Mon, May 6, 2019 at 5:37 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Hi. Thanks for a new version.
>
> Quick review inline.
>
> Best regards, Ilya Maximets.
>
> On 03.05.2019 22:02, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> > built upon the eBPF and XDP technology.  It is aims to have comparable
> > performance to DPDK but cooperate better with existing kernel's networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > attached to the netdev, by-passing a couple of Linux kernel's subsystems
> > As a result, AF_XDP socket shows much better performance than AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst
> >
> > Signed-off-by: William Tu <u9012063@gmail.com>
> >
> > ---
> > v1->v2:
> > - add a list to maintain unused umem elements
> > - remove copy from rx umem to ovs internal buffer
> > - use hugetlb to reduce misses (not much difference)
> > - use pmd mode netdev in OVS (huge performance improve)
> > - remove malloc dp_packet, instead put dp_packet in umem
> >
> > v2->v3:
> > - rebase on the OVS master, 7ab4b0653784
> >   ("configure: Check for more specific function to pull in pthread library.")
> > - remove the dependency on libbpf and dpif-bpf.
> >   instead, use the built-in XDP_ATTACH feature.
> > - data structure optimizations for better performance, see[1]
> > - more test cases support
> > v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> >
> > v3->v4:
> > - Use AF_XDP API provided by libbpf
> > - Remove the dependency on XDP_ATTACH kernel patch set
> > - Add documentation, bpf.rst
> >
> > v4->v5:
> > - rebase to master
> > - remove rfc, squash all into a single patch
> > - add --enable-afxdp, so by default, AF_XDP is not compiled
> > - add options: xdpmode=drv,skb
> > - add multiple queue and multiple PMD support, with options: n_rxq
> > - improve documentation, rename bpf.rst to af_xdp.rst
> >
> > v5->v6
> > - rebase to master, commit 0cdd5b13de91b98
> > - address errors from sparse and clang
> > - pass travis-ci test
> > - address feedback from Ben
> > - fix issues reported by 0-day robot
> > - improved documentation
> >
> > v6-v7
> > - rebase to master, commit abf11558c1515bf3b1
> > - address feedbacks from Ilya, Ben, and Eelco, see:
> >   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> > - add XDP mode change, implement get/set_config, reconfigure
> > - Fix reconfiguration/crash issue caused by libbpf, see patch:
> >   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> > - perf optimization for batching umem_push/pop
> > - perf optimization for batching kick_tx
> > - test build with dpdk
> > - fix/refactor atomic operation
> > - make AF_XDP x86 specific, otherwise fail at build time
> > - lots of code refactoring
> > - add PVP setup in documentation
> > ---
> > ---
> >  Documentation/automake.mk             |   1 +
> >  Documentation/index.rst               |   1 +
> >  Documentation/intro/install/afxdp.rst | 469 ++++++++++++++++
> >  Documentation/intro/install/index.rst |   1 +
> >  acinclude.m4                          |  32 ++
> >  configure.ac                          |   1 +
> >  lib/automake.mk                       |  12 +
> >  lib/dp-packet.c                       |  12 +
> >  lib/dp-packet.h                       |  18 +
> >  lib/dpif-netdev-perf.h                |  14 +
> >  lib/netdev-afxdp.c                    | 698 ++++++++++++++++++++++++
> >  lib/netdev-afxdp.h                    |  51 ++
> >  lib/netdev-linux.c                    | 118 ++--
> >  lib/netdev-linux.h                    |  72 +++
> >  lib/netdev-provider.h                 |   4 +-
> >  lib/netdev.c                          |   3 +
> >  lib/xdpsock.c                         | 236 ++++++++
> >  lib/xdpsock.h                         | 127 +++++
> >  tests/automake.mk                     |  17 +
> >  tests/system-afxdp-macros.at          | 153 ++++++
> >  tests/system-afxdp-testsuite.at       |  26 +
> >  tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
> >  22 files changed, 2993 insertions(+), 51 deletions(-)
> >  create mode 100644 Documentation/intro/install/afxdp.rst
> >  create mode 100644 lib/netdev-afxdp.c
> >  create mode 100644 lib/netdev-afxdp.h
> >  create mode 100644 lib/xdpsock.c
> >  create mode 100644 lib/xdpsock.h
> >  create mode 100644 tests/system-afxdp-macros.at
> >  create mode 100644 tests/system-afxdp-testsuite.at
> >  create mode 100644 tests/system-afxdp-traffic.at
> >
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 082438e09a33..11cc59efc881 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -10,6 +10,7 @@ DOC_SOURCE = \
> >       Documentation/intro/why-ovs.rst \
> >       Documentation/intro/install/index.rst \
> >       Documentation/intro/install/bash-completion.rst \
> > +     Documentation/intro/install/afxdp.rst \
> >       Documentation/intro/install/debian.rst \
> >       Documentation/intro/install/documentation.rst \
> >       Documentation/intro/install/distributions.rst \
> > diff --git a/Documentation/index.rst b/Documentation/index.rst
> > index 46261235c732..aa9e7c49f179 100644
> > --- a/Documentation/index.rst
> > +++ b/Documentation/index.rst
> > @@ -59,6 +59,7 @@ vSwitch? Start here.
> >    :doc:`intro/install/windows` |
> >    :doc:`intro/install/xenserver` |
> >    :doc:`intro/install/dpdk` |
> > +  :doc:`intro/install/afxdp` |
> >    :doc:`Installation FAQs <faq/releases>`
> >
> >  - **Tutorials:** :doc:`tutorials/faucet` |
> > diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> > new file mode 100644
> > index 000000000000..d68a4ac7ff8b
> > --- /dev/null
> > +++ b/Documentation/intro/install/afxdp.rst
> > @@ -0,0 +1,469 @@
> > +..
> > +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> > +      not use this file except in compliance with the License. You may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing, software
> > +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> > +      License for the specific language governing permissions and limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +
> > +========================
> > +Open vSwitch with AF_XDP
> > +========================
> > +
> > +This document describes how to build and install Open vSwitch using
> > +AF_XDP netdev.
> > +
> > +.. warning::
> > +  The AF_XDP support of Open vSwitch is considered 'experimental',
> > +  and it is not compiled in by default.
> > +
> > +Introduction
> > +------------
> > +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> > +built upon the eBPF and XDP technology.  It is aims to have comparable
> > +performance to DPDK but cooperate better with existing kernel's networking
> > +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> > +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> > +For more details about AF_XDP, please see linux kernel's
> > +Documentation/networking/af_xdp.rst
> > +
> > +
> > +AF_XDP Netdev
> > +-------------
> > +OVS has a couple of netdev types, i.e., system, tap, or
> > +internal.  The AF_XDP feature adds a new netdev types called
> > +"afxdp", and implement its configuration, packet reception,
> > +and transmit functions.  Since the AF_XDP socket, xsk,
> > +operates in userspace, once ovs-vswitchd receives packets
> > +from xsk, the proposed architecture re-uses the existing
> > +userspace dpif-netdev datapath.  As a result, most of
> > +the packet processing happens at the userspace instead of
> > +linux kernel.
> > +
> > +::
> > +
> > +              |   +-------------------+
> > +              |   |    ovs-vswitchd   |<-->ovsdb-server
> > +              |   +-------------------+
> > +              |   |      ofproto      |<-->OpenFlow controllers
> > +              |   +--------+-+--------+
> > +              |   | netdev | |ofproto-|
> > +    userspace |   +--------+ |  dpif  |
> > +              |   | afxdp  | +--------+
> > +              |   | netdev | |  dpif  |
> > +              |   +---||---+ +--------+
> > +              |       ||     |  dpif- |
> > +              |       ||     | netdev |
> > +              |_      ||     +--------+
> > +                      ||
> > +               _  +---||-----+--------+
> > +              |   | AF_XDP prog +     |
> > +       kernel |   |   xsk_map         |
> > +              |_  +--------||---------+
> > +                           ||
> > +                        physical
> > +                           NIC
> > +
> > +
> > +Build requirements
> > +------------------
> > +
> > +In addition to the requirements described in :doc:`general`, building Open
> > +vSwitch with AF_XDP will require the following:
> > +
> > +- libbpf from kernel source tree (kernel 5.0.0 or later)
> > +
> > +- Linux kernel XDP support, with the following options (required)
>
> Empty line will be good here.
>
> > +  ``_CONFIG_BPF=y``
> > +
> > +  ``_CONFIG_BPF_SYSCALL=y``
> > +
> > +  ``_CONFIG_XDP_SOCKETS=y``
>
> It also be better to make a dot list instead. Like:
>
> * item1
> * item2
>
> And why these underscores here?

I think it's a mistake, will fix it later.
>
> > +
> > +
> > +- The following optional Kconfig options are also recommended, but not
> > +  required:
> > +
> > +  ``_CONFIG_BPF_JIT=y`` (Performance)
> > +
> > +  ``_CONFIG_HAVE_BPF_JIT=y`` (Performance)
> > +
> > +  ``_CONFIG_XDP_SOCKETS_DIAG=y`` (Debugging)
> > +
Also fix the above.

> > +- If possible, run **./xdpsock -r -N -z -i <your device>** under
> > +  linux/samples/bpf.  This is the OVS indepedent benchmark tools for AF_XDP.
> > +  It makes sure your basic kernel requirements are met for AF_XDP.
> > +
> > +
> > +Installing
> > +----------
> > +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> > +Frist, clone a recent version of Linux bpf-next tree::
> > +
> > +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> > +
> > +Second, go into the Linux source directory and build libbpf in the tools
> > +directory::
> > +
> > +  cd bpf-next/
> > +  cd tools/lib/bpf/
> > +  make && make install
> > +  make install_headers
> > +
> > +.. note::
> > +   Make sure xsk.h and bpf.h are installed in system's library path,
> > +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> > +
> > +Make sure the libbpf.so is installed correctly::
> > +
> > +  ldconfig
> > +  ldconfig -p | grep libbpf
> > +
> > +
> > +Third, ensure the standard OVS requirements are installed and
> > +bootstrap/configure the package::
> > +
> > +  ./boot.sh && ./configure --enable-afxdp
> > +
> > +Finally, build and install OVS::
> > +
> > +  make && make install
> > +
> > +To kick start end-to-end autotesting::
> > +
> > +  uname -a # make sure having 5.0+ kernel
> > +  make check-afxdp
> > +
> > +if a test case fails, check the log at::
> > +
> > +  cat tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
> > +
> > +
> > +Setup AF_XDP netdev
> > +-------------------
> > +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> > +set-up right::
> > +
> > +  ldd vswitchd/ovs-vswitchd
> > +
> > +Open vSwitch should be started using userspace datapath as described
> > +in :doc:`general`::
> > +
> > +  ovs-vswitchd --disable-system
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > +
> > +.. note::
> > +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
> > +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
> > +   and datapath_type=netdev when adding a new bridge.
> > +
> > +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Or, use 4 pmds/cores and 4 queues by doing::
> > +
> > +  ethtool -L enp2s0 combined 4
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> > +    options:n_rxq=4 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > +
> > +To validate that the bridge has successfully instantiated, you can use the::
> > +
> > +  ovs-vsctl show
> > +
> > +should show something like::
> > +
> > +  Port "ens802f0"
> > +   Interface "ens802f0"
> > +      type: afxdp
> > +      options: {n_rxq="1", xdpmode=drv}
> > +
> > +Otherwise, enable debug by::
> > +
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +
> > +References
> > +----------
> > +Most of the design details are described in the paper presented at
> > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> > +section 4, and slides[2][4].
> > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> > +about AF_XDP current and future work.
> > +
> > +
> > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > +
> > +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > +
> > +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > +
> > +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > +
> > +
> > +Performance Tuning
> > +------------------
> > +The name of the game is to keep your CPU running in userspace, allowing PMD
> > +to keep polling the AF_XDP queues without any interferences from kernel.
> > +
> > +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> > +   running cores, device plug-in slot)
> > +
> > +#. Isolate your CPU by doing isolcpu at grub configure.
> > +
> > +#. IRQ should not set to pmd running core.
> > +
> > +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> > +
> > +Debugging performance issue
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +While running the traffic, use linux perf tool to see where your cpu
> > +spends its cycle::
> > +
> > +  cd bpf-next/tools/perf
> > +  make
> > +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> > +  ./perf report
> > +
> > +Measure your system call rate by doing::
> > +
> > +  pstree -p `pidof ovs-vswitchd`
> > +  strace -c -p <your pmd's PID>
> > +
> > +Or, use OVS pmd tool::
> > +
> > +  ovs-appctl dpif-netdev/pmd-stats-show
> > +
> > +
> > +Example Script
> > +--------------
> > +
> > +Below is a script using namespaces and veth peer::
> > +
> > +  #!/bin/bash
> > +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> > +    --disable-system --detach \
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> > +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> > +    fail-mode=secure datapath_type=netdev
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > +
> > +  ip netns add at_ns0
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +  ip link add p0 type veth peer name afxdp-p0
> > +  ip link set p0 netns at_ns0
> > +  ip link set dev afxdp-p0 up
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> > +
> > +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> > +  ip addr add "10.1.1.1/24" dev p0
> > +  ip link set dev p0 up
> > +  NS_EXEC_HEREDOC
> > +
> > +  ip netns add at_ns1
> > +  ip link add p1 type veth peer name afxdp-p1
> > +  ip link set p1 netns at_ns1
> > +  ip link set dev afxdp-p1 up
> > +
> > +  ovs-vsctl add-port br0 afxdp-p1 -- \
> > +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> > +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> > +  ip addr add "10.1.1.2/24" dev p1
> > +  ip link set dev p1 up
> > +  NS_EXEC_HEREDOC
> > +
> > +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> > +
> > +
> > +Limitations/Known Issues
> > +------------------------
> > +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> > +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> > +   work-around is to use OpenFlow meter action.
> > +#. AF_XDP device added to bridge, remove, and added again will fail.
> > +#. Most of the tests are done using i40e single port. Multiple ports and
> > +   also ixgbe driver also needs to be tested.
> > +#. No latency test result (TODO items)
> > +
> > +
> > +make check-afxdp
> > +----------------
> > +When executing 'make check-afxdp', OVS creates namespaces, sets up AF_XDP on
> > +veth devices and kicks start the testing.  So far we have the following test
> > +cases::
> > +
> > + AF_XDP netdev datapath-sanity
> > +
> > +  1: datapath - ping between two ports               ok
> > +  2: datapath - ping between two ports on vlan       ok
> > +  3: datapath - ping6 between two ports              ok
> > +  4: datapath - ping6 between two ports on vlan      ok
> > +  5: datapath - ping over vxlan tunnel               ok
> > +  6: datapath - ping over vxlan6 tunnel              ok
> > +  7: datapath - ping over gre tunnel                 ok
> > +  8: datapath - ping over erspan v1 tunnel           ok
> > +  9: datapath - ping over erspan v2 tunnel           ok
> > + 10: datapath - ping over ip6erspan v1 tunnel        ok
> > + 11: datapath - ping over ip6erspan v2 tunnel        ok
> > + 12: datapath - ping over geneve tunnel              ok
> > + 13: datapath - ping over geneve6 tunnel             ok
> > + 14: datapath - clone action                         ok
> > + 15: datapath - basic truncate action                ok
> > +
> > + conntrack
> > +
> > + 16: conntrack - controller                          ok
> > + 17: conntrack - force commit                        ok
> > + 18: conntrack - ct flush by 5-tuple                 ok
> > + 19: conntrack - IPv4 ping                           ok
> > + 20: conntrack - get_nconns and get/set_maxconns     ok
> > + 21: conntrack - IPv6 ping                           ok
> > +
> > + system-ovn
> > +
> > + 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
> > + 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
> > + 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
> > + 25: ovn -- load-balancing                           ok
> > + 26: ovn -- load-balancing - same subnet.            ok
> > + 27: ovn -- load balancing in gateway router         ok
> > + 28: ovn -- multiple gateway routers, load-balancing ok
> > + 29: ovn -- load balancing in router with gateway router port ok
> > + 30: ovn -- DNAT and SNAT on distributed router - N/S ok
> > + 31: ovn -- DNAT and SNAT on distributed router - E/W ok
> > +
> > +PVP using tap device
> > +--------------------
> > +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> > +First, start OVS, then add physical port::
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Start a VM with virtio and tap device::
> > +
> > +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > +    -m 4096 \
> > +    -cpu host,+x2apic -enable-kvm \
> > +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> > +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> > +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> > +    -object memory-backend-file,id=mem,size=4096M,\
> > +      mem-path=/dev/hugepages,share=on \
> > +    -numa node,memdev=mem -mem-prealloc -smp 2
> > +
> > +Create OpenFlow rules::
> > +
> > +  ovs-vsctl add-port br0 tap0
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > +
> > +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> > +
> > +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > +
> > +The performance number I got is around 700Kpps.
> > +This is due to using the kernel's tap interface, which requires copying
> > +packet into kernel from the umem buffer in userspace.
> > +
> > +PVP using vhostuser device
> > +--------------------------
> > +First, build OVS with DPDK and AFXDP::
> > +
> > +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> > +  make -j4 && make install
> > +
> > +Create a vhost-user port from OVS::
> > +
> > +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> > +    other_config:pmd-cpu-mask=0xfff
> > +  ovs-vsctl add-port br0 vhost-user-1 \
> > +    -- set Interface vhost-user-1 type=dpdkvhostuser
> > +
> > +Start VM using vhost-user mode::
> > +
> > +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > +   -m 4096 \
> > +   -cpu host,+x2apic -enable-kvm \
> > +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> > +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> > +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> > +      netdev=mynet1,mq=on,vectors=10 \
> > +   -object memory-backend-file,id=mem,size=4096M,\
> > +      mem-path=/dev/hugepages,share=on \
> > +   -numa node,memdev=mem -mem-prealloc -smp 2
> > +
> > +Setup the OpenFlow ruls::
> > +
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> > +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> > +
> > +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> > +
> > +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> > +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > +
> > +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> > +
> > +PCP container using veth
> > +------------------------
> > +Create namespace and veth peer devices::
> > +
> > +  ip netns add at_ns0
> > +  ip link add p0 type veth peer name afxdp-p0
> > +  ip link set p0 netns at_ns0
> > +  ip link set dev afxdp-p0 up
> > +  ip netns exec at_ns0 ip link set dev p0 up
> > +
> > +Attach the veth port to br0::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 options:n_rxq=1 type="afxdp" options:xdpmode=skb
> > +
> > +Setup the OpenFlow rules::
> > +
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> > +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> > +
> > +In the namespace, run drop or bounce back the packet::
> > +
> > +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> > +
> > +Bug Reporting
> > +-------------
> > +
> > +Please report problems to dev@openvswitch.org.
> > diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> > index 3193c736cf17..c27a9c9d16ff 100644
> > --- a/Documentation/intro/install/index.rst
> > +++ b/Documentation/intro/install/index.rst
> > @@ -45,6 +45,7 @@ Installation from Source
> >     xenserver
> >     userspace
> >     dpdk
> > +   afxdp
> >
> >  Installation from Packages
> >  --------------------------
> > diff --git a/acinclude.m4 b/acinclude.m4
> > index b532a4579266..5782f7e4bc2e 100644
> > --- a/acinclude.m4
> > +++ b/acinclude.m4
> > @@ -221,6 +221,38 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
> >    ])
> >  ])
> >
> > +dnl OVS_CHECK_LINUX_AF_XDP
> > +dnl
> > +dnl Check both Linux kernel AF_XDP and libbpf support
> > +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> > +  AC_ARG_ENABLE([afxdp],
> > +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> > +                [], [enable_afxdp=no])
> > +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> > +  if test "$enable_afxdp" != yes; then
> > +    AC_MSG_RESULT([no])
> > +    AF_XDP_ENABLE=false
> > +  else
> > +    AC_MSG_RESULT([yes])
> > +    AF_XDP_ENABLE=true
> > +
> > +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> > +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> > +
> > +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> > +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> > +
> > +    AC_CHECK_HEADER([bpf/xsk.h], [],
> > +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> > +
> > +    AC_DEFINE([HAVE_AF_XDP], [1],
> > +              [Define to 1 if AF_XDP support is available and enabled.])
> > +    LIBBPF_LDADD=" -lbpf -lelf"
> > +    AC_SUBST([LIBBPF_LDADD])
> > +  fi
> > +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> > +])
> > +
> >  dnl OVS_CHECK_DPDK
> >  dnl
> >  dnl Configure DPDK source tree
> > diff --git a/configure.ac b/configure.ac
> > index 505e3d041e93..29c90b73f836 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
> >  OVS_CHECK_DOT
> >  OVS_CHECK_IF_DL
> >  OVS_CHECK_STRTOK_R
> > +OVS_CHECK_LINUX_AF_XDP
> >  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
> >  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
> >    [], [], [[#include <sys/stat.h>]])
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index cc5dccf39d6b..e3c1d9cbf363 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -14,6 +14,10 @@ if WIN32
> >  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
> >  endif
> >
> > +if HAVE_AF_XDP
> > +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> > +endif
> > +
> >  lib_libopenvswitch_la_LDFLAGS = \
> >          $(OVS_LTINFO) \
> >          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> > @@ -409,6 +413,14 @@ lib_libopenvswitch_la_SOURCES += \
> >       lib/tc.h
> >  endif
> >
> > +if HAVE_AF_XDP
> > +lib_libopenvswitch_la_SOURCES += \
> > +     lib/xdpsock.c \
> > +     lib/xdpsock.h \
> > +     lib/netdev-afxdp.c \
> > +     lib/netdev-afxdp.h
> > +endif
> > +
> >  if DPDK_NETDEV
> >  lib_libopenvswitch_la_SOURCES += \
> >       lib/dpdk.c \
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index 0976a35e758b..c50f88e6e056 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -22,6 +22,9 @@
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "util.h"
> > +#ifdef HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
> >
> >  static void
> >  dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source source)
> > @@ -122,6 +125,11 @@ dp_packet_uninit(struct dp_packet *b)
> >               * created as a dp_packet */
> >              free_dpdk_buf((struct dp_packet*) b);
> >  #endif
> > +        } else if (b->source == DPBUF_AFXDP) {
> > +#ifdef HAVE_AF_XDP
> > +            free_afxdp_buf(b);
> > +#endif
> > +            return;
> >          }
> >      }
> >  }
> > @@ -248,6 +256,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
> >      case DPBUF_STACK:
> >          OVS_NOT_REACHED();
> >
> > +    case DPBUF_AFXDP:
> > +        OVS_NOT_REACHED();
> > +
> >      case DPBUF_STUB:
> >          b->source = DPBUF_MALLOC;
> >          new_base = xmalloc(new_allocated);
> > @@ -433,6 +444,7 @@ dp_packet_steal_data(struct dp_packet *b)
> >  {
> >      void *p;
> >      ovs_assert(b->source != DPBUF_DPDK);
> > +    ovs_assert(b->source != DPBUF_AFXDP);
> >
> >      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
> >          p = dp_packet_data(b);
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index a5e9ade1244a..91dcb886899f 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -25,6 +25,10 @@
> >  #include <rte_mbuf.h>
> >  #endif
> >
> > +#ifdef HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
> > +
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/list.h"
> >  #include "packets.h"
> > @@ -42,6 +46,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
> >      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
> >                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
> >                                  */
> > +    DPBUF_AFXDP,               /* buffer data from XDP frame */
> >  };
> >
> >  #define DP_PACKET_CONTEXT_SIZE 64
> > @@ -89,6 +94,13 @@ struct dp_packet {
> >      };
> >  };
> >
> > +#if HAVE_AF_XDP
> > +struct dp_packet_afxdp {
> > +    struct umem_pool *mpool;
> > +    struct dp_packet packet;
> > +};
> > +#endif
> > +
> >  static inline void *dp_packet_data(const struct dp_packet *);
> >  static inline void dp_packet_set_data(struct dp_packet *, void *);
> >  static inline void *dp_packet_base(const struct dp_packet *);
> > @@ -184,6 +196,12 @@ dp_packet_delete(struct dp_packet *b)
> >              return;
> >          }
> >
> > +#ifdef HAVE_AF_XDP
> > +        if (b->source == DPBUF_AFXDP) {
> > +            free_afxdp_buf((struct dp_packet *)b);
>
> I think that pointer cast is not needed here. BTW, I don't know
> why it exists for dpdk case above.

right, thanks.

>
> > +            return;
> > +        }
> > +#endif
> >          dp_packet_uninit(b);
> >          free(b);
> >      }
> > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> > index 859c05613ddf..cc91720fad6e 100644
> > --- a/lib/dpif-netdev-perf.h
> > +++ b/lib/dpif-netdev-perf.h
> > @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
> >  {
> >  #ifdef DPDK_NETDEV
> >      return s->last_tsc = rte_get_tsc_cycles();
> > +#elif HAVE_AF_XDP
> > +    /* This is x86-specific instructions. */
> > +    union {
> > +        uint64_t tsc_64;
> > +        struct {
> > +            uint32_t lo_32;
> > +            uint32_t hi_32;
> > +        };
> > +    } tsc;
> > +    asm volatile("rdtsc" :
> > +             "=a" (tsc.lo_32),
> > +             "=d" (tsc.hi_32));
> > +
> > +    return s->last_tsc = tsc.tsc_64;
> >  #else
> >      return s->last_tsc = 0;
> >  #endif
> > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> > new file mode 100644
> > index 000000000000..48de4eaaeed3
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.c
> > @@ -0,0 +1,698 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#if !defined(__i386__) && !defined(__x86_64__)
> > +#error AF_XDP supported only for Linux on x86 or x86_64
> > +#endif
> > +
> > +#include <config.h>
>
> Some space here.
>
> > +#include "netdev-linux.h"
>
> And here.

OK, will fix it.
>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <sys/types.h>
> > +#include <netinet/in.h>
> > +#include <arpa/inet.h>
> > +#include <inttypes.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/socket.h>
> > +#include <sys/utsname.h>
> > +#include <netpacket/packet.h>
> > +#include <net/if.h>
> > +#include <net/if_arp.h>
> > +#include <net/route.h>
> > +#include <poll.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +
> > +#include "coverage.h"
> > +#include "dp-packet.h"
> > +#include "dpif-netlink.h"
> > +#include "dpif-netdev.h"
> > +#include "openvswitch/dynamic-string.h"
> > +#include "fatal-signal.h"
> > +#include "hash.h"
> > +#include "openvswitch/hmap.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "netlink-notifier.h"
> > +#include "netlink-socket.h"
> > +#include "netlink.h"
> > +#include "netnsid.h"
> > +#include "openvswitch/ofpbuf.h"
> > +#include "openflow/openflow.h"
> > +#include "ovs-atomic.h"
> > +#include "packets.h"
> > +#include "openvswitch/poll-loop.h"
> > +#include "rtnetlink.h"
> > +#include "openvswitch/shash.h"
> > +#include "socket-util.h"
> > +#include "sset.h"
> > +#include "tc.h"
> > +#include "timer.h"
> > +#include "unaligned.h"
> > +#include "openvswitch/vlog.h"
> > +#include "util.h"
> > +#include "netdev-afxdp.h"
>
> Above header should be at the top, near to 'netdev-linux'.
>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <linux/rtnetlink.h>
> > +#include <linux/sockios.h>
> > +#include <linux/if_xdp.h>
> > +#include "xdpsock.h"
>
> All above headers should be within corresponding blocks.
> i.e. system headers along with system headers, ovs headers
> with ovs headers above. In lexicographical order, if possible.
>
Thanks, I will also check other places.

> > +
> > +#ifndef SOL_XDP
> > +#define SOL_XDP 283
> > +#endif
> > +#ifndef AF_XDP
> > +#define AF_XDP 44
> > +#endif
> > +#ifndef PF_XDP
> > +#define PF_XDP AF_XDP
> > +#endif
> > +
> > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > +
> > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> > +#define UMEM2XPKT(base, i) \
> > +    ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> > +    i * sizeof(struct dp_packet_afxdp))
>
> Please, align this line to the first parenthesis.
OK.

>
> > +
> > +static uint32_t prog_id;
> > +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> > +                                             int mode);
> > +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> > +static void xsk_destroy(struct xsk_socket_info *xsk);
> > +
> > +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
> > +                                                int xdpmode)
> > +{
> > +    struct xsk_umem_info *umem;
> > +    int ret;
> > +    int i;
> > +
> > +    umem = xcalloc(1, sizeof(*umem));
> > +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> > +                           NULL);
> > +
> > +    if (ret) {
> > +        VLOG_ERR("xsk umem create failed (%s) mode: %s",
> > +                 ovs_strerror(errno),
> > +                 xdpmode == XDP_COPY ? "SKB": "DRV");
>
> free(umem);

Right, thanks!
>
> > +        return NULL;
> > +    }
> > +
> > +    umem->buffer = buffer;
> > +
> > +    /* set-up umem pool */
> > +    umem_pool_init(&umem->mpool, NUM_FRAMES);
> > +
> > +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > +        struct umem_elem *elem;
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)umem->buffer + i * FRAME_SIZE);
> > +        umem_elem_push(&umem->mpool, elem);
> > +    }
> > +
> > +    /* set-up metadata */
> > +    xpacket_pool_init(&umem->xpool, NUM_FRAMES);
> > +
> > +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> > +              umem->xpool.array,
> > +              (char *)umem->xpool.array +
> > +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> > +
> > +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> > +        xpacket->mpool = &umem->mpool;
> > +
> > +        packet = &xpacket->packet;
> > +        packet->source = DPBUF_AFXDP;
> > +    }
> > +
> > +    return umem;
> > +}
> > +
> > +static struct xsk_socket_info *
> > +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> > +                     uint32_t queue_id, int xdpmode)
> > +{
> > +    struct xsk_socket_config cfg;
> > +    struct xsk_socket_info *xsk;
> > +    char devname[IF_NAMESIZE];
> > +    uint32_t idx = 0;
> > +    int ret;
> > +    int i;
> > +
> > +    xsk = xcalloc(1, sizeof(*xsk));
> > +    xsk->umem = umem;
> > +    cfg.rx_size = CONS_NUM_DESCS;
> > +    cfg.tx_size = PROD_NUM_DESCS;
> > +    cfg.libbpf_flags = 0;
> > +
> > +    if (xdpmode == XDP_ZEROCOPY) {
> > +        cfg.bind_flags = XDP_ZEROCOPY;
> > +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +    } else {
> > +        cfg.bind_flags = XDP_COPY;
> > +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +    }
> > +
> > +    if (if_indextoname(ifindex, devname) == NULL) {
> > +        VLOG_ERR("ifindex %d to devname failed (%s)",
> > +                 ifindex, ovs_strerror(errno));
>
> free(xsk);
>
> > +        return NULL;
> > +    }
> > +
> > +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> > +                             &xsk->rx, &xsk->tx, &cfg);
> > +    if (ret) {
> > +        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
> > +                 ovs_strerror(errno),
> > +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> > +                 queue_id);
>
> free(xsk);

Right, thanks!
>
> > +        return NULL;
> > +    }
> > +
> > +    /* Make sure the built-in AF_XDP program is loaded */
> > +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> > +    if (ret) {
> > +        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
> > +        xsk_socket__delete(xsk->xsk);
>
> xsk_socket__delete(xsk->xsk);
> free(xsk);
>

Right, thanks!

> > +        return NULL;
> > +    }
> > +
> > +    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
> > +
> > +    for (i = 0;
> > +         i < PROD_NUM_DESCS * FRAME_SIZE;
> > +         i += FRAME_SIZE) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        elem = umem_elem_pop(&xsk->umem->mpool);
> > +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> > +
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> > +    }
> > +
> > +    xsk_ring_prod__submit(&xsk->umem->fq,
> > +                          PROD_NUM_DESCS);
> > +    return xsk;
> > +}
> > +
> > +static struct xsk_socket_info *
> > +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> > +{
> > +    struct xsk_socket_info *xsk;
> > +    struct xsk_umem_info *umem;
> > +    void *bufs;
> > +    int ret;
> > +
> > +    /* umem memory region */
> > +    ret = posix_memalign(&bufs, getpagesize(),
> > +                         NUM_FRAMES * FRAME_SIZE);
> > +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> > +    ovs_assert(!ret);
> > +
> > +    /* create AF_XDP socket */
> > +    umem = xsk_configure_umem(bufs,
> > +                              NUM_FRAMES * FRAME_SIZE,
> > +                              xdpmode);
> > +    if (!umem) {
>
> free(bufs);
>
> > +        return NULL;
> > +    }
> > +
> > +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> > +    if (!xsk) {
> > +        /* clean up umem and xpacket pool */
> > +        free(bufs);
> > +        (void)xsk_umem__delete(umem->umem);
>
> 'umem' created on 'bufs', so you need to delete 'umem' before
> freeing the 'bufs'.
>
> > +        umem_pool_cleanup(&xsk->umem->mpool);
> > +        xpacket_pool_cleanup(&xsk->umem->xpool);
>
> There is no xsk here:
>
> umem_pool_cleanup(&umem->mpool);
> xpacket_pool_cleanup(&umem->xpool);
>
> And:
>
> free(umem);

good catch, thanks.
>
> > +    }
> > +    return xsk;
> > +}
> > +
> > +void
> > +xsk_configure_all(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct xsk_socket_info *xsk;
> > +    int i, ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev->name);
> > +
> > +    /* configure each queue */
> > +    for (i = 0; i < netdev->n_rxq; i++) {
> > +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> > +                dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
>
> Please, align to the first parenthesis.

Right, thanks!

>
> > +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> > +        if (!xsk) {
> > +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
>
> Need to destroy sockets that was configured.
>
> > +            return;
>
> We also need to return the error and check the result in caller functions.

OK, will fix it in next version.

>
> > +        }
> > +        dev->xsk[i] = xsk;
> > +    }
> > +}
> > +
> > +static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
> > +{
> > +    struct ds ds = DS_EMPTY_INITIALIZER;
> > +    ds_put_hex_dump(&ds, buf, count, 0, false);
> > +    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
> > +    ds_destroy(&ds);
> > +}
> > +
> > +static void
> > +xsk_destroy(struct xsk_socket_info *xsk)
> > +{
> > +    struct xsk_umem *umem;
> > +
> > +    if (!xsk) {
> > +        return;
> > +    }
> > +
> > +    umem = xsk->umem->umem;
> > +    xsk_socket__delete(xsk->xsk);
> > +    (void)xsk_umem__delete(umem);
> > +
> > +    /* free the packet buffer */
> > +    free(xsk->umem->buffer);
> > +
> > +    /* cleanup umem pool */
> > +    umem_pool_cleanup(&xsk->umem->mpool);
> > +
> > +    /* cleanup metadata pool */
> > +    xpacket_pool_cleanup(&xsk->umem->xpool);
>
> free(xsk->umem);
> free(xsk);

Right, thanks!
>
> > +}
> > +
> > +void
> > +xsk_destroy_all(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int i, ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev->name);
> > +
> > +    for (i = 0; i < MAX_XSKQ; i++) {
> > +        if (dev->xsk[i]) {
> > +            VLOG_INFO("destroy xsk[%d]", i);
> > +            xsk_destroy(dev->xsk[i]);
>
> dev->xsk[i] = NULL;
> To avoid double destroy on multiple reconfigurations.

good point, thanks.
>
> > +        }
> > +    }
> > +    VLOG_INFO("remove xdp program");
> > +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > +}
> > +
> > +static inline void OVS_UNUSED
> > +print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> > +    struct xdp_statistics stat;
> > +    socklen_t optlen;
> > +
> > +    optlen = sizeof stat;
> > +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> > +                &stat, &optlen) == 0);
>
> Please, align to arguments of 'getsockopt'.

Right, thanks!

>
> > +
> > +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> > +                     stat.rx_dropped,
> > +                     stat.rx_invalid_descs,
> > +                     stat.tx_invalid_descs);
> > +}
> > +
> > +int
> > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                        char **errp OVS_UNUSED)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> > +    const char *xdpmode;
> > +    int new_n_rxq;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > +    if (new_n_rxq > MAX_XSKQ) {
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        return EINVAL;
> > +    }
> > +
> > +    if (new_n_rxq != netdev->n_rxq) {
> > +        dev->requested_n_rxq = new_n_rxq;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +
> > +    xdpmode = smap_get(args, "xdpmode");
> > +    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
> > +        dev->requested_xdpmode = XDP_ZEROCOPY;
> > +
> > +        if (dev->xdpmode != dev->requested_xdpmode) {
> > +            VLOG_INFO("AF_XDP device %s in DRV mode", netdev->name);
> > +
> > +            /* From SKB mode to DRV mode */
> > +            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +            dev->xdp_bind_flags = XDP_ZEROCOPY;
> > +            dev->xdpmode = XDP_ZEROCOPY;
>
> Some of these flags are used while device destruction. Also, they are used
> while reporting the current device status. So, we should not update them
> before the actual reconfiguration. This should be done after the xsk_destroy_all().
> Same for the 'else' case and the 'setrlimit'.

yes, I will fix this in next version.

>
> > +            netdev_request_reconfigure(netdev);
> > +        }
> > +    } else {
> > +        dev->requested_xdpmode = XDP_COPY;
> > +        if (dev->xdpmode != dev->requested_xdpmode) {
> > +            VLOG_INFO("AF_XDP device %s in SKB mode", netdev->name);
> > +
> > +            /* From DRV mode to SKB mode */
> > +            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +            dev->xdp_bind_flags = XDP_COPY;
> > +            dev->xdpmode = XDP_COPY;
> > +            netdev_request_reconfigure(netdev);
> > +        }
> > +    }
> > +
> > +    if (dev->xdpmode == XDP_ZEROCOPY) {
> > +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> > +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
> > +                      ovs_strerror(errno));
> > +        }
> > +    }
> > +
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> > +    smap_add_format(args, "xdpmode", "%s",
> > +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_afxdp_reconfigure(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int err = 0;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    if (netdev->n_rxq == dev->requested_n_rxq
> > +        && dev->xdpmode == dev->requested_xdpmode) {
> > +        goto out;
> > +    }
> > +
> > +    xsk_destroy_all(netdev);
> > +
> > +    netdev->n_rxq = dev->requested_n_rxq;
> > +    dev->xdpmode = dev->requested_xdpmode;
> > +
> > +    xsk_configure_all(netdev);
>
> Need to get the actual status of xsk_configure_all() and set the 'err'
> accordingly to avoid using broken device.
>
> Another thought is that destroy/configure of each xsk should be implemented
> as rxq_construct/destruct() callbacks, so the datapath will handle rxqs, i.e.
> open and close them when needed. But this could be implemented later.

OK, I will implement this later.

>
> > +    netdev_change_seq_changed(netdev);
> > +out:
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return err;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> > +{
> > +    /* FIXME: Get netdev's PCIe device ID, then find
> > +     * its NUMA node id.
> > +     */
> > +    VLOG_INFO("FIXME: Device %s always use numa id 0", netdev->name);
>
> s/netdev->name/netdev_get_name(netdev)/g
>
> For all the other places in the code.

Right, thanks!
>
> > +    return 0;
> > +}
> > +
> > +void
> > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> > +{
> > +    uint32_t curr_prog_id = 0;
> > +    uint32_t flags;
> > +
> > +    /* remove_xdp_program() */
> > +    if (xdpmode == XDP_COPY) {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +    } else {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +    }
> > +
> > +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    }
> > +    if (prog_id == curr_prog_id) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    } else if (!curr_prog_id) {
> > +        VLOG_INFO("couldn't find a prog id on a given interface");
> > +    } else {
> > +        VLOG_INFO("program on interface changed, not removing");
> > +    }
> > +}
> > +
> > +static inline struct dp_packet_afxdp *
> > +dp_packet_cast_afxdp(const struct dp_packet *d OVS_UNUSED)
>
> 'd' is not UNUSED.
Right, thanks!

>
> > +{
> > +    ovs_assert(d->source == DPBUF_AFXDP);
> > +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> > +}
> > +
> > +void
> > +free_afxdp_buf(struct dp_packet *p)
> > +{
> > +    struct dp_packet_afxdp *xpacket;
> > +    unsigned long addr;
> > +
> > +    xpacket = dp_packet_cast_afxdp(p);
> > +    if (xpacket->mpool) {
> > +        void *base = dp_packet_base(p);
> > +
> > +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> > +        umem_elem_push(xpacket->mpool, (void *)addr);
> > +    }
> > +}
> > +
> > +void
> > +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> > +{
> > +        struct dp_packet_afxdp *xpacket = NULL;
> > +        struct dp_packet *packet;
> > +        void *elems[BATCH_SIZE];
> > +        unsigned long addr;
> > +
> > +       /* all packets are AF_XDP, so handles its own delete in batch */
> > +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +            xpacket = dp_packet_cast_afxdp(packet);
> > +            if (xpacket->mpool) {
> > +                void *base = dp_packet_base(packet);
> > +
> > +                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> > +                elems[i] = (void *)addr;
> > +            }
> > +        }
> > +        umem_elem_push_n(xpacket->mpool, batch->count, elems);
> > +        dp_packet_batch_init(batch);
> > +}
> > +
> > +/* Receive packet from AF_XDP socket */
> > +int
> > +netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > +                     struct dp_packet_batch *batch)
> > +{
> > +    struct umem_elem *elems[BATCH_SIZE];
> > +    uint32_t idx_rx = 0, idx_fq = 0;
> > +    unsigned int rcvd, i;
> > +    int ret = 0;
> > +
> > +    /* See if there is any packet on RX queue,
> > +     * if yes, idx_rx is the index having the packet.
> > +     */
> > +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> > +    if (!rcvd) {
> > +        return 0;
> > +    }
> > +
> > +    /* Form a dp_packet batch from descriptor in RX queue */
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> > +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> > +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > +        uint64_t index;
> > +
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        index = addr >> FRAME_SHIFT;
> > +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> > +
> > +        packet = &xpacket->packet;
> > +        xpacket->mpool = &xsk->umem->mpool;
> > +
> > +        /* Initialize the struct dp_packet */
> > +        dp_packet_set_base(packet, pkt);
> > +        dp_packet_set_data(packet, pkt);
> > +        dp_packet_set_size(packet, len);
>
> There must be some more work done. We need to clear all the data
> that left from the previous packets.
>
> You may initialize source and call dp_packet_init_specific() on xpool
> initialization, but base, data, size, packet_type, cutlen and offload
> flags should be initialized for each packet.
>
> You probably may implement your own dp_packet_use_afxdp() for this.

good idea. I will add it to my next version.

>
> > +
> > +        /* Add packet into batch, increase batch->count */
> > +        dp_packet_batch_add(batch, packet);
> > +
> > +        idx_rx++;
> > +    }
> > +
> > +    /* We've consume rcvd packets in RX, now re-fill the
> > +     * same number back to FILL queue.
> > +     */
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t index;
> > +        struct umem_elem *elem;
> > +
> > +        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> > +        while (OVS_UNLIKELY(ret == 0)) {
> > +            /* The FILL queue is full, so retry. (or skip)? */
> > +            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
> > +        }
> > +
> > +        /* Get one free umem, program it into FILL queue */
> > +        elem = elems[i];
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> > +
> > +        idx_fq++;
> > +    }
> > +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> > +
> > +    /* Release the RX queue */
> > +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > +    xsk->rx_npkts += rcvd;
> > +
> > +#ifdef AFXDP_DEBUG
> > +    print_xsk_stat(xsk);
> > +#endif
> > +    return 0;
> > +}
> > +
> > +static void kick_tx(struct xsk_socket_info *xsk)
> > +{
> > +    int ret;
> > +
> > +    /* This causes system call into kernel, avoid calling
> > +     * this as much as we can.
> > +     */
> > +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +    if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN || errno == EBUSY) {
> > +        return;
>
> This makes no much sense. Did you want to print something on error?

Right, will fix it in next version.

>
> > +    }
> > +}
> > +
> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                              struct dp_packet_batch *batch)
> > +{
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t tx_done, idx_cq = 0;
> > +    struct dp_packet *packet;
> > +    uint32_t idx = 0;
> > +    int j, ret;
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        return -EAGAIN;
> > +    }
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return -EAGAIN;
> > +    }
>
> We should probably make umem_elem_pop_n() first before the
> xsk_ring_prod__reserve(), because we can't undo the xsk_ring_prod__reserve(),
> but we can push umem_elems back in case of xsk_ring_prod__reserve() failure.
>
Right, thanks!
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        if (OVS_UNLIKELY(!elem)) {
> > +            return -EAGAIN;
> > +        }
> > +
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * We can avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    kick_tx(xsk);
> > +retry:
> > +
> > +    /* Process CQ */
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk->outstanding_tx -= tx_done;
> > +        xsk->tx_npkts += tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> > +    if (OVS_UNLIKELY(ret < 0)) {
> > +        goto out;
> > +    }
> > +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
> > +        /* If there are still a lot not transmitted,
> > +         * try harder.
> > +         */
> > +        goto retry;
> > +    }
> > +out:
> > +    return 0;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..e0a49a89accf
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,51 @@
> > +/*
> > + * Copyright (c) 2018 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used directly
> > + * only by Linux-specific code. */
> > +#define MAX_XSKQ 16
> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +
> > +void xsk_configure_all(struct netdev *netdev);
> > +
> > +void xsk_destroy_all(struct netdev *netdev);
> > +
> > +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > +                         struct dp_packet_batch *batch);
> > +
> > +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                                  struct dp_packet_batch *batch);
> > +
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..a17cf614a00c 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -75,6 +75,7 @@
> >  #include "unaligned.h"
> >  #include "openvswitch/vlog.h"
> >  #include "util.h"
> > +#include "netdev-afxdp.h"
>
> Headers should be added in lexicographical order.

OK will fix it.
>
> >
> >  VLOG_DEFINE_THIS_MODULE(netdev_linux);
> >
> > @@ -487,51 +488,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
> >  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
> >
> > -struct netdev_linux {
> > -    struct netdev up;
> > -
> > -    /* Protects all members below. */
> > -    struct ovs_mutex mutex;
> > -
> > -    unsigned int cache_valid;
> > -
> > -    bool miimon;                    /* Link status of last poll. */
> > -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> > -    struct timer miimon_timer;
> > -
> > -    int netnsid;                    /* Network namespace ID. */
> > -    /* The following are figured out "on demand" only.  They are only valid
> > -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > -    int ifindex;
> > -    struct eth_addr etheraddr;
> > -    int mtu;
> > -    unsigned int ifi_flags;
> > -    long long int carrier_resets;
> > -    uint32_t kbits_rate;        /* Policing data. */
> > -    uint32_t kbits_burst;
> > -    int vport_stats_error;      /* Cached error code from vport_get_stats().
> > -                                   0 or an errno value. */
> > -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
> > -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> > -    int netdev_policing_error;  /* Cached error code from set policing. */
> > -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> > -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> > -
> > -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > -
> > -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> > -    struct tc *tc;
> > -
> > -    /* For devices of class netdev_tap_class only. */
> > -    int tap_fd;
> > -    bool present;               /* If the device is present in the namespace */
> > -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> > -
> > -    /* LAG information. */
> > -    bool is_lag_master;         /* True if the netdev is a LAG master. */
> > -};
> >
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> > @@ -579,13 +535,26 @@ is_netdev_linux_class(const struct netdev_class *netdev_class)
> >      return netdev_class->run == netdev_linux_run;
> >  }
> >
> > +#if HAVE_AF_XDP
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev)
> > +{
> > +    return netdev_get_class(netdev) == &netdev_afxdp_class;
> > +}
> > +#else
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +#endif
> >  static bool
> >  is_tap_netdev(const struct netdev *netdev)
> >  {
> >      return netdev_get_class(netdev) == &netdev_tap_class;
> >  }
> >
> > -static struct netdev_linux *
> > +struct netdev_linux *
> >  netdev_linux_cast(const struct netdev *netdev)
> >  {
> >      ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > @@ -1083,7 +1052,11 @@ netdev_linux_destruct(struct netdev *netdev_)
> >      if (netdev->miimon_interval > 0) {
> >          atomic_count_dec(&miimon_cnt);
> >      }
> > -
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev_)) {
> > +        xsk_destroy_all(netdev_);
> > +    }
> > +#endif
> >      ovs_mutex_destroy(&netdev->mutex);
> >  }
> >
> > @@ -1113,7 +1086,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >      rx->is_tap = is_tap_netdev(netdev_);
> >      if (rx->is_tap) {
> >          rx->fd = netdev->tap_fd;
> > -    } else {
> > +    } else if (!is_afxdp_netdev(netdev_)) {
> >          struct sockaddr_ll sll;
> >          int ifindex, val;
> >          /* Result of tcpdump -dd inbound */
> > @@ -1318,10 +1291,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >  {
> >      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >      struct netdev *netdev = rx->up.netdev;
> > -    struct dp_packet *buffer;
> > +    struct dp_packet *buffer = NULL;
> >      ssize_t retval;
> >      int mtu;
> >
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +        int qid = rxq_->queue_id;
> > +
> > +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> > +    }
> > +#endif
> >      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> > @@ -1329,6 +1310,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >      /* Assume Ethernet port. No need to set packet_type. */
> >      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> >                                             DP_NETDEV_HEADROOM);
> > +
> >      retval = (rx->is_tap
> >                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> >                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > @@ -1480,7 +1462,8 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >      int error = 0;
> >      int sock = 0;
> >
> > -    if (!is_tap_netdev(netdev_)) {
> > +    if (!is_tap_netdev(netdev_) &&
> > +        !is_afxdp_netdev(netdev_)) {
> >          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >              error = EOPNOTSUPP;
> >              goto free_batch;
> > @@ -1499,6 +1482,23 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >          }
> >
> >          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +#if HAVE_AF_XDP
> > +    } else if (is_afxdp_netdev(netdev_)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +        struct dp_packet *packet;
> > +
> > +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> > +
> > +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +            if (packet->source != DPBUF_AFXDP) {
>
> You have to be sure that all packets are from the same umem, not only
> the same type.

Good catch. Will do it next version.

>
> > +                 /* free one-by-one */
> > +                goto free_batch;
> > +            }
> > +        }
> > +        /* free in batch */
> > +        free_afxdp_buf_batch(batch);
> > +        return 0;
> > +#endif
> >      } else {
> >          error = netdev_linux_tap_batch_send(netdev_, batch);
> >      }
> > @@ -3323,6 +3323,7 @@ const struct netdev_class netdev_linux_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "system",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_linux_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3333,6 +3334,7 @@ const struct netdev_class netdev_linux_class = {
> >  const struct netdev_class netdev_tap_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      .type = "tap",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct_tap,
> >      .get_stats = netdev_tap_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3343,10 +3345,26 @@ const struct netdev_class netdev_internal_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "internal",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_internal_get_stats,
> >      .get_status = netdev_internal_get_status,
> >  };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > +    NETDEV_LINUX_CLASS_COMMON,
> > +    .type = "afxdp",
> > +    .is_pmd = true,
> > +    .construct = netdev_linux_construct,
> > +    .get_stats = netdev_linux_get_stats,
> > +    .get_status = netdev_linux_get_status,
> > +    .set_config = netdev_afxdp_set_config,
> > +    .get_config = netdev_afxdp_get_config,
> > +    .reconfigure = netdev_afxdp_reconfigure,
> > +    .get_numa_id = netdev_afxdp_get_numa_id,
> > +};
> > +#endif
> >
> >
> >  #define CODEL_N_QUEUES 0x0000
> > diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> > index 17ca9120168a..570f9134e3d4 100644
> > --- a/lib/netdev-linux.h
> > +++ b/lib/netdev-linux.h
> > @@ -19,6 +19,21 @@
> >
> >  #include <stdint.h>
> >  #include <stdbool.h>
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "timer.h"
> > +#include "ovs-atomic.h"
> > +#include "netdev-afxdp.h"
> >
> >  /* These functions are Linux specific, so they should be used directly only by
> >   * Linux-specific code. */
> > @@ -28,6 +43,7 @@ struct netdev;
> >  int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
> >                                    const char *flag_name, bool enable);
> >  int linux_get_ifindex(const char *netdev_name);
> > +struct netdev_linux *netdev_linux_cast(const struct netdev *netdev);
> >
> >  #define LINUX_FLOW_OFFLOAD_API                          \
> >     .flow_flush = netdev_tc_flow_flush,                  \
> > @@ -39,4 +55,60 @@ int linux_get_ifindex(const char *netdev_name);
> >     .flow_del = netdev_tc_flow_del,                      \
> >     .init_flow_api = netdev_tc_init_flow_api
> >
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set policing. */
> > +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info *xsk[MAX_XSKQ];
> > +    int requested_n_rxq;
> > +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > +    int xdp_flags, xdp_bind_flags;
> > +#endif
> > +};
> > +
>
> Exposing internal data structures is not a good thing.
> You may create lib/netdev-linux-private.h and move the structure with the
> cast function there.
>
> >  #endif /* netdev-linux.h */
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index fb0c27e6e8e8..d433818f7064 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class;
> >  #endif
> >  extern const struct netdev_class netdev_internal_class;
> >  extern const struct netdev_class netdev_tap_class;
> > -
> > +#if HAVE_AF_XDP
> > +extern const struct netdev_class netdev_afxdp_class;
> > +#endif
> >  #ifdef  __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 7d7ecf6f0946..e2fae37d5a5e 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -146,6 +146,9 @@ netdev_initialize(void)
> >          netdev_register_provider(&netdev_internal_class);
> >          netdev_register_provider(&netdev_tap_class);
> >          netdev_vport_tunnel_register();
> > +#ifdef HAVE_AF_XDP
> > +        netdev_register_provider(&netdev_afxdp_class);
> > +#endif
> >  #endif
> >  #if defined(__FreeBSD__) || defined(__NetBSD__)
> >          netdev_register_provider(&netdev_tap_class);
> > diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> > new file mode 100644
> > index 000000000000..7f20e16364e3
> > --- /dev/null
> > +++ b/lib/xdpsock.c
> > @@ -0,0 +1,236 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +#include <config.h>
>
> Space here.
>
> > +#include <ctype.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdarg.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <syslog.h>
> > +#include <time.h>
> > +#include <unistd.h>
>
> Space here.
>
> > +#include "openvswitch/vlog.h"
> > +#include "async-append.h"
> > +#include "coverage.h"
> > +#include "dirs.h"
> > +#include "ovs-thread.h"
> > +#include "sat-math.h"
> > +#include "socket-util.h"
> > +#include "svec.h"
> > +#include "syslog-direct.h"
> > +#include "syslog-libc.h"
> > +#include "syslog-provider.h"
> > +#include "timeval.h"
> > +#include "unixctl.h"
> > +#include "util.h"
> > +#include "ovs-atomic.h"
> > +#include "openvswitch/compiler.h"
> > +#include "dp-packet.h"
>
> Please, keep them sorted.
>
> > +
> > +#include "xdpsock.h"
>
> This should be moved closer to 'config.h'
>
Right, thanks!

> > +
> > +static inline void ovs_spinlock_init(ovs_spinlock_t *sl)
>
> Please, keep the consistency in function definitions.
> Function name should start from the new line. Same for other functions.
>
OK, will also check other places,  thanks!

> > +{
> > +    sl->locked = 0;
>
> atomic_init(&sl->locked, 0);

>
> > +}
> > +
> > +static inline void ovs_spin_lock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0, locked = 0;
> > +
> > +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed)) {
> > +        locked = 1;
> > +        while (locked) {
> > +            atomic_read_relaxed(&sl->locked, &locked);
> > +        }
> > +        exp = 0;
> > +    }
> > +}
> > +
> > +static inline void ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
> > +
> > +inline int
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +
> > +    return 0;
> > +}
> > +
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void
> > +__umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)
>
> unused?
Thanks, will check all the places using unused.

>
> > +{
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)
>
> unused?
>
> > +{
> > +
> > +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> > +        /* stack is full */
> > +        /* it's possible that one umem gets pushed twice,
> > +         * because actions=1,2,3... multiple ports?
>
> In case of multiple output ports, packet will be cloned, so it's
> not the case. Most probably, you're pushing buffer from the wrong
> umem, i.e. umem of the diferent port/rxq.
This should be OK.

When packet gets cloned, it has
dp_execute_cb
  dp_packet_batch_clone
    dp_packet_clone
     ....
     dp_packet_use
so this buffer is cloned to DPBUF_MALLOC

We will end up having to copy this buffer at
netdev_linux_afxdp_batch_send() when there are
multiple output ports.


>
> > +        */
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void *
> > +__umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)
>
> unused?
>
> > +{
> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)
>
> unused?
>
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ptr;
> > +}
> > +
> > +void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(void *)) == 0);
> > +    memset(bufs, 0, size * sizeof(void *));
> > +    return (void **)bufs;
> > +}
> > +
> > +unsigned int
> > +umem_elem_count(struct umem_pool *mpool)
> > +{
> > +    return mpool->index;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp OVS_UNUSED, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->mutex);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp OVS_UNUSED)
> > +{
> > +    free(umemp->array);
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    /* TODO: check HAVE_POSIX_MEMALIGN  */
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(struct dp_packet_afxdp)) == 0);
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free(xp->array);
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..52d7faaacf75
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,127 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +#include <errno.h>
> > +#include <getopt.h>
> > +#include <libgen.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/if_ether.h>
> > +#include <net/if.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/mman.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +#include <pthread.h>
> > +#include <locale.h>
> > +#include <sys/types.h>
> > +#include <poll.h>
> > +#include <bpf/libbpf.h>
> > +
> > +#include "ovs-atomic.h"
> > +#include "openvswitch/thread.h"
> > +#include <bpf/xsk.h>
>
> Same for the headers.

sure, will fix it.

>
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define NUM_FRAMES      4096
> > +#define PROD_NUM_DESCS  512
> > +#define CONS_NUM_DESCS  512
> > +
> > +#ifdef USE_XSK_DEFAULT
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#endif
> > +
> > +typedef struct {
> > +    atomic_int locked;
> > +} ovs_spinlock_t;
> > +
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > +    int index;      /* point to top */
> > +    unsigned int size;
> > +    ovs_spinlock_t mutex;
> > +    void **array;   /* a pointer array */
> > +};
> > +
> > +/* array-based dp_packet_afxdp */
> > +struct xpacket_pool {
> > +    unsigned int size;
> > +    struct dp_packet_afxdp **array;
> > +};
> > +
> > +struct xsk_umem_info {
> > +    struct umem_pool mpool;
> > +    struct xpacket_pool xpool;
> > +    struct xsk_ring_prod fq;
> > +    struct xsk_ring_cons cq;
> > +    struct xsk_umem *umem;
> > +    void *buffer;
> > +};
> > +
> > +struct xsk_socket_info {
> > +    struct xsk_ring_cons rx;
> > +    struct xsk_ring_prod tx;
> > +    struct xsk_umem_info *umem;
> > +    struct xsk_socket *xsk;
> > +    unsigned long rx_npkts;
> > +    unsigned long tx_npkts;
> > +    unsigned long prev_rx_npkts;
> > +    unsigned long prev_tx_npkts;
> > +    uint32_t outstanding_tx;
> > +};
> > +
> > +struct umem_elem_head {
> > +    unsigned int index;
> > +    struct ovs_mutex mutex;
> > +    uint32_t n;
> > +};
>
> This structure is not used.

good catch, thanks.

>
> > +
> > +struct umem_elem {
> > +    struct umem_elem *next;
> > +};
> > +
> > +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> > +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void *__umem_elem_pop(struct umem_pool *umemp);
> > +void *umem_elem_pop(struct umem_pool *umemp);
> > +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void **__umem_pool_alloc(unsigned int size);
> > +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > +void umem_pool_cleanup(struct umem_pool *umemp);
> > +unsigned int umem_elem_count(struct umem_pool *mpool);
> > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > +
> > +#endif
> > diff --git a/tests/automake.mk b/tests/automake.mk
> > index ea16532dd2a0..715cef9a6b3b 100644
> > --- a/tests/automake.mk
> > +++ b/tests/automake.mk
> > @@ -4,12 +4,14 @@ EXTRA_DIST += \
> >       $(SYSTEM_TESTSUITE_AT) \
> >       $(SYSTEM_KMOD_TESTSUITE_AT) \
> >       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
> > +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
> >       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
> >       $(SYSTEM_DPDK_TESTSUITE_AT) \
> >       $(OVSDB_CLUSTER_TESTSUITE_AT) \
> >       $(TESTSUITE) \
> >       $(SYSTEM_KMOD_TESTSUITE) \
> >       $(SYSTEM_USERSPACE_TESTSUITE) \
> > +     $(SYSTEM_AFXDP_TESTSUITE) \
> >       $(SYSTEM_OFFLOADS_TESTSUITE) \
> >       $(SYSTEM_DPDK_TESTSUITE) \
> >       $(OVSDB_CLUSTER_TESTSUITE) \
> > @@ -158,6 +160,11 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
> >       tests/system-userspace-macros.at \
> >       tests/system-userspace-packet-type-aware.at
> >
> > +SYSTEM_AFXDP_TESTSUITE_AT = \
> > +     tests/system-afxdp-testsuite.at \
> > +     tests/system-afxdp-traffic.at \
> > +     tests/system-afxdp-macros.at
> > +
> >  SYSTEM_TESTSUITE_AT = \
> >       tests/system-common-macros.at \
> >       tests/system-ovn.at \
> > @@ -182,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
> >  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
> >  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
> >  SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
> > +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
> >  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
> >  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
> >  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> > @@ -315,6 +323,11 @@ check-system-userspace: all
> >       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> >
> > +check-afxdp: all
> > +     $(MAKE) install
> > +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> > +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> > +
> >  check-offloads: all
> >       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> > @@ -352,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
> >       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >       $(AM_V_at)mv $@.tmp $@
> >
> > +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> > +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> > +     $(AM_V_at)mv $@.tmp $@
> > +
> >  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
> >       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >       $(AM_V_at)mv $@.tmp $@
> > diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
> > new file mode 100644
> > index 000000000000..2c58c2d6554b
> > --- /dev/null
> > +++ b/tests/system-afxdp-macros.at
> > @@ -0,0 +1,153 @@
> > +# _ADD_BR([name])
> > +#
> > +# Expands into the proper ovs-vsctl commands to create a bridge with the
> > +# appropriate type and properties
> > +m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type=netdev protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
> > +
> > +# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
> > +#
> > +# Creates a database and starts ovsdb-server, starts ovs-vswitchd
> > +# connected to that database, calls ovs-vsctl to create a bridge named
> > +# br0 with predictable settings, passing 'vsctl-args' as additional
> > +# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
> > +# output (e.g. because it includes "create" commands) then 'vsctl-output'
> > +# specifies the expected output after filtering through uuidfilt.
> > +m4_define([OVS_TRAFFIC_VSWITCHD_START],
> > +  [
> > +   export OVS_PKGDATADIR=$(`pwd`)
> > +   _OVS_VSWITCHD_START([--disable-system])
> > +   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
> > +])
> > +
> > +# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
> > +#
> > +# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
> > +# for messages with severity WARN or higher and signaling an error if any
> > +# is present.  The optional WHITELIST may contain shell-quoted "sed"
> > +# commands to delete any warnings that are actually expected, e.g.:
> > +#
> > +#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
> > +#
> > +# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
> > +# invoked. They can be used to perform additional cleanups such as name space
> > +# removal.
> > +m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
> > +  [OVS_VSWITCHD_STOP([dnl
> > +$1";/netdev_linux.*obtaining netdev stats via vport failed/d
> > +/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d
> > +/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
> > +/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
> > +"])
> > +   AT_CHECK([:; $2])
> > +  ])
> > +
> > +m4_define([ADD_VETH_AFXDP],
> > +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
> > +      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
> > +      AT_CHECK([ip link set $1 netns $2])
> > +      AT_CHECK([ip link set dev ovs-$1 up])
> > +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> > +                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
> > +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> > +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> > +      if test -n "$5"; then
> > +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> > +      fi
> > +      if test -n "$6"; then
> > +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> > +      fi
> > +      on_exit 'ip link del ovs-$1'
> > +    ]
> > +)
> > +
> > +# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
> > +#
> > +# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
> > +m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
> > +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
> > +     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
> > +     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
> > +    ]
> > +)
> > +
> > +# CONFIGURE_VETH_OFFLOADS([VETH])
> > +#
> > +# Disable TX offloads for veths.  The userspace datapath uses the AF_PACKET
> > +# socket to receive packets for veths.  Unfortunately, the AF_PACKET socket
> > +# doesn't play well with offloads:
> > +# 1. GSO packets are received without segmentation and therefore discarded.
> > +# 2. Packets with offloaded partial checksum are received with the wrong
> > +#    checksum, therefore discarded by the receiver.
> > +#
> > +# By disabling tx offloads in the non-OVS side of the veth peer we make sure
> > +# that the AF_PACKET socket will not receive bad packets.
> > +#
> > +# This is a workaround, and should be removed when offloads are properly
> > +# supported in netdev-linux.
> > +m4_define([CONFIGURE_VETH_OFFLOADS],
> > +    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
> > +)
> > +
> > +# CHECK_CONNTRACK()
> > +#
> > +# Perform requirements checks for running conntrack tests.
> > +#
> > +m4_define([CHECK_CONNTRACK],
> > +    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
> > +)
> > +
> > +# CHECK_CONNTRACK_ALG()
> > +#
> > +# Perform requirements checks for running conntrack ALG tests. The userspace
> > +# supports FTP and TFTP.
> > +#
> > +m4_define([CHECK_CONNTRACK_ALG])
> > +
> > +# CHECK_CONNTRACK_FRAG()
> > +#
> > +# Perform requirements checks for running conntrack fragmentations tests.
> > +# The userspace doesn't support fragmentation yet, so skip the tests.
> > +m4_define([CHECK_CONNTRACK_FRAG],
> > +[
> > +    AT_SKIP_IF([:])
> > +])
> > +
> > +# CHECK_CONNTRACK_LOCAL_STACK()
> > +#
> > +# Perform requirements checks for running conntrack tests with local stack.
> > +# While the kernel connection tracker automatically passes all the connection
> > +# tracking state from an internal port to the OpenvSwitch kernel module, there
> > +# is simply no way of doing that with the userspace, so skip the tests.
> > +m4_define([CHECK_CONNTRACK_LOCAL_STACK],
> > +[
> > +    AT_SKIP_IF([:])
> > +])
> > +
> > +# CHECK_CONNTRACK_NAT()
> > +#
> > +# Perform requirements checks for running conntrack NAT tests. The userspace
> > +# datapath supports NAT.
> > +#
> > +m4_define([CHECK_CONNTRACK_NAT])
> > +
> > +# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
> > +#
> > +# Perform requirements checks for running ovs-dpctl flush-conntrack by
> > +# conntrack 5-tuple test. The userspace datapath does not support
> > +# this feature yet.
> > +m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
> > +[
> > +    AT_SKIP_IF([:])
> > +])
> > +
> > +# CHECK_CT_DPIF_SET_GET_MAXCONNS()
> > +#
> > +# Perform requirements checks for running ovs-dpctl ct-set-maxconns or
> > +# ovs-dpctl ct-get-maxconns. The userspace datapath does support this feature.
> > +m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
> > +
> > +# CHECK_CT_DPIF_GET_NCONNS()
> > +#
> > +# Perform requirements checks for running ovs-dpctl ct-get-nconns. The
> > +# userspace datapath does support this feature.
> > +m4_define([CHECK_CT_DPIF_GET_NCONNS])
> > diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
> > new file mode 100644
> > index 000000000000..538c0d15d556
> > --- /dev/null
> > +++ b/tests/system-afxdp-testsuite.at
> > @@ -0,0 +1,26 @@
> > +AT_INIT
> > +
> > +AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
> > +
> > +Licensed under the Apache License, Version 2.0 (the "License");
> > +you may not use this file except in compliance with the License.
> > +You may obtain a copy of the License at:
> > +
> > +    http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +Unless required by applicable law or agreed to in writing, software
> > +distributed under the License is distributed on an "AS IS" BASIS,
> > +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > +See the License for the specific language governing permissions and
> > +limitations under the License.])
> > +
> > +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> > +
> > +m4_include([tests/ovs-macros.at])
> > +m4_include([tests/ovsdb-macros.at])
> > +m4_include([tests/ofproto-macros.at])
> > +m4_include([tests/system-afxdp-macros.at])
> > +m4_include([tests/system-common-macros.at])
> > +
> > +m4_include([tests/system-afxdp-traffic.at])
> > +m4_include([tests/system-ovn.at])
> > diff --git a/tests/system-afxdp-traffic.at b/tests/system-afxdp-traffic.at
> > new file mode 100644
> > index 000000000000..26f72acf48ef
> > --- /dev/null
> > +++ b/tests/system-afxdp-traffic.at
> > @@ -0,0 +1,978 @@
> > +AT_BANNER([AF_XDP netdev datapath-sanity])
> > +
> > +AT_SETUP([datapath - ping between two ports])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ulimit -l unlimited
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping between two ports on vlan])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
> > +ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
> > +
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping6 between two ports])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> > +
> > +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> > +dnl waiting, we get occasional failures due to the following error:
> > +dnl "connect: Cannot assign requested address"
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> > +
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping6 between two ports on vlan])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> > +
> > +ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
> > +ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
> > +
> > +dnl Linux seems to take a little time to get its IPv6 stack in order. Without
> > +dnl waiting, we get occasional failures due to the following error:
> > +dnl "connect: Cannot assign requested address"
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
> > +
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over vxlan tunnel])
> > +OVS_CHECK_VXLAN()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> > +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
> > +ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> > +                  [id 0 dstport 4789])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> > +])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over vxlan6 tunnel])
> > +OVS_CHECK_VXLAN_UDP6ZEROCSUM()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> > +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
> > +ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
> > +                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> > +])
> > +
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over gre tunnel])
> > +OVS_CHECK_GRE()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> > +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
> > +ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> > +])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over erspan v1 tunnel])
> > +OVS_CHECK_GRE()
> > +OVS_CHECK_ERSPAN()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> > +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=1 options:erspan_idx=7])
> > +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> > +])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over erspan v2 tunnel])
> > +OVS_CHECK_GRE()
> > +OVS_CHECK_ERSPAN()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> > +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=2 options:erspan_dir=1 options:erspan_hwid=0x7])
> > +ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 7])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
> > +])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
> > +OVS_CHECK_GRE()
> > +OVS_CHECK_ERSPAN()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> > +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> > +                [options:key=123 options:erspan_ver=1 options:erspan_idx=0x7])
> > +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> > +                   [10.1.1.1/24], [local fc00:100::1 seq key 123 erspan_ver 1 erspan 7])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> > +])
> > +
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
> > +OVS_CHECK_GRE()
> > +OVS_CHECK_ERSPAN()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
> > +AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
> > +                [options:key=121 options:erspan_ver=2 options:erspan_dir=0 options:erspan_hwid=0x7])
> > +ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
> > +                   [10.1.1.1/24],
> > +                   [local fc00:100::1 seq key 121 erspan_ver 2 erspan_dir ingress erspan_hwid 0x7])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
> > +])
> > +
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over geneve tunnel])
> > +OVS_CHECK_GENEVE()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
> > +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
> > +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
> > +                  [vni 0])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], [OK
> > +])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - ping over geneve6 tunnel])
> > +OVS_CHECK_GENEVE_UDP6ZEROCSUM()
> > +
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-underlay])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> > +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
> > +
> > +ADD_NAMESPACES(at_ns0)
> > +
> > +dnl Set up underlay link from host into the namespace using veth pair.
> > +ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
> > +AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
> > +AT_CHECK([ip link set dev br-underlay up])
> > +
> > +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> > +dnl linux device inside the namespace.
> > +ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
> > +ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
> > +                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
> > +
> > +AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
> > +])
> > +AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
> > +])
> > +
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
> > +
> > +dnl First, check the underlay
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +dnl Okay, now check the overlay with different packet sizes
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - clone action])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
> > +                    -- set interface ovs-p1 ofport_request=2])
> > +
> > +AT_DATA([flows.txt], [dnl
> > +priority=1 actions=NORMAL
> > +priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
> > +priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
> > +])
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +
> > +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
> > +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> > +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> > +icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([datapath - basic truncate action])
> > +AT_SKIP_IF([test $HAVE_NC = no])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +AT_CHECK([ovs-ofctl del-flows br0])
> > +
> > +dnl Create p0 and ovs-p0(1)
> > +ADD_NAMESPACES(at_ns0)
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
> > +NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
> > +
> > +dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
> > +AT_CHECK([ip link add p1 type veth peer name ovs-p1])
> > +on_exit 'ip link del ovs-p1'
> > +AT_CHECK([ip link set dev ovs-p1 up])
> > +AT_CHECK([ip link set dev p1 up])
> > +AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
> > +dnl Use p1 to check the truncated packet
> > +AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
> > +
> > +dnl Create p2(5) and ovs-p2(4)
> > +AT_CHECK([ip link add p2 type veth peer name ovs-p2])
> > +on_exit 'ip link del ovs-p2'
> > +AT_CHECK([ip link set dev ovs-p2 up])
> > +AT_CHECK([ip link set dev p2 up])
> > +AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
> > +dnl Use p2 to check the truncated packet
> > +AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
> > +
> > +dnl basic test
> > +AT_CHECK([ovs-ofctl del-flows br0])
> > +AT_DATA([flows.txt], [dnl
> > +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> > +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> > +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
> > +])
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +
> > +dnl use this file as payload file for ncat
> > +AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
> > +on_exit 'rm -f payload200.bin'
> > +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> > +
> > +dnl packet with truncated size
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=100
> > +])
> > +dnl packet with original size
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=242
> > +])
> > +
> > +dnl more complicated output actions
> > +AT_CHECK([ovs-ofctl del-flows br0])
> > +AT_DATA([flows.txt], [dnl
> > +in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
> > +in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
> > +in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
> > +])
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +
> > +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> > +
> > +dnl 100 + 100 + 242 + min(65535,242) = 684
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=684
> > +])
> > +dnl 242 + 100 + min(242,200) = 542
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=542
> > +])
> > +
> > +dnl SLOW_ACTION: disable kernel datapath truncate support
> > +dnl Repeat the test above, but exercise the SLOW_ACTION code path
> > +AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
> > +
> > +dnl SLOW_ACTION test1: check datapatch actions
> > +AT_CHECK([ovs-ofctl del-flows br0])
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +
> > +AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
> > +AT_CHECK([tail -3 stdout], [0],
> > +[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
> > +This flow is handled by the userspace slow path because it:
> > +  - Uses action(s) not supported by datapath.
> > +])
> > +
> > +dnl SLOW_ACTION test2: check actual packet truncate
> > +AT_CHECK([ovs-ofctl del-flows br0])
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
> > +
> > +dnl 100 + 100 + 242 + min(65535,242) = 684
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=684
> > +])
> > +
> > +dnl 242 + 100 + min(242,200) = 542
> > +AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
> > +n_bytes=542
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +
> > +AT_BANNER([conntrack])
> > +
> > +AT_SETUP([conntrack - controller])
> > +CHECK_CONNTRACK()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> > +AT_DATA([flows.txt], [dnl
> > +priority=1,action=drop
> > +priority=10,arp,action=normal
> > +priority=100,in_port=1,udp,action=ct(commit),controller
> > +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> > +priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +AT_CAPTURE_FILE([ofctl_monitor.log])
> > +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> > +
> > +dnl Send an unsolicited reply from port 2. This should be dropped.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> > +
> > +dnl OK, now start a new connection from port 1.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
> > +
> > +dnl Now try a reply from port 2.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
> > +
> > +dnl Check this output. We only see the latter two packets, not the first.
> > +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> > +NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> > +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> > +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 (via action) data_len=42 (unbuffered)
> > +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([conntrack - force commit])
> > +CHECK_CONNTRACK()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +AT_DATA([flows.txt], [dnl
> > +priority=1,action=drop
> > +priority=10,arp,action=normal
> > +priority=100,in_port=1,udp,action=ct(force,commit),controller
> > +priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
> > +priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
> > +table=1,in_port=2,ct_state=+trk,udp,action=controller
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +AT_CAPTURE_FILE([ofctl_monitor.log])
> > +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> > +
> > +dnl Send an unsolicited reply from port 2. This should be dropped.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> > +
> > +dnl OK, now start a new connection from port 1.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> > +
> > +dnl Now try a reply from port 2.
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> > +
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +
> > +dnl Check this output. We only see the latter two packets, not the first.
> > +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> > +NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> > +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> > +NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 (via action) data_len=42 (unbuffered)
> > +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> > +])
> > +
> > +dnl
> > +dnl Check that the directionality has been changed by force commit.
> > +dnl
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [], [dnl
> > +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
> > +])
> > +
> > +dnl OK, now send another packet from port 1 and see that it switches again
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> > +AT_CHECK([ovs-appctl revalidator/purge], [0])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> > +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([conntrack - ct flush by 5-tuple])
> > +CHECK_CONNTRACK()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +AT_DATA([flows.txt], [dnl
> > +priority=1,action=drop
> > +priority=10,arp,action=normal
> > +priority=100,in_port=1,udp,action=ct(commit),2
> > +priority=100,in_port=2,udp,action=ct(zone=5,commit),1
> > +priority=100,in_port=1,icmp,action=ct(commit),2
> > +priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +dnl Test UDP from port 1
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
> > +udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [1], [dnl
> > +])
> > +
> > +dnl Test UDP from port 2
> > +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [dnl
> > +udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> > +])
> > +
> > +dnl Test ICMP traffic
> > +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [stdout])
> > +AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
> > +icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
> > +])
> > +
> > +ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
> > +ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [1], [dnl
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([conntrack - IPv4 ping])
> > +CHECK_CONNTRACK()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> > +AT_DATA([flows.txt], [dnl
> > +priority=1,action=drop
> > +priority=10,arp,action=normal
> > +priority=100,in_port=1,icmp,action=ct(commit),2
> > +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> > +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +dnl Pings from ns0->ns1 should work fine.
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> > +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> > +
> > +dnl Pings from ns1->ns0 should fail.
> > +NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
> > +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([conntrack - get_nconns and get/set_maxconns])
> > +CHECK_CONNTRACK()
> > +CHECK_CT_DPIF_SET_GET_MAXCONNS()
> > +CHECK_CT_DPIF_GET_NCONNS()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
> > +
> > +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> > +AT_DATA([flows.txt], [dnl
> > +priority=1,action=drop
> > +priority=10,arp,action=normal
> > +priority=100,in_port=1,icmp,action=ct(commit),2
> > +priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
> > +priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +dnl Pings from ns0->ns1 should work fine.
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
> > +icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
> > +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> > +ovs-appctl: ovs-vswitchd: server returned an error
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
> > +ovs-vswitchd: maxconns missing or malformed (Invalid argument)
> > +ovs-appctl: ovs-vswitchd: server returned an error
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], [dnl
> > +ovs-vswitchd: datapath not found (Invalid argument)
> > +ovs-appctl: ovs-vswitchd: server returned an error
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
> > +ovs-vswitchd: datapath not found (Invalid argument)
> > +ovs-appctl: ovs-vswitchd: server returned an error
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
> > +ovs-vswitchd: datapath not found (Invalid argument)
> > +ovs-appctl: ovs-vswitchd: server returned an error
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> > +1
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> > +3000000
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
> > +setting maxconns successful
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> > +10
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
> > +0
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
> > +10
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> > +
> > +AT_SETUP([conntrack - IPv6 ping])
> > +CHECK_CONNTRACK()
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +
> > +ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
> > +ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
> > +
> > +AT_DATA([flows.txt], [dnl
> > +
> > +dnl ICMPv6 echo request and reply go to table 1.  The rest of the traffic goes
> > +dnl through normal action.
> > +table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
> > +table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
> > +table=0,priority=1,action=normal
> > +
> > +dnl Allow everything from ns0->ns1. Only allow return traffic from ns1->ns0.
> > +table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
> > +table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
> > +table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
> > +table=1,priority=1,action=drop
> > +])
> > +
> > +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> > +
> > +OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
> > +
> > +dnl The above ping creates state in the connection tracker.  We're not
> > +dnl interested in that state.
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack])
> > +
> > +dnl Pings from ns1->ns0 should fail.
> > +NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | FORMAT_PING], [0], [dnl
> > +7 packets transmitted, 0 received, 100% packet loss, time 0ms
> > +])
> > +
> > +dnl Pings from ns0->ns1 should work fine.
> > +NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | FORMAT_PING], [0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], [dnl
> > +icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
> > +])
> > +
> > +OVS_TRAFFIC_VSWITCHD_STOP
> > +AT_CLEANUP
> >
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e09a33..11cc59efc881 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 46261235c732..aa9e7c49f179 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..d68a4ac7ff8b
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,469 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+internal.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the proposed architecture re-uses the existing
+userspace dpif-netdev datapath.  As a result, most of
+the packet processing happens at the userspace instead of
+linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+  ``_CONFIG_BPF=y``
+
+  ``_CONFIG_BPF_SYSCALL=y``
+
+  ``_CONFIG_XDP_SOCKETS=y``
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  ``_CONFIG_BPF_JIT=y`` (Performance)
+
+  ``_CONFIG_HAVE_BPF_JIT=y`` (Performance)
+
+  ``_CONFIG_XDP_SOCKETS_DIAG=y`` (Debugging)
+
+- If possible, run **./xdpsock -r -N -z -i <your device>** under
+  linux/samples/bpf.  This is the OVS indepedent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+Frist, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp
+
+if a test case fails, check the log at::
+
+  cat tests/system-afxdp-testsuite.dir/<number>/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd --disable-system
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+.. note::
+   OVS AF_XDP netdev is using the userspace datapath, the same datapath
+   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
+   and datapath_type=netdev when adding a new bridge.
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debug by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. AF_XDP device added to bridge, remove, and added again will fail.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+
+
+make check-afxdp
+----------------
+When executing 'make check-afxdp', OVS creates namespaces, sets up AF_XDP on
+veth devices and kicks start the testing.  So far we have the following test
+cases::
+
+ AF_XDP netdev datapath-sanity
+
+  1: datapath - ping between two ports               ok
+  2: datapath - ping between two ports on vlan       ok
+  3: datapath - ping6 between two ports              ok
+  4: datapath - ping6 between two ports on vlan      ok
+  5: datapath - ping over vxlan tunnel               ok
+  6: datapath - ping over vxlan6 tunnel              ok
+  7: datapath - ping over gre tunnel                 ok
+  8: datapath - ping over erspan v1 tunnel           ok
+  9: datapath - ping over erspan v2 tunnel           ok
+ 10: datapath - ping over ip6erspan v1 tunnel        ok
+ 11: datapath - ping over ip6erspan v2 tunnel        ok
+ 12: datapath - ping over geneve tunnel              ok
+ 13: datapath - ping over geneve6 tunnel             ok
+ 14: datapath - clone action                         ok
+ 15: datapath - basic truncate action                ok
+
+ conntrack
+
+ 16: conntrack - controller                          ok
+ 17: conntrack - force commit                        ok
+ 18: conntrack - ct flush by 5-tuple                 ok
+ 19: conntrack - IPv4 ping                           ok
+ 20: conntrack - get_nconns and get/set_maxconns     ok
+ 21: conntrack - IPv6 ping                           ok
+
+ system-ovn
+
+ 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
+ 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
+ 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
+ 25: ovn -- load-balancing                           ok
+ 26: ovn -- load-balancing - same subnet.            ok
+ 27: ovn -- load balancing in gateway router         ok
+ 28: ovn -- multiple gateway routers, load-balancing ok
+ 29: ovn -- load balancing in router with gateway router port ok
+ 30: ovn -- DNAT and SNAT on distributed router - N/S ok
+ 31: ovn -- DNAT and SNAT on distributed router - E/W ok
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+The performance number I got is around 700Kpps.
+This is due to using the kernel's tap interface, which requires copying
+packet into kernel from the umem buffer in userspace.
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1 type="afxdp" options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/acinclude.m4 b/acinclude.m4
index b532a4579266..5782f7e4bc2e 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -221,6 +221,38 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index 505e3d041e93..29c90b73f836 100644
--- a/configure.ac
+++ b/configure.ac
@@ -99,6 +99,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index cc5dccf39d6b..e3c1d9cbf363 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -14,6 +14,10 @@  if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
+endif
+
 lib_libopenvswitch_la_LDFLAGS = \
         $(OVS_LTINFO) \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
@@ -409,6 +413,14 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/xdpsock.c \
+	lib/xdpsock.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..c50f88e6e056 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -22,6 +22,9 @@ 
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
+#ifdef HAVE_AF_XDP
+#include "netdev-afxdp.h"
+#endif
 
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source source)
@@ -122,6 +125,11 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+#ifdef HAVE_AF_XDP
+            free_afxdp_buf(b);
+#endif
+            return;
         }
     }
 }
@@ -248,6 +256,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +444,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..91dcb886899f 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,10 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#ifdef HAVE_AF_XDP
+#include "netdev-afxdp.h"
+#endif
+
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +46,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* buffer data from XDP frame */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +94,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -184,6 +196,12 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+#ifdef HAVE_AF_XDP
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf((struct dp_packet *)b);
+            return;
+        }
+#endif
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..cc91720fad6e 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -198,6 +198,20 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif HAVE_AF_XDP
+    /* This is x86-specific instructions. */
+    union {
+        uint64_t tsc_64;
+        struct {
+            uint32_t lo_32;
+            uint32_t hi_32;
+        };
+    } tsc;
+    asm volatile("rdtsc" :
+             "=a" (tsc.lo_32),
+             "=d" (tsc.hi_32));
+
+    return s->last_tsc = tsc.tsc_64;
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..48de4eaaeed3
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,698 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if !defined(__i386__) && !defined(__x86_64__)
+#error AF_XDP supported only for Linux on x86 or x86_64
+#endif
+
+#include <config.h>
+#include "netdev-linux.h"
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <inttypes.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/utsname.h>
+#include <netpacket/packet.h>
+#include <net/if.h>
+#include <net/if_arp.h>
+#include <net/route.h>
+#include <poll.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dpif-netlink.h"
+#include "dpif-netdev.h"
+#include "openvswitch/dynamic-string.h"
+#include "fatal-signal.h"
+#include "hash.h"
+#include "openvswitch/hmap.h"
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "netlink-notifier.h"
+#include "netlink-socket.h"
+#include "netlink.h"
+#include "netnsid.h"
+#include "openvswitch/ofpbuf.h"
+#include "openflow/openflow.h"
+#include "ovs-atomic.h"
+#include "packets.h"
+#include "openvswitch/poll-loop.h"
+#include "rtnetlink.h"
+#include "openvswitch/shash.h"
+#include "socket-util.h"
+#include "sset.h"
+#include "tc.h"
+#include "timer.h"
+#include "unaligned.h"
+#include "openvswitch/vlog.h"
+#include "util.h"
+#include "netdev-afxdp.h"
+
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/rtnetlink.h>
+#include <linux/sockios.h>
+#include <linux/if_xdp.h>
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+#define UMEM2XPKT(base, i) \
+    ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
+    i * sizeof(struct dp_packet_afxdp))
+
+static uint32_t prog_id;
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+
+static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
+                                                int xdpmode)
+{
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xcalloc(1, sizeof(*umem));
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           NULL);
+
+    if (ret) {
+        VLOG_ERR("xsk umem create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* set-up umem pool */
+    umem_pool_init(&umem->mpool, NUM_FRAMES);
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct umem_elem *elem;
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* set-up metadata */
+    xpacket_pool_init(&umem->xpool, NUM_FRAMES);
+
+    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = UMEM2XPKT(umem->xpool.array, i);
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0;
+    int ret;
+    int i;
+
+    xsk = xcalloc(1, sizeof(*xsk));
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket_create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        return NULL;
+    }
+
+    xsk_ring_prod__reserve(&xsk->umem->fq, PROD_NUM_DESCS, &idx);
+
+    for (i = 0;
+         i < PROD_NUM_DESCS * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+    int ret;
+
+    /* umem memory region */
+    ret = posix_memalign(&bufs, getpagesize(),
+                         NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+    ovs_assert(!ret);
+
+    /* create AF_XDP socket */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        return NULL;
+    }
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* clean up umem and xpacket pool */
+        free(bufs);
+        (void)xsk_umem__delete(umem->umem);
+        umem_pool_cleanup(&xsk->umem->mpool);
+        xpacket_pool_cleanup(&xsk->umem->xpool);
+    }
+    return xsk;
+}
+
+void
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk;
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev->name);
+
+    /* configure each queue */
+    for (i = 0; i < netdev->n_rxq; i++) {
+        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
+                dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk) {
+            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
+            return;
+        }
+        dev->xsk[i] = xsk;
+    }
+}
+
+static void OVS_UNUSED vlog_hex_dump(const void *buf, size_t count)
+{
+    struct ds ds = DS_EMPTY_INITIALIZER;
+    ds_put_hex_dump(&ds, buf, count, 0, false);
+    VLOG_DBG_RL(&rl, "%s", ds_cstr(&ds));
+    ds_destroy(&ds);
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk)
+{
+    struct xsk_umem *umem;
+
+    if (!xsk) {
+        return;
+    }
+
+    umem = xsk->umem->umem;
+    xsk_socket__delete(xsk->xsk);
+    (void)xsk_umem__delete(umem);
+
+    /* free the packet buffer */
+    free(xsk->umem->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&xsk->umem->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&xsk->umem->xpool);
+}
+
+void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev->name);
+
+    for (i = 0; i < MAX_XSKQ; i++) {
+        if (dev->xsk[i]) {
+            VLOG_INFO("destroy xsk[%d]", i);
+            xsk_destroy(dev->xsk[i]);
+        }
+    }
+    VLOG_INFO("remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static inline void OVS_UNUSED
+print_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+                &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                     stat.rx_dropped,
+                     stat.rx_invalid_descs,
+                     stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    const char *xdpmode;
+    int new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (new_n_rxq != netdev->n_rxq) {
+        dev->requested_n_rxq = new_n_rxq;
+        netdev_request_reconfigure(netdev);
+    }
+
+    xdpmode = smap_get(args, "xdpmode");
+    if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) {
+        dev->requested_xdpmode = XDP_ZEROCOPY;
+
+        if (dev->xdpmode != dev->requested_xdpmode) {
+            VLOG_INFO("AF_XDP device %s in DRV mode", netdev->name);
+
+            /* From SKB mode to DRV mode */
+            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+            dev->xdp_bind_flags = XDP_ZEROCOPY;
+            dev->xdpmode = XDP_ZEROCOPY;
+            netdev_request_reconfigure(netdev);
+        }
+    } else {
+        dev->requested_xdpmode = XDP_COPY;
+        if (dev->xdpmode != dev->requested_xdpmode) {
+            VLOG_INFO("AF_XDP device %s in SKB mode", netdev->name);
+
+            /* From DRV mode to SKB mode */
+            dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+            dev->xdp_bind_flags = XDP_COPY;
+            dev->xdpmode = XDP_COPY;
+            netdev_request_reconfigure(netdev);
+        }
+    }
+
+    if (dev->xdpmode == XDP_ZEROCOPY) {
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+                      ovs_strerror(errno));
+        }
+    }
+
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+
+    netdev->n_rxq = dev->requested_n_rxq;
+    dev->xdpmode = dev->requested_xdpmode;
+
+    xsk_configure_all(netdev);
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0", netdev->name);
+    return 0;
+}
+
+void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t curr_prog_id = 0;
+    uint32_t flags;
+
+    /* remove_xdp_program() */
+    if (xdpmode == XDP_COPY) {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    } else {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    }
+
+    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    }
+    if (prog_id == curr_prog_id) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    } else if (!curr_prog_id) {
+        VLOG_INFO("couldn't find a prog id on a given interface");
+    } else {
+        VLOG_INFO("program on interface changed, not removing");
+    }
+}
+
+static inline struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d OVS_UNUSED)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    unsigned long addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+        struct dp_packet_afxdp *xpacket = NULL;
+        struct dp_packet *packet;
+        void *elems[BATCH_SIZE];
+        unsigned long addr;
+
+       /* all packets are AF_XDP, so handles its own delete in batch */
+        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+            xpacket = dp_packet_cast_afxdp(packet);
+            if (xpacket->mpool) {
+                void *base = dp_packet_base(packet);
+
+                addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+                elems[i] = (void *)addr;
+            }
+        }
+        umem_elem_push_n(xpacket->mpool, batch->count, elems);
+        dp_packet_batch_init(batch);
+}
+
+/* Receive packet from AF_XDP socket */
+int
+netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
+                     struct dp_packet_batch *batch)
+{
+    struct umem_elem *elems[BATCH_SIZE];
+    uint32_t idx_rx = 0, idx_fq = 0;
+    unsigned int rcvd, i;
+    int ret = 0;
+
+    /* See if there is any packet on RX queue,
+     * if yes, idx_rx is the index having the packet.
+     */
+    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return 0;
+    }
+
+    /* Form a dp_packet batch from descriptor in RX queue */
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
+        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
+        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+        uint64_t index;
+
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        index = addr >> FRAME_SHIFT;
+        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
+
+        packet = &xpacket->packet;
+        xpacket->mpool = &xsk->umem->mpool;
+
+        /* Initialize the struct dp_packet */
+        dp_packet_set_base(packet, pkt);
+        dp_packet_set_data(packet, pkt);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+
+    /* We've consume rcvd packets in RX, now re-fill the
+     * same number back to FILL queue.
+     */
+    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
+    if (OVS_UNLIKELY(ret)) {
+        return -ENOMEM;
+    }
+
+    for (i = 0; i < rcvd; i++) {
+        uint64_t index;
+        struct umem_elem *elem;
+
+        ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
+        while (OVS_UNLIKELY(ret == 0)) {
+            /* The FILL queue is full, so retry. (or skip)? */
+            ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq);
+        }
+
+        /* Get one free umem, program it into FILL queue */
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+
+    /* Release the RX queue */
+    xsk_ring_cons__release(&xsk->rx, rcvd);
+    xsk->rx_npkts += rcvd;
+
+#ifdef AFXDP_DEBUG
+    print_xsk_stat(xsk);
+#endif
+    return 0;
+}
+
+static void kick_tx(struct xsk_socket_info *xsk)
+{
+    int ret;
+
+    /* This causes system call into kernel, avoid calling
+     * this as much as we can.
+     */
+    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
+    if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN || errno == EBUSY) {
+        return;
+    }
+}
+
+int
+netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
+                              struct dp_packet_batch *batch)
+{
+    struct umem_elem *elems_pop[BATCH_SIZE];
+    struct umem_elem *elems_push[BATCH_SIZE];
+    uint32_t tx_done, idx_cq = 0;
+    struct dp_packet *packet;
+    uint32_t idx = 0;
+    int j, ret;
+
+    /* Make sure we have enough TX descs */
+    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        return -EAGAIN;
+    }
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        return -EAGAIN;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct umem_elem *elem;
+        uint64_t index;
+
+        elem = elems_pop[i];
+        if (OVS_UNLIKELY(!elem)) {
+            return -EAGAIN;
+        }
+
+        /* Copy the packet to the umem we just pop from umem pool.
+         * We can avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk->tx, batch->count);
+    xsk->outstanding_tx += batch->count;
+
+    kick_tx(xsk);
+retry:
+
+    /* Process CQ */
+    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
+    if (tx_done > 0) {
+        xsk->outstanding_tx -= tx_done;
+        xsk->tx_npkts += tx_done;
+    }
+
+    /* Recycle back to umem pool */
+    for (j = 0; j < tx_done; j++) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)xsk->umem->buffer + addr);
+        elems_push[j] = elem;
+    }
+    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
+    if (OVS_UNLIKELY(ret < 0)) {
+        goto out;
+    }
+    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
+
+    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
+        /* If there are still a lot not transmitted,
+         * try harder.
+         */
+        goto retry;
+    }
+out:
+    return 0;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..e0a49a89accf
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,51 @@ 
+/*
+ * Copyright (c) 2018 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+#define MAX_XSKQ 16
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+
+void xsk_configure_all(struct netdev *netdev);
+
+void xsk_destroy_all(struct netdev *netdev);
+
+int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
+                         struct dp_packet_batch *batch);
+
+int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
+                                  struct dp_packet_batch *batch);
+
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+
+void free_afxdp_buf(struct dp_packet *p);
+void free_afxdp_buf_batch(struct dp_packet_batch *batch);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index f75d73fd39f8..a17cf614a00c 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -75,6 +75,7 @@ 
 #include "unaligned.h"
 #include "openvswitch/vlog.h"
 #include "util.h"
+#include "netdev-afxdp.h"
 
 VLOG_DEFINE_THIS_MODULE(netdev_linux);
 
@@ -487,51 +488,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
 
 struct netdev_rxq_linux {
     struct netdev_rxq up;
@@ -579,13 +535,26 @@  is_netdev_linux_class(const struct netdev_class *netdev_class)
     return netdev_class->run == netdev_linux_run;
 }
 
+#if HAVE_AF_XDP
+static bool
+is_afxdp_netdev(const struct netdev *netdev)
+{
+    return netdev_get_class(netdev) == &netdev_afxdp_class;
+}
+#else
+static bool
+is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
+{
+    return false;
+}
+#endif
 static bool
 is_tap_netdev(const struct netdev *netdev)
 {
     return netdev_get_class(netdev) == &netdev_tap_class;
 }
 
-static struct netdev_linux *
+struct netdev_linux *
 netdev_linux_cast(const struct netdev *netdev)
 {
     ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
@@ -1083,7 +1052,11 @@  netdev_linux_destruct(struct netdev *netdev_)
     if (netdev->miimon_interval > 0) {
         atomic_count_dec(&miimon_cnt);
     }
-
+#if HAVE_AF_XDP
+    if (is_afxdp_netdev(netdev_)) {
+        xsk_destroy_all(netdev_);
+    }
+#endif
     ovs_mutex_destroy(&netdev->mutex);
 }
 
@@ -1113,7 +1086,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     rx->is_tap = is_tap_netdev(netdev_);
     if (rx->is_tap) {
         rx->fd = netdev->tap_fd;
-    } else {
+    } else if (!is_afxdp_netdev(netdev_)) {
         struct sockaddr_ll sll;
         int ifindex, val;
         /* Result of tcpdump -dd inbound */
@@ -1318,10 +1291,18 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
 {
     struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
     struct netdev *netdev = rx->up.netdev;
-    struct dp_packet *buffer;
+    struct dp_packet *buffer = NULL;
     ssize_t retval;
     int mtu;
 
+#if HAVE_AF_XDP
+    if (is_afxdp_netdev(netdev)) {
+        struct netdev_linux *dev = netdev_linux_cast(netdev);
+        int qid = rxq_->queue_id;
+
+        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
+    }
+#endif
     if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
         mtu = ETH_PAYLOAD_MAX;
     }
@@ -1329,6 +1310,7 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     /* Assume Ethernet port. No need to set packet_type. */
     buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
                                            DP_NETDEV_HEADROOM);
+
     retval = (rx->is_tap
               ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
               : netdev_linux_rxq_recv_sock(rx->fd, buffer));
@@ -1480,7 +1462,8 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
     int error = 0;
     int sock = 0;
 
-    if (!is_tap_netdev(netdev_)) {
+    if (!is_tap_netdev(netdev_) &&
+        !is_afxdp_netdev(netdev_)) {
         if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
             error = EOPNOTSUPP;
             goto free_batch;
@@ -1499,6 +1482,23 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
         }
 
         error = netdev_linux_sock_batch_send(sock, ifindex, batch);
+#if HAVE_AF_XDP
+    } else if (is_afxdp_netdev(netdev_)) {
+        struct netdev_linux *dev = netdev_linux_cast(netdev_);
+        struct dp_packet *packet;
+
+        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
+
+        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+            if (packet->source != DPBUF_AFXDP) {
+                 /* free one-by-one */
+                goto free_batch;
+            }
+        }
+        /* free in batch */
+        free_afxdp_buf_batch(batch);
+        return 0;
+#endif
     } else {
         error = netdev_linux_tap_batch_send(netdev_, batch);
     }
@@ -3323,6 +3323,7 @@  const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
@@ -3333,6 +3334,7 @@  const struct netdev_class netdev_linux_class = {
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
@@ -3343,10 +3345,26 @@  const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .get_stats = netdev_linux_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
index 17ca9120168a..570f9134e3d4 100644
--- a/lib/netdev-linux.h
+++ b/lib/netdev-linux.h
@@ -19,6 +19,21 @@ 
 
 #include <stdint.h>
 #include <stdbool.h>
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "timer.h"
+#include "ovs-atomic.h"
+#include "netdev-afxdp.h"
 
 /* These functions are Linux specific, so they should be used directly only by
  * Linux-specific code. */
@@ -28,6 +43,7 @@  struct netdev;
 int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
                                   const char *flag_name, bool enable);
 int linux_get_ifindex(const char *netdev_name);
+struct netdev_linux *netdev_linux_cast(const struct netdev *netdev);
 
 #define LINUX_FLOW_OFFLOAD_API                          \
    .flow_flush = netdev_tc_flow_flush,                  \
@@ -39,4 +55,60 @@  int linux_get_ifindex(const char *netdev_name);
    .flow_del = netdev_tc_flow_del,                      \
    .init_flow_api = netdev_tc_init_flow_api
 
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+    /* AF_XDP information */
+#ifdef HAVE_AF_XDP
+    struct xsk_socket_info *xsk[MAX_XSKQ];
+    int requested_n_rxq;
+    int xdpmode, requested_xdpmode; /* detect mode changed */
+    int xdp_flags, xdp_bind_flags;
+#endif
+};
+
 #endif /* netdev-linux.h */
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e6e8e8..d433818f7064 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -902,7 +902,9 @@  extern const struct netdev_class netdev_linux_class;
 #endif
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
-
+#if HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 7d7ecf6f0946..e2fae37d5a5e 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -146,6 +146,9 @@  netdev_initialize(void)
         netdev_register_provider(&netdev_internal_class);
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
diff --git a/lib/xdpsock.c b/lib/xdpsock.c
new file mode 100644
index 000000000000..7f20e16364e3
--- /dev/null
+++ b/lib/xdpsock.c
@@ -0,0 +1,236 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <syslog.h>
+#include <time.h>
+#include <unistd.h>
+#include "openvswitch/vlog.h"
+#include "async-append.h"
+#include "coverage.h"
+#include "dirs.h"
+#include "ovs-thread.h"
+#include "sat-math.h"
+#include "socket-util.h"
+#include "svec.h"
+#include "syslog-direct.h"
+#include "syslog-libc.h"
+#include "syslog-provider.h"
+#include "timeval.h"
+#include "unixctl.h"
+#include "util.h"
+#include "ovs-atomic.h"
+#include "openvswitch/compiler.h"
+#include "dp-packet.h"
+
+#include "xdpsock.h"
+
+static inline void ovs_spinlock_init(ovs_spinlock_t *sl)
+{
+    sl->locked = 0;
+}
+
+static inline void ovs_spin_lock(ovs_spinlock_t *sl)
+{
+    int exp = 0, locked = 0;
+
+    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed)) {
+        locked = 1;
+        while (locked) {
+            atomic_read_relaxed(&sl->locked, &locked);
+        }
+        exp = 0;
+    }
+}
+
+static inline void ovs_spin_unlock(ovs_spinlock_t *sl)
+{
+    atomic_store_explicit(&sl->locked, 0, memory_order_release);
+}
+
+static inline int OVS_UNUSED ovs_spin_trylock(ovs_spinlock_t *sl)
+{
+    int exp = 0;
+    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed);
+}
+
+inline int
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
+        return -ENOMEM;
+    }
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+
+    return 0;
+}
+
+int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->mutex);
+    ret = __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ret;
+}
+
+inline void
+__umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)
+{
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp OVS_UNUSED, void *addr)
+{
+
+    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
+        /* stack is full */
+        /* it's possible that one umem gets pushed twice,
+         * because actions=1,2,3... multiple ports?
+        */
+        OVS_NOT_REACHED();
+    }
+
+    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
+
+    ovs_spin_lock(&umemp->mutex);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->mutex);
+}
+
+inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->mutex);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ret;
+}
+
+inline void *
+__umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)
+{
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp OVS_UNUSED)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->mutex);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ptr;
+}
+
+void **
+__umem_pool_alloc(unsigned int size)
+{
+    void *bufs;
+
+    ovs_assert(posix_memalign(&bufs, getpagesize(),
+                              size * sizeof(void *)) == 0);
+    memset(bufs, 0, size * sizeof(void *));
+    return (void **)bufs;
+}
+
+unsigned int
+umem_elem_count(struct umem_pool *mpool)
+{
+    return mpool->index;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp OVS_UNUSED, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        OVS_NOT_REACHED();
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spinlock_init(&umemp->mutex);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp OVS_UNUSED)
+{
+    free(umemp->array);
+}
+
+/* AF_XDP metadata init/destroy */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    void *bufs;
+
+    /* TODO: check HAVE_POSIX_MEMALIGN  */
+    ovs_assert(posix_memalign(&bufs, getpagesize(),
+                              size * sizeof(struct dp_packet_afxdp)) == 0);
+    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
+
+    xp->array = bufs;
+    xp->size = size;
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free(xp->array);
+}
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
new file mode 100644
index 000000000000..52d7faaacf75
--- /dev/null
+++ b/lib/xdpsock.h
@@ -0,0 +1,127 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+#include <bpf/libbpf.h>
+
+#include "ovs-atomic.h"
+#include "openvswitch/thread.h"
+#include <bpf/xsk.h>
+
+#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
+#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
+#define BATCH_SIZE      NETDEV_MAX_BURST
+#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define NUM_FRAMES      4096
+#define PROD_NUM_DESCS  512
+#define CONS_NUM_DESCS  512
+
+#ifdef USE_XSK_DEFAULT
+#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
+#endif
+
+typedef struct {
+    atomic_int locked;
+} ovs_spinlock_t;
+
+/* LIFO ptr_array */
+struct umem_pool {
+    int index;      /* point to top */
+    unsigned int size;
+    ovs_spinlock_t mutex;
+    void **array;   /* a pointer array */
+};
+
+/* array-based dp_packet_afxdp */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp **array;
+};
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    unsigned long rx_npkts;
+    unsigned long tx_npkts;
+    unsigned long prev_rx_npkts;
+    unsigned long prev_tx_npkts;
+    uint32_t outstanding_tx;
+};
+
+struct umem_elem_head {
+    unsigned int index;
+    struct ovs_mutex mutex;
+    uint32_t n;
+};
+
+struct umem_elem {
+    struct umem_elem *next;
+};
+
+void __umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *__umem_elem_pop(struct umem_pool *umemp);
+void *umem_elem_pop(struct umem_pool *umemp);
+int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+void **__umem_pool_alloc(unsigned int size);
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+unsigned int umem_elem_count(struct umem_pool *mpool);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
diff --git a/tests/automake.mk b/tests/automake.mk
index ea16532dd2a0..715cef9a6b3b 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -158,6 +160,11 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-traffic.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -182,6 +189,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -315,6 +323,11 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	$(MAKE) install
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -352,6 +365,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..2c58c2d6554b
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,153 @@ 
+# _ADD_BR([name])
+#
+# Expands into the proper ovs-vsctl commands to create a bridge with the
+# appropriate type and properties
+m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type=netdev protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
+
+# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
+#
+# Creates a database and starts ovsdb-server, starts ovs-vswitchd
+# connected to that database, calls ovs-vsctl to create a bridge named
+# br0 with predictable settings, passing 'vsctl-args' as additional
+# commands to ovs-vsctl.  If 'vsctl-args' causes ovs-vsctl to provide
+# output (e.g. because it includes "create" commands) then 'vsctl-output'
+# specifies the expected output after filtering through uuidfilt.
+m4_define([OVS_TRAFFIC_VSWITCHD_START],
+  [
+   export OVS_PKGDATADIR=$(`pwd`)
+   _OVS_VSWITCHD_START([--disable-system])
+   AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
+])
+
+# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
+#
+# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
+# for messages with severity WARN or higher and signaling an error if any
+# is present.  The optional WHITELIST may contain shell-quoted "sed"
+# commands to delete any warnings that are actually expected, e.g.:
+#
+#   OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
+#
+# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
+# invoked. They can be used to perform additional cleanups such as name space
+# removal.
+m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
+  [OVS_VSWITCHD_STOP([dnl
+$1";/netdev_linux.*obtaining netdev stats via vport failed/d
+/dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d
+/dpif_netdev(revalidator.*)|ERR|internal error parsing flow key/d
+/dpif(revalidator.*)|WARN|netdev@ovs-netdev: failed to put/d
+"])
+   AT_CHECK([:; $2])
+  ])
+
+m4_define([ADD_VETH_AFXDP],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_AFXDP_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
+
+# CONFIGURE_AFXDP_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
+m4_define([CONFIGURE_AFXDP_VETH_OFFLOADS],
+    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
+     AT_CHECK([ethtool -K $1 rxvlan off], [0], [ignore], [ignore])
+     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
+    ]
+)
+
+# CONFIGURE_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads for veths.  The userspace datapath uses the AF_PACKET
+# socket to receive packets for veths.  Unfortunately, the AF_PACKET socket
+# doesn't play well with offloads:
+# 1. GSO packets are received without segmentation and therefore discarded.
+# 2. Packets with offloaded partial checksum are received with the wrong
+#    checksum, therefore discarded by the receiver.
+#
+# By disabling tx offloads in the non-OVS side of the veth peer we make sure
+# that the AF_PACKET socket will not receive bad packets.
+#
+# This is a workaround, and should be removed when offloads are properly
+# supported in netdev-linux.
+m4_define([CONFIGURE_VETH_OFFLOADS],
+    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
+)
+
+# CHECK_CONNTRACK()
+#
+# Perform requirements checks for running conntrack tests.
+#
+m4_define([CHECK_CONNTRACK],
+    [AT_SKIP_IF([test $HAVE_PYTHON = no])]
+)
+
+# CHECK_CONNTRACK_ALG()
+#
+# Perform requirements checks for running conntrack ALG tests. The userspace
+# supports FTP and TFTP.
+#
+m4_define([CHECK_CONNTRACK_ALG])
+
+# CHECK_CONNTRACK_FRAG()
+#
+# Perform requirements checks for running conntrack fragmentations tests.
+# The userspace doesn't support fragmentation yet, so skip the tests.
+m4_define([CHECK_CONNTRACK_FRAG],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_LOCAL_STACK()
+#
+# Perform requirements checks for running conntrack tests with local stack.
+# While the kernel connection tracker automatically passes all the connection
+# tracking state from an internal port to the OpenvSwitch kernel module, there
+# is simply no way of doing that with the userspace, so skip the tests.
+m4_define([CHECK_CONNTRACK_LOCAL_STACK],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_NAT()
+#
+# Perform requirements checks for running conntrack NAT tests. The userspace
+# datapath supports NAT.
+#
+m4_define([CHECK_CONNTRACK_NAT])
+
+# CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
+#
+# Perform requirements checks for running ovs-dpctl flush-conntrack by
+# conntrack 5-tuple test. The userspace datapath does not support
+# this feature yet.
+m4_define([CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE],
+[
+    AT_SKIP_IF([:])
+])
+
+# CHECK_CT_DPIF_SET_GET_MAXCONNS()
+#
+# Perform requirements checks for running ovs-dpctl ct-set-maxconns or
+# ovs-dpctl ct-get-maxconns. The userspace datapath does support this feature.
+m4_define([CHECK_CT_DPIF_SET_GET_MAXCONNS])
+
+# CHECK_CT_DPIF_GET_NCONNS()
+#
+# Perform requirements checks for running ovs-dpctl ct-get-nconns. The
+# userspace datapath does support this feature.
+m4_define([CHECK_CT_DPIF_GET_NCONNS])
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..538c0d15d556
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+m4_include([tests/system-common-macros.at])
+
+m4_include([tests/system-afxdp-traffic.at])
+m4_include([tests/system-ovn.at])
diff --git a/tests/system-afxdp-traffic.at b/tests/system-afxdp-traffic.at
new file mode 100644
index 000000000000..26f72acf48ef
--- /dev/null
+++ b/tests/system-afxdp-traffic.at
@@ -0,0 +1,978 @@ 
+AT_BANNER([AF_XDP netdev datapath-sanity])
+
+AT_SETUP([datapath - ping between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+ulimit -l unlimited
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
+ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
+ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan tunnel])
+OVS_CHECK_VXLAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+                  [id 0 dstport 4789])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan6 tunnel])
+OVS_CHECK_VXLAN_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
+                   [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over gre tunnel])
+OVS_CHECK_GRE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over erspan v1 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=1 options:erspan_idx=7])
+ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 1 erspan 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over erspan v2 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([erspan], [br0], [at_erspan0], [172.31.1.1], [10.1.1.100/24], [options:key=1 options:erspan_ver=2 options:erspan_dir=1 options:erspan_hwid=0x7])
+ADD_NATIVE_TUNNEL([erspan], [ns_erspan0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [seq key 1 erspan_ver 2 erspan_dir egress erspan_hwid 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.92/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+dnl NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+NS_CHECK_EXEC([at_ns0], [ping -s 1200 -i 0.3 -c 3 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over ip6erspan v1 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
+AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
+                [options:key=123 options:erspan_ver=1 options:erspan_idx=0x7])
+ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
+                   [10.1.1.1/24], [local fc00:100::1 seq key 123 erspan_ver 1 erspan 7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over ip6erspan v2 tunnel])
+OVS_CHECK_GRE()
+OVS_CHECK_ERSPAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00:100::1/96", [], [], nodad)
+AT_CHECK([ip addr add dev br-underlay "fc00:100::100/96" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([ip6erspan], [br0], [at_erspan0], [fc00:100::1], [10.1.1.100/24],
+                [options:key=121 options:erspan_ver=2 options:erspan_dir=0 options:erspan_hwid=0x7])
+ADD_NATIVE_TUNNEL6([ip6erspan], [ns_erspan0], [at_ns0], [fc00:100::100],
+                   [10.1.1.1/24],
+                   [local fc00:100::1 seq key 121 erspan_ver 2 erspan_dir ingress erspan_hwid 0x7])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00:100::1/96 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 2 fc00:100::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:100::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve tunnel])
+OVS_CHECK_GENEVE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+                  [vni 0])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add 172.31.1.100/24 br-underlay], [0], [OK
+])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve6 tunnel])
+OVS_CHECK_GENEVE_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH_AFXDP(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
+                   [vni 0 udp6zerocsumtx udp6zerocsumrx])
+
+AT_CHECK([ovs-appctl ovs/route/add 10.1.1.100/24 br0], [0], [OK
+])
+AT_CHECK([ovs-appctl ovs/route/add fc00::100/64 br-underlay], [0], [OK
+])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - clone action])
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
+                    -- set interface ovs-p1 ofport_request=2])
+
+AT_DATA([flows.txt], [dnl
+priority=1 actions=NORMAL
+priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
+priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - basic truncate action])
+AT_SKIP_IF([test $HAVE_NC = no])
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-ofctl del-flows br0])
+
+dnl Create p0 and ovs-p0(1)
+ADD_NAMESPACES(at_ns0)
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
+NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
+
+dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
+AT_CHECK([ip link add p1 type veth peer name ovs-p1])
+on_exit 'ip link del ovs-p1'
+AT_CHECK([ip link set dev ovs-p1 up])
+AT_CHECK([ip link set dev p1 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
+dnl Use p1 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
+
+dnl Create p2(5) and ovs-p2(4)
+AT_CHECK([ip link add p2 type veth peer name ovs-p2])
+on_exit 'ip link del ovs-p2'
+AT_CHECK([ip link set dev ovs-p2 up])
+AT_CHECK([ip link set dev p2 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
+dnl Use p2 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
+
+dnl basic test
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl use this file as payload file for ncat
+AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
+on_exit 'rm -f payload200.bin'
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl packet with truncated size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" |  sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=100
+])
+dnl packet with original size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=242
+])
+
+dnl more complicated output actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+dnl SLOW_ACTION: disable kernel datapath truncate support
+dnl Repeat the test above, but exercise the SLOW_ACTION code path
+AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
+
+dnl SLOW_ACTION test1: check datapatch actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
+AT_CHECK([tail -3 stdout], [0],
+[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
+This flow is handled by the userspace slow path because it:
+  - Uses action(s) not supported by datapath.
+])
+
+dnl SLOW_ACTION test2: check actual packet truncate
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+
+AT_BANNER([conntrack])
+
+AT_SETUP([conntrack - controller])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(commit),controller
+priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
+priority=100,in_port=2,ct_state=+trk+est,udp,action=controller
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl Send an unsolicited reply from port 2. This should be dropped.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
+
+dnl OK, now start a new connection from port 1.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000'])
+
+dnl Now try a reply from port 2.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000'])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN2 (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
+NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2,ip,in_port=2 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - force commit])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(force,commit),controller
+priority=100,in_port=2,ct_state=-trk,udp,action=ct(table=0)
+priority=100,in_port=2,ct_state=+trk+est,udp,action=ct(force,commit,table=1)
+table=1,in_port=2,ct_state=+trk,udp,action=controller
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl Send an unsolicited reply from port 2. This should be dropped.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+dnl OK, now start a new connection from port 1.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+
+dnl Now try a reply from port 2.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN2 (xid=0x0): cookie=0x0 total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
+NXT_PACKET_IN2 (xid=0x0): table_id=1 cookie=0x0 total_len=42 ct_state=new|trk,ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1,ip,in_port=2 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
+])
+
+dnl
+dnl Check that the directionality has been changed by force commit.
+dnl
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [], [dnl
+udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2)
+])
+
+dnl OK, now send another packet from port 1 and see that it switches again
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
+udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - ct flush by 5-tuple])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,udp,action=ct(commit),2
+priority=100,in_port=2,udp,action=ct(zone=5,commit),1
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,action=ct(zone=5,commit),1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Test UDP from port 1
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000200080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [], [dnl
+udp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),reply=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1)
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack 'ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=17,ct_tp_src=2,ct_tp_dst=1'])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], [1], [dnl
+])
+
+dnl Test UDP from port 2
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=2 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101020a0101010002000100080000 actions=resubmit(,0)"])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [dnl
+udp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=2,dport=1),reply=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=2),zone=5
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 'ct_nw_src=10.1.1.1,ct_nw_dst=10.1.1.2,ct_nw_proto=17,ct_tp_src=1,ct_tp_dst=2'])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+])
+
+dnl Test ICMP traffic
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [0], [stdout])
+AT_CHECK([cat stdout | FORMAT_CT(10.1.1.1)], [0],[dnl
+icmp,orig=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=8,code=0),reply=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=0,code=0),zone=5
+])
+
+ICMP_ID=`cat stdout | cut -d ',' -f4 | cut -d '=' -f2`
+ICMP_TUPLE=ct_nw_src=10.1.1.2,ct_nw_dst=10.1.1.1,ct_nw_proto=1,icmp_id=$ICMP_ID,icmp_type=8,icmp_code=0
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=5 $ICMP_TUPLE])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.2,"], [1], [dnl
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv4 ping])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
+priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+dnl Pings from ns1->ns0 should fail.
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 2 10.1.1.1 | FORMAT_PING], [0], [dnl
+7 packets transmitted, 0 received, 100% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - get_nconns and get/set_maxconns])
+CHECK_CONNTRACK()
+CHECK_CT_DPIF_SET_GET_MAXCONNS()
+CHECK_CT_DPIF_GET_NCONNS()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=100,in_port=1,icmp,action=ct(commit),2
+priority=100,in_port=2,icmp,ct_state=-trk,action=ct(table=0)
+priority=100,in_port=2,icmp,ct_state=+trk+est,action=1
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.2)], [0], [dnl
+icmp,orig=(src=10.1.1.1,dst=10.1.1.2,id=<cleared>,type=8,code=0),reply=(src=10.1.1.2,dst=10.1.1.1,id=<cleared>,type=0,code=0)
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: maxconns missing or malformed (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns a], [2], [], [dnl
+ovs-vswitchd: maxconns missing or malformed (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns one-bad-dp 10], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns one-bad-dp], [2], [], [dnl
+ovs-vswitchd: datapath not found (Invalid argument)
+ovs-appctl: ovs-vswitchd: server returned an error
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
+1
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+3000000
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-set-maxconns 10], [], [dnl
+setting maxconns successful
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+10
+])
+
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-nconns], [], [dnl
+0
+])
+
+AT_CHECK([ovs-appctl dpctl/ct-get-maxconns], [], [dnl
+10
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv6 ping])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH_AFXDP(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH_AFXDP(p1, at_ns1, br0, "fc00::2/96")
+
+AT_DATA([flows.txt], [dnl
+
+dnl ICMPv6 echo request and reply go to table 1.  The rest of the traffic goes
+dnl through normal action.
+table=0,priority=10,icmp6,icmp_type=128,action=goto_table:1
+table=0,priority=10,icmp6,icmp_type=129,action=goto_table:1
+table=0,priority=1,action=normal
+
+dnl Allow everything from ns0->ns1. Only allow return traffic from ns1->ns0.
+table=1,priority=100,in_port=1,icmp6,action=ct(commit),2
+table=1,priority=100,in_port=2,icmp6,ct_state=-trk,action=ct(table=0)
+table=1,priority=100,in_port=2,icmp6,ct_state=+trk+est,action=1
+table=1,priority=1,action=drop
+])
+
+AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
+
+dnl The above ping creates state in the connection tracker.  We're not
+dnl interested in that state.
+AT_CHECK([ovs-appctl dpctl/flush-conntrack])
+
+dnl Pings from ns1->ns0 should fail.
+NS_CHECK_EXEC([at_ns1], [ping6 -q -c 3 -i 0.3 -w 2 fc00::1 | FORMAT_PING], [0], [dnl
+7 packets transmitted, 0 received, 100% packet loss, time 0ms
+])
+
+dnl Pings from ns0->ns1 should work fine.
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(fc00::2)], [0], [dnl
+icmpv6,orig=(src=fc00::1,dst=fc00::2,id=<cleared>,type=128,code=0),reply=(src=fc00::2,dst=fc00::1,id=<cleared>,type=129,code=0)
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP