diff mbox series

[ovs-dev,PATCHv10] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1559070064-7211-1-git-send-email-u9012063@gmail.com
State Superseded
Headers show
Series [ovs-dev,PATCHv10] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu May 28, 2019, 7:01 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

v4->v5:
- rebase to master
- remove rfc, squash all into a single patch
- add --enable-afxdp, so by default, AF_XDP is not compiled
- add options: xdpmode=drv,skb
- add multiple queue and multiple PMD support, with options: n_rxq
- improve documentation, rename bpf.rst to af_xdp.rst

v5->v6
- rebase to master, commit 0cdd5b13de91b98
- address errors from sparse and clang
- pass travis-ci test
- address feedback from Ben
- fix issues reported by 0-day robot
- improved documentation

v6-v7
- rebase to master, commit abf11558c1515bf3b1
- address feedbacks from Ilya, Ben, and Eelco, see:
  https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
- add XDP mode change, implement get/set_config, reconfigure
- Fix reconfiguration/crash issue caused by libbpf, see patch:
  [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
- perf optimization for batching umem_push/pop
- perf optimization for batching kick_tx
- test build with dpdk
- fix/refactor atomic operation
- make AF_XDP x86 specific, otherwise fail at build time
- lots of code refactoring
- add PVP setup in documentation

v7-v8:
- Address feedback from Ilya at:
  https://patchwork.ozlabs.org/patch/1095019/
- add netdev-linux-private.h
- fix afxdp reconfigure issue
- sort include headers
- remove unnecessary OVS_UNUSED
- coding style fixes
- error case handling and memory leak

v8-v9:
- rebase to master 180bbbed3a3867d52
- Address review feedback from Ben, Ilya and Eelco, at:
  https://patchwork.ozlabs.org/patch/1097740/
- == From Ilya ==
- Optimize the reconfiguration logic
- Implement .rxq_recv and .send for afxdp
- Remove system-afxdp-traffic.at, reuse existing code
- Use Ilya's rdtsc code
- remove --disable-system
- == From Eelco ==
- Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
  assertion !fd != !wevent failed
- Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
- Clear xdp program when receive signal, ctrl+c
- Add options to vswitch.xml, set xdpmode default to skb-mode
- No support for ARM and PPC, now x86_64 only
- remove redundant header includes and function/macro definitions
- remove some ifdef HAVE_AF_XDP
- == From others/both about afxdp rx and tx ==
- Several umem push/pop error handling improvement/fixes
- add lock to address concurrent_txq case
- improve error handling
- add stats
- Things that are not done yet
- MTU limitation
- n_txq_desc/n_rxq_desc option.

v9-v10
- remove x86_64 limitation, suggested by Ben and Eelco
- add xmalloc_pagealign, free_pagealign
- minor refector
---
 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  35 ++
 configure.ac                          |   1 +
 lib/automake.mk                       |  14 +
 lib/dp-packet.c                       |  28 ++
 lib/dp-packet.h                       |  18 +-
 lib/dpif-netdev-perf.h                |  28 ++
 lib/netdev-afxdp.c                    | 850 ++++++++++++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |  74 +++
 lib/netdev-linux-private.h            | 139 ++++++
 lib/netdev-linux.c                    | 121 ++---
 lib/netdev-provider.h                 |   3 +
 lib/netdev.c                          |  11 +
 lib/spinlock.h                        |  70 +++
 lib/util.c                            |  43 ++
 lib/util.h                            |   5 +
 lib/xdpsock.c                         | 179 +++++++
 lib/xdpsock.h                         | 101 ++++
 tests/automake.mk                     |  16 +
 tests/system-afxdp-macros.at          |  20 +
 tests/system-afxdp-testsuite.at       |  26 ++
 vswitchd/vswitch.xml                  |  15 +
 25 files changed, 2150 insertions(+), 83 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 lib/spinlock.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at

Comments

Ilya Maximets May 30, 2019, 3:57 p.m. UTC | #1
On 28.05.2019 22:01, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> 
> v7-v8:
> - Address feedback from Ilya at:
>   https://protect2.fireeye.com/url?k=56282ea945466a02.5629a5e6-0b1830ef36465620&u=https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> 
> v8-v9:
> - rebase to master 180bbbed3a3867d52
> - Address review feedback from Ben, Ilya and Eelco, at:
>   https://protect2.fireeye.com/url?k=b08e5d041ce72fc6.b08fd64b-56e9484896ad35db&u=https://patchwork.ozlabs.org/patch/1097740/
> - == From Ilya ==
> - Optimize the reconfiguration logic
> - Implement .rxq_recv and .send for afxdp
> - Remove system-afxdp-traffic.at, reuse existing code
> - Use Ilya's rdtsc code
> - remove --disable-system
> - == From Eelco ==
> - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
>   assertion !fd != !wevent failed
> - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
> - Clear xdp program when receive signal, ctrl+c
> - Add options to vswitch.xml, set xdpmode default to skb-mode
> - No support for ARM and PPC, now x86_64 only
> - remove redundant header includes and function/macro definitions
> - remove some ifdef HAVE_AF_XDP
> - == From others/both about afxdp rx and tx ==
> - Several umem push/pop error handling improvement/fixes
> - add lock to address concurrent_txq case
> - improve error handling
> - add stats
> - Things that are not done yet
> - MTU limitation
> - n_txq_desc/n_rxq_desc option.
> 
> v9-v10
> - remove x86_64 limitation, suggested by Ben and Eelco
> - add xmalloc_pagealign, free_pagealign
> - minor refector
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  35 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  14 +
>  lib/dp-packet.c                       |  28 ++
>  lib/dp-packet.h                       |  18 +-
>  lib/dpif-netdev-perf.h                |  28 ++
>  lib/netdev-afxdp.c                    | 850 ++++++++++++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  74 +++
>  lib/netdev-linux-private.h            | 139 ++++++
>  lib/netdev-linux.c                    | 121 ++---
>  lib/netdev-provider.h                 |   3 +
>  lib/netdev.c                          |  11 +
>  lib/spinlock.h                        |  70 +++
>  lib/util.c                            |  43 ++
>  lib/util.h                            |   5 +
>  lib/xdpsock.c                         | 179 +++++++
>  lib/xdpsock.h                         | 101 ++++
>  tests/automake.mk                     |  16 +
>  tests/system-afxdp-macros.at          |  20 +
>  tests/system-afxdp-testsuite.at       |  26 ++
>  vswitchd/vswitch.xml                  |  15 +
>  25 files changed, 2150 insertions(+), 83 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/spinlock.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>  
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..a2bff5733d0a
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,433 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> +built upon the eBPF and XDP technology.  It is aims to have comparable
> +performance to DPDK but cooperate better with existing kernel's networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +dpdk.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, called xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the afxdp netdev re-uses the existing userspace
> +dpif-netdev datapath.  As a result, most of the packet processing
> +happens at the userspace instead of linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> +  This is an OVS indepedent benchmark tools for AF_XDP.

typo: s/indepedent/independent/

> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> +Frist, clone a recent version of Linux bpf-next tree::

s/Frist/First/

> +
> +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp TESTSUITEFLAGS='1'
> +
> +If a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd ...
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +.. note::
> +   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
> +
> +To validate that the bridge has successfully instantiated, you can use the::
> +
> +  ovs-vsctl show
> +
> +Should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debugging by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> +about AF_XDP current and future work.
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, allowing PMD
> +to keep polling the AF_XDP queues without any interferences from kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> +
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +The performance number I got is around 1.6Mpps.
> +This is due to using the kernel's tap interface, which requires copying
> +packet into kernel from the umem buffer in userspace.
> +
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> +
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1
> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> +
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>  
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index f8fc5bcd7b4c..b9eacd7c0f3c 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -221,6 +221,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>  
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 505e3d041e93..29c90b73f836 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index cc5dccf39d6b..b31e28f6e1f5 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-tc-offloads.c \
>  	lib/netdev-tc-offloads.h \
>  	lib/netlink-conntrack.c \
> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h \
> +	lib/spinlock.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..e6a7947076b4 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -19,6 +19,7 @@
>  #include <string.h>
>  
>  #include "dp-packet.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>  
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  It could
> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
>          }
>      }
>  }
> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>  
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>  
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..e3438226e360 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,7 @@
>  #include <rte_mbuf.h>
>  #endif
>  
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
>                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>  
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +91,13 @@ struct dp_packet {
>      };
>  };
>  
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>  
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>  
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..a33b9a7353ba 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -21,6 +21,7 @@
>  #include <stddef.h>
>  #include <stdint.h>
>  #include <string.h>
> +#include <time.h>
>  #include <math.h>
>  
>  #ifdef DPDK_NETDEV
> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>      char *log_reason;
>  };
>  
> +#ifdef HAVE_AF_XDP

I'd like to change this to "#ifdef __linux__".
'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is
Linux specific.

> +static inline uint64_t
> +rdtsc_syscall(struct pmd_perf_stats *s)
> +{
> +    struct timespec val;
> +    uint64_t v;
> +
> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> +       return s->last_tsc = 0;

Maybe it's better to just return the value and allow caller to assign?
This way you'll not need to pass any arguments here.

> +    }
> +
> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> +    v += (uint64_t) val.tv_nsec;
> +
> +    return s->last_tsc = v;
> +}
> +#endif
> +
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
>   * These functions are intended to be invoked in the context of pmd threads. */
>  
> @@ -198,6 +217,15 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif defined(HAVE_AF_XDP) && defined(__x86_64__)

And this should be:
#elif !defined(_MSC_VER) && defined(__x86_64__)

Visual Studio doesn't support inline assembly this way.
Other things are portable until we're on x86_64.

> +    /* This is x86-specific instructions. */
> +    uint32_t h, l;
> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> +
> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> +#elif defined(HAVE_AF_XDP)

#elif defined(__linux__)

> +    /* non-x86_64 architecture uses syscall */
> +    return rdtsc_syscall(s);
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..e20ee31c00f3
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,850 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"
> +#include "netdev-afxdp.h"
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <stdlib.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#include "dp-packet.h"
> +#include "dpif-netdev.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "packets.h"
> +#include "socket-util.h"
> +#include "spinlock.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static uint32_t prog_id;
> +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +static int xsk_configure_all(struct netdev *netdev);
> +static void xsk_destroy_all(struct netdev *netdev);
> +
> +static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
> +                                                int xdpmode)
> +{
> +    struct xsk_umem_config uconfig OVS_UNUSED;
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof(*umem));

No need to parenthesize the argument of 'sizeof'.

> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> +                           NULL);
> +    if (ret) {
> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("umem_pool_init failed");
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("xpacket_pool_init failed");
> +        umem_pool_cleanup(&umem->mpool);
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */
> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> +                                   PROD_NUM_DESCS - BATCH_SIZE, &idx)) {
> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
> +    }
> +
> +    for (i = 0;
> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS - BATCH_SIZE);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +
> +    /* umem memory region */
> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free_pagealign(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free_pagealign(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +static int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk;
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    /* configure each queue */
> +    for (i = 0; i < netdev->n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
> +            goto err;
> +        }
> +        dev->xsk[i] = xsk;
> +        xsk->rx_dropped = 0;
> +        xsk->tx_dropped = 0;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk)
> +{
> +    struct xsk_umem *umem;
> +
> +    if (!xsk) {
> +        return;
> +    }
> +
> +    umem = xsk->umem->umem;
> +    xsk_socket__delete(xsk->xsk);
> +    if (xsk_umem__delete(umem)) {
> +        VLOG_ERR("xsk_umem__delete failed");
> +    }
> +
> +    /* free the packet buffer */
> +    free_pagealign(xsk->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> +
> +    free(xsk->umem);
> +    free(xsk);
> +}
> +
> +static void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < MAX_XSKQ; i++) {
> +        if (dev->xsk[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsk[i]);
> +            dev->xsk[i] = NULL;
> +            dev->xsk[i]->rx_dropped = 0;
> +            dev->xsk[i]->tx_dropped = 0;

Dereferencing of a just assigned NULL poiner. Something is definitely
wrong here.

> +        }
> +    }
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static inline void OVS_UNUSED
> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *str_xdpmode;
> +    int xdpmode, new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> +        return EINVAL;
> +    }
> +
> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> +    if (!strcasecmp(str_xdpmode, "drv")) {
> +        xdpmode = XDP_ZEROCOPY;
> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> +        xdpmode = XDP_COPY;
> +    } else {
> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> +                 netdev_get_name(netdev), str_xdpmode);
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (dev->requested_n_rxq != new_n_rxq
> +        || dev->requested_xdpmode != xdpmode) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        dev->requested_xdpmode = xdpmode;
> +        netdev_request_reconfigure(netdev);
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    netdev->n_rxq = dev->requested_n_rxq;
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +static void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t curr_prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    }
> +    if (prog_id == curr_prog_id) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    } else if (!curr_prog_id) {
> +        VLOG_INFO("couldn't find a prog id on a given interface");
> +    } else {
> +        VLOG_INFO("program on interface changed, not removing");
> +    }
> +}
> +
> +void
> +signal_remove_xdp(struct netdev *netdev)
> +{> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    VLOG_WARN("force remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    unsigned long addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +static void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +    struct dp_packet_afxdp *xpacket = NULL;
> +    struct dp_packet *packet;
> +    void *elems[BATCH_SIZE];
> +    unsigned long addr;
> +
> +   /* all packets are AF_XDP, so handles its own delete in batch */

This comment should be somewhere else.

BTW, shift right by 1 space.

> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (xpacket->mpool) {
> +            void *base = dp_packet_base(packet);
> +
> +            addr = (unsigned long)base & (~FRAME_SHIFT_MASK);

Shouldn't it be uintptr_t ? Probably in some other places too.

> +            elems[i] = (void *)addr;
> +        }
> +    }
> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +    dp_packet_batch_init(batch);
> +}
> +
> +int
> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> +                      int *qfill)
> +{
> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    struct netdev *netdev = rx->up.netdev;
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct umem_elem *elems[BATCH_SIZE];
> +    uint32_t idx_rx = 0, idx_fq = 0;
> +    struct xsk_socket_info *xsk;
> +    int qid = rxq_->queue_id;
> +    unsigned int rcvd, i;
> +    int ret = 0;
> +
> +    xsk = dev->xsk[qid];
> +    rx->fd = xsk_socket__fd(xsk->xsk);
> +
> +    /* See if there is any packet on RX queue,
> +     * if yes, idx_rx is the index having the packet.
> +     */
> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;
> +    }
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {

We need to return rx buffers to mpool before releasing.
Otherwise they will be lost.

           for (i = 0; i < rcvd; i++) {
               uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, i)->addr;

               elems[i] = xsk_umem__get_data(xsk->umem->buffer, addr);
           }
           umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);

Please, re-check above code snippet before using.

> +        xsk_ring_cons__release(&xsk->rx, rcvd);
> +        xsk->rx_dropped += rcvd;
> +        return ENOMEM;
> +    }
> +
> +    /* Prepare for the FILL queue */
> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> +        /* The FILL queue is full, don't retry or process rx. Wait for kernel
> +         * to move received packets from FILL queue to RX queue.
> +         */
> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);

Same here.

> +        xsk_ring_cons__release(&xsk->rx, rcvd);
> +        xsk->rx_dropped += rcvd;
> +        return ENOMEM;
> +    }
> +
> +    /* Setup a dp_packet batch from descriptors in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> +        packet = &xpacket->packet;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> +
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        /* Get one free umem, program it into FILL queue */
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> +
> +    if (qfill) {
> +        /* TODO: return the number of remaining packets in the queue. */
> +        *qfill = 0;
> +    }
> +
> +#ifdef AFXDP_DEBUG
> +    log_xsk_stat(xsk);
> +#endif
> +    return 0;
> +}
> +
> +static inline int
> +kick_tx(struct xsk_socket_info *xsk)
> +{
> +    int ret;
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +static inline bool
> +check_free_batch(struct dp_packet_batch *batch)
> +{
> +    struct umem_pool *first_mpool = NULL;
> +    struct dp_packet_afxdp *xpacket;
> +    struct dp_packet *packet;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (packet->source != DPBUF_AFXDP) {
> +            return false;
> +        }
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (i == 0) {
> +            first_mpool = xpacket->mpool;
> +            continue;
> +        }
> +        if (xpacket->mpool != first_mpool) {
> +            return false;
> +        }
> +    }
> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> +    return true;
> +}
> +
> +static inline void
> +afxdp_complete_tx(struct xsk_socket_info *xsk)
> +{
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t idx_cq = 0;
> +    int tx_done, j, ret;
> +
> +    if (!xsk->outstanding_tx) {
> +        return;
> +    }
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +    }
> +
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq);
> +    if (tx_done > 0) {
> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +        xsk->outstanding_tx -= tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +
> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> +}
> +
> +int
> +netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> +                        struct dp_packet_batch *batch,
> +                        bool concurrent_txq)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev_);
> +    struct xsk_socket_info *xsk = dev->xsk[qid];
> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct dp_packet *packet;
> +    bool free_batch = true;
> +    uint32_t idx = 0;
> +    int error = 0;
> +    int ret;
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        ovs_spin_lock(&dev->tx_lock);

Using the same lock for all queues will procude a lot of unnecessary
contentions. It's better to allocate array of locks. One per tx queue.
You may re-allocate it in reconfigure() implementation.

> +    }
> +
> +    /* Process CQ first. */
> +    afxdp_complete_tx(xsk);
> +
> +    free_batch = check_free_batch(batch);
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        xsk->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> +        xsk->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * TODO: avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);

Do we really able to re-use these buffers? They are alredy in tx ring and
probably will be sent on next kick_tx().

> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +    }
> +
> +out:
> +    if (free_batch) {
> +        free_afxdp_buf_batch(batch);
> +    } else {
> +        dp_packet_delete_batch(batch, true);
> +    }
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        ovs_spin_unlock(&dev->tx_lock);
> +    }
> +    return error;
> +}
> +
> +int
> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> +{
> +   /* Done at reconfigure */
> +   return 0;
> +}
> +
> +void
> +netdev_afxdp_destruct(struct netdev *netdev_)
> +{
> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> +    /* Note: tc is by-passed when using drv-mode, but when using
> +     * skb-mode, we might need to clean up tc. */
> +
> +    xsk_destroy_all(netdev_);
> +    ovs_mutex_destroy(&netdev->mutex);
> +}
> +
> +int
> +netdev_afxdp_get_stats(const struct netdev *netdev_,

You don't need an underscore here.

> +                       struct netdev_stats *stats)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev_);
> +    struct netdev_stats dev_stats;
> +    struct xsk_socket_info *xsk;
> +    int error, i;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    error = get_stats_via_netlink(netdev_, &dev_stats);
> +    if (error) {
> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> +    } else {
> +        /* Use kernel netdev's packet and byte counts */
> +        stats->rx_packets = dev_stats.rx_packets;
> +        stats->rx_bytes = dev_stats.rx_bytes;
> +        stats->tx_packets = dev_stats.tx_packets;
> +        stats->tx_bytes = dev_stats.tx_bytes;
> +
> +        stats->rx_errors           += dev_stats.rx_errors;
> +        stats->tx_errors           += dev_stats.tx_errors;
> +        stats->rx_dropped          += dev_stats.rx_dropped;
> +        stats->tx_dropped          += dev_stats.tx_dropped;
> +        stats->multicast           += dev_stats.multicast;
> +        stats->collisions          += dev_stats.collisions;
> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> +        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> +
> +        /* Account the dropped in each xsk */
> +        for (i = 0; i < MAX_XSKQ; i++) {

i < netdev_n_rxq(netdev)

> +            xsk = dev->xsk[i];
> +            if (xsk) {
> +                stats->rx_dropped += xsk->rx_dropped;
> +                stats->tx_dropped += xsk->tx_dropped;
> +            }
> +        }
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +
> +    return error;
> +}
> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> new file mode 100644
> index 000000000000..dd2dc1a2064d
> --- /dev/null
> +++ b/lib/netdev-afxdp.h
> @@ -0,0 +1,74 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_AFXDP_H
> +#define NETDEV_AFXDP_H 1
> +
> +#include <config.h>
> +
> +#ifdef HAVE_AF_XDP
> +
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +/* These functions are Linux AF_XDP specific, so they should be used directly
> + * only by Linux-specific code. */
> +
> +#define MAX_XSKQ 16
> +
> +struct netdev;
> +struct xsk_socket_info;
> +struct xdp_umem;
> +struct dp_packet_batch;
> +struct smap;
> +struct dp_packet;
> +struct netdev_rxq;
> +struct netdev_stats;
> +
> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> +void netdev_afxdp_destruct(struct netdev *netdev_);
> +
> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> +                          struct dp_packet_batch *batch,
> +                          int *qfill);
> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> +                            struct dp_packet_batch *batch,
> +                            bool concurrent_txq);
> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                            char **errp);
> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> +                           struct netdev_stats *stats);
> +
> +void free_afxdp_buf(struct dp_packet *p);
> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> +void signal_remove_xdp(struct netdev *netdev);
> +
> +#else /* !HAVE_AF_XDP */
> +
> +#include "openvswitch/compiler.h"
> +
> +struct dp_packet;
> +
> +static inline void
> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> +{
> +    /* Nothing */
> +}
> +
> +#endif /* HAVE_AF_XDP */
> +#endif /* netdev-afxdp.h */
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> new file mode 100644
> index 000000000000..d43f79e6aa41
> --- /dev/null
> +++ b/lib/netdev-linux-private.h
> @@ -0,0 +1,139 @@
> +/*
> + * Copyright (c) 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_LINUX_PRIVATE_H
> +#define NETDEV_LINUX_PRIVATE_H 1
> +
> +#include <config.h>
> +
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include "netdev-afxdp.h"
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"
> +#include "xdpsock.h"
> +
> +/* These functions are Linux specific, so they should be used directly only by
> + * Linux-specific code. */
> +
> +struct netdev;
> +
> +struct netdev_rxq_linux {
> +    struct netdev_rxq up;
> +    bool is_tap;
> +    int fd;
> +};
> +
> +void netdev_linux_run(const struct netdev_class *);
> +
> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
> +                                  const char *flag_name, bool enable);
> +
> +int get_stats_via_netlink(const struct netdev *netdev_,
> +                          struct netdev_stats *stats);
> +
> +struct netdev_linux {
> +    struct netdev up;
> +
> +    /* Protects all members below. */
> +    struct ovs_mutex mutex;
> +
> +    unsigned int cache_valid;
> +
> +    bool miimon;                    /* Link status of last poll. */
> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> +    struct timer miimon_timer;
> +
> +    int netnsid;                    /* Network namespace ID. */
> +    /* The following are figured out "on demand" only.  They are only valid
> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> +    int ifindex;
> +    struct eth_addr etheraddr;
> +    int mtu;
> +    unsigned int ifi_flags;
> +    long long int carrier_resets;
> +    uint32_t kbits_rate;        /* Policing data. */
> +    uint32_t kbits_burst;
> +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> +                                   0 or an errno value. */
> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> +                                 * or SIOCSIFMTU.
> +                                 */
> +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> +    int netdev_policing_error;  /* Cached error code from set policing. */
> +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> +
> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> +
> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> +    struct tc *tc;
> +
> +    /* For devices of class netdev_tap_class only. */
> +    int tap_fd;
> +    bool present;               /* If the device is present in the namespace */
> +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> +
> +    /* LAG information. */
> +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> +
> +    /* AF_XDP information */
> +#ifdef HAVE_AF_XDP
> +    struct xsk_socket_info *xsk[MAX_XSKQ];

You may allocate this array dynamically based on the n_rxq while performing
reconfiguration. This way you will also have no limit on the number of rxqs.

> +    int requested_n_rxq;
> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> +    int xdp_flags, xdp_bind_flags;
> +    ovs_spinlock_t tx_lock;

This also should be an array to avoid unnecessary contention.

> +#endif
> +};
> +
> +static bool
> +is_netdev_linux_class(const struct netdev_class *netdev_class)
> +{
> +    return netdev_class->run == netdev_linux_run;
> +}
> +
> +static struct netdev_linux *
> +netdev_linux_cast(const struct netdev *netdev)
> +{
> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> +
> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +}
> +
> +static struct netdev_rxq_linux *
> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> +{
> +    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> +
> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> +}
> +
> +#endif /* netdev-linux-private.h */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index f75d73fd39f8..2883cf1f2586 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -17,6 +17,7 @@
>  #include <config.h>
>  
>  #include "netdev-linux.h"
> +#include "netdev-linux-private.h"
>  
>  #include <errno.h>
>  #include <fcntl.h>
> @@ -54,6 +55,7 @@
>  #include "fatal-signal.h"
>  #include "hash.h"
>  #include "openvswitch/hmap.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-provider.h"
>  #include "netdev-tc-offloads.h"
>  #include "netdev-vport.h"
> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
>  
> -struct netdev_linux {
> -    struct netdev up;
> -
> -    /* Protects all members below. */
> -    struct ovs_mutex mutex;
> -
> -    unsigned int cache_valid;
> -
> -    bool miimon;                    /* Link status of last poll. */
> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> -    struct timer miimon_timer;
> -
> -    int netnsid;                    /* Network namespace ID. */
> -    /* The following are figured out "on demand" only.  They are only valid
> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> -    int ifindex;
> -    struct eth_addr etheraddr;
> -    int mtu;
> -    unsigned int ifi_flags;
> -    long long int carrier_resets;
> -    uint32_t kbits_rate;        /* Policing data. */
> -    uint32_t kbits_burst;
> -    int vport_stats_error;      /* Cached error code from vport_get_stats().
> -                                   0 or an errno value. */
> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
> -    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> -    int netdev_policing_error;  /* Cached error code from set policing. */
> -    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> -    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> -
> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> -
> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> -    struct tc *tc;
> -
> -    /* For devices of class netdev_tap_class only. */
> -    int tap_fd;
> -    bool present;               /* If the device is present in the namespace */
> -    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> -
> -    /* LAG information. */
> -    bool is_lag_master;         /* True if the netdev is a LAG master. */
> -};
> -
> -struct netdev_rxq_linux {
> -    struct netdev_rxq up;
> -    bool is_tap;
> -    int fd;
> -};
>  
>  /* This is set pretty low because we probably won't learn anything from the
>   * additional log messages. */
> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>   * changes in the device miimon status, so we can use atomic_count. */
>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>  
> -static void netdev_linux_run(const struct netdev_class *);
> -
>  static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
>                                     int cmd, const char *cmd_name);
>  static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
>                         struct in_addr addr);
>  static int get_etheraddr(const char *netdev_name, struct eth_addr *ea);
>  static int set_etheraddr(const char *netdev_name, const struct eth_addr);
> -static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *);
>  static int af_packet_sock(void);
>  static bool netdev_linux_miimon_enabled(void);
>  static void netdev_linux_miimon_run(void);
> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
>  
>  static bool
> -is_netdev_linux_class(const struct netdev_class *netdev_class)
> -{
> -    return netdev_class->run == netdev_linux_run;
> -}
> -
> -static bool
>  is_tap_netdev(const struct netdev *netdev)
>  {
>      return netdev_get_class(netdev) == &netdev_tap_class;
>  }
> -
> -static struct netdev_linux *
> -netdev_linux_cast(const struct netdev *netdev)
> -{
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> -
> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> -}
> -
> -static struct netdev_rxq_linux *
> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> -{
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> -}
>  
>  static int
>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change *change)
>      }
>  }
>  
> -static void
> +void
>  netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
>  {
>      struct nl_sock *sock;
> @@ -3279,9 +3206,7 @@ exit:
>      .run = netdev_linux_run,                                    \
>      .wait = netdev_linux_wait,                                  \
>      .alloc = netdev_linux_alloc,                                \
> -    .destruct = netdev_linux_destruct,                          \
>      .dealloc = netdev_linux_dealloc,                            \
> -    .send = netdev_linux_send,                                  \
>      .send_wait = netdev_linux_send_wait,                        \
>      .set_etheraddr = netdev_linux_set_etheraddr,                \
>      .get_etheraddr = netdev_linux_get_etheraddr,                \
> @@ -3312,10 +3237,8 @@ exit:
>      .arp_lookup = netdev_linux_arp_lookup,                      \
>      .update_flags = netdev_linux_update_flags,                  \
>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
> -    .rxq_construct = netdev_linux_rxq_construct,                \
>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
> -    .rxq_recv = netdev_linux_rxq_recv,                          \
>      .rxq_wait = netdev_linux_rxq_wait,                          \
>      .rxq_drain = netdev_linux_rxq_drain
>  
> @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "system",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_linux_get_stats,
>      .get_features = netdev_linux_get_features,
>      .get_status = netdev_linux_get_status,
> -    .get_block_id = netdev_linux_get_block_id
> +    .get_block_id = netdev_linux_get_block_id,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
>  
>  const struct netdev_class netdev_tap_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      .type = "tap",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct_tap,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_tap_get_stats,
>      .get_features = netdev_linux_get_features,
>      .get_status = netdev_linux_get_status,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
>  
>  const struct netdev_class netdev_internal_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "internal",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_internal_get_stats,
>      .get_status = netdev_internal_get_status,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
> +
> +#ifdef HAVE_AF_XDP
> +const struct netdev_class netdev_afxdp_class = {
> +    NETDEV_LINUX_CLASS_COMMON,
> +    .type = "afxdp",
> +    .is_pmd = true,
> +    .construct = netdev_linux_construct,
> +    .destruct = netdev_afxdp_destruct,
> +    .get_stats = netdev_afxdp_get_stats,
> +    .get_status = netdev_linux_get_status,
> +    .set_config = netdev_afxdp_set_config,
> +    .get_config = netdev_afxdp_get_config,
> +    .reconfigure = netdev_afxdp_reconfigure,
> +    .get_numa_id = netdev_afxdp_get_numa_id,
> +    .send = netdev_afxdp_batch_send,
> +    .rxq_construct = netdev_afxdp_rxq_construct,
> +    .rxq_recv = netdev_afxdp_rxq_recv,
> +};
> +#endif
>  
>  
>  #define CODEL_N_QUEUES 0x0000
> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst,
>      dst->tx_window_errors = src->tx_window_errors;
>  }
>  
> -static int
> +int
>  get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats)
>  {
>      struct ofpbuf request;
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e6e8e8..91e6a9e2bfc0 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -903,6 +903,9 @@ extern const struct netdev_class netdev_linux_class;
>  extern const struct netdev_class netdev_internal_class;
>  extern const struct netdev_class netdev_tap_class;
>  
> +#ifdef HAVE_AF_XDP
> +extern const struct netdev_class netdev_afxdp_class;
> +#endif
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6f0946..0fac117cc602 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  
>  static void restore_all_flags(void *aux OVS_UNUSED);
>  void update_device_args(struct netdev *, const struct shash *args);
> +#ifdef HAVE_AF_XDP
> +void signal_remove_xdp(struct netdev *netdev);
> +#endif
>  
>  int
>  netdev_n_txq(const struct netdev *netdev)
> @@ -146,6 +149,9 @@ netdev_initialize(void)
>          netdev_register_provider(&netdev_internal_class);
>          netdev_register_provider(&netdev_tap_class);
>          netdev_vport_tunnel_register();
> +#ifdef HAVE_AF_XDP
> +        netdev_register_provider(&netdev_afxdp_class);
> +#endif
>  #endif
>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>          netdev_register_provider(&netdev_tap_class);
> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
>                                                 saved_flags & ~saved_values,
>                                                 &old_flags);
>          }
> +#ifdef HAVE_AF_XDP
> +        if (netdev->netdev_class == &netdev_afxdp_class) {
> +            signal_remove_xdp(netdev);
> +        }
> +#endif
>      }
>  }
>  
> diff --git a/lib/spinlock.h b/lib/spinlock.h
> new file mode 100644
> index 000000000000..17d79f217410
> --- /dev/null
> +++ b/lib/spinlock.h
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +#ifndef SPINLOCK_H
> +#define SPINLOCK_H 1
> +
> +#include <config.h>
> +
> +#include <ctype.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include "ovs-atomic.h"
> +
> +typedef struct {

It's probably better to not use 'typedef'. OVS doesn't use
typedefs for structures, unions and enums usually.
For example we have no typedef for 'struct ovs_mutex'.
So, this should be just 'struct ovs_spinlock'.


We may also add some annotations like OVS_LOCKABLE and clang
thread safety annotations: OVS_ACQUIRES, OVS_TRY_LOCK, OVS_RELEASES.
However, this could be done later.

> +    atomic_int locked;
> +} ovs_spinlock_t;> +
> +static inline void
> +ovs_spinlock_init(ovs_spinlock_t *sl)
> +{
> +    atomic_init(&sl->locked, 0);
> +}
> +
> +static inline void
> +ovs_spin_lock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0, locked = 0;
> +
> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed)) {
> +        locked = 1;
> +        while (locked) {
> +            atomic_read_relaxed(&sl->locked, &locked);
> +        }
> +        exp = 0;
> +    }
> +}
> +
> +static inline void
> +ovs_spin_unlock(ovs_spinlock_t *sl)
> +{
> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> +}
> +
> +static inline int OVS_UNUSED

Not sure that we need UNUSED annotation since we're in header now.

> +ovs_spin_trylock(ovs_spinlock_t *sl)
> +{
> +    int exp = 0;
> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed);
> +}
> +#endif
> diff --git a/lib/util.c b/lib/util.c
> index 5679232ffc5f..060b1e287bce 100644
> --- a/lib/util.c
> +++ b/lib/util.c
> @@ -277,6 +277,49 @@ free_cacheline(void *p)
>  #endif
>  }
>  
> +#ifdef HAVE_AF_XDP

I don't think that we need 'ifdef' here.

How about re-naming 'xmalloc_cacheline' to 'xmalloc_size_align'
making it allocate memory aligned to a specified size and in
a dedicated cachelines?

And implement two functions:
xmalloc_cacheline(size)
{
    return xmalloc_size_align(size, CACHE_LINE_SIZE);
}
xmalloc_pagealign(size)
{
    return xmalloc_size_align(size, get_page_size());
}

> +void *
> +xmalloc_pagealign(size_t size)
> +{
> +#ifdef HAVE_POSIX_MEMALIGN
> +    void *p;
> +    int error;
> +
> +    COVERAGE_INC(util_xalloc);
> +    error = posix_memalign(&p, get_page_size(), size ? size : 1);
> +    if (error != 0) {
> +        out_of_memory();
> +    }
> +    return p;
> +#else
> +    /* Similar to xmalloc_cacheline, but replace
> +     * CACHE_LINE_SIZE with get_page_size() */
> +    void *p = xmalloc((get_page_size() - 1)
> +                      + sizeof(void *)
> +                      + ROUND_UP(size, get_page_size()));

I think that you don't need to round up to a page size.
You need to round up to a CACHE_LINE_SIZE, probably.
There is no point to allocate so much memory more.
Below code should be re-checked too.

> +    bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *);
> +    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0),
> +                                get_page_size());
> +    void **q = (void **) r - 1;
> +    *q = p;
> +    return r;
> +#endif
> +}
> +
> +void
> +free_pagealign(void *p)
> +{
> +#ifdef HAVE_POSIX_MEMALIGN
> +    free(p);
> +#else
> +    if (p) {
> +        void **q = (void **) p - 1;
> +        free(*q);
> +    }
> +#endif
> +}
> +#endif
> +
>  char *
>  xasprintf(const char *format, ...)
>  {
> diff --git a/lib/util.h b/lib/util.h
> index 53354f1c6f0f..3cd8cf87fba8 100644
> --- a/lib/util.h
> +++ b/lib/util.h
> @@ -163,6 +163,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size);
>  
>  int string_ends_with(const char *str, const char *suffix);
>  
> +#ifdef HAVE_AF_XDP
> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> +void free_pagealign(void *);
> +#endif
> +
>  /* The C standards say that neither the 'dst' nor 'src' argument to
>   * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
>   * the null case. */
> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> new file mode 100644
> index 000000000000..ffdb54dfcd27
> --- /dev/null
> +++ b/lib/xdpsock.c
> @@ -0,0 +1,179 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +#include <config.h>
> +
> +#include "xdpsock.h"
> +#include "dp-packet.h"
> +#include "openvswitch/compiler.h"
> +
> +/* Note:
> + * umem_elem_push* shouldn't overflow because we always pop
> + * elem first, then push back to the stack.
> + */
> +static inline void
> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(ptr, addrs, n * sizeof(void *));
> +    umemp->index += n;
> +}
> +
> +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    ovs_spin_lock(&umemp->mutex);
> +    __umem_elem_push_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +}
> +
> +static inline void
> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    umemp->array[umemp->index++] = addr;
> +}
> +
> +void
> +umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +
> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    __umem_elem_push(umemp, addr);
> +    ovs_spin_unlock(&umemp->mutex);
> +}
> +
> +static inline int
> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->index -= n;
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(addrs, ptr, n * sizeof(void *));
> +
> +    return 0;
> +}
> +
> +int
> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ret;
> +}
> +
> +static inline void *
> +__umem_elem_pop(struct umem_pool *umemp)
> +{
> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> +        return NULL;
> +    }
> +
> +    return umemp->array[--umemp->index];
> +}
> +
> +void *
> +umem_elem_pop(struct umem_pool *umemp)
> +{
> +    void *ptr;
> +
> +    ovs_spin_lock(&umemp->mutex);
> +    ptr = __umem_elem_pop(umemp);
> +    ovs_spin_unlock(&umemp->mutex);
> +
> +    return ptr;
> +}
> +
> +static void **
> +__umem_pool_alloc(unsigned int size)
> +{
> +    void *bufs;
> +    int ret;
> +
> +    ret = posix_memalign(&bufs, getpagesize(),
> +                         size * sizeof(void *));

xmalloc_pagealign ?

> +    if (ret) {
> +        return NULL;
> +    }
> +
> +    memset(bufs, 0, size * sizeof(void *));
> +    return (void **)bufs;
> +}
> +
> +int
> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> +{
> +    umemp->array = __umem_pool_alloc(size);
> +    if (!umemp->array) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->size = size;
> +    umemp->index = 0;
> +    ovs_spinlock_init(&umemp->mutex);
> +    return 0;
> +}
> +
> +void
> +umem_pool_cleanup(struct umem_pool *umemp)
> +{
> +    free(umemp->array);

free_pagealign ?

> +    umemp->array = NULL;
> +}
> +
> +/* AF_XDP metadata init/destroy */
> +int
> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> +{
> +    void *bufs;
> +    int ret;
> +
> +    ret = posix_memalign(&bufs, getpagesize(),
> +                         size * sizeof(struct dp_packet_afxdp));
> +    if (ret) {
> +        return -ENOMEM;
> +    }
> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> +
> +    xp->array = bufs;
> +    xp->size = size;
> +    return 0;
> +}
> +
> +void
> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> +{
> +    free(xp->array);
> +    xp->array = NULL;
> +}
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> new file mode 100644
> index 000000000000..72578e383812
> --- /dev/null
> +++ b/lib/xdpsock.h
> @@ -0,0 +1,101 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef XDPSOCK_H
> +#define XDPSOCK_H 1
> +
> +#include <config.h>
> +
> +#ifdef HAVE_AF_XDP
> +
> +#include <bpf/xsk.h>
> +#include <errno.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "spinlock.h"
> +
> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> +
> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> +
> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> + * Setting NUM_FRAMES to this makes sure umem_pop always successes.
> + */
> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> +
> +#define BATCH_SIZE      NETDEV_MAX_BURST
> +
> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
> +
> +/* LIFO ptr_array */
> +struct umem_pool {
> +    int index;      /* point to top */
> +    unsigned int size;
> +    ovs_spinlock_t mutex;

It's a bit confusing to name it a 'mutex'. Sounds like it's 'ovs_mutex'.
Probably, it'll be better to name it 'spinlock' or just 'lock'.

> +    void **array;   /* a pointer array, point to umem buf */
> +};
> +
> +/* array-based dp_packet_afxdp */
> +struct xpacket_pool {
> +    unsigned int size;
> +    struct dp_packet_afxdp **array;
> +};
> +
> +struct xsk_umem_info {
> +    struct umem_pool mpool;
> +    struct xpacket_pool xpool;
> +    struct xsk_ring_prod fq;
> +    struct xsk_ring_cons cq;
> +    struct xsk_umem *umem;
> +    void *buffer;
> +};
> +
> +struct xsk_socket_info {
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_umem_info *umem;
> +    struct xsk_socket *xsk;
> +    unsigned long rx_dropped;
> +    unsigned long tx_dropped;
> +    uint32_t outstanding_tx;
> +};
> +
> +struct umem_elem {
> +    struct umem_elem *next;
> +};
> +
> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void *umem_elem_pop(struct umem_pool *umemp);
> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> +void umem_pool_cleanup(struct umem_pool *umemp);
> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> +
> +#endif
> +#endif
> diff --git a/tests/automake.mk b/tests/automake.mk
> index bc906fb79b46..7db64faabc71 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>  	$(SYSTEM_TESTSUITE_AT) \
>  	$(SYSTEM_KMOD_TESTSUITE_AT) \
>  	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
> +	$(SYSTEM_AFXDP_TESTSUITE_AT) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>  	$(SYSTEM_DPDK_TESTSUITE_AT) \
>  	$(OVSDB_CLUSTER_TESTSUITE_AT) \
>  	$(TESTSUITE) \
>  	$(SYSTEM_KMOD_TESTSUITE) \
>  	$(SYSTEM_USERSPACE_TESTSUITE) \
> +	$(SYSTEM_AFXDP_TESTSUITE) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE) \
>  	$(SYSTEM_DPDK_TESTSUITE) \
>  	$(OVSDB_CLUSTER_TESTSUITE) \
> @@ -159,6 +161,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>  	tests/system-userspace-macros.at \
>  	tests/system-userspace-packet-type-aware.at
>  
> +SYSTEM_AFXDP_TESTSUITE_AT = \
> +	tests/system-afxdp-testsuite.at \
> +	tests/system-afxdp-macros.at
> +
>  SYSTEM_TESTSUITE_AT = \
>  	tests/system-common-macros.at \
>  	tests/system-ovn.at \
> @@ -183,6 +189,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>  SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> @@ -316,6 +323,11 @@ check-system-userspace: all
>  	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>  
> +check-afxdp: all
> +	$(MAKE) install
> +	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> +	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> +
>  check-offloads: all
>  	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> @@ -353,6 +365,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
>  
> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> +	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> +	$(AM_V_at)mv $@.tmp $@
> +
>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
> new file mode 100644
> index 000000000000..1e6f7a46b4b7
> --- /dev/null
> +++ b/tests/system-afxdp-macros.at
> @@ -0,0 +1,20 @@
> +# Add port to ovs bridge by using afxdp mode.
> +# This will use generic XDP support in the veth driver.
> +m4_define([ADD_VETH],
> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
> +      CONFIGURE_VETH_OFFLOADS([$1])
> +      AT_CHECK([ip link set $1 netns $2])
> +      AT_CHECK([ip link set dev ovs-$1 up])
> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> +                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> +      if test -n "$5"; then
> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> +      fi
> +      if test -n "$6"; then
> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> +      fi
> +      on_exit 'ip link del ovs-$1'
> +    ]
> +)
> diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
> new file mode 100644
> index 000000000000..9b7a29066614
> --- /dev/null
> +++ b/tests/system-afxdp-testsuite.at
> @@ -0,0 +1,26 @@
> +AT_INIT
> +
> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
> +
> +Licensed under the Apache License, Version 2.0 (the "License");
> +you may not use this file except in compliance with the License.
> +You may obtain a copy of the License at:
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> +
> +Unless required by applicable law or agreed to in writing, software
> +distributed under the License is distributed on an "AS IS" BASIS,
> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +See the License for the specific language governing permissions and
> +limitations under the License.])
> +
> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> +
> +m4_include([tests/ovs-macros.at])
> +m4_include([tests/ovsdb-macros.at])
> +m4_include([tests/ofproto-macros.at])
> +m4_include([tests/system-common-macros.at])
> +m4_include([tests/system-userspace-macros.at])
> +m4_include([tests/system-afxdp-macros.at])
> +
> +m4_include([tests/system-traffic.at])
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 08001dbce3d3..6195a8fd41cf 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3082,6 +3082,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="other_config" key="xdpmode"
> +              type='{"type": "string",
> +                     "enum": ["set", ["skb", "drv"]]}'>
> +        <p>
> +          Specifies the operational mode of the XDP program.
> +          If "drv", the XDP program is loaded into the device driver with
> +          zero-copy RX and TX enabled. This mode requires device driver with
> +          AF_XDP support and has the best performance.
> +          If "skb", the XDP program is using generic XDP mode in kernel with
> +          extra data copying between userspace and kernel. No device driver
> +          support is needed. Note that this is afxdp netdev type only.
> +          Defaults to "skb" mode.
> +        </p>
> +      </column>
> +
>        <column name="options" key="vhost-server-path"
>                type='{"type": "string"}'>
>          <p>
>
William Tu June 2, 2019, 1:43 p.m. UTC | #2
Hi Ilya,

Thanks for your review.

On Thu, May 30, 2019 at 8:57 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 28.05.2019 22:01, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> > index 859c05613ddf..a33b9a7353ba 100644
> > --- a/lib/dpif-netdev-perf.h
> > +++ b/lib/dpif-netdev-perf.h
> > @@ -21,6 +21,7 @@
> >  #include <stddef.h>
> >  #include <stdint.h>
> >  #include <string.h>
> > +#include <time.h>
> >  #include <math.h>
> >
> >  #ifdef DPDK_NETDEV
> > @@ -186,6 +187,24 @@ struct pmd_perf_stats {
> >      char *log_reason;
> >  };
> >
> > +#ifdef HAVE_AF_XDP
>
> I'd like to change this to "#ifdef __linux__".
> 'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is
> Linux specific.

Yes, thanks, will do it.

>
> > +static inline uint64_t
> > +rdtsc_syscall(struct pmd_perf_stats *s)
> > +{
> > +    struct timespec val;
> > +    uint64_t v;
> > +
> > +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> > +       return s->last_tsc = 0;
>
> Maybe it's better to just return the value and allow caller to assign?

Do you mean just:
       return s->last_tsc;

> This way you'll not need to pass any arguments here.

I don't understand, I still need to pass &val, right?

>
> > +    }
> > +
> > +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> > +    v += (uint64_t) val.tv_nsec;
> > +
> > +    return s->last_tsc = v;
> > +}
> > +#endif
> > +
> >  /* Support for accurate timing of PMD execution on TSC clock cycle level.
> >   * These functions are intended to be invoked in the context of pmd threads. */
> >
> > @@ -198,6 +217,15 @@ cycles_counter_update(struct pmd_perf_stats *s)
> >  {
> >  #ifdef DPDK_NETDEV
> >      return s->last_tsc = rte_get_tsc_cycles();
> > +#elif defined(HAVE_AF_XDP) && defined(__x86_64__)
>
> And this should be:
> #elif !defined(_MSC_VER) && defined(__x86_64__)
>
> Visual Studio doesn't support inline assembly this way.
> Other things are portable until we're on x86_64.

right, thanks!

>
> > +    /* This is x86-specific instructions. */
> > +    uint32_t h, l;
> > +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> > +
> > +    return s->last_tsc = ((uint64_t) h << 32) | l;
> > +#elif defined(HAVE_AF_XDP)
>
> #elif defined(__linux__)
>
> > +    /* non-x86_64 architecture uses syscall */
> > +    return rdtsc_syscall(s);
> >  #else
> >      return s->last_tsc = 0;
> >  #endif
> > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> > new file mode 100644

<snip>
> > +
> > +static void
> > +xsk_destroy_all(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int i, ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > +
> > +    for (i = 0; i < MAX_XSKQ; i++) {
> > +        if (dev->xsk[i]) {
> > +            VLOG_INFO("destroy xsk[%d]", i);
> > +            xsk_destroy(dev->xsk[i]);
> > +            dev->xsk[i] = NULL;
> > +            dev->xsk[i]->rx_dropped = 0;
> > +            dev->xsk[i]->tx_dropped = 0;
>
> Dereferencing of a just assigned NULL poiner. Something is definitely
> wrong here.

oh, thanks a lot.
>
> > +        }
> > +    }
> > +    VLOG_INFO("remove xdp program");
> > +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > +}
> > +
> > +static inline void OVS_UNUSED
> > +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> > +    struct xdp_statistics stat;
> > +    socklen_t optlen;
> > +
> > +    optlen = sizeof stat;
> > +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> > +               &stat, &optlen) == 0);
> > +
> > +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> > +                stat.rx_dropped,
> > +                stat.rx_invalid_descs,
> > +                stat.tx_invalid_descs);
> > +}
> > +
> > +int
> > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                        char **errp OVS_UNUSED)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    const char *str_xdpmode;
> > +    int xdpmode, new_n_rxq;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > +    if (new_n_rxq > MAX_XSKQ) {
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> > +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> > +        return EINVAL;
> > +    }
> > +
> > +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> > +    if (!strcasecmp(str_xdpmode, "drv")) {
> > +        xdpmode = XDP_ZEROCOPY;
> > +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> > +        xdpmode = XDP_COPY;
> > +    } else {
> > +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> > +                 netdev_get_name(netdev), str_xdpmode);
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        return EINVAL;
> > +    }
> > +
> > +    if (dev->requested_n_rxq != new_n_rxq
> > +        || dev->requested_xdpmode != xdpmode) {
> > +        dev->requested_n_rxq = new_n_rxq;
> > +        dev->requested_xdpmode = xdpmode;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> > +    smap_add_format(args, "xdpmode", "%s",
> > +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_afxdp_reconfigure(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> > +    int err = 0;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    if (netdev->n_rxq == dev->requested_n_rxq
> > +        && dev->xdpmode == dev->requested_xdpmode) {
> > +        goto out;
> > +    }
> > +
> > +    xsk_destroy_all(netdev);
> > +    netdev->n_rxq = dev->requested_n_rxq;
> > +
> > +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> > +        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
> > +        /* From SKB mode to DRV mode */
> > +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> > +        dev->xdpmode = XDP_ZEROCOPY;
> > +
> > +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> > +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> > +                      ovs_strerror(errno));
> > +        }
> > +    } else {
> > +        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
> > +        /* From DRV mode to SKB mode */
> > +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +        dev->xdp_bind_flags = XDP_COPY;
> > +        dev->xdpmode = XDP_COPY;
> > +        /* TODO: set rlimit back to previous value
> > +         * when no device is in DRV mode.
> > +         */
> > +    }
> > +
> > +    err = xsk_configure_all(netdev);
> > +    if (err) {
> > +        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
> > +    }
> > +    netdev_change_seq_changed(netdev);
> > +out:
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return err;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> > +{
> > +    /* FIXME: Get netdev's PCIe device ID, then find
> > +     * its NUMA node id.
> > +     */
> > +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> > +              netdev_get_name(netdev));
> > +    return 0;
> > +}
> > +
> > +static void
> > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> > +{
> > +    uint32_t curr_prog_id = 0;
> > +    uint32_t flags;
> > +
> > +    /* remove_xdp_program() */
> > +    if (xdpmode == XDP_COPY) {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +    } else {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +    }
> > +
> > +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    }
> > +    if (prog_id == curr_prog_id) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    } else if (!curr_prog_id) {
> > +        VLOG_INFO("couldn't find a prog id on a given interface");
> > +    } else {
> > +        VLOG_INFO("program on interface changed, not removing");
> > +    }
> > +}
> > +
> > +void
> > +signal_remove_xdp(struct netdev *netdev)
> > +{> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > +
> > +    VLOG_WARN("force remove xdp program");
> > +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > +}
> > +
> > +static struct dp_packet_afxdp *
> > +dp_packet_cast_afxdp(const struct dp_packet *d)
> > +{
> > +    ovs_assert(d->source == DPBUF_AFXDP);
> > +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> > +}
> > +
> > +void
> > +free_afxdp_buf(struct dp_packet *p)
> > +{
> > +    struct dp_packet_afxdp *xpacket;
> > +    unsigned long addr;
> > +
> > +    xpacket = dp_packet_cast_afxdp(p);
> > +    if (xpacket->mpool) {
> > +        void *base = dp_packet_base(p);
> > +
> > +        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
> > +        umem_elem_push(xpacket->mpool, (void *)addr);
> > +    }
> > +}
> > +
> > +static void
> > +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet_afxdp *xpacket = NULL;
> > +    struct dp_packet *packet;
> > +    void *elems[BATCH_SIZE];
> > +    unsigned long addr;
> > +
> > +   /* all packets are AF_XDP, so handles its own delete in batch */
>
> This comment should be somewhere else.
>
> BTW, shift right by 1 space.
>
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (xpacket->mpool) {
> > +            void *base = dp_packet_base(packet);
> > +
> > +            addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
>
> Shouldn't it be uintptr_t ? Probably in some other places too.

right, I will use uintptr_t here and other places.

>
> > +            elems[i] = (void *)addr;
> > +        }
> > +    }
> > +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> > +    dp_packet_batch_init(batch);
> > +}
> > +
> > +int
> > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> > +                      int *qfill)
> > +{
> > +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> > +    struct netdev *netdev = rx->up.netdev;
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct umem_elem *elems[BATCH_SIZE];
> > +    uint32_t idx_rx = 0, idx_fq = 0;
> > +    struct xsk_socket_info *xsk;
> > +    int qid = rxq_->queue_id;
> > +    unsigned int rcvd, i;
> > +    int ret = 0;
> > +
> > +    xsk = dev->xsk[qid];
> > +    rx->fd = xsk_socket__fd(xsk->xsk);
> > +
> > +    /* See if there is any packet on RX queue,
> > +     * if yes, idx_rx is the index having the packet.
> > +     */
> > +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> > +    if (!rcvd) {
> > +        return 0;
> > +    }
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +    if (OVS_UNLIKELY(ret)) {
>
> We need to return rx buffers to mpool before releasing.
> Otherwise they will be lost.
>
>            for (i = 0; i < rcvd; i++) {
>                uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, i)->addr;
>
>                elems[i] = xsk_umem__get_data(xsk->umem->buffer, addr);
>            }
>            umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>
> Please, re-check above code snippet before using.

good point, thanks

>
> > +        xsk_ring_cons__release(&xsk->rx, rcvd);
> > +        xsk->rx_dropped += rcvd;
> > +        return ENOMEM;
> > +    }
> > +
> > +    /* Prepare for the FILL queue */
> > +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> > +        /* The FILL queue is full, don't retry or process rx. Wait for kernel
> > +         * to move received packets from FILL queue to RX queue.
> > +         */
> > +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>
> Same here.
>
> > +        xsk_ring_cons__release(&xsk->rx, rcvd);
> > +        xsk->rx_dropped += rcvd;
> > +        return ENOMEM;
> > +    }
> > +
> > +    /* Setup a dp_packet batch from descriptors in RX queue */
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> > +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> > +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > +        uint64_t index;
> > +
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        index = addr >> FRAME_SHIFT;
> > +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> > +        packet = &xpacket->packet;
> > +
> > +        /* Initialize the struct dp_packet */
> > +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> > +        dp_packet_set_size(packet, len);
> > +
> > +        /* Add packet into batch, increase batch->count */
> > +        dp_packet_batch_add(batch, packet);
> > +
> > +        idx_rx++;
> > +    }
> > +    /* Release the RX queue */
> > +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > +
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t index;
> > +        struct umem_elem *elem;
> > +
> > +        /* Get one free umem, program it into FILL queue */
> > +        elem = elems[i];
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> > +
> > +        idx_fq++;
> > +    }
> > +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> > +
> > +    if (qfill) {
> > +        /* TODO: return the number of remaining packets in the queue. */
> > +        *qfill = 0;
> > +    }
> > +
> > +#ifdef AFXDP_DEBUG
> > +    log_xsk_stat(xsk);
> > +#endif
> > +    return 0;
> > +}
> > +
> > +static inline int
> > +kick_tx(struct xsk_socket_info *xsk)
> > +{
> > +    int ret;
> > +
> > +    /* This causes system call into kernel's xsk_sendmsg, and
> > +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> > +     */
> > +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +    if (OVS_UNLIKELY(ret < 0)) {
> > +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> > +            return errno;
> > +        }
> > +    }
> > +    /* no error, or EBUSY or EAGAIN */
> > +    return 0;
> > +}
> > +
> > +static inline bool
> > +check_free_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct umem_pool *first_mpool = NULL;
> > +    struct dp_packet_afxdp *xpacket;
> > +    struct dp_packet *packet;
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        if (packet->source != DPBUF_AFXDP) {
> > +            return false;
> > +        }
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (i == 0) {
> > +            first_mpool = xpacket->mpool;
> > +            continue;
> > +        }
> > +        if (xpacket->mpool != first_mpool) {
> > +            return false;
> > +        }
> > +    }
> > +    /* All packets are DPBUF_AFXDP and from the same mpool */
> > +    return true;
> > +}
> > +
> > +static inline void
> > +afxdp_complete_tx(struct xsk_socket_info *xsk)
> > +{
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t idx_cq = 0;
> > +    int tx_done, j, ret;
> > +
> > +    if (!xsk->outstanding_tx) {
> > +        return;
> > +    }
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +    }
> > +
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +        xsk->outstanding_tx -= tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
> > +}
> > +
> > +int
> > +netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> > +                        struct dp_packet_batch *batch,
> > +                        bool concurrent_txq)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +    struct xsk_socket_info *xsk = dev->xsk[qid];
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct dp_packet *packet;
> > +    bool free_batch = true;
> > +    uint32_t idx = 0;
> > +    int error = 0;
> > +    int ret;
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        ovs_spin_lock(&dev->tx_lock);
>
> Using the same lock for all queues will procude a lot of unnecessary
> contentions. It's better to allocate array of locks. One per tx queue.
> You may re-allocate it in reconfigure() implementation.

Right, will do per tx lock in next version.

>
> > +    }
> > +
> > +    /* Process CQ first. */
> > +    afxdp_complete_tx(xsk);
> > +
> > +    free_batch = check_free_batch(batch);
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        xsk->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
> > +        xsk->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * TODO: avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>
> Do we really able to re-use these buffers? They are alredy in tx ring and
> probably will be sent on next kick_tx().
>
Right, so probably I should just print the warning message and continue.

> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +    }
> > +
> > +out:
> > +    if (free_batch) {
> > +        free_afxdp_buf_batch(batch);
> > +    } else {
> > +        dp_packet_delete_batch(batch, true);
> > +    }
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        ovs_spin_unlock(&dev->tx_lock);
> > +    }
> > +    return error;
> > +}
> > +
> > +int
> > +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> > +{
> > +   /* Done at reconfigure */
> > +   return 0;
> > +}
> > +
> > +void
> > +netdev_afxdp_destruct(struct netdev *netdev_)
> > +{
> > +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> > +
> > +    /* Note: tc is by-passed when using drv-mode, but when using
> > +     * skb-mode, we might need to clean up tc. */
> > +
> > +    xsk_destroy_all(netdev_);
> > +    ovs_mutex_destroy(&netdev->mutex);
> > +}
> > +
> > +int
> > +netdev_afxdp_get_stats(const struct netdev *netdev_,
>
> You don't need an underscore here.
>
> > +                       struct netdev_stats *stats)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +    struct netdev_stats dev_stats;
> > +    struct xsk_socket_info *xsk;
> > +    int error, i;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    error = get_stats_via_netlink(netdev_, &dev_stats);
> > +    if (error) {
> > +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> > +    } else {
> > +        /* Use kernel netdev's packet and byte counts */
> > +        stats->rx_packets = dev_stats.rx_packets;
> > +        stats->rx_bytes = dev_stats.rx_bytes;
> > +        stats->tx_packets = dev_stats.tx_packets;
> > +        stats->tx_bytes = dev_stats.tx_bytes;
> > +
> > +        stats->rx_errors           += dev_stats.rx_errors;
> > +        stats->tx_errors           += dev_stats.tx_errors;
> > +        stats->rx_dropped          += dev_stats.rx_dropped;
> > +        stats->tx_dropped          += dev_stats.tx_dropped;
> > +        stats->multicast           += dev_stats.multicast;
> > +        stats->collisions          += dev_stats.collisions;
> > +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> > +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> > +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> > +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> > +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> > +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> > +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> > +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> > +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> > +        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
> > +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> > +
> > +        /* Account the dropped in each xsk */
> > +        for (i = 0; i < MAX_XSKQ; i++) {
>
> i < netdev_n_rxq(netdev)
>
> > +            xsk = dev->xsk[i];
> > +            if (xsk) {
> > +                stats->rx_dropped += xsk->rx_dropped;
> > +                stats->tx_dropped += xsk->tx_dropped;
> > +            }
> > +        }
> > +    }
> > +    ovs_mutex_unlock(&dev->mutex);
> > +
> > +    return error;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..dd2dc1a2064d
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,74 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <config.h>
> > +
> > +#ifdef HAVE_AF_XDP
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used directly
> > + * only by Linux-specific code. */
> > +
> > +#define MAX_XSKQ 16
> > +
> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +struct netdev_rxq;
> > +struct netdev_stats;
> > +
> > +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> > +void netdev_afxdp_destruct(struct netdev *netdev_);
> > +
> > +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> > +                          struct dp_packet_batch *batch,
> > +                          int *qfill);
> > +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> > +                            struct dp_packet_batch *batch,
> > +                            bool concurrent_txq);
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> > +                           struct netdev_stats *stats);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +void signal_remove_xdp(struct netdev *netdev);
> > +
> > +#else /* !HAVE_AF_XDP */
> > +
> > +#include "openvswitch/compiler.h"
> > +
> > +struct dp_packet;
> > +
> > +static inline void
> > +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> > +{
> > +    /* Nothing */
> > +}
> > +
> > +#endif /* HAVE_AF_XDP */
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..d43f79e6aa41
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,139 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-afxdp.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
> > +#include "xdpsock.h"
> > +
> > +/* These functions are Linux specific, so they should be used directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +struct netdev_rxq_linux {
> > +    struct netdev_rxq up;
> > +    bool is_tap;
> > +    int fd;
> > +};
> > +
> > +void netdev_linux_run(const struct netdev_class *);
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
> > +                                  const char *flag_name, bool enable);
> > +
> > +int get_stats_via_netlink(const struct netdev *netdev_,
> > +                          struct netdev_stats *stats);
> > +
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set policing. */
> > +    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info *xsk[MAX_XSKQ];
>
> You may allocate this array dynamically based on the n_rxq while performing
> reconfiguration. This way you will also have no limit on the number of rxqs.

make sense, thanks.

<snip>

> > --- /dev/null
> > +++ b/lib/spinlock.h
> > @@ -0,0 +1,70 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +#ifndef SPINLOCK_H
> > +#define SPINLOCK_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <ctype.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdarg.h>
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +
> > +#include "ovs-atomic.h"
> > +
> > +typedef struct {
>
> It's probably better to not use 'typedef'. OVS doesn't use
> typedefs for structures, unions and enums usually.
> For example we have no typedef for 'struct ovs_mutex'.
> So, this should be just 'struct ovs_spinlock'.
>
>
> We may also add some annotations like OVS_LOCKABLE and clang
> thread safety annotations: OVS_ACQUIRES, OVS_TRY_LOCK, OVS_RELEASES.
> However, this could be done later.
>

OK, will do it.

> > +    atomic_int locked;
> > +} ovs_spinlock_t;> +
> > +static inline void
> > +ovs_spinlock_init(ovs_spinlock_t *sl)
> > +{
> > +    atomic_init(&sl->locked, 0);
> > +}
> > +
> > +static inline void
> > +ovs_spin_lock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0, locked = 0;
> > +
> > +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed)) {
> > +        locked = 1;
> > +        while (locked) {
> > +            atomic_read_relaxed(&sl->locked, &locked);
> > +        }
> > +        exp = 0;
> > +    }
> > +}
> > +
> > +static inline void
> > +ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED
>
> Not sure that we need UNUSED annotation since we're in header now.
>
> > +ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
> > +#endif
> > diff --git a/lib/util.c b/lib/util.c
> > index 5679232ffc5f..060b1e287bce 100644
> > --- a/lib/util.c
> > +++ b/lib/util.c
> > @@ -277,6 +277,49 @@ free_cacheline(void *p)
> >  #endif
> >  }
> >
> > +#ifdef HAVE_AF_XDP
>
> I don't think that we need 'ifdef' here.
>
> How about re-naming 'xmalloc_cacheline' to 'xmalloc_size_align'
> making it allocate memory aligned to a specified size and in
> a dedicated cachelines?
>
> And implement two functions:
> xmalloc_cacheline(size)
> {
>     return xmalloc_size_align(size, CACHE_LINE_SIZE);
> }
> xmalloc_pagealign(size)
> {
>     return xmalloc_size_align(size, get_page_size());
> }
>

I think it's better, will do it.

> > +void *
> > +xmalloc_pagealign(size_t size)
> > +{
> > +#ifdef HAVE_POSIX_MEMALIGN
> > +    void *p;
> > +    int error;
> > +
> > +    COVERAGE_INC(util_xalloc);
> > +    error = posix_memalign(&p, get_page_size(), size ? size : 1);
> > +    if (error != 0) {
> > +        out_of_memory();
> > +    }
> > +    return p;
> > +#else
> > +    /* Similar to xmalloc_cacheline, but replace
> > +     * CACHE_LINE_SIZE with get_page_size() */
> > +    void *p = xmalloc((get_page_size() - 1)
> > +                      + sizeof(void *)
> > +                      + ROUND_UP(size, get_page_size()));
>
> I think that you don't need to round up to a page size.
> You need to round up to a CACHE_LINE_SIZE, probably.
> There is no point to allocate so much memory more.
> Below code should be re-checked too.
>

Right, in worst case we should waste (page_size() - 1) bytes.

> > +    bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *);
> > +    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0),
> > +                                get_page_size());
> > +    void **q = (void **) r - 1;
> > +    *q = p;
> > +    return r;
> > +#endif
> > +}
> > +
> > +void
> > +free_pagealign(void *p)
> > +{
> > +#ifdef HAVE_POSIX_MEMALIGN
> > +    free(p);
> > +#else
> > +    if (p) {
> > +        void **q = (void **) p - 1;
> > +        free(*q);
> > +    }
> > +#endif
> > +}
> > +#endif
> > +
> >  char *
> >  xasprintf(const char *format, ...)
> >  {
> > diff --git a/lib/util.h b/lib/util.h
> > index 53354f1c6f0f..3cd8cf87fba8 100644
> > --- a/lib/util.h
> > +++ b/lib/util.h
> > @@ -163,6 +163,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size);
> >
> >  int string_ends_with(const char *str, const char *suffix);
> >
> > +#ifdef HAVE_AF_XDP
> > +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> > +void free_pagealign(void *);
> > +#endif
> > +
> >  /* The C standards say that neither the 'dst' nor 'src' argument to
> >   * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
> >   * the null case. */
> > diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> > new file mode 100644
> > index 000000000000..ffdb54dfcd27
> > --- /dev/null
> > +++ b/lib/xdpsock.c
> > @@ -0,0 +1,179 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +#include <config.h>
> > +
> > +#include "xdpsock.h"
> > +#include "dp-packet.h"
> > +#include "openvswitch/compiler.h"
> > +
> > +/* Note:
> > + * umem_elem_push* shouldn't overflow because we always pop
> > + * elem first, then push back to the stack.
> > + */
> > +static inline void
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +}
> > +
> > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +static inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +static inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +static inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> > +        return NULL;
> > +    }
> > +
> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ptr;
> > +}
> > +
> > +static void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +    int ret;
> > +
> > +    ret = posix_memalign(&bufs, getpagesize(),
> > +                         size * sizeof(void *));
>
> xmalloc_pagealign ?
>
> > +    if (ret) {
> > +        return NULL;
> > +    }
> > +
> > +    memset(bufs, 0, size * sizeof(void *));
> > +    return (void **)bufs;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->mutex);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > +    free(umemp->array);
>
> free_pagealign ?
>
> > +    umemp->array = NULL;
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +    int ret;
> > +
> > +    ret = posix_memalign(&bufs, getpagesize(),
> > +                         size * sizeof(struct dp_packet_afxdp));
> > +    if (ret) {
> > +        return -ENOMEM;
> > +    }
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free(xp->array);
> > +    xp->array = NULL;
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..72578e383812
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,101 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <config.h>
> > +
> > +#ifdef HAVE_AF_XDP
> > +
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "spinlock.h"
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +
> > +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> > + * Setting NUM_FRAMES to this makes sure umem_pop always successes.
> > + */
> > +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> > +
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
> > +
> > +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> > +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> > +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
> > +
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > +    int index;      /* point to top */
> > +    unsigned int size;
> > +    ovs_spinlock_t mutex;
>
> It's a bit confusing to name it a 'mutex'. Sounds like it's 'ovs_mutex'.
> Probably, it'll be better to name it 'spinlock' or just 'lock'.
>
OK

<snip>

Regards,
William
Ilya Maximets June 3, 2019, 3:26 p.m. UTC | #3
On 02.06.2019 16:43, William Tu wrote:
> Hi Ilya,
> 
> Thanks for your review.
> 
> On Thu, May 30, 2019 at 8:57 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>
>> On 28.05.2019 22:01, William Tu wrote:
>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>> index 859c05613ddf..a33b9a7353ba 100644
>>> --- a/lib/dpif-netdev-perf.h
>>> +++ b/lib/dpif-netdev-perf.h
>>> @@ -21,6 +21,7 @@
>>>  #include <stddef.h>
>>>  #include <stdint.h>
>>>  #include <string.h>
>>> +#include <time.h>
>>>  #include <math.h>
>>>
>>>  #ifdef DPDK_NETDEV
>>> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>>>      char *log_reason;
>>>  };
>>>
>>> +#ifdef HAVE_AF_XDP
>>
>> I'd like to change this to "#ifdef __linux__".
>> 'clock_gettime' is posix compliant, but CLOCK_MONOTONIC_RAW is
>> Linux specific.
> 
> Yes, thanks, will do it.
> 
>>
>>> +static inline uint64_t
>>> +rdtsc_syscall(struct pmd_perf_stats *s)
>>> +{
>>> +    struct timespec val;
>>> +    uint64_t v;
>>> +
>>> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
>>> +       return s->last_tsc = 0;
>>
>> Maybe it's better to just return the value and allow caller to assign?
> 
> Do you mean just:
>        return s->last_tsc;

Yes.

> 
>> This way you'll not need to pass any arguments here.
> 
> I don't understand, I still need to pass &val, right?

I meant not passing the argument to rdtsc_syscall, i.e. rdtsc_syscall(void).

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e09a33..11cc59efc881 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 46261235c732..aa9e7c49f179 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..a2bff5733d0a
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,433 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+dpdk.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, called xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the afxdp netdev re-uses the existing userspace
+dpif-netdev datapath.  As a result, most of the packet processing
+happens at the userspace instead of linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+
+  * CONFIG_BPF=y
+
+  * CONFIG_BPF_SYSCALL=y
+
+  * CONFIG_XDP_SOCKETS=y
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  * CONFIG_BPF_JIT=y (Performance)
+
+  * CONFIG_HAVE_BPF_JIT=y (Performance)
+
+  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
+
+- Once your AF_XDP-enabled kernel is ready, if possible, run
+  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
+  This is an OVS indepedent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+Frist, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp TESTSUITEFLAGS='1'
+
+If a test case fails, check the log at::
+
+  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd ...
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+.. note::
+   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+Should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debugging by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. AF_XDP device added to bridge, remove, and added again will fail.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+The performance number I got is around 1.6Mpps.
+This is due to using the kernel's tap interface, which requires copying
+packet into kernel from the umem buffer in userspace.
+
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
+
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0 (linux kernel mode)::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1
+
+Or, use AF_XDP with skb mode::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
+
+Performace: for RX_DROP: 800Kpps, TX: 700Kpps
+
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/acinclude.m4 b/acinclude.m4
index f8fc5bcd7b4c..b9eacd7c0f3c 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -221,6 +221,41 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index 505e3d041e93..29c90b73f836 100644
--- a/configure.ac
+++ b/configure.ac
@@ -99,6 +99,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index cc5dccf39d6b..b31e28f6e1f5 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -14,6 +14,10 @@  if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
+endif
+
 lib_libopenvswitch_la_LDFLAGS = \
         $(OVS_LTINFO) \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
@@ -392,6 +396,7 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/if-notifier.h \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
+	lib/netdev-linux-private.h \
 	lib/netdev-tc-offloads.c \
 	lib/netdev-tc-offloads.h \
 	lib/netlink-conntrack.c \
@@ -409,6 +414,15 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/xdpsock.c \
+	lib/xdpsock.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h \
+	lib/spinlock.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..e6a7947076b4 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -19,6 +19,7 @@ 
 #include <string.h>
 
 #include "dp-packet.h"
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
@@ -59,6 +60,27 @@  dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
     dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
 }
 
+#if HAVE_AF_XDP
+/* Initialize 'b' as an empty dp_packet that contains
+ * memory starting at AF_XDP umem base.
+ */
+void
+dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
+{
+    dp_packet_set_base(b, base);
+    dp_packet_set_data(b, base);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_set_allocated(b, allocated);
+    b->source = DPBUF_AFXDP;
+    dp_packet_reset_offsets(b);
+    pkt_metadata_init(&b->md, 0);
+    dp_packet_reset_cutlen(b);
+    dp_packet_reset_offload(b);
+    b->packet_type = htonl(PT_ETH);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -122,6 +144,8 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
         }
     }
 }
@@ -248,6 +272,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +460,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..e3438226e360 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,7 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +43,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* buffer data from XDP frame */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +91,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -122,7 +131,9 @@  static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
 void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
-
+#if HAVE_AF_XDP
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -184,6 +195,11 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
+            return;
+        }
+
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..a33b9a7353ba 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -21,6 +21,7 @@ 
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
+#include <time.h>
 #include <math.h>
 
 #ifdef DPDK_NETDEV
@@ -186,6 +187,24 @@  struct pmd_perf_stats {
     char *log_reason;
 };
 
+#ifdef HAVE_AF_XDP
+static inline uint64_t
+rdtsc_syscall(struct pmd_perf_stats *s)
+{
+    struct timespec val;
+    uint64_t v;
+
+    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
+       return s->last_tsc = 0;
+    }
+
+    v  = (uint64_t) val.tv_sec * 1000000000LL;
+    v += (uint64_t) val.tv_nsec;
+
+    return s->last_tsc = v;
+}
+#endif
+
 /* Support for accurate timing of PMD execution on TSC clock cycle level.
  * These functions are intended to be invoked in the context of pmd threads. */
 
@@ -198,6 +217,15 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif defined(HAVE_AF_XDP) && defined(__x86_64__)
+    /* This is x86-specific instructions. */
+    uint32_t h, l;
+    asm volatile("rdtsc" : "=a" (l), "=d" (h));
+
+    return s->last_tsc = ((uint64_t) h << 32) | l;
+#elif defined(HAVE_AF_XDP)
+    /* non-x86_64 architecture uses syscall */
+    return rdtsc_syscall(s);
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..e20ee31c00f3
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,850 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-afxdp.h"
+
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <stdlib.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "dp-packet.h"
+#include "dpif-netdev.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "socket-util.h"
+#include "spinlock.h"
+#include "util.h"
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+#define UMEM2XPKT(base, i) \
+                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
+                               i * sizeof(struct dp_packet_afxdp))
+
+static uint32_t prog_id;
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+static int xsk_configure_all(struct netdev *netdev);
+static void xsk_destroy_all(struct netdev *netdev);
+
+static struct xsk_umem_info *xsk_configure_umem(void *buffer, uint64_t size,
+                                                int xdpmode)
+{
+    struct xsk_umem_config uconfig OVS_UNUSED;
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xcalloc(1, sizeof(*umem));
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           NULL);
+    if (ret) {
+        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        free(umem);
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* set-up umem pool */
+    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("umem_pool_init failed");
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct umem_elem *elem;
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* set-up metadata */
+    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("xpacket_pool_init failed");
+        umem_pool_cleanup(&umem->mpool);
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = UMEM2XPKT(umem->xpool.array, i);
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0;
+    int ret;
+    int i;
+
+    xsk = xcalloc(1, sizeof(*xsk));
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        free(xsk);
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */
+    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
+                                   PROD_NUM_DESCS - BATCH_SIZE, &idx)) {
+        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
+    }
+
+    for (i = 0;
+         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS - BATCH_SIZE);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+
+    /* umem memory region */
+    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+
+    /* create AF_XDP socket */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        free_pagealign(bufs);
+        return NULL;
+    }
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* clean up umem and xpacket pool */
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free_pagealign(bufs);
+        umem_pool_cleanup(&umem->mpool);
+        xpacket_pool_cleanup(&umem->xpool);
+        free(umem);
+    }
+    return xsk;
+}
+
+static int
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk;
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    /* configure each queue */
+    for (i = 0; i < netdev->n_rxq; i++) {
+        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
+                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk) {
+            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
+            goto err;
+        }
+        dev->xsk[i] = xsk;
+        xsk->rx_dropped = 0;
+        xsk->tx_dropped = 0;
+    }
+
+    return 0;
+
+err:
+    xsk_destroy_all(netdev);
+    return EINVAL;
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk)
+{
+    struct xsk_umem *umem;
+
+    if (!xsk) {
+        return;
+    }
+
+    umem = xsk->umem->umem;
+    xsk_socket__delete(xsk->xsk);
+    if (xsk_umem__delete(umem)) {
+        VLOG_ERR("xsk_umem__delete failed");
+    }
+
+    /* free the packet buffer */
+    free_pagealign(xsk->umem->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&xsk->umem->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&xsk->umem->xpool);
+
+    free(xsk->umem);
+    free(xsk);
+}
+
+static void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    for (i = 0; i < MAX_XSKQ; i++) {
+        if (dev->xsk[i]) {
+            VLOG_INFO("destroy xsk[%d]", i);
+            xsk_destroy(dev->xsk[i]);
+            dev->xsk[i] = NULL;
+            dev->xsk[i]->rx_dropped = 0;
+            dev->xsk[i]->tx_dropped = 0;
+        }
+    }
+    VLOG_INFO("remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static inline void OVS_UNUSED
+log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+               &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                stat.rx_dropped,
+                stat.rx_invalid_descs,
+                stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    const char *str_xdpmode;
+    int xdpmode, new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
+                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
+        return EINVAL;
+    }
+
+    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
+    if (!strcasecmp(str_xdpmode, "drv")) {
+        xdpmode = XDP_ZEROCOPY;
+    } else if (!strcasecmp(str_xdpmode, "skb")) {
+        xdpmode = XDP_COPY;
+    } else {
+        VLOG_ERR("%s: Incorrect xdpmode (%s).",
+                 netdev_get_name(netdev), str_xdpmode);
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (dev->requested_n_rxq != new_n_rxq
+        || dev->requested_xdpmode != xdpmode) {
+        dev->requested_n_rxq = new_n_rxq;
+        dev->requested_xdpmode = xdpmode;
+        netdev_request_reconfigure(netdev);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+    netdev->n_rxq = dev->requested_n_rxq;
+
+    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
+        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
+        /* From SKB mode to DRV mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+        dev->xdp_bind_flags = XDP_ZEROCOPY;
+        dev->xdpmode = XDP_ZEROCOPY;
+
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
+                      ovs_strerror(errno));
+        }
+    } else {
+        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
+        /* From DRV mode to SKB mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+        dev->xdp_bind_flags = XDP_COPY;
+        dev->xdpmode = XDP_COPY;
+        /* TODO: set rlimit back to previous value
+         * when no device is in DRV mode.
+         */
+    }
+
+    err = xsk_configure_all(netdev);
+    if (err) {
+        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0",
+              netdev_get_name(netdev));
+    return 0;
+}
+
+static void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t curr_prog_id = 0;
+    uint32_t flags;
+
+    /* remove_xdp_program() */
+    if (xdpmode == XDP_COPY) {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    } else {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    }
+
+    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    }
+    if (prog_id == curr_prog_id) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    } else if (!curr_prog_id) {
+        VLOG_INFO("couldn't find a prog id on a given interface");
+    } else {
+        VLOG_INFO("program on interface changed, not removing");
+    }
+}
+
+void
+signal_remove_xdp(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    VLOG_WARN("force remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    unsigned long addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+static void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+    struct dp_packet_afxdp *xpacket = NULL;
+    struct dp_packet *packet;
+    void *elems[BATCH_SIZE];
+    unsigned long addr;
+
+   /* all packets are AF_XDP, so handles its own delete in batch */
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (xpacket->mpool) {
+            void *base = dp_packet_base(packet);
+
+            addr = (unsigned long)base & (~FRAME_SHIFT_MASK);
+            elems[i] = (void *)addr;
+        }
+    }
+    umem_elem_push_n(xpacket->mpool, batch->count, elems);
+    dp_packet_batch_init(batch);
+}
+
+int
+netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+                      int *qfill)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev *netdev = rx->up.netdev;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct umem_elem *elems[BATCH_SIZE];
+    uint32_t idx_rx = 0, idx_fq = 0;
+    struct xsk_socket_info *xsk;
+    int qid = rxq_->queue_id;
+    unsigned int rcvd, i;
+    int ret = 0;
+
+    xsk = dev->xsk[qid];
+    rx->fd = xsk_socket__fd(xsk->xsk);
+
+    /* See if there is any packet on RX queue,
+     * if yes, idx_rx is the index having the packet.
+     */
+    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return 0;
+    }
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
+    if (OVS_UNLIKELY(ret)) {
+        xsk_ring_cons__release(&xsk->rx, rcvd);
+        xsk->rx_dropped += rcvd;
+        return ENOMEM;
+    }
+
+    /* Prepare for the FILL queue */
+    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
+        /* The FILL queue is full, don't retry or process rx. Wait for kernel
+         * to move received packets from FILL queue to RX queue.
+         */
+        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
+        xsk_ring_cons__release(&xsk->rx, rcvd);
+        xsk->rx_dropped += rcvd;
+        return ENOMEM;
+    }
+
+    /* Setup a dp_packet batch from descriptors in RX queue */
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
+        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
+        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+        uint64_t index;
+
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        index = addr >> FRAME_SHIFT;
+        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
+        packet = &xpacket->packet;
+
+        /* Initialize the struct dp_packet */
+        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+    /* Release the RX queue */
+    xsk_ring_cons__release(&xsk->rx, rcvd);
+
+    for (i = 0; i < rcvd; i++) {
+        uint64_t index;
+        struct umem_elem *elem;
+
+        /* Get one free umem, program it into FILL queue */
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+
+    if (qfill) {
+        /* TODO: return the number of remaining packets in the queue. */
+        *qfill = 0;
+    }
+
+#ifdef AFXDP_DEBUG
+    log_xsk_stat(xsk);
+#endif
+    return 0;
+}
+
+static inline int
+kick_tx(struct xsk_socket_info *xsk)
+{
+    int ret;
+
+    /* This causes system call into kernel's xsk_sendmsg, and
+     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
+     */
+    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
+    if (OVS_UNLIKELY(ret < 0)) {
+        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
+            return errno;
+        }
+    }
+    /* no error, or EBUSY or EAGAIN */
+    return 0;
+}
+
+static inline bool
+check_free_batch(struct dp_packet_batch *batch)
+{
+    struct umem_pool *first_mpool = NULL;
+    struct dp_packet_afxdp *xpacket;
+    struct dp_packet *packet;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (packet->source != DPBUF_AFXDP) {
+            return false;
+        }
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (i == 0) {
+            first_mpool = xpacket->mpool;
+            continue;
+        }
+        if (xpacket->mpool != first_mpool) {
+            return false;
+        }
+    }
+    /* All packets are DPBUF_AFXDP and from the same mpool */
+    return true;
+}
+
+static inline void
+afxdp_complete_tx(struct xsk_socket_info *xsk)
+{
+    struct umem_elem *elems_push[BATCH_SIZE];
+    uint32_t idx_cq = 0;
+    int tx_done, j, ret;
+
+    if (!xsk->outstanding_tx) {
+        return;
+    }
+
+    ret = kick_tx(xsk);
+    if (OVS_UNLIKELY(ret)) {
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+    }
+
+    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq);
+    if (tx_done > 0) {
+        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
+        xsk->outstanding_tx -= tx_done;
+    }
+
+    /* Recycle back to umem pool */
+    for (j = 0; j < tx_done; j++) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)xsk->umem->buffer + addr);
+        elems_push[j] = elem;
+    }
+
+    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
+}
+
+int
+netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
+                        struct dp_packet_batch *batch,
+                        bool concurrent_txq)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev_);
+    struct xsk_socket_info *xsk = dev->xsk[qid];
+    struct umem_elem *elems_pop[BATCH_SIZE];
+    struct dp_packet *packet;
+    bool free_batch = true;
+    uint32_t idx = 0;
+    int error = 0;
+    int ret;
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        ovs_spin_lock(&dev->tx_lock);
+    }
+
+    /* Process CQ first. */
+    afxdp_complete_tx(xsk);
+
+    free_batch = check_free_batch(batch);
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        xsk->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    /* Make sure we have enough TX descs */
+    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+        xsk->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct umem_elem *elem;
+        uint64_t index;
+
+        elem = elems_pop[i];
+        /* Copy the packet to the umem we just pop from umem pool.
+         * TODO: avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk->tx, batch->count);
+    xsk->outstanding_tx += batch->count;
+
+    ret = kick_tx(xsk);
+    if (OVS_UNLIKELY(ret)) {
+        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+    }
+
+out:
+    if (free_batch) {
+        free_afxdp_buf_batch(batch);
+    } else {
+        dp_packet_delete_batch(batch, true);
+    }
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        ovs_spin_unlock(&dev->tx_lock);
+    }
+    return error;
+}
+
+int
+netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
+{
+   /* Done at reconfigure */
+   return 0;
+}
+
+void
+netdev_afxdp_destruct(struct netdev *netdev_)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+    /* Note: tc is by-passed when using drv-mode, but when using
+     * skb-mode, we might need to clean up tc. */
+
+    xsk_destroy_all(netdev_);
+    ovs_mutex_destroy(&netdev->mutex);
+}
+
+int
+netdev_afxdp_get_stats(const struct netdev *netdev_,
+                       struct netdev_stats *stats)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev_);
+    struct netdev_stats dev_stats;
+    struct xsk_socket_info *xsk;
+    int error, i;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    error = get_stats_via_netlink(netdev_, &dev_stats);
+    if (error) {
+        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
+    } else {
+        /* Use kernel netdev's packet and byte counts */
+        stats->rx_packets = dev_stats.rx_packets;
+        stats->rx_bytes = dev_stats.rx_bytes;
+        stats->tx_packets = dev_stats.tx_packets;
+        stats->tx_bytes = dev_stats.tx_bytes;
+
+        stats->rx_errors           += dev_stats.rx_errors;
+        stats->tx_errors           += dev_stats.tx_errors;
+        stats->rx_dropped          += dev_stats.rx_dropped;
+        stats->tx_dropped          += dev_stats.tx_dropped;
+        stats->multicast           += dev_stats.multicast;
+        stats->collisions          += dev_stats.collisions;
+        stats->rx_length_errors    += dev_stats.rx_length_errors;
+        stats->rx_over_errors      += dev_stats.rx_over_errors;
+        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
+        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
+        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
+        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
+        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
+        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
+        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
+        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
+        stats->tx_window_errors    += dev_stats.tx_window_errors;
+
+        /* Account the dropped in each xsk */
+        for (i = 0; i < MAX_XSKQ; i++) {
+            xsk = dev->xsk[i];
+            if (xsk) {
+                stats->rx_dropped += xsk->rx_dropped;
+                stats->tx_dropped += xsk->tx_dropped;
+            }
+        }
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return error;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..dd2dc1a2064d
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,74 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+
+#define MAX_XSKQ 16
+
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+struct netdev_rxq;
+struct netdev_stats;
+
+int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
+void netdev_afxdp_destruct(struct netdev *netdev_);
+
+int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
+                          struct dp_packet_batch *batch,
+                          int *qfill);
+int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
+                            struct dp_packet_batch *batch,
+                            bool concurrent_txq);
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+int netdev_afxdp_get_stats(const struct netdev *netdev_,
+                           struct netdev_stats *stats);
+
+void free_afxdp_buf(struct dp_packet *p);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+void signal_remove_xdp(struct netdev *netdev);
+
+#else /* !HAVE_AF_XDP */
+
+#include "openvswitch/compiler.h"
+
+struct dp_packet;
+
+static inline void
+free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
+{
+    /* Nothing */
+}
+
+#endif /* HAVE_AF_XDP */
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
new file mode 100644
index 000000000000..d43f79e6aa41
--- /dev/null
+++ b/lib/netdev-linux-private.h
@@ -0,0 +1,139 @@ 
+/*
+ * Copyright (c) 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_LINUX_PRIVATE_H
+#define NETDEV_LINUX_PRIVATE_H 1
+
+#include <config.h>
+
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "netdev-afxdp.h"
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
+#include "xdpsock.h"
+
+/* These functions are Linux specific, so they should be used directly only by
+ * Linux-specific code. */
+
+struct netdev;
+
+struct netdev_rxq_linux {
+    struct netdev_rxq up;
+    bool is_tap;
+    int fd;
+};
+
+void netdev_linux_run(const struct netdev_class *);
+
+int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
+                                  const char *flag_name, bool enable);
+
+int get_stats_via_netlink(const struct netdev *netdev_,
+                          struct netdev_stats *stats);
+
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+    /* AF_XDP information */
+#ifdef HAVE_AF_XDP
+    struct xsk_socket_info *xsk[MAX_XSKQ];
+    int requested_n_rxq;
+    int xdpmode, requested_xdpmode; /* detect mode changed */
+    int xdp_flags, xdp_bind_flags;
+    ovs_spinlock_t tx_lock;
+#endif
+};
+
+static bool
+is_netdev_linux_class(const struct netdev_class *netdev_class)
+{
+    return netdev_class->run == netdev_linux_run;
+}
+
+static struct netdev_linux *
+netdev_linux_cast(const struct netdev *netdev)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
+
+    return CONTAINER_OF(netdev, struct netdev_linux, up);
+}
+
+static struct netdev_rxq_linux *
+netdev_rxq_linux_cast(const struct netdev_rxq *rx)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
+
+    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
+}
+
+#endif /* netdev-linux-private.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index f75d73fd39f8..2883cf1f2586 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -17,6 +17,7 @@ 
 #include <config.h>
 
 #include "netdev-linux.h"
+#include "netdev-linux-private.h"
 
 #include <errno.h>
 #include <fcntl.h>
@@ -54,6 +55,7 @@ 
 #include "fatal-signal.h"
 #include "hash.h"
 #include "openvswitch/hmap.h"
+#include "netdev-afxdp.h"
 #include "netdev-provider.h"
 #include "netdev-tc-offloads.h"
 #include "netdev-vport.h"
@@ -487,57 +489,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
-
-struct netdev_rxq_linux {
-    struct netdev_rxq up;
-    bool is_tap;
-    int fd;
-};
 
 /* This is set pretty low because we probably won't learn anything from the
  * additional log messages. */
@@ -551,8 +502,6 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
-static void netdev_linux_run(const struct netdev_class *);
-
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -566,7 +515,6 @@  static int do_set_addr(struct netdev *netdev,
                        struct in_addr addr);
 static int get_etheraddr(const char *netdev_name, struct eth_addr *ea);
 static int set_etheraddr(const char *netdev_name, const struct eth_addr);
-static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *);
 static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
@@ -574,31 +522,10 @@  static void netdev_linux_miimon_wait(void);
 static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
 
 static bool
-is_netdev_linux_class(const struct netdev_class *netdev_class)
-{
-    return netdev_class->run == netdev_linux_run;
-}
-
-static bool
 is_tap_netdev(const struct netdev *netdev)
 {
     return netdev_get_class(netdev) == &netdev_tap_class;
 }
-
-static struct netdev_linux *
-netdev_linux_cast(const struct netdev *netdev)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
-
-    return CONTAINER_OF(netdev, struct netdev_linux, up);
-}
-
-static struct netdev_rxq_linux *
-netdev_rxq_linux_cast(const struct netdev_rxq *rx)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
-    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
-}
 
 static int
 netdev_linux_netnsid_update__(struct netdev_linux *netdev)
@@ -774,7 +701,7 @@  netdev_linux_update_lag(struct rtnetlink_change *change)
     }
 }
 
-static void
+void
 netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
 {
     struct nl_sock *sock;
@@ -3279,9 +3206,7 @@  exit:
     .run = netdev_linux_run,                                    \
     .wait = netdev_linux_wait,                                  \
     .alloc = netdev_linux_alloc,                                \
-    .destruct = netdev_linux_destruct,                          \
     .dealloc = netdev_linux_dealloc,                            \
-    .send = netdev_linux_send,                                  \
     .send_wait = netdev_linux_send_wait,                        \
     .set_etheraddr = netdev_linux_set_etheraddr,                \
     .get_etheraddr = netdev_linux_get_etheraddr,                \
@@ -3312,10 +3237,8 @@  exit:
     .arp_lookup = netdev_linux_arp_lookup,                      \
     .update_flags = netdev_linux_update_flags,                  \
     .rxq_alloc = netdev_linux_rxq_alloc,                        \
-    .rxq_construct = netdev_linux_rxq_construct,                \
     .rxq_destruct = netdev_linux_rxq_destruct,                  \
     .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
-    .rxq_recv = netdev_linux_rxq_recv,                          \
     .rxq_wait = netdev_linux_rxq_wait,                          \
     .rxq_drain = netdev_linux_rxq_drain
 
@@ -3323,30 +3246,64 @@  const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
-    .get_block_id = netdev_linux_get_block_id
+    .get_block_id = netdev_linux_get_block_id,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .destruct = netdev_afxdp_destruct,
+    .get_stats = netdev_afxdp_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+    .send = netdev_afxdp_batch_send,
+    .rxq_construct = netdev_afxdp_rxq_construct,
+    .rxq_recv = netdev_afxdp_rxq_recv,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
@@ -5918,7 +5875,7 @@  netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst,
     dst->tx_window_errors = src->tx_window_errors;
 }
 
-static int
+int
 get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats)
 {
     struct ofpbuf request;
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e6e8e8..91e6a9e2bfc0 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -903,6 +903,9 @@  extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
 
+#ifdef HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 7d7ecf6f0946..0fac117cc602 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -104,6 +104,9 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);
+#ifdef HAVE_AF_XDP
+void signal_remove_xdp(struct netdev *netdev);
+#endif
 
 int
 netdev_n_txq(const struct netdev *netdev)
@@ -146,6 +149,9 @@  netdev_initialize(void)
         netdev_register_provider(&netdev_internal_class);
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
@@ -2007,6 +2013,11 @@  restore_all_flags(void *aux OVS_UNUSED)
                                                saved_flags & ~saved_values,
                                                &old_flags);
         }
+#ifdef HAVE_AF_XDP
+        if (netdev->netdev_class == &netdev_afxdp_class) {
+            signal_remove_xdp(netdev);
+        }
+#endif
     }
 }
 
diff --git a/lib/spinlock.h b/lib/spinlock.h
new file mode 100644
index 000000000000..17d79f217410
--- /dev/null
+++ b/lib/spinlock.h
@@ -0,0 +1,70 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef SPINLOCK_H
+#define SPINLOCK_H 1
+
+#include <config.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "ovs-atomic.h"
+
+typedef struct {
+    atomic_int locked;
+} ovs_spinlock_t;
+
+static inline void
+ovs_spinlock_init(ovs_spinlock_t *sl)
+{
+    atomic_init(&sl->locked, 0);
+}
+
+static inline void
+ovs_spin_lock(ovs_spinlock_t *sl)
+{
+    int exp = 0, locked = 0;
+
+    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed)) {
+        locked = 1;
+        while (locked) {
+            atomic_read_relaxed(&sl->locked, &locked);
+        }
+        exp = 0;
+    }
+}
+
+static inline void
+ovs_spin_unlock(ovs_spinlock_t *sl)
+{
+    atomic_store_explicit(&sl->locked, 0, memory_order_release);
+}
+
+static inline int OVS_UNUSED
+ovs_spin_trylock(ovs_spinlock_t *sl)
+{
+    int exp = 0;
+    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed);
+}
+#endif
diff --git a/lib/util.c b/lib/util.c
index 5679232ffc5f..060b1e287bce 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -277,6 +277,49 @@  free_cacheline(void *p)
 #endif
 }
 
+#ifdef HAVE_AF_XDP
+void *
+xmalloc_pagealign(size_t size)
+{
+#ifdef HAVE_POSIX_MEMALIGN
+    void *p;
+    int error;
+
+    COVERAGE_INC(util_xalloc);
+    error = posix_memalign(&p, get_page_size(), size ? size : 1);
+    if (error != 0) {
+        out_of_memory();
+    }
+    return p;
+#else
+    /* Similar to xmalloc_cacheline, but replace
+     * CACHE_LINE_SIZE with get_page_size() */
+    void *p = xmalloc((get_page_size() - 1)
+                      + sizeof(void *)
+                      + ROUND_UP(size, get_page_size()));
+    bool runt = PAD_SIZE((uintptr_t) p, get_page_size()) < sizeof(void *);
+    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? get_page_size() : 0),
+                                get_page_size());
+    void **q = (void **) r - 1;
+    *q = p;
+    return r;
+#endif
+}
+
+void
+free_pagealign(void *p)
+{
+#ifdef HAVE_POSIX_MEMALIGN
+    free(p);
+#else
+    if (p) {
+        void **q = (void **) p - 1;
+        free(*q);
+    }
+#endif
+}
+#endif
+
 char *
 xasprintf(const char *format, ...)
 {
diff --git a/lib/util.h b/lib/util.h
index 53354f1c6f0f..3cd8cf87fba8 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -163,6 +163,11 @@  void ovs_strzcpy(char *dst, const char *src, size_t size);
 
 int string_ends_with(const char *str, const char *suffix);
 
+#ifdef HAVE_AF_XDP
+void *xmalloc_pagealign(size_t) MALLOC_LIKE;
+void free_pagealign(void *);
+#endif
+
 /* The C standards say that neither the 'dst' nor 'src' argument to
  * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
  * the null case. */
diff --git a/lib/xdpsock.c b/lib/xdpsock.c
new file mode 100644
index 000000000000..ffdb54dfcd27
--- /dev/null
+++ b/lib/xdpsock.c
@@ -0,0 +1,179 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+
+#include "xdpsock.h"
+#include "dp-packet.h"
+#include "openvswitch/compiler.h"
+
+/* Note:
+ * umem_elem_push* shouldn't overflow because we always pop
+ * elem first, then push back to the stack.
+ */
+static inline void
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
+        OVS_NOT_REACHED();
+    }
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+}
+
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    ovs_spin_lock(&umemp->mutex);
+    __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+}
+
+static inline void
+__umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
+        OVS_NOT_REACHED();
+    }
+
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+
+    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
+
+    ovs_spin_lock(&umemp->mutex);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->mutex);
+}
+
+static inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->mutex);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ret;
+}
+
+static inline void *
+__umem_elem_pop(struct umem_pool *umemp)
+{
+    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
+        return NULL;
+    }
+
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->mutex);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->mutex);
+
+    return ptr;
+}
+
+static void **
+__umem_pool_alloc(unsigned int size)
+{
+    void *bufs;
+    int ret;
+
+    ret = posix_memalign(&bufs, getpagesize(),
+                         size * sizeof(void *));
+    if (ret) {
+        return NULL;
+    }
+
+    memset(bufs, 0, size * sizeof(void *));
+    return (void **)bufs;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        return -ENOMEM;
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spinlock_init(&umemp->mutex);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp)
+{
+    free(umemp->array);
+    umemp->array = NULL;
+}
+
+/* AF_XDP metadata init/destroy */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    void *bufs;
+    int ret;
+
+    ret = posix_memalign(&bufs, getpagesize(),
+                         size * sizeof(struct dp_packet_afxdp));
+    if (ret) {
+        return -ENOMEM;
+    }
+    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
+
+    xp->array = bufs;
+    xp->size = size;
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free(xp->array);
+    xp->array = NULL;
+}
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
new file mode 100644
index 000000000000..72578e383812
--- /dev/null
+++ b/lib/xdpsock.h
@@ -0,0 +1,101 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <bpf/xsk.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stdio.h>
+
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "spinlock.h"
+
+#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
+#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
+#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
+
+/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
+ * Setting NUM_FRAMES to this makes sure umem_pop always successes.
+ */
+#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
+
+#define BATCH_SIZE      NETDEV_MAX_BURST
+
+BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
+BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
+BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
+
+/* LIFO ptr_array */
+struct umem_pool {
+    int index;      /* point to top */
+    unsigned int size;
+    ovs_spinlock_t mutex;
+    void **array;   /* a pointer array, point to umem buf */
+};
+
+/* array-based dp_packet_afxdp */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp **array;
+};
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    unsigned long rx_dropped;
+    unsigned long tx_dropped;
+    uint32_t outstanding_tx;
+};
+
+struct umem_elem {
+    struct umem_elem *next;
+};
+
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *umem_elem_pop(struct umem_pool *umemp);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
+#endif
diff --git a/tests/automake.mk b/tests/automake.mk
index bc906fb79b46..7db64faabc71 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -159,6 +161,10 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -183,6 +189,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -316,6 +323,11 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	$(MAKE) install
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -353,6 +365,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..1e6f7a46b4b7
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,20 @@ 
+# Add port to ovs bridge by using afxdp mode.
+# This will use generic XDP support in the veth driver.
+m4_define([ADD_VETH],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..9b7a29066614
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-common-macros.at])
+m4_include([tests/system-userspace-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+
+m4_include([tests/system-traffic.at])
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 08001dbce3d3..6195a8fd41cf 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3082,6 +3082,21 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="xdpmode"
+              type='{"type": "string",
+                     "enum": ["set", ["skb", "drv"]]}'>
+        <p>
+          Specifies the operational mode of the XDP program.
+          If "drv", the XDP program is loaded into the device driver with
+          zero-copy RX and TX enabled. This mode requires device driver with
+          AF_XDP support and has the best performance.
+          If "skb", the XDP program is using generic XDP mode in kernel with
+          extra data copying between userspace and kernel. No device driver
+          support is needed. Note that this is afxdp netdev type only.
+          Defaults to "skb" mode.
+        </p>
+      </column>
+
       <column name="options" key="vhost-server-path"
               type='{"type": "string"}'>
         <p>