diff mbox series

[ovs-dev,PATCHv11] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1559767671-6175-1-git-send-email-u9012063@gmail.com
State Changes Requested
Headers show
Series [ovs-dev,PATCHv11] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu June 5, 2019, 8:47 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

v4->v5:
- rebase to master
- remove rfc, squash all into a single patch
- add --enable-afxdp, so by default, AF_XDP is not compiled
- add options: xdpmode=drv,skb
- add multiple queue and multiple PMD support, with options: n_rxq
- improve documentation, rename bpf.rst to af_xdp.rst

v5->v6
- rebase to master, commit 0cdd5b13de91b98
- address errors from sparse and clang
- pass travis-ci test
- address feedback from Ben
- fix issues reported by 0-day robot
- improved documentation

v6-v7
- rebase to master, commit abf11558c1515bf3b1
- address feedbacks from Ilya, Ben, and Eelco, see:
  https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
- add XDP mode change, implement get/set_config, reconfigure
- Fix reconfiguration/crash issue caused by libbpf, see patch:
  [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
- perf optimization for batching umem_push/pop
- perf optimization for batching kick_tx
- test build with dpdk
- fix/refactor atomic operation
- make AF_XDP x86 specific, otherwise fail at build time
- lots of code refactoring
- add PVP setup in documentation

v7-v8:
- Address feedback from Ilya at:
  https://patchwork.ozlabs.org/patch/1095019/
- add netdev-linux-private.h
- fix afxdp reconfigure issue
- sort include headers
- remove unnecessary OVS_UNUSED
- coding style fixes
- error case handling and memory leak

v8-v9:
- rebase to master 180bbbed3a3867d52
- Address review feedback from Ben, Ilya and Eelco, at:
  https://patchwork.ozlabs.org/patch/1097740/
- == From Ilya ==
- Optimize the reconfiguration logic
- Implement .rxq_recv and .send for afxdp
- Remove system-afxdp-traffic.at, reuse existing code
- Use Ilya's rdtsc code
- remove --disable-system
- == From Eelco ==
- Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
  assertion !fd != !wevent failed
- Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
- Clear xdp program when receive signal, ctrl+c
- Add options to vswitch.xml, set xdpmode default to skb-mode
- No support for ARM and PPC, now x86_64 only
- remove redundant header includes and function/macro definitions
- remove some ifdef HAVE_AF_XDP
- == From others/both about afxdp rx and tx ==
- Several umem push/pop error handling improvement/fixes
- add lock to address concurrent_txq case
- improve error handling
- add stats
- Things that are not done yet
- MTU limitation
- n_txq_desc/n_rxq_desc option.

v9-v10
- remove x86_64 limitation, suggested by Ben and Eelco
- add xmalloc_pagealign, free_pagealign
- minor refector

v10-v11
- address feedback from Ilya at
  https://patchwork.ozlabs.org/patch/1106495/
- fix typos, and some refactoring
- refactor existing code and introduce xmalloc pagealign
- fix a couple of error handling case
- allocate per-txq lock
- dynamic allocate xsk array
- fix cycle_counter_update() for non-x86/non-linux case
---
 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  35 ++
 configure.ac                          |   1 +
 lib/automake.mk                       |  14 +
 lib/dp-packet.c                       |  28 ++
 lib/dp-packet.h                       |  18 +-
 lib/dpif-netdev-perf.h                |  26 +
 lib/netdev-afxdp.c                    | 891 ++++++++++++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |  74 +++
 lib/netdev-linux-private.h            | 139 ++++++
 lib/netdev-linux.c                    | 121 ++---
 lib/netdev-provider.h                 |   3 +
 lib/netdev.c                          |  11 +
 lib/spinlock.h                        |  70 +++
 lib/util.c                            |  92 +++-
 lib/util.h                            |   5 +
 lib/xdpsock.c                         | 170 +++++++
 lib/xdpsock.h                         | 101 ++++
 tests/automake.mk                     |  16 +
 tests/system-afxdp-macros.at          |  20 +
 tests/system-afxdp-testsuite.at       |  26 +
 vswitchd/vswitch.xml                  |  15 +
 25 files changed, 2204 insertions(+), 108 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 lib/spinlock.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at

Comments

Eelco Chaudron June 7, 2019, 3:43 p.m. UTC | #1
Hi William,

No review or full test yet, just some observations…

We run OVS as a non root user, which is causing OVS with XDP to fail:

2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev: 
Datapath supports ct_orig_tuple
2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev: 
Datapath supports ct_orig_tuple6
2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on numa_id: 
0, core id: 21 created.
2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd threads 
on numa node 0
2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1 in 
DRV mode
2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR: 
setrlimit(RLIMIT_MEMLOCK): Operation not permitted
2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all 
configure queue 0 mode DRV
2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create 
failed (Operation not permitted) mode: DRV qid: 0
2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create AF_XDP 
socket on queue 0
2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1 
reconfig fails
2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set interface 
eno1 new configuration

However when configuring this after startup it’s fine, but trying to 
restart OVS with this configuration results in a system core…




On 5 Jun 2019, at 22:47, William Tu wrote:

> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
> socket
> type built upon the eBPF and XDP technology.  It is aims to have 
> comparable
> performance to DPDK but cooperate better with existing kernel's 
> networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
> program
> attached to the netdev, by-passing a couple of Linux kernel's 
> subsystems
> As a result, AF_XDP socket shows much better performance than 
> AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this 
> feature is
> not compiled in.
>
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
>
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread 
> library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: 
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
>
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
>
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
>
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
>
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
>
> v8-v9:
> - rebase to master 180bbbed3a3867d52
> - Address review feedback from Ben, Ilya and Eelco, at:
>   https://patchwork.ozlabs.org/patch/1097740/
> - == From Ilya ==
> - Optimize the reconfiguration logic
> - Implement .rxq_recv and .send for afxdp
> - Remove system-afxdp-traffic.at, reuse existing code
> - Use Ilya's rdtsc code
> - remove --disable-system
> - == From Eelco ==
> - Fix bug when remove br0, 
> util(revalidator49)|EMER|lib/poll-loop.c:111:
>   assertion !fd != !wevent failed
> - Fix bug and use default value from libbpf, ex: 
> XSK_RING_PROD__DEFAULT...
> - Clear xdp program when receive signal, ctrl+c
> - Add options to vswitch.xml, set xdpmode default to skb-mode
> - No support for ARM and PPC, now x86_64 only
> - remove redundant header includes and function/macro definitions
> - remove some ifdef HAVE_AF_XDP
> - == From others/both about afxdp rx and tx ==
> - Several umem push/pop error handling improvement/fixes
> - add lock to address concurrent_txq case
> - improve error handling
> - add stats
> - Things that are not done yet
> - MTU limitation
> - n_txq_desc/n_rxq_desc option.
>
> v9-v10
> - remove x86_64 limitation, suggested by Ben and Eelco
> - add xmalloc_pagealign, free_pagealign
> - minor refector
>
> v10-v11
> - address feedback from Ilya at
>   https://patchwork.ozlabs.org/patch/1106495/
> - fix typos, and some refactoring
> - refactor existing code and introduce xmalloc pagealign
> - fix a couple of error handling case
> - allocate per-txq lock
> - dynamic allocate xsk array
> - fix cycle_counter_update() for non-x86/non-linux case
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  35 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  14 +
>  lib/dp-packet.c                       |  28 ++
>  lib/dp-packet.h                       |  18 +-
>  lib/dpif-netdev-perf.h                |  26 +
>  lib/netdev-afxdp.c                    | 891 
> ++++++++++++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  74 +++
>  lib/netdev-linux-private.h            | 139 ++++++
>  lib/netdev-linux.c                    | 121 ++---
>  lib/netdev-provider.h                 |   3 +
>  lib/netdev.c                          |  11 +
>  lib/spinlock.h                        |  70 +++
>  lib/util.c                            |  92 +++-
>  lib/util.h                            |   5 +
>  lib/xdpsock.c                         | 170 +++++++
>  lib/xdpsock.h                         | 101 ++++
>  tests/automake.mk                     |  16 +
>  tests/system-afxdp-macros.at          |  20 +
>  tests/system-afxdp-testsuite.at       |  26 +
>  vswitchd/vswitch.xml                  |  15 +
>  25 files changed, 2204 insertions(+), 108 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/spinlock.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
>
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst 
> b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..554964396353
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,433 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); 
> you may
> +      not use this file except in compliance with the License. You 
> may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, 
> software
> +      distributed under the License is distributed on an "AS IS" 
> BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied. See the
> +      License for the specific language governing permissions and 
> limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux 
> socket type
> +built upon the eBPF and XDP technology.  It is aims to have 
> comparable
> +performance to DPDK but cooperate better with existing kernel's 
> networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
> program
> +attached to the netdev, by-passing a couple of Linux kernel's 
> subsystems.
> +As a result, AF_XDP socket shows much better performance than 
> AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +dpdk.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, called xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the afxdp netdev re-uses the existing userspace
> +dpif-netdev datapath.  As a result, most of the packet processing
> +happens at the userspace instead of linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building 
> Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but 
> not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> +  This is an OVS independent benchmark tools for AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF 
> support.
> +First, clone a recent version of Linux bpf-next tree::
> +
> +  git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the 
> tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp TESTSUITEFLAGS='1'
> +
> +If a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd ...
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on 
> core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or 
> "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +.. note::
> +   pmd-rxq-affinity is optional. If not specified, system will 
> auto-assign.
> +
> +To validate that the bridge has successfully instantiated, you can 
> use the::
> +
> +  ovs-vsctl show
> +
> +Should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debugging by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good 
> introduction
> +about AF_XDP current and future work.
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] 
> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] 
> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] 
> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, 
> allowing PMD
> +to keep polling the AF_XDP queues without any interferences from 
> kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by 
> AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system 
> calls.
> +
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl 
> \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 
> \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a 
> netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A 
> possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports 
> and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to 
> VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" 
> \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +The performance number I got is around 1.6Mpps.
> +This is due to using the kernel's tap interface, which requires 
> copying
> +packet into kernel from the umem buffer in userspace.
> +
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev 
> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev 
> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, 
> actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, 
> actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> +
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1
> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 
> options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> +
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst 
> b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index cf9cc8b8b0de..721653ab0ec0 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP 
> support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP 
> support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP 
> support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP 
> support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and 
> enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index 2dbe9a9178e3..9e23e1c6958c 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct 
> stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index cc5dccf39d6b..b31e28f6e1f5 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-tc-offloads.c \
>  	lib/netdev-tc-offloads.h \
>  	lib/netlink-conntrack.c \
> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h \
> +	lib/spinlock.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..e6a7947076b4 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -19,6 +19,7 @@
>  #include <string.h>
>
>  #include "dp-packet.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, 
> size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t 
> allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 
> 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the 
> stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  
> It could
> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
>          }
>      }
>  }
> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t 
> new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == 
> dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..e3438226e360 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,7 @@
>  #include <rte_mbuf.h>
>  #endif
>
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated 
> memory.
>                                  * ref to dp_packet_init_dpdk() in 
> dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +91,13 @@ struct dp_packet {
>      };
>  };
>
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +131,9 @@ static inline const void 
> *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..6b6dfda7db1c 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -21,6 +21,7 @@
>  #include <stddef.h>
>  #include <stdint.h>
>  #include <string.h>
> +#include <time.h>
>  #include <math.h>
>
>  #ifdef DPDK_NETDEV
> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>      char *log_reason;
>  };
>
> +#ifdef __linux__
> +static inline uint64_t
> +rdtsc_syscall(struct pmd_perf_stats *s)
> +{
> +    struct timespec val;
> +    uint64_t v;
> +
> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> +       return s->last_tsc;
> +    }
> +
> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> +    v += (uint64_t) val.tv_nsec;
> +
> +    return s->last_tsc = v;
> +}
> +#endif
> +
>  /* Support for accurate timing of PMD execution on TSC clock cycle 
> level.
>   * These functions are intended to be invoked in the context of pmd 
> threads. */
>
> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif !defined(_MSC_VER) && defined(__x86_64__)
> +    uint32_t h, l;
> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> +
> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> +#elif defined(__linux__)
> +    return rdtsc_syscall(s);
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..a6543e8f5126
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,891 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"
> +#include "netdev-afxdp.h"
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <stdlib.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#include "dp-packet.h"
> +#include "dpif-netdev.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "packets.h"
> +#include "socket-util.h"
> +#include "spinlock.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char 
> *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base 
> + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static uint32_t prog_id;
> +static struct xsk_socket_info *xsk_configure(int ifindex, int 
> xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +static int xsk_configure_all(struct netdev *netdev);
> +static void xsk_destroy_all(struct netdev *netdev);
> +
> +static struct xsk_umem_info *
> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> +{
> +    struct xsk_umem_config uconfig OVS_UNUSED;
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof *umem);
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, 
> &umem->cq,
> +                           NULL);
> +    if (ret) {
> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("umem_pool_init failed");
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("xpacket_pool_init failed");
> +        umem_pool_cleanup(&umem->mpool);
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, 
> umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue 
> */
> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> +                                   PROD_NUM_DESCS - BATCH_SIZE, 
> &idx)) {
> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL 
> queue");
> +    }
> +
> +    for (i = 0;
> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS - BATCH_SIZE);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +
> +    /* umem memory region */
> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free_pagealign(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free_pagealign(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +static int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk;
> +    int i, ifindex, n_rxq;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    n_rxq = netdev_n_rxq(netdev);
> +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> +
> +    /* configure each queue */
> +    for (i = 0; i < n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", 
> i);
> +            dev->xsks[i] = NULL;
> +            goto err;
> +        }
> +        dev->xsks[i] = xsk;
> +        xsk->rx_dropped = 0;
> +        xsk->tx_dropped = 0;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk)
> +{
> +    struct xsk_umem *umem;
> +
> +    umem = xsk->umem->umem;
> +    xsk_socket__delete(xsk->xsk);
> +    if (xsk_umem__delete(umem)) {
> +        VLOG_ERR("xsk_umem__delete failed");
> +    }
> +
> +    /* free the packet buffer */
> +    free_pagealign(xsk->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> +
> +    free(xsk->umem);
> +    free(xsk);
> +}
> +
> +static void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> +        if (dev->xsks && dev->xsks[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsks[i]);
> +            dev->xsks[i] = NULL;
> +        }
> +    }
> +
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +
> +    free(dev->xsks);
> +}
> +
> +static inline void OVS_UNUSED
> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, 
> XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid 
> %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap 
> *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *str_xdpmode;
> +    int xdpmode, new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> +        return EINVAL;
> +    }
> +
> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> +    if (!strcasecmp(str_xdpmode, "drv")) {
> +        xdpmode = XDP_ZEROCOPY;
> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> +        xdpmode = XDP_COPY;
> +    } else {
> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> +                 netdev_get_name(netdev), str_xdpmode);
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (dev->requested_n_rxq != new_n_rxq
> +        || dev->requested_xdpmode != xdpmode) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        dev->requested_xdpmode = xdpmode;
> +        netdev_request_reconfigure(netdev);
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap 
> *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +static void
> +netdev_afxdp_alloc_txq(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int n_txqs = netdev_n_rxq(netdev);
> +    int i;
> +
> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> +
> +    for (i = 0; i < n_txqs; i++) {
> +        ovs_spinlock_init(&dev->tx_locks[i]);
> +    }
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    free(dev->tx_locks);
> +
> +    netdev->n_rxq = dev->requested_n_rxq;
> +    netdev_afxdp_alloc_txq(netdev);
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", 
> netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", 
> netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | 
> XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", 
> netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +static void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t curr_prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    }
> +    if (prog_id == curr_prog_id) {
> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> +    } else if (!curr_prog_id) {
> +        VLOG_INFO("couldn't find a prog id on a given interface");
> +    } else {
> +        VLOG_INFO("program on interface changed, not removing");
> +    }
> +}
> +
> +void
> +signal_remove_xdp(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    VLOG_WARN("force remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    uintptr_t addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +static void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +    struct dp_packet_afxdp *xpacket = NULL;
> +    struct dp_packet *packet;
> +    void *elems[BATCH_SIZE];
> +    uintptr_t addr;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (xpacket->mpool) {
> +            void *base = dp_packet_base(packet);
> +
> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +            elems[i] = (void *)addr;
> +        }
> +    }
> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +    dp_packet_batch_init(batch);
> +}
> +
> +static inline void
> +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
> +{
> +    void *elems[BATCH_SIZE];
> +    int i;
> +
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, 
> idx_rx)->addr;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +
> +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
> +    }
> +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +
> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> +    xsk->rx_dropped += rcvd;
> +}
> +
> +int
> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch 
> *batch,
> +                      int *qfill)
> +{
> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    struct netdev *netdev = rx->up.netdev;
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct umem_elem *elems[BATCH_SIZE];
> +    uint32_t idx_rx = 0, idx_fq = 0;
> +    struct xsk_socket_info *xsk;
> +    int qid = rxq_->queue_id;
> +    unsigned int rcvd, i;
> +    int ret = 0;
> +
> +    xsk = dev->xsks[qid];
> +    if (!xsk) {
> +        return 0;
> +    }
> +
> +    rx->fd = xsk_socket__fd(xsk->xsk);
> +
> +    /* See if there is any packet on RX queue,
> +     * if yes, idx_rx is the index having the packet.
> +     */
> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;
> +    }
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {
> +        handle_rx_fail(xsk, rcvd, idx_rx);
> +        return ENOMEM;
> +    }
> +
> +    /* Prepare for the FILL queue */
> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> +        /* The FILL queue is full, don't retry or process rx. Wait 
> for kernel
> +         * to move received packets from FILL queue to RX queue.
> +         */
> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> +        handle_rx_fail(xsk, rcvd, idx_rx);
> +        return ENOMEM;
> +    }
> +
> +    /* Setup a dp_packet batch from descriptors in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, 
> idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> +        packet = &xpacket->packet;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - 
> FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> +
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        /* Get one free umem, program it into FILL queue */
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> +
> +    if (qfill) {
> +        /* TODO: return the number of remaining packets in the queue. 
> */
> +        *qfill = 0;
> +    }
> +
> +#ifdef AFXDP_DEBUG
> +    log_xsk_stat(xsk);
> +#endif
> +    return 0;
> +}
> +
> +static inline int
> +kick_tx(struct xsk_socket_info *xsk)
> +{
> +    int ret;
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, 
> NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == 
> EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +static inline bool
> +check_free_batch(struct dp_packet_batch *batch)
> +{
> +    struct umem_pool *first_mpool = NULL;
> +    struct dp_packet_afxdp *xpacket;
> +    struct dp_packet *packet;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (packet->source != DPBUF_AFXDP) {
> +            return false;
> +        }
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (i == 0) {
> +            first_mpool = xpacket->mpool;
> +            continue;
> +        }
> +        if (xpacket->mpool != first_mpool) {
> +            return false;
> +        }
> +    }
> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> +    return true;
> +}
> +
> +static inline void
> +afxdp_complete_tx(struct xsk_socket_info *xsk)
> +{
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    uint32_t idx_cq = 0;
> +    int tx_done, j, ret;
> +
> +    if (!xsk->outstanding_tx) {
> +        return;
> +    }
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +    }
> +
> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, 
> &idx_cq);
> +    if (tx_done > 0) {
> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +        xsk->outstanding_tx -= tx_done;
> +    }
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)xsk->umem->buffer + addr);
> +        elems_push[j] = elem;
> +    }
> +
> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void 
> **)elems_push);
> +}
> +
> +int
> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> +                        struct dp_packet_batch *batch,
> +                        bool concurrent_txq)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk = dev->xsks[qid];
> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct dp_packet *packet;
> +    bool free_batch = true;
> +    uint32_t idx = 0;
> +    int error = 0;
> +    int ret;
> +
> +    if (!xsk) {
> +        goto out;
> +    }
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        qid = qid % dev->up.n_txq;
> +        ovs_spin_lock(&dev->tx_locks[qid]);
> +    }
> +
> +    /* Process CQ first. */
> +    afxdp_complete_tx(xsk);
> +
> +    free_batch = check_free_batch(batch);
> +
> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void 
> **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        xsk->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void 
> **)elems_pop);
> +        xsk->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * TODO: avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> +    xsk->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk);
> +    if (OVS_UNLIKELY(ret)) {
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +    }
> +
> +out:
> +    if (free_batch) {
> +        free_afxdp_buf_batch(batch);
> +    } else {
> +        dp_packet_delete_batch(batch, true);
> +    }
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        ovs_spin_unlock(&dev->tx_locks[qid]);
> +    }
> +    return error;
> +}
> +
> +int
> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> +{
> +   /* Done at reconfigure */
> +   return 0;
> +}
> +
> +void
> +netdev_afxdp_destruct(struct netdev *netdev_)
> +{
> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> +    /* Note: tc is by-passed when using drv-mode, but when using
> +     * skb-mode, we might need to clean up tc. */
> +
> +    xsk_destroy_all(netdev_);
> +    ovs_mutex_destroy(&netdev->mutex);
> +}
> +
> +int
> +netdev_afxdp_get_stats(const struct netdev *netdev,
> +                       struct netdev_stats *stats)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct netdev_stats dev_stats;
> +    struct xsk_socket_info *xsk;
> +    int error, i;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    error = get_stats_via_netlink(netdev, &dev_stats);
> +    if (error) {
> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> +    } else {
> +        /* Use kernel netdev's packet and byte counts */
> +        stats->rx_packets = dev_stats.rx_packets;
> +        stats->rx_bytes = dev_stats.rx_bytes;
> +        stats->tx_packets = dev_stats.tx_packets;
> +        stats->tx_bytes = dev_stats.tx_bytes;
> +
> +        stats->rx_errors           += dev_stats.rx_errors;
> +        stats->tx_errors           += dev_stats.tx_errors;
> +        stats->rx_dropped          += dev_stats.rx_dropped;
> +        stats->tx_dropped          += dev_stats.tx_dropped;
> +        stats->multicast           += dev_stats.multicast;
> +        stats->collisions          += dev_stats.collisions;
> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> +        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> +
> +        /* Account the dropped in each xsk */
> +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
> +            xsk = dev->xsks[i];
> +            if (xsk) {
> +                stats->rx_dropped += xsk->rx_dropped;
> +                stats->tx_dropped += xsk->tx_dropped;
> +            }
> +        }
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +
> +    return error;
> +}
> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> new file mode 100644
> index 000000000000..dd2dc1a2064d
> --- /dev/null
> +++ b/lib/netdev-afxdp.h
> @@ -0,0 +1,74 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_AFXDP_H
> +#define NETDEV_AFXDP_H 1
> +
> +#include <config.h>
> +
> +#ifdef HAVE_AF_XDP
> +
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +/* These functions are Linux AF_XDP specific, so they should be used 
> directly
> + * only by Linux-specific code. */
> +
> +#define MAX_XSKQ 16
> +
> +struct netdev;
> +struct xsk_socket_info;
> +struct xdp_umem;
> +struct dp_packet_batch;
> +struct smap;
> +struct dp_packet;
> +struct netdev_rxq;
> +struct netdev_stats;
> +
> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> +void netdev_afxdp_destruct(struct netdev *netdev_);
> +
> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> +                          struct dp_packet_batch *batch,
> +                          int *qfill);
> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> +                            struct dp_packet_batch *batch,
> +                            bool concurrent_txq);
> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap 
> *args,
> +                            char **errp);
> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap 
> *args);
> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> +                           struct netdev_stats *stats);
> +
> +void free_afxdp_buf(struct dp_packet *p);
> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> +void signal_remove_xdp(struct netdev *netdev);
> +
> +#else /* !HAVE_AF_XDP */
> +
> +#include "openvswitch/compiler.h"
> +
> +struct dp_packet;
> +
> +static inline void
> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> +{
> +    /* Nothing */
> +}
> +
> +#endif /* HAVE_AF_XDP */
> +#endif /* netdev-afxdp.h */
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> new file mode 100644
> index 000000000000..6a0388cf9dc3
> --- /dev/null
> +++ b/lib/netdev-linux-private.h
> @@ -0,0 +1,139 @@
> +/*
> + * Copyright (c) 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_LINUX_PRIVATE_H
> +#define NETDEV_LINUX_PRIVATE_H 1
> +
> +#include <config.h>
> +
> +#include <linux/filter.h>
> +#include <linux/gen_stats.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_tun.h>
> +#include <linux/types.h>
> +#include <linux/ethtool.h>
> +#include <linux/mii.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include "netdev-afxdp.h"
> +#include "netdev-provider.h"
> +#include "netdev-tc-offloads.h"
> +#include "netdev-vport.h"
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "timer.h"
> +#include "xdpsock.h"
> +
> +/* These functions are Linux specific, so they should be used 
> directly only by
> + * Linux-specific code. */
> +
> +struct netdev;
> +
> +struct netdev_rxq_linux {
> +    struct netdev_rxq up;
> +    bool is_tap;
> +    int fd;
> +};
> +
> +void netdev_linux_run(const struct netdev_class *);
> +
> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t 
> flag,
> +                                  const char *flag_name, bool 
> enable);
> +
> +int get_stats_via_netlink(const struct netdev *netdev_,
> +                          struct netdev_stats *stats);
> +
> +struct netdev_linux {
> +    struct netdev up;
> +
> +    /* Protects all members below. */
> +    struct ovs_mutex mutex;
> +
> +    unsigned int cache_valid;
> +
> +    bool miimon;                    /* Link status of last poll. */
> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if 
> <= 0. */
> +    struct timer miimon_timer;
> +
> +    int netnsid;                    /* Network namespace ID. */
> +    /* The following are figured out "on demand" only.  They are only 
> valid
> +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> +    int ifindex;
> +    struct eth_addr etheraddr;
> +    int mtu;
> +    unsigned int ifi_flags;
> +    long long int carrier_resets;
> +    uint32_t kbits_rate;        /* Policing data. */
> +    uint32_t kbits_burst;
> +    int vport_stats_error;      /* Cached error code from 
> vport_get_stats().
> +                                   0 or an errno value. */
> +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> +                                 * or SIOCSIFMTU.
> +                                 */
> +    int ether_addr_error;       /* Cached error code from set/get 
> etheraddr. */
> +    int netdev_policing_error;  /* Cached error code from set 
> policing. */
> +    int get_features_error;     /* Cached error code from 
> ETHTOOL_GSET. */
> +    int get_ifindex_error;      /* Cached error code from 
> SIOCGIFINDEX. */
> +
> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> +
> +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. 
> */
> +    struct tc *tc;
> +
> +    /* For devices of class netdev_tap_class only. */
> +    int tap_fd;
> +    bool present;               /* If the device is present in the 
> namespace */
> +    uint64_t tx_dropped;        /* tap device can drop if the iface 
> is down */
> +
> +    /* LAG information. */
> +    bool is_lag_master;         /* True if the netdev is a LAG 
> master. */
> +
> +    /* AF_XDP information */
> +#ifdef HAVE_AF_XDP
> +    struct xsk_socket_info **xsks;
> +    int requested_n_rxq;
> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> +    int xdp_flags, xdp_bind_flags;
> +    struct ovs_spinlock *tx_locks;
> +#endif
> +};
> +
> +static bool
> +is_netdev_linux_class(const struct netdev_class *netdev_class)
> +{
> +    return netdev_class->run == netdev_linux_run;
> +}
> +
> +static struct netdev_linux *
> +netdev_linux_cast(const struct netdev *netdev)
> +{
> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> +
> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> +}
> +
> +static struct netdev_rxq_linux *
> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> +{
> +    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> +
> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> +}
> +
> +#endif /* netdev-linux-private.h */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index f75d73fd39f8..2883cf1f2586 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -17,6 +17,7 @@
>  #include <config.h>
>
>  #include "netdev-linux.h"
> +#include "netdev-linux-private.h"
>
>  #include <errno.h>
>  #include <fcntl.h>
> @@ -54,6 +55,7 @@
>  #include "fatal-signal.h"
>  #include "hash.h"
>  #include "openvswitch/hmap.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-provider.h"
>  #include "netdev-tc-offloads.h"
>  #include "netdev-vport.h"
> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int 
> mtu);
>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t 
> burst_bytes);
>  
> -struct netdev_linux {
> -    struct netdev up;
> -
> -    /* Protects all members below. */
> -    struct ovs_mutex mutex;
> -
> -    unsigned int cache_valid;
> -
> -    bool miimon;                    /* Link status of last poll. */
> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if 
> <= 0. */
> -    struct timer miimon_timer;
> -
> -    int netnsid;                    /* Network namespace ID. */
> -    /* The following are figured out "on demand" only.  They are only 
> valid
> -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> -    int ifindex;
> -    struct eth_addr etheraddr;
> -    int mtu;
> -    unsigned int ifi_flags;
> -    long long int carrier_resets;
> -    uint32_t kbits_rate;        /* Policing data. */
> -    uint32_t kbits_burst;
> -    int vport_stats_error;      /* Cached error code from 
> vport_get_stats().
> -                                   0 or an errno value. */
> -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU 
> or SIOCSIFMTU. */
> -    int ether_addr_error;       /* Cached error code from set/get 
> etheraddr. */
> -    int netdev_policing_error;  /* Cached error code from set 
> policing. */
> -    int get_features_error;     /* Cached error code from 
> ETHTOOL_GSET. */
> -    int get_ifindex_error;      /* Cached error code from 
> SIOCGIFINDEX. */
> -
> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> -
> -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. 
> */
> -    struct tc *tc;
> -
> -    /* For devices of class netdev_tap_class only. */
> -    int tap_fd;
> -    bool present;               /* If the device is present in the 
> namespace */
> -    uint64_t tx_dropped;        /* tap device can drop if the iface 
> is down */
> -
> -    /* LAG information. */
> -    bool is_lag_master;         /* True if the netdev is a LAG 
> master. */
> -};
> -
> -struct netdev_rxq_linux {
> -    struct netdev_rxq up;
> -    bool is_tap;
> -    int fd;
> -};
>
>  /* This is set pretty low because we probably won't learn anything 
> from the
>   * additional log messages. */
> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl = 
> VLOG_RATE_LIMIT_INIT(5, 20);
>   * changes in the device miimon status, so we can use atomic_count. 
> */
>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>
> -static void netdev_linux_run(const struct netdev_class *);
> -
>  static int netdev_linux_do_ethtool(const char *name, struct 
> ethtool_cmd *,
>                                     int cmd, const char *cmd_name);
>  static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
>                         struct in_addr addr);
>  static int get_etheraddr(const char *netdev_name, struct eth_addr 
> *ea);
>  static int set_etheraddr(const char *netdev_name, const struct 
> eth_addr);
> -static int get_stats_via_netlink(const struct netdev *, struct 
> netdev_stats *);
>  static int af_packet_sock(void);
>  static bool netdev_linux_miimon_enabled(void);
>  static void netdev_linux_miimon_run(void);
> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int 
> *mtup);
>
>  static bool
> -is_netdev_linux_class(const struct netdev_class *netdev_class)
> -{
> -    return netdev_class->run == netdev_linux_run;
> -}
> -
> -static bool
>  is_tap_netdev(const struct netdev *netdev)
>  {
>      return netdev_get_class(netdev) == &netdev_tap_class;
>  }
> -
> -static struct netdev_linux *
> -netdev_linux_cast(const struct netdev *netdev)
> -{
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> -
> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> -}
> -
> -static struct netdev_rxq_linux *
> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> -{
> -    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> -}
>  
>  static int
>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change 
> *change)
>      }
>  }
>
> -static void
> +void
>  netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
>  {
>      struct nl_sock *sock;
> @@ -3279,9 +3206,7 @@ exit:
>      .run = netdev_linux_run,                                    \
>      .wait = netdev_linux_wait,                                  \
>      .alloc = netdev_linux_alloc,                                \
> -    .destruct = netdev_linux_destruct,                          \
>      .dealloc = netdev_linux_dealloc,                            \
> -    .send = netdev_linux_send,                                  \
>      .send_wait = netdev_linux_send_wait,                        \
>      .set_etheraddr = netdev_linux_set_etheraddr,                \
>      .get_etheraddr = netdev_linux_get_etheraddr,                \
> @@ -3312,10 +3237,8 @@ exit:
>      .arp_lookup = netdev_linux_arp_lookup,                      \
>      .update_flags = netdev_linux_update_flags,                  \
>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
> -    .rxq_construct = netdev_linux_rxq_construct,                \
>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
> -    .rxq_recv = netdev_linux_rxq_recv,                          \
>      .rxq_wait = netdev_linux_rxq_wait,                          \
>      .rxq_drain = netdev_linux_rxq_drain
>
> @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class = 
> {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "system",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_linux_get_stats,
>      .get_features = netdev_linux_get_features,
>      .get_status = netdev_linux_get_status,
> -    .get_block_id = netdev_linux_get_block_id
> +    .get_block_id = netdev_linux_get_block_id,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
>
>  const struct netdev_class netdev_tap_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      .type = "tap",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct_tap,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_tap_get_stats,
>      .get_features = netdev_linux_get_features,
>      .get_status = netdev_linux_get_status,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
>
>  const struct netdev_class netdev_internal_class = {
>      NETDEV_LINUX_CLASS_COMMON,
>      LINUX_FLOW_OFFLOAD_API,
>      .type = "internal",
> +    .is_pmd = false,
>      .construct = netdev_linux_construct,
> +    .destruct = netdev_linux_destruct,
>      .get_stats = netdev_internal_get_stats,
>      .get_status = netdev_internal_get_status,
> +    .send = netdev_linux_send,
> +    .rxq_construct = netdev_linux_rxq_construct,
> +    .rxq_recv = netdev_linux_rxq_recv,
>  };
> +
> +#ifdef HAVE_AF_XDP
> +const struct netdev_class netdev_afxdp_class = {
> +    NETDEV_LINUX_CLASS_COMMON,
> +    .type = "afxdp",
> +    .is_pmd = true,
> +    .construct = netdev_linux_construct,
> +    .destruct = netdev_afxdp_destruct,
> +    .get_stats = netdev_afxdp_get_stats,
> +    .get_status = netdev_linux_get_status,
> +    .set_config = netdev_afxdp_set_config,
> +    .get_config = netdev_afxdp_get_config,
> +    .reconfigure = netdev_afxdp_reconfigure,
> +    .get_numa_id = netdev_afxdp_get_numa_id,
> +    .send = netdev_afxdp_batch_send,
> +    .rxq_construct = netdev_afxdp_rxq_construct,
> +    .rxq_recv = netdev_afxdp_rxq_recv,
> +};
> +#endif
>  
>
>  #define CODEL_N_QUEUES 0x0000
> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct 
> netdev_stats *dst,
>      dst->tx_window_errors = src->tx_window_errors;
>  }
>
> -static int
> +int
>  get_stats_via_netlink(const struct netdev *netdev_, struct 
> netdev_stats *stats)
>  {
>      struct ofpbuf request;
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e6e8e8..91e6a9e2bfc0 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -903,6 +903,9 @@ extern const struct netdev_class 
> netdev_linux_class;
>  extern const struct netdev_class netdev_internal_class;
>  extern const struct netdev_class netdev_tap_class;
>
> +#ifdef HAVE_AF_XDP
> +extern const struct netdev_class netdev_afxdp_class;
> +#endif
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6f0946..0fac117cc602 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl = 
> VLOG_RATE_LIMIT_INIT(5, 20);
>
>  static void restore_all_flags(void *aux OVS_UNUSED);
>  void update_device_args(struct netdev *, const struct shash *args);
> +#ifdef HAVE_AF_XDP
> +void signal_remove_xdp(struct netdev *netdev);
> +#endif
>
>  int
>  netdev_n_txq(const struct netdev *netdev)
> @@ -146,6 +149,9 @@ netdev_initialize(void)
>          netdev_register_provider(&netdev_internal_class);
>          netdev_register_provider(&netdev_tap_class);
>          netdev_vport_tunnel_register();
> +#ifdef HAVE_AF_XDP
> +        netdev_register_provider(&netdev_afxdp_class);
> +#endif
>  #endif
>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>          netdev_register_provider(&netdev_tap_class);
> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
>                                                 saved_flags & 
> ~saved_values,
>                                                 &old_flags);
>          }
> +#ifdef HAVE_AF_XDP
> +        if (netdev->netdev_class == &netdev_afxdp_class) {
> +            signal_remove_xdp(netdev);
> +        }
> +#endif
>      }
>  }
>
> diff --git a/lib/spinlock.h b/lib/spinlock.h
> new file mode 100644
> index 000000000000..1ae634f23a6b
> --- /dev/null
> +++ b/lib/spinlock.h
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +#ifndef SPINLOCK_H
> +#define SPINLOCK_H 1
> +
> +#include <config.h>
> +
> +#include <ctype.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include "ovs-atomic.h"
> +
> +struct ovs_spinlock {
> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
> +};
> +
> +static inline void
> +ovs_spinlock_init(struct ovs_spinlock *sl)
> +{
> +    atomic_init(&sl->locked, 0);
> +}
> +
> +static inline void
> +ovs_spin_lock(struct ovs_spinlock *sl)
> +{
> +    int exp = 0, locked = 0;
> +
> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked, 
> &exp, 1,
> +                memory_order_acquire,
> +                memory_order_relaxed)) {
> +        locked = 1;
> +        while (locked) {
> +            atomic_read_relaxed(&sl->locked, &locked);
> +        }
> +        exp = 0;
> +    }
> +}
> +
> +static inline void
> +ovs_spin_unlock(struct ovs_spinlock *sl)
> +{
> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> +}
> +
> +static inline int
> +ovs_spin_trylock(struct ovs_spinlock *sl)
> +{
> +    int exp = 0;
> +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 
> 1,
> +                memory_order_acquire,
> +                memory_order_relaxed);
> +}
> +#endif
> diff --git a/lib/util.c b/lib/util.c
> index 7b8ab81f6ee1..5eb20995b370 100644
> --- a/lib/util.c
> +++ b/lib/util.c
> @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
>      return xrealloc(p, *n * s);
>  }
>
> -/* Allocates and returns 'size' bytes of memory aligned to a cache 
> line and in
> - * dedicated cache lines.  That is, the memory block returned will 
> not share a
> - * cache line with other data, avoiding "false sharing".
> +/* Allocates and returns 'size' bytes of memory aligned to 
> 'alignment' bytes.
> + * 'alignment' must be a power of two and a multiple of sizeof(void 
> *).
>   *
> - * Use free_cacheline() to free the returned memory block. */
> + * Use free_size_align() to free the returned memory block. */
>  void *
> -xmalloc_cacheline(size_t size)
> +xmalloc_size_align(size_t size, size_t alignment)
>  {
>  #ifdef HAVE_POSIX_MEMALIGN
>      void *p;
>      int error;
>
>      COVERAGE_INC(util_xalloc);
> -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
> +    error = posix_memalign(&p, alignment, size ? size : 1);
>      if (error != 0) {
>          out_of_memory();
>      }
> @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
>  #else
>      /* Allocate room for:
>       *
> -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to 
> allow the
> -     *       pointer to be aligned exactly sizeof(void *) bytes 
> before the
> -     *       beginning of a cache line.
> +     *     - Header padding: Up to alignment - 1 bytes, to allow the
> +     *       pointer 'q' to be aligned exactly sizeof(void *) bytes 
> before the
> +     *       beginning of the alignment.
>       *
>       *     - Pointer: A pointer to the start of the header padding, 
> to allow us
>       *       to free() the block later.
>       *
>       *     - User data: 'size' bytes.
>       *
> -     *     - Trailer padding: Enough to bring the user data up to a 
> cache line
> +     *     - Trailer padding: Enough to bring the user data up to a 
> alignment
>       *       multiple.
>       *
>       * +---------------+---------+------------------------+---------+
> @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
>       * p               q         r
>       *
>       */
> -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
> -                      + sizeof(void *)
> -                      + ROUND_UP(size, CACHE_LINE_SIZE));
> -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < 
> sizeof(void *);
> -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? 
> CACHE_LINE_SIZE : 0),
> -                                CACHE_LINE_SIZE);
> -    void **q = (void **) r - 1;
> +    void *p, *r, **q;
> +    bool runt;
> +
> +    COVERAGE_INC(util_xalloc);
> +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
> +        ovs_abort(0, "Invalid alignment");
> +    }
> +
> +    p = xmalloc((alignment - 1)
> +                + sizeof(void *)
> +                + ROUND_UP(size, alignment));
> +
> +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
> +    /* When the padding size < sizeof(void*), we don't have enough 
> room for
> +     * pointer 'q'. As a reuslt, need to move 'r' to the next 
> alignment.
> +     * So ROUND_UP when xmalloc above, and ROUND_UP again when 
> calculate 'r'
> +     * below.
> +     */
> +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment);
> +    q = (void **) r - 1;
>      *q = p;
> +
>      return r;
>  #endif
>  }
>
> +void
> +free_size_align(void *p)
> +{
> +#ifdef HAVE_POSIX_MEMALIGN
> +    free(p);
> +#else
> +    if (p) {
> +        void **q = (void **) p - 1;
> +        free(*q);
> +    }
> +#endif
> +}
> +
> +/* Allocates and returns 'size' bytes of memory aligned to a cache 
> line and in
> + * dedicated cache lines.  That is, the memory block returned will 
> not share a
> + * cache line with other data, avoiding "false sharing".
> + *
> + * Use free_cacheline() to free the returned memory block. */
> +void *
> +xmalloc_cacheline(size_t size)
> +{
> +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
> +}
> +
>  /* Like xmalloc_cacheline() but clears the allocated memory to all 
> zero
>   * bytes. */
>  void *
> @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
>  void
>  free_cacheline(void *p)
>  {
> -#ifdef HAVE_POSIX_MEMALIGN
> -    free(p);
> -#else
> -    if (p) {
> -        void **q = (void **) p - 1;
> -        free(*q);
> -    }
> -#endif
> +    free_size_align(p);
> +}
> +
> +void *
> +xmalloc_pagealign(size_t size)
> +{
> +    return xmalloc_size_align(size, get_page_size());
> +}
> +
> +void
> +free_pagealign(void *p)
> +{
> +    free_size_align(p);
>  }
>
>  char *
> diff --git a/lib/util.h b/lib/util.h
> index c26605abdce3..33665748274c 100644
> --- a/lib/util.h
> +++ b/lib/util.h
> @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src, 
> size_t size);
>
>  int string_ends_with(const char *str, const char *suffix);
>
> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> +void free_pagealign(void *);
> +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
> +void free_size_align(void *);
> +
>  /* The C standards say that neither the 'dst' nor 'src' argument to
>   * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
>   * the null case. */
> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> new file mode 100644
> index 000000000000..ea39fa557290
> --- /dev/null
> +++ b/lib/xdpsock.c
> @@ -0,0 +1,170 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +#include <config.h>
> +
> +#include "xdpsock.h"
> +#include "dp-packet.h"
> +#include "openvswitch/compiler.h"
> +
> +/* Note:
> + * umem_elem_push* shouldn't overflow because we always pop
> + * elem first, then push back to the stack.
> + */
> +static inline void
> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(ptr, addrs, n * sizeof(void *));
> +    umemp->index += n;
> +}
> +
> +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    ovs_spin_lock(&umemp->lock);
> +    __umem_elem_push_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->lock);
> +}
> +
> +static inline void
> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    umemp->array[umemp->index++] = addr;
> +}
> +
> +void
> +umem_elem_push(struct umem_pool *umemp, void *addr)
> +{
> +
> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> +
> +    ovs_spin_lock(&umemp->lock);
> +    __umem_elem_push(umemp, addr);
> +    ovs_spin_unlock(&umemp->lock);
> +}
> +
> +static inline int
> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    void *ptr;
> +
> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->index -= n;
> +    ptr = &umemp->array[umemp->index];
> +    memcpy(addrs, ptr, n * sizeof(void *));
> +
> +    return 0;
> +}
> +
> +int
> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> +{
> +    int ret;
> +
> +    ovs_spin_lock(&umemp->lock);
> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> +    ovs_spin_unlock(&umemp->lock);
> +
> +    return ret;
> +}
> +
> +static inline void *
> +__umem_elem_pop(struct umem_pool *umemp)
> +{
> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> +        return NULL;
> +    }
> +
> +    return umemp->array[--umemp->index];
> +}
> +
> +void *
> +umem_elem_pop(struct umem_pool *umemp)
> +{
> +    void *ptr;
> +
> +    ovs_spin_lock(&umemp->lock);
> +    ptr = __umem_elem_pop(umemp);
> +    ovs_spin_unlock(&umemp->lock);
> +
> +    return ptr;
> +}
> +
> +static void **
> +__umem_pool_alloc(unsigned int size)
> +{
> +    void *bufs;
> +
> +    bufs = xmalloc_pagealign(size * sizeof(void *));
> +    memset(bufs, 0, size * sizeof(void *));
> +
> +    return (void **)bufs;
> +}
> +
> +int
> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> +{
> +    umemp->array = __umem_pool_alloc(size);
> +    if (!umemp->array) {
> +        return -ENOMEM;
> +    }
> +
> +    umemp->size = size;
> +    umemp->index = 0;
> +    ovs_spinlock_init(&umemp->lock);
> +    return 0;
> +}
> +
> +void
> +umem_pool_cleanup(struct umem_pool *umemp)
> +{
> +    free_pagealign(umemp->array);
> +    umemp->array = NULL;
> +}
> +
> +/* AF_XDP metadata init/destroy */
> +int
> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> +{
> +    void *bufs;
> +
> +    bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp));
> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> +
> +    xp->array = bufs;
> +    xp->size = size;
> +
> +    return 0;
> +}
> +
> +void
> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> +{
> +    free_pagealign(xp->array);
> +    xp->array = NULL;
> +}
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> new file mode 100644
> index 000000000000..1a1093381243
> --- /dev/null
> +++ b/lib/xdpsock.h
> @@ -0,0 +1,101 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, 
> software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> + * See the License for the specific language governing permissions 
> and
> + * limitations under the License.
> + */
> +
> +#ifndef XDPSOCK_H
> +#define XDPSOCK_H 1
> +
> +#include <config.h>
> +
> +#ifdef HAVE_AF_XDP
> +
> +#include <bpf/xsk.h>
> +#include <errno.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +
> +#include "openvswitch/thread.h"
> +#include "ovs-atomic.h"
> +#include "spinlock.h"
> +
> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> +
> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> +
> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> + * Setting NUM_FRAMES to this makes sure umem_pop always successes.
> + */
> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> +
> +#define BATCH_SIZE      NETDEV_MAX_BURST
> +
> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + 
> CONS_NUM_DESCS));
> +
> +/* LIFO ptr_array */
> +struct umem_pool {
> +    int index;      /* point to top */
> +    unsigned int size;
> +    struct ovs_spinlock lock;
> +    void **array;   /* a pointer array, point to umem buf */
> +};
> +
> +/* array-based dp_packet_afxdp */
> +struct xpacket_pool {
> +    unsigned int size;
> +    struct dp_packet_afxdp **array;
> +};
> +
> +struct xsk_umem_info {
> +    struct umem_pool mpool;
> +    struct xpacket_pool xpool;
> +    struct xsk_ring_prod fq;
> +    struct xsk_ring_cons cq;
> +    struct xsk_umem *umem;
> +    void *buffer;
> +};
> +
> +struct xsk_socket_info {
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_umem_info *umem;
> +    struct xsk_socket *xsk;
> +    unsigned long rx_dropped;
> +    unsigned long tx_dropped;
> +    uint32_t outstanding_tx;
> +};
> +
> +struct umem_elem {
> +    struct umem_elem *next;
> +};
> +
> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +void *umem_elem_pop(struct umem_pool *umemp);
> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> +
> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> +void umem_pool_cleanup(struct umem_pool *umemp);
> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> +
> +#endif
> +#endif
> diff --git a/tests/automake.mk b/tests/automake.mk
> index 2956e68b242c..131564bb0bd3 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>  	$(SYSTEM_TESTSUITE_AT) \
>  	$(SYSTEM_KMOD_TESTSUITE_AT) \
>  	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
> +	$(SYSTEM_AFXDP_TESTSUITE_AT) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>  	$(SYSTEM_DPDK_TESTSUITE_AT) \
>  	$(OVSDB_CLUSTER_TESTSUITE_AT) \
>  	$(TESTSUITE) \
>  	$(SYSTEM_KMOD_TESTSUITE) \
>  	$(SYSTEM_USERSPACE_TESTSUITE) \
> +	$(SYSTEM_AFXDP_TESTSUITE) \
>  	$(SYSTEM_OFFLOADS_TESTSUITE) \
>  	$(SYSTEM_DPDK_TESTSUITE) \
>  	$(OVSDB_CLUSTER_TESTSUITE) \
> @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>  	tests/system-userspace-macros.at \
>  	tests/system-userspace-packet-type-aware.at
>
> +SYSTEM_AFXDP_TESTSUITE_AT = \
> +	tests/system-afxdp-testsuite.at \
> +	tests/system-afxdp-macros.at
> +
>  SYSTEM_TESTSUITE_AT = \
>  	tests/system-common-macros.at \
>  	tests/system-ovn.at \
> @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>  SYSTEM_USERSPACE_TESTSUITE = 
> $(srcdir)/tests/system-userspace-testsuite
> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> @@ -317,6 +324,11 @@ check-system-userspace: all
>  	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" 
> --recheck)
>
> +check-afxdp: all
> +	$(MAKE) install
> +	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> +	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> +
>  check-offloads: all
>  	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  
> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>  	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" 
> --recheck)
> @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 
> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
>
> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) 
> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> +	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> +	$(AM_V_at)mv $@.tmp $@
> +
>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) 
> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>  	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>  	$(AM_V_at)mv $@.tmp $@
> diff --git a/tests/system-afxdp-macros.at 
> b/tests/system-afxdp-macros.at
> new file mode 100644
> index 000000000000..1e6f7a46b4b7
> --- /dev/null
> +++ b/tests/system-afxdp-macros.at
> @@ -0,0 +1,20 @@
> +# Add port to ovs bridge by using afxdp mode.
> +# This will use generic XDP support in the veth driver.
> +m4_define([ADD_VETH],
> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 
> 77])
> +      CONFIGURE_VETH_OFFLOADS([$1])
> +      AT_CHECK([ip link set $1 netns $2])
> +      AT_CHECK([ip link set dev ovs-$1 up])
> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> +                set interface ovs-$1 external-ids:iface-id="$1" 
> type="afxdp"])
> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> +      if test -n "$5"; then
> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> +      fi
> +      if test -n "$6"; then
> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> +      fi
> +      on_exit 'ip link del ovs-$1'
> +    ]
> +)
> diff --git a/tests/system-afxdp-testsuite.at 
> b/tests/system-afxdp-testsuite.at
> new file mode 100644
> index 000000000000..9b7a29066614
> --- /dev/null
> +++ b/tests/system-afxdp-testsuite.at
> @@ -0,0 +1,26 @@
> +AT_INIT
> +
> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
> +
> +Licensed under the Apache License, Version 2.0 (the "License");
> +you may not use this file except in compliance with the License.
> +You may obtain a copy of the License at:
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> +
> +Unless required by applicable law or agreed to in writing, software
> +distributed under the License is distributed on an "AS IS" BASIS,
> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> +See the License for the specific language governing permissions and
> +limitations under the License.])
> +
> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> +
> +m4_include([tests/ovs-macros.at])
> +m4_include([tests/ovsdb-macros.at])
> +m4_include([tests/ofproto-macros.at])
> +m4_include([tests/system-common-macros.at])
> +m4_include([tests/system-userspace-macros.at])
> +m4_include([tests/system-afxdp-macros.at])
> +
> +m4_include([tests/system-traffic.at])
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 89c06a1b7877..1e3acbbb8075 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>          </p>
>        </column>
>
> +      <column name="other_config" key="xdpmode"
> +              type='{"type": "string",
> +                     "enum": ["set", ["skb", "drv"]]}'>
> +        <p>
> +          Specifies the operational mode of the XDP program.
> +          If "drv", the XDP program is loaded into the device driver 
> with
> +          zero-copy RX and TX enabled. This mode requires device 
> driver with
> +          AF_XDP support and has the best performance.
> +          If "skb", the XDP program is using generic XDP mode in 
> kernel with
> +          extra data copying between userspace and kernel. No device 
> driver
> +          support is needed. Note that this is afxdp netdev type 
> only.
> +          Defaults to "skb" mode.
> +        </p>
> +      </column>
> +
>        <column name="options" key="vhost-server-path"
>                type='{"type": "string"}'>
>          <p>
> -- 
> 2.7.4
William Tu June 7, 2019, 9:33 p.m. UTC | #2
Hi Eelco,

Thanks for the testing.

On Fri, Jun 7, 2019 at 8:43 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> Hi William,
>
> No review or full test yet, just some observations…
>
> We run OVS as a non root user, which is causing OVS with XDP to fail:

Right, XDP requires using root privilege.
I will add this in the documentation.

>
> 2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev:
> Datapath supports ct_orig_tuple
> 2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev:
> Datapath supports ct_orig_tuple6
> 2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on numa_id:
> 0, core id: 21 created.
> 2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd threads
> on numa node 0
> 2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
> 2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1 in
> DRV mode
> 2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR:
> setrlimit(RLIMIT_MEMLOCK): Operation not permitted

This is due to not having root privilege, so not able to lock the memory
for device driver to directly DMA packet buffer into userspace.

Can you try using root?

Regards,
William

> 2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all
> configure queue 0 mode DRV
> 2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create
> failed (Operation not permitted) mode: DRV qid: 0
> 2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create AF_XDP
> socket on queue 0
> 2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
> 2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1
> reconfig fails
> 2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set interface
> eno1 new configuration
>
> However when configuring this after startup it’s fine, but trying to
> restart OVS with this configuration results in a system core…
>
>
>
>
> On 5 Jun 2019, at 22:47, William Tu wrote:
>
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
> > socket
> > type built upon the eBPF and XDP technology.  It is aims to have
> > comparable
> > performance to DPDK but cooperate better with existing kernel's
> > networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP
> > program
> > attached to the netdev, by-passing a couple of Linux kernel's
> > subsystems
> > As a result, AF_XDP socket shows much better performance than
> > AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this
> > feature is
> > not compiled in.
> >
> > Signed-off-by: William Tu <u9012063@gmail.com>
> > ---
> > v1->v2:
> > - add a list to maintain unused umem elements
> > - remove copy from rx umem to ovs internal buffer
> > - use hugetlb to reduce misses (not much difference)
> > - use pmd mode netdev in OVS (huge performance improve)
> > - remove malloc dp_packet, instead put dp_packet in umem
> >
> > v2->v3:
> > - rebase on the OVS master, 7ab4b0653784
> >   ("configure: Check for more specific function to pull in pthread
> > library.")
> > - remove the dependency on libbpf and dpif-bpf.
> >   instead, use the built-in XDP_ATTACH feature.
> > - data structure optimizations for better performance, see[1]
> > - more test cases support
> > v3:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> >
> > v3->v4:
> > - Use AF_XDP API provided by libbpf
> > - Remove the dependency on XDP_ATTACH kernel patch set
> > - Add documentation, bpf.rst
> >
> > v4->v5:
> > - rebase to master
> > - remove rfc, squash all into a single patch
> > - add --enable-afxdp, so by default, AF_XDP is not compiled
> > - add options: xdpmode=drv,skb
> > - add multiple queue and multiple PMD support, with options: n_rxq
> > - improve documentation, rename bpf.rst to af_xdp.rst
> >
> > v5->v6
> > - rebase to master, commit 0cdd5b13de91b98
> > - address errors from sparse and clang
> > - pass travis-ci test
> > - address feedback from Ben
> > - fix issues reported by 0-day robot
> > - improved documentation
> >
> > v6-v7
> > - rebase to master, commit abf11558c1515bf3b1
> > - address feedbacks from Ilya, Ben, and Eelco, see:
> >   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> > - add XDP mode change, implement get/set_config, reconfigure
> > - Fix reconfiguration/crash issue caused by libbpf, see patch:
> >   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> > - perf optimization for batching umem_push/pop
> > - perf optimization for batching kick_tx
> > - test build with dpdk
> > - fix/refactor atomic operation
> > - make AF_XDP x86 specific, otherwise fail at build time
> > - lots of code refactoring
> > - add PVP setup in documentation
> >
> > v7-v8:
> > - Address feedback from Ilya at:
> >   https://patchwork.ozlabs.org/patch/1095019/
> > - add netdev-linux-private.h
> > - fix afxdp reconfigure issue
> > - sort include headers
> > - remove unnecessary OVS_UNUSED
> > - coding style fixes
> > - error case handling and memory leak
> >
> > v8-v9:
> > - rebase to master 180bbbed3a3867d52
> > - Address review feedback from Ben, Ilya and Eelco, at:
> >   https://patchwork.ozlabs.org/patch/1097740/
> > - == From Ilya ==
> > - Optimize the reconfiguration logic
> > - Implement .rxq_recv and .send for afxdp
> > - Remove system-afxdp-traffic.at, reuse existing code
> > - Use Ilya's rdtsc code
> > - remove --disable-system
> > - == From Eelco ==
> > - Fix bug when remove br0,
> > util(revalidator49)|EMER|lib/poll-loop.c:111:
> >   assertion !fd != !wevent failed
> > - Fix bug and use default value from libbpf, ex:
> > XSK_RING_PROD__DEFAULT...
> > - Clear xdp program when receive signal, ctrl+c
> > - Add options to vswitch.xml, set xdpmode default to skb-mode
> > - No support for ARM and PPC, now x86_64 only
> > - remove redundant header includes and function/macro definitions
> > - remove some ifdef HAVE_AF_XDP
> > - == From others/both about afxdp rx and tx ==
> > - Several umem push/pop error handling improvement/fixes
> > - add lock to address concurrent_txq case
> > - improve error handling
> > - add stats
> > - Things that are not done yet
> > - MTU limitation
> > - n_txq_desc/n_rxq_desc option.
> >
> > v9-v10
> > - remove x86_64 limitation, suggested by Ben and Eelco
> > - add xmalloc_pagealign, free_pagealign
> > - minor refector
> >
> > v10-v11
> > - address feedback from Ilya at
> >   https://patchwork.ozlabs.org/patch/1106495/
> > - fix typos, and some refactoring
> > - refactor existing code and introduce xmalloc pagealign
> > - fix a couple of error handling case
> > - allocate per-txq lock
> > - dynamic allocate xsk array
> > - fix cycle_counter_update() for non-x86/non-linux case
> > ---
> >  Documentation/automake.mk             |   1 +
> >  Documentation/index.rst               |   1 +
> >  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
> >  Documentation/intro/install/index.rst |   1 +
> >  acinclude.m4                          |  35 ++
> >  configure.ac                          |   1 +
> >  lib/automake.mk                       |  14 +
> >  lib/dp-packet.c                       |  28 ++
> >  lib/dp-packet.h                       |  18 +-
> >  lib/dpif-netdev-perf.h                |  26 +
> >  lib/netdev-afxdp.c                    | 891
> > ++++++++++++++++++++++++++++++++++
> >  lib/netdev-afxdp.h                    |  74 +++
> >  lib/netdev-linux-private.h            | 139 ++++++
> >  lib/netdev-linux.c                    | 121 ++---
> >  lib/netdev-provider.h                 |   3 +
> >  lib/netdev.c                          |  11 +
> >  lib/spinlock.h                        |  70 +++
> >  lib/util.c                            |  92 +++-
> >  lib/util.h                            |   5 +
> >  lib/xdpsock.c                         | 170 +++++++
> >  lib/xdpsock.h                         | 101 ++++
> >  tests/automake.mk                     |  16 +
> >  tests/system-afxdp-macros.at          |  20 +
> >  tests/system-afxdp-testsuite.at       |  26 +
> >  vswitchd/vswitch.xml                  |  15 +
> >  25 files changed, 2204 insertions(+), 108 deletions(-)
> >  create mode 100644 Documentation/intro/install/afxdp.rst
> >  create mode 100644 lib/netdev-afxdp.c
> >  create mode 100644 lib/netdev-afxdp.h
> >  create mode 100644 lib/netdev-linux-private.h
> >  create mode 100644 lib/spinlock.h
> >  create mode 100644 lib/xdpsock.c
> >  create mode 100644 lib/xdpsock.h
> >  create mode 100644 tests/system-afxdp-macros.at
> >  create mode 100644 tests/system-afxdp-testsuite.at
> >
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 082438e09a33..11cc59efc881 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -10,6 +10,7 @@ DOC_SOURCE = \
> >       Documentation/intro/why-ovs.rst \
> >       Documentation/intro/install/index.rst \
> >       Documentation/intro/install/bash-completion.rst \
> > +     Documentation/intro/install/afxdp.rst \
> >       Documentation/intro/install/debian.rst \
> >       Documentation/intro/install/documentation.rst \
> >       Documentation/intro/install/distributions.rst \
> > diff --git a/Documentation/index.rst b/Documentation/index.rst
> > index 46261235c732..aa9e7c49f179 100644
> > --- a/Documentation/index.rst
> > +++ b/Documentation/index.rst
> > @@ -59,6 +59,7 @@ vSwitch? Start here.
> >    :doc:`intro/install/windows` |
> >    :doc:`intro/install/xenserver` |
> >    :doc:`intro/install/dpdk` |
> > +  :doc:`intro/install/afxdp` |
> >    :doc:`Installation FAQs <faq/releases>`
> >
> >  - **Tutorials:** :doc:`tutorials/faucet` |
> > diff --git a/Documentation/intro/install/afxdp.rst
> > b/Documentation/intro/install/afxdp.rst
> > new file mode 100644
> > index 000000000000..554964396353
> > --- /dev/null
> > +++ b/Documentation/intro/install/afxdp.rst
> > @@ -0,0 +1,433 @@
> > +..
> > +      Licensed under the Apache License, Version 2.0 (the "License");
> > you may
> > +      not use this file except in compliance with the License. You
> > may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing,
> > software
> > +      distributed under the License is distributed on an "AS IS"
> > BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied. See the
> > +      License for the specific language governing permissions and
> > limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +
> > +========================
> > +Open vSwitch with AF_XDP
> > +========================
> > +
> > +This document describes how to build and install Open vSwitch using
> > +AF_XDP netdev.
> > +
> > +.. warning::
> > +  The AF_XDP support of Open vSwitch is considered 'experimental',
> > +  and it is not compiled in by default.
> > +
> > +
> > +Introduction
> > +------------
> > +AF_XDP, Address Family of the eXpress Data Path, is a new Linux
> > socket type
> > +built upon the eBPF and XDP technology.  It is aims to have
> > comparable
> > +performance to DPDK but cooperate better with existing kernel's
> > networking
> > +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP
> > program
> > +attached to the netdev, by-passing a couple of Linux kernel's
> > subsystems.
> > +As a result, AF_XDP socket shows much better performance than
> > AF_PACKET.
> > +For more details about AF_XDP, please see linux kernel's
> > +Documentation/networking/af_xdp.rst
> > +
> > +
> > +AF_XDP Netdev
> > +-------------
> > +OVS has a couple of netdev types, i.e., system, tap, or
> > +dpdk.  The AF_XDP feature adds a new netdev types called
> > +"afxdp", and implement its configuration, packet reception,
> > +and transmit functions.  Since the AF_XDP socket, called xsk,
> > +operates in userspace, once ovs-vswitchd receives packets
> > +from xsk, the afxdp netdev re-uses the existing userspace
> > +dpif-netdev datapath.  As a result, most of the packet processing
> > +happens at the userspace instead of linux kernel.
> > +
> > +::
> > +
> > +              |   +-------------------+
> > +              |   |    ovs-vswitchd   |<-->ovsdb-server
> > +              |   +-------------------+
> > +              |   |      ofproto      |<-->OpenFlow controllers
> > +              |   +--------+-+--------+
> > +              |   | netdev | |ofproto-|
> > +    userspace |   +--------+ |  dpif  |
> > +              |   | afxdp  | +--------+
> > +              |   | netdev | |  dpif  |
> > +              |   +---||---+ +--------+
> > +              |       ||     |  dpif- |
> > +              |       ||     | netdev |
> > +              |_      ||     +--------+
> > +                      ||
> > +               _  +---||-----+--------+
> > +              |   | AF_XDP prog +     |
> > +       kernel |   |   xsk_map         |
> > +              |_  +--------||---------+
> > +                           ||
> > +                        physical
> > +                           NIC
> > +
> > +
> > +Build requirements
> > +------------------
> > +
> > +In addition to the requirements described in :doc:`general`, building
> > Open
> > +vSwitch with AF_XDP will require the following:
> > +
> > +- libbpf from kernel source tree (kernel 5.0.0 or later)
> > +
> > +- Linux kernel XDP support, with the following options (required)
> > +
> > +  * CONFIG_BPF=y
> > +
> > +  * CONFIG_BPF_SYSCALL=y
> > +
> > +  * CONFIG_XDP_SOCKETS=y
> > +
> > +
> > +- The following optional Kconfig options are also recommended, but
> > not
> > +  required:
> > +
> > +  * CONFIG_BPF_JIT=y (Performance)
> > +
> > +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> > +
> > +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> > +
> > +- Once your AF_XDP-enabled kernel is ready, if possible, run
> > +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> > +  This is an OVS independent benchmark tools for AF_XDP.
> > +  It makes sure your basic kernel requirements are met for AF_XDP.
> > +
> > +
> > +Installing
> > +----------
> > +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF
> > support.
> > +First, clone a recent version of Linux bpf-next tree::
> > +
> > +  git clone
> > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> > +
> > +Second, go into the Linux source directory and build libbpf in the
> > tools
> > +directory::
> > +
> > +  cd bpf-next/
> > +  cd tools/lib/bpf/
> > +  make && make install
> > +  make install_headers
> > +
> > +.. note::
> > +   Make sure xsk.h and bpf.h are installed in system's library path,
> > +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> > +
> > +Make sure the libbpf.so is installed correctly::
> > +
> > +  ldconfig
> > +  ldconfig -p | grep libbpf
> > +
> > +Third, ensure the standard OVS requirements are installed and
> > +bootstrap/configure the package::
> > +
> > +  ./boot.sh && ./configure --enable-afxdp
> > +
> > +Finally, build and install OVS::
> > +
> > +  make && make install
> > +
> > +To kick start end-to-end autotesting::
> > +
> > +  uname -a # make sure having 5.0+ kernel
> > +  make check-afxdp TESTSUITEFLAGS='1'
> > +
> > +If a test case fails, check the log at::
> > +
> > +  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> > +
> > +
> > +Setup AF_XDP netdev
> > +-------------------
> > +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> > +set-up right::
> > +
> > +  ldd vswitchd/ovs-vswitchd
> > +
> > +Open vSwitch should be started using userspace datapath as described
> > +in :doc:`general`::
> > +
> > +  ovs-vswitchd ...
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > +
> > +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> > core 4)
> > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> > "skb"::
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Or, use 4 pmds/cores and 4 queues by doing::
> > +
> > +  ethtool -L enp2s0 combined 4
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=4 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > +
> > +.. note::
> > +   pmd-rxq-affinity is optional. If not specified, system will
> > auto-assign.
> > +
> > +To validate that the bridge has successfully instantiated, you can
> > use the::
> > +
> > +  ovs-vsctl show
> > +
> > +Should show something like::
> > +
> > +  Port "ens802f0"
> > +   Interface "ens802f0"
> > +      type: afxdp
> > +      options: {n_rxq="1", xdpmode=drv}
> > +
> > +Otherwise, enable debugging by::
> > +
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +
> > +References
> > +----------
> > +Most of the design details are described in the paper presented at
> > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> > +section 4, and slides[2][4].
> > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> > introduction
> > +about AF_XDP current and future work.
> > +
> > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > +
> > +[2]
> > http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > +
> > +[3]
> > http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > +
> > +[4]
> > https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > +
> > +
> > +Performance Tuning
> > +------------------
> > +The name of the game is to keep your CPU running in userspace,
> > allowing PMD
> > +to keep polling the AF_XDP queues without any interferences from
> > kernel.
> > +
> > +#. Make sure everything is in the same NUMA node (memory used by
> > AF_XDP, pmd
> > +   running cores, device plug-in slot)
> > +
> > +#. Isolate your CPU by doing isolcpu at grub configure.
> > +
> > +#. IRQ should not set to pmd running core.
> > +
> > +#. The Spectre and Meltdown fixes increase the overhead of system
> > calls.
> > +
> > +
> > +Debugging performance issue
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +While running the traffic, use linux perf tool to see where your cpu
> > +spends its cycle::
> > +
> > +  cd bpf-next/tools/perf
> > +  make
> > +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> > +  ./perf report
> > +
> > +Measure your system call rate by doing::
> > +
> > +  pstree -p `pidof ovs-vswitchd`
> > +  strace -c -p <your pmd's PID>
> > +
> > +Or, use OVS pmd tool::
> > +
> > +  ovs-appctl dpif-netdev/pmd-stats-show
> > +
> > +
> > +Example Script
> > +--------------
> > +
> > +Below is a script using namespaces and veth peer::
> > +
> > +  #!/bin/bash
> > +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl
> > \
> > +    --disable-system --detach \
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> > +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14
> > \
> > +    fail-mode=secure datapath_type=netdev
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > +
> > +  ip netns add at_ns0
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +  ip link add p0 type veth peer name afxdp-p0
> > +  ip link set p0 netns at_ns0
> > +  ip link set dev afxdp-p0 up
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> > +
> > +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> > +  ip addr add "10.1.1.1/24" dev p0
> > +  ip link set dev p0 up
> > +  NS_EXEC_HEREDOC
> > +
> > +  ip netns add at_ns1
> > +  ip link add p1 type veth peer name afxdp-p1
> > +  ip link set p1 netns at_ns1
> > +  ip link set dev afxdp-p1 up
> > +
> > +  ovs-vsctl add-port br0 afxdp-p1 -- \
> > +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> > +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> > +  ip addr add "10.1.1.2/24" dev p1
> > +  ip link set dev p1 up
> > +  NS_EXEC_HEREDOC
> > +
> > +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> > +
> > +
> > +Limitations/Known Issues
> > +------------------------
> > +#. Device's numa ID is always 0, need a way to find numa id from a
> > netdev.
> > +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A
> > possible
> > +   work-around is to use OpenFlow meter action.
> > +#. AF_XDP device added to bridge, remove, and added again will fail.
> > +#. Most of the tests are done using i40e single port. Multiple ports
> > and
> > +   also ixgbe driver also needs to be tested.
> > +#. No latency test result (TODO items)
> > +
> > +
> > +PVP using tap device
> > +--------------------
> > +Assume you have enp2s0 as physical nic, and a tap device connected to
> > VM.
> > +First, start OVS, then add physical port::
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Start a VM with virtio and tap device::
> > +
> > +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > +    -m 4096 \
> > +    -cpu host,+x2apic -enable-kvm \
> > +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> > +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> > +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> > +    -object memory-backend-file,id=mem,size=4096M,\
> > +      mem-path=/dev/hugepages,share=on \
> > +    -numa node,memdev=mem -mem-prealloc -smp 2
> > +
> > +Create OpenFlow rules::
> > +
> > +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > +
> > +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> > +
> > +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > +
> > +The performance number I got is around 1.6Mpps.
> > +This is due to using the kernel's tap interface, which requires
> > copying
> > +packet into kernel from the umem buffer in userspace.
> > +
> > +
> > +PVP using vhostuser device
> > +--------------------------
> > +First, build OVS with DPDK and AFXDP::
> > +
> > +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> > +  make -j4 && make install
> > +
> > +Create a vhost-user port from OVS::
> > +
> > +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> > +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> > +    other_config:pmd-cpu-mask=0xfff
> > +  ovs-vsctl add-port br0 vhost-user-1 \
> > +    -- set Interface vhost-user-1 type=dpdkvhostuser
> > +
> > +Start VM using vhost-user mode::
> > +
> > +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > +   -m 4096 \
> > +   -cpu host,+x2apic -enable-kvm \
> > +   -chardev
> > socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> > +   -netdev
> > type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> > +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> > +      netdev=mynet1,mq=on,vectors=10 \
> > +   -object memory-backend-file,id=mem,size=4096M,\
> > +      mem-path=/dev/hugepages,share=on \
> > +   -numa node,memdev=mem -mem-prealloc -smp 2
> > +
> > +Setup the OpenFlow ruls::
> > +
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0,
> > actions=output:vhost-user-1"
> > +  ovs-ofctl add-flow br0 "in_port=vhost-user-1,
> > actions=output:enp2s0"
> > +
> > +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> > +
> > +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> > +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > +
> > +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> > +
> > +
> > +PCP container using veth
> > +------------------------
> > +Create namespace and veth peer devices::
> > +
> > +  ip netns add at_ns0
> > +  ip link add p0 type veth peer name afxdp-p0
> > +  ip link set p0 netns at_ns0
> > +  ip link set dev afxdp-p0 up
> > +  ip netns exec at_ns0 ip link set dev p0 up
> > +
> > +Attach the veth port to br0 (linux kernel mode)::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 options:n_rxq=1
> > +
> > +Or, use AF_XDP with skb mode::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
> > options:xdpmode=skb
> > +
> > +Setup the OpenFlow rules::
> > +
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> > +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> > +
> > +In the namespace, run drop or bounce back the packet::
> > +
> > +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> > +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> > +
> > +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> > +
> > +
> > +Bug Reporting
> > +-------------
> > +
> > +Please report problems to dev@openvswitch.org.
> > diff --git a/Documentation/intro/install/index.rst
> > b/Documentation/intro/install/index.rst
> > index 3193c736cf17..c27a9c9d16ff 100644
> > --- a/Documentation/intro/install/index.rst
> > +++ b/Documentation/intro/install/index.rst
> > @@ -45,6 +45,7 @@ Installation from Source
> >     xenserver
> >     userspace
> >     dpdk
> > +   afxdp
> >
> >  Installation from Packages
> >  --------------------------
> > diff --git a/acinclude.m4 b/acinclude.m4
> > index cf9cc8b8b0de..721653ab0ec0 100644
> > --- a/acinclude.m4
> > +++ b/acinclude.m4
> > @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
> >    ])
> >  ])
> >
> > +dnl OVS_CHECK_LINUX_AF_XDP
> > +dnl
> > +dnl Check both Linux kernel AF_XDP and libbpf support
> > +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> > +  AC_ARG_ENABLE([afxdp],
> > +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP
> > support])],
> > +                [], [enable_afxdp=no])
> > +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> > +  if test "$enable_afxdp" != yes; then
> > +    AC_MSG_RESULT([no])
> > +    AF_XDP_ENABLE=false
> > +  else
> > +    AC_MSG_RESULT([yes])
> > +    AF_XDP_ENABLE=true
> > +
> > +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> > +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP
> > support])])
> > +
> > +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> > +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP
> > support])])
> > +
> > +    AC_CHECK_HEADER([bpf/xsk.h], [],
> > +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> > +
> > +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> > +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP
> > support])])
> > +
> > +    AC_DEFINE([HAVE_AF_XDP], [1],
> > +              [Define to 1 if AF_XDP support is available and
> > enabled.])
> > +    LIBBPF_LDADD=" -lbpf -lelf"
> > +    AC_SUBST([LIBBPF_LDADD])
> > +  fi
> > +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> > +])
> > +
> >  dnl OVS_CHECK_DPDK
> >  dnl
> >  dnl Configure DPDK source tree
> > diff --git a/configure.ac b/configure.ac
> > index 2dbe9a9178e3..9e23e1c6958c 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
> >  OVS_CHECK_DOT
> >  OVS_CHECK_IF_DL
> >  OVS_CHECK_STRTOK_R
> > +OVS_CHECK_LINUX_AF_XDP
> >  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
> >  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct
> > stat.st_mtimensec],
> >    [], [], [[#include <sys/stat.h>]])
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index cc5dccf39d6b..b31e28f6e1f5 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -14,6 +14,10 @@ if WIN32
> >  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
> >  endif
> >
> > +if HAVE_AF_XDP
> > +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> > +endif
> > +
> >  lib_libopenvswitch_la_LDFLAGS = \
> >          $(OVS_LTINFO) \
> >          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> > @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
> >       lib/if-notifier.h \
> >       lib/netdev-linux.c \
> >       lib/netdev-linux.h \
> > +     lib/netdev-linux-private.h \
> >       lib/netdev-tc-offloads.c \
> >       lib/netdev-tc-offloads.h \
> >       lib/netlink-conntrack.c \
> > @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
> >       lib/tc.h
> >  endif
> >
> > +if HAVE_AF_XDP
> > +lib_libopenvswitch_la_SOURCES += \
> > +     lib/xdpsock.c \
> > +     lib/xdpsock.h \
> > +     lib/netdev-afxdp.c \
> > +     lib/netdev-afxdp.h \
> > +     lib/spinlock.h
> > +endif
> > +
> >  if DPDK_NETDEV
> >  lib_libopenvswitch_la_SOURCES += \
> >       lib/dpdk.c \
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index 0976a35e758b..e6a7947076b4 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -19,6 +19,7 @@
> >  #include <string.h>
> >
> >  #include "dp-packet.h"
> > +#include "netdev-afxdp.h"
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "util.h"
> > @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> > size_t allocated)
> >      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> >  }
> >
> > +#if HAVE_AF_XDP
> > +/* Initialize 'b' as an empty dp_packet that contains
> > + * memory starting at AF_XDP umem base.
> > + */
> > +void
> > +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t
> > allocated)
> > +{
> > +    dp_packet_set_base(b, base);
> > +    dp_packet_set_data(b, base);
> > +    dp_packet_set_size(b, 0);
> > +
> > +    dp_packet_set_allocated(b, allocated);
> > +    b->source = DPBUF_AFXDP;
> > +    dp_packet_reset_offsets(b);
> > +    pkt_metadata_init(&b->md, 0);
> > +    dp_packet_reset_cutlen(b);
> > +    dp_packet_reset_offload(b);
> > +    b->packet_type = htonl(PT_ETH);
> > +}
> > +#endif
> > +
> >  /* Initializes 'b' as an empty dp_packet that contains the
> > 'allocated' bytes of
> >   * memory starting at 'base'.  'base' should point to a buffer on the
> > stack.
> >   * (Nothing actually relies on 'base' being allocated on the stack.
> > It could
> > @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
> >               * created as a dp_packet */
> >              free_dpdk_buf((struct dp_packet*) b);
> >  #endif
> > +        } else if (b->source == DPBUF_AFXDP) {
> > +            free_afxdp_buf(b);
> >          }
> >      }
> >  }
> > @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
> > new_headroom, size_t new_tailroom
> >      case DPBUF_STACK:
> >          OVS_NOT_REACHED();
> >
> > +    case DPBUF_AFXDP:
> > +        OVS_NOT_REACHED();
> > +
> >      case DPBUF_STUB:
> >          b->source = DPBUF_MALLOC;
> >          new_base = xmalloc(new_allocated);
> > @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
> >  {
> >      void *p;
> >      ovs_assert(b->source != DPBUF_DPDK);
> > +    ovs_assert(b->source != DPBUF_AFXDP);
> >
> >      if (b->source == DPBUF_MALLOC && dp_packet_data(b) ==
> > dp_packet_base(b)) {
> >          p = dp_packet_data(b);
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index a5e9ade1244a..e3438226e360 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -25,6 +25,7 @@
> >  #include <rte_mbuf.h>
> >  #endif
> >
> > +#include "netdev-afxdp.h"
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/list.h"
> >  #include "packets.h"
> > @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
> >      DPBUF_DPDK,                /* buffer data is from DPDK allocated
> > memory.
> >                                  * ref to dp_packet_init_dpdk() in
> > dp-packet.c.
> >                                  */
> > +    DPBUF_AFXDP,               /* buffer data from XDP frame */
> >  };
> >
> >  #define DP_PACKET_CONTEXT_SIZE 64
> > @@ -89,6 +91,13 @@ struct dp_packet {
> >      };
> >  };
> >
> > +#if HAVE_AF_XDP
> > +struct dp_packet_afxdp {
> > +    struct umem_pool *mpool;
> > +    struct dp_packet packet;
> > +};
> > +#endif
> > +
> >  static inline void *dp_packet_data(const struct dp_packet *);
> >  static inline void dp_packet_set_data(struct dp_packet *, void *);
> >  static inline void *dp_packet_base(const struct dp_packet *);
> > @@ -122,7 +131,9 @@ static inline const void
> > *dp_packet_get_nd_payload(const struct dp_packet *);
> >  void dp_packet_use(struct dp_packet *, void *, size_t);
> >  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
> >  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> > -
> > +#if HAVE_AF_XDP
> > +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> > +#endif
> >  void dp_packet_init_dpdk(struct dp_packet *);
> >
> >  void dp_packet_init(struct dp_packet *, size_t);
> > @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
> >              return;
> >          }
> >
> > +        if (b->source == DPBUF_AFXDP) {
> > +            free_afxdp_buf(b);
> > +            return;
> > +        }
> > +
> >          dp_packet_uninit(b);
> >          free(b);
> >      }
> > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> > index 859c05613ddf..6b6dfda7db1c 100644
> > --- a/lib/dpif-netdev-perf.h
> > +++ b/lib/dpif-netdev-perf.h
> > @@ -21,6 +21,7 @@
> >  #include <stddef.h>
> >  #include <stdint.h>
> >  #include <string.h>
> > +#include <time.h>
> >  #include <math.h>
> >
> >  #ifdef DPDK_NETDEV
> > @@ -186,6 +187,24 @@ struct pmd_perf_stats {
> >      char *log_reason;
> >  };
> >
> > +#ifdef __linux__
> > +static inline uint64_t
> > +rdtsc_syscall(struct pmd_perf_stats *s)
> > +{
> > +    struct timespec val;
> > +    uint64_t v;
> > +
> > +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> > +       return s->last_tsc;
> > +    }
> > +
> > +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> > +    v += (uint64_t) val.tv_nsec;
> > +
> > +    return s->last_tsc = v;
> > +}
> > +#endif
> > +
> >  /* Support for accurate timing of PMD execution on TSC clock cycle
> > level.
> >   * These functions are intended to be invoked in the context of pmd
> > threads. */
> >
> > @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s)
> >  {
> >  #ifdef DPDK_NETDEV
> >      return s->last_tsc = rte_get_tsc_cycles();
> > +#elif !defined(_MSC_VER) && defined(__x86_64__)
> > +    uint32_t h, l;
> > +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> > +
> > +    return s->last_tsc = ((uint64_t) h << 32) | l;
> > +#elif defined(__linux__)
> > +    return rdtsc_syscall(s);
> >  #else
> >      return s->last_tsc = 0;
> >  #endif
> > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> > new file mode 100644
> > index 000000000000..a6543e8f5126
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.c
> > @@ -0,0 +1,891 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#include <config.h>
> > +
> > +#include "netdev-linux-private.h"
> > +#include "netdev-linux.h"
> > +#include "netdev-afxdp.h"
> > +
> > +#include <errno.h>
> > +#include <inttypes.h>
> > +#include <linux/rtnetlink.h>
> > +#include <linux/if_xdp.h>
> > +#include <net/if.h>
> > +#include <stdlib.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <unistd.h>
> > +
> > +#include "dp-packet.h"
> > +#include "dpif-netdev.h"
> > +#include "openvswitch/dynamic-string.h"
> > +#include "openvswitch/vlog.h"
> > +#include "packets.h"
> > +#include "socket-util.h"
> > +#include "spinlock.h"
> > +#include "util.h"
> > +#include "xdpsock.h"
> > +
> > +#ifndef SOL_XDP
> > +#define SOL_XDP 283
> > +#endif
> > +
> > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > +
> > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> > *)base))
> > +#define UMEM2XPKT(base, i) \
> > +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base
> > + \
> > +                               i * sizeof(struct dp_packet_afxdp))
> > +
> > +static uint32_t prog_id;
> > +static struct xsk_socket_info *xsk_configure(int ifindex, int
> > xdp_queue_id,
> > +                                             int mode);
> > +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> > +static void xsk_destroy(struct xsk_socket_info *xsk);
> > +static int xsk_configure_all(struct netdev *netdev);
> > +static void xsk_destroy_all(struct netdev *netdev);
> > +
> > +static struct xsk_umem_info *
> > +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> > +{
> > +    struct xsk_umem_config uconfig OVS_UNUSED;
> > +    struct xsk_umem_info *umem;
> > +    int ret;
> > +    int i;
> > +
> > +    umem = xcalloc(1, sizeof *umem);
> > +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq,
> > &umem->cq,
> > +                           NULL);
> > +    if (ret) {
> > +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> > +                 ovs_strerror(errno),
> > +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> > +        free(umem);
> > +        return NULL;
> > +    }
> > +
> > +    umem->buffer = buffer;
> > +
> > +    /* set-up umem pool */
> > +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> > +        VLOG_ERR("umem_pool_init failed");
> > +        if (xsk_umem__delete(umem->umem)) {
> > +            VLOG_ERR("xsk_umem__delete failed");
> > +        }
> > +        free(umem);
> > +        return NULL;
> > +    }
> > +
> > +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > +        struct umem_elem *elem;
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)umem->buffer + i * FRAME_SIZE);
> > +        umem_elem_push(&umem->mpool, elem);
> > +    }
> > +
> > +    /* set-up metadata */
> > +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> > +        VLOG_ERR("xpacket_pool_init failed");
> > +        umem_pool_cleanup(&umem->mpool);
> > +        if (xsk_umem__delete(umem->umem)) {
> > +            VLOG_ERR("xsk_umem__delete failed");
> > +        }
> > +        free(umem);
> > +        return NULL;
> > +    }
> > +
> > +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> > +              umem->xpool.array,
> > +              (char *)umem->xpool.array +
> > +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> > +
> > +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> > +        xpacket->mpool = &umem->mpool;
> > +
> > +        packet = &xpacket->packet;
> > +        packet->source = DPBUF_AFXDP;
> > +    }
> > +
> > +    return umem;
> > +}
> > +
> > +static struct xsk_socket_info *
> > +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> > +                     uint32_t queue_id, int xdpmode)
> > +{
> > +    struct xsk_socket_config cfg;
> > +    struct xsk_socket_info *xsk;
> > +    char devname[IF_NAMESIZE];
> > +    uint32_t idx = 0;
> > +    int ret;
> > +    int i;
> > +
> > +    xsk = xcalloc(1, sizeof(*xsk));
> > +    xsk->umem = umem;
> > +    cfg.rx_size = CONS_NUM_DESCS;
> > +    cfg.tx_size = PROD_NUM_DESCS;
> > +    cfg.libbpf_flags = 0;
> > +
> > +    if (xdpmode == XDP_ZEROCOPY) {
> > +        cfg.bind_flags = XDP_ZEROCOPY;
> > +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > XDP_FLAGS_DRV_MODE;
> > +    } else {
> > +        cfg.bind_flags = XDP_COPY;
> > +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > XDP_FLAGS_SKB_MODE;
> > +    }
> > +
> > +    if (if_indextoname(ifindex, devname) == NULL) {
> > +        VLOG_ERR("ifindex %d to devname failed (%s)",
> > +                 ifindex, ovs_strerror(errno));
> > +        free(xsk);
> > +        return NULL;
> > +    }
> > +
> > +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id,
> > umem->umem,
> > +                             &xsk->rx, &xsk->tx, &cfg);
> > +    if (ret) {
> > +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
> > +                 ovs_strerror(errno),
> > +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> > +                 queue_id);
> > +        free(xsk);
> > +        return NULL;
> > +    }
> > +
> > +    /* Make sure the built-in AF_XDP program is loaded */
> > +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> > +    if (ret) {
> > +        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
> > +        xsk_socket__delete(xsk->xsk);
> > +        free(xsk);
> > +        return NULL;
> > +    }
> > +
> > +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue
> > */
> > +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> > +                                   PROD_NUM_DESCS - BATCH_SIZE,
> > &idx)) {
> > +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL
> > queue");
> > +    }
> > +
> > +    for (i = 0;
> > +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
> > +         i += FRAME_SIZE) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        elem = umem_elem_pop(&xsk->umem->mpool);
> > +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> > +
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> > +    }
> > +
> > +    xsk_ring_prod__submit(&xsk->umem->fq,
> > +                          PROD_NUM_DESCS - BATCH_SIZE);
> > +    return xsk;
> > +}
> > +
> > +static struct xsk_socket_info *
> > +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> > +{
> > +    struct xsk_socket_info *xsk;
> > +    struct xsk_umem_info *umem;
> > +    void *bufs;
> > +
> > +    /* umem memory region */
> > +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> > +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> > +
> > +    /* create AF_XDP socket */
> > +    umem = xsk_configure_umem(bufs,
> > +                              NUM_FRAMES * FRAME_SIZE,
> > +                              xdpmode);
> > +    if (!umem) {
> > +        free_pagealign(bufs);
> > +        return NULL;
> > +    }
> > +
> > +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> > +    if (!xsk) {
> > +        /* clean up umem and xpacket pool */
> > +        if (xsk_umem__delete(umem->umem)) {
> > +            VLOG_ERR("xsk_umem__delete failed");
> > +        }
> > +        free_pagealign(bufs);
> > +        umem_pool_cleanup(&umem->mpool);
> > +        xpacket_pool_cleanup(&umem->xpool);
> > +        free(umem);
> > +    }
> > +    return xsk;
> > +}
> > +
> > +static int
> > +xsk_configure_all(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct xsk_socket_info *xsk;
> > +    int i, ifindex, n_rxq;
> > +
> > +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > +
> > +    n_rxq = netdev_n_rxq(netdev);
> > +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> > +
> > +    /* configure each queue */
> > +    for (i = 0; i < n_rxq; i++) {
> > +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> > +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> > +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> > +        if (!xsk) {
> > +            VLOG_ERR("failed to create AF_XDP socket on queue %d",
> > i);
> > +            dev->xsks[i] = NULL;
> > +            goto err;
> > +        }
> > +        dev->xsks[i] = xsk;
> > +        xsk->rx_dropped = 0;
> > +        xsk->tx_dropped = 0;
> > +    }
> > +
> > +    return 0;
> > +
> > +err:
> > +    xsk_destroy_all(netdev);
> > +    return EINVAL;
> > +}
> > +
> > +static void
> > +xsk_destroy(struct xsk_socket_info *xsk)
> > +{
> > +    struct xsk_umem *umem;
> > +
> > +    umem = xsk->umem->umem;
> > +    xsk_socket__delete(xsk->xsk);
> > +    if (xsk_umem__delete(umem)) {
> > +        VLOG_ERR("xsk_umem__delete failed");
> > +    }
> > +
> > +    /* free the packet buffer */
> > +    free_pagealign(xsk->umem->buffer);
> > +
> > +    /* cleanup umem pool */
> > +    umem_pool_cleanup(&xsk->umem->mpool);
> > +
> > +    /* cleanup metadata pool */
> > +    xpacket_pool_cleanup(&xsk->umem->xpool);
> > +
> > +    free(xsk->umem);
> > +    free(xsk);
> > +}
> > +
> > +static void
> > +xsk_destroy_all(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int i, ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > +
> > +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> > +        if (dev->xsks && dev->xsks[i]) {
> > +            VLOG_INFO("destroy xsk[%d]", i);
> > +            xsk_destroy(dev->xsks[i]);
> > +            dev->xsks[i] = NULL;
> > +        }
> > +    }
> > +
> > +    VLOG_INFO("remove xdp program");
> > +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > +
> > +    free(dev->xsks);
> > +}
> > +
> > +static inline void OVS_UNUSED
> > +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> > +    struct xdp_statistics stat;
> > +    socklen_t optlen;
> > +
> > +    optlen = sizeof stat;
> > +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP,
> > XDP_STATISTICS,
> > +               &stat, &optlen) == 0);
> > +
> > +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid
> > %llu",
> > +                stat.rx_dropped,
> > +                stat.rx_invalid_descs,
> > +                stat.tx_invalid_descs);
> > +}
> > +
> > +int
> > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > *args,
> > +                        char **errp OVS_UNUSED)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    const char *str_xdpmode;
> > +    int xdpmode, new_n_rxq;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > +    if (new_n_rxq > MAX_XSKQ) {
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> > +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> > +        return EINVAL;
> > +    }
> > +
> > +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> > +    if (!strcasecmp(str_xdpmode, "drv")) {
> > +        xdpmode = XDP_ZEROCOPY;
> > +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> > +        xdpmode = XDP_COPY;
> > +    } else {
> > +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> > +                 netdev_get_name(netdev), str_xdpmode);
> > +        ovs_mutex_unlock(&dev->mutex);
> > +        return EINVAL;
> > +    }
> > +
> > +    if (dev->requested_n_rxq != new_n_rxq
> > +        || dev->requested_xdpmode != xdpmode) {
> > +        dev->requested_n_rxq = new_n_rxq;
> > +        dev->requested_xdpmode = xdpmode;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > *args)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> > +    smap_add_format(args, "xdpmode", "%s",
> > +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return 0;
> > +}
> > +
> > +static void
> > +netdev_afxdp_alloc_txq(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int n_txqs = netdev_n_rxq(netdev);
> > +    int i;
> > +
> > +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> > +
> > +    for (i = 0; i < n_txqs; i++) {
> > +        ovs_spinlock_init(&dev->tx_locks[i]);
> > +    }
> > +}
> > +
> > +int
> > +netdev_afxdp_reconfigure(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> > +    int err = 0;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    if (netdev->n_rxq == dev->requested_n_rxq
> > +        && dev->xdpmode == dev->requested_xdpmode) {
> > +        goto out;
> > +    }
> > +
> > +    xsk_destroy_all(netdev);
> > +    free(dev->tx_locks);
> > +
> > +    netdev->n_rxq = dev->requested_n_rxq;
> > +    netdev_afxdp_alloc_txq(netdev);
> > +
> > +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> > +        VLOG_INFO("AF_XDP device %s in DRV mode",
> > netdev_get_name(netdev));
> > +        /* From SKB mode to DRV mode */
> > +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > XDP_FLAGS_DRV_MODE;
> > +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> > +        dev->xdpmode = XDP_ZEROCOPY;
> > +
> > +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> > +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> > +                      ovs_strerror(errno));
> > +        }
> > +    } else {
> > +        VLOG_INFO("AF_XDP device %s in SKB mode",
> > netdev_get_name(netdev));
> > +        /* From DRV mode to SKB mode */
> > +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > XDP_FLAGS_SKB_MODE;
> > +        dev->xdp_bind_flags = XDP_COPY;
> > +        dev->xdpmode = XDP_COPY;
> > +        /* TODO: set rlimit back to previous value
> > +         * when no device is in DRV mode.
> > +         */
> > +    }
> > +
> > +    err = xsk_configure_all(netdev);
> > +    if (err) {
> > +        VLOG_ERR("AF_XDP device %s reconfig fails",
> > netdev_get_name(netdev));
> > +    }
> > +    netdev_change_seq_changed(netdev);
> > +out:
> > +    ovs_mutex_unlock(&dev->mutex);
> > +    return err;
> > +}
> > +
> > +int
> > +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> > +{
> > +    /* FIXME: Get netdev's PCIe device ID, then find
> > +     * its NUMA node id.
> > +     */
> > +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> > +              netdev_get_name(netdev));
> > +    return 0;
> > +}
> > +
> > +static void
> > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> > +{
> > +    uint32_t curr_prog_id = 0;
> > +    uint32_t flags;
> > +
> > +    /* remove_xdp_program() */
> > +    if (xdpmode == XDP_COPY) {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > +    } else {
> > +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > +    }
> > +
> > +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    }
> > +    if (prog_id == curr_prog_id) {
> > +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > +    } else if (!curr_prog_id) {
> > +        VLOG_INFO("couldn't find a prog id on a given interface");
> > +    } else {
> > +        VLOG_INFO("program on interface changed, not removing");
> > +    }
> > +}
> > +
> > +void
> > +signal_remove_xdp(struct netdev *netdev)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    int ifindex;
> > +
> > +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > +
> > +    VLOG_WARN("force remove xdp program");
> > +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > +}
> > +
> > +static struct dp_packet_afxdp *
> > +dp_packet_cast_afxdp(const struct dp_packet *d)
> > +{
> > +    ovs_assert(d->source == DPBUF_AFXDP);
> > +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> > +}
> > +
> > +void
> > +free_afxdp_buf(struct dp_packet *p)
> > +{
> > +    struct dp_packet_afxdp *xpacket;
> > +    uintptr_t addr;
> > +
> > +    xpacket = dp_packet_cast_afxdp(p);
> > +    if (xpacket->mpool) {
> > +        void *base = dp_packet_base(p);
> > +
> > +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > +        umem_elem_push(xpacket->mpool, (void *)addr);
> > +    }
> > +}
> > +
> > +static void
> > +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet_afxdp *xpacket = NULL;
> > +    struct dp_packet *packet;
> > +    void *elems[BATCH_SIZE];
> > +    uintptr_t addr;
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (xpacket->mpool) {
> > +            void *base = dp_packet_base(packet);
> > +
> > +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > +            elems[i] = (void *)addr;
> > +        }
> > +    }
> > +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> > +    dp_packet_batch_init(batch);
> > +}
> > +
> > +static inline void
> > +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
> > +{
> > +    void *elems[BATCH_SIZE];
> > +    int i;
> > +
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> > idx_rx)->addr;
> > +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > +
> > +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
> > +    }
> > +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +
> > +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > +    xsk->rx_dropped += rcvd;
> > +}
> > +
> > +int
> > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch
> > *batch,
> > +                      int *qfill)
> > +{
> > +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> > +    struct netdev *netdev = rx->up.netdev;
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct umem_elem *elems[BATCH_SIZE];
> > +    uint32_t idx_rx = 0, idx_fq = 0;
> > +    struct xsk_socket_info *xsk;
> > +    int qid = rxq_->queue_id;
> > +    unsigned int rcvd, i;
> > +    int ret = 0;
> > +
> > +    xsk = dev->xsks[qid];
> > +    if (!xsk) {
> > +        return 0;
> > +    }
> > +
> > +    rx->fd = xsk_socket__fd(xsk->xsk);
> > +
> > +    /* See if there is any packet on RX queue,
> > +     * if yes, idx_rx is the index having the packet.
> > +     */
> > +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> > +    if (!rcvd) {
> > +        return 0;
> > +    }
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        handle_rx_fail(xsk, rcvd, idx_rx);
> > +        return ENOMEM;
> > +    }
> > +
> > +    /* Prepare for the FILL queue */
> > +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> > +        /* The FILL queue is full, don't retry or process rx. Wait
> > for kernel
> > +         * to move received packets from FILL queue to RX queue.
> > +         */
> > +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > +        handle_rx_fail(xsk, rcvd, idx_rx);
> > +        return ENOMEM;
> > +    }
> > +
> > +    /* Setup a dp_packet batch from descriptors in RX queue */
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> > idx_rx)->addr;
> > +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
> > +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > +        uint64_t index;
> > +
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        index = addr >> FRAME_SHIFT;
> > +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> > +        packet = &xpacket->packet;
> > +
> > +        /* Initialize the struct dp_packet */
> > +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE -
> > FRAME_HEADROOM);
> > +        dp_packet_set_size(packet, len);
> > +
> > +        /* Add packet into batch, increase batch->count */
> > +        dp_packet_batch_add(batch, packet);
> > +
> > +        idx_rx++;
> > +    }
> > +    /* Release the RX queue */
> > +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > +
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t index;
> > +        struct umem_elem *elem;
> > +
> > +        /* Get one free umem, program it into FILL queue */
> > +        elem = elems[i];
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> > +
> > +        idx_fq++;
> > +    }
> > +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> > +
> > +    if (qfill) {
> > +        /* TODO: return the number of remaining packets in the queue.
> > */
> > +        *qfill = 0;
> > +    }
> > +
> > +#ifdef AFXDP_DEBUG
> > +    log_xsk_stat(xsk);
> > +#endif
> > +    return 0;
> > +}
> > +
> > +static inline int
> > +kick_tx(struct xsk_socket_info *xsk)
> > +{
> > +    int ret;
> > +
> > +    /* This causes system call into kernel's xsk_sendmsg, and
> > +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> > +     */
> > +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT,
> > NULL, 0);
> > +    if (OVS_UNLIKELY(ret < 0)) {
> > +        if (errno == ENXIO || errno == ENOBUFS || errno ==
> > EOPNOTSUPP) {
> > +            return errno;
> > +        }
> > +    }
> > +    /* no error, or EBUSY or EAGAIN */
> > +    return 0;
> > +}
> > +
> > +static inline bool
> > +check_free_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct umem_pool *first_mpool = NULL;
> > +    struct dp_packet_afxdp *xpacket;
> > +    struct dp_packet *packet;
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        if (packet->source != DPBUF_AFXDP) {
> > +            return false;
> > +        }
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (i == 0) {
> > +            first_mpool = xpacket->mpool;
> > +            continue;
> > +        }
> > +        if (xpacket->mpool != first_mpool) {
> > +            return false;
> > +        }
> > +    }
> > +    /* All packets are DPBUF_AFXDP and from the same mpool */
> > +    return true;
> > +}
> > +
> > +static inline void
> > +afxdp_complete_tx(struct xsk_socket_info *xsk)
> > +{
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t idx_cq = 0;
> > +    int tx_done, j, ret;
> > +
> > +    if (!xsk->outstanding_tx) {
> > +        return;
> > +    }
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +    }
> > +
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE,
> > &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +        xsk->outstanding_tx -= tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> > **)elems_push);
> > +}
> > +
> > +int
> > +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> > +                        struct dp_packet_batch *batch,
> > +                        bool concurrent_txq)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct xsk_socket_info *xsk = dev->xsks[qid];
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct dp_packet *packet;
> > +    bool free_batch = true;
> > +    uint32_t idx = 0;
> > +    int error = 0;
> > +    int ret;
> > +
> > +    if (!xsk) {
> > +        goto out;
> > +    }
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        qid = qid % dev->up.n_txq;
> > +        ovs_spin_lock(&dev->tx_locks[qid]);
> > +    }
> > +
> > +    /* Process CQ first. */
> > +    afxdp_complete_tx(xsk);
> > +
> > +    free_batch = check_free_batch(batch);
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        xsk->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +        xsk->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * TODO: avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +    }
> > +
> > +out:
> > +    if (free_batch) {
> > +        free_afxdp_buf_batch(batch);
> > +    } else {
> > +        dp_packet_delete_batch(batch, true);
> > +    }
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        ovs_spin_unlock(&dev->tx_locks[qid]);
> > +    }
> > +    return error;
> > +}
> > +
> > +int
> > +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> > +{
> > +   /* Done at reconfigure */
> > +   return 0;
> > +}
> > +
> > +void
> > +netdev_afxdp_destruct(struct netdev *netdev_)
> > +{
> > +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> > +
> > +    /* Note: tc is by-passed when using drv-mode, but when using
> > +     * skb-mode, we might need to clean up tc. */
> > +
> > +    xsk_destroy_all(netdev_);
> > +    ovs_mutex_destroy(&netdev->mutex);
> > +}
> > +
> > +int
> > +netdev_afxdp_get_stats(const struct netdev *netdev,
> > +                       struct netdev_stats *stats)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct netdev_stats dev_stats;
> > +    struct xsk_socket_info *xsk;
> > +    int error, i;
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +
> > +    error = get_stats_via_netlink(netdev, &dev_stats);
> > +    if (error) {
> > +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> > +    } else {
> > +        /* Use kernel netdev's packet and byte counts */
> > +        stats->rx_packets = dev_stats.rx_packets;
> > +        stats->rx_bytes = dev_stats.rx_bytes;
> > +        stats->tx_packets = dev_stats.tx_packets;
> > +        stats->tx_bytes = dev_stats.tx_bytes;
> > +
> > +        stats->rx_errors           += dev_stats.rx_errors;
> > +        stats->tx_errors           += dev_stats.tx_errors;
> > +        stats->rx_dropped          += dev_stats.rx_dropped;
> > +        stats->tx_dropped          += dev_stats.tx_dropped;
> > +        stats->multicast           += dev_stats.multicast;
> > +        stats->collisions          += dev_stats.collisions;
> > +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> > +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> > +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> > +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> > +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> > +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> > +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> > +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> > +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> > +        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
> > +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> > +
> > +        /* Account the dropped in each xsk */
> > +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
> > +            xsk = dev->xsks[i];
> > +            if (xsk) {
> > +                stats->rx_dropped += xsk->rx_dropped;
> > +                stats->tx_dropped += xsk->tx_dropped;
> > +            }
> > +        }
> > +    }
> > +    ovs_mutex_unlock(&dev->mutex);
> > +
> > +    return error;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..dd2dc1a2064d
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,74 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <config.h>
> > +
> > +#ifdef HAVE_AF_XDP
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used
> > directly
> > + * only by Linux-specific code. */
> > +
> > +#define MAX_XSKQ 16
> > +
> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +struct netdev_rxq;
> > +struct netdev_stats;
> > +
> > +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> > +void netdev_afxdp_destruct(struct netdev *netdev_);
> > +
> > +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> > +                          struct dp_packet_batch *batch,
> > +                          int *qfill);
> > +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> > +                            struct dp_packet_batch *batch,
> > +                            bool concurrent_txq);
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> > +                           struct netdev_stats *stats);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +void signal_remove_xdp(struct netdev *netdev);
> > +
> > +#else /* !HAVE_AF_XDP */
> > +
> > +#include "openvswitch/compiler.h"
> > +
> > +struct dp_packet;
> > +
> > +static inline void
> > +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> > +{
> > +    /* Nothing */
> > +}
> > +
> > +#endif /* HAVE_AF_XDP */
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..6a0388cf9dc3
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,139 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-afxdp.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
> > +#include "xdpsock.h"
> > +
> > +/* These functions are Linux specific, so they should be used
> > directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +struct netdev_rxq_linux {
> > +    struct netdev_rxq up;
> > +    bool is_tap;
> > +    int fd;
> > +};
> > +
> > +void netdev_linux_run(const struct netdev_class *);
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> > flag,
> > +                                  const char *flag_name, bool
> > enable);
> > +
> > +int get_stats_via_netlink(const struct netdev *netdev_,
> > +                          struct netdev_stats *stats);
> > +
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only
> > valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > +    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the
> > namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info **xsks;
> > +    int requested_n_rxq;
> > +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > +    int xdp_flags, xdp_bind_flags;
> > +    struct ovs_spinlock *tx_locks;
> > +#endif
> > +};
> > +
> > +static bool
> > +is_netdev_linux_class(const struct netdev_class *netdev_class)
> > +{
> > +    return netdev_class->run == netdev_linux_run;
> > +}
> > +
> > +static struct netdev_linux *
> > +netdev_linux_cast(const struct netdev *netdev)
> > +{
> > +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > +
> > +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +}
> > +
> > +static struct netdev_rxq_linux *
> > +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> > +{
> > +    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> > +
> > +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> > +}
> > +
> > +#endif /* netdev-linux-private.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..2883cf1f2586 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -17,6 +17,7 @@
> >  #include <config.h>
> >
> >  #include "netdev-linux.h"
> > +#include "netdev-linux-private.h"
> >
> >  #include <errno.h>
> >  #include <fcntl.h>
> > @@ -54,6 +55,7 @@
> >  #include "fatal-signal.h"
> >  #include "hash.h"
> >  #include "openvswitch/hmap.h"
> > +#include "netdev-afxdp.h"
> >  #include "netdev-provider.h"
> >  #include "netdev-tc-offloads.h"
> >  #include "netdev-vport.h"
> > @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int
> > mtu);
> >  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> > burst_bytes);
> >
> > -struct netdev_linux {
> > -    struct netdev up;
> > -
> > -    /* Protects all members below. */
> > -    struct ovs_mutex mutex;
> > -
> > -    unsigned int cache_valid;
> > -
> > -    bool miimon;                    /* Link status of last poll. */
> > -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > -    struct timer miimon_timer;
> > -
> > -    int netnsid;                    /* Network namespace ID. */
> > -    /* The following are figured out "on demand" only.  They are only
> > valid
> > -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > -    int ifindex;
> > -    struct eth_addr etheraddr;
> > -    int mtu;
> > -    unsigned int ifi_flags;
> > -    long long int carrier_resets;
> > -    uint32_t kbits_rate;        /* Policing data. */
> > -    uint32_t kbits_burst;
> > -    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > -                                   0 or an errno value. */
> > -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > or SIOCSIFMTU. */
> > -    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > -    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > -    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > -    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > -
> > -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > -
> > -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > -    struct tc *tc;
> > -
> > -    /* For devices of class netdev_tap_class only. */
> > -    int tap_fd;
> > -    bool present;               /* If the device is present in the
> > namespace */
> > -    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > -
> > -    /* LAG information. */
> > -    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > -};
> > -
> > -struct netdev_rxq_linux {
> > -    struct netdev_rxq up;
> > -    bool is_tap;
> > -    int fd;
> > -};
> >
> >  /* This is set pretty low because we probably won't learn anything
> > from the
> >   * additional log messages. */
> > @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl =
> > VLOG_RATE_LIMIT_INIT(5, 20);
> >   * changes in the device miimon status, so we can use atomic_count.
> > */
> >  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
> >
> > -static void netdev_linux_run(const struct netdev_class *);
> > -
> >  static int netdev_linux_do_ethtool(const char *name, struct
> > ethtool_cmd *,
> >                                     int cmd, const char *cmd_name);
> >  static int get_flags(const struct netdev *, unsigned int *flags);
> > @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
> >                         struct in_addr addr);
> >  static int get_etheraddr(const char *netdev_name, struct eth_addr
> > *ea);
> >  static int set_etheraddr(const char *netdev_name, const struct
> > eth_addr);
> > -static int get_stats_via_netlink(const struct netdev *, struct
> > netdev_stats *);
> >  static int af_packet_sock(void);
> >  static bool netdev_linux_miimon_enabled(void);
> >  static void netdev_linux_miimon_run(void);
> > @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
> >  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int
> > *mtup);
> >
> >  static bool
> > -is_netdev_linux_class(const struct netdev_class *netdev_class)
> > -{
> > -    return netdev_class->run == netdev_linux_run;
> > -}
> > -
> > -static bool
> >  is_tap_netdev(const struct netdev *netdev)
> >  {
> >      return netdev_get_class(netdev) == &netdev_tap_class;
> >  }
> > -
> > -static struct netdev_linux *
> > -netdev_linux_cast(const struct netdev *netdev)
> > -{
> > -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > -
> > -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > -}
> > -
> > -static struct netdev_rxq_linux *
> > -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> > -{
> > -    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> > -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> > -}
> >
> >  static int
> >  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
> > @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change
> > *change)
> >      }
> >  }
> >
> > -static void
> > +void
> >  netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
> >  {
> >      struct nl_sock *sock;
> > @@ -3279,9 +3206,7 @@ exit:
> >      .run = netdev_linux_run,                                    \
> >      .wait = netdev_linux_wait,                                  \
> >      .alloc = netdev_linux_alloc,                                \
> > -    .destruct = netdev_linux_destruct,                          \
> >      .dealloc = netdev_linux_dealloc,                            \
> > -    .send = netdev_linux_send,                                  \
> >      .send_wait = netdev_linux_send_wait,                        \
> >      .set_etheraddr = netdev_linux_set_etheraddr,                \
> >      .get_etheraddr = netdev_linux_get_etheraddr,                \
> > @@ -3312,10 +3237,8 @@ exit:
> >      .arp_lookup = netdev_linux_arp_lookup,                      \
> >      .update_flags = netdev_linux_update_flags,                  \
> >      .rxq_alloc = netdev_linux_rxq_alloc,                        \
> > -    .rxq_construct = netdev_linux_rxq_construct,                \
> >      .rxq_destruct = netdev_linux_rxq_destruct,                  \
> >      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
> > -    .rxq_recv = netdev_linux_rxq_recv,                          \
> >      .rxq_wait = netdev_linux_rxq_wait,                          \
> >      .rxq_drain = netdev_linux_rxq_drain
> >
> > @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class =
> > {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "system",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> > +    .destruct = netdev_linux_destruct,
> >      .get_stats = netdev_linux_get_stats,
> >      .get_features = netdev_linux_get_features,
> >      .get_status = netdev_linux_get_status,
> > -    .get_block_id = netdev_linux_get_block_id
> > +    .get_block_id = netdev_linux_get_block_id,
> > +    .send = netdev_linux_send,
> > +    .rxq_construct = netdev_linux_rxq_construct,
> > +    .rxq_recv = netdev_linux_rxq_recv,
> >  };
> >
> >  const struct netdev_class netdev_tap_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      .type = "tap",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct_tap,
> > +    .destruct = netdev_linux_destruct,
> >      .get_stats = netdev_tap_get_stats,
> >      .get_features = netdev_linux_get_features,
> >      .get_status = netdev_linux_get_status,
> > +    .send = netdev_linux_send,
> > +    .rxq_construct = netdev_linux_rxq_construct,
> > +    .rxq_recv = netdev_linux_rxq_recv,
> >  };
> >
> >  const struct netdev_class netdev_internal_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "internal",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> > +    .destruct = netdev_linux_destruct,
> >      .get_stats = netdev_internal_get_stats,
> >      .get_status = netdev_internal_get_status,
> > +    .send = netdev_linux_send,
> > +    .rxq_construct = netdev_linux_rxq_construct,
> > +    .rxq_recv = netdev_linux_rxq_recv,
> >  };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > +    NETDEV_LINUX_CLASS_COMMON,
> > +    .type = "afxdp",
> > +    .is_pmd = true,
> > +    .construct = netdev_linux_construct,
> > +    .destruct = netdev_afxdp_destruct,
> > +    .get_stats = netdev_afxdp_get_stats,
> > +    .get_status = netdev_linux_get_status,
> > +    .set_config = netdev_afxdp_set_config,
> > +    .get_config = netdev_afxdp_get_config,
> > +    .reconfigure = netdev_afxdp_reconfigure,
> > +    .get_numa_id = netdev_afxdp_get_numa_id,
> > +    .send = netdev_afxdp_batch_send,
> > +    .rxq_construct = netdev_afxdp_rxq_construct,
> > +    .rxq_recv = netdev_afxdp_rxq_recv,
> > +};
> > +#endif
> >
> >
> >  #define CODEL_N_QUEUES 0x0000
> > @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct
> > netdev_stats *dst,
> >      dst->tx_window_errors = src->tx_window_errors;
> >  }
> >
> > -static int
> > +int
> >  get_stats_via_netlink(const struct netdev *netdev_, struct
> > netdev_stats *stats)
> >  {
> >      struct ofpbuf request;
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index fb0c27e6e8e8..91e6a9e2bfc0 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -903,6 +903,9 @@ extern const struct netdev_class
> > netdev_linux_class;
> >  extern const struct netdev_class netdev_internal_class;
> >  extern const struct netdev_class netdev_tap_class;
> >
> > +#ifdef HAVE_AF_XDP
> > +extern const struct netdev_class netdev_afxdp_class;
> > +#endif
> >  #ifdef  __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 7d7ecf6f0946..0fac117cc602 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl =
> > VLOG_RATE_LIMIT_INIT(5, 20);
> >
> >  static void restore_all_flags(void *aux OVS_UNUSED);
> >  void update_device_args(struct netdev *, const struct shash *args);
> > +#ifdef HAVE_AF_XDP
> > +void signal_remove_xdp(struct netdev *netdev);
> > +#endif
> >
> >  int
> >  netdev_n_txq(const struct netdev *netdev)
> > @@ -146,6 +149,9 @@ netdev_initialize(void)
> >          netdev_register_provider(&netdev_internal_class);
> >          netdev_register_provider(&netdev_tap_class);
> >          netdev_vport_tunnel_register();
> > +#ifdef HAVE_AF_XDP
> > +        netdev_register_provider(&netdev_afxdp_class);
> > +#endif
> >  #endif
> >  #if defined(__FreeBSD__) || defined(__NetBSD__)
> >          netdev_register_provider(&netdev_tap_class);
> > @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
> >                                                 saved_flags &
> > ~saved_values,
> >                                                 &old_flags);
> >          }
> > +#ifdef HAVE_AF_XDP
> > +        if (netdev->netdev_class == &netdev_afxdp_class) {
> > +            signal_remove_xdp(netdev);
> > +        }
> > +#endif
> >      }
> >  }
> >
> > diff --git a/lib/spinlock.h b/lib/spinlock.h
> > new file mode 100644
> > index 000000000000..1ae634f23a6b
> > --- /dev/null
> > +++ b/lib/spinlock.h
> > @@ -0,0 +1,70 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +#ifndef SPINLOCK_H
> > +#define SPINLOCK_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <ctype.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdarg.h>
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +
> > +#include "ovs-atomic.h"
> > +
> > +struct ovs_spinlock {
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
> > +};
> > +
> > +static inline void
> > +ovs_spinlock_init(struct ovs_spinlock *sl)
> > +{
> > +    atomic_init(&sl->locked, 0);
> > +}
> > +
> > +static inline void
> > +ovs_spin_lock(struct ovs_spinlock *sl)
> > +{
> > +    int exp = 0, locked = 0;
> > +
> > +    while (!atomic_compare_exchange_strong_explicit(&sl->locked,
> > &exp, 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed)) {
> > +        locked = 1;
> > +        while (locked) {
> > +            atomic_read_relaxed(&sl->locked, &locked);
> > +        }
> > +        exp = 0;
> > +    }
> > +}
> > +
> > +static inline void
> > +ovs_spin_unlock(struct ovs_spinlock *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int
> > +ovs_spin_trylock(struct ovs_spinlock *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp,
> > 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
> > +#endif
> > diff --git a/lib/util.c b/lib/util.c
> > index 7b8ab81f6ee1..5eb20995b370 100644
> > --- a/lib/util.c
> > +++ b/lib/util.c
> > @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
> >      return xrealloc(p, *n * s);
> >  }
> >
> > -/* Allocates and returns 'size' bytes of memory aligned to a cache
> > line and in
> > - * dedicated cache lines.  That is, the memory block returned will
> > not share a
> > - * cache line with other data, avoiding "false sharing".
> > +/* Allocates and returns 'size' bytes of memory aligned to
> > 'alignment' bytes.
> > + * 'alignment' must be a power of two and a multiple of sizeof(void
> > *).
> >   *
> > - * Use free_cacheline() to free the returned memory block. */
> > + * Use free_size_align() to free the returned memory block. */
> >  void *
> > -xmalloc_cacheline(size_t size)
> > +xmalloc_size_align(size_t size, size_t alignment)
> >  {
> >  #ifdef HAVE_POSIX_MEMALIGN
> >      void *p;
> >      int error;
> >
> >      COVERAGE_INC(util_xalloc);
> > -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
> > +    error = posix_memalign(&p, alignment, size ? size : 1);
> >      if (error != 0) {
> >          out_of_memory();
> >      }
> > @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
> >  #else
> >      /* Allocate room for:
> >       *
> > -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to
> > allow the
> > -     *       pointer to be aligned exactly sizeof(void *) bytes
> > before the
> > -     *       beginning of a cache line.
> > +     *     - Header padding: Up to alignment - 1 bytes, to allow the
> > +     *       pointer 'q' to be aligned exactly sizeof(void *) bytes
> > before the
> > +     *       beginning of the alignment.
> >       *
> >       *     - Pointer: A pointer to the start of the header padding,
> > to allow us
> >       *       to free() the block later.
> >       *
> >       *     - User data: 'size' bytes.
> >       *
> > -     *     - Trailer padding: Enough to bring the user data up to a
> > cache line
> > +     *     - Trailer padding: Enough to bring the user data up to a
> > alignment
> >       *       multiple.
> >       *
> >       * +---------------+---------+------------------------+---------+
> > @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
> >       * p               q         r
> >       *
> >       */
> > -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
> > -                      + sizeof(void *)
> > -                      + ROUND_UP(size, CACHE_LINE_SIZE));
> > -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) <
> > sizeof(void *);
> > -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ?
> > CACHE_LINE_SIZE : 0),
> > -                                CACHE_LINE_SIZE);
> > -    void **q = (void **) r - 1;
> > +    void *p, *r, **q;
> > +    bool runt;
> > +
> > +    COVERAGE_INC(util_xalloc);
> > +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
> > +        ovs_abort(0, "Invalid alignment");
> > +    }
> > +
> > +    p = xmalloc((alignment - 1)
> > +                + sizeof(void *)
> > +                + ROUND_UP(size, alignment));
> > +
> > +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
> > +    /* When the padding size < sizeof(void*), we don't have enough
> > room for
> > +     * pointer 'q'. As a reuslt, need to move 'r' to the next
> > alignment.
> > +     * So ROUND_UP when xmalloc above, and ROUND_UP again when
> > calculate 'r'
> > +     * below.
> > +     */
> > +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment);
> > +    q = (void **) r - 1;
> >      *q = p;
> > +
> >      return r;
> >  #endif
> >  }
> >
> > +void
> > +free_size_align(void *p)
> > +{
> > +#ifdef HAVE_POSIX_MEMALIGN
> > +    free(p);
> > +#else
> > +    if (p) {
> > +        void **q = (void **) p - 1;
> > +        free(*q);
> > +    }
> > +#endif
> > +}
> > +
> > +/* Allocates and returns 'size' bytes of memory aligned to a cache
> > line and in
> > + * dedicated cache lines.  That is, the memory block returned will
> > not share a
> > + * cache line with other data, avoiding "false sharing".
> > + *
> > + * Use free_cacheline() to free the returned memory block. */
> > +void *
> > +xmalloc_cacheline(size_t size)
> > +{
> > +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
> > +}
> > +
> >  /* Like xmalloc_cacheline() but clears the allocated memory to all
> > zero
> >   * bytes. */
> >  void *
> > @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
> >  void
> >  free_cacheline(void *p)
> >  {
> > -#ifdef HAVE_POSIX_MEMALIGN
> > -    free(p);
> > -#else
> > -    if (p) {
> > -        void **q = (void **) p - 1;
> > -        free(*q);
> > -    }
> > -#endif
> > +    free_size_align(p);
> > +}
> > +
> > +void *
> > +xmalloc_pagealign(size_t size)
> > +{
> > +    return xmalloc_size_align(size, get_page_size());
> > +}
> > +
> > +void
> > +free_pagealign(void *p)
> > +{
> > +    free_size_align(p);
> >  }
> >
> >  char *
> > diff --git a/lib/util.h b/lib/util.h
> > index c26605abdce3..33665748274c 100644
> > --- a/lib/util.h
> > +++ b/lib/util.h
> > @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src,
> > size_t size);
> >
> >  int string_ends_with(const char *str, const char *suffix);
> >
> > +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> > +void free_pagealign(void *);
> > +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
> > +void free_size_align(void *);
> > +
> >  /* The C standards say that neither the 'dst' nor 'src' argument to
> >   * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
> >   * the null case. */
> > diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> > new file mode 100644
> > index 000000000000..ea39fa557290
> > --- /dev/null
> > +++ b/lib/xdpsock.c
> > @@ -0,0 +1,170 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +#include <config.h>
> > +
> > +#include "xdpsock.h"
> > +#include "dp-packet.h"
> > +#include "openvswitch/compiler.h"
> > +
> > +/* Note:
> > + * umem_elem_push* shouldn't overflow because we always pop
> > + * elem first, then push back to the stack.
> > + */
> > +static inline void
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +}
> > +
> > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    ovs_spin_lock(&umemp->lock);
> > +    __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->lock);
> > +}
> > +
> > +static inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> > +        OVS_NOT_REACHED();
> > +    }
> > +
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->lock);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->lock);
> > +}
> > +
> > +static inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->lock);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->lock);
> > +
> > +    return ret;
> > +}
> > +
> > +static inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> > +        return NULL;
> > +    }
> > +
> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->lock);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->lock);
> > +
> > +    return ptr;
> > +}
> > +
> > +static void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    bufs = xmalloc_pagealign(size * sizeof(void *));
> > +    memset(bufs, 0, size * sizeof(void *));
> > +
> > +    return (void **)bufs;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->lock);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > +    free_pagealign(umemp->array);
> > +    umemp->array = NULL;
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp));
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free_pagealign(xp->array);
> > +    xp->array = NULL;
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..1a1093381243
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,101 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <config.h>
> > +
> > +#ifdef HAVE_AF_XDP
> > +
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "spinlock.h"
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +
> > +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> > + * Setting NUM_FRAMES to this makes sure umem_pop always successes.
> > + */
> > +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> > +
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
> > +
> > +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> > +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> > +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS +
> > CONS_NUM_DESCS));
> > +
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > +    int index;      /* point to top */
> > +    unsigned int size;
> > +    struct ovs_spinlock lock;
> > +    void **array;   /* a pointer array, point to umem buf */
> > +};
> > +
> > +/* array-based dp_packet_afxdp */
> > +struct xpacket_pool {
> > +    unsigned int size;
> > +    struct dp_packet_afxdp **array;
> > +};
> > +
> > +struct xsk_umem_info {
> > +    struct umem_pool mpool;
> > +    struct xpacket_pool xpool;
> > +    struct xsk_ring_prod fq;
> > +    struct xsk_ring_cons cq;
> > +    struct xsk_umem *umem;
> > +    void *buffer;
> > +};
> > +
> > +struct xsk_socket_info {
> > +    struct xsk_ring_cons rx;
> > +    struct xsk_ring_prod tx;
> > +    struct xsk_umem_info *umem;
> > +    struct xsk_socket *xsk;
> > +    unsigned long rx_dropped;
> > +    unsigned long tx_dropped;
> > +    uint32_t outstanding_tx;
> > +};
> > +
> > +struct umem_elem {
> > +    struct umem_elem *next;
> > +};
> > +
> > +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void *umem_elem_pop(struct umem_pool *umemp);
> > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > +void umem_pool_cleanup(struct umem_pool *umemp);
> > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > +
> > +#endif
> > +#endif
> > diff --git a/tests/automake.mk b/tests/automake.mk
> > index 2956e68b242c..131564bb0bd3 100644
> > --- a/tests/automake.mk
> > +++ b/tests/automake.mk
> > @@ -4,12 +4,14 @@ EXTRA_DIST += \
> >       $(SYSTEM_TESTSUITE_AT) \
> >       $(SYSTEM_KMOD_TESTSUITE_AT) \
> >       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
> > +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
> >       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
> >       $(SYSTEM_DPDK_TESTSUITE_AT) \
> >       $(OVSDB_CLUSTER_TESTSUITE_AT) \
> >       $(TESTSUITE) \
> >       $(SYSTEM_KMOD_TESTSUITE) \
> >       $(SYSTEM_USERSPACE_TESTSUITE) \
> > +     $(SYSTEM_AFXDP_TESTSUITE) \
> >       $(SYSTEM_OFFLOADS_TESTSUITE) \
> >       $(SYSTEM_DPDK_TESTSUITE) \
> >       $(OVSDB_CLUSTER_TESTSUITE) \
> > @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
> >       tests/system-userspace-macros.at \
> >       tests/system-userspace-packet-type-aware.at
> >
> > +SYSTEM_AFXDP_TESTSUITE_AT = \
> > +     tests/system-afxdp-testsuite.at \
> > +     tests/system-afxdp-macros.at
> > +
> >  SYSTEM_TESTSUITE_AT = \
> >       tests/system-common-macros.at \
> >       tests/system-ovn.at \
> > @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
> >  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
> >  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
> >  SYSTEM_USERSPACE_TESTSUITE =
> > $(srcdir)/tests/system-userspace-testsuite
> > +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
> >  SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
> >  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
> >  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> > @@ -317,6 +324,11 @@ check-system-userspace: all
> >       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests
> > AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@"
> > --recheck)
> >
> > +check-afxdp: all
> > +     $(MAKE) install
> > +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests
> > AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> > +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> > +
> >  check-offloads: all
> >       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests
> > AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@"
> > --recheck)
> > @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4
> > $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
> >       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >       $(AM_V_at)mv $@.tmp $@
> >
> > +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> > $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> > +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> > +     $(AM_V_at)mv $@.tmp $@
> > +
> >  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> > $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
> >       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >       $(AM_V_at)mv $@.tmp $@
> > diff --git a/tests/system-afxdp-macros.at
> > b/tests/system-afxdp-macros.at
> > new file mode 100644
> > index 000000000000..1e6f7a46b4b7
> > --- /dev/null
> > +++ b/tests/system-afxdp-macros.at
> > @@ -0,0 +1,20 @@
> > +# Add port to ovs bridge by using afxdp mode.
> > +# This will use generic XDP support in the veth driver.
> > +m4_define([ADD_VETH],
> > +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return
> > 77])
> > +      CONFIGURE_VETH_OFFLOADS([$1])
> > +      AT_CHECK([ip link set $1 netns $2])
> > +      AT_CHECK([ip link set dev ovs-$1 up])
> > +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> > +                set interface ovs-$1 external-ids:iface-id="$1"
> > type="afxdp"])
> > +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> > +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> > +      if test -n "$5"; then
> > +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> > +      fi
> > +      if test -n "$6"; then
> > +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> > +      fi
> > +      on_exit 'ip link del ovs-$1'
> > +    ]
> > +)
> > diff --git a/tests/system-afxdp-testsuite.at
> > b/tests/system-afxdp-testsuite.at
> > new file mode 100644
> > index 000000000000..9b7a29066614
> > --- /dev/null
> > +++ b/tests/system-afxdp-testsuite.at
> > @@ -0,0 +1,26 @@
> > +AT_INIT
> > +
> > +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
> > +
> > +Licensed under the Apache License, Version 2.0 (the "License");
> > +you may not use this file except in compliance with the License.
> > +You may obtain a copy of the License at:
> > +
> > +    http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +Unless required by applicable law or agreed to in writing, software
> > +distributed under the License is distributed on an "AS IS" BASIS,
> > +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > +See the License for the specific language governing permissions and
> > +limitations under the License.])
> > +
> > +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> > +
> > +m4_include([tests/ovs-macros.at])
> > +m4_include([tests/ovsdb-macros.at])
> > +m4_include([tests/ofproto-macros.at])
> > +m4_include([tests/system-common-macros.at])
> > +m4_include([tests/system-userspace-macros.at])
> > +m4_include([tests/system-afxdp-macros.at])
> > +
> > +m4_include([tests/system-traffic.at])
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 89c06a1b7877..1e3acbbb8075 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> > type=patch options:peer=p1 \
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="xdpmode"
> > +              type='{"type": "string",
> > +                     "enum": ["set", ["skb", "drv"]]}'>
> > +        <p>
> > +          Specifies the operational mode of the XDP program.
> > +          If "drv", the XDP program is loaded into the device driver
> > with
> > +          zero-copy RX and TX enabled. This mode requires device
> > driver with
> > +          AF_XDP support and has the best performance.
> > +          If "skb", the XDP program is using generic XDP mode in
> > kernel with
> > +          extra data copying between userspace and kernel. No device
> > driver
> > +          support is needed. Note that this is afxdp netdev type
> > only.
> > +          Defaults to "skb" mode.
> > +        </p>
> > +      </column>
> > +
> >        <column name="options" key="vhost-server-path"
> >                type='{"type": "string"}'>
> >          <p>
> > --
> > 2.7.4
William Tu June 8, 2019, 4:48 a.m. UTC | #3
> > > +  ethtool -L enp2s0 combined 1
> > > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > > \
> > > +    options:n_rxq=1 options:xdpmode=drv \
> > > +    other_config:pmd-rxq-affinity="0:4"

another feature I'm thinking about to add is a new options
for loading custom XDP program

For example:
ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
    options:n_rxq=1 options:xdpmode=drv
    options:xdp_prog=/path/to/xdp.o

If users do not specify the path, then it is using the libbpf's default program
(which forwards all packets to userspace)

If users want to use their own xdp object, then this option can load the
xdp object file from the path.

William
Eelco Chaudron June 8, 2019, 8:12 a.m. UTC | #4
Hi William,

This was still a draft email, and was not supposed to go out ;)

My debug and build setup was a bit messed up and was having problems 
running GDB… I was (I’m) planning to continue getting some debug 
info on Tuesday after the public holiday here…

But just to give you a heads up, it starts up fine with root access but 
it crashes during a simple Port to Port run with wire-speed traffic. 
Then it will run into a restart/crash loop.

Will try to get you more details next week…

Cheers,

Eelco


On 7 Jun 2019, at 23:33, William Tu wrote:

> Hi Eelco,
>
> Thanks for the testing.
>
> On Fri, Jun 7, 2019 at 8:43 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>> Hi William,
>>
>> No review or full test yet, just some observations…
>>
>> We run OVS as a non root user, which is causing OVS with XDP to fail:
>
> Right, XDP requires using root privilege.
> I will add this in the documentation.

Is this a hard requirement? As I do not remember running OVS as root 
before…

>>
>> 2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev:
>> Datapath supports ct_orig_tuple
>> 2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev:
>> Datapath supports ct_orig_tuple6
>> 2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on 
>> numa_id:
>> 0, core id: 21 created.
>> 2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd 
>> threads
>> on numa node 0
>> 2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
>> 2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1 
>> in
>> DRV mode
>> 2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR:
>> setrlimit(RLIMIT_MEMLOCK): Operation not permitted
>
> This is due to not having root privilege, so not able to lock the 
> memory
> for device driver to directly DMA packet buffer into userspace.
>
> Can you try using root?
>
> Regards,
> William
>
>> 2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all
>> configure queue 0 mode DRV
>> 2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create
>> failed (Operation not permitted) mode: DRV qid: 0
>> 2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create 
>> AF_XDP
>> socket on queue 0
>> 2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
>> 2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1
>> reconfig fails
>> 2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set 
>> interface
>> eno1 new configuration
>>
>> However when configuring this after startup it’s fine, but trying 
>> to
>> restart OVS with this configuration results in a system core…
>>
>>
>>
>>
>> On 5 Jun 2019, at 22:47, William Tu wrote:
>>
>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
>>> socket
>>> type built upon the eBPF and XDP technology.  It is aims to have
>>> comparable
>>> performance to DPDK but cooperate better with existing kernel's
>>> networking
>>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP
>>> program
>>> attached to the netdev, by-passing a couple of Linux kernel's
>>> subsystems
>>> As a result, AF_XDP socket shows much better performance than
>>> AF_PACKET
>>> For more details about AF_XDP, please see linux kernel's
>>> Documentation/networking/af_xdp.rst. Note that by default, this
>>> feature is
>>> not compiled in.
>>>
>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>> ---
>>> v1->v2:
>>> - add a list to maintain unused umem elements
>>> - remove copy from rx umem to ovs internal buffer
>>> - use hugetlb to reduce misses (not much difference)
>>> - use pmd mode netdev in OVS (huge performance improve)
>>> - remove malloc dp_packet, instead put dp_packet in umem
>>>
>>> v2->v3:
>>> - rebase on the OVS master, 7ab4b0653784
>>>   ("configure: Check for more specific function to pull in pthread
>>> library.")
>>> - remove the dependency on libbpf and dpif-bpf.
>>>   instead, use the built-in XDP_ATTACH feature.
>>> - data structure optimizations for better performance, see[1]
>>> - more test cases support
>>> v3:
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>>>
>>> v3->v4:
>>> - Use AF_XDP API provided by libbpf
>>> - Remove the dependency on XDP_ATTACH kernel patch set
>>> - Add documentation, bpf.rst
>>>
>>> v4->v5:
>>> - rebase to master
>>> - remove rfc, squash all into a single patch
>>> - add --enable-afxdp, so by default, AF_XDP is not compiled
>>> - add options: xdpmode=drv,skb
>>> - add multiple queue and multiple PMD support, with options: n_rxq
>>> - improve documentation, rename bpf.rst to af_xdp.rst
>>>
>>> v5->v6
>>> - rebase to master, commit 0cdd5b13de91b98
>>> - address errors from sparse and clang
>>> - pass travis-ci test
>>> - address feedback from Ben
>>> - fix issues reported by 0-day robot
>>> - improved documentation
>>>
>>> v6-v7
>>> - rebase to master, commit abf11558c1515bf3b1
>>> - address feedbacks from Ilya, Ben, and Eelco, see:
>>>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
>>> - add XDP mode change, implement get/set_config, reconfigure
>>> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>>>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
>>> - perf optimization for batching umem_push/pop
>>> - perf optimization for batching kick_tx
>>> - test build with dpdk
>>> - fix/refactor atomic operation
>>> - make AF_XDP x86 specific, otherwise fail at build time
>>> - lots of code refactoring
>>> - add PVP setup in documentation
>>>
>>> v7-v8:
>>> - Address feedback from Ilya at:
>>>   https://patchwork.ozlabs.org/patch/1095019/
>>> - add netdev-linux-private.h
>>> - fix afxdp reconfigure issue
>>> - sort include headers
>>> - remove unnecessary OVS_UNUSED
>>> - coding style fixes
>>> - error case handling and memory leak
>>>
>>> v8-v9:
>>> - rebase to master 180bbbed3a3867d52
>>> - Address review feedback from Ben, Ilya and Eelco, at:
>>>   https://patchwork.ozlabs.org/patch/1097740/
>>> - == From Ilya ==
>>> - Optimize the reconfiguration logic
>>> - Implement .rxq_recv and .send for afxdp
>>> - Remove system-afxdp-traffic.at, reuse existing code
>>> - Use Ilya's rdtsc code
>>> - remove --disable-system
>>> - == From Eelco ==
>>> - Fix bug when remove br0,
>>> util(revalidator49)|EMER|lib/poll-loop.c:111:
>>>   assertion !fd != !wevent failed
>>> - Fix bug and use default value from libbpf, ex:
>>> XSK_RING_PROD__DEFAULT...
>>> - Clear xdp program when receive signal, ctrl+c
>>> - Add options to vswitch.xml, set xdpmode default to skb-mode
>>> - No support for ARM and PPC, now x86_64 only
>>> - remove redundant header includes and function/macro definitions
>>> - remove some ifdef HAVE_AF_XDP
>>> - == From others/both about afxdp rx and tx ==
>>> - Several umem push/pop error handling improvement/fixes
>>> - add lock to address concurrent_txq case
>>> - improve error handling
>>> - add stats
>>> - Things that are not done yet
>>> - MTU limitation
>>> - n_txq_desc/n_rxq_desc option.
>>>
>>> v9-v10
>>> - remove x86_64 limitation, suggested by Ben and Eelco
>>> - add xmalloc_pagealign, free_pagealign
>>> - minor refector
>>>
>>> v10-v11
>>> - address feedback from Ilya at
>>>   https://patchwork.ozlabs.org/patch/1106495/
>>> - fix typos, and some refactoring
>>> - refactor existing code and introduce xmalloc pagealign
>>> - fix a couple of error handling case
>>> - allocate per-txq lock
>>> - dynamic allocate xsk array
>>> - fix cycle_counter_update() for non-x86/non-linux case
>>> ---
>>>  Documentation/automake.mk             |   1 +
>>>  Documentation/index.rst               |   1 +
>>>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
>>>  Documentation/intro/install/index.rst |   1 +
>>>  acinclude.m4                          |  35 ++
>>>  configure.ac                          |   1 +
>>>  lib/automake.mk                       |  14 +
>>>  lib/dp-packet.c                       |  28 ++
>>>  lib/dp-packet.h                       |  18 +-
>>>  lib/dpif-netdev-perf.h                |  26 +
>>>  lib/netdev-afxdp.c                    | 891
>>> ++++++++++++++++++++++++++++++++++
>>>  lib/netdev-afxdp.h                    |  74 +++
>>>  lib/netdev-linux-private.h            | 139 ++++++
>>>  lib/netdev-linux.c                    | 121 ++---
>>>  lib/netdev-provider.h                 |   3 +
>>>  lib/netdev.c                          |  11 +
>>>  lib/spinlock.h                        |  70 +++
>>>  lib/util.c                            |  92 +++-
>>>  lib/util.h                            |   5 +
>>>  lib/xdpsock.c                         | 170 +++++++
>>>  lib/xdpsock.h                         | 101 ++++
>>>  tests/automake.mk                     |  16 +
>>>  tests/system-afxdp-macros.at          |  20 +
>>>  tests/system-afxdp-testsuite.at       |  26 +
>>>  vswitchd/vswitch.xml                  |  15 +
>>>  25 files changed, 2204 insertions(+), 108 deletions(-)
>>>  create mode 100644 Documentation/intro/install/afxdp.rst
>>>  create mode 100644 lib/netdev-afxdp.c
>>>  create mode 100644 lib/netdev-afxdp.h
>>>  create mode 100644 lib/netdev-linux-private.h
>>>  create mode 100644 lib/spinlock.h
>>>  create mode 100644 lib/xdpsock.c
>>>  create mode 100644 lib/xdpsock.h
>>>  create mode 100644 tests/system-afxdp-macros.at
>>>  create mode 100644 tests/system-afxdp-testsuite.at
>>>
>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>> index 082438e09a33..11cc59efc881 100644
>>> --- a/Documentation/automake.mk
>>> +++ b/Documentation/automake.mk
>>> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>>>       Documentation/intro/why-ovs.rst \
>>>       Documentation/intro/install/index.rst \
>>>       Documentation/intro/install/bash-completion.rst \
>>> +     Documentation/intro/install/afxdp.rst \
>>>       Documentation/intro/install/debian.rst \
>>>       Documentation/intro/install/documentation.rst \
>>>       Documentation/intro/install/distributions.rst \
>>> diff --git a/Documentation/index.rst b/Documentation/index.rst
>>> index 46261235c732..aa9e7c49f179 100644
>>> --- a/Documentation/index.rst
>>> +++ b/Documentation/index.rst
>>> @@ -59,6 +59,7 @@ vSwitch? Start here.
>>>    :doc:`intro/install/windows` |
>>>    :doc:`intro/install/xenserver` |
>>>    :doc:`intro/install/dpdk` |
>>> +  :doc:`intro/install/afxdp` |
>>>    :doc:`Installation FAQs <faq/releases>`
>>>
>>>  - **Tutorials:** :doc:`tutorials/faucet` |
>>> diff --git a/Documentation/intro/install/afxdp.rst
>>> b/Documentation/intro/install/afxdp.rst
>>> new file mode 100644
>>> index 000000000000..554964396353
>>> --- /dev/null
>>> +++ b/Documentation/intro/install/afxdp.rst
>>> @@ -0,0 +1,433 @@
>>> +..
>>> +      Licensed under the Apache License, Version 2.0 (the 
>>> "License");
>>> you may
>>> +      not use this file except in compliance with the License. You
>>> may obtain
>>> +      a copy of the License at
>>> +
>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> +      Unless required by applicable law or agreed to in writing,
>>> software
>>> +      distributed under the License is distributed on an "AS IS"
>>> BASIS, WITHOUT
>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied. See the
>>> +      License for the specific language governing permissions and
>>> limitations
>>> +      under the License.
>>> +
>>> +      Convention for heading levels in Open vSwitch documentation:
>>> +
>>> +      =======  Heading 0 (reserved for the title in a document)
>>> +      -------  Heading 1
>>> +      ~~~~~~~  Heading 2
>>> +      +++++++  Heading 3
>>> +      '''''''  Heading 4
>>> +
>>> +      Avoid deeper levels because they do not render well.
>>> +
>>> +
>>> +========================
>>> +Open vSwitch with AF_XDP
>>> +========================
>>> +
>>> +This document describes how to build and install Open vSwitch using
>>> +AF_XDP netdev.
>>> +
>>> +.. warning::
>>> +  The AF_XDP support of Open vSwitch is considered 'experimental',
>>> +  and it is not compiled in by default.
>>> +
>>> +
>>> +Introduction
>>> +------------
>>> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux
>>> socket type
>>> +built upon the eBPF and XDP technology.  It is aims to have
>>> comparable
>>> +performance to DPDK but cooperate better with existing kernel's
>>> networking
>>> +stack.  An AF_XDP socket receives and sends packets from an 
>>> eBPF/XDP
>>> program
>>> +attached to the netdev, by-passing a couple of Linux kernel's
>>> subsystems.
>>> +As a result, AF_XDP socket shows much better performance than
>>> AF_PACKET.
>>> +For more details about AF_XDP, please see linux kernel's
>>> +Documentation/networking/af_xdp.rst
>>> +
>>> +
>>> +AF_XDP Netdev
>>> +-------------
>>> +OVS has a couple of netdev types, i.e., system, tap, or
>>> +dpdk.  The AF_XDP feature adds a new netdev types called
>>> +"afxdp", and implement its configuration, packet reception,
>>> +and transmit functions.  Since the AF_XDP socket, called xsk,
>>> +operates in userspace, once ovs-vswitchd receives packets
>>> +from xsk, the afxdp netdev re-uses the existing userspace
>>> +dpif-netdev datapath.  As a result, most of the packet processing
>>> +happens at the userspace instead of linux kernel.
>>> +
>>> +::
>>> +
>>> +              |   +-------------------+
>>> +              |   |    ovs-vswitchd   |<-->ovsdb-server
>>> +              |   +-------------------+
>>> +              |   |      ofproto      |<-->OpenFlow controllers
>>> +              |   +--------+-+--------+
>>> +              |   | netdev | |ofproto-|
>>> +    userspace |   +--------+ |  dpif  |
>>> +              |   | afxdp  | +--------+
>>> +              |   | netdev | |  dpif  |
>>> +              |   +---||---+ +--------+
>>> +              |       ||     |  dpif- |
>>> +              |       ||     | netdev |
>>> +              |_      ||     +--------+
>>> +                      ||
>>> +               _  +---||-----+--------+
>>> +              |   | AF_XDP prog +     |
>>> +       kernel |   |   xsk_map         |
>>> +              |_  +--------||---------+
>>> +                           ||
>>> +                        physical
>>> +                           NIC
>>> +
>>> +
>>> +Build requirements
>>> +------------------
>>> +
>>> +In addition to the requirements described in :doc:`general`, 
>>> building
>>> Open
>>> +vSwitch with AF_XDP will require the following:
>>> +
>>> +- libbpf from kernel source tree (kernel 5.0.0 or later)
>>> +
>>> +- Linux kernel XDP support, with the following options (required)
>>> +
>>> +  * CONFIG_BPF=y
>>> +
>>> +  * CONFIG_BPF_SYSCALL=y
>>> +
>>> +  * CONFIG_XDP_SOCKETS=y
>>> +
>>> +
>>> +- The following optional Kconfig options are also recommended, but
>>> not
>>> +  required:
>>> +
>>> +  * CONFIG_BPF_JIT=y (Performance)
>>> +
>>> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
>>> +
>>> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
>>> +
>>> +- Once your AF_XDP-enabled kernel is ready, if possible, run
>>> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
>>> +  This is an OVS independent benchmark tools for AF_XDP.
>>> +  It makes sure your basic kernel requirements are met for AF_XDP.
>>> +
>>> +
>>> +Installing
>>> +----------
>>> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF
>>> support.
>>> +First, clone a recent version of Linux bpf-next tree::
>>> +
>>> +  git clone
>>> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
>>> +
>>> +Second, go into the Linux source directory and build libbpf in the
>>> tools
>>> +directory::
>>> +
>>> +  cd bpf-next/
>>> +  cd tools/lib/bpf/
>>> +  make && make install
>>> +  make install_headers
>>> +
>>> +.. note::
>>> +   Make sure xsk.h and bpf.h are installed in system's library 
>>> path,
>>> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
>>> +
>>> +Make sure the libbpf.so is installed correctly::
>>> +
>>> +  ldconfig
>>> +  ldconfig -p | grep libbpf
>>> +
>>> +Third, ensure the standard OVS requirements are installed and
>>> +bootstrap/configure the package::
>>> +
>>> +  ./boot.sh && ./configure --enable-afxdp
>>> +
>>> +Finally, build and install OVS::
>>> +
>>> +  make && make install
>>> +
>>> +To kick start end-to-end autotesting::
>>> +
>>> +  uname -a # make sure having 5.0+ kernel
>>> +  make check-afxdp TESTSUITEFLAGS='1'
>>> +
>>> +If a test case fails, check the log at::
>>> +
>>> +  cat 
>>> tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
>>> +
>>> +
>>> +Setup AF_XDP netdev
>>> +-------------------
>>> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
>>> +set-up right::
>>> +
>>> +  ldd vswitchd/ovs-vswitchd
>>> +
>>> +Open vSwitch should be started using userspace datapath as 
>>> described
>>> +in :doc:`general`::
>>> +
>>> +  ovs-vswitchd ...
>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>>> +
>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
>>> core 4)
>>> +on 1 queue (queue 0) device, configure these options: 
>>> **pmd-cpu-mask,
>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
>>> "skb"::
>>> +
>>> +  ethtool -L enp2s0 combined 1
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=1 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:4"
>>> +
>>> +Or, use 4 pmds/cores and 4 queues by doing::
>>> +
>>> +  ethtool -L enp2s0 combined 4
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=4 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
>>> +
>>> +.. note::
>>> +   pmd-rxq-affinity is optional. If not specified, system will
>>> auto-assign.
>>> +
>>> +To validate that the bridge has successfully instantiated, you can
>>> use the::
>>> +
>>> +  ovs-vsctl show
>>> +
>>> +Should show something like::
>>> +
>>> +  Port "ens802f0"
>>> +   Interface "ens802f0"
>>> +      type: afxdp
>>> +      options: {n_rxq="1", xdpmode=drv}
>>> +
>>> +Otherwise, enable debugging by::
>>> +
>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>> +
>>> +
>>> +References
>>> +----------
>>> +Most of the design details are described in the paper presented at
>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open 
>>> vSwitch"[1],
>>> +section 4, and slides[2][4].
>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
>>> introduction
>>> +about AF_XDP current and future work.
>>> +
>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
>>> +
>>> +[2]
>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
>>> +
>>> +[3]
>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>>> +
>>> +[4]
>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
>>> +
>>> +
>>> +Performance Tuning
>>> +------------------
>>> +The name of the game is to keep your CPU running in userspace,
>>> allowing PMD
>>> +to keep polling the AF_XDP queues without any interferences from
>>> kernel.
>>> +
>>> +#. Make sure everything is in the same NUMA node (memory used by
>>> AF_XDP, pmd
>>> +   running cores, device plug-in slot)
>>> +
>>> +#. Isolate your CPU by doing isolcpu at grub configure.
>>> +
>>> +#. IRQ should not set to pmd running core.
>>> +
>>> +#. The Spectre and Meltdown fixes increase the overhead of system
>>> calls.
>>> +
>>> +
>>> +Debugging performance issue
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +While running the traffic, use linux perf tool to see where your 
>>> cpu
>>> +spends its cycle::
>>> +
>>> +  cd bpf-next/tools/perf
>>> +  make
>>> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
>>> +  ./perf report
>>> +
>>> +Measure your system call rate by doing::
>>> +
>>> +  pstree -p `pidof ovs-vswitchd`
>>> +  strace -c -p <your pmd's PID>
>>> +
>>> +Or, use OVS pmd tool::
>>> +
>>> +  ovs-appctl dpif-netdev/pmd-stats-show
>>> +
>>> +
>>> +Example Script
>>> +--------------
>>> +
>>> +Below is a script using namespaces and veth peer::
>>> +
>>> +  #!/bin/bash
>>> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif 
>>> -vunixctl
>>> \
>>> +    --disable-system --detach \
>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
>>> +    
>>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14
>>> \
>>> +    fail-mode=secure datapath_type=netdev
>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>>> +
>>> +  ip netns add at_ns0
>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>> +
>>> +  ip link add p0 type veth peer name afxdp-p0
>>> +  ip link set p0 netns at_ns0
>>> +  ip link set dev afxdp-p0 up
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
>>> +
>>> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
>>> +  ip addr add "10.1.1.1/24" dev p0
>>> +  ip link set dev p0 up
>>> +  NS_EXEC_HEREDOC
>>> +
>>> +  ip netns add at_ns1
>>> +  ip link add p1 type veth peer name afxdp-p1
>>> +  ip link set p1 netns at_ns1
>>> +  ip link set dev afxdp-p1 up
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p1 -- \
>>> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
>>> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
>>> +  ip addr add "10.1.1.2/24" dev p1
>>> +  ip link set dev p1 up
>>> +  NS_EXEC_HEREDOC
>>> +
>>> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
>>> +
>>> +
>>> +Limitations/Known Issues
>>> +------------------------
>>> +#. Device's numa ID is always 0, need a way to find numa id from a
>>> netdev.
>>> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. 
>>> A
>>> possible
>>> +   work-around is to use OpenFlow meter action.
>>> +#. AF_XDP device added to bridge, remove, and added again will 
>>> fail.
>>> +#. Most of the tests are done using i40e single port. Multiple 
>>> ports
>>> and
>>> +   also ixgbe driver also needs to be tested.
>>> +#. No latency test result (TODO items)
>>> +
>>> +
>>> +PVP using tap device
>>> +--------------------
>>> +Assume you have enp2s0 as physical nic, and a tap device connected 
>>> to
>>> VM.
>>> +First, start OVS, then add physical port::
>>> +
>>> +  ethtool -L enp2s0 combined 1
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=1 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:4"
>>> +
>>> +Start a VM with virtio and tap device::
>>> +
>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
>>> +    -m 4096 \
>>> +    -cpu host,+x2apic -enable-kvm \
>>> +    -device 
>>> virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
>>> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
>>> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
>>> +    -object memory-backend-file,id=mem,size=4096M,\
>>> +      mem-path=/dev/hugepages,share=on \
>>> +    -numa node,memdev=mem -mem-prealloc -smp 2
>>> +
>>> +Create OpenFlow rules::
>>> +
>>> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
>>> +  ovs-ofctl del-flows br0
>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
>>> +
>>> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
>>> +
>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
>>> +
>>> +The performance number I got is around 1.6Mpps.
>>> +This is due to using the kernel's tap interface, which requires
>>> copying
>>> +packet into kernel from the umem buffer in userspace.
>>> +
>>> +
>>> +PVP using vhostuser device
>>> +--------------------------
>>> +First, build OVS with DPDK and AFXDP::
>>> +
>>> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
>>> +  make -j4 && make install
>>> +
>>> +Create a vhost-user port from OVS::
>>> +
>>> +  ovs-vsctl --no-wait set Open_vSwitch . 
>>> other_config:dpdk-init=true
>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
>>> +    other_config:pmd-cpu-mask=0xfff
>>> +  ovs-vsctl add-port br0 vhost-user-1 \
>>> +    -- set Interface vhost-user-1 type=dpdkvhostuser
>>> +
>>> +Start VM using vhost-user mode::
>>> +
>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
>>> +   -m 4096 \
>>> +   -cpu host,+x2apic -enable-kvm \
>>> +   -chardev
>>> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
>>> +   -netdev
>>> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
>>> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
>>> +      netdev=mynet1,mq=on,vectors=10 \
>>> +   -object memory-backend-file,id=mem,size=4096M,\
>>> +      mem-path=/dev/hugepages,share=on \
>>> +   -numa node,memdev=mem -mem-prealloc -smp 2
>>> +
>>> +Setup the OpenFlow ruls::
>>> +
>>> +  ovs-ofctl del-flows br0
>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0,
>>> actions=output:vhost-user-1"
>>> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1,
>>> actions=output:enp2s0"
>>> +
>>> +Inside the VM, use xdp_rxq_info to drop or bounce back the 
>>> traffic::
>>> +
>>> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
>>> +
>>> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
>>> +
>>> +
>>> +PCP container using veth
>>> +------------------------
>>> +Create namespace and veth peer devices::
>>> +
>>> +  ip netns add at_ns0
>>> +  ip link add p0 type veth peer name afxdp-p0
>>> +  ip link set p0 netns at_ns0
>>> +  ip link set dev afxdp-p0 up
>>> +  ip netns exec at_ns0 ip link set dev p0 up
>>> +
>>> +Attach the veth port to br0 (linux kernel mode)::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 options:n_rxq=1
>>> +
>>> +Or, use AF_XDP with skb mode::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
>>> options:xdpmode=skb
>>> +
>>> +Setup the OpenFlow rules::
>>> +
>>> +  ovs-ofctl del-flows br0
>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
>>> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
>>> +
>>> +In the namespace, run drop or bounce back the packet::
>>> +
>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
>>> +
>>> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
>>> +
>>> +
>>> +Bug Reporting
>>> +-------------
>>> +
>>> +Please report problems to dev@openvswitch.org.
>>> diff --git a/Documentation/intro/install/index.rst
>>> b/Documentation/intro/install/index.rst
>>> index 3193c736cf17..c27a9c9d16ff 100644
>>> --- a/Documentation/intro/install/index.rst
>>> +++ b/Documentation/intro/install/index.rst
>>> @@ -45,6 +45,7 @@ Installation from Source
>>>     xenserver
>>>     userspace
>>>     dpdk
>>> +   afxdp
>>>
>>>  Installation from Packages
>>>  --------------------------
>>> diff --git a/acinclude.m4 b/acinclude.m4
>>> index cf9cc8b8b0de..721653ab0ec0 100644
>>> --- a/acinclude.m4
>>> +++ b/acinclude.m4
>>> @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>>>    ])
>>>  ])
>>>
>>> +dnl OVS_CHECK_LINUX_AF_XDP
>>> +dnl
>>> +dnl Check both Linux kernel AF_XDP and libbpf support
>>> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
>>> +  AC_ARG_ENABLE([afxdp],
>>> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP
>>> support])],
>>> +                [], [enable_afxdp=no])
>>> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
>>> +  if test "$enable_afxdp" != yes; then
>>> +    AC_MSG_RESULT([no])
>>> +    AF_XDP_ENABLE=false
>>> +  else
>>> +    AC_MSG_RESULT([yes])
>>> +    AF_XDP_ENABLE=true
>>> +
>>> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP
>>> support])])
>>> +
>>> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
>>> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP
>>> support])])
>>> +
>>> +    AC_CHECK_HEADER([bpf/xsk.h], [],
>>> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP 
>>> support])])
>>> +
>>> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP
>>> support])])
>>> +
>>> +    AC_DEFINE([HAVE_AF_XDP], [1],
>>> +              [Define to 1 if AF_XDP support is available and
>>> enabled.])
>>> +    LIBBPF_LDADD=" -lbpf -lelf"
>>> +    AC_SUBST([LIBBPF_LDADD])
>>> +  fi
>>> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
>>> +])
>>> +
>>>  dnl OVS_CHECK_DPDK
>>>  dnl
>>>  dnl Configure DPDK source tree
>>> diff --git a/configure.ac b/configure.ac
>>> index 2dbe9a9178e3..9e23e1c6958c 100644
>>> --- a/configure.ac
>>> +++ b/configure.ac
>>> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>>>  OVS_CHECK_DOT
>>>  OVS_CHECK_IF_DL
>>>  OVS_CHECK_STRTOK_R
>>> +OVS_CHECK_LINUX_AF_XDP
>>>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>>>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct
>>> stat.st_mtimensec],
>>>    [], [], [[#include <sys/stat.h>]])
>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>> index cc5dccf39d6b..b31e28f6e1f5 100644
>>> --- a/lib/automake.mk
>>> +++ b/lib/automake.mk
>>> @@ -14,6 +14,10 @@ if WIN32
>>>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>>>  endif
>>>
>>> +if HAVE_AF_XDP
>>> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
>>> +endif
>>> +
>>>  lib_libopenvswitch_la_LDFLAGS = \
>>>          $(OVS_LTINFO) \
>>>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym 
>>> \
>>> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>>>       lib/if-notifier.h \
>>>       lib/netdev-linux.c \
>>>       lib/netdev-linux.h \
>>> +     lib/netdev-linux-private.h \
>>>       lib/netdev-tc-offloads.c \
>>>       lib/netdev-tc-offloads.h \
>>>       lib/netlink-conntrack.c \
>>> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
>>>       lib/tc.h
>>>  endif
>>>
>>> +if HAVE_AF_XDP
>>> +lib_libopenvswitch_la_SOURCES += \
>>> +     lib/xdpsock.c \
>>> +     lib/xdpsock.h \
>>> +     lib/netdev-afxdp.c \
>>> +     lib/netdev-afxdp.h \
>>> +     lib/spinlock.h
>>> +endif
>>> +
>>>  if DPDK_NETDEV
>>>  lib_libopenvswitch_la_SOURCES += \
>>>       lib/dpdk.c \
>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>>> index 0976a35e758b..e6a7947076b4 100644
>>> --- a/lib/dp-packet.c
>>> +++ b/lib/dp-packet.c
>>> @@ -19,6 +19,7 @@
>>>  #include <string.h>
>>>
>>>  #include "dp-packet.h"
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-dpdk.h"
>>>  #include "openvswitch/dynamic-string.h"
>>>  #include "util.h"
>>> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base,
>>> size_t allocated)
>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>>>  }
>>>
>>> +#if HAVE_AF_XDP
>>> +/* Initialize 'b' as an empty dp_packet that contains
>>> + * memory starting at AF_XDP umem base.
>>> + */
>>> +void
>>> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t
>>> allocated)
>>> +{
>>> +    dp_packet_set_base(b, base);
>>> +    dp_packet_set_data(b, base);
>>> +    dp_packet_set_size(b, 0);
>>> +
>>> +    dp_packet_set_allocated(b, allocated);
>>> +    b->source = DPBUF_AFXDP;
>>> +    dp_packet_reset_offsets(b);
>>> +    pkt_metadata_init(&b->md, 0);
>>> +    dp_packet_reset_cutlen(b);
>>> +    dp_packet_reset_offload(b);
>>> +    b->packet_type = htonl(PT_ETH);
>>> +}
>>> +#endif
>>> +
>>>  /* Initializes 'b' as an empty dp_packet that contains the
>>> 'allocated' bytes of
>>>   * memory starting at 'base'.  'base' should point to a buffer on 
>>> the
>>> stack.
>>>   * (Nothing actually relies on 'base' being allocated on the stack.
>>> It could
>>> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>>>               * created as a dp_packet */
>>>              free_dpdk_buf((struct dp_packet*) b);
>>>  #endif
>>> +        } else if (b->source == DPBUF_AFXDP) {
>>> +            free_afxdp_buf(b);
>>>          }
>>>      }
>>>  }
>>> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
>>> new_headroom, size_t new_tailroom
>>>      case DPBUF_STACK:
>>>          OVS_NOT_REACHED();
>>>
>>> +    case DPBUF_AFXDP:
>>> +        OVS_NOT_REACHED();
>>> +
>>>      case DPBUF_STUB:
>>>          b->source = DPBUF_MALLOC;
>>>          new_base = xmalloc(new_allocated);
>>> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>>>  {
>>>      void *p;
>>>      ovs_assert(b->source != DPBUF_DPDK);
>>> +    ovs_assert(b->source != DPBUF_AFXDP);
>>>
>>>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) ==
>>> dp_packet_base(b)) {
>>>          p = dp_packet_data(b);
>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>> index a5e9ade1244a..e3438226e360 100644
>>> --- a/lib/dp-packet.h
>>> +++ b/lib/dp-packet.h
>>> @@ -25,6 +25,7 @@
>>>  #include <rte_mbuf.h>
>>>  #endif
>>>
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-dpdk.h"
>>>  #include "openvswitch/list.h"
>>>  #include "packets.h"
>>> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>>>      DPBUF_DPDK,                /* buffer data is from DPDK 
>>> allocated
>>> memory.
>>>                                  * ref to dp_packet_init_dpdk() in
>>> dp-packet.c.
>>>                                  */
>>> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>>>  };
>>>
>>>  #define DP_PACKET_CONTEXT_SIZE 64
>>> @@ -89,6 +91,13 @@ struct dp_packet {
>>>      };
>>>  };
>>>
>>> +#if HAVE_AF_XDP
>>> +struct dp_packet_afxdp {
>>> +    struct umem_pool *mpool;
>>> +    struct dp_packet packet;
>>> +};
>>> +#endif
>>> +
>>>  static inline void *dp_packet_data(const struct dp_packet *);
>>>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>>>  static inline void *dp_packet_base(const struct dp_packet *);
>>> @@ -122,7 +131,9 @@ static inline const void
>>> *dp_packet_get_nd_payload(const struct dp_packet *);
>>>  void dp_packet_use(struct dp_packet *, void *, size_t);
>>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>>>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
>>> -
>>> +#if HAVE_AF_XDP
>>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
>>> +#endif
>>>  void dp_packet_init_dpdk(struct dp_packet *);
>>>
>>>  void dp_packet_init(struct dp_packet *, size_t);
>>> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>>>              return;
>>>          }
>>>
>>> +        if (b->source == DPBUF_AFXDP) {
>>> +            free_afxdp_buf(b);
>>> +            return;
>>> +        }
>>> +
>>>          dp_packet_uninit(b);
>>>          free(b);
>>>      }
>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>> index 859c05613ddf..6b6dfda7db1c 100644
>>> --- a/lib/dpif-netdev-perf.h
>>> +++ b/lib/dpif-netdev-perf.h
>>> @@ -21,6 +21,7 @@
>>>  #include <stddef.h>
>>>  #include <stdint.h>
>>>  #include <string.h>
>>> +#include <time.h>
>>>  #include <math.h>
>>>
>>>  #ifdef DPDK_NETDEV
>>> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>>>      char *log_reason;
>>>  };
>>>
>>> +#ifdef __linux__
>>> +static inline uint64_t
>>> +rdtsc_syscall(struct pmd_perf_stats *s)
>>> +{
>>> +    struct timespec val;
>>> +    uint64_t v;
>>> +
>>> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
>>> +       return s->last_tsc;
>>> +    }
>>> +
>>> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
>>> +    v += (uint64_t) val.tv_nsec;
>>> +
>>> +    return s->last_tsc = v;
>>> +}
>>> +#endif
>>> +
>>>  /* Support for accurate timing of PMD execution on TSC clock cycle
>>> level.
>>>   * These functions are intended to be invoked in the context of pmd
>>> threads. */
>>>
>>> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s)
>>>  {
>>>  #ifdef DPDK_NETDEV
>>>      return s->last_tsc = rte_get_tsc_cycles();
>>> +#elif !defined(_MSC_VER) && defined(__x86_64__)
>>> +    uint32_t h, l;
>>> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
>>> +
>>> +    return s->last_tsc = ((uint64_t) h << 32) | l;
>>> +#elif defined(__linux__)
>>> +    return rdtsc_syscall(s);
>>>  #else
>>>      return s->last_tsc = 0;
>>>  #endif
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> new file mode 100644
>>> index 000000000000..a6543e8f5126
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -0,0 +1,891 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#include <config.h>
>>> +
>>> +#include "netdev-linux-private.h"
>>> +#include "netdev-linux.h"
>>> +#include "netdev-afxdp.h"
>>> +
>>> +#include <errno.h>
>>> +#include <inttypes.h>
>>> +#include <linux/rtnetlink.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <net/if.h>
>>> +#include <stdlib.h>
>>> +#include <sys/resource.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/types.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "dp-packet.h"
>>> +#include "dpif-netdev.h"
>>> +#include "openvswitch/dynamic-string.h"
>>> +#include "openvswitch/vlog.h"
>>> +#include "packets.h"
>>> +#include "socket-util.h"
>>> +#include "spinlock.h"
>>> +#include "util.h"
>>> +#include "xdpsock.h"
>>> +
>>> +#ifndef SOL_XDP
>>> +#define SOL_XDP 283
>>> +#endif
>>> +
>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>> +
>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
>>> *)base))
>>> +#define UMEM2XPKT(base, i) \
>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char 
>>> *)base
>>> + \
>>> +                               i * sizeof(struct dp_packet_afxdp))
>>> +
>>> +static uint32_t prog_id;
>>> +static struct xsk_socket_info *xsk_configure(int ifindex, int
>>> xdp_queue_id,
>>> +                                             int mode);
>>> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
>>> +static void xsk_destroy(struct xsk_socket_info *xsk);
>>> +static int xsk_configure_all(struct netdev *netdev);
>>> +static void xsk_destroy_all(struct netdev *netdev);
>>> +
>>> +static struct xsk_umem_info *
>>> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
>>> +{
>>> +    struct xsk_umem_config uconfig OVS_UNUSED;
>>> +    struct xsk_umem_info *umem;
>>> +    int ret;
>>> +    int i;
>>> +
>>> +    umem = xcalloc(1, sizeof *umem);
>>> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq,
>>> &umem->cq,
>>> +                           NULL);
>>> +    if (ret) {
>>> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
>>> +                 ovs_strerror(errno),
>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
>>> +        free(umem);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    umem->buffer = buffer;
>>> +
>>> +    /* set-up umem pool */
>>> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
>>> +        VLOG_ERR("umem_pool_init failed");
>>> +        if (xsk_umem__delete(umem->umem)) {
>>> +            VLOG_ERR("xsk_umem__delete failed");
>>> +        }
>>> +        free(umem);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
>>> +        struct umem_elem *elem;
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)umem->buffer + i * FRAME_SIZE);
>>> +        umem_elem_push(&umem->mpool, elem);
>>> +    }
>>> +
>>> +    /* set-up metadata */
>>> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
>>> +        VLOG_ERR("xpacket_pool_init failed");
>>> +        umem_pool_cleanup(&umem->mpool);
>>> +        if (xsk_umem__delete(umem->umem)) {
>>> +            VLOG_ERR("xsk_umem__delete failed");
>>> +        }
>>> +        free(umem);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
>>> +              umem->xpool.array,
>>> +              (char *)umem->xpool.array +
>>> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
>>> +
>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct dp_packet *packet;
>>> +
>>> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
>>> +        xpacket->mpool = &umem->mpool;
>>> +
>>> +        packet = &xpacket->packet;
>>> +        packet->source = DPBUF_AFXDP;
>>> +    }
>>> +
>>> +    return umem;
>>> +}
>>> +
>>> +static struct xsk_socket_info *
>>> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
>>> +                     uint32_t queue_id, int xdpmode)
>>> +{
>>> +    struct xsk_socket_config cfg;
>>> +    struct xsk_socket_info *xsk;
>>> +    char devname[IF_NAMESIZE];
>>> +    uint32_t idx = 0;
>>> +    int ret;
>>> +    int i;
>>> +
>>> +    xsk = xcalloc(1, sizeof(*xsk));
>>> +    xsk->umem = umem;
>>> +    cfg.rx_size = CONS_NUM_DESCS;
>>> +    cfg.tx_size = PROD_NUM_DESCS;
>>> +    cfg.libbpf_flags = 0;
>>> +
>>> +    if (xdpmode == XDP_ZEROCOPY) {
>>> +        cfg.bind_flags = XDP_ZEROCOPY;
>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>> XDP_FLAGS_DRV_MODE;
>>> +    } else {
>>> +        cfg.bind_flags = XDP_COPY;
>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>> XDP_FLAGS_SKB_MODE;
>>> +    }
>>> +
>>> +    if (if_indextoname(ifindex, devname) == NULL) {
>>> +        VLOG_ERR("ifindex %d to devname failed (%s)",
>>> +                 ifindex, ovs_strerror(errno));
>>> +        free(xsk);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id,
>>> umem->umem,
>>> +                             &xsk->rx, &xsk->tx, &cfg);
>>> +    if (ret) {
>>> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
>>> +                 ovs_strerror(errno),
>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
>>> +                 queue_id);
>>> +        free(xsk);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    /* Make sure the built-in AF_XDP program is loaded */
>>> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
>>> +    if (ret) {
>>> +        VLOG_ERR("Get XDP prog ID failed (%s)", 
>>> ovs_strerror(errno));
>>> +        xsk_socket__delete(xsk->xsk);
>>> +        free(xsk);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL 
>>> queue
>>> */
>>> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
>>> +                                   PROD_NUM_DESCS - BATCH_SIZE,
>>> &idx)) {
>>> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL
>>> queue");
>>> +    }
>>> +
>>> +    for (i = 0;
>>> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
>>> +         i += FRAME_SIZE) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        elem = umem_elem_pop(&xsk->umem->mpool);
>>> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
>>> +
>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
>>> +    }
>>> +
>>> +    xsk_ring_prod__submit(&xsk->umem->fq,
>>> +                          PROD_NUM_DESCS - BATCH_SIZE);
>>> +    return xsk;
>>> +}
>>> +
>>> +static struct xsk_socket_info *
>>> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
>>> +{
>>> +    struct xsk_socket_info *xsk;
>>> +    struct xsk_umem_info *umem;
>>> +    void *bufs;
>>> +
>>> +    /* umem memory region */
>>> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
>>> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
>>> +
>>> +    /* create AF_XDP socket */
>>> +    umem = xsk_configure_umem(bufs,
>>> +                              NUM_FRAMES * FRAME_SIZE,
>>> +                              xdpmode);
>>> +    if (!umem) {
>>> +        free_pagealign(bufs);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, 
>>> xdpmode);
>>> +    if (!xsk) {
>>> +        /* clean up umem and xpacket pool */
>>> +        if (xsk_umem__delete(umem->umem)) {
>>> +            VLOG_ERR("xsk_umem__delete failed");
>>> +        }
>>> +        free_pagealign(bufs);
>>> +        umem_pool_cleanup(&umem->mpool);
>>> +        xpacket_pool_cleanup(&umem->xpool);
>>> +        free(umem);
>>> +    }
>>> +    return xsk;
>>> +}
>>> +
>>> +static int
>>> +xsk_configure_all(struct netdev *netdev)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    struct xsk_socket_info *xsk;
>>> +    int i, ifindex, n_rxq;
>>> +
>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>> +
>>> +    n_rxq = netdev_n_rxq(netdev);
>>> +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
>>> +
>>> +    /* configure each queue */
>>> +    for (i = 0; i < n_rxq; i++) {
>>> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
>>> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
>>> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
>>> +        if (!xsk) {
>>> +            VLOG_ERR("failed to create AF_XDP socket on queue %d",
>>> i);
>>> +            dev->xsks[i] = NULL;
>>> +            goto err;
>>> +        }
>>> +        dev->xsks[i] = xsk;
>>> +        xsk->rx_dropped = 0;
>>> +        xsk->tx_dropped = 0;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +err:
>>> +    xsk_destroy_all(netdev);
>>> +    return EINVAL;
>>> +}
>>> +
>>> +static void
>>> +xsk_destroy(struct xsk_socket_info *xsk)
>>> +{
>>> +    struct xsk_umem *umem;
>>> +
>>> +    umem = xsk->umem->umem;
>>> +    xsk_socket__delete(xsk->xsk);
>>> +    if (xsk_umem__delete(umem)) {
>>> +        VLOG_ERR("xsk_umem__delete failed");
>>> +    }
>>> +
>>> +    /* free the packet buffer */
>>> +    free_pagealign(xsk->umem->buffer);
>>> +
>>> +    /* cleanup umem pool */
>>> +    umem_pool_cleanup(&xsk->umem->mpool);
>>> +
>>> +    /* cleanup metadata pool */
>>> +    xpacket_pool_cleanup(&xsk->umem->xpool);
>>> +
>>> +    free(xsk->umem);
>>> +    free(xsk);
>>> +}
>>> +
>>> +static void
>>> +xsk_destroy_all(struct netdev *netdev)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    int i, ifindex;
>>> +
>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>> +
>>> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
>>> +        if (dev->xsks && dev->xsks[i]) {
>>> +            VLOG_INFO("destroy xsk[%d]", i);
>>> +            xsk_destroy(dev->xsks[i]);
>>> +            dev->xsks[i] = NULL;
>>> +        }
>>> +    }
>>> +
>>> +    VLOG_INFO("remove xdp program");
>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
>>> +
>>> +    free(dev->xsks);
>>> +}
>>> +
>>> +static inline void OVS_UNUSED
>>> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
>>> +    struct xdp_statistics stat;
>>> +    socklen_t optlen;
>>> +
>>> +    optlen = sizeof stat;
>>> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP,
>>> XDP_STATISTICS,
>>> +               &stat, &optlen) == 0);
>>> +
>>> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid
>>> %llu",
>>> +                stat.rx_dropped,
>>> +                stat.rx_invalid_descs,
>>> +                stat.tx_invalid_descs);
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap
>>> *args,
>>> +                        char **errp OVS_UNUSED)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    const char *str_xdpmode;
>>> +    int xdpmode, new_n_rxq;
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>>> +    if (new_n_rxq > MAX_XSKQ) {
>>> +        ovs_mutex_unlock(&dev->mutex);
>>> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
>>> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
>>> +        return EINVAL;
>>> +    }
>>> +
>>> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
>>> +    if (!strcasecmp(str_xdpmode, "drv")) {
>>> +        xdpmode = XDP_ZEROCOPY;
>>> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
>>> +        xdpmode = XDP_COPY;
>>> +    } else {
>>> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
>>> +                 netdev_get_name(netdev), str_xdpmode);
>>> +        ovs_mutex_unlock(&dev->mutex);
>>> +        return EINVAL;
>>> +    }
>>> +
>>> +    if (dev->requested_n_rxq != new_n_rxq
>>> +        || dev->requested_xdpmode != xdpmode) {
>>> +        dev->requested_n_rxq = new_n_rxq;
>>> +        dev->requested_xdpmode = xdpmode;
>>> +        netdev_request_reconfigure(netdev);
>>> +    }
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap
>>> *args)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
>>> +    smap_add_format(args, "xdpmode", "%s",
>>> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +    return 0;
>>> +}
>>> +
>>> +static void
>>> +netdev_afxdp_alloc_txq(struct netdev *netdev)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    int n_txqs = netdev_n_rxq(netdev);
>>> +    int i;
>>> +
>>> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
>>> +
>>> +    for (i = 0; i < n_txqs; i++) {
>>> +        ovs_spinlock_init(&dev->tx_locks[i]);
>>> +    }
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_reconfigure(struct netdev *netdev)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>>> +    int err = 0;
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +
>>> +    if (netdev->n_rxq == dev->requested_n_rxq
>>> +        && dev->xdpmode == dev->requested_xdpmode) {
>>> +        goto out;
>>> +    }
>>> +
>>> +    xsk_destroy_all(netdev);
>>> +    free(dev->tx_locks);
>>> +
>>> +    netdev->n_rxq = dev->requested_n_rxq;
>>> +    netdev_afxdp_alloc_txq(netdev);
>>> +
>>> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
>>> +        VLOG_INFO("AF_XDP device %s in DRV mode",
>>> netdev_get_name(netdev));
>>> +        /* From SKB mode to DRV mode */
>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>> XDP_FLAGS_DRV_MODE;
>>> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
>>> +        dev->xdpmode = XDP_ZEROCOPY;
>>> +
>>> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
>>> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
>>> +                      ovs_strerror(errno));
>>> +        }
>>> +    } else {
>>> +        VLOG_INFO("AF_XDP device %s in SKB mode",
>>> netdev_get_name(netdev));
>>> +        /* From DRV mode to SKB mode */
>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>> XDP_FLAGS_SKB_MODE;
>>> +        dev->xdp_bind_flags = XDP_COPY;
>>> +        dev->xdpmode = XDP_COPY;
>>> +        /* TODO: set rlimit back to previous value
>>> +         * when no device is in DRV mode.
>>> +         */
>>> +    }
>>> +
>>> +    err = xsk_configure_all(netdev);
>>> +    if (err) {
>>> +        VLOG_ERR("AF_XDP device %s reconfig fails",
>>> netdev_get_name(netdev));
>>> +    }
>>> +    netdev_change_seq_changed(netdev);
>>> +out:
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +    return err;
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
>>> +{
>>> +    /* FIXME: Get netdev's PCIe device ID, then find
>>> +     * its NUMA node id.
>>> +     */
>>> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
>>> +              netdev_get_name(netdev));
>>> +    return 0;
>>> +}
>>> +
>>> +static void
>>> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
>>> +{
>>> +    uint32_t curr_prog_id = 0;
>>> +    uint32_t flags;
>>> +
>>> +    /* remove_xdp_program() */
>>> +    if (xdpmode == XDP_COPY) {
>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
>>> +    } else {
>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
>>> +    }
>>> +
>>> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
>>> +    }
>>> +    if (prog_id == curr_prog_id) {
>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
>>> +    } else if (!curr_prog_id) {
>>> +        VLOG_INFO("couldn't find a prog id on a given interface");
>>> +    } else {
>>> +        VLOG_INFO("program on interface changed, not removing");
>>> +    }
>>> +}
>>> +
>>> +void
>>> +signal_remove_xdp(struct netdev *netdev)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    int ifindex;
>>> +
>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>> +
>>> +    VLOG_WARN("force remove xdp program");
>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
>>> +}
>>> +
>>> +static struct dp_packet_afxdp *
>>> +dp_packet_cast_afxdp(const struct dp_packet *d)
>>> +{
>>> +    ovs_assert(d->source == DPBUF_AFXDP);
>>> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
>>> +}
>>> +
>>> +void
>>> +free_afxdp_buf(struct dp_packet *p)
>>> +{
>>> +    struct dp_packet_afxdp *xpacket;
>>> +    uintptr_t addr;
>>> +
>>> +    xpacket = dp_packet_cast_afxdp(p);
>>> +    if (xpacket->mpool) {
>>> +        void *base = dp_packet_base(p);
>>> +
>>> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>> +        umem_elem_push(xpacket->mpool, (void *)addr);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
>>> +{
>>> +    struct dp_packet_afxdp *xpacket = NULL;
>>> +    struct dp_packet *packet;
>>> +    void *elems[BATCH_SIZE];
>>> +    uintptr_t addr;
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        xpacket = dp_packet_cast_afxdp(packet);
>>> +        if (xpacket->mpool) {
>>> +            void *base = dp_packet_base(packet);
>>> +
>>> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>> +            elems[i] = (void *)addr;
>>> +        }
>>> +    }
>>> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
>>> +    dp_packet_batch_init(batch);
>>> +}
>>> +
>>> +static inline void
>>> +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
>>> +{
>>> +    void *elems[BATCH_SIZE];
>>> +    int i;
>>> +
>>> +    for (i = 0; i < rcvd; i++) {
>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
>>> idx_rx)->addr;
>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>>> +
>>> +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
>>> +    }
>>> +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>> +
>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
>>> +    xsk->rx_dropped += rcvd;
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct 
>>> dp_packet_batch
>>> *batch,
>>> +                      int *qfill)
>>> +{
>>> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>> +    struct netdev *netdev = rx->up.netdev;
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    struct umem_elem *elems[BATCH_SIZE];
>>> +    uint32_t idx_rx = 0, idx_fq = 0;
>>> +    struct xsk_socket_info *xsk;
>>> +    int qid = rxq_->queue_id;
>>> +    unsigned int rcvd, i;
>>> +    int ret = 0;
>>> +
>>> +    xsk = dev->xsks[qid];
>>> +    if (!xsk) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    rx->fd = xsk_socket__fd(xsk->xsk);
>>> +
>>> +    /* See if there is any packet on RX queue,
>>> +     * if yes, idx_rx is the index having the packet.
>>> +     */
>>> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
>>> +    if (!rcvd) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
>>> +        return ENOMEM;
>>> +    }
>>> +
>>> +    /* Prepare for the FILL queue */
>>> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
>>> +        /* The FILL queue is full, don't retry or process rx. Wait
>>> for kernel
>>> +         * to move received packets from FILL queue to RX queue.
>>> +         */
>>> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
>>> +        return ENOMEM;
>>> +    }
>>> +
>>> +    /* Setup a dp_packet batch from descriptors in RX queue */
>>> +    for (i = 0; i < rcvd; i++) {
>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
>>> idx_rx)->addr;
>>> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, 
>>> idx_rx)->len;
>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>>> +        uint64_t index;
>>> +
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct dp_packet *packet;
>>> +
>>> +        index = addr >> FRAME_SHIFT;
>>> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
>>> +        packet = &xpacket->packet;
>>> +
>>> +        /* Initialize the struct dp_packet */
>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE -
>>> FRAME_HEADROOM);
>>> +        dp_packet_set_size(packet, len);
>>> +
>>> +        /* Add packet into batch, increase batch->count */
>>> +        dp_packet_batch_add(batch, packet);
>>> +
>>> +        idx_rx++;
>>> +    }
>>> +    /* Release the RX queue */
>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
>>> +
>>> +    for (i = 0; i < rcvd; i++) {
>>> +        uint64_t index;
>>> +        struct umem_elem *elem;
>>> +
>>> +        /* Get one free umem, program it into FILL queue */
>>> +        elem = elems[i];
>>> +        index = (uint64_t)((char *)elem - (char 
>>> *)xsk->umem->buffer);
>>> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
>>> +
>>> +        idx_fq++;
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
>>> +
>>> +    if (qfill) {
>>> +        /* TODO: return the number of remaining packets in the 
>>> queue.
>>> */
>>> +        *qfill = 0;
>>> +    }
>>> +
>>> +#ifdef AFXDP_DEBUG
>>> +    log_xsk_stat(xsk);
>>> +#endif
>>> +    return 0;
>>> +}
>>> +
>>> +static inline int
>>> +kick_tx(struct xsk_socket_info *xsk)
>>> +{
>>> +    int ret;
>>> +
>>> +    /* This causes system call into kernel's xsk_sendmsg, and
>>> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
>>> +     */
>>> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT,
>>> NULL, 0);
>>> +    if (OVS_UNLIKELY(ret < 0)) {
>>> +        if (errno == ENXIO || errno == ENOBUFS || errno ==
>>> EOPNOTSUPP) {
>>> +            return errno;
>>> +        }
>>> +    }
>>> +    /* no error, or EBUSY or EAGAIN */
>>> +    return 0;
>>> +}
>>> +
>>> +static inline bool
>>> +check_free_batch(struct dp_packet_batch *batch)
>>> +{
>>> +    struct umem_pool *first_mpool = NULL;
>>> +    struct dp_packet_afxdp *xpacket;
>>> +    struct dp_packet *packet;
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        if (packet->source != DPBUF_AFXDP) {
>>> +            return false;
>>> +        }
>>> +        xpacket = dp_packet_cast_afxdp(packet);
>>> +        if (i == 0) {
>>> +            first_mpool = xpacket->mpool;
>>> +            continue;
>>> +        }
>>> +        if (xpacket->mpool != first_mpool) {
>>> +            return false;
>>> +        }
>>> +    }
>>> +    /* All packets are DPBUF_AFXDP and from the same mpool */
>>> +    return true;
>>> +}
>>> +
>>> +static inline void
>>> +afxdp_complete_tx(struct xsk_socket_info *xsk)
>>> +{
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t idx_cq = 0;
>>> +    int tx_done, j, ret;
>>> +
>>> +    if (!xsk->outstanding_tx) {
>>> +        return;
>>> +    }
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +    }
>>> +
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE,
>>> &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +        xsk->outstanding_tx -= tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
>>> **)elems_push);
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
>>> +                        struct dp_packet_batch *batch,
>>> +                        bool concurrent_txq)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    struct xsk_socket_info *xsk = dev->xsks[qid];
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct dp_packet *packet;
>>> +    bool free_batch = true;
>>> +    uint32_t idx = 0;
>>> +    int error = 0;
>>> +    int ret;
>>> +
>>> +    if (!xsk) {
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
>>> +        qid = qid % dev->up.n_txq;
>>> +        ovs_spin_lock(&dev->tx_locks[qid]);
>>> +    }
>>> +
>>> +    /* Process CQ first. */
>>> +    afxdp_complete_tx(xsk);
>>> +
>>> +    free_batch = check_free_batch(batch);
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        xsk->tx_dropped += batch->count;
>>> +        error = ENOMEM;
>>> +        goto out;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +        xsk->tx_dropped += batch->count;
>>> +        error = ENOMEM;
>>> +        goto out;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * TODO: avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>> +        memcpy(elem, dp_packet_data(packet), 
>>> dp_packet_size(packet));
>>> +
>>> +        index = (uint64_t)((char *)elem - (char 
>>> *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +    }
>>> +
>>> +out:
>>> +    if (free_batch) {
>>> +        free_afxdp_buf_batch(batch);
>>> +    } else {
>>> +        dp_packet_delete_batch(batch, true);
>>> +    }
>>> +
>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
>>> +        ovs_spin_unlock(&dev->tx_locks[qid]);
>>> +    }
>>> +    return error;
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
>>> +{
>>> +   /* Done at reconfigure */
>>> +   return 0;
>>> +}
>>> +
>>> +void
>>> +netdev_afxdp_destruct(struct netdev *netdev_)
>>> +{
>>> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>>> +
>>> +    /* Note: tc is by-passed when using drv-mode, but when using
>>> +     * skb-mode, we might need to clean up tc. */
>>> +
>>> +    xsk_destroy_all(netdev_);
>>> +    ovs_mutex_destroy(&netdev->mutex);
>>> +}
>>> +
>>> +int
>>> +netdev_afxdp_get_stats(const struct netdev *netdev,
>>> +                       struct netdev_stats *stats)
>>> +{
>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +    struct netdev_stats dev_stats;
>>> +    struct xsk_socket_info *xsk;
>>> +    int error, i;
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +
>>> +    error = get_stats_via_netlink(netdev, &dev_stats);
>>> +    if (error) {
>>> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
>>> +    } else {
>>> +        /* Use kernel netdev's packet and byte counts */
>>> +        stats->rx_packets = dev_stats.rx_packets;
>>> +        stats->rx_bytes = dev_stats.rx_bytes;
>>> +        stats->tx_packets = dev_stats.tx_packets;
>>> +        stats->tx_bytes = dev_stats.tx_bytes;
>>> +
>>> +        stats->rx_errors           += dev_stats.rx_errors;
>>> +        stats->tx_errors           += dev_stats.tx_errors;
>>> +        stats->rx_dropped          += dev_stats.rx_dropped;
>>> +        stats->tx_dropped          += dev_stats.tx_dropped;
>>> +        stats->multicast           += dev_stats.multicast;
>>> +        stats->collisions          += dev_stats.collisions;
>>> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
>>> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
>>> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
>>> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
>>> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
>>> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
>>> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
>>> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
>>> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
>>> +        stats->tx_heartbeat_errors += 
>>> dev_stats.tx_heartbeat_errors;
>>> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
>>> +
>>> +        /* Account the dropped in each xsk */
>>> +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
>>> +            xsk = dev->xsks[i];
>>> +            if (xsk) {
>>> +                stats->rx_dropped += xsk->rx_dropped;
>>> +                stats->tx_dropped += xsk->tx_dropped;
>>> +            }
>>> +        }
>>> +    }
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +
>>> +    return error;
>>> +}
>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
>>> new file mode 100644
>>> index 000000000000..dd2dc1a2064d
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.h
>>> @@ -0,0 +1,74 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_AFXDP_H
>>> +#define NETDEV_AFXDP_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +/* These functions are Linux AF_XDP specific, so they should be 
>>> used
>>> directly
>>> + * only by Linux-specific code. */
>>> +
>>> +#define MAX_XSKQ 16
>>> +
>>> +struct netdev;
>>> +struct xsk_socket_info;
>>> +struct xdp_umem;
>>> +struct dp_packet_batch;
>>> +struct smap;
>>> +struct dp_packet;
>>> +struct netdev_rxq;
>>> +struct netdev_stats;
>>> +
>>> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
>>> +void netdev_afxdp_destruct(struct netdev *netdev_);
>>> +
>>> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
>>> +                          struct dp_packet_batch *batch,
>>> +                          int *qfill);
>>> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
>>> +                            struct dp_packet_batch *batch,
>>> +                            bool concurrent_txq);
>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct 
>>> smap
>>> *args,
>>> +                            char **errp);
>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct 
>>> smap
>>> *args);
>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
>>> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
>>> +                           struct netdev_stats *stats);
>>> +
>>> +void free_afxdp_buf(struct dp_packet *p);
>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
>>> +void signal_remove_xdp(struct netdev *netdev);
>>> +
>>> +#else /* !HAVE_AF_XDP */
>>> +
>>> +#include "openvswitch/compiler.h"
>>> +
>>> +struct dp_packet;
>>> +
>>> +static inline void
>>> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
>>> +{
>>> +    /* Nothing */
>>> +}
>>> +
>>> +#endif /* HAVE_AF_XDP */
>>> +#endif /* netdev-afxdp.h */
>>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>>> new file mode 100644
>>> index 000000000000..6a0388cf9dc3
>>> --- /dev/null
>>> +++ b/lib/netdev-linux-private.h
>>> @@ -0,0 +1,139 @@
>>> +/*
>>> + * Copyright (c) 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +#include "netdev-afxdp.h"
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>> +#include "xdpsock.h"
>>> +
>>> +/* These functions are Linux specific, so they should be used
>>> directly only by
>>> + * Linux-specific code. */
>>> +
>>> +struct netdev;
>>> +
>>> +struct netdev_rxq_linux {
>>> +    struct netdev_rxq up;
>>> +    bool is_tap;
>>> +    int fd;
>>> +};
>>> +
>>> +void netdev_linux_run(const struct netdev_class *);
>>> +
>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
>>> flag,
>>> +                                  const char *flag_name, bool
>>> enable);
>>> +
>>> +int get_stats_via_netlink(const struct netdev *netdev_,
>>> +                          struct netdev_stats *stats);
>>> +
>>> +struct netdev_linux {
>>> +    struct netdev up;
>>> +
>>> +    /* Protects all members below. */
>>> +    struct ovs_mutex mutex;
>>> +
>>> +    unsigned int cache_valid;
>>> +
>>> +    bool miimon;                    /* Link status of last poll. */
>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> +    struct timer miimon_timer;
>>> +
>>> +    int netnsid;                    /* Network namespace ID. */
>>> +    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> +    int ifindex;
>>> +    struct eth_addr etheraddr;
>>> +    int mtu;
>>> +    unsigned int ifi_flags;
>>> +    long long int carrier_resets;
>>> +    uint32_t kbits_rate;        /* Policing data. */
>>> +    uint32_t kbits_burst;
>>> +    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> +                                   0 or an errno value. */
>>> +    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> +                                 * or SIOCSIFMTU.
>>> +                                 */
>>> +    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> +    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> +    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> +    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> +
>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> +
>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> +    struct tc *tc;
>>> +
>>> +    /* For devices of class netdev_tap_class only. */
>>> +    int tap_fd;
>>> +    bool present;               /* If the device is present in the
>>> namespace */
>>> +    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> +
>>> +    /* LAG information. */
>>> +    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> +
>>> +    /* AF_XDP information */
>>> +#ifdef HAVE_AF_XDP
>>> +    struct xsk_socket_info **xsks;
>>> +    int requested_n_rxq;
>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
>>> +    int xdp_flags, xdp_bind_flags;
>>> +    struct ovs_spinlock *tx_locks;
>>> +#endif
>>> +};
>>> +
>>> +static bool
>>> +is_netdev_linux_class(const struct netdev_class *netdev_class)
>>> +{
>>> +    return netdev_class->run == netdev_linux_run;
>>> +}
>>> +
>>> +static struct netdev_linux *
>>> +netdev_linux_cast(const struct netdev *netdev)
>>> +{
>>> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>> +
>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +}
>>> +
>>> +static struct netdev_rxq_linux *
>>> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
>>> +{
>>> +    
>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
>>> +
>>> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
>>> +}
>>> +
>>> +#endif /* netdev-linux-private.h */
>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>> index f75d73fd39f8..2883cf1f2586 100644
>>> --- a/lib/netdev-linux.c
>>> +++ b/lib/netdev-linux.c
>>> @@ -17,6 +17,7 @@
>>>  #include <config.h>
>>>
>>>  #include "netdev-linux.h"
>>> +#include "netdev-linux-private.h"
>>>
>>>  #include <errno.h>
>>>  #include <fcntl.h>
>>> @@ -54,6 +55,7 @@
>>>  #include "fatal-signal.h"
>>>  #include "hash.h"
>>>  #include "openvswitch/hmap.h"
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-provider.h"
>>>  #include "netdev-tc-offloads.h"
>>>  #include "netdev-vport.h"
>>> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, 
>>> int
>>> mtu);
>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
>>> burst_bytes);
>>>
>>> -struct netdev_linux {
>>> -    struct netdev up;
>>> -
>>> -    /* Protects all members below. */
>>> -    struct ovs_mutex mutex;
>>> -
>>> -    unsigned int cache_valid;
>>> -
>>> -    bool miimon;                    /* Link status of last poll. */
>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> -    struct timer miimon_timer;
>>> -
>>> -    int netnsid;                    /* Network namespace ID. */
>>> -    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> -    int ifindex;
>>> -    struct eth_addr etheraddr;
>>> -    int mtu;
>>> -    unsigned int ifi_flags;
>>> -    long long int carrier_resets;
>>> -    uint32_t kbits_rate;        /* Policing data. */
>>> -    uint32_t kbits_burst;
>>> -    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> -                                   0 or an errno value. */
>>> -    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> or SIOCSIFMTU. */
>>> -    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> -    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> -    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> -    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> -
>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> -
>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> -    struct tc *tc;
>>> -
>>> -    /* For devices of class netdev_tap_class only. */
>>> -    int tap_fd;
>>> -    bool present;               /* If the device is present in the
>>> namespace */
>>> -    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> -
>>> -    /* LAG information. */
>>> -    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> -};
>>> -
>>> -struct netdev_rxq_linux {
>>> -    struct netdev_rxq up;
>>> -    bool is_tap;
>>> -    int fd;
>>> -};
>>>
>>>  /* This is set pretty low because we probably won't learn anything
>>> from the
>>>   * additional log messages. */
>>> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>   * changes in the device miimon status, so we can use atomic_count.
>>> */
>>>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>>>
>>> -static void netdev_linux_run(const struct netdev_class *);
>>> -
>>>  static int netdev_linux_do_ethtool(const char *name, struct
>>> ethtool_cmd *,
>>>                                     int cmd, const char *cmd_name);
>>>  static int get_flags(const struct netdev *, unsigned int *flags);
>>> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
>>>                         struct in_addr addr);
>>>  static int get_etheraddr(const char *netdev_name, struct eth_addr
>>> *ea);
>>>  static int set_etheraddr(const char *netdev_name, const struct
>>> eth_addr);
>>> -static int get_stats_via_netlink(const struct netdev *, struct
>>> netdev_stats *);
>>>  static int af_packet_sock(void);
>>>  static bool netdev_linux_miimon_enabled(void);
>>>  static void netdev_linux_miimon_run(void);
>>> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
>>>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int
>>> *mtup);
>>>
>>>  static bool
>>> -is_netdev_linux_class(const struct netdev_class *netdev_class)
>>> -{
>>> -    return netdev_class->run == netdev_linux_run;
>>> -}
>>> -
>>> -static bool
>>>  is_tap_netdev(const struct netdev *netdev)
>>>  {
>>>      return netdev_get_class(netdev) == &netdev_tap_class;
>>>  }
>>> -
>>> -static struct netdev_linux *
>>> -netdev_linux_cast(const struct netdev *netdev)
>>> -{
>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>> -
>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> -}
>>> -
>>> -static struct netdev_rxq_linux *
>>> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
>>> -{
>>> -    
>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
>>> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
>>> -}
>>>
>>>  static int
>>>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
>>> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change
>>> *change)
>>>      }
>>>  }
>>>
>>> -static void
>>> +void
>>>  netdev_linux_run(const struct netdev_class *netdev_class 
>>> OVS_UNUSED)
>>>  {
>>>      struct nl_sock *sock;
>>> @@ -3279,9 +3206,7 @@ exit:
>>>      .run = netdev_linux_run,                                    \
>>>      .wait = netdev_linux_wait,                                  \
>>>      .alloc = netdev_linux_alloc,                                \
>>> -    .destruct = netdev_linux_destruct,                          \
>>>      .dealloc = netdev_linux_dealloc,                            \
>>> -    .send = netdev_linux_send,                                  \
>>>      .send_wait = netdev_linux_send_wait,                        \
>>>      .set_etheraddr = netdev_linux_set_etheraddr,                \
>>>      .get_etheraddr = netdev_linux_get_etheraddr,                \
>>> @@ -3312,10 +3237,8 @@ exit:
>>>      .arp_lookup = netdev_linux_arp_lookup,                      \
>>>      .update_flags = netdev_linux_update_flags,                  \
>>>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
>>> -    .rxq_construct = netdev_linux_rxq_construct,                \
>>>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
>>>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
>>> -    .rxq_recv = netdev_linux_rxq_recv,                          \
>>>      .rxq_wait = netdev_linux_rxq_wait,                          \
>>>      .rxq_drain = netdev_linux_rxq_drain
>>>
>>> @@ -3323,30 +3246,64 @@ const struct netdev_class netdev_linux_class 
>>> =
>>> {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "system",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>> +    .destruct = netdev_linux_destruct,
>>>      .get_stats = netdev_linux_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>>      .get_status = netdev_linux_get_status,
>>> -    .get_block_id = netdev_linux_get_block_id
>>> +    .get_block_id = netdev_linux_get_block_id,
>>> +    .send = netdev_linux_send,
>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>  };
>>>
>>>  const struct netdev_class netdev_tap_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      .type = "tap",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct_tap,
>>> +    .destruct = netdev_linux_destruct,
>>>      .get_stats = netdev_tap_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>>      .get_status = netdev_linux_get_status,
>>> +    .send = netdev_linux_send,
>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>  };
>>>
>>>  const struct netdev_class netdev_internal_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "internal",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>> +    .destruct = netdev_linux_destruct,
>>>      .get_stats = netdev_internal_get_stats,
>>>      .get_status = netdev_internal_get_status,
>>> +    .send = netdev_linux_send,
>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>  };
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +const struct netdev_class netdev_afxdp_class = {
>>> +    NETDEV_LINUX_CLASS_COMMON,
>>> +    .type = "afxdp",
>>> +    .is_pmd = true,
>>> +    .construct = netdev_linux_construct,
>>> +    .destruct = netdev_afxdp_destruct,
>>> +    .get_stats = netdev_afxdp_get_stats,
>>> +    .get_status = netdev_linux_get_status,
>>> +    .set_config = netdev_afxdp_set_config,
>>> +    .get_config = netdev_afxdp_get_config,
>>> +    .reconfigure = netdev_afxdp_reconfigure,
>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
>>> +    .send = netdev_afxdp_batch_send,
>>> +    .rxq_construct = netdev_afxdp_rxq_construct,
>>> +    .rxq_recv = netdev_afxdp_rxq_recv,
>>> +};
>>> +#endif
>>>
>>>
>>>  #define CODEL_N_QUEUES 0x0000
>>> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct
>>> netdev_stats *dst,
>>>      dst->tx_window_errors = src->tx_window_errors;
>>>  }
>>>
>>> -static int
>>> +int
>>>  get_stats_via_netlink(const struct netdev *netdev_, struct
>>> netdev_stats *stats)
>>>  {
>>>      struct ofpbuf request;
>>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>>> index fb0c27e6e8e8..91e6a9e2bfc0 100644
>>> --- a/lib/netdev-provider.h
>>> +++ b/lib/netdev-provider.h
>>> @@ -903,6 +903,9 @@ extern const struct netdev_class
>>> netdev_linux_class;
>>>  extern const struct netdev_class netdev_internal_class;
>>>  extern const struct netdev_class netdev_tap_class;
>>>
>>> +#ifdef HAVE_AF_XDP
>>> +extern const struct netdev_class netdev_afxdp_class;
>>> +#endif
>>>  #ifdef  __cplusplus
>>>  }
>>>  #endif
>>> diff --git a/lib/netdev.c b/lib/netdev.c
>>> index 7d7ecf6f0946..0fac117cc602 100644
>>> --- a/lib/netdev.c
>>> +++ b/lib/netdev.c
>>> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>
>>>  static void restore_all_flags(void *aux OVS_UNUSED);
>>>  void update_device_args(struct netdev *, const struct shash *args);
>>> +#ifdef HAVE_AF_XDP
>>> +void signal_remove_xdp(struct netdev *netdev);
>>> +#endif
>>>
>>>  int
>>>  netdev_n_txq(const struct netdev *netdev)
>>> @@ -146,6 +149,9 @@ netdev_initialize(void)
>>>          netdev_register_provider(&netdev_internal_class);
>>>          netdev_register_provider(&netdev_tap_class);
>>>          netdev_vport_tunnel_register();
>>> +#ifdef HAVE_AF_XDP
>>> +        netdev_register_provider(&netdev_afxdp_class);
>>> +#endif
>>>  #endif
>>>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>>>          netdev_register_provider(&netdev_tap_class);
>>> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
>>>                                                 saved_flags &
>>> ~saved_values,
>>>                                                 &old_flags);
>>>          }
>>> +#ifdef HAVE_AF_XDP
>>> +        if (netdev->netdev_class == &netdev_afxdp_class) {
>>> +            signal_remove_xdp(netdev);
>>> +        }
>>> +#endif
>>>      }
>>>  }
>>>
>>> diff --git a/lib/spinlock.h b/lib/spinlock.h
>>> new file mode 100644
>>> index 000000000000..1ae634f23a6b
>>> --- /dev/null
>>> +++ b/lib/spinlock.h
>>> @@ -0,0 +1,70 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +#ifndef SPINLOCK_H
>>> +#define SPINLOCK_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <ctype.h>
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <stdarg.h>
>>> +#include <stdlib.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "ovs-atomic.h"
>>> +
>>> +struct ovs_spinlock {
>>> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
>>> +};
>>> +
>>> +static inline void
>>> +ovs_spinlock_init(struct ovs_spinlock *sl)
>>> +{
>>> +    atomic_init(&sl->locked, 0);
>>> +}
>>> +
>>> +static inline void
>>> +ovs_spin_lock(struct ovs_spinlock *sl)
>>> +{
>>> +    int exp = 0, locked = 0;
>>> +
>>> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked,
>>> &exp, 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed)) {
>>> +        locked = 1;
>>> +        while (locked) {
>>> +            atomic_read_relaxed(&sl->locked, &locked);
>>> +        }
>>> +        exp = 0;
>>> +    }
>>> +}
>>> +
>>> +static inline void
>>> +ovs_spin_unlock(struct ovs_spinlock *sl)
>>> +{
>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
>>> +}
>>> +
>>> +static inline int
>>> +ovs_spin_trylock(struct ovs_spinlock *sl)
>>> +{
>>> +    int exp = 0;
>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked, 
>>> &exp,
>>> 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed);
>>> +}
>>> +#endif
>>> diff --git a/lib/util.c b/lib/util.c
>>> index 7b8ab81f6ee1..5eb20995b370 100644
>>> --- a/lib/util.c
>>> +++ b/lib/util.c
>>> @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
>>>      return xrealloc(p, *n * s);
>>>  }
>>>
>>> -/* Allocates and returns 'size' bytes of memory aligned to a cache
>>> line and in
>>> - * dedicated cache lines.  That is, the memory block returned will
>>> not share a
>>> - * cache line with other data, avoiding "false sharing".
>>> +/* Allocates and returns 'size' bytes of memory aligned to
>>> 'alignment' bytes.
>>> + * 'alignment' must be a power of two and a multiple of sizeof(void
>>> *).
>>>   *
>>> - * Use free_cacheline() to free the returned memory block. */
>>> + * Use free_size_align() to free the returned memory block. */
>>>  void *
>>> -xmalloc_cacheline(size_t size)
>>> +xmalloc_size_align(size_t size, size_t alignment)
>>>  {
>>>  #ifdef HAVE_POSIX_MEMALIGN
>>>      void *p;
>>>      int error;
>>>
>>>      COVERAGE_INC(util_xalloc);
>>> -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
>>> +    error = posix_memalign(&p, alignment, size ? size : 1);
>>>      if (error != 0) {
>>>          out_of_memory();
>>>      }
>>> @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
>>>  #else
>>>      /* Allocate room for:
>>>       *
>>> -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to
>>> allow the
>>> -     *       pointer to be aligned exactly sizeof(void *) bytes
>>> before the
>>> -     *       beginning of a cache line.
>>> +     *     - Header padding: Up to alignment - 1 bytes, to allow 
>>> the
>>> +     *       pointer 'q' to be aligned exactly sizeof(void *) bytes
>>> before the
>>> +     *       beginning of the alignment.
>>>       *
>>>       *     - Pointer: A pointer to the start of the header padding,
>>> to allow us
>>>       *       to free() the block later.
>>>       *
>>>       *     - User data: 'size' bytes.
>>>       *
>>> -     *     - Trailer padding: Enough to bring the user data up to a
>>> cache line
>>> +     *     - Trailer padding: Enough to bring the user data up to a
>>> alignment
>>>       *       multiple.
>>>       *
>>>       * 
>>> +---------------+---------+------------------------+---------+
>>> @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
>>>       * p               q         r
>>>       *
>>>       */
>>> -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
>>> -                      + sizeof(void *)
>>> -                      + ROUND_UP(size, CACHE_LINE_SIZE));
>>> -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) <
>>> sizeof(void *);
>>> -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ?
>>> CACHE_LINE_SIZE : 0),
>>> -                                CACHE_LINE_SIZE);
>>> -    void **q = (void **) r - 1;
>>> +    void *p, *r, **q;
>>> +    bool runt;
>>> +
>>> +    COVERAGE_INC(util_xalloc);
>>> +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
>>> +        ovs_abort(0, "Invalid alignment");
>>> +    }
>>> +
>>> +    p = xmalloc((alignment - 1)
>>> +                + sizeof(void *)
>>> +                + ROUND_UP(size, alignment));
>>> +
>>> +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
>>> +    /* When the padding size < sizeof(void*), we don't have enough
>>> room for
>>> +     * pointer 'q'. As a reuslt, need to move 'r' to the next
>>> alignment.
>>> +     * So ROUND_UP when xmalloc above, and ROUND_UP again when
>>> calculate 'r'
>>> +     * below.
>>> +     */
>>> +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment);
>>> +    q = (void **) r - 1;
>>>      *q = p;
>>> +
>>>      return r;
>>>  #endif
>>>  }
>>>
>>> +void
>>> +free_size_align(void *p)
>>> +{
>>> +#ifdef HAVE_POSIX_MEMALIGN
>>> +    free(p);
>>> +#else
>>> +    if (p) {
>>> +        void **q = (void **) p - 1;
>>> +        free(*q);
>>> +    }
>>> +#endif
>>> +}
>>> +
>>> +/* Allocates and returns 'size' bytes of memory aligned to a cache
>>> line and in
>>> + * dedicated cache lines.  That is, the memory block returned will
>>> not share a
>>> + * cache line with other data, avoiding "false sharing".
>>> + *
>>> + * Use free_cacheline() to free the returned memory block. */
>>> +void *
>>> +xmalloc_cacheline(size_t size)
>>> +{
>>> +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
>>> +}
>>> +
>>>  /* Like xmalloc_cacheline() but clears the allocated memory to all
>>> zero
>>>   * bytes. */
>>>  void *
>>> @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
>>>  void
>>>  free_cacheline(void *p)
>>>  {
>>> -#ifdef HAVE_POSIX_MEMALIGN
>>> -    free(p);
>>> -#else
>>> -    if (p) {
>>> -        void **q = (void **) p - 1;
>>> -        free(*q);
>>> -    }
>>> -#endif
>>> +    free_size_align(p);
>>> +}
>>> +
>>> +void *
>>> +xmalloc_pagealign(size_t size)
>>> +{
>>> +    return xmalloc_size_align(size, get_page_size());
>>> +}
>>> +
>>> +void
>>> +free_pagealign(void *p)
>>> +{
>>> +    free_size_align(p);
>>>  }
>>>
>>>  char *
>>> diff --git a/lib/util.h b/lib/util.h
>>> index c26605abdce3..33665748274c 100644
>>> --- a/lib/util.h
>>> +++ b/lib/util.h
>>> @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src,
>>> size_t size);
>>>
>>>  int string_ends_with(const char *str, const char *suffix);
>>>
>>> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
>>> +void free_pagealign(void *);
>>> +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
>>> +void free_size_align(void *);
>>> +
>>>  /* The C standards say that neither the 'dst' nor 'src' argument to
>>>   * memcpy() may be null, even if 'n' is zero.  This wrapper 
>>> tolerates
>>>   * the null case. */
>>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
>>> new file mode 100644
>>> index 000000000000..ea39fa557290
>>> --- /dev/null
>>> +++ b/lib/xdpsock.c
>>> @@ -0,0 +1,170 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +#include <config.h>
>>> +
>>> +#include "xdpsock.h"
>>> +#include "dp-packet.h"
>>> +#include "openvswitch/compiler.h"
>>> +
>>> +/* Note:
>>> + * umem_elem_push* shouldn't overflow because we always pop
>>> + * elem first, then push back to the stack.
>>> + */
>>> +static inline void
>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>>> +        OVS_NOT_REACHED();
>>> +    }
>>> +
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(ptr, addrs, n * sizeof(void *));
>>> +    umemp->index += n;
>>> +}
>>> +
>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    ovs_spin_lock(&umemp->lock);
>>> +    __umem_elem_push_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->lock);
>>> +}
>>> +
>>> +static inline void
>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
>>> +        OVS_NOT_REACHED();
>>> +    }
>>> +
>>> +    umemp->array[umemp->index++] = addr;
>>> +}
>>> +
>>> +void
>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +
>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
>>> +
>>> +    ovs_spin_lock(&umemp->lock);
>>> +    __umem_elem_push(umemp, addr);
>>> +    ovs_spin_unlock(&umemp->lock);
>>> +}
>>> +
>>> +static inline int
>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umemp->index -= n;
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(addrs, ptr, n * sizeof(void *));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->lock);
>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->lock);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static inline void *
>>> +__umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return umemp->array[--umemp->index];
>>> +}
>>> +
>>> +void *
>>> +umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    ovs_spin_lock(&umemp->lock);
>>> +    ptr = __umem_elem_pop(umemp);
>>> +    ovs_spin_unlock(&umemp->lock);
>>> +
>>> +    return ptr;
>>> +}
>>> +
>>> +static void **
>>> +__umem_pool_alloc(unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    bufs = xmalloc_pagealign(size * sizeof(void *));
>>> +    memset(bufs, 0, size * sizeof(void *));
>>> +
>>> +    return (void **)bufs;
>>> +}
>>> +
>>> +int
>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
>>> +{
>>> +    umemp->array = __umem_pool_alloc(size);
>>> +    if (!umemp->array) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umemp->size = size;
>>> +    umemp->index = 0;
>>> +    ovs_spinlock_init(&umemp->lock);
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +umem_pool_cleanup(struct umem_pool *umemp)
>>> +{
>>> +    free_pagealign(umemp->array);
>>> +    umemp->array = NULL;
>>> +}
>>> +
>>> +/* AF_XDP metadata init/destroy */
>>> +int
>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    bufs = xmalloc_pagealign(size * sizeof(struct 
>>> dp_packet_afxdp));
>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
>>> +
>>> +    xp->array = bufs;
>>> +    xp->size = size;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
>>> +{
>>> +    free_pagealign(xp->array);
>>> +    xp->array = NULL;
>>> +}
>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>> new file mode 100644
>>> index 000000000000..1a1093381243
>>> --- /dev/null
>>> +++ b/lib/xdpsock.h
>>> @@ -0,0 +1,101 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef XDPSOCK_H
>>> +#define XDPSOCK_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +
>>> +#include <bpf/xsk.h>
>>> +#include <errno.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "spinlock.h"
>>> +
>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>> +
>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>> +
>>> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
>>> + * Setting NUM_FRAMES to this makes sure umem_pop always successes.
>>> + */
>>> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
>>> +
>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>> +
>>> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
>>> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
>>> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS +
>>> CONS_NUM_DESCS));
>>> +
>>> +/* LIFO ptr_array */
>>> +struct umem_pool {
>>> +    int index;      /* point to top */
>>> +    unsigned int size;
>>> +    struct ovs_spinlock lock;
>>> +    void **array;   /* a pointer array, point to umem buf */
>>> +};
>>> +
>>> +/* array-based dp_packet_afxdp */
>>> +struct xpacket_pool {
>>> +    unsigned int size;
>>> +    struct dp_packet_afxdp **array;
>>> +};
>>> +
>>> +struct xsk_umem_info {
>>> +    struct umem_pool mpool;
>>> +    struct xpacket_pool xpool;
>>> +    struct xsk_ring_prod fq;
>>> +    struct xsk_ring_cons cq;
>>> +    struct xsk_umem *umem;
>>> +    void *buffer;
>>> +};
>>> +
>>> +struct xsk_socket_info {
>>> +    struct xsk_ring_cons rx;
>>> +    struct xsk_ring_prod tx;
>>> +    struct xsk_umem_info *umem;
>>> +    struct xsk_socket *xsk;
>>> +    unsigned long rx_dropped;
>>> +    unsigned long tx_dropped;
>>> +    uint32_t outstanding_tx;
>>> +};
>>> +
>>> +struct umem_elem {
>>> +    struct umem_elem *next;
>>> +};
>>> +
>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void 
>>> **addrs);
>>> +
>>> +void *umem_elem_pop(struct umem_pool *umemp);
>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
>>> +
>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
>>> +void umem_pool_cleanup(struct umem_pool *umemp);
>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
>>> +
>>> +#endif
>>> +#endif
>>> diff --git a/tests/automake.mk b/tests/automake.mk
>>> index 2956e68b242c..131564bb0bd3 100644
>>> --- a/tests/automake.mk
>>> +++ b/tests/automake.mk
>>> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>>>       $(SYSTEM_TESTSUITE_AT) \
>>>       $(SYSTEM_KMOD_TESTSUITE_AT) \
>>>       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
>>> +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
>>>       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>>>       $(SYSTEM_DPDK_TESTSUITE_AT) \
>>>       $(OVSDB_CLUSTER_TESTSUITE_AT) \
>>>       $(TESTSUITE) \
>>>       $(SYSTEM_KMOD_TESTSUITE) \
>>>       $(SYSTEM_USERSPACE_TESTSUITE) \
>>> +     $(SYSTEM_AFXDP_TESTSUITE) \
>>>       $(SYSTEM_OFFLOADS_TESTSUITE) \
>>>       $(SYSTEM_DPDK_TESTSUITE) \
>>>       $(OVSDB_CLUSTER_TESTSUITE) \
>>> @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>>>       tests/system-userspace-macros.at \
>>>       tests/system-userspace-packet-type-aware.at
>>>
>>> +SYSTEM_AFXDP_TESTSUITE_AT = \
>>> +     tests/system-afxdp-testsuite.at \
>>> +     tests/system-afxdp-macros.at
>>> +
>>>  SYSTEM_TESTSUITE_AT = \
>>>       tests/system-common-macros.at \
>>>       tests/system-ovn.at \
>>> @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>>>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>>>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>>>  SYSTEM_USERSPACE_TESTSUITE =
>>> $(srcdir)/tests/system-userspace-testsuite
>>> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>>>  SYSTEM_OFFLOADS_TESTSUITE = 
>>> $(srcdir)/tests/system-offloads-testsuite
>>>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>>>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
>>> @@ -317,6 +324,11 @@ check-system-userspace: all
>>>       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests
>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && 
>>> "$$@"
>>> --recheck)
>>>
>>> +check-afxdp: all
>>> +     $(MAKE) install
>>> +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests
>>> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
>>> +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>> +
>>>  check-offloads: all
>>>       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests
>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && 
>>> "$$@"
>>> --recheck)
>>> @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4
>>> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>>       $(AM_V_at)mv $@.tmp $@
>>>
>>> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
>>> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
>>> +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>> +     $(AM_V_at)mv $@.tmp $@
>>> +
>>>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
>>> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>>       $(AM_V_at)mv $@.tmp $@
>>> diff --git a/tests/system-afxdp-macros.at
>>> b/tests/system-afxdp-macros.at
>>> new file mode 100644
>>> index 000000000000..1e6f7a46b4b7
>>> --- /dev/null
>>> +++ b/tests/system-afxdp-macros.at
>>> @@ -0,0 +1,20 @@
>>> +# Add port to ovs bridge by using afxdp mode.
>>> +# This will use generic XDP support in the veth driver.
>>> +m4_define([ADD_VETH],
>>> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return
>>> 77])
>>> +      CONFIGURE_VETH_OFFLOADS([$1])
>>> +      AT_CHECK([ip link set $1 netns $2])
>>> +      AT_CHECK([ip link set dev ovs-$1 up])
>>> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
>>> +                set interface ovs-$1 external-ids:iface-id="$1"
>>> type="afxdp"])
>>> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
>>> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
>>> +      if test -n "$5"; then
>>> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
>>> +      fi
>>> +      if test -n "$6"; then
>>> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
>>> +      fi
>>> +      on_exit 'ip link del ovs-$1'
>>> +    ]
>>> +)
>>> diff --git a/tests/system-afxdp-testsuite.at
>>> b/tests/system-afxdp-testsuite.at
>>> new file mode 100644
>>> index 000000000000..9b7a29066614
>>> --- /dev/null
>>> +++ b/tests/system-afxdp-testsuite.at
>>> @@ -0,0 +1,26 @@
>>> +AT_INIT
>>> +
>>> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
>>> +
>>> +Licensed under the Apache License, Version 2.0 (the "License");
>>> +you may not use this file except in compliance with the License.
>>> +You may obtain a copy of the License at:
>>> +
>>> +    http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> +Unless required by applicable law or agreed to in writing, software
>>> +distributed under the License is distributed on an "AS IS" BASIS,
>>> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> +See the License for the specific language governing permissions and
>>> +limitations under the License.])
>>> +
>>> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
>>> +
>>> +m4_include([tests/ovs-macros.at])
>>> +m4_include([tests/ovsdb-macros.at])
>>> +m4_include([tests/ofproto-macros.at])
>>> +m4_include([tests/system-common-macros.at])
>>> +m4_include([tests/system-userspace-macros.at])
>>> +m4_include([tests/system-afxdp-macros.at])
>>> +
>>> +m4_include([tests/system-traffic.at])
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 89c06a1b7877..1e3acbbb8075 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>>> type=patch options:peer=p1 \
>>>          </p>
>>>        </column>
>>>
>>> +      <column name="other_config" key="xdpmode"
>>> +              type='{"type": "string",
>>> +                     "enum": ["set", ["skb", "drv"]]}'>
>>> +        <p>
>>> +          Specifies the operational mode of the XDP program.
>>> +          If "drv", the XDP program is loaded into the device 
>>> driver
>>> with
>>> +          zero-copy RX and TX enabled. This mode requires device
>>> driver with
>>> +          AF_XDP support and has the best performance.
>>> +          If "skb", the XDP program is using generic XDP mode in
>>> kernel with
>>> +          extra data copying between userspace and kernel. No 
>>> device
>>> driver
>>> +          support is needed. Note that this is afxdp netdev type
>>> only.
>>> +          Defaults to "skb" mode.
>>> +        </p>
>>> +      </column>
>>> +
>>>        <column name="options" key="vhost-server-path"
>>>                type='{"type": "string"}'>
>>>          <p>
>>> --
>>> 2.7.4
Eelco Chaudron June 11, 2019, 6:47 a.m. UTC | #5
On 8 Jun 2019, at 6:48, William Tu wrote:

>>>> +  ethtool -L enp2s0 combined 1
>>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>>> type="afxdp"
>>>> \
>>>> +    options:n_rxq=1 options:xdpmode=drv \
>>>> +    other_config:pmd-rxq-affinity="0:4"
>
> another feature I'm thinking about to add is a new options
> for loading custom XDP program
>
> For example:
> ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
>     options:n_rxq=1 options:xdpmode=drv
>     options:xdp_prog=/path/to/xdp.o
>
> If users do not specify the path, then it is using the libbpf's 
> default program
> (which forwards all packets to userspace)
>
> If users want to use their own xdp object, then this option can load 
> the
> xdp object file from the path.

This might be useful, specially if you would like to do some 
experiments.
Eelco Chaudron June 11, 2019, 1:52 p.m. UTC | #6
Hi William,

Here are some more details, this is a port to port test (same port in as 
out) using the following rule:

   ovs-ofctl add-flow ovs_pvp_br0 "in_port=eno1,action=IN_PORT"

Sent packets wire speed, and crash…

(gdb) bt
#0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
#1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
#2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50, 
n=32, umemp=0x24cc790) at lib/xdpsock.c:32
#3  umem_elem_push_n (umemp=0x24cc790, n=32, 
addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
#4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at 
lib/netdev-afxdp.c:736
#5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0, 
batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at 
lib/netdev-afxdp.c:763
#6  0x0000000000908041 in netdev_send (netdev=<optimized out>, 
qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80, 
concurrent_txq=concurrent_txq@entry=true)
     at lib/netdev.c:800
#7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port 
(pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at 
lib/dpif-netdev.c:4187
#8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets 
(pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at 
lib/dpif-netdev.c:4227
#9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets 
(force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
#10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010, 
rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
#11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5449
#12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at 
lib/ovs-thread.c:352
#13 0x00007fbc6b0a12de in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6

After this crash, systemd restart OVS, and it crashed again (guess 
traffic was still flowing for a bit with the NORMAL rule installed):

Program terminated with signal SIGSEGV, Segmentation fault.
#0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0, 
qfill=0x0) at lib/netdev-afxdp.c:583
583	    rx->fd = xsk_socket__fd(xsk->xsk);
[Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]

(gdb) bt
#0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0, 
qfill=0x0) at lib/netdev-afxdp.c:583
#1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>, 
batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at 
lib/netdev.c:710
#2  0x00000000008dd1d3 in dp_netdev_process_rxq_port 
(pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at 
lib/dpif-netdev.c:4257
#3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5449
#4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at 
lib/ovs-thread.c:352
#5  0x00007f47229732de in start_thread () from /lib64/libpthread.so.0
#6  0x00007f4722118a63 in clone () from /lib64/libc.so.6

I did not further investigate, but it should be easy to replicate. This 
is the same setup that worked fine with the v8 patchset for port to 
port.
Next step was to verify PVP was fixed, but could not get there…
Cheers,

Eelco

On 8 Jun 2019, at 10:12, Eelco Chaudron wrote:

> Hi William,
>
> This was still a draft email, and was not supposed to go out ;)
>
> My debug and build setup was a bit messed up and was having problems 
> running GDB… I was (I’m) planning to continue getting some debug 
> info on Tuesday after the public holiday here…
>
> But just to give you a heads up, it starts up fine with root access 
> but it crashes during a simple Port to Port run with wire-speed 
> traffic. Then it will run into a restart/crash loop.
>
> Will try to get you more details next week…
>
> Cheers,
>
> Eelco
>
>
> On 7 Jun 2019, at 23:33, William Tu wrote:
>
>> Hi Eelco,
>>
>> Thanks for the testing.
>>
>> On Fri, Jun 7, 2019 at 8:43 AM Eelco Chaudron <echaudro@redhat.com> 
>> wrote:
>>>
>>> Hi William,
>>>
>>> No review or full test yet, just some observations…
>>>
>>> We run OVS as a non root user, which is causing OVS with XDP to 
>>> fail:
>>
>> Right, XDP requires using root privilege.
>> I will add this in the documentation.
>
> Is this a hard requirement? As I do not remember running OVS as root 
> before…
>
>>>
>>> 2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev:
>>> Datapath supports ct_orig_tuple
>>> 2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev:
>>> Datapath supports ct_orig_tuple6
>>> 2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on 
>>> numa_id:
>>> 0, core id: 21 created.
>>> 2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd 
>>> threads
>>> on numa node 0
>>> 2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
>>> 2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1 
>>> in
>>> DRV mode
>>> 2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR:
>>> setrlimit(RLIMIT_MEMLOCK): Operation not permitted
>>
>> This is due to not having root privilege, so not able to lock the 
>> memory
>> for device driver to directly DMA packet buffer into userspace.
>>
>> Can you try using root?
>>
>> Regards,
>> William
>>
>>> 2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all
>>> configure queue 0 mode DRV
>>> 2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create
>>> failed (Operation not permitted) mode: DRV qid: 0
>>> 2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create 
>>> AF_XDP
>>> socket on queue 0
>>> 2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
>>> 2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1
>>> reconfig fails
>>> 2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set 
>>> interface
>>> eno1 new configuration
>>>
>>> However when configuring this after startup it’s fine, but trying 
>>> to
>>> restart OVS with this configuration results in a system core…
>>>
>>>
>>>
>>>
>>> On 5 Jun 2019, at 22:47, William Tu wrote:
>>>
>>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
>>>> socket
>>>> type built upon the eBPF and XDP technology.  It is aims to have
>>>> comparable
>>>> performance to DPDK but cooperate better with existing kernel's
>>>> networking
>>>> stack.  An AF_XDP socket receives and sends packets from an 
>>>> eBPF/XDP
>>>> program
>>>> attached to the netdev, by-passing a couple of Linux kernel's
>>>> subsystems
>>>> As a result, AF_XDP socket shows much better performance than
>>>> AF_PACKET
>>>> For more details about AF_XDP, please see linux kernel's
>>>> Documentation/networking/af_xdp.rst. Note that by default, this
>>>> feature is
>>>> not compiled in.
>>>>
>>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>>> ---
>>>> v1->v2:
>>>> - add a list to maintain unused umem elements
>>>> - remove copy from rx umem to ovs internal buffer
>>>> - use hugetlb to reduce misses (not much difference)
>>>> - use pmd mode netdev in OVS (huge performance improve)
>>>> - remove malloc dp_packet, instead put dp_packet in umem
>>>>
>>>> v2->v3:
>>>> - rebase on the OVS master, 7ab4b0653784
>>>>   ("configure: Check for more specific function to pull in pthread
>>>> library.")
>>>> - remove the dependency on libbpf and dpif-bpf.
>>>>   instead, use the built-in XDP_ATTACH feature.
>>>> - data structure optimizations for better performance, see[1]
>>>> - more test cases support
>>>> v3:
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>>>>
>>>> v3->v4:
>>>> - Use AF_XDP API provided by libbpf
>>>> - Remove the dependency on XDP_ATTACH kernel patch set
>>>> - Add documentation, bpf.rst
>>>>
>>>> v4->v5:
>>>> - rebase to master
>>>> - remove rfc, squash all into a single patch
>>>> - add --enable-afxdp, so by default, AF_XDP is not compiled
>>>> - add options: xdpmode=drv,skb
>>>> - add multiple queue and multiple PMD support, with options: n_rxq
>>>> - improve documentation, rename bpf.rst to af_xdp.rst
>>>>
>>>> v5->v6
>>>> - rebase to master, commit 0cdd5b13de91b98
>>>> - address errors from sparse and clang
>>>> - pass travis-ci test
>>>> - address feedback from Ben
>>>> - fix issues reported by 0-day robot
>>>> - improved documentation
>>>>
>>>> v6-v7
>>>> - rebase to master, commit abf11558c1515bf3b1
>>>> - address feedbacks from Ilya, Ben, and Eelco, see:
>>>>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
>>>> - add XDP mode change, implement get/set_config, reconfigure
>>>> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>>>>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
>>>> - perf optimization for batching umem_push/pop
>>>> - perf optimization for batching kick_tx
>>>> - test build with dpdk
>>>> - fix/refactor atomic operation
>>>> - make AF_XDP x86 specific, otherwise fail at build time
>>>> - lots of code refactoring
>>>> - add PVP setup in documentation
>>>>
>>>> v7-v8:
>>>> - Address feedback from Ilya at:
>>>>   https://patchwork.ozlabs.org/patch/1095019/
>>>> - add netdev-linux-private.h
>>>> - fix afxdp reconfigure issue
>>>> - sort include headers
>>>> - remove unnecessary OVS_UNUSED
>>>> - coding style fixes
>>>> - error case handling and memory leak
>>>>
>>>> v8-v9:
>>>> - rebase to master 180bbbed3a3867d52
>>>> - Address review feedback from Ben, Ilya and Eelco, at:
>>>>   https://patchwork.ozlabs.org/patch/1097740/
>>>> - == From Ilya ==
>>>> - Optimize the reconfiguration logic
>>>> - Implement .rxq_recv and .send for afxdp
>>>> - Remove system-afxdp-traffic.at, reuse existing code
>>>> - Use Ilya's rdtsc code
>>>> - remove --disable-system
>>>> - == From Eelco ==
>>>> - Fix bug when remove br0,
>>>> util(revalidator49)|EMER|lib/poll-loop.c:111:
>>>>   assertion !fd != !wevent failed
>>>> - Fix bug and use default value from libbpf, ex:
>>>> XSK_RING_PROD__DEFAULT...
>>>> - Clear xdp program when receive signal, ctrl+c
>>>> - Add options to vswitch.xml, set xdpmode default to skb-mode
>>>> - No support for ARM and PPC, now x86_64 only
>>>> - remove redundant header includes and function/macro definitions
>>>> - remove some ifdef HAVE_AF_XDP
>>>> - == From others/both about afxdp rx and tx ==
>>>> - Several umem push/pop error handling improvement/fixes
>>>> - add lock to address concurrent_txq case
>>>> - improve error handling
>>>> - add stats
>>>> - Things that are not done yet
>>>> - MTU limitation
>>>> - n_txq_desc/n_rxq_desc option.
>>>>
>>>> v9-v10
>>>> - remove x86_64 limitation, suggested by Ben and Eelco
>>>> - add xmalloc_pagealign, free_pagealign
>>>> - minor refector
>>>>
>>>> v10-v11
>>>> - address feedback from Ilya at
>>>>   https://patchwork.ozlabs.org/patch/1106495/
>>>> - fix typos, and some refactoring
>>>> - refactor existing code and introduce xmalloc pagealign
>>>> - fix a couple of error handling case
>>>> - allocate per-txq lock
>>>> - dynamic allocate xsk array
>>>> - fix cycle_counter_update() for non-x86/non-linux case
>>>> ---
>>>>  Documentation/automake.mk             |   1 +
>>>>  Documentation/index.rst               |   1 +
>>>>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
>>>>  Documentation/intro/install/index.rst |   1 +
>>>>  acinclude.m4                          |  35 ++
>>>>  configure.ac                          |   1 +
>>>>  lib/automake.mk                       |  14 +
>>>>  lib/dp-packet.c                       |  28 ++
>>>>  lib/dp-packet.h                       |  18 +-
>>>>  lib/dpif-netdev-perf.h                |  26 +
>>>>  lib/netdev-afxdp.c                    | 891
>>>> ++++++++++++++++++++++++++++++++++
>>>>  lib/netdev-afxdp.h                    |  74 +++
>>>>  lib/netdev-linux-private.h            | 139 ++++++
>>>>  lib/netdev-linux.c                    | 121 ++---
>>>>  lib/netdev-provider.h                 |   3 +
>>>>  lib/netdev.c                          |  11 +
>>>>  lib/spinlock.h                        |  70 +++
>>>>  lib/util.c                            |  92 +++-
>>>>  lib/util.h                            |   5 +
>>>>  lib/xdpsock.c                         | 170 +++++++
>>>>  lib/xdpsock.h                         | 101 ++++
>>>>  tests/automake.mk                     |  16 +
>>>>  tests/system-afxdp-macros.at          |  20 +
>>>>  tests/system-afxdp-testsuite.at       |  26 +
>>>>  vswitchd/vswitch.xml                  |  15 +
>>>>  25 files changed, 2204 insertions(+), 108 deletions(-)
>>>>  create mode 100644 Documentation/intro/install/afxdp.rst
>>>>  create mode 100644 lib/netdev-afxdp.c
>>>>  create mode 100644 lib/netdev-afxdp.h
>>>>  create mode 100644 lib/netdev-linux-private.h
>>>>  create mode 100644 lib/spinlock.h
>>>>  create mode 100644 lib/xdpsock.c
>>>>  create mode 100644 lib/xdpsock.h
>>>>  create mode 100644 tests/system-afxdp-macros.at
>>>>  create mode 100644 tests/system-afxdp-testsuite.at
>>>>
>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>> index 082438e09a33..11cc59efc881 100644
>>>> --- a/Documentation/automake.mk
>>>> +++ b/Documentation/automake.mk
>>>> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>>>>       Documentation/intro/why-ovs.rst \
>>>>       Documentation/intro/install/index.rst \
>>>>       Documentation/intro/install/bash-completion.rst \
>>>> +     Documentation/intro/install/afxdp.rst \
>>>>       Documentation/intro/install/debian.rst \
>>>>       Documentation/intro/install/documentation.rst \
>>>>       Documentation/intro/install/distributions.rst \
>>>> diff --git a/Documentation/index.rst b/Documentation/index.rst
>>>> index 46261235c732..aa9e7c49f179 100644
>>>> --- a/Documentation/index.rst
>>>> +++ b/Documentation/index.rst
>>>> @@ -59,6 +59,7 @@ vSwitch? Start here.
>>>>    :doc:`intro/install/windows` |
>>>>    :doc:`intro/install/xenserver` |
>>>>    :doc:`intro/install/dpdk` |
>>>> +  :doc:`intro/install/afxdp` |
>>>>    :doc:`Installation FAQs <faq/releases>`
>>>>
>>>>  - **Tutorials:** :doc:`tutorials/faucet` |
>>>> diff --git a/Documentation/intro/install/afxdp.rst
>>>> b/Documentation/intro/install/afxdp.rst
>>>> new file mode 100644
>>>> index 000000000000..554964396353
>>>> --- /dev/null
>>>> +++ b/Documentation/intro/install/afxdp.rst
>>>> @@ -0,0 +1,433 @@
>>>> +..
>>>> +      Licensed under the Apache License, Version 2.0 (the 
>>>> "License");
>>>> you may
>>>> +      not use this file except in compliance with the License. You
>>>> may obtain
>>>> +      a copy of the License at
>>>> +
>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +      Unless required by applicable law or agreed to in writing,
>>>> software
>>>> +      distributed under the License is distributed on an "AS IS"
>>>> BASIS, WITHOUT
>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied. See the
>>>> +      License for the specific language governing permissions and
>>>> limitations
>>>> +      under the License.
>>>> +
>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>> +
>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>> +      -------  Heading 1
>>>> +      ~~~~~~~  Heading 2
>>>> +      +++++++  Heading 3
>>>> +      '''''''  Heading 4
>>>> +
>>>> +      Avoid deeper levels because they do not render well.
>>>> +
>>>> +
>>>> +========================
>>>> +Open vSwitch with AF_XDP
>>>> +========================
>>>> +
>>>> +This document describes how to build and install Open vSwitch 
>>>> using
>>>> +AF_XDP netdev.
>>>> +
>>>> +.. warning::
>>>> +  The AF_XDP support of Open vSwitch is considered 'experimental',
>>>> +  and it is not compiled in by default.
>>>> +
>>>> +
>>>> +Introduction
>>>> +------------
>>>> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux
>>>> socket type
>>>> +built upon the eBPF and XDP technology.  It is aims to have
>>>> comparable
>>>> +performance to DPDK but cooperate better with existing kernel's
>>>> networking
>>>> +stack.  An AF_XDP socket receives and sends packets from an 
>>>> eBPF/XDP
>>>> program
>>>> +attached to the netdev, by-passing a couple of Linux kernel's
>>>> subsystems.
>>>> +As a result, AF_XDP socket shows much better performance than
>>>> AF_PACKET.
>>>> +For more details about AF_XDP, please see linux kernel's
>>>> +Documentation/networking/af_xdp.rst
>>>> +
>>>> +
>>>> +AF_XDP Netdev
>>>> +-------------
>>>> +OVS has a couple of netdev types, i.e., system, tap, or
>>>> +dpdk.  The AF_XDP feature adds a new netdev types called
>>>> +"afxdp", and implement its configuration, packet reception,
>>>> +and transmit functions.  Since the AF_XDP socket, called xsk,
>>>> +operates in userspace, once ovs-vswitchd receives packets
>>>> +from xsk, the afxdp netdev re-uses the existing userspace
>>>> +dpif-netdev datapath.  As a result, most of the packet processing
>>>> +happens at the userspace instead of linux kernel.
>>>> +
>>>> +::
>>>> +
>>>> +              |   +-------------------+
>>>> +              |   |    ovs-vswitchd   |<-->ovsdb-server
>>>> +              |   +-------------------+
>>>> +              |   |      ofproto      |<-->OpenFlow controllers
>>>> +              |   +--------+-+--------+
>>>> +              |   | netdev | |ofproto-|
>>>> +    userspace |   +--------+ |  dpif  |
>>>> +              |   | afxdp  | +--------+
>>>> +              |   | netdev | |  dpif  |
>>>> +              |   +---||---+ +--------+
>>>> +              |       ||     |  dpif- |
>>>> +              |       ||     | netdev |
>>>> +              |_      ||     +--------+
>>>> +                      ||
>>>> +               _  +---||-----+--------+
>>>> +              |   | AF_XDP prog +     |
>>>> +       kernel |   |   xsk_map         |
>>>> +              |_  +--------||---------+
>>>> +                           ||
>>>> +                        physical
>>>> +                           NIC
>>>> +
>>>> +
>>>> +Build requirements
>>>> +------------------
>>>> +
>>>> +In addition to the requirements described in :doc:`general`, 
>>>> building
>>>> Open
>>>> +vSwitch with AF_XDP will require the following:
>>>> +
>>>> +- libbpf from kernel source tree (kernel 5.0.0 or later)
>>>> +
>>>> +- Linux kernel XDP support, with the following options (required)
>>>> +
>>>> +  * CONFIG_BPF=y
>>>> +
>>>> +  * CONFIG_BPF_SYSCALL=y
>>>> +
>>>> +  * CONFIG_XDP_SOCKETS=y
>>>> +
>>>> +
>>>> +- The following optional Kconfig options are also recommended, but
>>>> not
>>>> +  required:
>>>> +
>>>> +  * CONFIG_BPF_JIT=y (Performance)
>>>> +
>>>> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
>>>> +
>>>> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
>>>> +
>>>> +- Once your AF_XDP-enabled kernel is ready, if possible, run
>>>> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
>>>> +  This is an OVS independent benchmark tools for AF_XDP.
>>>> +  It makes sure your basic kernel requirements are met for AF_XDP.
>>>> +
>>>> +
>>>> +Installing
>>>> +----------
>>>> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF
>>>> support.
>>>> +First, clone a recent version of Linux bpf-next tree::
>>>> +
>>>> +  git clone
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
>>>> +
>>>> +Second, go into the Linux source directory and build libbpf in the
>>>> tools
>>>> +directory::
>>>> +
>>>> +  cd bpf-next/
>>>> +  cd tools/lib/bpf/
>>>> +  make && make install
>>>> +  make install_headers
>>>> +
>>>> +.. note::
>>>> +   Make sure xsk.h and bpf.h are installed in system's library 
>>>> path,
>>>> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
>>>> +
>>>> +Make sure the libbpf.so is installed correctly::
>>>> +
>>>> +  ldconfig
>>>> +  ldconfig -p | grep libbpf
>>>> +
>>>> +Third, ensure the standard OVS requirements are installed and
>>>> +bootstrap/configure the package::
>>>> +
>>>> +  ./boot.sh && ./configure --enable-afxdp
>>>> +
>>>> +Finally, build and install OVS::
>>>> +
>>>> +  make && make install
>>>> +
>>>> +To kick start end-to-end autotesting::
>>>> +
>>>> +  uname -a # make sure having 5.0+ kernel
>>>> +  make check-afxdp TESTSUITEFLAGS='1'
>>>> +
>>>> +If a test case fails, check the log at::
>>>> +
>>>> +  cat 
>>>> tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
>>>> +
>>>> +
>>>> +Setup AF_XDP netdev
>>>> +-------------------
>>>> +Before running OVS with AF_XDP, make sure the libbpf and libelf 
>>>> are
>>>> +set-up right::
>>>> +
>>>> +  ldd vswitchd/ovs-vswitchd
>>>> +
>>>> +Open vSwitch should be started using userspace datapath as 
>>>> described
>>>> +in :doc:`general`::
>>>> +
>>>> +  ovs-vswitchd ...
>>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>>>> +
>>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
>>>> core 4)
>>>> +on 1 queue (queue 0) device, configure these options: 
>>>> **pmd-cpu-mask,
>>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
>>>> "skb"::
>>>> +
>>>> +  ethtool -L enp2s0 combined 1
>>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>>> type="afxdp"
>>>> \
>>>> +    options:n_rxq=1 options:xdpmode=drv \
>>>> +    other_config:pmd-rxq-affinity="0:4"
>>>> +
>>>> +Or, use 4 pmds/cores and 4 queues by doing::
>>>> +
>>>> +  ethtool -L enp2s0 combined 4
>>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
>>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>>> type="afxdp"
>>>> \
>>>> +    options:n_rxq=4 options:xdpmode=drv \
>>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
>>>> +
>>>> +.. note::
>>>> +   pmd-rxq-affinity is optional. If not specified, system will
>>>> auto-assign.
>>>> +
>>>> +To validate that the bridge has successfully instantiated, you can
>>>> use the::
>>>> +
>>>> +  ovs-vsctl show
>>>> +
>>>> +Should show something like::
>>>> +
>>>> +  Port "ens802f0"
>>>> +   Interface "ens802f0"
>>>> +      type: afxdp
>>>> +      options: {n_rxq="1", xdpmode=drv}
>>>> +
>>>> +Otherwise, enable debugging by::
>>>> +
>>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>>> +
>>>> +
>>>> +References
>>>> +----------
>>>> +Most of the design details are described in the paper presented at
>>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open 
>>>> vSwitch"[1],
>>>> +section 4, and slides[2][4].
>>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
>>>> introduction
>>>> +about AF_XDP current and future work.
>>>> +
>>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
>>>> +
>>>> +[2]
>>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
>>>> +
>>>> +[3]
>>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>>>> +
>>>> +[4]
>>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
>>>> +
>>>> +
>>>> +Performance Tuning
>>>> +------------------
>>>> +The name of the game is to keep your CPU running in userspace,
>>>> allowing PMD
>>>> +to keep polling the AF_XDP queues without any interferences from
>>>> kernel.
>>>> +
>>>> +#. Make sure everything is in the same NUMA node (memory used by
>>>> AF_XDP, pmd
>>>> +   running cores, device plug-in slot)
>>>> +
>>>> +#. Isolate your CPU by doing isolcpu at grub configure.
>>>> +
>>>> +#. IRQ should not set to pmd running core.
>>>> +
>>>> +#. The Spectre and Meltdown fixes increase the overhead of system
>>>> calls.
>>>> +
>>>> +
>>>> +Debugging performance issue
>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> +While running the traffic, use linux perf tool to see where your 
>>>> cpu
>>>> +spends its cycle::
>>>> +
>>>> +  cd bpf-next/tools/perf
>>>> +  make
>>>> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
>>>> +  ./perf report
>>>> +
>>>> +Measure your system call rate by doing::
>>>> +
>>>> +  pstree -p `pidof ovs-vswitchd`
>>>> +  strace -c -p <your pmd's PID>
>>>> +
>>>> +Or, use OVS pmd tool::
>>>> +
>>>> +  ovs-appctl dpif-netdev/pmd-stats-show
>>>> +
>>>> +
>>>> +Example Script
>>>> +--------------
>>>> +
>>>> +Below is a script using namespaces and veth peer::
>>>> +
>>>> +  #!/bin/bash
>>>> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif 
>>>> -vunixctl
>>>> \
>>>> +    --disable-system --detach \
>>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
>>>> +    
>>>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14
>>>> \
>>>> +    fail-mode=secure datapath_type=netdev
>>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>>>> +
>>>> +  ip netns add at_ns0
>>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>>> +
>>>> +  ip link add p0 type veth peer name afxdp-p0
>>>> +  ip link set p0 netns at_ns0
>>>> +  ip link set dev afxdp-p0 up
>>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>>> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
>>>> +
>>>> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
>>>> +  ip addr add "10.1.1.1/24" dev p0
>>>> +  ip link set dev p0 up
>>>> +  NS_EXEC_HEREDOC
>>>> +
>>>> +  ip netns add at_ns1
>>>> +  ip link add p1 type veth peer name afxdp-p1
>>>> +  ip link set p1 netns at_ns1
>>>> +  ip link set dev afxdp-p1 up
>>>> +
>>>> +  ovs-vsctl add-port br0 afxdp-p1 -- \
>>>> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
>>>> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
>>>> +  ip addr add "10.1.1.2/24" dev p1
>>>> +  ip link set dev p1 up
>>>> +  NS_EXEC_HEREDOC
>>>> +
>>>> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
>>>> +
>>>> +
>>>> +Limitations/Known Issues
>>>> +------------------------
>>>> +#. Device's numa ID is always 0, need a way to find numa id from a
>>>> netdev.
>>>> +#. No QoS support because AF_XDP netdev by-pass the Linux TC 
>>>> layer. A
>>>> possible
>>>> +   work-around is to use OpenFlow meter action.
>>>> +#. AF_XDP device added to bridge, remove, and added again will 
>>>> fail.
>>>> +#. Most of the tests are done using i40e single port. Multiple 
>>>> ports
>>>> and
>>>> +   also ixgbe driver also needs to be tested.
>>>> +#. No latency test result (TODO items)
>>>> +
>>>> +
>>>> +PVP using tap device
>>>> +--------------------
>>>> +Assume you have enp2s0 as physical nic, and a tap device connected 
>>>> to
>>>> VM.
>>>> +First, start OVS, then add physical port::
>>>> +
>>>> +  ethtool -L enp2s0 combined 1
>>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>>> type="afxdp"
>>>> \
>>>> +    options:n_rxq=1 options:xdpmode=drv \
>>>> +    other_config:pmd-rxq-affinity="0:4"
>>>> +
>>>> +Start a VM with virtio and tap device::
>>>> +
>>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
>>>> +    -m 4096 \
>>>> +    -cpu host,+x2apic -enable-kvm \
>>>> +    -device 
>>>> virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
>>>> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
>>>> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
>>>> +    -object memory-backend-file,id=mem,size=4096M,\
>>>> +      mem-path=/dev/hugepages,share=on \
>>>> +    -numa node,memdev=mem -mem-prealloc -smp 2
>>>> +
>>>> +Create OpenFlow rules::
>>>> +
>>>> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
>>>> +  ovs-ofctl del-flows br0
>>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
>>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
>>>> +
>>>> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
>>>> +
>>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
>>>> +
>>>> +The performance number I got is around 1.6Mpps.
>>>> +This is due to using the kernel's tap interface, which requires
>>>> copying
>>>> +packet into kernel from the umem buffer in userspace.
>>>> +
>>>> +
>>>> +PVP using vhostuser device
>>>> +--------------------------
>>>> +First, build OVS with DPDK and AFXDP::
>>>> +
>>>> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
>>>> +  make -j4 && make install
>>>> +
>>>> +Create a vhost-user port from OVS::
>>>> +
>>>> +  ovs-vsctl --no-wait set Open_vSwitch . 
>>>> other_config:dpdk-init=true
>>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
>>>> +    other_config:pmd-cpu-mask=0xfff
>>>> +  ovs-vsctl add-port br0 vhost-user-1 \
>>>> +    -- set Interface vhost-user-1 type=dpdkvhostuser
>>>> +
>>>> +Start VM using vhost-user mode::
>>>> +
>>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
>>>> +   -m 4096 \
>>>> +   -cpu host,+x2apic -enable-kvm \
>>>> +   -chardev
>>>> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
>>>> +   -netdev
>>>> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
>>>> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
>>>> +      netdev=mynet1,mq=on,vectors=10 \
>>>> +   -object memory-backend-file,id=mem,size=4096M,\
>>>> +      mem-path=/dev/hugepages,share=on \
>>>> +   -numa node,memdev=mem -mem-prealloc -smp 2
>>>> +
>>>> +Setup the OpenFlow ruls::
>>>> +
>>>> +  ovs-ofctl del-flows br0
>>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0,
>>>> actions=output:vhost-user-1"
>>>> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1,
>>>> actions=output:enp2s0"
>>>> +
>>>> +Inside the VM, use xdp_rxq_info to drop or bounce back the 
>>>> traffic::
>>>> +
>>>> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
>>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
>>>> +
>>>> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
>>>> +
>>>> +
>>>> +PCP container using veth
>>>> +------------------------
>>>> +Create namespace and veth peer devices::
>>>> +
>>>> +  ip netns add at_ns0
>>>> +  ip link add p0 type veth peer name afxdp-p0
>>>> +  ip link set p0 netns at_ns0
>>>> +  ip link set dev afxdp-p0 up
>>>> +  ip netns exec at_ns0 ip link set dev p0 up
>>>> +
>>>> +Attach the veth port to br0 (linux kernel mode)::
>>>> +
>>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>>> +    set interface afxdp-p0 options:n_rxq=1
>>>> +
>>>> +Or, use AF_XDP with skb mode::
>>>> +
>>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
>>>> options:xdpmode=skb
>>>> +
>>>> +Setup the OpenFlow rules::
>>>> +
>>>> +  ovs-ofctl del-flows br0
>>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
>>>> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
>>>> +
>>>> +In the namespace, run drop or bounce back the packet::
>>>> +
>>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
>>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
>>>> +
>>>> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
>>>> +
>>>> +
>>>> +Bug Reporting
>>>> +-------------
>>>> +
>>>> +Please report problems to dev@openvswitch.org.
>>>> diff --git a/Documentation/intro/install/index.rst
>>>> b/Documentation/intro/install/index.rst
>>>> index 3193c736cf17..c27a9c9d16ff 100644
>>>> --- a/Documentation/intro/install/index.rst
>>>> +++ b/Documentation/intro/install/index.rst
>>>> @@ -45,6 +45,7 @@ Installation from Source
>>>>     xenserver
>>>>     userspace
>>>>     dpdk
>>>> +   afxdp
>>>>
>>>>  Installation from Packages
>>>>  --------------------------
>>>> diff --git a/acinclude.m4 b/acinclude.m4
>>>> index cf9cc8b8b0de..721653ab0ec0 100644
>>>> --- a/acinclude.m4
>>>> +++ b/acinclude.m4
>>>> @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>>>>    ])
>>>>  ])
>>>>
>>>> +dnl OVS_CHECK_LINUX_AF_XDP
>>>> +dnl
>>>> +dnl Check both Linux kernel AF_XDP and libbpf support
>>>> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
>>>> +  AC_ARG_ENABLE([afxdp],
>>>> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP
>>>> support])],
>>>> +                [], [enable_afxdp=no])
>>>> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
>>>> +  if test "$enable_afxdp" != yes; then
>>>> +    AC_MSG_RESULT([no])
>>>> +    AF_XDP_ENABLE=false
>>>> +  else
>>>> +    AC_MSG_RESULT([yes])
>>>> +    AF_XDP_ENABLE=true
>>>> +
>>>> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
>>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP
>>>> support])])
>>>> +
>>>> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
>>>> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP
>>>> support])])
>>>> +
>>>> +    AC_CHECK_HEADER([bpf/xsk.h], [],
>>>> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP 
>>>> support])])
>>>> +
>>>> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
>>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP
>>>> support])])
>>>> +
>>>> +    AC_DEFINE([HAVE_AF_XDP], [1],
>>>> +              [Define to 1 if AF_XDP support is available and
>>>> enabled.])
>>>> +    LIBBPF_LDADD=" -lbpf -lelf"
>>>> +    AC_SUBST([LIBBPF_LDADD])
>>>> +  fi
>>>> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
>>>> +])
>>>> +
>>>>  dnl OVS_CHECK_DPDK
>>>>  dnl
>>>>  dnl Configure DPDK source tree
>>>> diff --git a/configure.ac b/configure.ac
>>>> index 2dbe9a9178e3..9e23e1c6958c 100644
>>>> --- a/configure.ac
>>>> +++ b/configure.ac
>>>> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
>>>>  OVS_CHECK_DOT
>>>>  OVS_CHECK_IF_DL
>>>>  OVS_CHECK_STRTOK_R
>>>> +OVS_CHECK_LINUX_AF_XDP
>>>>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>>>>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct
>>>> stat.st_mtimensec],
>>>>    [], [], [[#include <sys/stat.h>]])
>>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>>> index cc5dccf39d6b..b31e28f6e1f5 100644
>>>> --- a/lib/automake.mk
>>>> +++ b/lib/automake.mk
>>>> @@ -14,6 +14,10 @@ if WIN32
>>>>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>>>>  endif
>>>>
>>>> +if HAVE_AF_XDP
>>>> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
>>>> +endif
>>>> +
>>>>  lib_libopenvswitch_la_LDFLAGS = \
>>>>          $(OVS_LTINFO) \
>>>>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym 
>>>> \
>>>> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
>>>>       lib/if-notifier.h \
>>>>       lib/netdev-linux.c \
>>>>       lib/netdev-linux.h \
>>>> +     lib/netdev-linux-private.h \
>>>>       lib/netdev-tc-offloads.c \
>>>>       lib/netdev-tc-offloads.h \
>>>>       lib/netlink-conntrack.c \
>>>> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
>>>>       lib/tc.h
>>>>  endif
>>>>
>>>> +if HAVE_AF_XDP
>>>> +lib_libopenvswitch_la_SOURCES += \
>>>> +     lib/xdpsock.c \
>>>> +     lib/xdpsock.h \
>>>> +     lib/netdev-afxdp.c \
>>>> +     lib/netdev-afxdp.h \
>>>> +     lib/spinlock.h
>>>> +endif
>>>> +
>>>>  if DPDK_NETDEV
>>>>  lib_libopenvswitch_la_SOURCES += \
>>>>       lib/dpdk.c \
>>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>>>> index 0976a35e758b..e6a7947076b4 100644
>>>> --- a/lib/dp-packet.c
>>>> +++ b/lib/dp-packet.c
>>>> @@ -19,6 +19,7 @@
>>>>  #include <string.h>
>>>>
>>>>  #include "dp-packet.h"
>>>> +#include "netdev-afxdp.h"
>>>>  #include "netdev-dpdk.h"
>>>>  #include "openvswitch/dynamic-string.h"
>>>>  #include "util.h"
>>>> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base,
>>>> size_t allocated)
>>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>>>>  }
>>>>
>>>> +#if HAVE_AF_XDP
>>>> +/* Initialize 'b' as an empty dp_packet that contains
>>>> + * memory starting at AF_XDP umem base.
>>>> + */
>>>> +void
>>>> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t
>>>> allocated)
>>>> +{
>>>> +    dp_packet_set_base(b, base);
>>>> +    dp_packet_set_data(b, base);
>>>> +    dp_packet_set_size(b, 0);
>>>> +
>>>> +    dp_packet_set_allocated(b, allocated);
>>>> +    b->source = DPBUF_AFXDP;
>>>> +    dp_packet_reset_offsets(b);
>>>> +    pkt_metadata_init(&b->md, 0);
>>>> +    dp_packet_reset_cutlen(b);
>>>> +    dp_packet_reset_offload(b);
>>>> +    b->packet_type = htonl(PT_ETH);
>>>> +}
>>>> +#endif
>>>> +
>>>>  /* Initializes 'b' as an empty dp_packet that contains the
>>>> 'allocated' bytes of
>>>>   * memory starting at 'base'.  'base' should point to a buffer on 
>>>> the
>>>> stack.
>>>>   * (Nothing actually relies on 'base' being allocated on the 
>>>> stack.
>>>> It could
>>>> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>>>>               * created as a dp_packet */
>>>>              free_dpdk_buf((struct dp_packet*) b);
>>>>  #endif
>>>> +        } else if (b->source == DPBUF_AFXDP) {
>>>> +            free_afxdp_buf(b);
>>>>          }
>>>>      }
>>>>  }
>>>> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
>>>> new_headroom, size_t new_tailroom
>>>>      case DPBUF_STACK:
>>>>          OVS_NOT_REACHED();
>>>>
>>>> +    case DPBUF_AFXDP:
>>>> +        OVS_NOT_REACHED();
>>>> +
>>>>      case DPBUF_STUB:
>>>>          b->source = DPBUF_MALLOC;
>>>>          new_base = xmalloc(new_allocated);
>>>> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>>>>  {
>>>>      void *p;
>>>>      ovs_assert(b->source != DPBUF_DPDK);
>>>> +    ovs_assert(b->source != DPBUF_AFXDP);
>>>>
>>>>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) ==
>>>> dp_packet_base(b)) {
>>>>          p = dp_packet_data(b);
>>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>>> index a5e9ade1244a..e3438226e360 100644
>>>> --- a/lib/dp-packet.h
>>>> +++ b/lib/dp-packet.h
>>>> @@ -25,6 +25,7 @@
>>>>  #include <rte_mbuf.h>
>>>>  #endif
>>>>
>>>> +#include "netdev-afxdp.h"
>>>>  #include "netdev-dpdk.h"
>>>>  #include "openvswitch/list.h"
>>>>  #include "packets.h"
>>>> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>>>>      DPBUF_DPDK,                /* buffer data is from DPDK 
>>>> allocated
>>>> memory.
>>>>                                  * ref to dp_packet_init_dpdk() in
>>>> dp-packet.c.
>>>>                                  */
>>>> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>>>>  };
>>>>
>>>>  #define DP_PACKET_CONTEXT_SIZE 64
>>>> @@ -89,6 +91,13 @@ struct dp_packet {
>>>>      };
>>>>  };
>>>>
>>>> +#if HAVE_AF_XDP
>>>> +struct dp_packet_afxdp {
>>>> +    struct umem_pool *mpool;
>>>> +    struct dp_packet packet;
>>>> +};
>>>> +#endif
>>>> +
>>>>  static inline void *dp_packet_data(const struct dp_packet *);
>>>>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>>>>  static inline void *dp_packet_base(const struct dp_packet *);
>>>> @@ -122,7 +131,9 @@ static inline const void
>>>> *dp_packet_get_nd_payload(const struct dp_packet *);
>>>>  void dp_packet_use(struct dp_packet *, void *, size_t);
>>>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>>>>  void dp_packet_use_const(struct dp_packet *, const void *, 
>>>> size_t);
>>>> -
>>>> +#if HAVE_AF_XDP
>>>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
>>>> +#endif
>>>>  void dp_packet_init_dpdk(struct dp_packet *);
>>>>
>>>>  void dp_packet_init(struct dp_packet *, size_t);
>>>> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>>>>              return;
>>>>          }
>>>>
>>>> +        if (b->source == DPBUF_AFXDP) {
>>>> +            free_afxdp_buf(b);
>>>> +            return;
>>>> +        }
>>>> +
>>>>          dp_packet_uninit(b);
>>>>          free(b);
>>>>      }
>>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>>> index 859c05613ddf..6b6dfda7db1c 100644
>>>> --- a/lib/dpif-netdev-perf.h
>>>> +++ b/lib/dpif-netdev-perf.h
>>>> @@ -21,6 +21,7 @@
>>>>  #include <stddef.h>
>>>>  #include <stdint.h>
>>>>  #include <string.h>
>>>> +#include <time.h>
>>>>  #include <math.h>
>>>>
>>>>  #ifdef DPDK_NETDEV
>>>> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>>>>      char *log_reason;
>>>>  };
>>>>
>>>> +#ifdef __linux__
>>>> +static inline uint64_t
>>>> +rdtsc_syscall(struct pmd_perf_stats *s)
>>>> +{
>>>> +    struct timespec val;
>>>> +    uint64_t v;
>>>> +
>>>> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
>>>> +       return s->last_tsc;
>>>> +    }
>>>> +
>>>> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
>>>> +    v += (uint64_t) val.tv_nsec;
>>>> +
>>>> +    return s->last_tsc = v;
>>>> +}
>>>> +#endif
>>>> +
>>>>  /* Support for accurate timing of PMD execution on TSC clock cycle
>>>> level.
>>>>   * These functions are intended to be invoked in the context of 
>>>> pmd
>>>> threads. */
>>>>
>>>> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats 
>>>> *s)
>>>>  {
>>>>  #ifdef DPDK_NETDEV
>>>>      return s->last_tsc = rte_get_tsc_cycles();
>>>> +#elif !defined(_MSC_VER) && defined(__x86_64__)
>>>> +    uint32_t h, l;
>>>> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
>>>> +
>>>> +    return s->last_tsc = ((uint64_t) h << 32) | l;
>>>> +#elif defined(__linux__)
>>>> +    return rdtsc_syscall(s);
>>>>  #else
>>>>      return s->last_tsc = 0;
>>>>  #endif
>>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>>> new file mode 100644
>>>> index 000000000000..a6543e8f5126
>>>> --- /dev/null
>>>> +++ b/lib/netdev-afxdp.c
>>>> @@ -0,0 +1,891 @@
>>>> +/*
>>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#include "netdev-linux-private.h"
>>>> +#include "netdev-linux.h"
>>>> +#include "netdev-afxdp.h"
>>>> +
>>>> +#include <errno.h>
>>>> +#include <inttypes.h>
>>>> +#include <linux/rtnetlink.h>
>>>> +#include <linux/if_xdp.h>
>>>> +#include <net/if.h>
>>>> +#include <stdlib.h>
>>>> +#include <sys/resource.h>
>>>> +#include <sys/socket.h>
>>>> +#include <sys/types.h>
>>>> +#include <unistd.h>
>>>> +
>>>> +#include "dp-packet.h"
>>>> +#include "dpif-netdev.h"
>>>> +#include "openvswitch/dynamic-string.h"
>>>> +#include "openvswitch/vlog.h"
>>>> +#include "packets.h"
>>>> +#include "socket-util.h"
>>>> +#include "spinlock.h"
>>>> +#include "util.h"
>>>> +#include "xdpsock.h"
>>>> +
>>>> +#ifndef SOL_XDP
>>>> +#define SOL_XDP 283
>>>> +#endif
>>>> +
>>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
>>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>>> +
>>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
>>>> *)base))
>>>> +#define UMEM2XPKT(base, i) \
>>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char 
>>>> *)base
>>>> + \
>>>> +                               i * sizeof(struct dp_packet_afxdp))
>>>> +
>>>> +static uint32_t prog_id;
>>>> +static struct xsk_socket_info *xsk_configure(int ifindex, int
>>>> xdp_queue_id,
>>>> +                                             int mode);
>>>> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
>>>> +static void xsk_destroy(struct xsk_socket_info *xsk);
>>>> +static int xsk_configure_all(struct netdev *netdev);
>>>> +static void xsk_destroy_all(struct netdev *netdev);
>>>> +
>>>> +static struct xsk_umem_info *
>>>> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
>>>> +{
>>>> +    struct xsk_umem_config uconfig OVS_UNUSED;
>>>> +    struct xsk_umem_info *umem;
>>>> +    int ret;
>>>> +    int i;
>>>> +
>>>> +    umem = xcalloc(1, sizeof *umem);
>>>> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq,
>>>> &umem->cq,
>>>> +                           NULL);
>>>> +    if (ret) {
>>>> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
>>>> +                 ovs_strerror(errno),
>>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
>>>> +        free(umem);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    umem->buffer = buffer;
>>>> +
>>>> +    /* set-up umem pool */
>>>> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
>>>> +        VLOG_ERR("umem_pool_init failed");
>>>> +        if (xsk_umem__delete(umem->umem)) {
>>>> +            VLOG_ERR("xsk_umem__delete failed");
>>>> +        }
>>>> +        free(umem);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
>>>> +        struct umem_elem *elem;
>>>> +
>>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>>> +                            (char *)umem->buffer + i * 
>>>> FRAME_SIZE);
>>>> +        umem_elem_push(&umem->mpool, elem);
>>>> +    }
>>>> +
>>>> +    /* set-up metadata */
>>>> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
>>>> +        VLOG_ERR("xpacket_pool_init failed");
>>>> +        umem_pool_cleanup(&umem->mpool);
>>>> +        if (xsk_umem__delete(umem->umem)) {
>>>> +            VLOG_ERR("xsk_umem__delete failed");
>>>> +        }
>>>> +        free(umem);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
>>>> +              umem->xpool.array,
>>>> +              (char *)umem->xpool.array +
>>>> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
>>>> +
>>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
>>>> +        struct dp_packet_afxdp *xpacket;
>>>> +        struct dp_packet *packet;
>>>> +
>>>> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
>>>> +        xpacket->mpool = &umem->mpool;
>>>> +
>>>> +        packet = &xpacket->packet;
>>>> +        packet->source = DPBUF_AFXDP;
>>>> +    }
>>>> +
>>>> +    return umem;
>>>> +}
>>>> +
>>>> +static struct xsk_socket_info *
>>>> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
>>>> +                     uint32_t queue_id, int xdpmode)
>>>> +{
>>>> +    struct xsk_socket_config cfg;
>>>> +    struct xsk_socket_info *xsk;
>>>> +    char devname[IF_NAMESIZE];
>>>> +    uint32_t idx = 0;
>>>> +    int ret;
>>>> +    int i;
>>>> +
>>>> +    xsk = xcalloc(1, sizeof(*xsk));
>>>> +    xsk->umem = umem;
>>>> +    cfg.rx_size = CONS_NUM_DESCS;
>>>> +    cfg.tx_size = PROD_NUM_DESCS;
>>>> +    cfg.libbpf_flags = 0;
>>>> +
>>>> +    if (xdpmode == XDP_ZEROCOPY) {
>>>> +        cfg.bind_flags = XDP_ZEROCOPY;
>>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>>> XDP_FLAGS_DRV_MODE;
>>>> +    } else {
>>>> +        cfg.bind_flags = XDP_COPY;
>>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>>> XDP_FLAGS_SKB_MODE;
>>>> +    }
>>>> +
>>>> +    if (if_indextoname(ifindex, devname) == NULL) {
>>>> +        VLOG_ERR("ifindex %d to devname failed (%s)",
>>>> +                 ifindex, ovs_strerror(errno));
>>>> +        free(xsk);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id,
>>>> umem->umem,
>>>> +                             &xsk->rx, &xsk->tx, &cfg);
>>>> +    if (ret) {
>>>> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: 
>>>> %d",
>>>> +                 ovs_strerror(errno),
>>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
>>>> +                 queue_id);
>>>> +        free(xsk);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    /* Make sure the built-in AF_XDP program is loaded */
>>>> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
>>>> +    if (ret) {
>>>> +        VLOG_ERR("Get XDP prog ID failed (%s)", 
>>>> ovs_strerror(errno));
>>>> +        xsk_socket__delete(xsk->xsk);
>>>> +        free(xsk);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL 
>>>> queue
>>>> */
>>>> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
>>>> +                                   PROD_NUM_DESCS - BATCH_SIZE,
>>>> &idx)) {
>>>> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL
>>>> queue");
>>>> +    }
>>>> +
>>>> +    for (i = 0;
>>>> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
>>>> +         i += FRAME_SIZE) {
>>>> +        struct umem_elem *elem;
>>>> +        uint64_t addr;
>>>> +
>>>> +        elem = umem_elem_pop(&xsk->umem->mpool);
>>>> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
>>>> +
>>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
>>>> +    }
>>>> +
>>>> +    xsk_ring_prod__submit(&xsk->umem->fq,
>>>> +                          PROD_NUM_DESCS - BATCH_SIZE);
>>>> +    return xsk;
>>>> +}
>>>> +
>>>> +static struct xsk_socket_info *
>>>> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
>>>> +{
>>>> +    struct xsk_socket_info *xsk;
>>>> +    struct xsk_umem_info *umem;
>>>> +    void *bufs;
>>>> +
>>>> +    /* umem memory region */
>>>> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
>>>> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
>>>> +
>>>> +    /* create AF_XDP socket */
>>>> +    umem = xsk_configure_umem(bufs,
>>>> +                              NUM_FRAMES * FRAME_SIZE,
>>>> +                              xdpmode);
>>>> +    if (!umem) {
>>>> +        free_pagealign(bufs);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, 
>>>> xdpmode);
>>>> +    if (!xsk) {
>>>> +        /* clean up umem and xpacket pool */
>>>> +        if (xsk_umem__delete(umem->umem)) {
>>>> +            VLOG_ERR("xsk_umem__delete failed");
>>>> +        }
>>>> +        free_pagealign(bufs);
>>>> +        umem_pool_cleanup(&umem->mpool);
>>>> +        xpacket_pool_cleanup(&umem->xpool);
>>>> +        free(umem);
>>>> +    }
>>>> +    return xsk;
>>>> +}
>>>> +
>>>> +static int
>>>> +xsk_configure_all(struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    struct xsk_socket_info *xsk;
>>>> +    int i, ifindex, n_rxq;
>>>> +
>>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>>> +
>>>> +    n_rxq = netdev_n_rxq(netdev);
>>>> +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
>>>> +
>>>> +    /* configure each queue */
>>>> +    for (i = 0; i < n_rxq; i++) {
>>>> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
>>>> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
>>>> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
>>>> +        if (!xsk) {
>>>> +            VLOG_ERR("failed to create AF_XDP socket on queue %d",
>>>> i);
>>>> +            dev->xsks[i] = NULL;
>>>> +            goto err;
>>>> +        }
>>>> +        dev->xsks[i] = xsk;
>>>> +        xsk->rx_dropped = 0;
>>>> +        xsk->tx_dropped = 0;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +
>>>> +err:
>>>> +    xsk_destroy_all(netdev);
>>>> +    return EINVAL;
>>>> +}
>>>> +
>>>> +static void
>>>> +xsk_destroy(struct xsk_socket_info *xsk)
>>>> +{
>>>> +    struct xsk_umem *umem;
>>>> +
>>>> +    umem = xsk->umem->umem;
>>>> +    xsk_socket__delete(xsk->xsk);
>>>> +    if (xsk_umem__delete(umem)) {
>>>> +        VLOG_ERR("xsk_umem__delete failed");
>>>> +    }
>>>> +
>>>> +    /* free the packet buffer */
>>>> +    free_pagealign(xsk->umem->buffer);
>>>> +
>>>> +    /* cleanup umem pool */
>>>> +    umem_pool_cleanup(&xsk->umem->mpool);
>>>> +
>>>> +    /* cleanup metadata pool */
>>>> +    xpacket_pool_cleanup(&xsk->umem->xpool);
>>>> +
>>>> +    free(xsk->umem);
>>>> +    free(xsk);
>>>> +}
>>>> +
>>>> +static void
>>>> +xsk_destroy_all(struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    int i, ifindex;
>>>> +
>>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>>> +
>>>> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
>>>> +        if (dev->xsks && dev->xsks[i]) {
>>>> +            VLOG_INFO("destroy xsk[%d]", i);
>>>> +            xsk_destroy(dev->xsks[i]);
>>>> +            dev->xsks[i] = NULL;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    VLOG_INFO("remove xdp program");
>>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
>>>> +
>>>> +    free(dev->xsks);
>>>> +}
>>>> +
>>>> +static inline void OVS_UNUSED
>>>> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
>>>> +    struct xdp_statistics stat;
>>>> +    socklen_t optlen;
>>>> +
>>>> +    optlen = sizeof stat;
>>>> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP,
>>>> XDP_STATISTICS,
>>>> +               &stat, &optlen) == 0);
>>>> +
>>>> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid
>>>> %llu",
>>>> +                stat.rx_dropped,
>>>> +                stat.rx_invalid_descs,
>>>> +                stat.tx_invalid_descs);
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap
>>>> *args,
>>>> +                        char **errp OVS_UNUSED)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    const char *str_xdpmode;
>>>> +    int xdpmode, new_n_rxq;
>>>> +
>>>> +    ovs_mutex_lock(&dev->mutex);
>>>> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>>>> +    if (new_n_rxq > MAX_XSKQ) {
>>>> +        ovs_mutex_unlock(&dev->mutex);
>>>> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
>>>> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
>>>> +        return EINVAL;
>>>> +    }
>>>> +
>>>> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
>>>> +    if (!strcasecmp(str_xdpmode, "drv")) {
>>>> +        xdpmode = XDP_ZEROCOPY;
>>>> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
>>>> +        xdpmode = XDP_COPY;
>>>> +    } else {
>>>> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
>>>> +                 netdev_get_name(netdev), str_xdpmode);
>>>> +        ovs_mutex_unlock(&dev->mutex);
>>>> +        return EINVAL;
>>>> +    }
>>>> +
>>>> +    if (dev->requested_n_rxq != new_n_rxq
>>>> +        || dev->requested_xdpmode != xdpmode) {
>>>> +        dev->requested_n_rxq = new_n_rxq;
>>>> +        dev->requested_xdpmode = xdpmode;
>>>> +        netdev_request_reconfigure(netdev);
>>>> +    }
>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap
>>>> *args)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +
>>>> +    ovs_mutex_lock(&dev->mutex);
>>>> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
>>>> +    smap_add_format(args, "xdpmode", "%s",
>>>> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void
>>>> +netdev_afxdp_alloc_txq(struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    int n_txqs = netdev_n_rxq(netdev);
>>>> +    int i;
>>>> +
>>>> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
>>>> +
>>>> +    for (i = 0; i < n_txqs; i++) {
>>>> +        ovs_spinlock_init(&dev->tx_locks[i]);
>>>> +    }
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_reconfigure(struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>>>> +    int err = 0;
>>>> +
>>>> +    ovs_mutex_lock(&dev->mutex);
>>>> +
>>>> +    if (netdev->n_rxq == dev->requested_n_rxq
>>>> +        && dev->xdpmode == dev->requested_xdpmode) {
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    xsk_destroy_all(netdev);
>>>> +    free(dev->tx_locks);
>>>> +
>>>> +    netdev->n_rxq = dev->requested_n_rxq;
>>>> +    netdev_afxdp_alloc_txq(netdev);
>>>> +
>>>> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
>>>> +        VLOG_INFO("AF_XDP device %s in DRV mode",
>>>> netdev_get_name(netdev));
>>>> +        /* From SKB mode to DRV mode */
>>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>>> XDP_FLAGS_DRV_MODE;
>>>> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
>>>> +        dev->xdpmode = XDP_ZEROCOPY;
>>>> +
>>>> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
>>>> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
>>>> +                      ovs_strerror(errno));
>>>> +        }
>>>> +    } else {
>>>> +        VLOG_INFO("AF_XDP device %s in SKB mode",
>>>> netdev_get_name(netdev));
>>>> +        /* From DRV mode to SKB mode */
>>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
>>>> XDP_FLAGS_SKB_MODE;
>>>> +        dev->xdp_bind_flags = XDP_COPY;
>>>> +        dev->xdpmode = XDP_COPY;
>>>> +        /* TODO: set rlimit back to previous value
>>>> +         * when no device is in DRV mode.
>>>> +         */
>>>> +    }
>>>> +
>>>> +    err = xsk_configure_all(netdev);
>>>> +    if (err) {
>>>> +        VLOG_ERR("AF_XDP device %s reconfig fails",
>>>> netdev_get_name(netdev));
>>>> +    }
>>>> +    netdev_change_seq_changed(netdev);
>>>> +out:
>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>> +    return err;
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
>>>> +{
>>>> +    /* FIXME: Get netdev's PCIe device ID, then find
>>>> +     * its NUMA node id.
>>>> +     */
>>>> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
>>>> +              netdev_get_name(netdev));
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void
>>>> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
>>>> +{
>>>> +    uint32_t curr_prog_id = 0;
>>>> +    uint32_t flags;
>>>> +
>>>> +    /* remove_xdp_program() */
>>>> +    if (xdpmode == XDP_COPY) {
>>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
>>>> +    } else {
>>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
>>>> +    }
>>>> +
>>>> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
>>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
>>>> +    }
>>>> +    if (prog_id == curr_prog_id) {
>>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
>>>> +    } else if (!curr_prog_id) {
>>>> +        VLOG_INFO("couldn't find a prog id on a given interface");
>>>> +    } else {
>>>> +        VLOG_INFO("program on interface changed, not removing");
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +signal_remove_xdp(struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    int ifindex;
>>>> +
>>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>>> +
>>>> +    VLOG_WARN("force remove xdp program");
>>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
>>>> +}
>>>> +
>>>> +static struct dp_packet_afxdp *
>>>> +dp_packet_cast_afxdp(const struct dp_packet *d)
>>>> +{
>>>> +    ovs_assert(d->source == DPBUF_AFXDP);
>>>> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
>>>> +}
>>>> +
>>>> +void
>>>> +free_afxdp_buf(struct dp_packet *p)
>>>> +{
>>>> +    struct dp_packet_afxdp *xpacket;
>>>> +    uintptr_t addr;
>>>> +
>>>> +    xpacket = dp_packet_cast_afxdp(p);
>>>> +    if (xpacket->mpool) {
>>>> +        void *base = dp_packet_base(p);
>>>> +
>>>> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>>> +        umem_elem_push(xpacket->mpool, (void *)addr);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
>>>> +{
>>>> +    struct dp_packet_afxdp *xpacket = NULL;
>>>> +    struct dp_packet *packet;
>>>> +    void *elems[BATCH_SIZE];
>>>> +    uintptr_t addr;
>>>> +
>>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>>> +        xpacket = dp_packet_cast_afxdp(packet);
>>>> +        if (xpacket->mpool) {
>>>> +            void *base = dp_packet_base(packet);
>>>> +
>>>> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>>> +            elems[i] = (void *)addr;
>>>> +        }
>>>> +    }
>>>> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
>>>> +    dp_packet_batch_init(batch);
>>>> +}
>>>> +
>>>> +static inline void
>>>> +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
>>>> +{
>>>> +    void *elems[BATCH_SIZE];
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < rcvd; i++) {
>>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
>>>> idx_rx)->addr;
>>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>>>> +
>>>> +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
>>>> +    }
>>>> +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>>> +
>>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
>>>> +    xsk->rx_dropped += rcvd;
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct 
>>>> dp_packet_batch
>>>> *batch,
>>>> +                      int *qfill)
>>>> +{
>>>> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>>> +    struct netdev *netdev = rx->up.netdev;
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    struct umem_elem *elems[BATCH_SIZE];
>>>> +    uint32_t idx_rx = 0, idx_fq = 0;
>>>> +    struct xsk_socket_info *xsk;
>>>> +    int qid = rxq_->queue_id;
>>>> +    unsigned int rcvd, i;
>>>> +    int ret = 0;
>>>> +
>>>> +    xsk = dev->xsks[qid];
>>>> +    if (!xsk) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    rx->fd = xsk_socket__fd(xsk->xsk);
>>>> +
>>>> +    /* See if there is any packet on RX queue,
>>>> +     * if yes, idx_rx is the index having the packet.
>>>> +     */
>>>> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
>>>> +    if (!rcvd) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void 
>>>> **)elems);
>>>> +    if (OVS_UNLIKELY(ret)) {
>>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
>>>> +        return ENOMEM;
>>>> +    }
>>>> +
>>>> +    /* Prepare for the FILL queue */
>>>> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
>>>> +        /* The FILL queue is full, don't retry or process rx. Wait
>>>> for kernel
>>>> +         * to move received packets from FILL queue to RX queue.
>>>> +         */
>>>> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
>>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
>>>> +        return ENOMEM;
>>>> +    }
>>>> +
>>>> +    /* Setup a dp_packet batch from descriptors in RX queue */
>>>> +    for (i = 0; i < rcvd; i++) {
>>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
>>>> idx_rx)->addr;
>>>> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, 
>>>> idx_rx)->len;
>>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>>>> +        uint64_t index;
>>>> +
>>>> +        struct dp_packet_afxdp *xpacket;
>>>> +        struct dp_packet *packet;
>>>> +
>>>> +        index = addr >> FRAME_SHIFT;
>>>> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
>>>> +        packet = &xpacket->packet;
>>>> +
>>>> +        /* Initialize the struct dp_packet */
>>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE -
>>>> FRAME_HEADROOM);
>>>> +        dp_packet_set_size(packet, len);
>>>> +
>>>> +        /* Add packet into batch, increase batch->count */
>>>> +        dp_packet_batch_add(batch, packet);
>>>> +
>>>> +        idx_rx++;
>>>> +    }
>>>> +    /* Release the RX queue */
>>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
>>>> +
>>>> +    for (i = 0; i < rcvd; i++) {
>>>> +        uint64_t index;
>>>> +        struct umem_elem *elem;
>>>> +
>>>> +        /* Get one free umem, program it into FILL queue */
>>>> +        elem = elems[i];
>>>> +        index = (uint64_t)((char *)elem - (char 
>>>> *)xsk->umem->buffer);
>>>> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
>>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
>>>> +
>>>> +        idx_fq++;
>>>> +    }
>>>> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
>>>> +
>>>> +    if (qfill) {
>>>> +        /* TODO: return the number of remaining packets in the 
>>>> queue.
>>>> */
>>>> +        *qfill = 0;
>>>> +    }
>>>> +
>>>> +#ifdef AFXDP_DEBUG
>>>> +    log_xsk_stat(xsk);
>>>> +#endif
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static inline int
>>>> +kick_tx(struct xsk_socket_info *xsk)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    /* This causes system call into kernel's xsk_sendmsg, and
>>>> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver 
>>>> mode).
>>>> +     */
>>>> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT,
>>>> NULL, 0);
>>>> +    if (OVS_UNLIKELY(ret < 0)) {
>>>> +        if (errno == ENXIO || errno == ENOBUFS || errno ==
>>>> EOPNOTSUPP) {
>>>> +            return errno;
>>>> +        }
>>>> +    }
>>>> +    /* no error, or EBUSY or EAGAIN */
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +check_free_batch(struct dp_packet_batch *batch)
>>>> +{
>>>> +    struct umem_pool *first_mpool = NULL;
>>>> +    struct dp_packet_afxdp *xpacket;
>>>> +    struct dp_packet *packet;
>>>> +
>>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>>> +        if (packet->source != DPBUF_AFXDP) {
>>>> +            return false;
>>>> +        }
>>>> +        xpacket = dp_packet_cast_afxdp(packet);
>>>> +        if (i == 0) {
>>>> +            first_mpool = xpacket->mpool;
>>>> +            continue;
>>>> +        }
>>>> +        if (xpacket->mpool != first_mpool) {
>>>> +            return false;
>>>> +        }
>>>> +    }
>>>> +    /* All packets are DPBUF_AFXDP and from the same mpool */
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static inline void
>>>> +afxdp_complete_tx(struct xsk_socket_info *xsk)
>>>> +{
>>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>>> +    uint32_t idx_cq = 0;
>>>> +    int tx_done, j, ret;
>>>> +
>>>> +    if (!xsk->outstanding_tx) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    ret = kick_tx(xsk);
>>>> +    if (OVS_UNLIKELY(ret)) {
>>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>>> +                     ovs_strerror(ret));
>>>> +    }
>>>> +
>>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE,
>>>> &idx_cq);
>>>> +    if (tx_done > 0) {
>>>> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>>> +        xsk->outstanding_tx -= tx_done;
>>>> +    }
>>>> +
>>>> +    /* Recycle back to umem pool */
>>>> +    for (j = 0; j < tx_done; j++) {
>>>> +        struct umem_elem *elem;
>>>> +        uint64_t addr;
>>>> +
>>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, 
>>>> idx_cq++);
>>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>>> +                            (char *)xsk->umem->buffer + addr);
>>>> +        elems_push[j] = elem;
>>>> +    }
>>>> +
>>>> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
>>>> **)elems_push);
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
>>>> +                        struct dp_packet_batch *batch,
>>>> +                        bool concurrent_txq)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    struct xsk_socket_info *xsk = dev->xsks[qid];
>>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>>> +    struct dp_packet *packet;
>>>> +    bool free_batch = true;
>>>> +    uint32_t idx = 0;
>>>> +    int error = 0;
>>>> +    int ret;
>>>> +
>>>> +    if (!xsk) {
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
>>>> +        qid = qid % dev->up.n_txq;
>>>> +        ovs_spin_lock(&dev->tx_locks[qid]);
>>>> +    }
>>>> +
>>>> +    /* Process CQ first. */
>>>> +    afxdp_complete_tx(xsk);
>>>> +
>>>> +    free_batch = check_free_batch(batch);
>>>> +
>>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
>>>> **)elems_pop);
>>>> +    if (OVS_UNLIKELY(ret)) {
>>>> +        xsk->tx_dropped += batch->count;
>>>> +        error = ENOMEM;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* Make sure we have enough TX descs */
>>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>>> **)elems_pop);
>>>> +        xsk->tx_dropped += batch->count;
>>>> +        error = ENOMEM;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>>> +        struct umem_elem *elem;
>>>> +        uint64_t index;
>>>> +
>>>> +        elem = elems_pop[i];
>>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>>> +         * TODO: avoid this copy if the packet and the pop umem
>>>> +         * are located in the same umem.
>>>> +         */
>>>> +        memcpy(elem, dp_packet_data(packet), 
>>>> dp_packet_size(packet));
>>>> +
>>>> +        index = (uint64_t)((char *)elem - (char 
>>>> *)xsk->umem->buffer);
>>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>>> +            = dp_packet_size(packet);
>>>> +    }
>>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>>> +    xsk->outstanding_tx += batch->count;
>>>> +
>>>> +    ret = kick_tx(xsk);
>>>> +    if (OVS_UNLIKELY(ret)) {
>>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>>> +                     ovs_strerror(ret));
>>>> +    }
>>>> +
>>>> +out:
>>>> +    if (free_batch) {
>>>> +        free_afxdp_buf_batch(batch);
>>>> +    } else {
>>>> +        dp_packet_delete_batch(batch, true);
>>>> +    }
>>>> +
>>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
>>>> +        ovs_spin_unlock(&dev->tx_locks[qid]);
>>>> +    }
>>>> +    return error;
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
>>>> +{
>>>> +   /* Done at reconfigure */
>>>> +   return 0;
>>>> +}
>>>> +
>>>> +void
>>>> +netdev_afxdp_destruct(struct netdev *netdev_)
>>>> +{
>>>> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>>>> +
>>>> +    /* Note: tc is by-passed when using drv-mode, but when using
>>>> +     * skb-mode, we might need to clean up tc. */
>>>> +
>>>> +    xsk_destroy_all(netdev_);
>>>> +    ovs_mutex_destroy(&netdev->mutex);
>>>> +}
>>>> +
>>>> +int
>>>> +netdev_afxdp_get_stats(const struct netdev *netdev,
>>>> +                       struct netdev_stats *stats)
>>>> +{
>>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
>>>> +    struct netdev_stats dev_stats;
>>>> +    struct xsk_socket_info *xsk;
>>>> +    int error, i;
>>>> +
>>>> +    ovs_mutex_lock(&dev->mutex);
>>>> +
>>>> +    error = get_stats_via_netlink(netdev, &dev_stats);
>>>> +    if (error) {
>>>> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
>>>> +    } else {
>>>> +        /* Use kernel netdev's packet and byte counts */
>>>> +        stats->rx_packets = dev_stats.rx_packets;
>>>> +        stats->rx_bytes = dev_stats.rx_bytes;
>>>> +        stats->tx_packets = dev_stats.tx_packets;
>>>> +        stats->tx_bytes = dev_stats.tx_bytes;
>>>> +
>>>> +        stats->rx_errors           += dev_stats.rx_errors;
>>>> +        stats->tx_errors           += dev_stats.tx_errors;
>>>> +        stats->rx_dropped          += dev_stats.rx_dropped;
>>>> +        stats->tx_dropped          += dev_stats.tx_dropped;
>>>> +        stats->multicast           += dev_stats.multicast;
>>>> +        stats->collisions          += dev_stats.collisions;
>>>> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
>>>> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
>>>> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
>>>> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
>>>> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
>>>> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
>>>> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
>>>> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
>>>> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
>>>> +        stats->tx_heartbeat_errors += 
>>>> dev_stats.tx_heartbeat_errors;
>>>> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
>>>> +
>>>> +        /* Account the dropped in each xsk */
>>>> +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
>>>> +            xsk = dev->xsks[i];
>>>> +            if (xsk) {
>>>> +                stats->rx_dropped += xsk->rx_dropped;
>>>> +                stats->tx_dropped += xsk->tx_dropped;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>> +
>>>> +    return error;
>>>> +}
>>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
>>>> new file mode 100644
>>>> index 000000000000..dd2dc1a2064d
>>>> --- /dev/null
>>>> +++ b/lib/netdev-afxdp.h
>>>> @@ -0,0 +1,74 @@
>>>> +/*
>>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#ifndef NETDEV_AFXDP_H
>>>> +#define NETDEV_AFXDP_H 1
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#ifdef HAVE_AF_XDP
>>>> +
>>>> +#include <stdint.h>
>>>> +#include <stdbool.h>
>>>> +
>>>> +/* These functions are Linux AF_XDP specific, so they should be 
>>>> used
>>>> directly
>>>> + * only by Linux-specific code. */
>>>> +
>>>> +#define MAX_XSKQ 16
>>>> +
>>>> +struct netdev;
>>>> +struct xsk_socket_info;
>>>> +struct xdp_umem;
>>>> +struct dp_packet_batch;
>>>> +struct smap;
>>>> +struct dp_packet;
>>>> +struct netdev_rxq;
>>>> +struct netdev_stats;
>>>> +
>>>> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
>>>> +void netdev_afxdp_destruct(struct netdev *netdev_);
>>>> +
>>>> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
>>>> +                          struct dp_packet_batch *batch,
>>>> +                          int *qfill);
>>>> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
>>>> +                            struct dp_packet_batch *batch,
>>>> +                            bool concurrent_txq);
>>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct 
>>>> smap
>>>> *args,
>>>> +                            char **errp);
>>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct 
>>>> smap
>>>> *args);
>>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
>>>> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
>>>> +                           struct netdev_stats *stats);
>>>> +
>>>> +void free_afxdp_buf(struct dp_packet *p);
>>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
>>>> +void signal_remove_xdp(struct netdev *netdev);
>>>> +
>>>> +#else /* !HAVE_AF_XDP */
>>>> +
>>>> +#include "openvswitch/compiler.h"
>>>> +
>>>> +struct dp_packet;
>>>> +
>>>> +static inline void
>>>> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
>>>> +{
>>>> +    /* Nothing */
>>>> +}
>>>> +
>>>> +#endif /* HAVE_AF_XDP */
>>>> +#endif /* netdev-afxdp.h */
>>>> diff --git a/lib/netdev-linux-private.h 
>>>> b/lib/netdev-linux-private.h
>>>> new file mode 100644
>>>> index 000000000000..6a0388cf9dc3
>>>> --- /dev/null
>>>> +++ b/lib/netdev-linux-private.h
>>>> @@ -0,0 +1,139 @@
>>>> +/*
>>>> + * Copyright (c) 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#include <linux/filter.h>
>>>> +#include <linux/gen_stats.h>
>>>> +#include <linux/if_ether.h>
>>>> +#include <linux/if_tun.h>
>>>> +#include <linux/types.h>
>>>> +#include <linux/ethtool.h>
>>>> +#include <linux/mii.h>
>>>> +#include <stdint.h>
>>>> +#include <stdbool.h>
>>>> +
>>>> +#include "netdev-afxdp.h"
>>>> +#include "netdev-provider.h"
>>>> +#include "netdev-tc-offloads.h"
>>>> +#include "netdev-vport.h"
>>>> +#include "openvswitch/thread.h"
>>>> +#include "ovs-atomic.h"
>>>> +#include "timer.h"
>>>> +#include "xdpsock.h"
>>>> +
>>>> +/* These functions are Linux specific, so they should be used
>>>> directly only by
>>>> + * Linux-specific code. */
>>>> +
>>>> +struct netdev;
>>>> +
>>>> +struct netdev_rxq_linux {
>>>> +    struct netdev_rxq up;
>>>> +    bool is_tap;
>>>> +    int fd;
>>>> +};
>>>> +
>>>> +void netdev_linux_run(const struct netdev_class *);
>>>> +
>>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
>>>> flag,
>>>> +                                  const char *flag_name, bool
>>>> enable);
>>>> +
>>>> +int get_stats_via_netlink(const struct netdev *netdev_,
>>>> +                          struct netdev_stats *stats);
>>>> +
>>>> +struct netdev_linux {
>>>> +    struct netdev up;
>>>> +
>>>> +    /* Protects all members below. */
>>>> +    struct ovs_mutex mutex;
>>>> +
>>>> +    unsigned int cache_valid;
>>>> +
>>>> +    bool miimon;                    /* Link status of last poll. 
>>>> */
>>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>>> if
>>>> <= 0. */
>>>> +    struct timer miimon_timer;
>>>> +
>>>> +    int netnsid;                    /* Network namespace ID. */
>>>> +    /* The following are figured out "on demand" only.  They are 
>>>> only
>>>> valid
>>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>>> */
>>>> +    int ifindex;
>>>> +    struct eth_addr etheraddr;
>>>> +    int mtu;
>>>> +    unsigned int ifi_flags;
>>>> +    long long int carrier_resets;
>>>> +    uint32_t kbits_rate;        /* Policing data. */
>>>> +    uint32_t kbits_burst;
>>>> +    int vport_stats_error;      /* Cached error code from
>>>> vport_get_stats().
>>>> +                                   0 or an errno value. */
>>>> +    int netdev_mtu_error;       /* Cached error code from 
>>>> SIOCGIFMTU
>>>> +                                 * or SIOCSIFMTU.
>>>> +                                 */
>>>> +    int ether_addr_error;       /* Cached error code from set/get
>>>> etheraddr. */
>>>> +    int netdev_policing_error;  /* Cached error code from set
>>>> policing. */
>>>> +    int get_features_error;     /* Cached error code from
>>>> ETHTOOL_GSET. */
>>>> +    int get_ifindex_error;      /* Cached error code from
>>>> SIOCGIFINDEX. */
>>>> +
>>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> +
>>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>>> ETHTOOL_GDRVINFO.
>>>> */
>>>> +    struct tc *tc;
>>>> +
>>>> +    /* For devices of class netdev_tap_class only. */
>>>> +    int tap_fd;
>>>> +    bool present;               /* If the device is present in the
>>>> namespace */
>>>> +    uint64_t tx_dropped;        /* tap device can drop if the 
>>>> iface
>>>> is down */
>>>> +
>>>> +    /* LAG information. */
>>>> +    bool is_lag_master;         /* True if the netdev is a LAG
>>>> master. */
>>>> +
>>>> +    /* AF_XDP information */
>>>> +#ifdef HAVE_AF_XDP
>>>> +    struct xsk_socket_info **xsks;
>>>> +    int requested_n_rxq;
>>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
>>>> +    int xdp_flags, xdp_bind_flags;
>>>> +    struct ovs_spinlock *tx_locks;
>>>> +#endif
>>>> +};
>>>> +
>>>> +static bool
>>>> +is_netdev_linux_class(const struct netdev_class *netdev_class)
>>>> +{
>>>> +    return netdev_class->run == netdev_linux_run;
>>>> +}
>>>> +
>>>> +static struct netdev_linux *
>>>> +netdev_linux_cast(const struct netdev *netdev)
>>>> +{
>>>> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>>> +
>>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>>> +}
>>>> +
>>>> +static struct netdev_rxq_linux *
>>>> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
>>>> +{
>>>> +    
>>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
>>>> +
>>>> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
>>>> +}
>>>> +
>>>> +#endif /* netdev-linux-private.h */
>>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>>> index f75d73fd39f8..2883cf1f2586 100644
>>>> --- a/lib/netdev-linux.c
>>>> +++ b/lib/netdev-linux.c
>>>> @@ -17,6 +17,7 @@
>>>>  #include <config.h>
>>>>
>>>>  #include "netdev-linux.h"
>>>> +#include "netdev-linux-private.h"
>>>>
>>>>  #include <errno.h>
>>>>  #include <fcntl.h>
>>>> @@ -54,6 +55,7 @@
>>>>  #include "fatal-signal.h"
>>>>  #include "hash.h"
>>>>  #include "openvswitch/hmap.h"
>>>> +#include "netdev-afxdp.h"
>>>>  #include "netdev-provider.h"
>>>>  #include "netdev-tc-offloads.h"
>>>>  #include "netdev-vport.h"
>>>> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, 
>>>> int
>>>> mtu);
>>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
>>>> burst_bytes);
>>>>
>>>> -struct netdev_linux {
>>>> -    struct netdev up;
>>>> -
>>>> -    /* Protects all members below. */
>>>> -    struct ovs_mutex mutex;
>>>> -
>>>> -    unsigned int cache_valid;
>>>> -
>>>> -    bool miimon;                    /* Link status of last poll. 
>>>> */
>>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>>> if
>>>> <= 0. */
>>>> -    struct timer miimon_timer;
>>>> -
>>>> -    int netnsid;                    /* Network namespace ID. */
>>>> -    /* The following are figured out "on demand" only.  They are 
>>>> only
>>>> valid
>>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>>> */
>>>> -    int ifindex;
>>>> -    struct eth_addr etheraddr;
>>>> -    int mtu;
>>>> -    unsigned int ifi_flags;
>>>> -    long long int carrier_resets;
>>>> -    uint32_t kbits_rate;        /* Policing data. */
>>>> -    uint32_t kbits_burst;
>>>> -    int vport_stats_error;      /* Cached error code from
>>>> vport_get_stats().
>>>> -                                   0 or an errno value. */
>>>> -    int netdev_mtu_error;       /* Cached error code from 
>>>> SIOCGIFMTU
>>>> or SIOCSIFMTU. */
>>>> -    int ether_addr_error;       /* Cached error code from set/get
>>>> etheraddr. */
>>>> -    int netdev_policing_error;  /* Cached error code from set
>>>> policing. */
>>>> -    int get_features_error;     /* Cached error code from
>>>> ETHTOOL_GSET. */
>>>> -    int get_ifindex_error;      /* Cached error code from
>>>> SIOCGIFINDEX. */
>>>> -
>>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>>> */
>>>> -
>>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>>> ETHTOOL_GDRVINFO.
>>>> */
>>>> -    struct tc *tc;
>>>> -
>>>> -    /* For devices of class netdev_tap_class only. */
>>>> -    int tap_fd;
>>>> -    bool present;               /* If the device is present in the
>>>> namespace */
>>>> -    uint64_t tx_dropped;        /* tap device can drop if the 
>>>> iface
>>>> is down */
>>>> -
>>>> -    /* LAG information. */
>>>> -    bool is_lag_master;         /* True if the netdev is a LAG
>>>> master. */
>>>> -};
>>>> -
>>>> -struct netdev_rxq_linux {
>>>> -    struct netdev_rxq up;
>>>> -    bool is_tap;
>>>> -    int fd;
>>>> -};
>>>>
>>>>  /* This is set pretty low because we probably won't learn anything
>>>> from the
>>>>   * additional log messages. */
>>>> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>   * changes in the device miimon status, so we can use 
>>>> atomic_count.
>>>> */
>>>>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>>>>
>>>> -static void netdev_linux_run(const struct netdev_class *);
>>>> -
>>>>  static int netdev_linux_do_ethtool(const char *name, struct
>>>> ethtool_cmd *,
>>>>                                     int cmd, const char *cmd_name);
>>>>  static int get_flags(const struct netdev *, unsigned int *flags);
>>>> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
>>>>                         struct in_addr addr);
>>>>  static int get_etheraddr(const char *netdev_name, struct eth_addr
>>>> *ea);
>>>>  static int set_etheraddr(const char *netdev_name, const struct
>>>> eth_addr);
>>>> -static int get_stats_via_netlink(const struct netdev *, struct
>>>> netdev_stats *);
>>>>  static int af_packet_sock(void);
>>>>  static bool netdev_linux_miimon_enabled(void);
>>>>  static void netdev_linux_miimon_run(void);
>>>> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
>>>>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int
>>>> *mtup);
>>>>
>>>>  static bool
>>>> -is_netdev_linux_class(const struct netdev_class *netdev_class)
>>>> -{
>>>> -    return netdev_class->run == netdev_linux_run;
>>>> -}
>>>> -
>>>> -static bool
>>>>  is_tap_netdev(const struct netdev *netdev)
>>>>  {
>>>>      return netdev_get_class(netdev) == &netdev_tap_class;
>>>>  }
>>>> -
>>>> -static struct netdev_linux *
>>>> -netdev_linux_cast(const struct netdev *netdev)
>>>> -{
>>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>>> -
>>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>>> -}
>>>> -
>>>> -static struct netdev_rxq_linux *
>>>> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
>>>> -{
>>>> -    
>>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
>>>> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
>>>> -}
>>>>
>>>>  static int
>>>>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
>>>> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change
>>>> *change)
>>>>      }
>>>>  }
>>>>
>>>> -static void
>>>> +void
>>>>  netdev_linux_run(const struct netdev_class *netdev_class 
>>>> OVS_UNUSED)
>>>>  {
>>>>      struct nl_sock *sock;
>>>> @@ -3279,9 +3206,7 @@ exit:
>>>>      .run = netdev_linux_run,                                    \
>>>>      .wait = netdev_linux_wait,                                  \
>>>>      .alloc = netdev_linux_alloc,                                \
>>>> -    .destruct = netdev_linux_destruct,                          \
>>>>      .dealloc = netdev_linux_dealloc,                            \
>>>> -    .send = netdev_linux_send,                                  \
>>>>      .send_wait = netdev_linux_send_wait,                        \
>>>>      .set_etheraddr = netdev_linux_set_etheraddr,                \
>>>>      .get_etheraddr = netdev_linux_get_etheraddr,                \
>>>> @@ -3312,10 +3237,8 @@ exit:
>>>>      .arp_lookup = netdev_linux_arp_lookup,                      \
>>>>      .update_flags = netdev_linux_update_flags,                  \
>>>>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
>>>> -    .rxq_construct = netdev_linux_rxq_construct,                \
>>>>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
>>>>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
>>>> -    .rxq_recv = netdev_linux_rxq_recv,                          \
>>>>      .rxq_wait = netdev_linux_rxq_wait,                          \
>>>>      .rxq_drain = netdev_linux_rxq_drain
>>>>
>>>> @@ -3323,30 +3246,64 @@ const struct netdev_class 
>>>> netdev_linux_class =
>>>> {
>>>>      NETDEV_LINUX_CLASS_COMMON,
>>>>      LINUX_FLOW_OFFLOAD_API,
>>>>      .type = "system",
>>>> +    .is_pmd = false,
>>>>      .construct = netdev_linux_construct,
>>>> +    .destruct = netdev_linux_destruct,
>>>>      .get_stats = netdev_linux_get_stats,
>>>>      .get_features = netdev_linux_get_features,
>>>>      .get_status = netdev_linux_get_status,
>>>> -    .get_block_id = netdev_linux_get_block_id
>>>> +    .get_block_id = netdev_linux_get_block_id,
>>>> +    .send = netdev_linux_send,
>>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>>  };
>>>>
>>>>  const struct netdev_class netdev_tap_class = {
>>>>      NETDEV_LINUX_CLASS_COMMON,
>>>>      .type = "tap",
>>>> +    .is_pmd = false,
>>>>      .construct = netdev_linux_construct_tap,
>>>> +    .destruct = netdev_linux_destruct,
>>>>      .get_stats = netdev_tap_get_stats,
>>>>      .get_features = netdev_linux_get_features,
>>>>      .get_status = netdev_linux_get_status,
>>>> +    .send = netdev_linux_send,
>>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>>  };
>>>>
>>>>  const struct netdev_class netdev_internal_class = {
>>>>      NETDEV_LINUX_CLASS_COMMON,
>>>>      LINUX_FLOW_OFFLOAD_API,
>>>>      .type = "internal",
>>>> +    .is_pmd = false,
>>>>      .construct = netdev_linux_construct,
>>>> +    .destruct = netdev_linux_destruct,
>>>>      .get_stats = netdev_internal_get_stats,
>>>>      .get_status = netdev_internal_get_status,
>>>> +    .send = netdev_linux_send,
>>>> +    .rxq_construct = netdev_linux_rxq_construct,
>>>> +    .rxq_recv = netdev_linux_rxq_recv,
>>>>  };
>>>> +
>>>> +#ifdef HAVE_AF_XDP
>>>> +const struct netdev_class netdev_afxdp_class = {
>>>> +    NETDEV_LINUX_CLASS_COMMON,
>>>> +    .type = "afxdp",
>>>> +    .is_pmd = true,
>>>> +    .construct = netdev_linux_construct,
>>>> +    .destruct = netdev_afxdp_destruct,
>>>> +    .get_stats = netdev_afxdp_get_stats,
>>>> +    .get_status = netdev_linux_get_status,
>>>> +    .set_config = netdev_afxdp_set_config,
>>>> +    .get_config = netdev_afxdp_get_config,
>>>> +    .reconfigure = netdev_afxdp_reconfigure,
>>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
>>>> +    .send = netdev_afxdp_batch_send,
>>>> +    .rxq_construct = netdev_afxdp_rxq_construct,
>>>> +    .rxq_recv = netdev_afxdp_rxq_recv,
>>>> +};
>>>> +#endif
>>>>
>>>>
>>>>  #define CODEL_N_QUEUES 0x0000
>>>> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct
>>>> netdev_stats *dst,
>>>>      dst->tx_window_errors = src->tx_window_errors;
>>>>  }
>>>>
>>>> -static int
>>>> +int
>>>>  get_stats_via_netlink(const struct netdev *netdev_, struct
>>>> netdev_stats *stats)
>>>>  {
>>>>      struct ofpbuf request;
>>>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>>>> index fb0c27e6e8e8..91e6a9e2bfc0 100644
>>>> --- a/lib/netdev-provider.h
>>>> +++ b/lib/netdev-provider.h
>>>> @@ -903,6 +903,9 @@ extern const struct netdev_class
>>>> netdev_linux_class;
>>>>  extern const struct netdev_class netdev_internal_class;
>>>>  extern const struct netdev_class netdev_tap_class;
>>>>
>>>> +#ifdef HAVE_AF_XDP
>>>> +extern const struct netdev_class netdev_afxdp_class;
>>>> +#endif
>>>>  #ifdef  __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/lib/netdev.c b/lib/netdev.c
>>>> index 7d7ecf6f0946..0fac117cc602 100644
>>>> --- a/lib/netdev.c
>>>> +++ b/lib/netdev.c
>>>> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>
>>>>  static void restore_all_flags(void *aux OVS_UNUSED);
>>>>  void update_device_args(struct netdev *, const struct shash 
>>>> *args);
>>>> +#ifdef HAVE_AF_XDP
>>>> +void signal_remove_xdp(struct netdev *netdev);
>>>> +#endif
>>>>
>>>>  int
>>>>  netdev_n_txq(const struct netdev *netdev)
>>>> @@ -146,6 +149,9 @@ netdev_initialize(void)
>>>>          netdev_register_provider(&netdev_internal_class);
>>>>          netdev_register_provider(&netdev_tap_class);
>>>>          netdev_vport_tunnel_register();
>>>> +#ifdef HAVE_AF_XDP
>>>> +        netdev_register_provider(&netdev_afxdp_class);
>>>> +#endif
>>>>  #endif
>>>>  #if defined(__FreeBSD__) || defined(__NetBSD__)
>>>>          netdev_register_provider(&netdev_tap_class);
>>>> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
>>>>                                                 saved_flags &
>>>> ~saved_values,
>>>>                                                 &old_flags);
>>>>          }
>>>> +#ifdef HAVE_AF_XDP
>>>> +        if (netdev->netdev_class == &netdev_afxdp_class) {
>>>> +            signal_remove_xdp(netdev);
>>>> +        }
>>>> +#endif
>>>>      }
>>>>  }
>>>>
>>>> diff --git a/lib/spinlock.h b/lib/spinlock.h
>>>> new file mode 100644
>>>> index 000000000000..1ae634f23a6b
>>>> --- /dev/null
>>>> +++ b/lib/spinlock.h
>>>> @@ -0,0 +1,70 @@
>>>> +/*
>>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +#ifndef SPINLOCK_H
>>>> +#define SPINLOCK_H 1
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#include <ctype.h>
>>>> +#include <errno.h>
>>>> +#include <fcntl.h>
>>>> +#include <stdarg.h>
>>>> +#include <stdlib.h>
>>>> +#include <unistd.h>
>>>> +
>>>> +#include "ovs-atomic.h"
>>>> +
>>>> +struct ovs_spinlock {
>>>> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
>>>> +};
>>>> +
>>>> +static inline void
>>>> +ovs_spinlock_init(struct ovs_spinlock *sl)
>>>> +{
>>>> +    atomic_init(&sl->locked, 0);
>>>> +}
>>>> +
>>>> +static inline void
>>>> +ovs_spin_lock(struct ovs_spinlock *sl)
>>>> +{
>>>> +    int exp = 0, locked = 0;
>>>> +
>>>> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked,
>>>> &exp, 1,
>>>> +                memory_order_acquire,
>>>> +                memory_order_relaxed)) {
>>>> +        locked = 1;
>>>> +        while (locked) {
>>>> +            atomic_read_relaxed(&sl->locked, &locked);
>>>> +        }
>>>> +        exp = 0;
>>>> +    }
>>>> +}
>>>> +
>>>> +static inline void
>>>> +ovs_spin_unlock(struct ovs_spinlock *sl)
>>>> +{
>>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
>>>> +}
>>>> +
>>>> +static inline int
>>>> +ovs_spin_trylock(struct ovs_spinlock *sl)
>>>> +{
>>>> +    int exp = 0;
>>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked, 
>>>> &exp,
>>>> 1,
>>>> +                memory_order_acquire,
>>>> +                memory_order_relaxed);
>>>> +}
>>>> +#endif
>>>> diff --git a/lib/util.c b/lib/util.c
>>>> index 7b8ab81f6ee1..5eb20995b370 100644
>>>> --- a/lib/util.c
>>>> +++ b/lib/util.c
>>>> @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
>>>>      return xrealloc(p, *n * s);
>>>>  }
>>>>
>>>> -/* Allocates and returns 'size' bytes of memory aligned to a cache
>>>> line and in
>>>> - * dedicated cache lines.  That is, the memory block returned will
>>>> not share a
>>>> - * cache line with other data, avoiding "false sharing".
>>>> +/* Allocates and returns 'size' bytes of memory aligned to
>>>> 'alignment' bytes.
>>>> + * 'alignment' must be a power of two and a multiple of 
>>>> sizeof(void
>>>> *).
>>>>   *
>>>> - * Use free_cacheline() to free the returned memory block. */
>>>> + * Use free_size_align() to free the returned memory block. */
>>>>  void *
>>>> -xmalloc_cacheline(size_t size)
>>>> +xmalloc_size_align(size_t size, size_t alignment)
>>>>  {
>>>>  #ifdef HAVE_POSIX_MEMALIGN
>>>>      void *p;
>>>>      int error;
>>>>
>>>>      COVERAGE_INC(util_xalloc);
>>>> -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
>>>> +    error = posix_memalign(&p, alignment, size ? size : 1);
>>>>      if (error != 0) {
>>>>          out_of_memory();
>>>>      }
>>>> @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
>>>>  #else
>>>>      /* Allocate room for:
>>>>       *
>>>> -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to
>>>> allow the
>>>> -     *       pointer to be aligned exactly sizeof(void *) bytes
>>>> before the
>>>> -     *       beginning of a cache line.
>>>> +     *     - Header padding: Up to alignment - 1 bytes, to allow 
>>>> the
>>>> +     *       pointer 'q' to be aligned exactly sizeof(void *) 
>>>> bytes
>>>> before the
>>>> +     *       beginning of the alignment.
>>>>       *
>>>>       *     - Pointer: A pointer to the start of the header 
>>>> padding,
>>>> to allow us
>>>>       *       to free() the block later.
>>>>       *
>>>>       *     - User data: 'size' bytes.
>>>>       *
>>>> -     *     - Trailer padding: Enough to bring the user data up to 
>>>> a
>>>> cache line
>>>> +     *     - Trailer padding: Enough to bring the user data up to 
>>>> a
>>>> alignment
>>>>       *       multiple.
>>>>       *
>>>>       * 
>>>> +---------------+---------+------------------------+---------+
>>>> @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
>>>>       * p               q         r
>>>>       *
>>>>       */
>>>> -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
>>>> -                      + sizeof(void *)
>>>> -                      + ROUND_UP(size, CACHE_LINE_SIZE));
>>>> -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) <
>>>> sizeof(void *);
>>>> -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ?
>>>> CACHE_LINE_SIZE : 0),
>>>> -                                CACHE_LINE_SIZE);
>>>> -    void **q = (void **) r - 1;
>>>> +    void *p, *r, **q;
>>>> +    bool runt;
>>>> +
>>>> +    COVERAGE_INC(util_xalloc);
>>>> +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) 
>>>> {
>>>> +        ovs_abort(0, "Invalid alignment");
>>>> +    }
>>>> +
>>>> +    p = xmalloc((alignment - 1)
>>>> +                + sizeof(void *)
>>>> +                + ROUND_UP(size, alignment));
>>>> +
>>>> +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
>>>> +    /* When the padding size < sizeof(void*), we don't have enough
>>>> room for
>>>> +     * pointer 'q'. As a reuslt, need to move 'r' to the next
>>>> alignment.
>>>> +     * So ROUND_UP when xmalloc above, and ROUND_UP again when
>>>> calculate 'r'
>>>> +     * below.
>>>> +     */
>>>> +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), 
>>>> alignment);
>>>> +    q = (void **) r - 1;
>>>>      *q = p;
>>>> +
>>>>      return r;
>>>>  #endif
>>>>  }
>>>>
>>>> +void
>>>> +free_size_align(void *p)
>>>> +{
>>>> +#ifdef HAVE_POSIX_MEMALIGN
>>>> +    free(p);
>>>> +#else
>>>> +    if (p) {
>>>> +        void **q = (void **) p - 1;
>>>> +        free(*q);
>>>> +    }
>>>> +#endif
>>>> +}
>>>> +
>>>> +/* Allocates and returns 'size' bytes of memory aligned to a cache
>>>> line and in
>>>> + * dedicated cache lines.  That is, the memory block returned will
>>>> not share a
>>>> + * cache line with other data, avoiding "false sharing".
>>>> + *
>>>> + * Use free_cacheline() to free the returned memory block. */
>>>> +void *
>>>> +xmalloc_cacheline(size_t size)
>>>> +{
>>>> +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
>>>> +}
>>>> +
>>>>  /* Like xmalloc_cacheline() but clears the allocated memory to all
>>>> zero
>>>>   * bytes. */
>>>>  void *
>>>> @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
>>>>  void
>>>>  free_cacheline(void *p)
>>>>  {
>>>> -#ifdef HAVE_POSIX_MEMALIGN
>>>> -    free(p);
>>>> -#else
>>>> -    if (p) {
>>>> -        void **q = (void **) p - 1;
>>>> -        free(*q);
>>>> -    }
>>>> -#endif
>>>> +    free_size_align(p);
>>>> +}
>>>> +
>>>> +void *
>>>> +xmalloc_pagealign(size_t size)
>>>> +{
>>>> +    return xmalloc_size_align(size, get_page_size());
>>>> +}
>>>> +
>>>> +void
>>>> +free_pagealign(void *p)
>>>> +{
>>>> +    free_size_align(p);
>>>>  }
>>>>
>>>>  char *
>>>> diff --git a/lib/util.h b/lib/util.h
>>>> index c26605abdce3..33665748274c 100644
>>>> --- a/lib/util.h
>>>> +++ b/lib/util.h
>>>> @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src,
>>>> size_t size);
>>>>
>>>>  int string_ends_with(const char *str, const char *suffix);
>>>>
>>>> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
>>>> +void free_pagealign(void *);
>>>> +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
>>>> +void free_size_align(void *);
>>>> +
>>>>  /* The C standards say that neither the 'dst' nor 'src' argument 
>>>> to
>>>>   * memcpy() may be null, even if 'n' is zero.  This wrapper 
>>>> tolerates
>>>>   * the null case. */
>>>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
>>>> new file mode 100644
>>>> index 000000000000..ea39fa557290
>>>> --- /dev/null
>>>> +++ b/lib/xdpsock.c
>>>> @@ -0,0 +1,170 @@
>>>> +/*
>>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +#include <config.h>
>>>> +
>>>> +#include "xdpsock.h"
>>>> +#include "dp-packet.h"
>>>> +#include "openvswitch/compiler.h"
>>>> +
>>>> +/* Note:
>>>> + * umem_elem_push* shouldn't overflow because we always pop
>>>> + * elem first, then push back to the stack.
>>>> + */
>>>> +static inline void
>>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>>> +{
>>>> +    void *ptr;
>>>> +
>>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>>>> +        OVS_NOT_REACHED();
>>>> +    }
>>>> +
>>>> +    ptr = &umemp->array[umemp->index];
>>>> +    memcpy(ptr, addrs, n * sizeof(void *));
>>>> +    umemp->index += n;
>>>> +}
>>>> +
>>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void 
>>>> **addrs)
>>>> +{
>>>> +    ovs_spin_lock(&umemp->lock);
>>>> +    __umem_elem_push_n(umemp, n, addrs);
>>>> +    ovs_spin_unlock(&umemp->lock);
>>>> +}
>>>> +
>>>> +static inline void
>>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
>>>> +{
>>>> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
>>>> +        OVS_NOT_REACHED();
>>>> +    }
>>>> +
>>>> +    umemp->array[umemp->index++] = addr;
>>>> +}
>>>> +
>>>> +void
>>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
>>>> +{
>>>> +
>>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
>>>> +
>>>> +    ovs_spin_lock(&umemp->lock);
>>>> +    __umem_elem_push(umemp, addr);
>>>> +    ovs_spin_unlock(&umemp->lock);
>>>> +}
>>>> +
>>>> +static inline int
>>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>>> +{
>>>> +    void *ptr;
>>>> +
>>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
>>>> +        return -ENOMEM;
>>>> +    }
>>>> +
>>>> +    umemp->index -= n;
>>>> +    ptr = &umemp->array[umemp->index];
>>>> +    memcpy(addrs, ptr, n * sizeof(void *));
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ovs_spin_lock(&umemp->lock);
>>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
>>>> +    ovs_spin_unlock(&umemp->lock);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static inline void *
>>>> +__umem_elem_pop(struct umem_pool *umemp)
>>>> +{
>>>> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    return umemp->array[--umemp->index];
>>>> +}
>>>> +
>>>> +void *
>>>> +umem_elem_pop(struct umem_pool *umemp)
>>>> +{
>>>> +    void *ptr;
>>>> +
>>>> +    ovs_spin_lock(&umemp->lock);
>>>> +    ptr = __umem_elem_pop(umemp);
>>>> +    ovs_spin_unlock(&umemp->lock);
>>>> +
>>>> +    return ptr;
>>>> +}
>>>> +
>>>> +static void **
>>>> +__umem_pool_alloc(unsigned int size)
>>>> +{
>>>> +    void *bufs;
>>>> +
>>>> +    bufs = xmalloc_pagealign(size * sizeof(void *));
>>>> +    memset(bufs, 0, size * sizeof(void *));
>>>> +
>>>> +    return (void **)bufs;
>>>> +}
>>>> +
>>>> +int
>>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
>>>> +{
>>>> +    umemp->array = __umem_pool_alloc(size);
>>>> +    if (!umemp->array) {
>>>> +        return -ENOMEM;
>>>> +    }
>>>> +
>>>> +    umemp->size = size;
>>>> +    umemp->index = 0;
>>>> +    ovs_spinlock_init(&umemp->lock);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void
>>>> +umem_pool_cleanup(struct umem_pool *umemp)
>>>> +{
>>>> +    free_pagealign(umemp->array);
>>>> +    umemp->array = NULL;
>>>> +}
>>>> +
>>>> +/* AF_XDP metadata init/destroy */
>>>> +int
>>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
>>>> +{
>>>> +    void *bufs;
>>>> +
>>>> +    bufs = xmalloc_pagealign(size * sizeof(struct 
>>>> dp_packet_afxdp));
>>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
>>>> +
>>>> +    xp->array = bufs;
>>>> +    xp->size = size;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void
>>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
>>>> +{
>>>> +    free_pagealign(xp->array);
>>>> +    xp->array = NULL;
>>>> +}
>>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>>> new file mode 100644
>>>> index 000000000000..1a1093381243
>>>> --- /dev/null
>>>> +++ b/lib/xdpsock.h
>>>> @@ -0,0 +1,101 @@
>>>> +/*
>>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the 
>>>> License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing,
>>>> software
>>>> + * distributed under the License is distributed on an "AS IS" 
>>>> BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> + * See the License for the specific language governing permissions
>>>> and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#ifndef XDPSOCK_H
>>>> +#define XDPSOCK_H 1
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#ifdef HAVE_AF_XDP
>>>> +
>>>> +#include <bpf/xsk.h>
>>>> +#include <errno.h>
>>>> +#include <stdbool.h>
>>>> +#include <stdio.h>
>>>> +
>>>> +#include "openvswitch/thread.h"
>>>> +#include "ovs-atomic.h"
>>>> +#include "spinlock.h"
>>>> +
>>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>>> +
>>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>>> +
>>>> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
>>>> + * Setting NUM_FRAMES to this makes sure umem_pop always 
>>>> successes.
>>>> + */
>>>> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
>>>> +
>>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>>> +
>>>> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
>>>> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
>>>> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS +
>>>> CONS_NUM_DESCS));
>>>> +
>>>> +/* LIFO ptr_array */
>>>> +struct umem_pool {
>>>> +    int index;      /* point to top */
>>>> +    unsigned int size;
>>>> +    struct ovs_spinlock lock;
>>>> +    void **array;   /* a pointer array, point to umem buf */
>>>> +};
>>>> +
>>>> +/* array-based dp_packet_afxdp */
>>>> +struct xpacket_pool {
>>>> +    unsigned int size;
>>>> +    struct dp_packet_afxdp **array;
>>>> +};
>>>> +
>>>> +struct xsk_umem_info {
>>>> +    struct umem_pool mpool;
>>>> +    struct xpacket_pool xpool;
>>>> +    struct xsk_ring_prod fq;
>>>> +    struct xsk_ring_cons cq;
>>>> +    struct xsk_umem *umem;
>>>> +    void *buffer;
>>>> +};
>>>> +
>>>> +struct xsk_socket_info {
>>>> +    struct xsk_ring_cons rx;
>>>> +    struct xsk_ring_prod tx;
>>>> +    struct xsk_umem_info *umem;
>>>> +    struct xsk_socket *xsk;
>>>> +    unsigned long rx_dropped;
>>>> +    unsigned long tx_dropped;
>>>> +    uint32_t outstanding_tx;
>>>> +};
>>>> +
>>>> +struct umem_elem {
>>>> +    struct umem_elem *next;
>>>> +};
>>>> +
>>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
>>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void 
>>>> **addrs);
>>>> +
>>>> +void *umem_elem_pop(struct umem_pool *umemp);
>>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
>>>> +
>>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
>>>> +void umem_pool_cleanup(struct umem_pool *umemp);
>>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
>>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
>>>> +
>>>> +#endif
>>>> +#endif
>>>> diff --git a/tests/automake.mk b/tests/automake.mk
>>>> index 2956e68b242c..131564bb0bd3 100644
>>>> --- a/tests/automake.mk
>>>> +++ b/tests/automake.mk
>>>> @@ -4,12 +4,14 @@ EXTRA_DIST += \
>>>>       $(SYSTEM_TESTSUITE_AT) \
>>>>       $(SYSTEM_KMOD_TESTSUITE_AT) \
>>>>       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
>>>> +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
>>>>       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
>>>>       $(SYSTEM_DPDK_TESTSUITE_AT) \
>>>>       $(OVSDB_CLUSTER_TESTSUITE_AT) \
>>>>       $(TESTSUITE) \
>>>>       $(SYSTEM_KMOD_TESTSUITE) \
>>>>       $(SYSTEM_USERSPACE_TESTSUITE) \
>>>> +     $(SYSTEM_AFXDP_TESTSUITE) \
>>>>       $(SYSTEM_OFFLOADS_TESTSUITE) \
>>>>       $(SYSTEM_DPDK_TESTSUITE) \
>>>>       $(OVSDB_CLUSTER_TESTSUITE) \
>>>> @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>>>>       tests/system-userspace-macros.at \
>>>>       tests/system-userspace-packet-type-aware.at
>>>>
>>>> +SYSTEM_AFXDP_TESTSUITE_AT = \
>>>> +     tests/system-afxdp-testsuite.at \
>>>> +     tests/system-afxdp-macros.at
>>>> +
>>>>  SYSTEM_TESTSUITE_AT = \
>>>>       tests/system-common-macros.at \
>>>>       tests/system-ovn.at \
>>>> @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
>>>>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
>>>>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
>>>>  SYSTEM_USERSPACE_TESTSUITE =
>>>> $(srcdir)/tests/system-userspace-testsuite
>>>> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
>>>>  SYSTEM_OFFLOADS_TESTSUITE = 
>>>> $(srcdir)/tests/system-offloads-testsuite
>>>>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
>>>>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
>>>> @@ -317,6 +324,11 @@ check-system-userspace: all
>>>>       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests
>>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && 
>>>> "$$@"
>>>> --recheck)
>>>>
>>>> +check-afxdp: all
>>>> +     $(MAKE) install
>>>> +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests
>>>> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
>>>> +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>>> +
>>>>  check-offloads: all
>>>>       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests
>>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
>>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && 
>>>> "$$@"
>>>> --recheck)
>>>> @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4
>>>> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
>>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>>>       $(AM_V_at)mv $@.tmp $@
>>>>
>>>> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
>>>> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
>>>> +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>>> +     $(AM_V_at)mv $@.tmp $@
>>>> +
>>>>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
>>>> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
>>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
>>>>       $(AM_V_at)mv $@.tmp $@
>>>> diff --git a/tests/system-afxdp-macros.at
>>>> b/tests/system-afxdp-macros.at
>>>> new file mode 100644
>>>> index 000000000000..1e6f7a46b4b7
>>>> --- /dev/null
>>>> +++ b/tests/system-afxdp-macros.at
>>>> @@ -0,0 +1,20 @@
>>>> +# Add port to ovs bridge by using afxdp mode.
>>>> +# This will use generic XDP support in the veth driver.
>>>> +m4_define([ADD_VETH],
>>>> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || 
>>>> return
>>>> 77])
>>>> +      CONFIGURE_VETH_OFFLOADS([$1])
>>>> +      AT_CHECK([ip link set $1 netns $2])
>>>> +      AT_CHECK([ip link set dev ovs-$1 up])
>>>> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
>>>> +                set interface ovs-$1 external-ids:iface-id="$1"
>>>> type="afxdp"])
>>>> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
>>>> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
>>>> +      if test -n "$5"; then
>>>> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
>>>> +      fi
>>>> +      if test -n "$6"; then
>>>> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
>>>> +      fi
>>>> +      on_exit 'ip link del ovs-$1'
>>>> +    ]
>>>> +)
>>>> diff --git a/tests/system-afxdp-testsuite.at
>>>> b/tests/system-afxdp-testsuite.at
>>>> new file mode 100644
>>>> index 000000000000..9b7a29066614
>>>> --- /dev/null
>>>> +++ b/tests/system-afxdp-testsuite.at
>>>> @@ -0,0 +1,26 @@
>>>> +AT_INIT
>>>> +
>>>> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
>>>> +
>>>> +Licensed under the Apache License, Version 2.0 (the "License");
>>>> +you may not use this file except in compliance with the License.
>>>> +You may obtain a copy of the License at:
>>>> +
>>>> +    http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +Unless required by applicable law or agreed to in writing, 
>>>> software
>>>> +distributed under the License is distributed on an "AS IS" BASIS,
>>>> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied.
>>>> +See the License for the specific language governing permissions 
>>>> and
>>>> +limitations under the License.])
>>>> +
>>>> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
>>>> +
>>>> +m4_include([tests/ovs-macros.at])
>>>> +m4_include([tests/ovsdb-macros.at])
>>>> +m4_include([tests/ofproto-macros.at])
>>>> +m4_include([tests/system-common-macros.at])
>>>> +m4_include([tests/system-userspace-macros.at])
>>>> +m4_include([tests/system-afxdp-macros.at])
>>>> +
>>>> +m4_include([tests/system-traffic.at])
>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>> index 89c06a1b7877..1e3acbbb8075 100644
>>>> --- a/vswitchd/vswitch.xml
>>>> +++ b/vswitchd/vswitch.xml
>>>> @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface 
>>>> p0
>>>> type=patch options:peer=p1 \
>>>>          </p>
>>>>        </column>
>>>>
>>>> +      <column name="other_config" key="xdpmode"
>>>> +              type='{"type": "string",
>>>> +                     "enum": ["set", ["skb", "drv"]]}'>
>>>> +        <p>
>>>> +          Specifies the operational mode of the XDP program.
>>>> +          If "drv", the XDP program is loaded into the device 
>>>> driver
>>>> with
>>>> +          zero-copy RX and TX enabled. This mode requires device
>>>> driver with
>>>> +          AF_XDP support and has the best performance.
>>>> +          If "skb", the XDP program is using generic XDP mode in
>>>> kernel with
>>>> +          extra data copying between userspace and kernel. No 
>>>> device
>>>> driver
>>>> +          support is needed. Note that this is afxdp netdev type
>>>> only.
>>>> +          Defaults to "skb" mode.
>>>> +        </p>
>>>> +      </column>
>>>> +
>>>>        <column name="options" key="vhost-server-path"
>>>>                type='{"type": "string"}'>
>>>>          <p>
>>>> --
>>>> 2.7.4
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ben Pfaff June 11, 2019, 2:53 p.m. UTC | #7
On Tue, Jun 11, 2019 at 08:47:19AM +0200, Eelco Chaudron wrote:
> 
> 
> On 8 Jun 2019, at 6:48, William Tu wrote:
> 
> > > > > +  ethtool -L enp2s0 combined 1
> > > > > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > > > > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> > > > > type="afxdp"
> > > > > \
> > > > > +    options:n_rxq=1 options:xdpmode=drv \
> > > > > +    other_config:pmd-rxq-affinity="0:4"
> > 
> > another feature I'm thinking about to add is a new options
> > for loading custom XDP program
> > 
> > For example:
> > ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> >     options:n_rxq=1 options:xdpmode=drv
> >     options:xdp_prog=/path/to/xdp.o
> > 
> > If users do not specify the path, then it is using the libbpf's default
> > program
> > (which forwards all packets to userspace)
> > 
> > If users want to use their own xdp object, then this option can load the
> > xdp object file from the path.
> 
> This might be useful, specially if you would like to do some experiments.

There could be a security risk here depending on how we sanitize the
path.
William Tu June 11, 2019, 3:02 p.m. UTC | #8
Hi Eelco,

Thanks for the trace.

On Tue, Jun 11, 2019 at 6:52 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> Hi William,
>
> Here are some more details, this is a port to port test (same port in as
> out) using the following rule:
>
>    ovs-ofctl add-flow ovs_pvp_br0 "in_port=eno1,action=IN_PORT"
>
> Sent packets wire speed, and crash…
>
> (gdb) bt
> #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
> #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
> #3  umem_elem_push_n (umemp=0x24cc790, n=32,
> addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
> #4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at
> lib/netdev-afxdp.c:736
> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0,
> batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at
> lib/netdev-afxdp.c:763
> #6  0x0000000000908041 in netdev_send (netdev=<optimized out>,
> qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80,
> concurrent_txq=concurrent_txq@entry=true)
>      at lib/netdev.c:800
> #7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port
> (pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at
> lib/dpif-netdev.c:4187
> #8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets
> (pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at
> lib/dpif-netdev.c:4227
> #9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets
> (force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010,
> rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
> #11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5449
> #12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:352
> #13 0x00007fbc6b0a12de in start_thread () from /lib64/libpthread.so.0
> #14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6
>
> After this crash, systemd restart OVS, and it crashed again (guess
> traffic was still flowing for a bit with the NORMAL rule installed):
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> qfill=0x0) at lib/netdev-afxdp.c:583
> 583         rx->fd = xsk_socket__fd(xsk->xsk);
> [Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]
>
> (gdb) bt
> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> qfill=0x0) at lib/netdev-afxdp.c:583
> #1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>,
> batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at
> lib/netdev.c:710
> #2  0x00000000008dd1d3 in dp_netdev_process_rxq_port
> (pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at
> lib/dpif-netdev.c:4257
> #3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5449
> #4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:352
> #5  0x00007f47229732de in start_thread () from /lib64/libpthread.so.0
> #6  0x00007f4722118a63 in clone () from /lib64/libc.so.6
>
> I did not further investigate, but it should be easy to replicate. This
> is the same setup that worked fine with the v8 patchset for port to
> port.
> Next step was to verify PVP was fixed, but could not get there…
> Cheers,

I'm not able to reproduce it on my testbed using i40e, I will try
using ixgbe today.

btw, if you try skb-mode, does the crash still show up?
Although skb-mode is much slower, so it might not trigger the issue.

Regards,
William

>
> Eelco
>
> On 8 Jun 2019, at 10:12, Eelco Chaudron wrote:
>
> > Hi William,
> >
> > This was still a draft email, and was not supposed to go out ;)
> >
> > My debug and build setup was a bit messed up and was having problems
> > running GDB… I was (I’m) planning to continue getting some debug
> > info on Tuesday after the public holiday here…
> >
> > But just to give you a heads up, it starts up fine with root access
> > but it crashes during a simple Port to Port run with wire-speed
> > traffic. Then it will run into a restart/crash loop.
> >
> > Will try to get you more details next week…
> >
> > Cheers,
> >
> > Eelco
> >
> >
> > On 7 Jun 2019, at 23:33, William Tu wrote:
> >
> >> Hi Eelco,
> >>
> >> Thanks for the testing.
> >>
> >> On Fri, Jun 7, 2019 at 8:43 AM Eelco Chaudron <echaudro@redhat.com>
> >> wrote:
> >>>
> >>> Hi William,
> >>>
> >>> No review or full test yet, just some observations…
> >>>
> >>> We run OVS as a non root user, which is causing OVS with XDP to
> >>> fail:
> >>
> >> Right, XDP requires using root privilege.
> >> I will add this in the documentation.
> >
> > Is this a hard requirement? As I do not remember running OVS as root
> > before…
> >
> >>>
> >>> 2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev:
> >>> Datapath supports ct_orig_tuple
> >>> 2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev:
> >>> Datapath supports ct_orig_tuple6
> >>> 2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on
> >>> numa_id:
> >>> 0, core id: 21 created.
> >>> 2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd
> >>> threads
> >>> on numa node 0
> >>> 2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
> >>> 2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1
> >>> in
> >>> DRV mode
> >>> 2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR:
> >>> setrlimit(RLIMIT_MEMLOCK): Operation not permitted
> >>
> >> This is due to not having root privilege, so not able to lock the
> >> memory
> >> for device driver to directly DMA packet buffer into userspace.
> >>
> >> Can you try using root?
> >>
> >> Regards,
> >> William
> >>
> >>> 2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all
> >>> configure queue 0 mode DRV
> >>> 2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create
> >>> failed (Operation not permitted) mode: DRV qid: 0
> >>> 2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create
> >>> AF_XDP
> >>> socket on queue 0
> >>> 2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
> >>> 2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1
> >>> reconfig fails
> >>> 2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set
> >>> interface
> >>> eno1 new configuration
> >>>
> >>> However when configuring this after startup it’s fine, but trying
> >>> to
> >>> restart OVS with this configuration results in a system core…
> >>>
> >>>
> >>>
> >>>
> >>> On 5 Jun 2019, at 22:47, William Tu wrote:
> >>>
> >>>> The patch introduces experimental AF_XDP support for OVS netdev.
> >>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
> >>>> socket
> >>>> type built upon the eBPF and XDP technology.  It is aims to have
> >>>> comparable
> >>>> performance to DPDK but cooperate better with existing kernel's
> >>>> networking
> >>>> stack.  An AF_XDP socket receives and sends packets from an
> >>>> eBPF/XDP
> >>>> program
> >>>> attached to the netdev, by-passing a couple of Linux kernel's
> >>>> subsystems
> >>>> As a result, AF_XDP socket shows much better performance than
> >>>> AF_PACKET
> >>>> For more details about AF_XDP, please see linux kernel's
> >>>> Documentation/networking/af_xdp.rst. Note that by default, this
> >>>> feature is
> >>>> not compiled in.
> >>>>
> >>>> Signed-off-by: William Tu <u9012063@gmail.com>
> >>>> ---
> >>>> v1->v2:
> >>>> - add a list to maintain unused umem elements
> >>>> - remove copy from rx umem to ovs internal buffer
> >>>> - use hugetlb to reduce misses (not much difference)
> >>>> - use pmd mode netdev in OVS (huge performance improve)
> >>>> - remove malloc dp_packet, instead put dp_packet in umem
> >>>>
> >>>> v2->v3:
> >>>> - rebase on the OVS master, 7ab4b0653784
> >>>>   ("configure: Check for more specific function to pull in pthread
> >>>> library.")
> >>>> - remove the dependency on libbpf and dpif-bpf.
> >>>>   instead, use the built-in XDP_ATTACH feature.
> >>>> - data structure optimizations for better performance, see[1]
> >>>> - more test cases support
> >>>> v3:
> >>>> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> >>>>
> >>>> v3->v4:
> >>>> - Use AF_XDP API provided by libbpf
> >>>> - Remove the dependency on XDP_ATTACH kernel patch set
> >>>> - Add documentation, bpf.rst
> >>>>
> >>>> v4->v5:
> >>>> - rebase to master
> >>>> - remove rfc, squash all into a single patch
> >>>> - add --enable-afxdp, so by default, AF_XDP is not compiled
> >>>> - add options: xdpmode=drv,skb
> >>>> - add multiple queue and multiple PMD support, with options: n_rxq
> >>>> - improve documentation, rename bpf.rst to af_xdp.rst
> >>>>
> >>>> v5->v6
> >>>> - rebase to master, commit 0cdd5b13de91b98
> >>>> - address errors from sparse and clang
> >>>> - pass travis-ci test
> >>>> - address feedback from Ben
> >>>> - fix issues reported by 0-day robot
> >>>> - improved documentation
> >>>>
> >>>> v6-v7
> >>>> - rebase to master, commit abf11558c1515bf3b1
> >>>> - address feedbacks from Ilya, Ben, and Eelco, see:
> >>>>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> >>>> - add XDP mode change, implement get/set_config, reconfigure
> >>>> - Fix reconfiguration/crash issue caused by libbpf, see patch:
> >>>>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> >>>> - perf optimization for batching umem_push/pop
> >>>> - perf optimization for batching kick_tx
> >>>> - test build with dpdk
> >>>> - fix/refactor atomic operation
> >>>> - make AF_XDP x86 specific, otherwise fail at build time
> >>>> - lots of code refactoring
> >>>> - add PVP setup in documentation
> >>>>
> >>>> v7-v8:
> >>>> - Address feedback from Ilya at:
> >>>>   https://patchwork.ozlabs.org/patch/1095019/
> >>>> - add netdev-linux-private.h
> >>>> - fix afxdp reconfigure issue
> >>>> - sort include headers
> >>>> - remove unnecessary OVS_UNUSED
> >>>> - coding style fixes
> >>>> - error case handling and memory leak
> >>>>
> >>>> v8-v9:
> >>>> - rebase to master 180bbbed3a3867d52
> >>>> - Address review feedback from Ben, Ilya and Eelco, at:
> >>>>   https://patchwork.ozlabs.org/patch/1097740/
> >>>> - == From Ilya ==
> >>>> - Optimize the reconfiguration logic
> >>>> - Implement .rxq_recv and .send for afxdp
> >>>> - Remove system-afxdp-traffic.at, reuse existing code
> >>>> - Use Ilya's rdtsc code
> >>>> - remove --disable-system
> >>>> - == From Eelco ==
> >>>> - Fix bug when remove br0,
> >>>> util(revalidator49)|EMER|lib/poll-loop.c:111:
> >>>>   assertion !fd != !wevent failed
> >>>> - Fix bug and use default value from libbpf, ex:
> >>>> XSK_RING_PROD__DEFAULT...
> >>>> - Clear xdp program when receive signal, ctrl+c
> >>>> - Add options to vswitch.xml, set xdpmode default to skb-mode
> >>>> - No support for ARM and PPC, now x86_64 only
> >>>> - remove redundant header includes and function/macro definitions
> >>>> - remove some ifdef HAVE_AF_XDP
> >>>> - == From others/both about afxdp rx and tx ==
> >>>> - Several umem push/pop error handling improvement/fixes
> >>>> - add lock to address concurrent_txq case
> >>>> - improve error handling
> >>>> - add stats
> >>>> - Things that are not done yet
> >>>> - MTU limitation
> >>>> - n_txq_desc/n_rxq_desc option.
> >>>>
> >>>> v9-v10
> >>>> - remove x86_64 limitation, suggested by Ben and Eelco
> >>>> - add xmalloc_pagealign, free_pagealign
> >>>> - minor refector
> >>>>
> >>>> v10-v11
> >>>> - address feedback from Ilya at
> >>>>   https://patchwork.ozlabs.org/patch/1106495/
> >>>> - fix typos, and some refactoring
> >>>> - refactor existing code and introduce xmalloc pagealign
> >>>> - fix a couple of error handling case
> >>>> - allocate per-txq lock
> >>>> - dynamic allocate xsk array
> >>>> - fix cycle_counter_update() for non-x86/non-linux case
> >>>> ---
> >>>>  Documentation/automake.mk             |   1 +
> >>>>  Documentation/index.rst               |   1 +
> >>>>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
> >>>>  Documentation/intro/install/index.rst |   1 +
> >>>>  acinclude.m4                          |  35 ++
> >>>>  configure.ac                          |   1 +
> >>>>  lib/automake.mk                       |  14 +
> >>>>  lib/dp-packet.c                       |  28 ++
> >>>>  lib/dp-packet.h                       |  18 +-
> >>>>  lib/dpif-netdev-perf.h                |  26 +
> >>>>  lib/netdev-afxdp.c                    | 891
> >>>> ++++++++++++++++++++++++++++++++++
> >>>>  lib/netdev-afxdp.h                    |  74 +++
> >>>>  lib/netdev-linux-private.h            | 139 ++++++
> >>>>  lib/netdev-linux.c                    | 121 ++---
> >>>>  lib/netdev-provider.h                 |   3 +
> >>>>  lib/netdev.c                          |  11 +
> >>>>  lib/spinlock.h                        |  70 +++
> >>>>  lib/util.c                            |  92 +++-
> >>>>  lib/util.h                            |   5 +
> >>>>  lib/xdpsock.c                         | 170 +++++++
> >>>>  lib/xdpsock.h                         | 101 ++++
> >>>>  tests/automake.mk                     |  16 +
> >>>>  tests/system-afxdp-macros.at          |  20 +
> >>>>  tests/system-afxdp-testsuite.at       |  26 +
> >>>>  vswitchd/vswitch.xml                  |  15 +
> >>>>  25 files changed, 2204 insertions(+), 108 deletions(-)
> >>>>  create mode 100644 Documentation/intro/install/afxdp.rst
> >>>>  create mode 100644 lib/netdev-afxdp.c
> >>>>  create mode 100644 lib/netdev-afxdp.h
> >>>>  create mode 100644 lib/netdev-linux-private.h
> >>>>  create mode 100644 lib/spinlock.h
> >>>>  create mode 100644 lib/xdpsock.c
> >>>>  create mode 100644 lib/xdpsock.h
> >>>>  create mode 100644 tests/system-afxdp-macros.at
> >>>>  create mode 100644 tests/system-afxdp-testsuite.at
> >>>>
> >>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> >>>> index 082438e09a33..11cc59efc881 100644
> >>>> --- a/Documentation/automake.mk
> >>>> +++ b/Documentation/automake.mk
> >>>> @@ -10,6 +10,7 @@ DOC_SOURCE = \
> >>>>       Documentation/intro/why-ovs.rst \
> >>>>       Documentation/intro/install/index.rst \
> >>>>       Documentation/intro/install/bash-completion.rst \
> >>>> +     Documentation/intro/install/afxdp.rst \
> >>>>       Documentation/intro/install/debian.rst \
> >>>>       Documentation/intro/install/documentation.rst \
> >>>>       Documentation/intro/install/distributions.rst \
> >>>> diff --git a/Documentation/index.rst b/Documentation/index.rst
> >>>> index 46261235c732..aa9e7c49f179 100644
> >>>> --- a/Documentation/index.rst
> >>>> +++ b/Documentation/index.rst
> >>>> @@ -59,6 +59,7 @@ vSwitch? Start here.
> >>>>    :doc:`intro/install/windows` |
> >>>>    :doc:`intro/install/xenserver` |
> >>>>    :doc:`intro/install/dpdk` |
> >>>> +  :doc:`intro/install/afxdp` |
> >>>>    :doc:`Installation FAQs <faq/releases>`
> >>>>
> >>>>  - **Tutorials:** :doc:`tutorials/faucet` |
> >>>> diff --git a/Documentation/intro/install/afxdp.rst
> >>>> b/Documentation/intro/install/afxdp.rst
> >>>> new file mode 100644
> >>>> index 000000000000..554964396353
> >>>> --- /dev/null
> >>>> +++ b/Documentation/intro/install/afxdp.rst
> >>>> @@ -0,0 +1,433 @@
> >>>> +..
> >>>> +      Licensed under the Apache License, Version 2.0 (the
> >>>> "License");
> >>>> you may
> >>>> +      not use this file except in compliance with the License. You
> >>>> may obtain
> >>>> +      a copy of the License at
> >>>> +
> >>>> +          http://www.apache.org/licenses/LICENSE-2.0
> >>>> +
> >>>> +      Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> +      distributed under the License is distributed on an "AS IS"
> >>>> BASIS, WITHOUT
> >>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied. See the
> >>>> +      License for the specific language governing permissions and
> >>>> limitations
> >>>> +      under the License.
> >>>> +
> >>>> +      Convention for heading levels in Open vSwitch documentation:
> >>>> +
> >>>> +      =======  Heading 0 (reserved for the title in a document)
> >>>> +      -------  Heading 1
> >>>> +      ~~~~~~~  Heading 2
> >>>> +      +++++++  Heading 3
> >>>> +      '''''''  Heading 4
> >>>> +
> >>>> +      Avoid deeper levels because they do not render well.
> >>>> +
> >>>> +
> >>>> +========================
> >>>> +Open vSwitch with AF_XDP
> >>>> +========================
> >>>> +
> >>>> +This document describes how to build and install Open vSwitch
> >>>> using
> >>>> +AF_XDP netdev.
> >>>> +
> >>>> +.. warning::
> >>>> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> >>>> +  and it is not compiled in by default.
> >>>> +
> >>>> +
> >>>> +Introduction
> >>>> +------------
> >>>> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux
> >>>> socket type
> >>>> +built upon the eBPF and XDP technology.  It is aims to have
> >>>> comparable
> >>>> +performance to DPDK but cooperate better with existing kernel's
> >>>> networking
> >>>> +stack.  An AF_XDP socket receives and sends packets from an
> >>>> eBPF/XDP
> >>>> program
> >>>> +attached to the netdev, by-passing a couple of Linux kernel's
> >>>> subsystems.
> >>>> +As a result, AF_XDP socket shows much better performance than
> >>>> AF_PACKET.
> >>>> +For more details about AF_XDP, please see linux kernel's
> >>>> +Documentation/networking/af_xdp.rst
> >>>> +
> >>>> +
> >>>> +AF_XDP Netdev
> >>>> +-------------
> >>>> +OVS has a couple of netdev types, i.e., system, tap, or
> >>>> +dpdk.  The AF_XDP feature adds a new netdev types called
> >>>> +"afxdp", and implement its configuration, packet reception,
> >>>> +and transmit functions.  Since the AF_XDP socket, called xsk,
> >>>> +operates in userspace, once ovs-vswitchd receives packets
> >>>> +from xsk, the afxdp netdev re-uses the existing userspace
> >>>> +dpif-netdev datapath.  As a result, most of the packet processing
> >>>> +happens at the userspace instead of linux kernel.
> >>>> +
> >>>> +::
> >>>> +
> >>>> +              |   +-------------------+
> >>>> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> >>>> +              |   +-------------------+
> >>>> +              |   |      ofproto      |<-->OpenFlow controllers
> >>>> +              |   +--------+-+--------+
> >>>> +              |   | netdev | |ofproto-|
> >>>> +    userspace |   +--------+ |  dpif  |
> >>>> +              |   | afxdp  | +--------+
> >>>> +              |   | netdev | |  dpif  |
> >>>> +              |   +---||---+ +--------+
> >>>> +              |       ||     |  dpif- |
> >>>> +              |       ||     | netdev |
> >>>> +              |_      ||     +--------+
> >>>> +                      ||
> >>>> +               _  +---||-----+--------+
> >>>> +              |   | AF_XDP prog +     |
> >>>> +       kernel |   |   xsk_map         |
> >>>> +              |_  +--------||---------+
> >>>> +                           ||
> >>>> +                        physical
> >>>> +                           NIC
> >>>> +
> >>>> +
> >>>> +Build requirements
> >>>> +------------------
> >>>> +
> >>>> +In addition to the requirements described in :doc:`general`,
> >>>> building
> >>>> Open
> >>>> +vSwitch with AF_XDP will require the following:
> >>>> +
> >>>> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> >>>> +
> >>>> +- Linux kernel XDP support, with the following options (required)
> >>>> +
> >>>> +  * CONFIG_BPF=y
> >>>> +
> >>>> +  * CONFIG_BPF_SYSCALL=y
> >>>> +
> >>>> +  * CONFIG_XDP_SOCKETS=y
> >>>> +
> >>>> +
> >>>> +- The following optional Kconfig options are also recommended, but
> >>>> not
> >>>> +  required:
> >>>> +
> >>>> +  * CONFIG_BPF_JIT=y (Performance)
> >>>> +
> >>>> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> >>>> +
> >>>> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> >>>> +
> >>>> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> >>>> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> >>>> +  This is an OVS independent benchmark tools for AF_XDP.
> >>>> +  It makes sure your basic kernel requirements are met for AF_XDP.
> >>>> +
> >>>> +
> >>>> +Installing
> >>>> +----------
> >>>> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF
> >>>> support.
> >>>> +First, clone a recent version of Linux bpf-next tree::
> >>>> +
> >>>> +  git clone
> >>>> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> >>>> +
> >>>> +Second, go into the Linux source directory and build libbpf in the
> >>>> tools
> >>>> +directory::
> >>>> +
> >>>> +  cd bpf-next/
> >>>> +  cd tools/lib/bpf/
> >>>> +  make && make install
> >>>> +  make install_headers
> >>>> +
> >>>> +.. note::
> >>>> +   Make sure xsk.h and bpf.h are installed in system's library
> >>>> path,
> >>>> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> >>>> +
> >>>> +Make sure the libbpf.so is installed correctly::
> >>>> +
> >>>> +  ldconfig
> >>>> +  ldconfig -p | grep libbpf
> >>>> +
> >>>> +Third, ensure the standard OVS requirements are installed and
> >>>> +bootstrap/configure the package::
> >>>> +
> >>>> +  ./boot.sh && ./configure --enable-afxdp
> >>>> +
> >>>> +Finally, build and install OVS::
> >>>> +
> >>>> +  make && make install
> >>>> +
> >>>> +To kick start end-to-end autotesting::
> >>>> +
> >>>> +  uname -a # make sure having 5.0+ kernel
> >>>> +  make check-afxdp TESTSUITEFLAGS='1'
> >>>> +
> >>>> +If a test case fails, check the log at::
> >>>> +
> >>>> +  cat
> >>>> tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> >>>> +
> >>>> +
> >>>> +Setup AF_XDP netdev
> >>>> +-------------------
> >>>> +Before running OVS with AF_XDP, make sure the libbpf and libelf
> >>>> are
> >>>> +set-up right::
> >>>> +
> >>>> +  ldd vswitchd/ovs-vswitchd
> >>>> +
> >>>> +Open vSwitch should be started using userspace datapath as
> >>>> described
> >>>> +in :doc:`general`::
> >>>> +
> >>>> +  ovs-vswitchd ...
> >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> >>>> +
> >>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> >>>> core 4)
> >>>> +on 1 queue (queue 0) device, configure these options:
> >>>> **pmd-cpu-mask,
> >>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> >>>> "skb"::
> >>>> +
> >>>> +  ethtool -L enp2s0 combined 1
> >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> >>>> type="afxdp"
> >>>> \
> >>>> +    options:n_rxq=1 options:xdpmode=drv \
> >>>> +    other_config:pmd-rxq-affinity="0:4"
> >>>> +
> >>>> +Or, use 4 pmds/cores and 4 queues by doing::
> >>>> +
> >>>> +  ethtool -L enp2s0 combined 4
> >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> >>>> type="afxdp"
> >>>> \
> >>>> +    options:n_rxq=4 options:xdpmode=drv \
> >>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> >>>> +
> >>>> +.. note::
> >>>> +   pmd-rxq-affinity is optional. If not specified, system will
> >>>> auto-assign.
> >>>> +
> >>>> +To validate that the bridge has successfully instantiated, you can
> >>>> use the::
> >>>> +
> >>>> +  ovs-vsctl show
> >>>> +
> >>>> +Should show something like::
> >>>> +
> >>>> +  Port "ens802f0"
> >>>> +   Interface "ens802f0"
> >>>> +      type: afxdp
> >>>> +      options: {n_rxq="1", xdpmode=drv}
> >>>> +
> >>>> +Otherwise, enable debugging by::
> >>>> +
> >>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
> >>>> +
> >>>> +
> >>>> +References
> >>>> +----------
> >>>> +Most of the design details are described in the paper presented at
> >>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open
> >>>> vSwitch"[1],
> >>>> +section 4, and slides[2][4].
> >>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> >>>> introduction
> >>>> +about AF_XDP current and future work.
> >>>> +
> >>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> >>>> +
> >>>> +[2]
> >>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> >>>> +
> >>>> +[3]
> >>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> >>>> +
> >>>> +[4]
> >>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> >>>> +
> >>>> +
> >>>> +Performance Tuning
> >>>> +------------------
> >>>> +The name of the game is to keep your CPU running in userspace,
> >>>> allowing PMD
> >>>> +to keep polling the AF_XDP queues without any interferences from
> >>>> kernel.
> >>>> +
> >>>> +#. Make sure everything is in the same NUMA node (memory used by
> >>>> AF_XDP, pmd
> >>>> +   running cores, device plug-in slot)
> >>>> +
> >>>> +#. Isolate your CPU by doing isolcpu at grub configure.
> >>>> +
> >>>> +#. IRQ should not set to pmd running core.
> >>>> +
> >>>> +#. The Spectre and Meltdown fixes increase the overhead of system
> >>>> calls.
> >>>> +
> >>>> +
> >>>> +Debugging performance issue
> >>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> +While running the traffic, use linux perf tool to see where your
> >>>> cpu
> >>>> +spends its cycle::
> >>>> +
> >>>> +  cd bpf-next/tools/perf
> >>>> +  make
> >>>> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> >>>> +  ./perf report
> >>>> +
> >>>> +Measure your system call rate by doing::
> >>>> +
> >>>> +  pstree -p `pidof ovs-vswitchd`
> >>>> +  strace -c -p <your pmd's PID>
> >>>> +
> >>>> +Or, use OVS pmd tool::
> >>>> +
> >>>> +  ovs-appctl dpif-netdev/pmd-stats-show
> >>>> +
> >>>> +
> >>>> +Example Script
> >>>> +--------------
> >>>> +
> >>>> +Below is a script using namespaces and veth peer::
> >>>> +
> >>>> +  #!/bin/bash
> >>>> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif
> >>>> -vunixctl
> >>>> \
> >>>> +    --disable-system --detach \
> >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> >>>> +
> >>>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14
> >>>> \
> >>>> +    fail-mode=secure datapath_type=netdev
> >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> >>>> +
> >>>> +  ip netns add at_ns0
> >>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
> >>>> +
> >>>> +  ip link add p0 type veth peer name afxdp-p0
> >>>> +  ip link set p0 netns at_ns0
> >>>> +  ip link set dev afxdp-p0 up
> >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> >>>> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> >>>> +
> >>>> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> >>>> +  ip addr add "10.1.1.1/24" dev p0
> >>>> +  ip link set dev p0 up
> >>>> +  NS_EXEC_HEREDOC
> >>>> +
> >>>> +  ip netns add at_ns1
> >>>> +  ip link add p1 type veth peer name afxdp-p1
> >>>> +  ip link set p1 netns at_ns1
> >>>> +  ip link set dev afxdp-p1 up
> >>>> +
> >>>> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> >>>> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> >>>> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> >>>> +  ip addr add "10.1.1.2/24" dev p1
> >>>> +  ip link set dev p1 up
> >>>> +  NS_EXEC_HEREDOC
> >>>> +
> >>>> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> >>>> +
> >>>> +
> >>>> +Limitations/Known Issues
> >>>> +------------------------
> >>>> +#. Device's numa ID is always 0, need a way to find numa id from a
> >>>> netdev.
> >>>> +#. No QoS support because AF_XDP netdev by-pass the Linux TC
> >>>> layer. A
> >>>> possible
> >>>> +   work-around is to use OpenFlow meter action.
> >>>> +#. AF_XDP device added to bridge, remove, and added again will
> >>>> fail.
> >>>> +#. Most of the tests are done using i40e single port. Multiple
> >>>> ports
> >>>> and
> >>>> +   also ixgbe driver also needs to be tested.
> >>>> +#. No latency test result (TODO items)
> >>>> +
> >>>> +
> >>>> +PVP using tap device
> >>>> +--------------------
> >>>> +Assume you have enp2s0 as physical nic, and a tap device connected
> >>>> to
> >>>> VM.
> >>>> +First, start OVS, then add physical port::
> >>>> +
> >>>> +  ethtool -L enp2s0 combined 1
> >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> >>>> type="afxdp"
> >>>> \
> >>>> +    options:n_rxq=1 options:xdpmode=drv \
> >>>> +    other_config:pmd-rxq-affinity="0:4"
> >>>> +
> >>>> +Start a VM with virtio and tap device::
> >>>> +
> >>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> >>>> +    -m 4096 \
> >>>> +    -cpu host,+x2apic -enable-kvm \
> >>>> +    -device
> >>>> virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> >>>> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> >>>> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> >>>> +    -object memory-backend-file,id=mem,size=4096M,\
> >>>> +      mem-path=/dev/hugepages,share=on \
> >>>> +    -numa node,memdev=mem -mem-prealloc -smp 2
> >>>> +
> >>>> +Create OpenFlow rules::
> >>>> +
> >>>> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
> >>>> +  ovs-ofctl del-flows br0
> >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> >>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> >>>> +
> >>>> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> >>>> +
> >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> >>>> +
> >>>> +The performance number I got is around 1.6Mpps.
> >>>> +This is due to using the kernel's tap interface, which requires
> >>>> copying
> >>>> +packet into kernel from the umem buffer in userspace.
> >>>> +
> >>>> +
> >>>> +PVP using vhostuser device
> >>>> +--------------------------
> >>>> +First, build OVS with DPDK and AFXDP::
> >>>> +
> >>>> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> >>>> +  make -j4 && make install
> >>>> +
> >>>> +Create a vhost-user port from OVS::
> >>>> +
> >>>> +  ovs-vsctl --no-wait set Open_vSwitch .
> >>>> other_config:dpdk-init=true
> >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> >>>> +    other_config:pmd-cpu-mask=0xfff
> >>>> +  ovs-vsctl add-port br0 vhost-user-1 \
> >>>> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> >>>> +
> >>>> +Start VM using vhost-user mode::
> >>>> +
> >>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> >>>> +   -m 4096 \
> >>>> +   -cpu host,+x2apic -enable-kvm \
> >>>> +   -chardev
> >>>> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> >>>> +   -netdev
> >>>> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> >>>> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> >>>> +      netdev=mynet1,mq=on,vectors=10 \
> >>>> +   -object memory-backend-file,id=mem,size=4096M,\
> >>>> +      mem-path=/dev/hugepages,share=on \
> >>>> +   -numa node,memdev=mem -mem-prealloc -smp 2
> >>>> +
> >>>> +Setup the OpenFlow ruls::
> >>>> +
> >>>> +  ovs-ofctl del-flows br0
> >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0,
> >>>> actions=output:vhost-user-1"
> >>>> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1,
> >>>> actions=output:enp2s0"
> >>>> +
> >>>> +Inside the VM, use xdp_rxq_info to drop or bounce back the
> >>>> traffic::
> >>>> +
> >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> >>>> +
> >>>> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> >>>> +
> >>>> +
> >>>> +PCP container using veth
> >>>> +------------------------
> >>>> +Create namespace and veth peer devices::
> >>>> +
> >>>> +  ip netns add at_ns0
> >>>> +  ip link add p0 type veth peer name afxdp-p0
> >>>> +  ip link set p0 netns at_ns0
> >>>> +  ip link set dev afxdp-p0 up
> >>>> +  ip netns exec at_ns0 ip link set dev p0 up
> >>>> +
> >>>> +Attach the veth port to br0 (linux kernel mode)::
> >>>> +
> >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> >>>> +    set interface afxdp-p0 options:n_rxq=1
> >>>> +
> >>>> +Or, use AF_XDP with skb mode::
> >>>> +
> >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> >>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
> >>>> options:xdpmode=skb
> >>>> +
> >>>> +Setup the OpenFlow rules::
> >>>> +
> >>>> +  ovs-ofctl del-flows br0
> >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> >>>> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> >>>> +
> >>>> +In the namespace, run drop or bounce back the packet::
> >>>> +
> >>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> >>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> >>>> +
> >>>> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> >>>> +
> >>>> +
> >>>> +Bug Reporting
> >>>> +-------------
> >>>> +
> >>>> +Please report problems to dev@openvswitch.org.
> >>>> diff --git a/Documentation/intro/install/index.rst
> >>>> b/Documentation/intro/install/index.rst
> >>>> index 3193c736cf17..c27a9c9d16ff 100644
> >>>> --- a/Documentation/intro/install/index.rst
> >>>> +++ b/Documentation/intro/install/index.rst
> >>>> @@ -45,6 +45,7 @@ Installation from Source
> >>>>     xenserver
> >>>>     userspace
> >>>>     dpdk
> >>>> +   afxdp
> >>>>
> >>>>  Installation from Packages
> >>>>  --------------------------
> >>>> diff --git a/acinclude.m4 b/acinclude.m4
> >>>> index cf9cc8b8b0de..721653ab0ec0 100644
> >>>> --- a/acinclude.m4
> >>>> +++ b/acinclude.m4
> >>>> @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
> >>>>    ])
> >>>>  ])
> >>>>
> >>>> +dnl OVS_CHECK_LINUX_AF_XDP
> >>>> +dnl
> >>>> +dnl Check both Linux kernel AF_XDP and libbpf support
> >>>> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> >>>> +  AC_ARG_ENABLE([afxdp],
> >>>> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP
> >>>> support])],
> >>>> +                [], [enable_afxdp=no])
> >>>> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> >>>> +  if test "$enable_afxdp" != yes; then
> >>>> +    AC_MSG_RESULT([no])
> >>>> +    AF_XDP_ENABLE=false
> >>>> +  else
> >>>> +    AC_MSG_RESULT([yes])
> >>>> +    AF_XDP_ENABLE=true
> >>>> +
> >>>> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> >>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP
> >>>> support])])
> >>>> +
> >>>> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> >>>> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP
> >>>> support])])
> >>>> +
> >>>> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> >>>> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP
> >>>> support])])
> >>>> +
> >>>> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> >>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP
> >>>> support])])
> >>>> +
> >>>> +    AC_DEFINE([HAVE_AF_XDP], [1],
> >>>> +              [Define to 1 if AF_XDP support is available and
> >>>> enabled.])
> >>>> +    LIBBPF_LDADD=" -lbpf -lelf"
> >>>> +    AC_SUBST([LIBBPF_LDADD])
> >>>> +  fi
> >>>> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> >>>> +])
> >>>> +
> >>>>  dnl OVS_CHECK_DPDK
> >>>>  dnl
> >>>>  dnl Configure DPDK source tree
> >>>> diff --git a/configure.ac b/configure.ac
> >>>> index 2dbe9a9178e3..9e23e1c6958c 100644
> >>>> --- a/configure.ac
> >>>> +++ b/configure.ac
> >>>> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
> >>>>  OVS_CHECK_DOT
> >>>>  OVS_CHECK_IF_DL
> >>>>  OVS_CHECK_STRTOK_R
> >>>> +OVS_CHECK_LINUX_AF_XDP
> >>>>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
> >>>>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct
> >>>> stat.st_mtimensec],
> >>>>    [], [], [[#include <sys/stat.h>]])
> >>>> diff --git a/lib/automake.mk b/lib/automake.mk
> >>>> index cc5dccf39d6b..b31e28f6e1f5 100644
> >>>> --- a/lib/automake.mk
> >>>> +++ b/lib/automake.mk
> >>>> @@ -14,6 +14,10 @@ if WIN32
> >>>>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
> >>>>  endif
> >>>>
> >>>> +if HAVE_AF_XDP
> >>>> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> >>>> +endif
> >>>> +
> >>>>  lib_libopenvswitch_la_LDFLAGS = \
> >>>>          $(OVS_LTINFO) \
> >>>>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym
> >>>> \
> >>>> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
> >>>>       lib/if-notifier.h \
> >>>>       lib/netdev-linux.c \
> >>>>       lib/netdev-linux.h \
> >>>> +     lib/netdev-linux-private.h \
> >>>>       lib/netdev-tc-offloads.c \
> >>>>       lib/netdev-tc-offloads.h \
> >>>>       lib/netlink-conntrack.c \
> >>>> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
> >>>>       lib/tc.h
> >>>>  endif
> >>>>
> >>>> +if HAVE_AF_XDP
> >>>> +lib_libopenvswitch_la_SOURCES += \
> >>>> +     lib/xdpsock.c \
> >>>> +     lib/xdpsock.h \
> >>>> +     lib/netdev-afxdp.c \
> >>>> +     lib/netdev-afxdp.h \
> >>>> +     lib/spinlock.h
> >>>> +endif
> >>>> +
> >>>>  if DPDK_NETDEV
> >>>>  lib_libopenvswitch_la_SOURCES += \
> >>>>       lib/dpdk.c \
> >>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> >>>> index 0976a35e758b..e6a7947076b4 100644
> >>>> --- a/lib/dp-packet.c
> >>>> +++ b/lib/dp-packet.c
> >>>> @@ -19,6 +19,7 @@
> >>>>  #include <string.h>
> >>>>
> >>>>  #include "dp-packet.h"
> >>>> +#include "netdev-afxdp.h"
> >>>>  #include "netdev-dpdk.h"
> >>>>  #include "openvswitch/dynamic-string.h"
> >>>>  #include "util.h"
> >>>> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> >>>> size_t allocated)
> >>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> >>>>  }
> >>>>
> >>>> +#if HAVE_AF_XDP
> >>>> +/* Initialize 'b' as an empty dp_packet that contains
> >>>> + * memory starting at AF_XDP umem base.
> >>>> + */
> >>>> +void
> >>>> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t
> >>>> allocated)
> >>>> +{
> >>>> +    dp_packet_set_base(b, base);
> >>>> +    dp_packet_set_data(b, base);
> >>>> +    dp_packet_set_size(b, 0);
> >>>> +
> >>>> +    dp_packet_set_allocated(b, allocated);
> >>>> +    b->source = DPBUF_AFXDP;
> >>>> +    dp_packet_reset_offsets(b);
> >>>> +    pkt_metadata_init(&b->md, 0);
> >>>> +    dp_packet_reset_cutlen(b);
> >>>> +    dp_packet_reset_offload(b);
> >>>> +    b->packet_type = htonl(PT_ETH);
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  /* Initializes 'b' as an empty dp_packet that contains the
> >>>> 'allocated' bytes of
> >>>>   * memory starting at 'base'.  'base' should point to a buffer on
> >>>> the
> >>>> stack.
> >>>>   * (Nothing actually relies on 'base' being allocated on the
> >>>> stack.
> >>>> It could
> >>>> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
> >>>>               * created as a dp_packet */
> >>>>              free_dpdk_buf((struct dp_packet*) b);
> >>>>  #endif
> >>>> +        } else if (b->source == DPBUF_AFXDP) {
> >>>> +            free_afxdp_buf(b);
> >>>>          }
> >>>>      }
> >>>>  }
> >>>> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
> >>>> new_headroom, size_t new_tailroom
> >>>>      case DPBUF_STACK:
> >>>>          OVS_NOT_REACHED();
> >>>>
> >>>> +    case DPBUF_AFXDP:
> >>>> +        OVS_NOT_REACHED();
> >>>> +
> >>>>      case DPBUF_STUB:
> >>>>          b->source = DPBUF_MALLOC;
> >>>>          new_base = xmalloc(new_allocated);
> >>>> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
> >>>>  {
> >>>>      void *p;
> >>>>      ovs_assert(b->source != DPBUF_DPDK);
> >>>> +    ovs_assert(b->source != DPBUF_AFXDP);
> >>>>
> >>>>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) ==
> >>>> dp_packet_base(b)) {
> >>>>          p = dp_packet_data(b);
> >>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> >>>> index a5e9ade1244a..e3438226e360 100644
> >>>> --- a/lib/dp-packet.h
> >>>> +++ b/lib/dp-packet.h
> >>>> @@ -25,6 +25,7 @@
> >>>>  #include <rte_mbuf.h>
> >>>>  #endif
> >>>>
> >>>> +#include "netdev-afxdp.h"
> >>>>  #include "netdev-dpdk.h"
> >>>>  #include "openvswitch/list.h"
> >>>>  #include "packets.h"
> >>>> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
> >>>>      DPBUF_DPDK,                /* buffer data is from DPDK
> >>>> allocated
> >>>> memory.
> >>>>                                  * ref to dp_packet_init_dpdk() in
> >>>> dp-packet.c.
> >>>>                                  */
> >>>> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
> >>>>  };
> >>>>
> >>>>  #define DP_PACKET_CONTEXT_SIZE 64
> >>>> @@ -89,6 +91,13 @@ struct dp_packet {
> >>>>      };
> >>>>  };
> >>>>
> >>>> +#if HAVE_AF_XDP
> >>>> +struct dp_packet_afxdp {
> >>>> +    struct umem_pool *mpool;
> >>>> +    struct dp_packet packet;
> >>>> +};
> >>>> +#endif
> >>>> +
> >>>>  static inline void *dp_packet_data(const struct dp_packet *);
> >>>>  static inline void dp_packet_set_data(struct dp_packet *, void *);
> >>>>  static inline void *dp_packet_base(const struct dp_packet *);
> >>>> @@ -122,7 +131,9 @@ static inline const void
> >>>> *dp_packet_get_nd_payload(const struct dp_packet *);
> >>>>  void dp_packet_use(struct dp_packet *, void *, size_t);
> >>>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
> >>>>  void dp_packet_use_const(struct dp_packet *, const void *,
> >>>> size_t);
> >>>> -
> >>>> +#if HAVE_AF_XDP
> >>>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> >>>> +#endif
> >>>>  void dp_packet_init_dpdk(struct dp_packet *);
> >>>>
> >>>>  void dp_packet_init(struct dp_packet *, size_t);
> >>>> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
> >>>>              return;
> >>>>          }
> >>>>
> >>>> +        if (b->source == DPBUF_AFXDP) {
> >>>> +            free_afxdp_buf(b);
> >>>> +            return;
> >>>> +        }
> >>>> +
> >>>>          dp_packet_uninit(b);
> >>>>          free(b);
> >>>>      }
> >>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> >>>> index 859c05613ddf..6b6dfda7db1c 100644
> >>>> --- a/lib/dpif-netdev-perf.h
> >>>> +++ b/lib/dpif-netdev-perf.h
> >>>> @@ -21,6 +21,7 @@
> >>>>  #include <stddef.h>
> >>>>  #include <stdint.h>
> >>>>  #include <string.h>
> >>>> +#include <time.h>
> >>>>  #include <math.h>
> >>>>
> >>>>  #ifdef DPDK_NETDEV
> >>>> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
> >>>>      char *log_reason;
> >>>>  };
> >>>>
> >>>> +#ifdef __linux__
> >>>> +static inline uint64_t
> >>>> +rdtsc_syscall(struct pmd_perf_stats *s)
> >>>> +{
> >>>> +    struct timespec val;
> >>>> +    uint64_t v;
> >>>> +
> >>>> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> >>>> +       return s->last_tsc;
> >>>> +    }
> >>>> +
> >>>> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> >>>> +    v += (uint64_t) val.tv_nsec;
> >>>> +
> >>>> +    return s->last_tsc = v;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  /* Support for accurate timing of PMD execution on TSC clock cycle
> >>>> level.
> >>>>   * These functions are intended to be invoked in the context of
> >>>> pmd
> >>>> threads. */
> >>>>
> >>>> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats
> >>>> *s)
> >>>>  {
> >>>>  #ifdef DPDK_NETDEV
> >>>>      return s->last_tsc = rte_get_tsc_cycles();
> >>>> +#elif !defined(_MSC_VER) && defined(__x86_64__)
> >>>> +    uint32_t h, l;
> >>>> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> >>>> +
> >>>> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> >>>> +#elif defined(__linux__)
> >>>> +    return rdtsc_syscall(s);
> >>>>  #else
> >>>>      return s->last_tsc = 0;
> >>>>  #endif
> >>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> >>>> new file mode 100644
> >>>> index 000000000000..a6543e8f5126
> >>>> --- /dev/null
> >>>> +++ b/lib/netdev-afxdp.c
> >>>> @@ -0,0 +1,891 @@
> >>>> +/*
> >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +
> >>>> +#include <config.h>
> >>>> +
> >>>> +#include "netdev-linux-private.h"
> >>>> +#include "netdev-linux.h"
> >>>> +#include "netdev-afxdp.h"
> >>>> +
> >>>> +#include <errno.h>
> >>>> +#include <inttypes.h>
> >>>> +#include <linux/rtnetlink.h>
> >>>> +#include <linux/if_xdp.h>
> >>>> +#include <net/if.h>
> >>>> +#include <stdlib.h>
> >>>> +#include <sys/resource.h>
> >>>> +#include <sys/socket.h>
> >>>> +#include <sys/types.h>
> >>>> +#include <unistd.h>
> >>>> +
> >>>> +#include "dp-packet.h"
> >>>> +#include "dpif-netdev.h"
> >>>> +#include "openvswitch/dynamic-string.h"
> >>>> +#include "openvswitch/vlog.h"
> >>>> +#include "packets.h"
> >>>> +#include "socket-util.h"
> >>>> +#include "spinlock.h"
> >>>> +#include "util.h"
> >>>> +#include "xdpsock.h"
> >>>> +
> >>>> +#ifndef SOL_XDP
> >>>> +#define SOL_XDP 283
> >>>> +#endif
> >>>> +
> >>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> >>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> >>>> +
> >>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> >>>> *)base))
> >>>> +#define UMEM2XPKT(base, i) \
> >>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char
> >>>> *)base
> >>>> + \
> >>>> +                               i * sizeof(struct dp_packet_afxdp))
> >>>> +
> >>>> +static uint32_t prog_id;
> >>>> +static struct xsk_socket_info *xsk_configure(int ifindex, int
> >>>> xdp_queue_id,
> >>>> +                                             int mode);
> >>>> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> >>>> +static void xsk_destroy(struct xsk_socket_info *xsk);
> >>>> +static int xsk_configure_all(struct netdev *netdev);
> >>>> +static void xsk_destroy_all(struct netdev *netdev);
> >>>> +
> >>>> +static struct xsk_umem_info *
> >>>> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> >>>> +{
> >>>> +    struct xsk_umem_config uconfig OVS_UNUSED;
> >>>> +    struct xsk_umem_info *umem;
> >>>> +    int ret;
> >>>> +    int i;
> >>>> +
> >>>> +    umem = xcalloc(1, sizeof *umem);
> >>>> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq,
> >>>> &umem->cq,
> >>>> +                           NULL);
> >>>> +    if (ret) {
> >>>> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> >>>> +                 ovs_strerror(errno),
> >>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> >>>> +        free(umem);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    umem->buffer = buffer;
> >>>> +
> >>>> +    /* set-up umem pool */
> >>>> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> >>>> +        VLOG_ERR("umem_pool_init failed");
> >>>> +        if (xsk_umem__delete(umem->umem)) {
> >>>> +            VLOG_ERR("xsk_umem__delete failed");
> >>>> +        }
> >>>> +        free(umem);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> >>>> +        struct umem_elem *elem;
> >>>> +
> >>>> +        elem = ALIGNED_CAST(struct umem_elem *,
> >>>> +                            (char *)umem->buffer + i *
> >>>> FRAME_SIZE);
> >>>> +        umem_elem_push(&umem->mpool, elem);
> >>>> +    }
> >>>> +
> >>>> +    /* set-up metadata */
> >>>> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> >>>> +        VLOG_ERR("xpacket_pool_init failed");
> >>>> +        umem_pool_cleanup(&umem->mpool);
> >>>> +        if (xsk_umem__delete(umem->umem)) {
> >>>> +            VLOG_ERR("xsk_umem__delete failed");
> >>>> +        }
> >>>> +        free(umem);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> >>>> +              umem->xpool.array,
> >>>> +              (char *)umem->xpool.array +
> >>>> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> >>>> +
> >>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> >>>> +        struct dp_packet_afxdp *xpacket;
> >>>> +        struct dp_packet *packet;
> >>>> +
> >>>> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> >>>> +        xpacket->mpool = &umem->mpool;
> >>>> +
> >>>> +        packet = &xpacket->packet;
> >>>> +        packet->source = DPBUF_AFXDP;
> >>>> +    }
> >>>> +
> >>>> +    return umem;
> >>>> +}
> >>>> +
> >>>> +static struct xsk_socket_info *
> >>>> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> >>>> +                     uint32_t queue_id, int xdpmode)
> >>>> +{
> >>>> +    struct xsk_socket_config cfg;
> >>>> +    struct xsk_socket_info *xsk;
> >>>> +    char devname[IF_NAMESIZE];
> >>>> +    uint32_t idx = 0;
> >>>> +    int ret;
> >>>> +    int i;
> >>>> +
> >>>> +    xsk = xcalloc(1, sizeof(*xsk));
> >>>> +    xsk->umem = umem;
> >>>> +    cfg.rx_size = CONS_NUM_DESCS;
> >>>> +    cfg.tx_size = PROD_NUM_DESCS;
> >>>> +    cfg.libbpf_flags = 0;
> >>>> +
> >>>> +    if (xdpmode == XDP_ZEROCOPY) {
> >>>> +        cfg.bind_flags = XDP_ZEROCOPY;
> >>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> >>>> XDP_FLAGS_DRV_MODE;
> >>>> +    } else {
> >>>> +        cfg.bind_flags = XDP_COPY;
> >>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> >>>> XDP_FLAGS_SKB_MODE;
> >>>> +    }
> >>>> +
> >>>> +    if (if_indextoname(ifindex, devname) == NULL) {
> >>>> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> >>>> +                 ifindex, ovs_strerror(errno));
> >>>> +        free(xsk);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id,
> >>>> umem->umem,
> >>>> +                             &xsk->rx, &xsk->tx, &cfg);
> >>>> +    if (ret) {
> >>>> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid:
> >>>> %d",
> >>>> +                 ovs_strerror(errno),
> >>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> >>>> +                 queue_id);
> >>>> +        free(xsk);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    /* Make sure the built-in AF_XDP program is loaded */
> >>>> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> >>>> +    if (ret) {
> >>>> +        VLOG_ERR("Get XDP prog ID failed (%s)",
> >>>> ovs_strerror(errno));
> >>>> +        xsk_socket__delete(xsk->xsk);
> >>>> +        free(xsk);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL
> >>>> queue
> >>>> */
> >>>> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> >>>> +                                   PROD_NUM_DESCS - BATCH_SIZE,
> >>>> &idx)) {
> >>>> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL
> >>>> queue");
> >>>> +    }
> >>>> +
> >>>> +    for (i = 0;
> >>>> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
> >>>> +         i += FRAME_SIZE) {
> >>>> +        struct umem_elem *elem;
> >>>> +        uint64_t addr;
> >>>> +
> >>>> +        elem = umem_elem_pop(&xsk->umem->mpool);
> >>>> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> >>>> +
> >>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> >>>> +    }
> >>>> +
> >>>> +    xsk_ring_prod__submit(&xsk->umem->fq,
> >>>> +                          PROD_NUM_DESCS - BATCH_SIZE);
> >>>> +    return xsk;
> >>>> +}
> >>>> +
> >>>> +static struct xsk_socket_info *
> >>>> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> >>>> +{
> >>>> +    struct xsk_socket_info *xsk;
> >>>> +    struct xsk_umem_info *umem;
> >>>> +    void *bufs;
> >>>> +
> >>>> +    /* umem memory region */
> >>>> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> >>>> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> >>>> +
> >>>> +    /* create AF_XDP socket */
> >>>> +    umem = xsk_configure_umem(bufs,
> >>>> +                              NUM_FRAMES * FRAME_SIZE,
> >>>> +                              xdpmode);
> >>>> +    if (!umem) {
> >>>> +        free_pagealign(bufs);
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id,
> >>>> xdpmode);
> >>>> +    if (!xsk) {
> >>>> +        /* clean up umem and xpacket pool */
> >>>> +        if (xsk_umem__delete(umem->umem)) {
> >>>> +            VLOG_ERR("xsk_umem__delete failed");
> >>>> +        }
> >>>> +        free_pagealign(bufs);
> >>>> +        umem_pool_cleanup(&umem->mpool);
> >>>> +        xpacket_pool_cleanup(&umem->xpool);
> >>>> +        free(umem);
> >>>> +    }
> >>>> +    return xsk;
> >>>> +}
> >>>> +
> >>>> +static int
> >>>> +xsk_configure_all(struct netdev *netdev)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    struct xsk_socket_info *xsk;
> >>>> +    int i, ifindex, n_rxq;
> >>>> +
> >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> >>>> +
> >>>> +    n_rxq = netdev_n_rxq(netdev);
> >>>> +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> >>>> +
> >>>> +    /* configure each queue */
> >>>> +    for (i = 0; i < n_rxq; i++) {
> >>>> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> >>>> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> >>>> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> >>>> +        if (!xsk) {
> >>>> +            VLOG_ERR("failed to create AF_XDP socket on queue %d",
> >>>> i);
> >>>> +            dev->xsks[i] = NULL;
> >>>> +            goto err;
> >>>> +        }
> >>>> +        dev->xsks[i] = xsk;
> >>>> +        xsk->rx_dropped = 0;
> >>>> +        xsk->tx_dropped = 0;
> >>>> +    }
> >>>> +
> >>>> +    return 0;
> >>>> +
> >>>> +err:
> >>>> +    xsk_destroy_all(netdev);
> >>>> +    return EINVAL;
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +xsk_destroy(struct xsk_socket_info *xsk)
> >>>> +{
> >>>> +    struct xsk_umem *umem;
> >>>> +
> >>>> +    umem = xsk->umem->umem;
> >>>> +    xsk_socket__delete(xsk->xsk);
> >>>> +    if (xsk_umem__delete(umem)) {
> >>>> +        VLOG_ERR("xsk_umem__delete failed");
> >>>> +    }
> >>>> +
> >>>> +    /* free the packet buffer */
> >>>> +    free_pagealign(xsk->umem->buffer);
> >>>> +
> >>>> +    /* cleanup umem pool */
> >>>> +    umem_pool_cleanup(&xsk->umem->mpool);
> >>>> +
> >>>> +    /* cleanup metadata pool */
> >>>> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> >>>> +
> >>>> +    free(xsk->umem);
> >>>> +    free(xsk);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +xsk_destroy_all(struct netdev *netdev)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    int i, ifindex;
> >>>> +
> >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> >>>> +
> >>>> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> >>>> +        if (dev->xsks && dev->xsks[i]) {
> >>>> +            VLOG_INFO("destroy xsk[%d]", i);
> >>>> +            xsk_destroy(dev->xsks[i]);
> >>>> +            dev->xsks[i] = NULL;
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    VLOG_INFO("remove xdp program");
> >>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> >>>> +
> >>>> +    free(dev->xsks);
> >>>> +}
> >>>> +
> >>>> +static inline void OVS_UNUSED
> >>>> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> >>>> +    struct xdp_statistics stat;
> >>>> +    socklen_t optlen;
> >>>> +
> >>>> +    optlen = sizeof stat;
> >>>> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP,
> >>>> XDP_STATISTICS,
> >>>> +               &stat, &optlen) == 0);
> >>>> +
> >>>> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid
> >>>> %llu",
> >>>> +                stat.rx_dropped,
> >>>> +                stat.rx_invalid_descs,
> >>>> +                stat.tx_invalid_descs);
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> >>>> *args,
> >>>> +                        char **errp OVS_UNUSED)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    const char *str_xdpmode;
> >>>> +    int xdpmode, new_n_rxq;
> >>>> +
> >>>> +    ovs_mutex_lock(&dev->mutex);
> >>>> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> >>>> +    if (new_n_rxq > MAX_XSKQ) {
> >>>> +        ovs_mutex_unlock(&dev->mutex);
> >>>> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> >>>> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> >>>> +        return EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> >>>> +    if (!strcasecmp(str_xdpmode, "drv")) {
> >>>> +        xdpmode = XDP_ZEROCOPY;
> >>>> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> >>>> +        xdpmode = XDP_COPY;
> >>>> +    } else {
> >>>> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> >>>> +                 netdev_get_name(netdev), str_xdpmode);
> >>>> +        ovs_mutex_unlock(&dev->mutex);
> >>>> +        return EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    if (dev->requested_n_rxq != new_n_rxq
> >>>> +        || dev->requested_xdpmode != xdpmode) {
> >>>> +        dev->requested_n_rxq = new_n_rxq;
> >>>> +        dev->requested_xdpmode = xdpmode;
> >>>> +        netdev_request_reconfigure(netdev);
> >>>> +    }
> >>>> +    ovs_mutex_unlock(&dev->mutex);
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> >>>> *args)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +
> >>>> +    ovs_mutex_lock(&dev->mutex);
> >>>> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> >>>> +    smap_add_format(args, "xdpmode", "%s",
> >>>> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> >>>> +    ovs_mutex_unlock(&dev->mutex);
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +netdev_afxdp_alloc_txq(struct netdev *netdev)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    int n_txqs = netdev_n_rxq(netdev);
> >>>> +    int i;
> >>>> +
> >>>> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> >>>> +
> >>>> +    for (i = 0; i < n_txqs; i++) {
> >>>> +        ovs_spinlock_init(&dev->tx_locks[i]);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_reconfigure(struct netdev *netdev)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >>>> +    int err = 0;
> >>>> +
> >>>> +    ovs_mutex_lock(&dev->mutex);
> >>>> +
> >>>> +    if (netdev->n_rxq == dev->requested_n_rxq
> >>>> +        && dev->xdpmode == dev->requested_xdpmode) {
> >>>> +        goto out;
> >>>> +    }
> >>>> +
> >>>> +    xsk_destroy_all(netdev);
> >>>> +    free(dev->tx_locks);
> >>>> +
> >>>> +    netdev->n_rxq = dev->requested_n_rxq;
> >>>> +    netdev_afxdp_alloc_txq(netdev);
> >>>> +
> >>>> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> >>>> +        VLOG_INFO("AF_XDP device %s in DRV mode",
> >>>> netdev_get_name(netdev));
> >>>> +        /* From SKB mode to DRV mode */
> >>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> >>>> XDP_FLAGS_DRV_MODE;
> >>>> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> >>>> +        dev->xdpmode = XDP_ZEROCOPY;
> >>>> +
> >>>> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> >>>> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> >>>> +                      ovs_strerror(errno));
> >>>> +        }
> >>>> +    } else {
> >>>> +        VLOG_INFO("AF_XDP device %s in SKB mode",
> >>>> netdev_get_name(netdev));
> >>>> +        /* From DRV mode to SKB mode */
> >>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> >>>> XDP_FLAGS_SKB_MODE;
> >>>> +        dev->xdp_bind_flags = XDP_COPY;
> >>>> +        dev->xdpmode = XDP_COPY;
> >>>> +        /* TODO: set rlimit back to previous value
> >>>> +         * when no device is in DRV mode.
> >>>> +         */
> >>>> +    }
> >>>> +
> >>>> +    err = xsk_configure_all(netdev);
> >>>> +    if (err) {
> >>>> +        VLOG_ERR("AF_XDP device %s reconfig fails",
> >>>> netdev_get_name(netdev));
> >>>> +    }
> >>>> +    netdev_change_seq_changed(netdev);
> >>>> +out:
> >>>> +    ovs_mutex_unlock(&dev->mutex);
> >>>> +    return err;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> >>>> +{
> >>>> +    /* FIXME: Get netdev's PCIe device ID, then find
> >>>> +     * its NUMA node id.
> >>>> +     */
> >>>> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> >>>> +              netdev_get_name(netdev));
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> >>>> +{
> >>>> +    uint32_t curr_prog_id = 0;
> >>>> +    uint32_t flags;
> >>>> +
> >>>> +    /* remove_xdp_program() */
> >>>> +    if (xdpmode == XDP_COPY) {
> >>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> >>>> +    } else {
> >>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> >>>> +    }
> >>>> +
> >>>> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> >>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> >>>> +    }
> >>>> +    if (prog_id == curr_prog_id) {
> >>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> >>>> +    } else if (!curr_prog_id) {
> >>>> +        VLOG_INFO("couldn't find a prog id on a given interface");
> >>>> +    } else {
> >>>> +        VLOG_INFO("program on interface changed, not removing");
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +signal_remove_xdp(struct netdev *netdev)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    int ifindex;
> >>>> +
> >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> >>>> +
> >>>> +    VLOG_WARN("force remove xdp program");
> >>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> >>>> +}
> >>>> +
> >>>> +static struct dp_packet_afxdp *
> >>>> +dp_packet_cast_afxdp(const struct dp_packet *d)
> >>>> +{
> >>>> +    ovs_assert(d->source == DPBUF_AFXDP);
> >>>> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +free_afxdp_buf(struct dp_packet *p)
> >>>> +{
> >>>> +    struct dp_packet_afxdp *xpacket;
> >>>> +    uintptr_t addr;
> >>>> +
> >>>> +    xpacket = dp_packet_cast_afxdp(p);
> >>>> +    if (xpacket->mpool) {
> >>>> +        void *base = dp_packet_base(p);
> >>>> +
> >>>> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> >>>> +        umem_elem_push(xpacket->mpool, (void *)addr);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> >>>> +{
> >>>> +    struct dp_packet_afxdp *xpacket = NULL;
> >>>> +    struct dp_packet *packet;
> >>>> +    void *elems[BATCH_SIZE];
> >>>> +    uintptr_t addr;
> >>>> +
> >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> >>>> +        xpacket = dp_packet_cast_afxdp(packet);
> >>>> +        if (xpacket->mpool) {
> >>>> +            void *base = dp_packet_base(packet);
> >>>> +
> >>>> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> >>>> +            elems[i] = (void *)addr;
> >>>> +        }
> >>>> +    }
> >>>> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> >>>> +    dp_packet_batch_init(batch);
> >>>> +}
> >>>> +
> >>>> +static inline void
> >>>> +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
> >>>> +{
> >>>> +    void *elems[BATCH_SIZE];
> >>>> +    int i;
> >>>> +
> >>>> +    for (i = 0; i < rcvd; i++) {
> >>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> >>>> idx_rx)->addr;
> >>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> >>>> +
> >>>> +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
> >>>> +    }
> >>>> +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> >>>> +
> >>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> >>>> +    xsk->rx_dropped += rcvd;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct
> >>>> dp_packet_batch
> >>>> *batch,
> >>>> +                      int *qfill)
> >>>> +{
> >>>> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >>>> +    struct netdev *netdev = rx->up.netdev;
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    struct umem_elem *elems[BATCH_SIZE];
> >>>> +    uint32_t idx_rx = 0, idx_fq = 0;
> >>>> +    struct xsk_socket_info *xsk;
> >>>> +    int qid = rxq_->queue_id;
> >>>> +    unsigned int rcvd, i;
> >>>> +    int ret = 0;
> >>>> +
> >>>> +    xsk = dev->xsks[qid];
> >>>> +    if (!xsk) {
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    rx->fd = xsk_socket__fd(xsk->xsk);
> >>>> +
> >>>> +    /* See if there is any packet on RX queue,
> >>>> +     * if yes, idx_rx is the index having the packet.
> >>>> +     */
> >>>> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> >>>> +    if (!rcvd) {
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void
> >>>> **)elems);
> >>>> +    if (OVS_UNLIKELY(ret)) {
> >>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
> >>>> +        return ENOMEM;
> >>>> +    }
> >>>> +
> >>>> +    /* Prepare for the FILL queue */
> >>>> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> >>>> +        /* The FILL queue is full, don't retry or process rx. Wait
> >>>> for kernel
> >>>> +         * to move received packets from FILL queue to RX queue.
> >>>> +         */
> >>>> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> >>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
> >>>> +        return ENOMEM;
> >>>> +    }
> >>>> +
> >>>> +    /* Setup a dp_packet batch from descriptors in RX queue */
> >>>> +    for (i = 0; i < rcvd; i++) {
> >>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> >>>> idx_rx)->addr;
> >>>> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx,
> >>>> idx_rx)->len;
> >>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> >>>> +        uint64_t index;
> >>>> +
> >>>> +        struct dp_packet_afxdp *xpacket;
> >>>> +        struct dp_packet *packet;
> >>>> +
> >>>> +        index = addr >> FRAME_SHIFT;
> >>>> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> >>>> +        packet = &xpacket->packet;
> >>>> +
> >>>> +        /* Initialize the struct dp_packet */
> >>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE -
> >>>> FRAME_HEADROOM);
> >>>> +        dp_packet_set_size(packet, len);
> >>>> +
> >>>> +        /* Add packet into batch, increase batch->count */
> >>>> +        dp_packet_batch_add(batch, packet);
> >>>> +
> >>>> +        idx_rx++;
> >>>> +    }
> >>>> +    /* Release the RX queue */
> >>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> >>>> +
> >>>> +    for (i = 0; i < rcvd; i++) {
> >>>> +        uint64_t index;
> >>>> +        struct umem_elem *elem;
> >>>> +
> >>>> +        /* Get one free umem, program it into FILL queue */
> >>>> +        elem = elems[i];
> >>>> +        index = (uint64_t)((char *)elem - (char
> >>>> *)xsk->umem->buffer);
> >>>> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> >>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> >>>> +
> >>>> +        idx_fq++;
> >>>> +    }
> >>>> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> >>>> +
> >>>> +    if (qfill) {
> >>>> +        /* TODO: return the number of remaining packets in the
> >>>> queue.
> >>>> */
> >>>> +        *qfill = 0;
> >>>> +    }
> >>>> +
> >>>> +#ifdef AFXDP_DEBUG
> >>>> +    log_xsk_stat(xsk);
> >>>> +#endif
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static inline int
> >>>> +kick_tx(struct xsk_socket_info *xsk)
> >>>> +{
> >>>> +    int ret;
> >>>> +
> >>>> +    /* This causes system call into kernel's xsk_sendmsg, and
> >>>> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver
> >>>> mode).
> >>>> +     */
> >>>> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT,
> >>>> NULL, 0);
> >>>> +    if (OVS_UNLIKELY(ret < 0)) {
> >>>> +        if (errno == ENXIO || errno == ENOBUFS || errno ==
> >>>> EOPNOTSUPP) {
> >>>> +            return errno;
> >>>> +        }
> >>>> +    }
> >>>> +    /* no error, or EBUSY or EAGAIN */
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static inline bool
> >>>> +check_free_batch(struct dp_packet_batch *batch)
> >>>> +{
> >>>> +    struct umem_pool *first_mpool = NULL;
> >>>> +    struct dp_packet_afxdp *xpacket;
> >>>> +    struct dp_packet *packet;
> >>>> +
> >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> >>>> +        if (packet->source != DPBUF_AFXDP) {
> >>>> +            return false;
> >>>> +        }
> >>>> +        xpacket = dp_packet_cast_afxdp(packet);
> >>>> +        if (i == 0) {
> >>>> +            first_mpool = xpacket->mpool;
> >>>> +            continue;
> >>>> +        }
> >>>> +        if (xpacket->mpool != first_mpool) {
> >>>> +            return false;
> >>>> +        }
> >>>> +    }
> >>>> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> >>>> +    return true;
> >>>> +}
> >>>> +
> >>>> +static inline void
> >>>> +afxdp_complete_tx(struct xsk_socket_info *xsk)
> >>>> +{
> >>>> +    struct umem_elem *elems_push[BATCH_SIZE];
> >>>> +    uint32_t idx_cq = 0;
> >>>> +    int tx_done, j, ret;
> >>>> +
> >>>> +    if (!xsk->outstanding_tx) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    ret = kick_tx(xsk);
> >>>> +    if (OVS_UNLIKELY(ret)) {
> >>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> >>>> +                     ovs_strerror(ret));
> >>>> +    }
> >>>> +
> >>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE,
> >>>> &idx_cq);
> >>>> +    if (tx_done > 0) {
> >>>> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> >>>> +        xsk->outstanding_tx -= tx_done;
> >>>> +    }
> >>>> +
> >>>> +    /* Recycle back to umem pool */
> >>>> +    for (j = 0; j < tx_done; j++) {
> >>>> +        struct umem_elem *elem;
> >>>> +        uint64_t addr;
> >>>> +
> >>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq,
> >>>> idx_cq++);
> >>>> +        elem = ALIGNED_CAST(struct umem_elem *,
> >>>> +                            (char *)xsk->umem->buffer + addr);
> >>>> +        elems_push[j] = elem;
> >>>> +    }
> >>>> +
> >>>> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> >>>> **)elems_push);
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> >>>> +                        struct dp_packet_batch *batch,
> >>>> +                        bool concurrent_txq)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    struct xsk_socket_info *xsk = dev->xsks[qid];
> >>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
> >>>> +    struct dp_packet *packet;
> >>>> +    bool free_batch = true;
> >>>> +    uint32_t idx = 0;
> >>>> +    int error = 0;
> >>>> +    int ret;
> >>>> +
> >>>> +    if (!xsk) {
> >>>> +        goto out;
> >>>> +    }
> >>>> +
> >>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
> >>>> +        qid = qid % dev->up.n_txq;
> >>>> +        ovs_spin_lock(&dev->tx_locks[qid]);
> >>>> +    }
> >>>> +
> >>>> +    /* Process CQ first. */
> >>>> +    afxdp_complete_tx(xsk);
> >>>> +
> >>>> +    free_batch = check_free_batch(batch);
> >>>> +
> >>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> >>>> **)elems_pop);
> >>>> +    if (OVS_UNLIKELY(ret)) {
> >>>> +        xsk->tx_dropped += batch->count;
> >>>> +        error = ENOMEM;
> >>>> +        goto out;
> >>>> +    }
> >>>> +
> >>>> +    /* Make sure we have enough TX descs */
> >>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> >>>> +    if (OVS_UNLIKELY(ret == 0)) {
> >>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> >>>> **)elems_pop);
> >>>> +        xsk->tx_dropped += batch->count;
> >>>> +        error = ENOMEM;
> >>>> +        goto out;
> >>>> +    }
> >>>> +
> >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> >>>> +        struct umem_elem *elem;
> >>>> +        uint64_t index;
> >>>> +
> >>>> +        elem = elems_pop[i];
> >>>> +        /* Copy the packet to the umem we just pop from umem pool.
> >>>> +         * TODO: avoid this copy if the packet and the pop umem
> >>>> +         * are located in the same umem.
> >>>> +         */
> >>>> +        memcpy(elem, dp_packet_data(packet),
> >>>> dp_packet_size(packet));
> >>>> +
> >>>> +        index = (uint64_t)((char *)elem - (char
> >>>> *)xsk->umem->buffer);
> >>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> >>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> >>>> +            = dp_packet_size(packet);
> >>>> +    }
> >>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> >>>> +    xsk->outstanding_tx += batch->count;
> >>>> +
> >>>> +    ret = kick_tx(xsk);
> >>>> +    if (OVS_UNLIKELY(ret)) {
> >>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> >>>> +                     ovs_strerror(ret));
> >>>> +    }
> >>>> +
> >>>> +out:
> >>>> +    if (free_batch) {
> >>>> +        free_afxdp_buf_batch(batch);
> >>>> +    } else {
> >>>> +        dp_packet_delete_batch(batch, true);
> >>>> +    }
> >>>> +
> >>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
> >>>> +        ovs_spin_unlock(&dev->tx_locks[qid]);
> >>>> +    }
> >>>> +    return error;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> >>>> +{
> >>>> +   /* Done at reconfigure */
> >>>> +   return 0;
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +netdev_afxdp_destruct(struct netdev *netdev_)
> >>>> +{
> >>>> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> >>>> +
> >>>> +    /* Note: tc is by-passed when using drv-mode, but when using
> >>>> +     * skb-mode, we might need to clean up tc. */
> >>>> +
> >>>> +    xsk_destroy_all(netdev_);
> >>>> +    ovs_mutex_destroy(&netdev->mutex);
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +netdev_afxdp_get_stats(const struct netdev *netdev,
> >>>> +                       struct netdev_stats *stats)
> >>>> +{
> >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> >>>> +    struct netdev_stats dev_stats;
> >>>> +    struct xsk_socket_info *xsk;
> >>>> +    int error, i;
> >>>> +
> >>>> +    ovs_mutex_lock(&dev->mutex);
> >>>> +
> >>>> +    error = get_stats_via_netlink(netdev, &dev_stats);
> >>>> +    if (error) {
> >>>> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> >>>> +    } else {
> >>>> +        /* Use kernel netdev's packet and byte counts */
> >>>> +        stats->rx_packets = dev_stats.rx_packets;
> >>>> +        stats->rx_bytes = dev_stats.rx_bytes;
> >>>> +        stats->tx_packets = dev_stats.tx_packets;
> >>>> +        stats->tx_bytes = dev_stats.tx_bytes;
> >>>> +
> >>>> +        stats->rx_errors           += dev_stats.rx_errors;
> >>>> +        stats->tx_errors           += dev_stats.tx_errors;
> >>>> +        stats->rx_dropped          += dev_stats.rx_dropped;
> >>>> +        stats->tx_dropped          += dev_stats.tx_dropped;
> >>>> +        stats->multicast           += dev_stats.multicast;
> >>>> +        stats->collisions          += dev_stats.collisions;
> >>>> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> >>>> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> >>>> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> >>>> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> >>>> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> >>>> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> >>>> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> >>>> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> >>>> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> >>>> +        stats->tx_heartbeat_errors +=
> >>>> dev_stats.tx_heartbeat_errors;
> >>>> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> >>>> +
> >>>> +        /* Account the dropped in each xsk */
> >>>> +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
> >>>> +            xsk = dev->xsks[i];
> >>>> +            if (xsk) {
> >>>> +                stats->rx_dropped += xsk->rx_dropped;
> >>>> +                stats->tx_dropped += xsk->tx_dropped;
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +    ovs_mutex_unlock(&dev->mutex);
> >>>> +
> >>>> +    return error;
> >>>> +}
> >>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> >>>> new file mode 100644
> >>>> index 000000000000..dd2dc1a2064d
> >>>> --- /dev/null
> >>>> +++ b/lib/netdev-afxdp.h
> >>>> @@ -0,0 +1,74 @@
> >>>> +/*
> >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +
> >>>> +#ifndef NETDEV_AFXDP_H
> >>>> +#define NETDEV_AFXDP_H 1
> >>>> +
> >>>> +#include <config.h>
> >>>> +
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +
> >>>> +#include <stdint.h>
> >>>> +#include <stdbool.h>
> >>>> +
> >>>> +/* These functions are Linux AF_XDP specific, so they should be
> >>>> used
> >>>> directly
> >>>> + * only by Linux-specific code. */
> >>>> +
> >>>> +#define MAX_XSKQ 16
> >>>> +
> >>>> +struct netdev;
> >>>> +struct xsk_socket_info;
> >>>> +struct xdp_umem;
> >>>> +struct dp_packet_batch;
> >>>> +struct smap;
> >>>> +struct dp_packet;
> >>>> +struct netdev_rxq;
> >>>> +struct netdev_stats;
> >>>> +
> >>>> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> >>>> +void netdev_afxdp_destruct(struct netdev *netdev_);
> >>>> +
> >>>> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> >>>> +                          struct dp_packet_batch *batch,
> >>>> +                          int *qfill);
> >>>> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> >>>> +                            struct dp_packet_batch *batch,
> >>>> +                            bool concurrent_txq);
> >>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct
> >>>> smap
> >>>> *args,
> >>>> +                            char **errp);
> >>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct
> >>>> smap
> >>>> *args);
> >>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> >>>> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> >>>> +                           struct netdev_stats *stats);
> >>>> +
> >>>> +void free_afxdp_buf(struct dp_packet *p);
> >>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> >>>> +void signal_remove_xdp(struct netdev *netdev);
> >>>> +
> >>>> +#else /* !HAVE_AF_XDP */
> >>>> +
> >>>> +#include "openvswitch/compiler.h"
> >>>> +
> >>>> +struct dp_packet;
> >>>> +
> >>>> +static inline void
> >>>> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> >>>> +{
> >>>> +    /* Nothing */
> >>>> +}
> >>>> +
> >>>> +#endif /* HAVE_AF_XDP */
> >>>> +#endif /* netdev-afxdp.h */
> >>>> diff --git a/lib/netdev-linux-private.h
> >>>> b/lib/netdev-linux-private.h
> >>>> new file mode 100644
> >>>> index 000000000000..6a0388cf9dc3
> >>>> --- /dev/null
> >>>> +++ b/lib/netdev-linux-private.h
> >>>> @@ -0,0 +1,139 @@
> >>>> +/*
> >>>> + * Copyright (c) 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +
> >>>> +#ifndef NETDEV_LINUX_PRIVATE_H
> >>>> +#define NETDEV_LINUX_PRIVATE_H 1
> >>>> +
> >>>> +#include <config.h>
> >>>> +
> >>>> +#include <linux/filter.h>
> >>>> +#include <linux/gen_stats.h>
> >>>> +#include <linux/if_ether.h>
> >>>> +#include <linux/if_tun.h>
> >>>> +#include <linux/types.h>
> >>>> +#include <linux/ethtool.h>
> >>>> +#include <linux/mii.h>
> >>>> +#include <stdint.h>
> >>>> +#include <stdbool.h>
> >>>> +
> >>>> +#include "netdev-afxdp.h"
> >>>> +#include "netdev-provider.h"
> >>>> +#include "netdev-tc-offloads.h"
> >>>> +#include "netdev-vport.h"
> >>>> +#include "openvswitch/thread.h"
> >>>> +#include "ovs-atomic.h"
> >>>> +#include "timer.h"
> >>>> +#include "xdpsock.h"
> >>>> +
> >>>> +/* These functions are Linux specific, so they should be used
> >>>> directly only by
> >>>> + * Linux-specific code. */
> >>>> +
> >>>> +struct netdev;
> >>>> +
> >>>> +struct netdev_rxq_linux {
> >>>> +    struct netdev_rxq up;
> >>>> +    bool is_tap;
> >>>> +    int fd;
> >>>> +};
> >>>> +
> >>>> +void netdev_linux_run(const struct netdev_class *);
> >>>> +
> >>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> >>>> flag,
> >>>> +                                  const char *flag_name, bool
> >>>> enable);
> >>>> +
> >>>> +int get_stats_via_netlink(const struct netdev *netdev_,
> >>>> +                          struct netdev_stats *stats);
> >>>> +
> >>>> +struct netdev_linux {
> >>>> +    struct netdev up;
> >>>> +
> >>>> +    /* Protects all members below. */
> >>>> +    struct ovs_mutex mutex;
> >>>> +
> >>>> +    unsigned int cache_valid;
> >>>> +
> >>>> +    bool miimon;                    /* Link status of last poll.
> >>>> */
> >>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled
> >>>> if
> >>>> <= 0. */
> >>>> +    struct timer miimon_timer;
> >>>> +
> >>>> +    int netnsid;                    /* Network namespace ID. */
> >>>> +    /* The following are figured out "on demand" only.  They are
> >>>> only
> >>>> valid
> >>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set.
> >>>> */
> >>>> +    int ifindex;
> >>>> +    struct eth_addr etheraddr;
> >>>> +    int mtu;
> >>>> +    unsigned int ifi_flags;
> >>>> +    long long int carrier_resets;
> >>>> +    uint32_t kbits_rate;        /* Policing data. */
> >>>> +    uint32_t kbits_burst;
> >>>> +    int vport_stats_error;      /* Cached error code from
> >>>> vport_get_stats().
> >>>> +                                   0 or an errno value. */
> >>>> +    int netdev_mtu_error;       /* Cached error code from
> >>>> SIOCGIFMTU
> >>>> +                                 * or SIOCSIFMTU.
> >>>> +                                 */
> >>>> +    int ether_addr_error;       /* Cached error code from set/get
> >>>> etheraddr. */
> >>>> +    int netdev_policing_error;  /* Cached error code from set
> >>>> policing. */
> >>>> +    int get_features_error;     /* Cached error code from
> >>>> ETHTOOL_GSET. */
> >>>> +    int get_ifindex_error;      /* Cached error code from
> >>>> SIOCGIFINDEX. */
> >>>> +
> >>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> +
> >>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from
> >>>> ETHTOOL_GDRVINFO.
> >>>> */
> >>>> +    struct tc *tc;
> >>>> +
> >>>> +    /* For devices of class netdev_tap_class only. */
> >>>> +    int tap_fd;
> >>>> +    bool present;               /* If the device is present in the
> >>>> namespace */
> >>>> +    uint64_t tx_dropped;        /* tap device can drop if the
> >>>> iface
> >>>> is down */
> >>>> +
> >>>> +    /* LAG information. */
> >>>> +    bool is_lag_master;         /* True if the netdev is a LAG
> >>>> master. */
> >>>> +
> >>>> +    /* AF_XDP information */
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +    struct xsk_socket_info **xsks;
> >>>> +    int requested_n_rxq;
> >>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> >>>> +    int xdp_flags, xdp_bind_flags;
> >>>> +    struct ovs_spinlock *tx_locks;
> >>>> +#endif
> >>>> +};
> >>>> +
> >>>> +static bool
> >>>> +is_netdev_linux_class(const struct netdev_class *netdev_class)
> >>>> +{
> >>>> +    return netdev_class->run == netdev_linux_run;
> >>>> +}
> >>>> +
> >>>> +static struct netdev_linux *
> >>>> +netdev_linux_cast(const struct netdev *netdev)
> >>>> +{
> >>>> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> >>>> +
> >>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> >>>> +}
> >>>> +
> >>>> +static struct netdev_rxq_linux *
> >>>> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> >>>> +{
> >>>> +
> >>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> >>>> +
> >>>> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> >>>> +}
> >>>> +
> >>>> +#endif /* netdev-linux-private.h */
> >>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> >>>> index f75d73fd39f8..2883cf1f2586 100644
> >>>> --- a/lib/netdev-linux.c
> >>>> +++ b/lib/netdev-linux.c
> >>>> @@ -17,6 +17,7 @@
> >>>>  #include <config.h>
> >>>>
> >>>>  #include "netdev-linux.h"
> >>>> +#include "netdev-linux-private.h"
> >>>>
> >>>>  #include <errno.h>
> >>>>  #include <fcntl.h>
> >>>> @@ -54,6 +55,7 @@
> >>>>  #include "fatal-signal.h"
> >>>>  #include "hash.h"
> >>>>  #include "openvswitch/hmap.h"
> >>>> +#include "netdev-afxdp.h"
> >>>>  #include "netdev-provider.h"
> >>>>  #include "netdev-tc-offloads.h"
> >>>>  #include "netdev-vport.h"
> >>>> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps,
> >>>> int
> >>>> mtu);
> >>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> >>>> burst_bytes);
> >>>>
> >>>> -struct netdev_linux {
> >>>> -    struct netdev up;
> >>>> -
> >>>> -    /* Protects all members below. */
> >>>> -    struct ovs_mutex mutex;
> >>>> -
> >>>> -    unsigned int cache_valid;
> >>>> -
> >>>> -    bool miimon;                    /* Link status of last poll.
> >>>> */
> >>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled
> >>>> if
> >>>> <= 0. */
> >>>> -    struct timer miimon_timer;
> >>>> -
> >>>> -    int netnsid;                    /* Network namespace ID. */
> >>>> -    /* The following are figured out "on demand" only.  They are
> >>>> only
> >>>> valid
> >>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set.
> >>>> */
> >>>> -    int ifindex;
> >>>> -    struct eth_addr etheraddr;
> >>>> -    int mtu;
> >>>> -    unsigned int ifi_flags;
> >>>> -    long long int carrier_resets;
> >>>> -    uint32_t kbits_rate;        /* Policing data. */
> >>>> -    uint32_t kbits_burst;
> >>>> -    int vport_stats_error;      /* Cached error code from
> >>>> vport_get_stats().
> >>>> -                                   0 or an errno value. */
> >>>> -    int netdev_mtu_error;       /* Cached error code from
> >>>> SIOCGIFMTU
> >>>> or SIOCSIFMTU. */
> >>>> -    int ether_addr_error;       /* Cached error code from set/get
> >>>> etheraddr. */
> >>>> -    int netdev_policing_error;  /* Cached error code from set
> >>>> policing. */
> >>>> -    int get_features_error;     /* Cached error code from
> >>>> ETHTOOL_GSET. */
> >>>> -    int get_ifindex_error;      /* Cached error code from
> >>>> SIOCGIFINDEX. */
> >>>> -
> >>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET.
> >>>> */
> >>>> -
> >>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from
> >>>> ETHTOOL_GDRVINFO.
> >>>> */
> >>>> -    struct tc *tc;
> >>>> -
> >>>> -    /* For devices of class netdev_tap_class only. */
> >>>> -    int tap_fd;
> >>>> -    bool present;               /* If the device is present in the
> >>>> namespace */
> >>>> -    uint64_t tx_dropped;        /* tap device can drop if the
> >>>> iface
> >>>> is down */
> >>>> -
> >>>> -    /* LAG information. */
> >>>> -    bool is_lag_master;         /* True if the netdev is a LAG
> >>>> master. */
> >>>> -};
> >>>> -
> >>>> -struct netdev_rxq_linux {
> >>>> -    struct netdev_rxq up;
> >>>> -    bool is_tap;
> >>>> -    int fd;
> >>>> -};
> >>>>
> >>>>  /* This is set pretty low because we probably won't learn anything
> >>>> from the
> >>>>   * additional log messages. */
> >>>> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl =
> >>>> VLOG_RATE_LIMIT_INIT(5, 20);
> >>>>   * changes in the device miimon status, so we can use
> >>>> atomic_count.
> >>>> */
> >>>>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
> >>>>
> >>>> -static void netdev_linux_run(const struct netdev_class *);
> >>>> -
> >>>>  static int netdev_linux_do_ethtool(const char *name, struct
> >>>> ethtool_cmd *,
> >>>>                                     int cmd, const char *cmd_name);
> >>>>  static int get_flags(const struct netdev *, unsigned int *flags);
> >>>> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
> >>>>                         struct in_addr addr);
> >>>>  static int get_etheraddr(const char *netdev_name, struct eth_addr
> >>>> *ea);
> >>>>  static int set_etheraddr(const char *netdev_name, const struct
> >>>> eth_addr);
> >>>> -static int get_stats_via_netlink(const struct netdev *, struct
> >>>> netdev_stats *);
> >>>>  static int af_packet_sock(void);
> >>>>  static bool netdev_linux_miimon_enabled(void);
> >>>>  static void netdev_linux_miimon_run(void);
> >>>> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
> >>>>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int
> >>>> *mtup);
> >>>>
> >>>>  static bool
> >>>> -is_netdev_linux_class(const struct netdev_class *netdev_class)
> >>>> -{
> >>>> -    return netdev_class->run == netdev_linux_run;
> >>>> -}
> >>>> -
> >>>> -static bool
> >>>>  is_tap_netdev(const struct netdev *netdev)
> >>>>  {
> >>>>      return netdev_get_class(netdev) == &netdev_tap_class;
> >>>>  }
> >>>> -
> >>>> -static struct netdev_linux *
> >>>> -netdev_linux_cast(const struct netdev *netdev)
> >>>> -{
> >>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> >>>> -
> >>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> >>>> -}
> >>>> -
> >>>> -static struct netdev_rxq_linux *
> >>>> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> >>>> -{
> >>>> -
> >>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> >>>> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> >>>> -}
> >>>>
> >>>>  static int
> >>>>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
> >>>> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change
> >>>> *change)
> >>>>      }
> >>>>  }
> >>>>
> >>>> -static void
> >>>> +void
> >>>>  netdev_linux_run(const struct netdev_class *netdev_class
> >>>> OVS_UNUSED)
> >>>>  {
> >>>>      struct nl_sock *sock;
> >>>> @@ -3279,9 +3206,7 @@ exit:
> >>>>      .run = netdev_linux_run,                                    \
> >>>>      .wait = netdev_linux_wait,                                  \
> >>>>      .alloc = netdev_linux_alloc,                                \
> >>>> -    .destruct = netdev_linux_destruct,                          \
> >>>>      .dealloc = netdev_linux_dealloc,                            \
> >>>> -    .send = netdev_linux_send,                                  \
> >>>>      .send_wait = netdev_linux_send_wait,                        \
> >>>>      .set_etheraddr = netdev_linux_set_etheraddr,                \
> >>>>      .get_etheraddr = netdev_linux_get_etheraddr,                \
> >>>> @@ -3312,10 +3237,8 @@ exit:
> >>>>      .arp_lookup = netdev_linux_arp_lookup,                      \
> >>>>      .update_flags = netdev_linux_update_flags,                  \
> >>>>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
> >>>> -    .rxq_construct = netdev_linux_rxq_construct,                \
> >>>>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
> >>>>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
> >>>> -    .rxq_recv = netdev_linux_rxq_recv,                          \
> >>>>      .rxq_wait = netdev_linux_rxq_wait,                          \
> >>>>      .rxq_drain = netdev_linux_rxq_drain
> >>>>
> >>>> @@ -3323,30 +3246,64 @@ const struct netdev_class
> >>>> netdev_linux_class =
> >>>> {
> >>>>      NETDEV_LINUX_CLASS_COMMON,
> >>>>      LINUX_FLOW_OFFLOAD_API,
> >>>>      .type = "system",
> >>>> +    .is_pmd = false,
> >>>>      .construct = netdev_linux_construct,
> >>>> +    .destruct = netdev_linux_destruct,
> >>>>      .get_stats = netdev_linux_get_stats,
> >>>>      .get_features = netdev_linux_get_features,
> >>>>      .get_status = netdev_linux_get_status,
> >>>> -    .get_block_id = netdev_linux_get_block_id
> >>>> +    .get_block_id = netdev_linux_get_block_id,
> >>>> +    .send = netdev_linux_send,
> >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> >>>>  };
> >>>>
> >>>>  const struct netdev_class netdev_tap_class = {
> >>>>      NETDEV_LINUX_CLASS_COMMON,
> >>>>      .type = "tap",
> >>>> +    .is_pmd = false,
> >>>>      .construct = netdev_linux_construct_tap,
> >>>> +    .destruct = netdev_linux_destruct,
> >>>>      .get_stats = netdev_tap_get_stats,
> >>>>      .get_features = netdev_linux_get_features,
> >>>>      .get_status = netdev_linux_get_status,
> >>>> +    .send = netdev_linux_send,
> >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> >>>>  };
> >>>>
> >>>>  const struct netdev_class netdev_internal_class = {
> >>>>      NETDEV_LINUX_CLASS_COMMON,
> >>>>      LINUX_FLOW_OFFLOAD_API,
> >>>>      .type = "internal",
> >>>> +    .is_pmd = false,
> >>>>      .construct = netdev_linux_construct,
> >>>> +    .destruct = netdev_linux_destruct,
> >>>>      .get_stats = netdev_internal_get_stats,
> >>>>      .get_status = netdev_internal_get_status,
> >>>> +    .send = netdev_linux_send,
> >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> >>>>  };
> >>>> +
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +const struct netdev_class netdev_afxdp_class = {
> >>>> +    NETDEV_LINUX_CLASS_COMMON,
> >>>> +    .type = "afxdp",
> >>>> +    .is_pmd = true,
> >>>> +    .construct = netdev_linux_construct,
> >>>> +    .destruct = netdev_afxdp_destruct,
> >>>> +    .get_stats = netdev_afxdp_get_stats,
> >>>> +    .get_status = netdev_linux_get_status,
> >>>> +    .set_config = netdev_afxdp_set_config,
> >>>> +    .get_config = netdev_afxdp_get_config,
> >>>> +    .reconfigure = netdev_afxdp_reconfigure,
> >>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
> >>>> +    .send = netdev_afxdp_batch_send,
> >>>> +    .rxq_construct = netdev_afxdp_rxq_construct,
> >>>> +    .rxq_recv = netdev_afxdp_rxq_recv,
> >>>> +};
> >>>> +#endif
> >>>>
> >>>>
> >>>>  #define CODEL_N_QUEUES 0x0000
> >>>> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct
> >>>> netdev_stats *dst,
> >>>>      dst->tx_window_errors = src->tx_window_errors;
> >>>>  }
> >>>>
> >>>> -static int
> >>>> +int
> >>>>  get_stats_via_netlink(const struct netdev *netdev_, struct
> >>>> netdev_stats *stats)
> >>>>  {
> >>>>      struct ofpbuf request;
> >>>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> >>>> index fb0c27e6e8e8..91e6a9e2bfc0 100644
> >>>> --- a/lib/netdev-provider.h
> >>>> +++ b/lib/netdev-provider.h
> >>>> @@ -903,6 +903,9 @@ extern const struct netdev_class
> >>>> netdev_linux_class;
> >>>>  extern const struct netdev_class netdev_internal_class;
> >>>>  extern const struct netdev_class netdev_tap_class;
> >>>>
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +extern const struct netdev_class netdev_afxdp_class;
> >>>> +#endif
> >>>>  #ifdef  __cplusplus
> >>>>  }
> >>>>  #endif
> >>>> diff --git a/lib/netdev.c b/lib/netdev.c
> >>>> index 7d7ecf6f0946..0fac117cc602 100644
> >>>> --- a/lib/netdev.c
> >>>> +++ b/lib/netdev.c
> >>>> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl =
> >>>> VLOG_RATE_LIMIT_INIT(5, 20);
> >>>>
> >>>>  static void restore_all_flags(void *aux OVS_UNUSED);
> >>>>  void update_device_args(struct netdev *, const struct shash
> >>>> *args);
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +void signal_remove_xdp(struct netdev *netdev);
> >>>> +#endif
> >>>>
> >>>>  int
> >>>>  netdev_n_txq(const struct netdev *netdev)
> >>>> @@ -146,6 +149,9 @@ netdev_initialize(void)
> >>>>          netdev_register_provider(&netdev_internal_class);
> >>>>          netdev_register_provider(&netdev_tap_class);
> >>>>          netdev_vport_tunnel_register();
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +        netdev_register_provider(&netdev_afxdp_class);
> >>>> +#endif
> >>>>  #endif
> >>>>  #if defined(__FreeBSD__) || defined(__NetBSD__)
> >>>>          netdev_register_provider(&netdev_tap_class);
> >>>> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
> >>>>                                                 saved_flags &
> >>>> ~saved_values,
> >>>>                                                 &old_flags);
> >>>>          }
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +        if (netdev->netdev_class == &netdev_afxdp_class) {
> >>>> +            signal_remove_xdp(netdev);
> >>>> +        }
> >>>> +#endif
> >>>>      }
> >>>>  }
> >>>>
> >>>> diff --git a/lib/spinlock.h b/lib/spinlock.h
> >>>> new file mode 100644
> >>>> index 000000000000..1ae634f23a6b
> >>>> --- /dev/null
> >>>> +++ b/lib/spinlock.h
> >>>> @@ -0,0 +1,70 @@
> >>>> +/*
> >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +#ifndef SPINLOCK_H
> >>>> +#define SPINLOCK_H 1
> >>>> +
> >>>> +#include <config.h>
> >>>> +
> >>>> +#include <ctype.h>
> >>>> +#include <errno.h>
> >>>> +#include <fcntl.h>
> >>>> +#include <stdarg.h>
> >>>> +#include <stdlib.h>
> >>>> +#include <unistd.h>
> >>>> +
> >>>> +#include "ovs-atomic.h"
> >>>> +
> >>>> +struct ovs_spinlock {
> >>>> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
> >>>> +};
> >>>> +
> >>>> +static inline void
> >>>> +ovs_spinlock_init(struct ovs_spinlock *sl)
> >>>> +{
> >>>> +    atomic_init(&sl->locked, 0);
> >>>> +}
> >>>> +
> >>>> +static inline void
> >>>> +ovs_spin_lock(struct ovs_spinlock *sl)
> >>>> +{
> >>>> +    int exp = 0, locked = 0;
> >>>> +
> >>>> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked,
> >>>> &exp, 1,
> >>>> +                memory_order_acquire,
> >>>> +                memory_order_relaxed)) {
> >>>> +        locked = 1;
> >>>> +        while (locked) {
> >>>> +            atomic_read_relaxed(&sl->locked, &locked);
> >>>> +        }
> >>>> +        exp = 0;
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static inline void
> >>>> +ovs_spin_unlock(struct ovs_spinlock *sl)
> >>>> +{
> >>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> >>>> +}
> >>>> +
> >>>> +static inline int
> >>>> +ovs_spin_trylock(struct ovs_spinlock *sl)
> >>>> +{
> >>>> +    int exp = 0;
> >>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked,
> >>>> &exp,
> >>>> 1,
> >>>> +                memory_order_acquire,
> >>>> +                memory_order_relaxed);
> >>>> +}
> >>>> +#endif
> >>>> diff --git a/lib/util.c b/lib/util.c
> >>>> index 7b8ab81f6ee1..5eb20995b370 100644
> >>>> --- a/lib/util.c
> >>>> +++ b/lib/util.c
> >>>> @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
> >>>>      return xrealloc(p, *n * s);
> >>>>  }
> >>>>
> >>>> -/* Allocates and returns 'size' bytes of memory aligned to a cache
> >>>> line and in
> >>>> - * dedicated cache lines.  That is, the memory block returned will
> >>>> not share a
> >>>> - * cache line with other data, avoiding "false sharing".
> >>>> +/* Allocates and returns 'size' bytes of memory aligned to
> >>>> 'alignment' bytes.
> >>>> + * 'alignment' must be a power of two and a multiple of
> >>>> sizeof(void
> >>>> *).
> >>>>   *
> >>>> - * Use free_cacheline() to free the returned memory block. */
> >>>> + * Use free_size_align() to free the returned memory block. */
> >>>>  void *
> >>>> -xmalloc_cacheline(size_t size)
> >>>> +xmalloc_size_align(size_t size, size_t alignment)
> >>>>  {
> >>>>  #ifdef HAVE_POSIX_MEMALIGN
> >>>>      void *p;
> >>>>      int error;
> >>>>
> >>>>      COVERAGE_INC(util_xalloc);
> >>>> -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
> >>>> +    error = posix_memalign(&p, alignment, size ? size : 1);
> >>>>      if (error != 0) {
> >>>>          out_of_memory();
> >>>>      }
> >>>> @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
> >>>>  #else
> >>>>      /* Allocate room for:
> >>>>       *
> >>>> -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to
> >>>> allow the
> >>>> -     *       pointer to be aligned exactly sizeof(void *) bytes
> >>>> before the
> >>>> -     *       beginning of a cache line.
> >>>> +     *     - Header padding: Up to alignment - 1 bytes, to allow
> >>>> the
> >>>> +     *       pointer 'q' to be aligned exactly sizeof(void *)
> >>>> bytes
> >>>> before the
> >>>> +     *       beginning of the alignment.
> >>>>       *
> >>>>       *     - Pointer: A pointer to the start of the header
> >>>> padding,
> >>>> to allow us
> >>>>       *       to free() the block later.
> >>>>       *
> >>>>       *     - User data: 'size' bytes.
> >>>>       *
> >>>> -     *     - Trailer padding: Enough to bring the user data up to
> >>>> a
> >>>> cache line
> >>>> +     *     - Trailer padding: Enough to bring the user data up to
> >>>> a
> >>>> alignment
> >>>>       *       multiple.
> >>>>       *
> >>>>       *
> >>>> +---------------+---------+------------------------+---------+
> >>>> @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
> >>>>       * p               q         r
> >>>>       *
> >>>>       */
> >>>> -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
> >>>> -                      + sizeof(void *)
> >>>> -                      + ROUND_UP(size, CACHE_LINE_SIZE));
> >>>> -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) <
> >>>> sizeof(void *);
> >>>> -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ?
> >>>> CACHE_LINE_SIZE : 0),
> >>>> -                                CACHE_LINE_SIZE);
> >>>> -    void **q = (void **) r - 1;
> >>>> +    void *p, *r, **q;
> >>>> +    bool runt;
> >>>> +
> >>>> +    COVERAGE_INC(util_xalloc);
> >>>> +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0))
> >>>> {
> >>>> +        ovs_abort(0, "Invalid alignment");
> >>>> +    }
> >>>> +
> >>>> +    p = xmalloc((alignment - 1)
> >>>> +                + sizeof(void *)
> >>>> +                + ROUND_UP(size, alignment));
> >>>> +
> >>>> +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
> >>>> +    /* When the padding size < sizeof(void*), we don't have enough
> >>>> room for
> >>>> +     * pointer 'q'. As a reuslt, need to move 'r' to the next
> >>>> alignment.
> >>>> +     * So ROUND_UP when xmalloc above, and ROUND_UP again when
> >>>> calculate 'r'
> >>>> +     * below.
> >>>> +     */
> >>>> +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0),
> >>>> alignment);
> >>>> +    q = (void **) r - 1;
> >>>>      *q = p;
> >>>> +
> >>>>      return r;
> >>>>  #endif
> >>>>  }
> >>>>
> >>>> +void
> >>>> +free_size_align(void *p)
> >>>> +{
> >>>> +#ifdef HAVE_POSIX_MEMALIGN
> >>>> +    free(p);
> >>>> +#else
> >>>> +    if (p) {
> >>>> +        void **q = (void **) p - 1;
> >>>> +        free(*q);
> >>>> +    }
> >>>> +#endif
> >>>> +}
> >>>> +
> >>>> +/* Allocates and returns 'size' bytes of memory aligned to a cache
> >>>> line and in
> >>>> + * dedicated cache lines.  That is, the memory block returned will
> >>>> not share a
> >>>> + * cache line with other data, avoiding "false sharing".
> >>>> + *
> >>>> + * Use free_cacheline() to free the returned memory block. */
> >>>> +void *
> >>>> +xmalloc_cacheline(size_t size)
> >>>> +{
> >>>> +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
> >>>> +}
> >>>> +
> >>>>  /* Like xmalloc_cacheline() but clears the allocated memory to all
> >>>> zero
> >>>>   * bytes. */
> >>>>  void *
> >>>> @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
> >>>>  void
> >>>>  free_cacheline(void *p)
> >>>>  {
> >>>> -#ifdef HAVE_POSIX_MEMALIGN
> >>>> -    free(p);
> >>>> -#else
> >>>> -    if (p) {
> >>>> -        void **q = (void **) p - 1;
> >>>> -        free(*q);
> >>>> -    }
> >>>> -#endif
> >>>> +    free_size_align(p);
> >>>> +}
> >>>> +
> >>>> +void *
> >>>> +xmalloc_pagealign(size_t size)
> >>>> +{
> >>>> +    return xmalloc_size_align(size, get_page_size());
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +free_pagealign(void *p)
> >>>> +{
> >>>> +    free_size_align(p);
> >>>>  }
> >>>>
> >>>>  char *
> >>>> diff --git a/lib/util.h b/lib/util.h
> >>>> index c26605abdce3..33665748274c 100644
> >>>> --- a/lib/util.h
> >>>> +++ b/lib/util.h
> >>>> @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src,
> >>>> size_t size);
> >>>>
> >>>>  int string_ends_with(const char *str, const char *suffix);
> >>>>
> >>>> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> >>>> +void free_pagealign(void *);
> >>>> +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
> >>>> +void free_size_align(void *);
> >>>> +
> >>>>  /* The C standards say that neither the 'dst' nor 'src' argument
> >>>> to
> >>>>   * memcpy() may be null, even if 'n' is zero.  This wrapper
> >>>> tolerates
> >>>>   * the null case. */
> >>>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> >>>> new file mode 100644
> >>>> index 000000000000..ea39fa557290
> >>>> --- /dev/null
> >>>> +++ b/lib/xdpsock.c
> >>>> @@ -0,0 +1,170 @@
> >>>> +/*
> >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +#include <config.h>
> >>>> +
> >>>> +#include "xdpsock.h"
> >>>> +#include "dp-packet.h"
> >>>> +#include "openvswitch/compiler.h"
> >>>> +
> >>>> +/* Note:
> >>>> + * umem_elem_push* shouldn't overflow because we always pop
> >>>> + * elem first, then push back to the stack.
> >>>> + */
> >>>> +static inline void
> >>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> >>>> +{
> >>>> +    void *ptr;
> >>>> +
> >>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> >>>> +        OVS_NOT_REACHED();
> >>>> +    }
> >>>> +
> >>>> +    ptr = &umemp->array[umemp->index];
> >>>> +    memcpy(ptr, addrs, n * sizeof(void *));
> >>>> +    umemp->index += n;
> >>>> +}
> >>>> +
> >>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void
> >>>> **addrs)
> >>>> +{
> >>>> +    ovs_spin_lock(&umemp->lock);
> >>>> +    __umem_elem_push_n(umemp, n, addrs);
> >>>> +    ovs_spin_unlock(&umemp->lock);
> >>>> +}
> >>>> +
> >>>> +static inline void
> >>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> >>>> +{
> >>>> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> >>>> +        OVS_NOT_REACHED();
> >>>> +    }
> >>>> +
> >>>> +    umemp->array[umemp->index++] = addr;
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
> >>>> +{
> >>>> +
> >>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> >>>> +
> >>>> +    ovs_spin_lock(&umemp->lock);
> >>>> +    __umem_elem_push(umemp, addr);
> >>>> +    ovs_spin_unlock(&umemp->lock);
> >>>> +}
> >>>> +
> >>>> +static inline int
> >>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> >>>> +{
> >>>> +    void *ptr;
> >>>> +
> >>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> >>>> +        return -ENOMEM;
> >>>> +    }
> >>>> +
> >>>> +    umemp->index -= n;
> >>>> +    ptr = &umemp->array[umemp->index];
> >>>> +    memcpy(addrs, ptr, n * sizeof(void *));
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> >>>> +{
> >>>> +    int ret;
> >>>> +
> >>>> +    ovs_spin_lock(&umemp->lock);
> >>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> >>>> +    ovs_spin_unlock(&umemp->lock);
> >>>> +
> >>>> +    return ret;
> >>>> +}
> >>>> +
> >>>> +static inline void *
> >>>> +__umem_elem_pop(struct umem_pool *umemp)
> >>>> +{
> >>>> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    return umemp->array[--umemp->index];
> >>>> +}
> >>>> +
> >>>> +void *
> >>>> +umem_elem_pop(struct umem_pool *umemp)
> >>>> +{
> >>>> +    void *ptr;
> >>>> +
> >>>> +    ovs_spin_lock(&umemp->lock);
> >>>> +    ptr = __umem_elem_pop(umemp);
> >>>> +    ovs_spin_unlock(&umemp->lock);
> >>>> +
> >>>> +    return ptr;
> >>>> +}
> >>>> +
> >>>> +static void **
> >>>> +__umem_pool_alloc(unsigned int size)
> >>>> +{
> >>>> +    void *bufs;
> >>>> +
> >>>> +    bufs = xmalloc_pagealign(size * sizeof(void *));
> >>>> +    memset(bufs, 0, size * sizeof(void *));
> >>>> +
> >>>> +    return (void **)bufs;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> >>>> +{
> >>>> +    umemp->array = __umem_pool_alloc(size);
> >>>> +    if (!umemp->array) {
> >>>> +        return -ENOMEM;
> >>>> +    }
> >>>> +
> >>>> +    umemp->size = size;
> >>>> +    umemp->index = 0;
> >>>> +    ovs_spinlock_init(&umemp->lock);
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +umem_pool_cleanup(struct umem_pool *umemp)
> >>>> +{
> >>>> +    free_pagealign(umemp->array);
> >>>> +    umemp->array = NULL;
> >>>> +}
> >>>> +
> >>>> +/* AF_XDP metadata init/destroy */
> >>>> +int
> >>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> >>>> +{
> >>>> +    void *bufs;
> >>>> +
> >>>> +    bufs = xmalloc_pagealign(size * sizeof(struct
> >>>> dp_packet_afxdp));
> >>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> >>>> +
> >>>> +    xp->array = bufs;
> >>>> +    xp->size = size;
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +void
> >>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> >>>> +{
> >>>> +    free_pagealign(xp->array);
> >>>> +    xp->array = NULL;
> >>>> +}
> >>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> >>>> new file mode 100644
> >>>> index 000000000000..1a1093381243
> >>>> --- /dev/null
> >>>> +++ b/lib/xdpsock.h
> >>>> @@ -0,0 +1,101 @@
> >>>> +/*
> >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> + *
> >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> >>>> + * you may not use this file except in compliance with the
> >>>> License.
> >>>> + * You may obtain a copy of the License at:
> >>>> + *
> >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> >>>> + *
> >>>> + * Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> + * distributed under the License is distributed on an "AS IS"
> >>>> BASIS,
> >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> + * See the License for the specific language governing permissions
> >>>> and
> >>>> + * limitations under the License.
> >>>> + */
> >>>> +
> >>>> +#ifndef XDPSOCK_H
> >>>> +#define XDPSOCK_H 1
> >>>> +
> >>>> +#include <config.h>
> >>>> +
> >>>> +#ifdef HAVE_AF_XDP
> >>>> +
> >>>> +#include <bpf/xsk.h>
> >>>> +#include <errno.h>
> >>>> +#include <stdbool.h>
> >>>> +#include <stdio.h>
> >>>> +
> >>>> +#include "openvswitch/thread.h"
> >>>> +#include "ovs-atomic.h"
> >>>> +#include "spinlock.h"
> >>>> +
> >>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> >>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> >>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> >>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> >>>> +
> >>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> >>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> >>>> +
> >>>> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> >>>> + * Setting NUM_FRAMES to this makes sure umem_pop always
> >>>> successes.
> >>>> + */
> >>>> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> >>>> +
> >>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
> >>>> +
> >>>> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> >>>> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> >>>> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS +
> >>>> CONS_NUM_DESCS));
> >>>> +
> >>>> +/* LIFO ptr_array */
> >>>> +struct umem_pool {
> >>>> +    int index;      /* point to top */
> >>>> +    unsigned int size;
> >>>> +    struct ovs_spinlock lock;
> >>>> +    void **array;   /* a pointer array, point to umem buf */
> >>>> +};
> >>>> +
> >>>> +/* array-based dp_packet_afxdp */
> >>>> +struct xpacket_pool {
> >>>> +    unsigned int size;
> >>>> +    struct dp_packet_afxdp **array;
> >>>> +};
> >>>> +
> >>>> +struct xsk_umem_info {
> >>>> +    struct umem_pool mpool;
> >>>> +    struct xpacket_pool xpool;
> >>>> +    struct xsk_ring_prod fq;
> >>>> +    struct xsk_ring_cons cq;
> >>>> +    struct xsk_umem *umem;
> >>>> +    void *buffer;
> >>>> +};
> >>>> +
> >>>> +struct xsk_socket_info {
> >>>> +    struct xsk_ring_cons rx;
> >>>> +    struct xsk_ring_prod tx;
> >>>> +    struct xsk_umem_info *umem;
> >>>> +    struct xsk_socket *xsk;
> >>>> +    unsigned long rx_dropped;
> >>>> +    unsigned long tx_dropped;
> >>>> +    uint32_t outstanding_tx;
> >>>> +};
> >>>> +
> >>>> +struct umem_elem {
> >>>> +    struct umem_elem *next;
> >>>> +};
> >>>> +
> >>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> >>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void
> >>>> **addrs);
> >>>> +
> >>>> +void *umem_elem_pop(struct umem_pool *umemp);
> >>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> >>>> +
> >>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> >>>> +void umem_pool_cleanup(struct umem_pool *umemp);
> >>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> >>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> >>>> +
> >>>> +#endif
> >>>> +#endif
> >>>> diff --git a/tests/automake.mk b/tests/automake.mk
> >>>> index 2956e68b242c..131564bb0bd3 100644
> >>>> --- a/tests/automake.mk
> >>>> +++ b/tests/automake.mk
> >>>> @@ -4,12 +4,14 @@ EXTRA_DIST += \
> >>>>       $(SYSTEM_TESTSUITE_AT) \
> >>>>       $(SYSTEM_KMOD_TESTSUITE_AT) \
> >>>>       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
> >>>> +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
> >>>>       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
> >>>>       $(SYSTEM_DPDK_TESTSUITE_AT) \
> >>>>       $(OVSDB_CLUSTER_TESTSUITE_AT) \
> >>>>       $(TESTSUITE) \
> >>>>       $(SYSTEM_KMOD_TESTSUITE) \
> >>>>       $(SYSTEM_USERSPACE_TESTSUITE) \
> >>>> +     $(SYSTEM_AFXDP_TESTSUITE) \
> >>>>       $(SYSTEM_OFFLOADS_TESTSUITE) \
> >>>>       $(SYSTEM_DPDK_TESTSUITE) \
> >>>>       $(OVSDB_CLUSTER_TESTSUITE) \
> >>>> @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
> >>>>       tests/system-userspace-macros.at \
> >>>>       tests/system-userspace-packet-type-aware.at
> >>>>
> >>>> +SYSTEM_AFXDP_TESTSUITE_AT = \
> >>>> +     tests/system-afxdp-testsuite.at \
> >>>> +     tests/system-afxdp-macros.at
> >>>> +
> >>>>  SYSTEM_TESTSUITE_AT = \
> >>>>       tests/system-common-macros.at \
> >>>>       tests/system-ovn.at \
> >>>> @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
> >>>>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
> >>>>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
> >>>>  SYSTEM_USERSPACE_TESTSUITE =
> >>>> $(srcdir)/tests/system-userspace-testsuite
> >>>> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
> >>>>  SYSTEM_OFFLOADS_TESTSUITE =
> >>>> $(srcdir)/tests/system-offloads-testsuite
> >>>>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
> >>>>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> >>>> @@ -317,6 +324,11 @@ check-system-userspace: all
> >>>>       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests
> >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes &&
> >>>> "$$@"
> >>>> --recheck)
> >>>>
> >>>> +check-afxdp: all
> >>>> +     $(MAKE) install
> >>>> +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests
> >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> >>>> +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> >>>> +
> >>>>  check-offloads: all
> >>>>       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests
> >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> >>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes &&
> >>>> "$$@"
> >>>> --recheck)
> >>>> @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4
> >>>> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
> >>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >>>>       $(AM_V_at)mv $@.tmp $@
> >>>>
> >>>> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> >>>> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> >>>> +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >>>> +     $(AM_V_at)mv $@.tmp $@
> >>>> +
> >>>>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> >>>> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
> >>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> >>>>       $(AM_V_at)mv $@.tmp $@
> >>>> diff --git a/tests/system-afxdp-macros.at
> >>>> b/tests/system-afxdp-macros.at
> >>>> new file mode 100644
> >>>> index 000000000000..1e6f7a46b4b7
> >>>> --- /dev/null
> >>>> +++ b/tests/system-afxdp-macros.at
> >>>> @@ -0,0 +1,20 @@
> >>>> +# Add port to ovs bridge by using afxdp mode.
> >>>> +# This will use generic XDP support in the veth driver.
> >>>> +m4_define([ADD_VETH],
> >>>> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 ||
> >>>> return
> >>>> 77])
> >>>> +      CONFIGURE_VETH_OFFLOADS([$1])
> >>>> +      AT_CHECK([ip link set $1 netns $2])
> >>>> +      AT_CHECK([ip link set dev ovs-$1 up])
> >>>> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> >>>> +                set interface ovs-$1 external-ids:iface-id="$1"
> >>>> type="afxdp"])
> >>>> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> >>>> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> >>>> +      if test -n "$5"; then
> >>>> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> >>>> +      fi
> >>>> +      if test -n "$6"; then
> >>>> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> >>>> +      fi
> >>>> +      on_exit 'ip link del ovs-$1'
> >>>> +    ]
> >>>> +)
> >>>> diff --git a/tests/system-afxdp-testsuite.at
> >>>> b/tests/system-afxdp-testsuite.at
> >>>> new file mode 100644
> >>>> index 000000000000..9b7a29066614
> >>>> --- /dev/null
> >>>> +++ b/tests/system-afxdp-testsuite.at
> >>>> @@ -0,0 +1,26 @@
> >>>> +AT_INIT
> >>>> +
> >>>> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
> >>>> +
> >>>> +Licensed under the Apache License, Version 2.0 (the "License");
> >>>> +you may not use this file except in compliance with the License.
> >>>> +You may obtain a copy of the License at:
> >>>> +
> >>>> +    http://www.apache.org/licenses/LICENSE-2.0
> >>>> +
> >>>> +Unless required by applicable law or agreed to in writing,
> >>>> software
> >>>> +distributed under the License is distributed on an "AS IS" BASIS,
> >>>> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> >>>> implied.
> >>>> +See the License for the specific language governing permissions
> >>>> and
> >>>> +limitations under the License.])
> >>>> +
> >>>> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> >>>> +
> >>>> +m4_include([tests/ovs-macros.at])
> >>>> +m4_include([tests/ovsdb-macros.at])
> >>>> +m4_include([tests/ofproto-macros.at])
> >>>> +m4_include([tests/system-common-macros.at])
> >>>> +m4_include([tests/system-userspace-macros.at])
> >>>> +m4_include([tests/system-afxdp-macros.at])
> >>>> +
> >>>> +m4_include([tests/system-traffic.at])
> >>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> >>>> index 89c06a1b7877..1e3acbbb8075 100644
> >>>> --- a/vswitchd/vswitch.xml
> >>>> +++ b/vswitchd/vswitch.xml
> >>>> @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface
> >>>> p0
> >>>> type=patch options:peer=p1 \
> >>>>          </p>
> >>>>        </column>
> >>>>
> >>>> +      <column name="other_config" key="xdpmode"
> >>>> +              type='{"type": "string",
> >>>> +                     "enum": ["set", ["skb", "drv"]]}'>
> >>>> +        <p>
> >>>> +          Specifies the operational mode of the XDP program.
> >>>> +          If "drv", the XDP program is loaded into the device
> >>>> driver
> >>>> with
> >>>> +          zero-copy RX and TX enabled. This mode requires device
> >>>> driver with
> >>>> +          AF_XDP support and has the best performance.
> >>>> +          If "skb", the XDP program is using generic XDP mode in
> >>>> kernel with
> >>>> +          extra data copying between userspace and kernel. No
> >>>> device
> >>>> driver
> >>>> +          support is needed. Note that this is afxdp netdev type
> >>>> only.
> >>>> +          Defaults to "skb" mode.
> >>>> +        </p>
> >>>> +      </column>
> >>>> +
> >>>>        <column name="options" key="vhost-server-path"
> >>>>                type='{"type": "string"}'>
> >>>>          <p>
> >>>> --
> >>>> 2.7.4
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
William Tu June 11, 2019, 5:47 p.m. UTC | #9
Hi Eelco,

I tested using ixgbe driver but still works ok.
Is the crash due to packet size > mtu?
In my case, I only tested 64B packets size.

Thanks
William

On Tue, Jun 11, 2019 at 8:02 AM William Tu <u9012063@gmail.com> wrote:
>
> Hi Eelco,
>
> Thanks for the trace.
>
> On Tue, Jun 11, 2019 at 6:52 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >
> > Hi William,
> >
> > Here are some more details, this is a port to port test (same port in as
> > out) using the following rule:
> >
> >    ovs-ofctl add-flow ovs_pvp_br0 "in_port=eno1,action=IN_PORT"
> >
> > Sent packets wire speed, and crash…
> >
> > (gdb) bt
> > #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
> > #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
> > #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
> > n=32, umemp=0x24cc790) at lib/xdpsock.c:32
> > #3  umem_elem_push_n (umemp=0x24cc790, n=32,
> > addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
> > #4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at
> > lib/netdev-afxdp.c:736
> > #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0,
> > batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at
> > lib/netdev-afxdp.c:763
> > #6  0x0000000000908041 in netdev_send (netdev=<optimized out>,
> > qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80,
> > concurrent_txq=concurrent_txq@entry=true)
> >      at lib/netdev.c:800
> > #7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port
> > (pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at
> > lib/dpif-netdev.c:4187
> > #8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets
> > (pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at
> > lib/dpif-netdev.c:4227
> > #9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets
> > (force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
> > #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010,
> > rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
> > #11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> > lib/dpif-netdev.c:5449
> > #12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at
> > lib/ovs-thread.c:352
> > #13 0x00007fbc6b0a12de in start_thread () from /lib64/libpthread.so.0
> > #14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6
> >
> > After this crash, systemd restart OVS, and it crashed again (guess
> > traffic was still flowing for a bit with the NORMAL rule installed):
> >
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> > qfill=0x0) at lib/netdev-afxdp.c:583
> > 583         rx->fd = xsk_socket__fd(xsk->xsk);
> > [Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]
> >
> > (gdb) bt
> > #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> > qfill=0x0) at lib/netdev-afxdp.c:583
> > #1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>,
> > batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at
> > lib/netdev.c:710
> > #2  0x00000000008dd1d3 in dp_netdev_process_rxq_port
> > (pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at
> > lib/dpif-netdev.c:4257
> > #3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> > lib/dpif-netdev.c:5449
> > #4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) at
> > lib/ovs-thread.c:352
> > #5  0x00007f47229732de in start_thread () from /lib64/libpthread.so.0
> > #6  0x00007f4722118a63 in clone () from /lib64/libc.so.6
> >
> > I did not further investigate, but it should be easy to replicate. This
> > is the same setup that worked fine with the v8 patchset for port to
> > port.
> > Next step was to verify PVP was fixed, but could not get there…
> > Cheers,
>
> I'm not able to reproduce it on my testbed using i40e, I will try
> using ixgbe today.
>
> btw, if you try skb-mode, does the crash still show up?
> Although skb-mode is much slower, so it might not trigger the issue.
>
> Regards,
> William
>
> >
> > Eelco
> >
> > On 8 Jun 2019, at 10:12, Eelco Chaudron wrote:
> >
> > > Hi William,
> > >
> > > This was still a draft email, and was not supposed to go out ;)
> > >
> > > My debug and build setup was a bit messed up and was having problems
> > > running GDB… I was (I’m) planning to continue getting some debug
> > > info on Tuesday after the public holiday here…
> > >
> > > But just to give you a heads up, it starts up fine with root access
> > > but it crashes during a simple Port to Port run with wire-speed
> > > traffic. Then it will run into a restart/crash loop.
> > >
> > > Will try to get you more details next week…
> > >
> > > Cheers,
> > >
> > > Eelco
> > >
> > >
> > > On 7 Jun 2019, at 23:33, William Tu wrote:
> > >
> > >> Hi Eelco,
> > >>
> > >> Thanks for the testing.
> > >>
> > >> On Fri, Jun 7, 2019 at 8:43 AM Eelco Chaudron <echaudro@redhat.com>
> > >> wrote:
> > >>>
> > >>> Hi William,
> > >>>
> > >>> No review or full test yet, just some observations…
> > >>>
> > >>> We run OVS as a non root user, which is causing OVS with XDP to
> > >>> fail:
> > >>
> > >> Right, XDP requires using root privilege.
> > >> I will add this in the documentation.
> > >
> > > Is this a hard requirement? As I do not remember running OVS as root
> > > before…
> > >
> > >>>
> > >>> 2019-06-07T09:14:20.628Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev:
> > >>> Datapath supports ct_orig_tuple
> > >>> 2019-06-07T09:14:20.628Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev:
> > >>> Datapath supports ct_orig_tuple6
> > >>> 2019-06-07T09:14:20.664Z|00025|dpif_netdev|INFO|PMD thread on
> > >>> numa_id:
> > >>> 0, core id: 21 created.
> > >>> 2019-06-07T09:14:20.664Z|00026|dpif_netdev|INFO|There are 1 pmd
> > >>> threads
> > >>> on numa node 0
> > >>> 2019-06-07T09:14:20.664Z|00027|netdev_afxdp|INFO|remove xdp program
> > >>> 2019-06-07T09:14:20.664Z|00028|netdev_afxdp|INFO|AF_XDP device eno1
> > >>> in
> > >>> DRV mode
> > >>> 2019-06-07T09:14:20.664Z|00029|netdev_afxdp|ERR|ERROR:
> > >>> setrlimit(RLIMIT_MEMLOCK): Operation not permitted
> > >>
> > >> This is due to not having root privilege, so not able to lock the
> > >> memory
> > >> for device driver to directly DMA packet buffer into userspace.
> > >>
> > >> Can you try using root?
> > >>
> > >> Regards,
> > >> William
> > >>
> > >>> 2019-06-07T09:14:20.664Z|00030|netdev_afxdp|INFO|xsk_configure_all
> > >>> configure queue 0 mode DRV
> > >>> 2019-06-07T09:14:20.672Z|00031|netdev_afxdp|ERR|xsk_socket__create
> > >>> failed (Operation not permitted) mode: DRV qid: 0
> > >>> 2019-06-07T09:14:20.686Z|00032|netdev_afxdp|ERR|failed to create
> > >>> AF_XDP
> > >>> socket on queue 0
> > >>> 2019-06-07T09:14:20.686Z|00033|netdev_afxdp|INFO|remove xdp program
> > >>> 2019-06-07T09:14:20.687Z|00034|netdev_afxdp|ERR|AF_XDP device eno1
> > >>> reconfig fails
> > >>> 2019-06-07T09:14:20.687Z|00035|dpif_netdev|ERR|Failed to set
> > >>> interface
> > >>> eno1 new configuration
> > >>>
> > >>> However when configuring this after startup it’s fine, but trying
> > >>> to
> > >>> restart OVS with this configuration results in a system core…
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 5 Jun 2019, at 22:47, William Tu wrote:
> > >>>
> > >>>> The patch introduces experimental AF_XDP support for OVS netdev.
> > >>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
> > >>>> socket
> > >>>> type built upon the eBPF and XDP technology.  It is aims to have
> > >>>> comparable
> > >>>> performance to DPDK but cooperate better with existing kernel's
> > >>>> networking
> > >>>> stack.  An AF_XDP socket receives and sends packets from an
> > >>>> eBPF/XDP
> > >>>> program
> > >>>> attached to the netdev, by-passing a couple of Linux kernel's
> > >>>> subsystems
> > >>>> As a result, AF_XDP socket shows much better performance than
> > >>>> AF_PACKET
> > >>>> For more details about AF_XDP, please see linux kernel's
> > >>>> Documentation/networking/af_xdp.rst. Note that by default, this
> > >>>> feature is
> > >>>> not compiled in.
> > >>>>
> > >>>> Signed-off-by: William Tu <u9012063@gmail.com>
> > >>>> ---
> > >>>> v1->v2:
> > >>>> - add a list to maintain unused umem elements
> > >>>> - remove copy from rx umem to ovs internal buffer
> > >>>> - use hugetlb to reduce misses (not much difference)
> > >>>> - use pmd mode netdev in OVS (huge performance improve)
> > >>>> - remove malloc dp_packet, instead put dp_packet in umem
> > >>>>
> > >>>> v2->v3:
> > >>>> - rebase on the OVS master, 7ab4b0653784
> > >>>>   ("configure: Check for more specific function to pull in pthread
> > >>>> library.")
> > >>>> - remove the dependency on libbpf and dpif-bpf.
> > >>>>   instead, use the built-in XDP_ATTACH feature.
> > >>>> - data structure optimizations for better performance, see[1]
> > >>>> - more test cases support
> > >>>> v3:
> > >>>> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> > >>>>
> > >>>> v3->v4:
> > >>>> - Use AF_XDP API provided by libbpf
> > >>>> - Remove the dependency on XDP_ATTACH kernel patch set
> > >>>> - Add documentation, bpf.rst
> > >>>>
> > >>>> v4->v5:
> > >>>> - rebase to master
> > >>>> - remove rfc, squash all into a single patch
> > >>>> - add --enable-afxdp, so by default, AF_XDP is not compiled
> > >>>> - add options: xdpmode=drv,skb
> > >>>> - add multiple queue and multiple PMD support, with options: n_rxq
> > >>>> - improve documentation, rename bpf.rst to af_xdp.rst
> > >>>>
> > >>>> v5->v6
> > >>>> - rebase to master, commit 0cdd5b13de91b98
> > >>>> - address errors from sparse and clang
> > >>>> - pass travis-ci test
> > >>>> - address feedback from Ben
> > >>>> - fix issues reported by 0-day robot
> > >>>> - improved documentation
> > >>>>
> > >>>> v6-v7
> > >>>> - rebase to master, commit abf11558c1515bf3b1
> > >>>> - address feedbacks from Ilya, Ben, and Eelco, see:
> > >>>>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> > >>>> - add XDP mode change, implement get/set_config, reconfigure
> > >>>> - Fix reconfiguration/crash issue caused by libbpf, see patch:
> > >>>>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> > >>>> - perf optimization for batching umem_push/pop
> > >>>> - perf optimization for batching kick_tx
> > >>>> - test build with dpdk
> > >>>> - fix/refactor atomic operation
> > >>>> - make AF_XDP x86 specific, otherwise fail at build time
> > >>>> - lots of code refactoring
> > >>>> - add PVP setup in documentation
> > >>>>
> > >>>> v7-v8:
> > >>>> - Address feedback from Ilya at:
> > >>>>   https://patchwork.ozlabs.org/patch/1095019/
> > >>>> - add netdev-linux-private.h
> > >>>> - fix afxdp reconfigure issue
> > >>>> - sort include headers
> > >>>> - remove unnecessary OVS_UNUSED
> > >>>> - coding style fixes
> > >>>> - error case handling and memory leak
> > >>>>
> > >>>> v8-v9:
> > >>>> - rebase to master 180bbbed3a3867d52
> > >>>> - Address review feedback from Ben, Ilya and Eelco, at:
> > >>>>   https://patchwork.ozlabs.org/patch/1097740/
> > >>>> - == From Ilya ==
> > >>>> - Optimize the reconfiguration logic
> > >>>> - Implement .rxq_recv and .send for afxdp
> > >>>> - Remove system-afxdp-traffic.at, reuse existing code
> > >>>> - Use Ilya's rdtsc code
> > >>>> - remove --disable-system
> > >>>> - == From Eelco ==
> > >>>> - Fix bug when remove br0,
> > >>>> util(revalidator49)|EMER|lib/poll-loop.c:111:
> > >>>>   assertion !fd != !wevent failed
> > >>>> - Fix bug and use default value from libbpf, ex:
> > >>>> XSK_RING_PROD__DEFAULT...
> > >>>> - Clear xdp program when receive signal, ctrl+c
> > >>>> - Add options to vswitch.xml, set xdpmode default to skb-mode
> > >>>> - No support for ARM and PPC, now x86_64 only
> > >>>> - remove redundant header includes and function/macro definitions
> > >>>> - remove some ifdef HAVE_AF_XDP
> > >>>> - == From others/both about afxdp rx and tx ==
> > >>>> - Several umem push/pop error handling improvement/fixes
> > >>>> - add lock to address concurrent_txq case
> > >>>> - improve error handling
> > >>>> - add stats
> > >>>> - Things that are not done yet
> > >>>> - MTU limitation
> > >>>> - n_txq_desc/n_rxq_desc option.
> > >>>>
> > >>>> v9-v10
> > >>>> - remove x86_64 limitation, suggested by Ben and Eelco
> > >>>> - add xmalloc_pagealign, free_pagealign
> > >>>> - minor refector
> > >>>>
> > >>>> v10-v11
> > >>>> - address feedback from Ilya at
> > >>>>   https://patchwork.ozlabs.org/patch/1106495/
> > >>>> - fix typos, and some refactoring
> > >>>> - refactor existing code and introduce xmalloc pagealign
> > >>>> - fix a couple of error handling case
> > >>>> - allocate per-txq lock
> > >>>> - dynamic allocate xsk array
> > >>>> - fix cycle_counter_update() for non-x86/non-linux case
> > >>>> ---
> > >>>>  Documentation/automake.mk             |   1 +
> > >>>>  Documentation/index.rst               |   1 +
> > >>>>  Documentation/intro/install/afxdp.rst | 433 +++++++++++++++++
> > >>>>  Documentation/intro/install/index.rst |   1 +
> > >>>>  acinclude.m4                          |  35 ++
> > >>>>  configure.ac                          |   1 +
> > >>>>  lib/automake.mk                       |  14 +
> > >>>>  lib/dp-packet.c                       |  28 ++
> > >>>>  lib/dp-packet.h                       |  18 +-
> > >>>>  lib/dpif-netdev-perf.h                |  26 +
> > >>>>  lib/netdev-afxdp.c                    | 891
> > >>>> ++++++++++++++++++++++++++++++++++
> > >>>>  lib/netdev-afxdp.h                    |  74 +++
> > >>>>  lib/netdev-linux-private.h            | 139 ++++++
> > >>>>  lib/netdev-linux.c                    | 121 ++---
> > >>>>  lib/netdev-provider.h                 |   3 +
> > >>>>  lib/netdev.c                          |  11 +
> > >>>>  lib/spinlock.h                        |  70 +++
> > >>>>  lib/util.c                            |  92 +++-
> > >>>>  lib/util.h                            |   5 +
> > >>>>  lib/xdpsock.c                         | 170 +++++++
> > >>>>  lib/xdpsock.h                         | 101 ++++
> > >>>>  tests/automake.mk                     |  16 +
> > >>>>  tests/system-afxdp-macros.at          |  20 +
> > >>>>  tests/system-afxdp-testsuite.at       |  26 +
> > >>>>  vswitchd/vswitch.xml                  |  15 +
> > >>>>  25 files changed, 2204 insertions(+), 108 deletions(-)
> > >>>>  create mode 100644 Documentation/intro/install/afxdp.rst
> > >>>>  create mode 100644 lib/netdev-afxdp.c
> > >>>>  create mode 100644 lib/netdev-afxdp.h
> > >>>>  create mode 100644 lib/netdev-linux-private.h
> > >>>>  create mode 100644 lib/spinlock.h
> > >>>>  create mode 100644 lib/xdpsock.c
> > >>>>  create mode 100644 lib/xdpsock.h
> > >>>>  create mode 100644 tests/system-afxdp-macros.at
> > >>>>  create mode 100644 tests/system-afxdp-testsuite.at
> > >>>>
> > >>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > >>>> index 082438e09a33..11cc59efc881 100644
> > >>>> --- a/Documentation/automake.mk
> > >>>> +++ b/Documentation/automake.mk
> > >>>> @@ -10,6 +10,7 @@ DOC_SOURCE = \
> > >>>>       Documentation/intro/why-ovs.rst \
> > >>>>       Documentation/intro/install/index.rst \
> > >>>>       Documentation/intro/install/bash-completion.rst \
> > >>>> +     Documentation/intro/install/afxdp.rst \
> > >>>>       Documentation/intro/install/debian.rst \
> > >>>>       Documentation/intro/install/documentation.rst \
> > >>>>       Documentation/intro/install/distributions.rst \
> > >>>> diff --git a/Documentation/index.rst b/Documentation/index.rst
> > >>>> index 46261235c732..aa9e7c49f179 100644
> > >>>> --- a/Documentation/index.rst
> > >>>> +++ b/Documentation/index.rst
> > >>>> @@ -59,6 +59,7 @@ vSwitch? Start here.
> > >>>>    :doc:`intro/install/windows` |
> > >>>>    :doc:`intro/install/xenserver` |
> > >>>>    :doc:`intro/install/dpdk` |
> > >>>> +  :doc:`intro/install/afxdp` |
> > >>>>    :doc:`Installation FAQs <faq/releases>`
> > >>>>
> > >>>>  - **Tutorials:** :doc:`tutorials/faucet` |
> > >>>> diff --git a/Documentation/intro/install/afxdp.rst
> > >>>> b/Documentation/intro/install/afxdp.rst
> > >>>> new file mode 100644
> > >>>> index 000000000000..554964396353
> > >>>> --- /dev/null
> > >>>> +++ b/Documentation/intro/install/afxdp.rst
> > >>>> @@ -0,0 +1,433 @@
> > >>>> +..
> > >>>> +      Licensed under the Apache License, Version 2.0 (the
> > >>>> "License");
> > >>>> you may
> > >>>> +      not use this file except in compliance with the License. You
> > >>>> may obtain
> > >>>> +      a copy of the License at
> > >>>> +
> > >>>> +          http://www.apache.org/licenses/LICENSE-2.0
> > >>>> +
> > >>>> +      Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> +      distributed under the License is distributed on an "AS IS"
> > >>>> BASIS, WITHOUT
> > >>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied. See the
> > >>>> +      License for the specific language governing permissions and
> > >>>> limitations
> > >>>> +      under the License.
> > >>>> +
> > >>>> +      Convention for heading levels in Open vSwitch documentation:
> > >>>> +
> > >>>> +      =======  Heading 0 (reserved for the title in a document)
> > >>>> +      -------  Heading 1
> > >>>> +      ~~~~~~~  Heading 2
> > >>>> +      +++++++  Heading 3
> > >>>> +      '''''''  Heading 4
> > >>>> +
> > >>>> +      Avoid deeper levels because they do not render well.
> > >>>> +
> > >>>> +
> > >>>> +========================
> > >>>> +Open vSwitch with AF_XDP
> > >>>> +========================
> > >>>> +
> > >>>> +This document describes how to build and install Open vSwitch
> > >>>> using
> > >>>> +AF_XDP netdev.
> > >>>> +
> > >>>> +.. warning::
> > >>>> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> > >>>> +  and it is not compiled in by default.
> > >>>> +
> > >>>> +
> > >>>> +Introduction
> > >>>> +------------
> > >>>> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux
> > >>>> socket type
> > >>>> +built upon the eBPF and XDP technology.  It is aims to have
> > >>>> comparable
> > >>>> +performance to DPDK but cooperate better with existing kernel's
> > >>>> networking
> > >>>> +stack.  An AF_XDP socket receives and sends packets from an
> > >>>> eBPF/XDP
> > >>>> program
> > >>>> +attached to the netdev, by-passing a couple of Linux kernel's
> > >>>> subsystems.
> > >>>> +As a result, AF_XDP socket shows much better performance than
> > >>>> AF_PACKET.
> > >>>> +For more details about AF_XDP, please see linux kernel's
> > >>>> +Documentation/networking/af_xdp.rst
> > >>>> +
> > >>>> +
> > >>>> +AF_XDP Netdev
> > >>>> +-------------
> > >>>> +OVS has a couple of netdev types, i.e., system, tap, or
> > >>>> +dpdk.  The AF_XDP feature adds a new netdev types called
> > >>>> +"afxdp", and implement its configuration, packet reception,
> > >>>> +and transmit functions.  Since the AF_XDP socket, called xsk,
> > >>>> +operates in userspace, once ovs-vswitchd receives packets
> > >>>> +from xsk, the afxdp netdev re-uses the existing userspace
> > >>>> +dpif-netdev datapath.  As a result, most of the packet processing
> > >>>> +happens at the userspace instead of linux kernel.
> > >>>> +
> > >>>> +::
> > >>>> +
> > >>>> +              |   +-------------------+
> > >>>> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> > >>>> +              |   +-------------------+
> > >>>> +              |   |      ofproto      |<-->OpenFlow controllers
> > >>>> +              |   +--------+-+--------+
> > >>>> +              |   | netdev | |ofproto-|
> > >>>> +    userspace |   +--------+ |  dpif  |
> > >>>> +              |   | afxdp  | +--------+
> > >>>> +              |   | netdev | |  dpif  |
> > >>>> +              |   +---||---+ +--------+
> > >>>> +              |       ||     |  dpif- |
> > >>>> +              |       ||     | netdev |
> > >>>> +              |_      ||     +--------+
> > >>>> +                      ||
> > >>>> +               _  +---||-----+--------+
> > >>>> +              |   | AF_XDP prog +     |
> > >>>> +       kernel |   |   xsk_map         |
> > >>>> +              |_  +--------||---------+
> > >>>> +                           ||
> > >>>> +                        physical
> > >>>> +                           NIC
> > >>>> +
> > >>>> +
> > >>>> +Build requirements
> > >>>> +------------------
> > >>>> +
> > >>>> +In addition to the requirements described in :doc:`general`,
> > >>>> building
> > >>>> Open
> > >>>> +vSwitch with AF_XDP will require the following:
> > >>>> +
> > >>>> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> > >>>> +
> > >>>> +- Linux kernel XDP support, with the following options (required)
> > >>>> +
> > >>>> +  * CONFIG_BPF=y
> > >>>> +
> > >>>> +  * CONFIG_BPF_SYSCALL=y
> > >>>> +
> > >>>> +  * CONFIG_XDP_SOCKETS=y
> > >>>> +
> > >>>> +
> > >>>> +- The following optional Kconfig options are also recommended, but
> > >>>> not
> > >>>> +  required:
> > >>>> +
> > >>>> +  * CONFIG_BPF_JIT=y (Performance)
> > >>>> +
> > >>>> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> > >>>> +
> > >>>> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> > >>>> +
> > >>>> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> > >>>> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> > >>>> +  This is an OVS independent benchmark tools for AF_XDP.
> > >>>> +  It makes sure your basic kernel requirements are met for AF_XDP.
> > >>>> +
> > >>>> +
> > >>>> +Installing
> > >>>> +----------
> > >>>> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF
> > >>>> support.
> > >>>> +First, clone a recent version of Linux bpf-next tree::
> > >>>> +
> > >>>> +  git clone
> > >>>> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> > >>>> +
> > >>>> +Second, go into the Linux source directory and build libbpf in the
> > >>>> tools
> > >>>> +directory::
> > >>>> +
> > >>>> +  cd bpf-next/
> > >>>> +  cd tools/lib/bpf/
> > >>>> +  make && make install
> > >>>> +  make install_headers
> > >>>> +
> > >>>> +.. note::
> > >>>> +   Make sure xsk.h and bpf.h are installed in system's library
> > >>>> path,
> > >>>> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> > >>>> +
> > >>>> +Make sure the libbpf.so is installed correctly::
> > >>>> +
> > >>>> +  ldconfig
> > >>>> +  ldconfig -p | grep libbpf
> > >>>> +
> > >>>> +Third, ensure the standard OVS requirements are installed and
> > >>>> +bootstrap/configure the package::
> > >>>> +
> > >>>> +  ./boot.sh && ./configure --enable-afxdp
> > >>>> +
> > >>>> +Finally, build and install OVS::
> > >>>> +
> > >>>> +  make && make install
> > >>>> +
> > >>>> +To kick start end-to-end autotesting::
> > >>>> +
> > >>>> +  uname -a # make sure having 5.0+ kernel
> > >>>> +  make check-afxdp TESTSUITEFLAGS='1'
> > >>>> +
> > >>>> +If a test case fails, check the log at::
> > >>>> +
> > >>>> +  cat
> > >>>> tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> > >>>> +
> > >>>> +
> > >>>> +Setup AF_XDP netdev
> > >>>> +-------------------
> > >>>> +Before running OVS with AF_XDP, make sure the libbpf and libelf
> > >>>> are
> > >>>> +set-up right::
> > >>>> +
> > >>>> +  ldd vswitchd/ovs-vswitchd
> > >>>> +
> > >>>> +Open vSwitch should be started using userspace datapath as
> > >>>> described
> > >>>> +in :doc:`general`::
> > >>>> +
> > >>>> +  ovs-vswitchd ...
> > >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > >>>> +
> > >>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> > >>>> core 4)
> > >>>> +on 1 queue (queue 0) device, configure these options:
> > >>>> **pmd-cpu-mask,
> > >>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> > >>>> "skb"::
> > >>>> +
> > >>>> +  ethtool -L enp2s0 combined 1
> > >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> > >>>> type="afxdp"
> > >>>> \
> > >>>> +    options:n_rxq=1 options:xdpmode=drv \
> > >>>> +    other_config:pmd-rxq-affinity="0:4"
> > >>>> +
> > >>>> +Or, use 4 pmds/cores and 4 queues by doing::
> > >>>> +
> > >>>> +  ethtool -L enp2s0 combined 4
> > >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> > >>>> type="afxdp"
> > >>>> \
> > >>>> +    options:n_rxq=4 options:xdpmode=drv \
> > >>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > >>>> +
> > >>>> +.. note::
> > >>>> +   pmd-rxq-affinity is optional. If not specified, system will
> > >>>> auto-assign.
> > >>>> +
> > >>>> +To validate that the bridge has successfully instantiated, you can
> > >>>> use the::
> > >>>> +
> > >>>> +  ovs-vsctl show
> > >>>> +
> > >>>> +Should show something like::
> > >>>> +
> > >>>> +  Port "ens802f0"
> > >>>> +   Interface "ens802f0"
> > >>>> +      type: afxdp
> > >>>> +      options: {n_rxq="1", xdpmode=drv}
> > >>>> +
> > >>>> +Otherwise, enable debugging by::
> > >>>> +
> > >>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
> > >>>> +
> > >>>> +
> > >>>> +References
> > >>>> +----------
> > >>>> +Most of the design details are described in the paper presented at
> > >>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open
> > >>>> vSwitch"[1],
> > >>>> +section 4, and slides[2][4].
> > >>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> > >>>> introduction
> > >>>> +about AF_XDP current and future work.
> > >>>> +
> > >>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > >>>> +
> > >>>> +[2]
> > >>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > >>>> +
> > >>>> +[3]
> > >>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > >>>> +
> > >>>> +[4]
> > >>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > >>>> +
> > >>>> +
> > >>>> +Performance Tuning
> > >>>> +------------------
> > >>>> +The name of the game is to keep your CPU running in userspace,
> > >>>> allowing PMD
> > >>>> +to keep polling the AF_XDP queues without any interferences from
> > >>>> kernel.
> > >>>> +
> > >>>> +#. Make sure everything is in the same NUMA node (memory used by
> > >>>> AF_XDP, pmd
> > >>>> +   running cores, device plug-in slot)
> > >>>> +
> > >>>> +#. Isolate your CPU by doing isolcpu at grub configure.
> > >>>> +
> > >>>> +#. IRQ should not set to pmd running core.
> > >>>> +
> > >>>> +#. The Spectre and Meltdown fixes increase the overhead of system
> > >>>> calls.
> > >>>> +
> > >>>> +
> > >>>> +Debugging performance issue
> > >>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >>>> +While running the traffic, use linux perf tool to see where your
> > >>>> cpu
> > >>>> +spends its cycle::
> > >>>> +
> > >>>> +  cd bpf-next/tools/perf
> > >>>> +  make
> > >>>> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> > >>>> +  ./perf report
> > >>>> +
> > >>>> +Measure your system call rate by doing::
> > >>>> +
> > >>>> +  pstree -p `pidof ovs-vswitchd`
> > >>>> +  strace -c -p <your pmd's PID>
> > >>>> +
> > >>>> +Or, use OVS pmd tool::
> > >>>> +
> > >>>> +  ovs-appctl dpif-netdev/pmd-stats-show
> > >>>> +
> > >>>> +
> > >>>> +Example Script
> > >>>> +--------------
> > >>>> +
> > >>>> +Below is a script using namespaces and veth peer::
> > >>>> +
> > >>>> +  #!/bin/bash
> > >>>> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif
> > >>>> -vunixctl
> > >>>> \
> > >>>> +    --disable-system --detach \
> > >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> > >>>> +
> > >>>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14
> > >>>> \
> > >>>> +    fail-mode=secure datapath_type=netdev
> > >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> > >>>> +
> > >>>> +  ip netns add at_ns0
> > >>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
> > >>>> +
> > >>>> +  ip link add p0 type veth peer name afxdp-p0
> > >>>> +  ip link set p0 netns at_ns0
> > >>>> +  ip link set dev afxdp-p0 up
> > >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > >>>> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> > >>>> +
> > >>>> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> > >>>> +  ip addr add "10.1.1.1/24" dev p0
> > >>>> +  ip link set dev p0 up
> > >>>> +  NS_EXEC_HEREDOC
> > >>>> +
> > >>>> +  ip netns add at_ns1
> > >>>> +  ip link add p1 type veth peer name afxdp-p1
> > >>>> +  ip link set p1 netns at_ns1
> > >>>> +  ip link set dev afxdp-p1 up
> > >>>> +
> > >>>> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> > >>>> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> > >>>> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> > >>>> +  ip addr add "10.1.1.2/24" dev p1
> > >>>> +  ip link set dev p1 up
> > >>>> +  NS_EXEC_HEREDOC
> > >>>> +
> > >>>> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> > >>>> +
> > >>>> +
> > >>>> +Limitations/Known Issues
> > >>>> +------------------------
> > >>>> +#. Device's numa ID is always 0, need a way to find numa id from a
> > >>>> netdev.
> > >>>> +#. No QoS support because AF_XDP netdev by-pass the Linux TC
> > >>>> layer. A
> > >>>> possible
> > >>>> +   work-around is to use OpenFlow meter action.
> > >>>> +#. AF_XDP device added to bridge, remove, and added again will
> > >>>> fail.
> > >>>> +#. Most of the tests are done using i40e single port. Multiple
> > >>>> ports
> > >>>> and
> > >>>> +   also ixgbe driver also needs to be tested.
> > >>>> +#. No latency test result (TODO items)
> > >>>> +
> > >>>> +
> > >>>> +PVP using tap device
> > >>>> +--------------------
> > >>>> +Assume you have enp2s0 as physical nic, and a tap device connected
> > >>>> to
> > >>>> VM.
> > >>>> +First, start OVS, then add physical port::
> > >>>> +
> > >>>> +  ethtool -L enp2s0 combined 1
> > >>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > >>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0
> > >>>> type="afxdp"
> > >>>> \
> > >>>> +    options:n_rxq=1 options:xdpmode=drv \
> > >>>> +    other_config:pmd-rxq-affinity="0:4"
> > >>>> +
> > >>>> +Start a VM with virtio and tap device::
> > >>>> +
> > >>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > >>>> +    -m 4096 \
> > >>>> +    -cpu host,+x2apic -enable-kvm \
> > >>>> +    -device
> > >>>> virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> > >>>> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> > >>>> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> > >>>> +    -object memory-backend-file,id=mem,size=4096M,\
> > >>>> +      mem-path=/dev/hugepages,share=on \
> > >>>> +    -numa node,memdev=mem -mem-prealloc -smp 2
> > >>>> +
> > >>>> +Create OpenFlow rules::
> > >>>> +
> > >>>> +  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
> > >>>> +  ovs-ofctl del-flows br0
> > >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > >>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > >>>> +
> > >>>> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> > >>>> +
> > >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > >>>> +
> > >>>> +The performance number I got is around 1.6Mpps.
> > >>>> +This is due to using the kernel's tap interface, which requires
> > >>>> copying
> > >>>> +packet into kernel from the umem buffer in userspace.
> > >>>> +
> > >>>> +
> > >>>> +PVP using vhostuser device
> > >>>> +--------------------------
> > >>>> +First, build OVS with DPDK and AFXDP::
> > >>>> +
> > >>>> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> > >>>> +  make -j4 && make install
> > >>>> +
> > >>>> +Create a vhost-user port from OVS::
> > >>>> +
> > >>>> +  ovs-vsctl --no-wait set Open_vSwitch .
> > >>>> other_config:dpdk-init=true
> > >>>> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> > >>>> +    other_config:pmd-cpu-mask=0xfff
> > >>>> +  ovs-vsctl add-port br0 vhost-user-1 \
> > >>>> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> > >>>> +
> > >>>> +Start VM using vhost-user mode::
> > >>>> +
> > >>>> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> > >>>> +   -m 4096 \
> > >>>> +   -cpu host,+x2apic -enable-kvm \
> > >>>> +   -chardev
> > >>>> socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> > >>>> +   -netdev
> > >>>> type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> > >>>> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> > >>>> +      netdev=mynet1,mq=on,vectors=10 \
> > >>>> +   -object memory-backend-file,id=mem,size=4096M,\
> > >>>> +      mem-path=/dev/hugepages,share=on \
> > >>>> +   -numa node,memdev=mem -mem-prealloc -smp 2
> > >>>> +
> > >>>> +Setup the OpenFlow ruls::
> > >>>> +
> > >>>> +  ovs-ofctl del-flows br0
> > >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0,
> > >>>> actions=output:vhost-user-1"
> > >>>> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1,
> > >>>> actions=output:enp2s0"
> > >>>> +
> > >>>> +Inside the VM, use xdp_rxq_info to drop or bounce back the
> > >>>> traffic::
> > >>>> +
> > >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> > >>>> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> > >>>> +
> > >>>> +Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
> > >>>> +
> > >>>> +
> > >>>> +PCP container using veth
> > >>>> +------------------------
> > >>>> +Create namespace and veth peer devices::
> > >>>> +
> > >>>> +  ip netns add at_ns0
> > >>>> +  ip link add p0 type veth peer name afxdp-p0
> > >>>> +  ip link set p0 netns at_ns0
> > >>>> +  ip link set dev afxdp-p0 up
> > >>>> +  ip netns exec at_ns0 ip link set dev p0 up
> > >>>> +
> > >>>> +Attach the veth port to br0 (linux kernel mode)::
> > >>>> +
> > >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > >>>> +    set interface afxdp-p0 options:n_rxq=1
> > >>>> +
> > >>>> +Or, use AF_XDP with skb mode::
> > >>>> +
> > >>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > >>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
> > >>>> options:xdpmode=skb
> > >>>> +
> > >>>> +Setup the OpenFlow rules::
> > >>>> +
> > >>>> +  ovs-ofctl del-flows br0
> > >>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> > >>>> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> > >>>> +
> > >>>> +In the namespace, run drop or bounce back the packet::
> > >>>> +
> > >>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> > >>>> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> > >>>> +
> > >>>> +Performace: for RX_DROP: 800Kpps, TX: 700Kpps
> > >>>> +
> > >>>> +
> > >>>> +Bug Reporting
> > >>>> +-------------
> > >>>> +
> > >>>> +Please report problems to dev@openvswitch.org.
> > >>>> diff --git a/Documentation/intro/install/index.rst
> > >>>> b/Documentation/intro/install/index.rst
> > >>>> index 3193c736cf17..c27a9c9d16ff 100644
> > >>>> --- a/Documentation/intro/install/index.rst
> > >>>> +++ b/Documentation/intro/install/index.rst
> > >>>> @@ -45,6 +45,7 @@ Installation from Source
> > >>>>     xenserver
> > >>>>     userspace
> > >>>>     dpdk
> > >>>> +   afxdp
> > >>>>
> > >>>>  Installation from Packages
> > >>>>  --------------------------
> > >>>> diff --git a/acinclude.m4 b/acinclude.m4
> > >>>> index cf9cc8b8b0de..721653ab0ec0 100644
> > >>>> --- a/acinclude.m4
> > >>>> +++ b/acinclude.m4
> > >>>> @@ -236,6 +236,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
> > >>>>    ])
> > >>>>  ])
> > >>>>
> > >>>> +dnl OVS_CHECK_LINUX_AF_XDP
> > >>>> +dnl
> > >>>> +dnl Check both Linux kernel AF_XDP and libbpf support
> > >>>> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> > >>>> +  AC_ARG_ENABLE([afxdp],
> > >>>> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP
> > >>>> support])],
> > >>>> +                [], [enable_afxdp=no])
> > >>>> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> > >>>> +  if test "$enable_afxdp" != yes; then
> > >>>> +    AC_MSG_RESULT([no])
> > >>>> +    AF_XDP_ENABLE=false
> > >>>> +  else
> > >>>> +    AC_MSG_RESULT([yes])
> > >>>> +    AF_XDP_ENABLE=true
> > >>>> +
> > >>>> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> > >>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP
> > >>>> support])])
> > >>>> +
> > >>>> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> > >>>> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP
> > >>>> support])])
> > >>>> +
> > >>>> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> > >>>> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP
> > >>>> support])])
> > >>>> +
> > >>>> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> > >>>> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP
> > >>>> support])])
> > >>>> +
> > >>>> +    AC_DEFINE([HAVE_AF_XDP], [1],
> > >>>> +              [Define to 1 if AF_XDP support is available and
> > >>>> enabled.])
> > >>>> +    LIBBPF_LDADD=" -lbpf -lelf"
> > >>>> +    AC_SUBST([LIBBPF_LDADD])
> > >>>> +  fi
> > >>>> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> > >>>> +])
> > >>>> +
> > >>>>  dnl OVS_CHECK_DPDK
> > >>>>  dnl
> > >>>>  dnl Configure DPDK source tree
> > >>>> diff --git a/configure.ac b/configure.ac
> > >>>> index 2dbe9a9178e3..9e23e1c6958c 100644
> > >>>> --- a/configure.ac
> > >>>> +++ b/configure.ac
> > >>>> @@ -99,6 +99,7 @@ OVS_CHECK_SPHINX
> > >>>>  OVS_CHECK_DOT
> > >>>>  OVS_CHECK_IF_DL
> > >>>>  OVS_CHECK_STRTOK_R
> > >>>> +OVS_CHECK_LINUX_AF_XDP
> > >>>>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
> > >>>>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct
> > >>>> stat.st_mtimensec],
> > >>>>    [], [], [[#include <sys/stat.h>]])
> > >>>> diff --git a/lib/automake.mk b/lib/automake.mk
> > >>>> index cc5dccf39d6b..b31e28f6e1f5 100644
> > >>>> --- a/lib/automake.mk
> > >>>> +++ b/lib/automake.mk
> > >>>> @@ -14,6 +14,10 @@ if WIN32
> > >>>>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
> > >>>>  endif
> > >>>>
> > >>>> +if HAVE_AF_XDP
> > >>>> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> > >>>> +endif
> > >>>> +
> > >>>>  lib_libopenvswitch_la_LDFLAGS = \
> > >>>>          $(OVS_LTINFO) \
> > >>>>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym
> > >>>> \
> > >>>> @@ -392,6 +396,7 @@ lib_libopenvswitch_la_SOURCES += \
> > >>>>       lib/if-notifier.h \
> > >>>>       lib/netdev-linux.c \
> > >>>>       lib/netdev-linux.h \
> > >>>> +     lib/netdev-linux-private.h \
> > >>>>       lib/netdev-tc-offloads.c \
> > >>>>       lib/netdev-tc-offloads.h \
> > >>>>       lib/netlink-conntrack.c \
> > >>>> @@ -409,6 +414,15 @@ lib_libopenvswitch_la_SOURCES += \
> > >>>>       lib/tc.h
> > >>>>  endif
> > >>>>
> > >>>> +if HAVE_AF_XDP
> > >>>> +lib_libopenvswitch_la_SOURCES += \
> > >>>> +     lib/xdpsock.c \
> > >>>> +     lib/xdpsock.h \
> > >>>> +     lib/netdev-afxdp.c \
> > >>>> +     lib/netdev-afxdp.h \
> > >>>> +     lib/spinlock.h
> > >>>> +endif
> > >>>> +
> > >>>>  if DPDK_NETDEV
> > >>>>  lib_libopenvswitch_la_SOURCES += \
> > >>>>       lib/dpdk.c \
> > >>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > >>>> index 0976a35e758b..e6a7947076b4 100644
> > >>>> --- a/lib/dp-packet.c
> > >>>> +++ b/lib/dp-packet.c
> > >>>> @@ -19,6 +19,7 @@
> > >>>>  #include <string.h>
> > >>>>
> > >>>>  #include "dp-packet.h"
> > >>>> +#include "netdev-afxdp.h"
> > >>>>  #include "netdev-dpdk.h"
> > >>>>  #include "openvswitch/dynamic-string.h"
> > >>>>  #include "util.h"
> > >>>> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> > >>>> size_t allocated)
> > >>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> > >>>>  }
> > >>>>
> > >>>> +#if HAVE_AF_XDP
> > >>>> +/* Initialize 'b' as an empty dp_packet that contains
> > >>>> + * memory starting at AF_XDP umem base.
> > >>>> + */
> > >>>> +void
> > >>>> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t
> > >>>> allocated)
> > >>>> +{
> > >>>> +    dp_packet_set_base(b, base);
> > >>>> +    dp_packet_set_data(b, base);
> > >>>> +    dp_packet_set_size(b, 0);
> > >>>> +
> > >>>> +    dp_packet_set_allocated(b, allocated);
> > >>>> +    b->source = DPBUF_AFXDP;
> > >>>> +    dp_packet_reset_offsets(b);
> > >>>> +    pkt_metadata_init(&b->md, 0);
> > >>>> +    dp_packet_reset_cutlen(b);
> > >>>> +    dp_packet_reset_offload(b);
> > >>>> +    b->packet_type = htonl(PT_ETH);
> > >>>> +}
> > >>>> +#endif
> > >>>> +
> > >>>>  /* Initializes 'b' as an empty dp_packet that contains the
> > >>>> 'allocated' bytes of
> > >>>>   * memory starting at 'base'.  'base' should point to a buffer on
> > >>>> the
> > >>>> stack.
> > >>>>   * (Nothing actually relies on 'base' being allocated on the
> > >>>> stack.
> > >>>> It could
> > >>>> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
> > >>>>               * created as a dp_packet */
> > >>>>              free_dpdk_buf((struct dp_packet*) b);
> > >>>>  #endif
> > >>>> +        } else if (b->source == DPBUF_AFXDP) {
> > >>>> +            free_afxdp_buf(b);
> > >>>>          }
> > >>>>      }
> > >>>>  }
> > >>>> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t
> > >>>> new_headroom, size_t new_tailroom
> > >>>>      case DPBUF_STACK:
> > >>>>          OVS_NOT_REACHED();
> > >>>>
> > >>>> +    case DPBUF_AFXDP:
> > >>>> +        OVS_NOT_REACHED();
> > >>>> +
> > >>>>      case DPBUF_STUB:
> > >>>>          b->source = DPBUF_MALLOC;
> > >>>>          new_base = xmalloc(new_allocated);
> > >>>> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
> > >>>>  {
> > >>>>      void *p;
> > >>>>      ovs_assert(b->source != DPBUF_DPDK);
> > >>>> +    ovs_assert(b->source != DPBUF_AFXDP);
> > >>>>
> > >>>>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) ==
> > >>>> dp_packet_base(b)) {
> > >>>>          p = dp_packet_data(b);
> > >>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > >>>> index a5e9ade1244a..e3438226e360 100644
> > >>>> --- a/lib/dp-packet.h
> > >>>> +++ b/lib/dp-packet.h
> > >>>> @@ -25,6 +25,7 @@
> > >>>>  #include <rte_mbuf.h>
> > >>>>  #endif
> > >>>>
> > >>>> +#include "netdev-afxdp.h"
> > >>>>  #include "netdev-dpdk.h"
> > >>>>  #include "openvswitch/list.h"
> > >>>>  #include "packets.h"
> > >>>> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
> > >>>>      DPBUF_DPDK,                /* buffer data is from DPDK
> > >>>> allocated
> > >>>> memory.
> > >>>>                                  * ref to dp_packet_init_dpdk() in
> > >>>> dp-packet.c.
> > >>>>                                  */
> > >>>> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
> > >>>>  };
> > >>>>
> > >>>>  #define DP_PACKET_CONTEXT_SIZE 64
> > >>>> @@ -89,6 +91,13 @@ struct dp_packet {
> > >>>>      };
> > >>>>  };
> > >>>>
> > >>>> +#if HAVE_AF_XDP
> > >>>> +struct dp_packet_afxdp {
> > >>>> +    struct umem_pool *mpool;
> > >>>> +    struct dp_packet packet;
> > >>>> +};
> > >>>> +#endif
> > >>>> +
> > >>>>  static inline void *dp_packet_data(const struct dp_packet *);
> > >>>>  static inline void dp_packet_set_data(struct dp_packet *, void *);
> > >>>>  static inline void *dp_packet_base(const struct dp_packet *);
> > >>>> @@ -122,7 +131,9 @@ static inline const void
> > >>>> *dp_packet_get_nd_payload(const struct dp_packet *);
> > >>>>  void dp_packet_use(struct dp_packet *, void *, size_t);
> > >>>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
> > >>>>  void dp_packet_use_const(struct dp_packet *, const void *,
> > >>>> size_t);
> > >>>> -
> > >>>> +#if HAVE_AF_XDP
> > >>>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> > >>>> +#endif
> > >>>>  void dp_packet_init_dpdk(struct dp_packet *);
> > >>>>
> > >>>>  void dp_packet_init(struct dp_packet *, size_t);
> > >>>> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
> > >>>>              return;
> > >>>>          }
> > >>>>
> > >>>> +        if (b->source == DPBUF_AFXDP) {
> > >>>> +            free_afxdp_buf(b);
> > >>>> +            return;
> > >>>> +        }
> > >>>> +
> > >>>>          dp_packet_uninit(b);
> > >>>>          free(b);
> > >>>>      }
> > >>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> > >>>> index 859c05613ddf..6b6dfda7db1c 100644
> > >>>> --- a/lib/dpif-netdev-perf.h
> > >>>> +++ b/lib/dpif-netdev-perf.h
> > >>>> @@ -21,6 +21,7 @@
> > >>>>  #include <stddef.h>
> > >>>>  #include <stdint.h>
> > >>>>  #include <string.h>
> > >>>> +#include <time.h>
> > >>>>  #include <math.h>
> > >>>>
> > >>>>  #ifdef DPDK_NETDEV
> > >>>> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
> > >>>>      char *log_reason;
> > >>>>  };
> > >>>>
> > >>>> +#ifdef __linux__
> > >>>> +static inline uint64_t
> > >>>> +rdtsc_syscall(struct pmd_perf_stats *s)
> > >>>> +{
> > >>>> +    struct timespec val;
> > >>>> +    uint64_t v;
> > >>>> +
> > >>>> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> > >>>> +       return s->last_tsc;
> > >>>> +    }
> > >>>> +
> > >>>> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> > >>>> +    v += (uint64_t) val.tv_nsec;
> > >>>> +
> > >>>> +    return s->last_tsc = v;
> > >>>> +}
> > >>>> +#endif
> > >>>> +
> > >>>>  /* Support for accurate timing of PMD execution on TSC clock cycle
> > >>>> level.
> > >>>>   * These functions are intended to be invoked in the context of
> > >>>> pmd
> > >>>> threads. */
> > >>>>
> > >>>> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats
> > >>>> *s)
> > >>>>  {
> > >>>>  #ifdef DPDK_NETDEV
> > >>>>      return s->last_tsc = rte_get_tsc_cycles();
> > >>>> +#elif !defined(_MSC_VER) && defined(__x86_64__)
> > >>>> +    uint32_t h, l;
> > >>>> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> > >>>> +
> > >>>> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> > >>>> +#elif defined(__linux__)
> > >>>> +    return rdtsc_syscall(s);
> > >>>>  #else
> > >>>>      return s->last_tsc = 0;
> > >>>>  #endif
> > >>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> > >>>> new file mode 100644
> > >>>> index 000000000000..a6543e8f5126
> > >>>> --- /dev/null
> > >>>> +++ b/lib/netdev-afxdp.c
> > >>>> @@ -0,0 +1,891 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#include "netdev-linux-private.h"
> > >>>> +#include "netdev-linux.h"
> > >>>> +#include "netdev-afxdp.h"
> > >>>> +
> > >>>> +#include <errno.h>
> > >>>> +#include <inttypes.h>
> > >>>> +#include <linux/rtnetlink.h>
> > >>>> +#include <linux/if_xdp.h>
> > >>>> +#include <net/if.h>
> > >>>> +#include <stdlib.h>
> > >>>> +#include <sys/resource.h>
> > >>>> +#include <sys/socket.h>
> > >>>> +#include <sys/types.h>
> > >>>> +#include <unistd.h>
> > >>>> +
> > >>>> +#include "dp-packet.h"
> > >>>> +#include "dpif-netdev.h"
> > >>>> +#include "openvswitch/dynamic-string.h"
> > >>>> +#include "openvswitch/vlog.h"
> > >>>> +#include "packets.h"
> > >>>> +#include "socket-util.h"
> > >>>> +#include "spinlock.h"
> > >>>> +#include "util.h"
> > >>>> +#include "xdpsock.h"
> > >>>> +
> > >>>> +#ifndef SOL_XDP
> > >>>> +#define SOL_XDP 283
> > >>>> +#endif
> > >>>> +
> > >>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > >>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > >>>> +
> > >>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> > >>>> *)base))
> > >>>> +#define UMEM2XPKT(base, i) \
> > >>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char
> > >>>> *)base
> > >>>> + \
> > >>>> +                               i * sizeof(struct dp_packet_afxdp))
> > >>>> +
> > >>>> +static uint32_t prog_id;
> > >>>> +static struct xsk_socket_info *xsk_configure(int ifindex, int
> > >>>> xdp_queue_id,
> > >>>> +                                             int mode);
> > >>>> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> > >>>> +static void xsk_destroy(struct xsk_socket_info *xsk);
> > >>>> +static int xsk_configure_all(struct netdev *netdev);
> > >>>> +static void xsk_destroy_all(struct netdev *netdev);
> > >>>> +
> > >>>> +static struct xsk_umem_info *
> > >>>> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> > >>>> +{
> > >>>> +    struct xsk_umem_config uconfig OVS_UNUSED;
> > >>>> +    struct xsk_umem_info *umem;
> > >>>> +    int ret;
> > >>>> +    int i;
> > >>>> +
> > >>>> +    umem = xcalloc(1, sizeof *umem);
> > >>>> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq,
> > >>>> &umem->cq,
> > >>>> +                           NULL);
> > >>>> +    if (ret) {
> > >>>> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> > >>>> +                 ovs_strerror(errno),
> > >>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> > >>>> +        free(umem);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    umem->buffer = buffer;
> > >>>> +
> > >>>> +    /* set-up umem pool */
> > >>>> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> > >>>> +        VLOG_ERR("umem_pool_init failed");
> > >>>> +        if (xsk_umem__delete(umem->umem)) {
> > >>>> +            VLOG_ERR("xsk_umem__delete failed");
> > >>>> +        }
> > >>>> +        free(umem);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > >>>> +        struct umem_elem *elem;
> > >>>> +
> > >>>> +        elem = ALIGNED_CAST(struct umem_elem *,
> > >>>> +                            (char *)umem->buffer + i *
> > >>>> FRAME_SIZE);
> > >>>> +        umem_elem_push(&umem->mpool, elem);
> > >>>> +    }
> > >>>> +
> > >>>> +    /* set-up metadata */
> > >>>> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> > >>>> +        VLOG_ERR("xpacket_pool_init failed");
> > >>>> +        umem_pool_cleanup(&umem->mpool);
> > >>>> +        if (xsk_umem__delete(umem->umem)) {
> > >>>> +            VLOG_ERR("xsk_umem__delete failed");
> > >>>> +        }
> > >>>> +        free(umem);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> > >>>> +              umem->xpool.array,
> > >>>> +              (char *)umem->xpool.array +
> > >>>> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> > >>>> +
> > >>>> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> > >>>> +        struct dp_packet_afxdp *xpacket;
> > >>>> +        struct dp_packet *packet;
> > >>>> +
> > >>>> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> > >>>> +        xpacket->mpool = &umem->mpool;
> > >>>> +
> > >>>> +        packet = &xpacket->packet;
> > >>>> +        packet->source = DPBUF_AFXDP;
> > >>>> +    }
> > >>>> +
> > >>>> +    return umem;
> > >>>> +}
> > >>>> +
> > >>>> +static struct xsk_socket_info *
> > >>>> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> > >>>> +                     uint32_t queue_id, int xdpmode)
> > >>>> +{
> > >>>> +    struct xsk_socket_config cfg;
> > >>>> +    struct xsk_socket_info *xsk;
> > >>>> +    char devname[IF_NAMESIZE];
> > >>>> +    uint32_t idx = 0;
> > >>>> +    int ret;
> > >>>> +    int i;
> > >>>> +
> > >>>> +    xsk = xcalloc(1, sizeof(*xsk));
> > >>>> +    xsk->umem = umem;
> > >>>> +    cfg.rx_size = CONS_NUM_DESCS;
> > >>>> +    cfg.tx_size = PROD_NUM_DESCS;
> > >>>> +    cfg.libbpf_flags = 0;
> > >>>> +
> > >>>> +    if (xdpmode == XDP_ZEROCOPY) {
> > >>>> +        cfg.bind_flags = XDP_ZEROCOPY;
> > >>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > >>>> XDP_FLAGS_DRV_MODE;
> > >>>> +    } else {
> > >>>> +        cfg.bind_flags = XDP_COPY;
> > >>>> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > >>>> XDP_FLAGS_SKB_MODE;
> > >>>> +    }
> > >>>> +
> > >>>> +    if (if_indextoname(ifindex, devname) == NULL) {
> > >>>> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> > >>>> +                 ifindex, ovs_strerror(errno));
> > >>>> +        free(xsk);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id,
> > >>>> umem->umem,
> > >>>> +                             &xsk->rx, &xsk->tx, &cfg);
> > >>>> +    if (ret) {
> > >>>> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid:
> > >>>> %d",
> > >>>> +                 ovs_strerror(errno),
> > >>>> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> > >>>> +                 queue_id);
> > >>>> +        free(xsk);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Make sure the built-in AF_XDP program is loaded */
> > >>>> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> > >>>> +    if (ret) {
> > >>>> +        VLOG_ERR("Get XDP prog ID failed (%s)",
> > >>>> ovs_strerror(errno));
> > >>>> +        xsk_socket__delete(xsk->xsk);
> > >>>> +        free(xsk);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL
> > >>>> queue
> > >>>> */
> > >>>> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> > >>>> +                                   PROD_NUM_DESCS - BATCH_SIZE,
> > >>>> &idx)) {
> > >>>> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL
> > >>>> queue");
> > >>>> +    }
> > >>>> +
> > >>>> +    for (i = 0;
> > >>>> +         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
> > >>>> +         i += FRAME_SIZE) {
> > >>>> +        struct umem_elem *elem;
> > >>>> +        uint64_t addr;
> > >>>> +
> > >>>> +        elem = umem_elem_pop(&xsk->umem->mpool);
> > >>>> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> > >>>> +
> > >>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> > >>>> +    }
> > >>>> +
> > >>>> +    xsk_ring_prod__submit(&xsk->umem->fq,
> > >>>> +                          PROD_NUM_DESCS - BATCH_SIZE);
> > >>>> +    return xsk;
> > >>>> +}
> > >>>> +
> > >>>> +static struct xsk_socket_info *
> > >>>> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> > >>>> +{
> > >>>> +    struct xsk_socket_info *xsk;
> > >>>> +    struct xsk_umem_info *umem;
> > >>>> +    void *bufs;
> > >>>> +
> > >>>> +    /* umem memory region */
> > >>>> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> > >>>> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> > >>>> +
> > >>>> +    /* create AF_XDP socket */
> > >>>> +    umem = xsk_configure_umem(bufs,
> > >>>> +                              NUM_FRAMES * FRAME_SIZE,
> > >>>> +                              xdpmode);
> > >>>> +    if (!umem) {
> > >>>> +        free_pagealign(bufs);
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id,
> > >>>> xdpmode);
> > >>>> +    if (!xsk) {
> > >>>> +        /* clean up umem and xpacket pool */
> > >>>> +        if (xsk_umem__delete(umem->umem)) {
> > >>>> +            VLOG_ERR("xsk_umem__delete failed");
> > >>>> +        }
> > >>>> +        free_pagealign(bufs);
> > >>>> +        umem_pool_cleanup(&umem->mpool);
> > >>>> +        xpacket_pool_cleanup(&umem->xpool);
> > >>>> +        free(umem);
> > >>>> +    }
> > >>>> +    return xsk;
> > >>>> +}
> > >>>> +
> > >>>> +static int
> > >>>> +xsk_configure_all(struct netdev *netdev)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    struct xsk_socket_info *xsk;
> > >>>> +    int i, ifindex, n_rxq;
> > >>>> +
> > >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > >>>> +
> > >>>> +    n_rxq = netdev_n_rxq(netdev);
> > >>>> +    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> > >>>> +
> > >>>> +    /* configure each queue */
> > >>>> +    for (i = 0; i < n_rxq; i++) {
> > >>>> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> > >>>> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> > >>>> +        xsk = xsk_configure(ifindex, i, dev->xdpmode);
> > >>>> +        if (!xsk) {
> > >>>> +            VLOG_ERR("failed to create AF_XDP socket on queue %d",
> > >>>> i);
> > >>>> +            dev->xsks[i] = NULL;
> > >>>> +            goto err;
> > >>>> +        }
> > >>>> +        dev->xsks[i] = xsk;
> > >>>> +        xsk->rx_dropped = 0;
> > >>>> +        xsk->tx_dropped = 0;
> > >>>> +    }
> > >>>> +
> > >>>> +    return 0;
> > >>>> +
> > >>>> +err:
> > >>>> +    xsk_destroy_all(netdev);
> > >>>> +    return EINVAL;
> > >>>> +}
> > >>>> +
> > >>>> +static void
> > >>>> +xsk_destroy(struct xsk_socket_info *xsk)
> > >>>> +{
> > >>>> +    struct xsk_umem *umem;
> > >>>> +
> > >>>> +    umem = xsk->umem->umem;
> > >>>> +    xsk_socket__delete(xsk->xsk);
> > >>>> +    if (xsk_umem__delete(umem)) {
> > >>>> +        VLOG_ERR("xsk_umem__delete failed");
> > >>>> +    }
> > >>>> +
> > >>>> +    /* free the packet buffer */
> > >>>> +    free_pagealign(xsk->umem->buffer);
> > >>>> +
> > >>>> +    /* cleanup umem pool */
> > >>>> +    umem_pool_cleanup(&xsk->umem->mpool);
> > >>>> +
> > >>>> +    /* cleanup metadata pool */
> > >>>> +    xpacket_pool_cleanup(&xsk->umem->xpool);
> > >>>> +
> > >>>> +    free(xsk->umem);
> > >>>> +    free(xsk);
> > >>>> +}
> > >>>> +
> > >>>> +static void
> > >>>> +xsk_destroy_all(struct netdev *netdev)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    int i, ifindex;
> > >>>> +
> > >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > >>>> +
> > >>>> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> > >>>> +        if (dev->xsks && dev->xsks[i]) {
> > >>>> +            VLOG_INFO("destroy xsk[%d]", i);
> > >>>> +            xsk_destroy(dev->xsks[i]);
> > >>>> +            dev->xsks[i] = NULL;
> > >>>> +        }
> > >>>> +    }
> > >>>> +
> > >>>> +    VLOG_INFO("remove xdp program");
> > >>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > >>>> +
> > >>>> +    free(dev->xsks);
> > >>>> +}
> > >>>> +
> > >>>> +static inline void OVS_UNUSED
> > >>>> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> > >>>> +    struct xdp_statistics stat;
> > >>>> +    socklen_t optlen;
> > >>>> +
> > >>>> +    optlen = sizeof stat;
> > >>>> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP,
> > >>>> XDP_STATISTICS,
> > >>>> +               &stat, &optlen) == 0);
> > >>>> +
> > >>>> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid
> > >>>> %llu",
> > >>>> +                stat.rx_dropped,
> > >>>> +                stat.rx_invalid_descs,
> > >>>> +                stat.tx_invalid_descs);
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > >>>> *args,
> > >>>> +                        char **errp OVS_UNUSED)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    const char *str_xdpmode;
> > >>>> +    int xdpmode, new_n_rxq;
> > >>>> +
> > >>>> +    ovs_mutex_lock(&dev->mutex);
> > >>>> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > >>>> +    if (new_n_rxq > MAX_XSKQ) {
> > >>>> +        ovs_mutex_unlock(&dev->mutex);
> > >>>> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> > >>>> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> > >>>> +        return EINVAL;
> > >>>> +    }
> > >>>> +
> > >>>> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> > >>>> +    if (!strcasecmp(str_xdpmode, "drv")) {
> > >>>> +        xdpmode = XDP_ZEROCOPY;
> > >>>> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> > >>>> +        xdpmode = XDP_COPY;
> > >>>> +    } else {
> > >>>> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> > >>>> +                 netdev_get_name(netdev), str_xdpmode);
> > >>>> +        ovs_mutex_unlock(&dev->mutex);
> > >>>> +        return EINVAL;
> > >>>> +    }
> > >>>> +
> > >>>> +    if (dev->requested_n_rxq != new_n_rxq
> > >>>> +        || dev->requested_xdpmode != xdpmode) {
> > >>>> +        dev->requested_n_rxq = new_n_rxq;
> > >>>> +        dev->requested_xdpmode = xdpmode;
> > >>>> +        netdev_request_reconfigure(netdev);
> > >>>> +    }
> > >>>> +    ovs_mutex_unlock(&dev->mutex);
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > >>>> *args)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +
> > >>>> +    ovs_mutex_lock(&dev->mutex);
> > >>>> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> > >>>> +    smap_add_format(args, "xdpmode", "%s",
> > >>>> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> > >>>> +    ovs_mutex_unlock(&dev->mutex);
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static void
> > >>>> +netdev_afxdp_alloc_txq(struct netdev *netdev)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    int n_txqs = netdev_n_rxq(netdev);
> > >>>> +    int i;
> > >>>> +
> > >>>> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> > >>>> +
> > >>>> +    for (i = 0; i < n_txqs; i++) {
> > >>>> +        ovs_spinlock_init(&dev->tx_locks[i]);
> > >>>> +    }
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_reconfigure(struct netdev *netdev)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> > >>>> +    int err = 0;
> > >>>> +
> > >>>> +    ovs_mutex_lock(&dev->mutex);
> > >>>> +
> > >>>> +    if (netdev->n_rxq == dev->requested_n_rxq
> > >>>> +        && dev->xdpmode == dev->requested_xdpmode) {
> > >>>> +        goto out;
> > >>>> +    }
> > >>>> +
> > >>>> +    xsk_destroy_all(netdev);
> > >>>> +    free(dev->tx_locks);
> > >>>> +
> > >>>> +    netdev->n_rxq = dev->requested_n_rxq;
> > >>>> +    netdev_afxdp_alloc_txq(netdev);
> > >>>> +
> > >>>> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> > >>>> +        VLOG_INFO("AF_XDP device %s in DRV mode",
> > >>>> netdev_get_name(netdev));
> > >>>> +        /* From SKB mode to DRV mode */
> > >>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > >>>> XDP_FLAGS_DRV_MODE;
> > >>>> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> > >>>> +        dev->xdpmode = XDP_ZEROCOPY;
> > >>>> +
> > >>>> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> > >>>> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> > >>>> +                      ovs_strerror(errno));
> > >>>> +        }
> > >>>> +    } else {
> > >>>> +        VLOG_INFO("AF_XDP device %s in SKB mode",
> > >>>> netdev_get_name(netdev));
> > >>>> +        /* From DRV mode to SKB mode */
> > >>>> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST |
> > >>>> XDP_FLAGS_SKB_MODE;
> > >>>> +        dev->xdp_bind_flags = XDP_COPY;
> > >>>> +        dev->xdpmode = XDP_COPY;
> > >>>> +        /* TODO: set rlimit back to previous value
> > >>>> +         * when no device is in DRV mode.
> > >>>> +         */
> > >>>> +    }
> > >>>> +
> > >>>> +    err = xsk_configure_all(netdev);
> > >>>> +    if (err) {
> > >>>> +        VLOG_ERR("AF_XDP device %s reconfig fails",
> > >>>> netdev_get_name(netdev));
> > >>>> +    }
> > >>>> +    netdev_change_seq_changed(netdev);
> > >>>> +out:
> > >>>> +    ovs_mutex_unlock(&dev->mutex);
> > >>>> +    return err;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> > >>>> +{
> > >>>> +    /* FIXME: Get netdev's PCIe device ID, then find
> > >>>> +     * its NUMA node id.
> > >>>> +     */
> > >>>> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> > >>>> +              netdev_get_name(netdev));
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static void
> > >>>> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> > >>>> +{
> > >>>> +    uint32_t curr_prog_id = 0;
> > >>>> +    uint32_t flags;
> > >>>> +
> > >>>> +    /* remove_xdp_program() */
> > >>>> +    if (xdpmode == XDP_COPY) {
> > >>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> > >>>> +    } else {
> > >>>> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> > >>>> +    }
> > >>>> +
> > >>>> +    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
> > >>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > >>>> +    }
> > >>>> +    if (prog_id == curr_prog_id) {
> > >>>> +        bpf_set_link_xdp_fd(ifindex, -1, flags);
> > >>>> +    } else if (!curr_prog_id) {
> > >>>> +        VLOG_INFO("couldn't find a prog id on a given interface");
> > >>>> +    } else {
> > >>>> +        VLOG_INFO("program on interface changed, not removing");
> > >>>> +    }
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +signal_remove_xdp(struct netdev *netdev)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    int ifindex;
> > >>>> +
> > >>>> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> > >>>> +
> > >>>> +    VLOG_WARN("force remove xdp program");
> > >>>> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> > >>>> +}
> > >>>> +
> > >>>> +static struct dp_packet_afxdp *
> > >>>> +dp_packet_cast_afxdp(const struct dp_packet *d)
> > >>>> +{
> > >>>> +    ovs_assert(d->source == DPBUF_AFXDP);
> > >>>> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +free_afxdp_buf(struct dp_packet *p)
> > >>>> +{
> > >>>> +    struct dp_packet_afxdp *xpacket;
> > >>>> +    uintptr_t addr;
> > >>>> +
> > >>>> +    xpacket = dp_packet_cast_afxdp(p);
> > >>>> +    if (xpacket->mpool) {
> > >>>> +        void *base = dp_packet_base(p);
> > >>>> +
> > >>>> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > >>>> +        umem_elem_push(xpacket->mpool, (void *)addr);
> > >>>> +    }
> > >>>> +}
> > >>>> +
> > >>>> +static void
> > >>>> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> > >>>> +{
> > >>>> +    struct dp_packet_afxdp *xpacket = NULL;
> > >>>> +    struct dp_packet *packet;
> > >>>> +    void *elems[BATCH_SIZE];
> > >>>> +    uintptr_t addr;
> > >>>> +
> > >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > >>>> +        xpacket = dp_packet_cast_afxdp(packet);
> > >>>> +        if (xpacket->mpool) {
> > >>>> +            void *base = dp_packet_base(packet);
> > >>>> +
> > >>>> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > >>>> +            elems[i] = (void *)addr;
> > >>>> +        }
> > >>>> +    }
> > >>>> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> > >>>> +    dp_packet_batch_init(batch);
> > >>>> +}
> > >>>> +
> > >>>> +static inline void
> > >>>> +handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
> > >>>> +{
> > >>>> +    void *elems[BATCH_SIZE];
> > >>>> +    int i;
> > >>>> +
> > >>>> +    for (i = 0; i < rcvd; i++) {
> > >>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> > >>>> idx_rx)->addr;
> > >>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > >>>> +
> > >>>> +        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
> > >>>> +    }
> > >>>> +    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > >>>> +
> > >>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > >>>> +    xsk->rx_dropped += rcvd;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct
> > >>>> dp_packet_batch
> > >>>> *batch,
> > >>>> +                      int *qfill)
> > >>>> +{
> > >>>> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> > >>>> +    struct netdev *netdev = rx->up.netdev;
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    struct umem_elem *elems[BATCH_SIZE];
> > >>>> +    uint32_t idx_rx = 0, idx_fq = 0;
> > >>>> +    struct xsk_socket_info *xsk;
> > >>>> +    int qid = rxq_->queue_id;
> > >>>> +    unsigned int rcvd, i;
> > >>>> +    int ret = 0;
> > >>>> +
> > >>>> +    xsk = dev->xsks[qid];
> > >>>> +    if (!xsk) {
> > >>>> +        return 0;
> > >>>> +    }
> > >>>> +
> > >>>> +    rx->fd = xsk_socket__fd(xsk->xsk);
> > >>>> +
> > >>>> +    /* See if there is any packet on RX queue,
> > >>>> +     * if yes, idx_rx is the index having the packet.
> > >>>> +     */
> > >>>> +    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> > >>>> +    if (!rcvd) {
> > >>>> +        return 0;
> > >>>> +    }
> > >>>> +
> > >>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void
> > >>>> **)elems);
> > >>>> +    if (OVS_UNLIKELY(ret)) {
> > >>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
> > >>>> +        return ENOMEM;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Prepare for the FILL queue */
> > >>>> +    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
> > >>>> +        /* The FILL queue is full, don't retry or process rx. Wait
> > >>>> for kernel
> > >>>> +         * to move received packets from FILL queue to RX queue.
> > >>>> +         */
> > >>>> +        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
> > >>>> +        handle_rx_fail(xsk, rcvd, idx_rx);
> > >>>> +        return ENOMEM;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Setup a dp_packet batch from descriptors in RX queue */
> > >>>> +    for (i = 0; i < rcvd; i++) {
> > >>>> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx,
> > >>>> idx_rx)->addr;
> > >>>> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx,
> > >>>> idx_rx)->len;
> > >>>> +        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> > >>>> +        uint64_t index;
> > >>>> +
> > >>>> +        struct dp_packet_afxdp *xpacket;
> > >>>> +        struct dp_packet *packet;
> > >>>> +
> > >>>> +        index = addr >> FRAME_SHIFT;
> > >>>> +        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
> > >>>> +        packet = &xpacket->packet;
> > >>>> +
> > >>>> +        /* Initialize the struct dp_packet */
> > >>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE -
> > >>>> FRAME_HEADROOM);
> > >>>> +        dp_packet_set_size(packet, len);
> > >>>> +
> > >>>> +        /* Add packet into batch, increase batch->count */
> > >>>> +        dp_packet_batch_add(batch, packet);
> > >>>> +
> > >>>> +        idx_rx++;
> > >>>> +    }
> > >>>> +    /* Release the RX queue */
> > >>>> +    xsk_ring_cons__release(&xsk->rx, rcvd);
> > >>>> +
> > >>>> +    for (i = 0; i < rcvd; i++) {
> > >>>> +        uint64_t index;
> > >>>> +        struct umem_elem *elem;
> > >>>> +
> > >>>> +        /* Get one free umem, program it into FILL queue */
> > >>>> +        elem = elems[i];
> > >>>> +        index = (uint64_t)((char *)elem - (char
> > >>>> *)xsk->umem->buffer);
> > >>>> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > >>>> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
> > >>>> +
> > >>>> +        idx_fq++;
> > >>>> +    }
> > >>>> +    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> > >>>> +
> > >>>> +    if (qfill) {
> > >>>> +        /* TODO: return the number of remaining packets in the
> > >>>> queue.
> > >>>> */
> > >>>> +        *qfill = 0;
> > >>>> +    }
> > >>>> +
> > >>>> +#ifdef AFXDP_DEBUG
> > >>>> +    log_xsk_stat(xsk);
> > >>>> +#endif
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static inline int
> > >>>> +kick_tx(struct xsk_socket_info *xsk)
> > >>>> +{
> > >>>> +    int ret;
> > >>>> +
> > >>>> +    /* This causes system call into kernel's xsk_sendmsg, and
> > >>>> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver
> > >>>> mode).
> > >>>> +     */
> > >>>> +    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT,
> > >>>> NULL, 0);
> > >>>> +    if (OVS_UNLIKELY(ret < 0)) {
> > >>>> +        if (errno == ENXIO || errno == ENOBUFS || errno ==
> > >>>> EOPNOTSUPP) {
> > >>>> +            return errno;
> > >>>> +        }
> > >>>> +    }
> > >>>> +    /* no error, or EBUSY or EAGAIN */
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static inline bool
> > >>>> +check_free_batch(struct dp_packet_batch *batch)
> > >>>> +{
> > >>>> +    struct umem_pool *first_mpool = NULL;
> > >>>> +    struct dp_packet_afxdp *xpacket;
> > >>>> +    struct dp_packet *packet;
> > >>>> +
> > >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > >>>> +        if (packet->source != DPBUF_AFXDP) {
> > >>>> +            return false;
> > >>>> +        }
> > >>>> +        xpacket = dp_packet_cast_afxdp(packet);
> > >>>> +        if (i == 0) {
> > >>>> +            first_mpool = xpacket->mpool;
> > >>>> +            continue;
> > >>>> +        }
> > >>>> +        if (xpacket->mpool != first_mpool) {
> > >>>> +            return false;
> > >>>> +        }
> > >>>> +    }
> > >>>> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> > >>>> +    return true;
> > >>>> +}
> > >>>> +
> > >>>> +static inline void
> > >>>> +afxdp_complete_tx(struct xsk_socket_info *xsk)
> > >>>> +{
> > >>>> +    struct umem_elem *elems_push[BATCH_SIZE];
> > >>>> +    uint32_t idx_cq = 0;
> > >>>> +    int tx_done, j, ret;
> > >>>> +
> > >>>> +    if (!xsk->outstanding_tx) {
> > >>>> +        return;
> > >>>> +    }
> > >>>> +
> > >>>> +    ret = kick_tx(xsk);
> > >>>> +    if (OVS_UNLIKELY(ret)) {
> > >>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > >>>> +                     ovs_strerror(ret));
> > >>>> +    }
> > >>>> +
> > >>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE,
> > >>>> &idx_cq);
> > >>>> +    if (tx_done > 0) {
> > >>>> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > >>>> +        xsk->outstanding_tx -= tx_done;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Recycle back to umem pool */
> > >>>> +    for (j = 0; j < tx_done; j++) {
> > >>>> +        struct umem_elem *elem;
> > >>>> +        uint64_t addr;
> > >>>> +
> > >>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq,
> > >>>> idx_cq++);
> > >>>> +        elem = ALIGNED_CAST(struct umem_elem *,
> > >>>> +                            (char *)xsk->umem->buffer + addr);
> > >>>> +        elems_push[j] = elem;
> > >>>> +    }
> > >>>> +
> > >>>> +    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> > >>>> **)elems_push);
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> > >>>> +                        struct dp_packet_batch *batch,
> > >>>> +                        bool concurrent_txq)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    struct xsk_socket_info *xsk = dev->xsks[qid];
> > >>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
> > >>>> +    struct dp_packet *packet;
> > >>>> +    bool free_batch = true;
> > >>>> +    uint32_t idx = 0;
> > >>>> +    int error = 0;
> > >>>> +    int ret;
> > >>>> +
> > >>>> +    if (!xsk) {
> > >>>> +        goto out;
> > >>>> +    }
> > >>>> +
> > >>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
> > >>>> +        qid = qid % dev->up.n_txq;
> > >>>> +        ovs_spin_lock(&dev->tx_locks[qid]);
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Process CQ first. */
> > >>>> +    afxdp_complete_tx(xsk);
> > >>>> +
> > >>>> +    free_batch = check_free_batch(batch);
> > >>>> +
> > >>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> > >>>> **)elems_pop);
> > >>>> +    if (OVS_UNLIKELY(ret)) {
> > >>>> +        xsk->tx_dropped += batch->count;
> > >>>> +        error = ENOMEM;
> > >>>> +        goto out;
> > >>>> +    }
> > >>>> +
> > >>>> +    /* Make sure we have enough TX descs */
> > >>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > >>>> +    if (OVS_UNLIKELY(ret == 0)) {
> > >>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > >>>> **)elems_pop);
> > >>>> +        xsk->tx_dropped += batch->count;
> > >>>> +        error = ENOMEM;
> > >>>> +        goto out;
> > >>>> +    }
> > >>>> +
> > >>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > >>>> +        struct umem_elem *elem;
> > >>>> +        uint64_t index;
> > >>>> +
> > >>>> +        elem = elems_pop[i];
> > >>>> +        /* Copy the packet to the umem we just pop from umem pool.
> > >>>> +         * TODO: avoid this copy if the packet and the pop umem
> > >>>> +         * are located in the same umem.
> > >>>> +         */
> > >>>> +        memcpy(elem, dp_packet_data(packet),
> > >>>> dp_packet_size(packet));
> > >>>> +
> > >>>> +        index = (uint64_t)((char *)elem - (char
> > >>>> *)xsk->umem->buffer);
> > >>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > >>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > >>>> +            = dp_packet_size(packet);
> > >>>> +    }
> > >>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > >>>> +    xsk->outstanding_tx += batch->count;
> > >>>> +
> > >>>> +    ret = kick_tx(xsk);
> > >>>> +    if (OVS_UNLIKELY(ret)) {
> > >>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > >>>> +                     ovs_strerror(ret));
> > >>>> +    }
> > >>>> +
> > >>>> +out:
> > >>>> +    if (free_batch) {
> > >>>> +        free_afxdp_buf_batch(batch);
> > >>>> +    } else {
> > >>>> +        dp_packet_delete_batch(batch, true);
> > >>>> +    }
> > >>>> +
> > >>>> +    if (OVS_UNLIKELY(concurrent_txq)) {
> > >>>> +        ovs_spin_unlock(&dev->tx_locks[qid]);
> > >>>> +    }
> > >>>> +    return error;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
> > >>>> +{
> > >>>> +   /* Done at reconfigure */
> > >>>> +   return 0;
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +netdev_afxdp_destruct(struct netdev *netdev_)
> > >>>> +{
> > >>>> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> > >>>> +
> > >>>> +    /* Note: tc is by-passed when using drv-mode, but when using
> > >>>> +     * skb-mode, we might need to clean up tc. */
> > >>>> +
> > >>>> +    xsk_destroy_all(netdev_);
> > >>>> +    ovs_mutex_destroy(&netdev->mutex);
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +netdev_afxdp_get_stats(const struct netdev *netdev,
> > >>>> +                       struct netdev_stats *stats)
> > >>>> +{
> > >>>> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > >>>> +    struct netdev_stats dev_stats;
> > >>>> +    struct xsk_socket_info *xsk;
> > >>>> +    int error, i;
> > >>>> +
> > >>>> +    ovs_mutex_lock(&dev->mutex);
> > >>>> +
> > >>>> +    error = get_stats_via_netlink(netdev, &dev_stats);
> > >>>> +    if (error) {
> > >>>> +        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
> > >>>> +    } else {
> > >>>> +        /* Use kernel netdev's packet and byte counts */
> > >>>> +        stats->rx_packets = dev_stats.rx_packets;
> > >>>> +        stats->rx_bytes = dev_stats.rx_bytes;
> > >>>> +        stats->tx_packets = dev_stats.tx_packets;
> > >>>> +        stats->tx_bytes = dev_stats.tx_bytes;
> > >>>> +
> > >>>> +        stats->rx_errors           += dev_stats.rx_errors;
> > >>>> +        stats->tx_errors           += dev_stats.tx_errors;
> > >>>> +        stats->rx_dropped          += dev_stats.rx_dropped;
> > >>>> +        stats->tx_dropped          += dev_stats.tx_dropped;
> > >>>> +        stats->multicast           += dev_stats.multicast;
> > >>>> +        stats->collisions          += dev_stats.collisions;
> > >>>> +        stats->rx_length_errors    += dev_stats.rx_length_errors;
> > >>>> +        stats->rx_over_errors      += dev_stats.rx_over_errors;
> > >>>> +        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
> > >>>> +        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
> > >>>> +        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
> > >>>> +        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
> > >>>> +        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
> > >>>> +        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
> > >>>> +        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
> > >>>> +        stats->tx_heartbeat_errors +=
> > >>>> dev_stats.tx_heartbeat_errors;
> > >>>> +        stats->tx_window_errors    += dev_stats.tx_window_errors;
> > >>>> +
> > >>>> +        /* Account the dropped in each xsk */
> > >>>> +        for (i = 0; i < netdev_n_rxq(netdev); i++) {
> > >>>> +            xsk = dev->xsks[i];
> > >>>> +            if (xsk) {
> > >>>> +                stats->rx_dropped += xsk->rx_dropped;
> > >>>> +                stats->tx_dropped += xsk->tx_dropped;
> > >>>> +            }
> > >>>> +        }
> > >>>> +    }
> > >>>> +    ovs_mutex_unlock(&dev->mutex);
> > >>>> +
> > >>>> +    return error;
> > >>>> +}
> > >>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > >>>> new file mode 100644
> > >>>> index 000000000000..dd2dc1a2064d
> > >>>> --- /dev/null
> > >>>> +++ b/lib/netdev-afxdp.h
> > >>>> @@ -0,0 +1,74 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +
> > >>>> +#ifndef NETDEV_AFXDP_H
> > >>>> +#define NETDEV_AFXDP_H 1
> > >>>> +
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +
> > >>>> +#include <stdint.h>
> > >>>> +#include <stdbool.h>
> > >>>> +
> > >>>> +/* These functions are Linux AF_XDP specific, so they should be
> > >>>> used
> > >>>> directly
> > >>>> + * only by Linux-specific code. */
> > >>>> +
> > >>>> +#define MAX_XSKQ 16
> > >>>> +
> > >>>> +struct netdev;
> > >>>> +struct xsk_socket_info;
> > >>>> +struct xdp_umem;
> > >>>> +struct dp_packet_batch;
> > >>>> +struct smap;
> > >>>> +struct dp_packet;
> > >>>> +struct netdev_rxq;
> > >>>> +struct netdev_stats;
> > >>>> +
> > >>>> +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
> > >>>> +void netdev_afxdp_destruct(struct netdev *netdev_);
> > >>>> +
> > >>>> +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
> > >>>> +                          struct dp_packet_batch *batch,
> > >>>> +                          int *qfill);
> > >>>> +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
> > >>>> +                            struct dp_packet_batch *batch,
> > >>>> +                            bool concurrent_txq);
> > >>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct
> > >>>> smap
> > >>>> *args,
> > >>>> +                            char **errp);
> > >>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct
> > >>>> smap
> > >>>> *args);
> > >>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > >>>> +int netdev_afxdp_get_stats(const struct netdev *netdev_,
> > >>>> +                           struct netdev_stats *stats);
> > >>>> +
> > >>>> +void free_afxdp_buf(struct dp_packet *p);
> > >>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > >>>> +void signal_remove_xdp(struct netdev *netdev);
> > >>>> +
> > >>>> +#else /* !HAVE_AF_XDP */
> > >>>> +
> > >>>> +#include "openvswitch/compiler.h"
> > >>>> +
> > >>>> +struct dp_packet;
> > >>>> +
> > >>>> +static inline void
> > >>>> +free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
> > >>>> +{
> > >>>> +    /* Nothing */
> > >>>> +}
> > >>>> +
> > >>>> +#endif /* HAVE_AF_XDP */
> > >>>> +#endif /* netdev-afxdp.h */
> > >>>> diff --git a/lib/netdev-linux-private.h
> > >>>> b/lib/netdev-linux-private.h
> > >>>> new file mode 100644
> > >>>> index 000000000000..6a0388cf9dc3
> > >>>> --- /dev/null
> > >>>> +++ b/lib/netdev-linux-private.h
> > >>>> @@ -0,0 +1,139 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +
> > >>>> +#ifndef NETDEV_LINUX_PRIVATE_H
> > >>>> +#define NETDEV_LINUX_PRIVATE_H 1
> > >>>> +
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#include <linux/filter.h>
> > >>>> +#include <linux/gen_stats.h>
> > >>>> +#include <linux/if_ether.h>
> > >>>> +#include <linux/if_tun.h>
> > >>>> +#include <linux/types.h>
> > >>>> +#include <linux/ethtool.h>
> > >>>> +#include <linux/mii.h>
> > >>>> +#include <stdint.h>
> > >>>> +#include <stdbool.h>
> > >>>> +
> > >>>> +#include "netdev-afxdp.h"
> > >>>> +#include "netdev-provider.h"
> > >>>> +#include "netdev-tc-offloads.h"
> > >>>> +#include "netdev-vport.h"
> > >>>> +#include "openvswitch/thread.h"
> > >>>> +#include "ovs-atomic.h"
> > >>>> +#include "timer.h"
> > >>>> +#include "xdpsock.h"
> > >>>> +
> > >>>> +/* These functions are Linux specific, so they should be used
> > >>>> directly only by
> > >>>> + * Linux-specific code. */
> > >>>> +
> > >>>> +struct netdev;
> > >>>> +
> > >>>> +struct netdev_rxq_linux {
> > >>>> +    struct netdev_rxq up;
> > >>>> +    bool is_tap;
> > >>>> +    int fd;
> > >>>> +};
> > >>>> +
> > >>>> +void netdev_linux_run(const struct netdev_class *);
> > >>>> +
> > >>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> > >>>> flag,
> > >>>> +                                  const char *flag_name, bool
> > >>>> enable);
> > >>>> +
> > >>>> +int get_stats_via_netlink(const struct netdev *netdev_,
> > >>>> +                          struct netdev_stats *stats);
> > >>>> +
> > >>>> +struct netdev_linux {
> > >>>> +    struct netdev up;
> > >>>> +
> > >>>> +    /* Protects all members below. */
> > >>>> +    struct ovs_mutex mutex;
> > >>>> +
> > >>>> +    unsigned int cache_valid;
> > >>>> +
> > >>>> +    bool miimon;                    /* Link status of last poll.
> > >>>> */
> > >>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled
> > >>>> if
> > >>>> <= 0. */
> > >>>> +    struct timer miimon_timer;
> > >>>> +
> > >>>> +    int netnsid;                    /* Network namespace ID. */
> > >>>> +    /* The following are figured out "on demand" only.  They are
> > >>>> only
> > >>>> valid
> > >>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set.
> > >>>> */
> > >>>> +    int ifindex;
> > >>>> +    struct eth_addr etheraddr;
> > >>>> +    int mtu;
> > >>>> +    unsigned int ifi_flags;
> > >>>> +    long long int carrier_resets;
> > >>>> +    uint32_t kbits_rate;        /* Policing data. */
> > >>>> +    uint32_t kbits_burst;
> > >>>> +    int vport_stats_error;      /* Cached error code from
> > >>>> vport_get_stats().
> > >>>> +                                   0 or an errno value. */
> > >>>> +    int netdev_mtu_error;       /* Cached error code from
> > >>>> SIOCGIFMTU
> > >>>> +                                 * or SIOCSIFMTU.
> > >>>> +                                 */
> > >>>> +    int ether_addr_error;       /* Cached error code from set/get
> > >>>> etheraddr. */
> > >>>> +    int netdev_policing_error;  /* Cached error code from set
> > >>>> policing. */
> > >>>> +    int get_features_error;     /* Cached error code from
> > >>>> ETHTOOL_GSET. */
> > >>>> +    int get_ifindex_error;      /* Cached error code from
> > >>>> SIOCGIFINDEX. */
> > >>>> +
> > >>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> +
> > >>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from
> > >>>> ETHTOOL_GDRVINFO.
> > >>>> */
> > >>>> +    struct tc *tc;
> > >>>> +
> > >>>> +    /* For devices of class netdev_tap_class only. */
> > >>>> +    int tap_fd;
> > >>>> +    bool present;               /* If the device is present in the
> > >>>> namespace */
> > >>>> +    uint64_t tx_dropped;        /* tap device can drop if the
> > >>>> iface
> > >>>> is down */
> > >>>> +
> > >>>> +    /* LAG information. */
> > >>>> +    bool is_lag_master;         /* True if the netdev is a LAG
> > >>>> master. */
> > >>>> +
> > >>>> +    /* AF_XDP information */
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +    struct xsk_socket_info **xsks;
> > >>>> +    int requested_n_rxq;
> > >>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > >>>> +    int xdp_flags, xdp_bind_flags;
> > >>>> +    struct ovs_spinlock *tx_locks;
> > >>>> +#endif
> > >>>> +};
> > >>>> +
> > >>>> +static bool
> > >>>> +is_netdev_linux_class(const struct netdev_class *netdev_class)
> > >>>> +{
> > >>>> +    return netdev_class->run == netdev_linux_run;
> > >>>> +}
> > >>>> +
> > >>>> +static struct netdev_linux *
> > >>>> +netdev_linux_cast(const struct netdev *netdev)
> > >>>> +{
> > >>>> +    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > >>>> +
> > >>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > >>>> +}
> > >>>> +
> > >>>> +static struct netdev_rxq_linux *
> > >>>> +netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> > >>>> +{
> > >>>> +
> > >>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> > >>>> +
> > >>>> +    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> > >>>> +}
> > >>>> +
> > >>>> +#endif /* netdev-linux-private.h */
> > >>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > >>>> index f75d73fd39f8..2883cf1f2586 100644
> > >>>> --- a/lib/netdev-linux.c
> > >>>> +++ b/lib/netdev-linux.c
> > >>>> @@ -17,6 +17,7 @@
> > >>>>  #include <config.h>
> > >>>>
> > >>>>  #include "netdev-linux.h"
> > >>>> +#include "netdev-linux-private.h"
> > >>>>
> > >>>>  #include <errno.h>
> > >>>>  #include <fcntl.h>
> > >>>> @@ -54,6 +55,7 @@
> > >>>>  #include "fatal-signal.h"
> > >>>>  #include "hash.h"
> > >>>>  #include "openvswitch/hmap.h"
> > >>>> +#include "netdev-afxdp.h"
> > >>>>  #include "netdev-provider.h"
> > >>>>  #include "netdev-tc-offloads.h"
> > >>>>  #include "netdev-vport.h"
> > >>>> @@ -487,57 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> > >>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps,
> > >>>> int
> > >>>> mtu);
> > >>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> > >>>> burst_bytes);
> > >>>>
> > >>>> -struct netdev_linux {
> > >>>> -    struct netdev up;
> > >>>> -
> > >>>> -    /* Protects all members below. */
> > >>>> -    struct ovs_mutex mutex;
> > >>>> -
> > >>>> -    unsigned int cache_valid;
> > >>>> -
> > >>>> -    bool miimon;                    /* Link status of last poll.
> > >>>> */
> > >>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled
> > >>>> if
> > >>>> <= 0. */
> > >>>> -    struct timer miimon_timer;
> > >>>> -
> > >>>> -    int netnsid;                    /* Network namespace ID. */
> > >>>> -    /* The following are figured out "on demand" only.  They are
> > >>>> only
> > >>>> valid
> > >>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set.
> > >>>> */
> > >>>> -    int ifindex;
> > >>>> -    struct eth_addr etheraddr;
> > >>>> -    int mtu;
> > >>>> -    unsigned int ifi_flags;
> > >>>> -    long long int carrier_resets;
> > >>>> -    uint32_t kbits_rate;        /* Policing data. */
> > >>>> -    uint32_t kbits_burst;
> > >>>> -    int vport_stats_error;      /* Cached error code from
> > >>>> vport_get_stats().
> > >>>> -                                   0 or an errno value. */
> > >>>> -    int netdev_mtu_error;       /* Cached error code from
> > >>>> SIOCGIFMTU
> > >>>> or SIOCSIFMTU. */
> > >>>> -    int ether_addr_error;       /* Cached error code from set/get
> > >>>> etheraddr. */
> > >>>> -    int netdev_policing_error;  /* Cached error code from set
> > >>>> policing. */
> > >>>> -    int get_features_error;     /* Cached error code from
> > >>>> ETHTOOL_GSET. */
> > >>>> -    int get_ifindex_error;      /* Cached error code from
> > >>>> SIOCGIFINDEX. */
> > >>>> -
> > >>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET.
> > >>>> */
> > >>>> -
> > >>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from
> > >>>> ETHTOOL_GDRVINFO.
> > >>>> */
> > >>>> -    struct tc *tc;
> > >>>> -
> > >>>> -    /* For devices of class netdev_tap_class only. */
> > >>>> -    int tap_fd;
> > >>>> -    bool present;               /* If the device is present in the
> > >>>> namespace */
> > >>>> -    uint64_t tx_dropped;        /* tap device can drop if the
> > >>>> iface
> > >>>> is down */
> > >>>> -
> > >>>> -    /* LAG information. */
> > >>>> -    bool is_lag_master;         /* True if the netdev is a LAG
> > >>>> master. */
> > >>>> -};
> > >>>> -
> > >>>> -struct netdev_rxq_linux {
> > >>>> -    struct netdev_rxq up;
> > >>>> -    bool is_tap;
> > >>>> -    int fd;
> > >>>> -};
> > >>>>
> > >>>>  /* This is set pretty low because we probably won't learn anything
> > >>>> from the
> > >>>>   * additional log messages. */
> > >>>> @@ -551,8 +502,6 @@ static struct vlog_rate_limit rl =
> > >>>> VLOG_RATE_LIMIT_INIT(5, 20);
> > >>>>   * changes in the device miimon status, so we can use
> > >>>> atomic_count.
> > >>>> */
> > >>>>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
> > >>>>
> > >>>> -static void netdev_linux_run(const struct netdev_class *);
> > >>>> -
> > >>>>  static int netdev_linux_do_ethtool(const char *name, struct
> > >>>> ethtool_cmd *,
> > >>>>                                     int cmd, const char *cmd_name);
> > >>>>  static int get_flags(const struct netdev *, unsigned int *flags);
> > >>>> @@ -566,7 +515,6 @@ static int do_set_addr(struct netdev *netdev,
> > >>>>                         struct in_addr addr);
> > >>>>  static int get_etheraddr(const char *netdev_name, struct eth_addr
> > >>>> *ea);
> > >>>>  static int set_etheraddr(const char *netdev_name, const struct
> > >>>> eth_addr);
> > >>>> -static int get_stats_via_netlink(const struct netdev *, struct
> > >>>> netdev_stats *);
> > >>>>  static int af_packet_sock(void);
> > >>>>  static bool netdev_linux_miimon_enabled(void);
> > >>>>  static void netdev_linux_miimon_run(void);
> > >>>> @@ -574,31 +522,10 @@ static void netdev_linux_miimon_wait(void);
> > >>>>  static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int
> > >>>> *mtup);
> > >>>>
> > >>>>  static bool
> > >>>> -is_netdev_linux_class(const struct netdev_class *netdev_class)
> > >>>> -{
> > >>>> -    return netdev_class->run == netdev_linux_run;
> > >>>> -}
> > >>>> -
> > >>>> -static bool
> > >>>>  is_tap_netdev(const struct netdev *netdev)
> > >>>>  {
> > >>>>      return netdev_get_class(netdev) == &netdev_tap_class;
> > >>>>  }
> > >>>> -
> > >>>> -static struct netdev_linux *
> > >>>> -netdev_linux_cast(const struct netdev *netdev)
> > >>>> -{
> > >>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > >>>> -
> > >>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > >>>> -}
> > >>>> -
> > >>>> -static struct netdev_rxq_linux *
> > >>>> -netdev_rxq_linux_cast(const struct netdev_rxq *rx)
> > >>>> -{
> > >>>> -
> > >>>> ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
> > >>>> -    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
> > >>>> -}
> > >>>>
> > >>>>  static int
> > >>>>  netdev_linux_netnsid_update__(struct netdev_linux *netdev)
> > >>>> @@ -774,7 +701,7 @@ netdev_linux_update_lag(struct rtnetlink_change
> > >>>> *change)
> > >>>>      }
> > >>>>  }
> > >>>>
> > >>>> -static void
> > >>>> +void
> > >>>>  netdev_linux_run(const struct netdev_class *netdev_class
> > >>>> OVS_UNUSED)
> > >>>>  {
> > >>>>      struct nl_sock *sock;
> > >>>> @@ -3279,9 +3206,7 @@ exit:
> > >>>>      .run = netdev_linux_run,                                    \
> > >>>>      .wait = netdev_linux_wait,                                  \
> > >>>>      .alloc = netdev_linux_alloc,                                \
> > >>>> -    .destruct = netdev_linux_destruct,                          \
> > >>>>      .dealloc = netdev_linux_dealloc,                            \
> > >>>> -    .send = netdev_linux_send,                                  \
> > >>>>      .send_wait = netdev_linux_send_wait,                        \
> > >>>>      .set_etheraddr = netdev_linux_set_etheraddr,                \
> > >>>>      .get_etheraddr = netdev_linux_get_etheraddr,                \
> > >>>> @@ -3312,10 +3237,8 @@ exit:
> > >>>>      .arp_lookup = netdev_linux_arp_lookup,                      \
> > >>>>      .update_flags = netdev_linux_update_flags,                  \
> > >>>>      .rxq_alloc = netdev_linux_rxq_alloc,                        \
> > >>>> -    .rxq_construct = netdev_linux_rxq_construct,                \
> > >>>>      .rxq_destruct = netdev_linux_rxq_destruct,                  \
> > >>>>      .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
> > >>>> -    .rxq_recv = netdev_linux_rxq_recv,                          \
> > >>>>      .rxq_wait = netdev_linux_rxq_wait,                          \
> > >>>>      .rxq_drain = netdev_linux_rxq_drain
> > >>>>
> > >>>> @@ -3323,30 +3246,64 @@ const struct netdev_class
> > >>>> netdev_linux_class =
> > >>>> {
> > >>>>      NETDEV_LINUX_CLASS_COMMON,
> > >>>>      LINUX_FLOW_OFFLOAD_API,
> > >>>>      .type = "system",
> > >>>> +    .is_pmd = false,
> > >>>>      .construct = netdev_linux_construct,
> > >>>> +    .destruct = netdev_linux_destruct,
> > >>>>      .get_stats = netdev_linux_get_stats,
> > >>>>      .get_features = netdev_linux_get_features,
> > >>>>      .get_status = netdev_linux_get_status,
> > >>>> -    .get_block_id = netdev_linux_get_block_id
> > >>>> +    .get_block_id = netdev_linux_get_block_id,
> > >>>> +    .send = netdev_linux_send,
> > >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> > >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> > >>>>  };
> > >>>>
> > >>>>  const struct netdev_class netdev_tap_class = {
> > >>>>      NETDEV_LINUX_CLASS_COMMON,
> > >>>>      .type = "tap",
> > >>>> +    .is_pmd = false,
> > >>>>      .construct = netdev_linux_construct_tap,
> > >>>> +    .destruct = netdev_linux_destruct,
> > >>>>      .get_stats = netdev_tap_get_stats,
> > >>>>      .get_features = netdev_linux_get_features,
> > >>>>      .get_status = netdev_linux_get_status,
> > >>>> +    .send = netdev_linux_send,
> > >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> > >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> > >>>>  };
> > >>>>
> > >>>>  const struct netdev_class netdev_internal_class = {
> > >>>>      NETDEV_LINUX_CLASS_COMMON,
> > >>>>      LINUX_FLOW_OFFLOAD_API,
> > >>>>      .type = "internal",
> > >>>> +    .is_pmd = false,
> > >>>>      .construct = netdev_linux_construct,
> > >>>> +    .destruct = netdev_linux_destruct,
> > >>>>      .get_stats = netdev_internal_get_stats,
> > >>>>      .get_status = netdev_internal_get_status,
> > >>>> +    .send = netdev_linux_send,
> > >>>> +    .rxq_construct = netdev_linux_rxq_construct,
> > >>>> +    .rxq_recv = netdev_linux_rxq_recv,
> > >>>>  };
> > >>>> +
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +const struct netdev_class netdev_afxdp_class = {
> > >>>> +    NETDEV_LINUX_CLASS_COMMON,
> > >>>> +    .type = "afxdp",
> > >>>> +    .is_pmd = true,
> > >>>> +    .construct = netdev_linux_construct,
> > >>>> +    .destruct = netdev_afxdp_destruct,
> > >>>> +    .get_stats = netdev_afxdp_get_stats,
> > >>>> +    .get_status = netdev_linux_get_status,
> > >>>> +    .set_config = netdev_afxdp_set_config,
> > >>>> +    .get_config = netdev_afxdp_get_config,
> > >>>> +    .reconfigure = netdev_afxdp_reconfigure,
> > >>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
> > >>>> +    .send = netdev_afxdp_batch_send,
> > >>>> +    .rxq_construct = netdev_afxdp_rxq_construct,
> > >>>> +    .rxq_recv = netdev_afxdp_rxq_recv,
> > >>>> +};
> > >>>> +#endif
> > >>>>
> > >>>>
> > >>>>  #define CODEL_N_QUEUES 0x0000
> > >>>> @@ -5918,7 +5875,7 @@ netdev_stats_from_rtnl_link_stats64(struct
> > >>>> netdev_stats *dst,
> > >>>>      dst->tx_window_errors = src->tx_window_errors;
> > >>>>  }
> > >>>>
> > >>>> -static int
> > >>>> +int
> > >>>>  get_stats_via_netlink(const struct netdev *netdev_, struct
> > >>>> netdev_stats *stats)
> > >>>>  {
> > >>>>      struct ofpbuf request;
> > >>>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > >>>> index fb0c27e6e8e8..91e6a9e2bfc0 100644
> > >>>> --- a/lib/netdev-provider.h
> > >>>> +++ b/lib/netdev-provider.h
> > >>>> @@ -903,6 +903,9 @@ extern const struct netdev_class
> > >>>> netdev_linux_class;
> > >>>>  extern const struct netdev_class netdev_internal_class;
> > >>>>  extern const struct netdev_class netdev_tap_class;
> > >>>>
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +extern const struct netdev_class netdev_afxdp_class;
> > >>>> +#endif
> > >>>>  #ifdef  __cplusplus
> > >>>>  }
> > >>>>  #endif
> > >>>> diff --git a/lib/netdev.c b/lib/netdev.c
> > >>>> index 7d7ecf6f0946..0fac117cc602 100644
> > >>>> --- a/lib/netdev.c
> > >>>> +++ b/lib/netdev.c
> > >>>> @@ -104,6 +104,9 @@ static struct vlog_rate_limit rl =
> > >>>> VLOG_RATE_LIMIT_INIT(5, 20);
> > >>>>
> > >>>>  static void restore_all_flags(void *aux OVS_UNUSED);
> > >>>>  void update_device_args(struct netdev *, const struct shash
> > >>>> *args);
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +void signal_remove_xdp(struct netdev *netdev);
> > >>>> +#endif
> > >>>>
> > >>>>  int
> > >>>>  netdev_n_txq(const struct netdev *netdev)
> > >>>> @@ -146,6 +149,9 @@ netdev_initialize(void)
> > >>>>          netdev_register_provider(&netdev_internal_class);
> > >>>>          netdev_register_provider(&netdev_tap_class);
> > >>>>          netdev_vport_tunnel_register();
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +        netdev_register_provider(&netdev_afxdp_class);
> > >>>> +#endif
> > >>>>  #endif
> > >>>>  #if defined(__FreeBSD__) || defined(__NetBSD__)
> > >>>>          netdev_register_provider(&netdev_tap_class);
> > >>>> @@ -2007,6 +2013,11 @@ restore_all_flags(void *aux OVS_UNUSED)
> > >>>>                                                 saved_flags &
> > >>>> ~saved_values,
> > >>>>                                                 &old_flags);
> > >>>>          }
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +        if (netdev->netdev_class == &netdev_afxdp_class) {
> > >>>> +            signal_remove_xdp(netdev);
> > >>>> +        }
> > >>>> +#endif
> > >>>>      }
> > >>>>  }
> > >>>>
> > >>>> diff --git a/lib/spinlock.h b/lib/spinlock.h
> > >>>> new file mode 100644
> > >>>> index 000000000000..1ae634f23a6b
> > >>>> --- /dev/null
> > >>>> +++ b/lib/spinlock.h
> > >>>> @@ -0,0 +1,70 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +#ifndef SPINLOCK_H
> > >>>> +#define SPINLOCK_H 1
> > >>>> +
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#include <ctype.h>
> > >>>> +#include <errno.h>
> > >>>> +#include <fcntl.h>
> > >>>> +#include <stdarg.h>
> > >>>> +#include <stdlib.h>
> > >>>> +#include <unistd.h>
> > >>>> +
> > >>>> +#include "ovs-atomic.h"
> > >>>> +
> > >>>> +struct ovs_spinlock {
> > >>>> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
> > >>>> +};
> > >>>> +
> > >>>> +static inline void
> > >>>> +ovs_spinlock_init(struct ovs_spinlock *sl)
> > >>>> +{
> > >>>> +    atomic_init(&sl->locked, 0);
> > >>>> +}
> > >>>> +
> > >>>> +static inline void
> > >>>> +ovs_spin_lock(struct ovs_spinlock *sl)
> > >>>> +{
> > >>>> +    int exp = 0, locked = 0;
> > >>>> +
> > >>>> +    while (!atomic_compare_exchange_strong_explicit(&sl->locked,
> > >>>> &exp, 1,
> > >>>> +                memory_order_acquire,
> > >>>> +                memory_order_relaxed)) {
> > >>>> +        locked = 1;
> > >>>> +        while (locked) {
> > >>>> +            atomic_read_relaxed(&sl->locked, &locked);
> > >>>> +        }
> > >>>> +        exp = 0;
> > >>>> +    }
> > >>>> +}
> > >>>> +
> > >>>> +static inline void
> > >>>> +ovs_spin_unlock(struct ovs_spinlock *sl)
> > >>>> +{
> > >>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > >>>> +}
> > >>>> +
> > >>>> +static inline int
> > >>>> +ovs_spin_trylock(struct ovs_spinlock *sl)
> > >>>> +{
> > >>>> +    int exp = 0;
> > >>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked,
> > >>>> &exp,
> > >>>> 1,
> > >>>> +                memory_order_acquire,
> > >>>> +                memory_order_relaxed);
> > >>>> +}
> > >>>> +#endif
> > >>>> diff --git a/lib/util.c b/lib/util.c
> > >>>> index 7b8ab81f6ee1..5eb20995b370 100644
> > >>>> --- a/lib/util.c
> > >>>> +++ b/lib/util.c
> > >>>> @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s)
> > >>>>      return xrealloc(p, *n * s);
> > >>>>  }
> > >>>>
> > >>>> -/* Allocates and returns 'size' bytes of memory aligned to a cache
> > >>>> line and in
> > >>>> - * dedicated cache lines.  That is, the memory block returned will
> > >>>> not share a
> > >>>> - * cache line with other data, avoiding "false sharing".
> > >>>> +/* Allocates and returns 'size' bytes of memory aligned to
> > >>>> 'alignment' bytes.
> > >>>> + * 'alignment' must be a power of two and a multiple of
> > >>>> sizeof(void
> > >>>> *).
> > >>>>   *
> > >>>> - * Use free_cacheline() to free the returned memory block. */
> > >>>> + * Use free_size_align() to free the returned memory block. */
> > >>>>  void *
> > >>>> -xmalloc_cacheline(size_t size)
> > >>>> +xmalloc_size_align(size_t size, size_t alignment)
> > >>>>  {
> > >>>>  #ifdef HAVE_POSIX_MEMALIGN
> > >>>>      void *p;
> > >>>>      int error;
> > >>>>
> > >>>>      COVERAGE_INC(util_xalloc);
> > >>>> -    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
> > >>>> +    error = posix_memalign(&p, alignment, size ? size : 1);
> > >>>>      if (error != 0) {
> > >>>>          out_of_memory();
> > >>>>      }
> > >>>> @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size)
> > >>>>  #else
> > >>>>      /* Allocate room for:
> > >>>>       *
> > >>>> -     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to
> > >>>> allow the
> > >>>> -     *       pointer to be aligned exactly sizeof(void *) bytes
> > >>>> before the
> > >>>> -     *       beginning of a cache line.
> > >>>> +     *     - Header padding: Up to alignment - 1 bytes, to allow
> > >>>> the
> > >>>> +     *       pointer 'q' to be aligned exactly sizeof(void *)
> > >>>> bytes
> > >>>> before the
> > >>>> +     *       beginning of the alignment.
> > >>>>       *
> > >>>>       *     - Pointer: A pointer to the start of the header
> > >>>> padding,
> > >>>> to allow us
> > >>>>       *       to free() the block later.
> > >>>>       *
> > >>>>       *     - User data: 'size' bytes.
> > >>>>       *
> > >>>> -     *     - Trailer padding: Enough to bring the user data up to
> > >>>> a
> > >>>> cache line
> > >>>> +     *     - Trailer padding: Enough to bring the user data up to
> > >>>> a
> > >>>> alignment
> > >>>>       *       multiple.
> > >>>>       *
> > >>>>       *
> > >>>> +---------------+---------+------------------------+---------+
> > >>>> @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size)
> > >>>>       * p               q         r
> > >>>>       *
> > >>>>       */
> > >>>> -    void *p = xmalloc((CACHE_LINE_SIZE - 1)
> > >>>> -                      + sizeof(void *)
> > >>>> -                      + ROUND_UP(size, CACHE_LINE_SIZE));
> > >>>> -    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) <
> > >>>> sizeof(void *);
> > >>>> -    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ?
> > >>>> CACHE_LINE_SIZE : 0),
> > >>>> -                                CACHE_LINE_SIZE);
> > >>>> -    void **q = (void **) r - 1;
> > >>>> +    void *p, *r, **q;
> > >>>> +    bool runt;
> > >>>> +
> > >>>> +    COVERAGE_INC(util_xalloc);
> > >>>> +    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0))
> > >>>> {
> > >>>> +        ovs_abort(0, "Invalid alignment");
> > >>>> +    }
> > >>>> +
> > >>>> +    p = xmalloc((alignment - 1)
> > >>>> +                + sizeof(void *)
> > >>>> +                + ROUND_UP(size, alignment));
> > >>>> +
> > >>>> +    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
> > >>>> +    /* When the padding size < sizeof(void*), we don't have enough
> > >>>> room for
> > >>>> +     * pointer 'q'. As a reuslt, need to move 'r' to the next
> > >>>> alignment.
> > >>>> +     * So ROUND_UP when xmalloc above, and ROUND_UP again when
> > >>>> calculate 'r'
> > >>>> +     * below.
> > >>>> +     */
> > >>>> +    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0),
> > >>>> alignment);
> > >>>> +    q = (void **) r - 1;
> > >>>>      *q = p;
> > >>>> +
> > >>>>      return r;
> > >>>>  #endif
> > >>>>  }
> > >>>>
> > >>>> +void
> > >>>> +free_size_align(void *p)
> > >>>> +{
> > >>>> +#ifdef HAVE_POSIX_MEMALIGN
> > >>>> +    free(p);
> > >>>> +#else
> > >>>> +    if (p) {
> > >>>> +        void **q = (void **) p - 1;
> > >>>> +        free(*q);
> > >>>> +    }
> > >>>> +#endif
> > >>>> +}
> > >>>> +
> > >>>> +/* Allocates and returns 'size' bytes of memory aligned to a cache
> > >>>> line and in
> > >>>> + * dedicated cache lines.  That is, the memory block returned will
> > >>>> not share a
> > >>>> + * cache line with other data, avoiding "false sharing".
> > >>>> + *
> > >>>> + * Use free_cacheline() to free the returned memory block. */
> > >>>> +void *
> > >>>> +xmalloc_cacheline(size_t size)
> > >>>> +{
> > >>>> +    return xmalloc_size_align(size, CACHE_LINE_SIZE);
> > >>>> +}
> > >>>> +
> > >>>>  /* Like xmalloc_cacheline() but clears the allocated memory to all
> > >>>> zero
> > >>>>   * bytes. */
> > >>>>  void *
> > >>>> @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size)
> > >>>>  void
> > >>>>  free_cacheline(void *p)
> > >>>>  {
> > >>>> -#ifdef HAVE_POSIX_MEMALIGN
> > >>>> -    free(p);
> > >>>> -#else
> > >>>> -    if (p) {
> > >>>> -        void **q = (void **) p - 1;
> > >>>> -        free(*q);
> > >>>> -    }
> > >>>> -#endif
> > >>>> +    free_size_align(p);
> > >>>> +}
> > >>>> +
> > >>>> +void *
> > >>>> +xmalloc_pagealign(size_t size)
> > >>>> +{
> > >>>> +    return xmalloc_size_align(size, get_page_size());
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +free_pagealign(void *p)
> > >>>> +{
> > >>>> +    free_size_align(p);
> > >>>>  }
> > >>>>
> > >>>>  char *
> > >>>> diff --git a/lib/util.h b/lib/util.h
> > >>>> index c26605abdce3..33665748274c 100644
> > >>>> --- a/lib/util.h
> > >>>> +++ b/lib/util.h
> > >>>> @@ -166,6 +166,11 @@ void ovs_strzcpy(char *dst, const char *src,
> > >>>> size_t size);
> > >>>>
> > >>>>  int string_ends_with(const char *str, const char *suffix);
> > >>>>
> > >>>> +void *xmalloc_pagealign(size_t) MALLOC_LIKE;
> > >>>> +void free_pagealign(void *);
> > >>>> +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
> > >>>> +void free_size_align(void *);
> > >>>> +
> > >>>>  /* The C standards say that neither the 'dst' nor 'src' argument
> > >>>> to
> > >>>>   * memcpy() may be null, even if 'n' is zero.  This wrapper
> > >>>> tolerates
> > >>>>   * the null case. */
> > >>>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c
> > >>>> new file mode 100644
> > >>>> index 000000000000..ea39fa557290
> > >>>> --- /dev/null
> > >>>> +++ b/lib/xdpsock.c
> > >>>> @@ -0,0 +1,170 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#include "xdpsock.h"
> > >>>> +#include "dp-packet.h"
> > >>>> +#include "openvswitch/compiler.h"
> > >>>> +
> > >>>> +/* Note:
> > >>>> + * umem_elem_push* shouldn't overflow because we always pop
> > >>>> + * elem first, then push back to the stack.
> > >>>> + */
> > >>>> +static inline void
> > >>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > >>>> +{
> > >>>> +    void *ptr;
> > >>>> +
> > >>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
> > >>>> +        OVS_NOT_REACHED();
> > >>>> +    }
> > >>>> +
> > >>>> +    ptr = &umemp->array[umemp->index];
> > >>>> +    memcpy(ptr, addrs, n * sizeof(void *));
> > >>>> +    umemp->index += n;
> > >>>> +}
> > >>>> +
> > >>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void
> > >>>> **addrs)
> > >>>> +{
> > >>>> +    ovs_spin_lock(&umemp->lock);
> > >>>> +    __umem_elem_push_n(umemp, n, addrs);
> > >>>> +    ovs_spin_unlock(&umemp->lock);
> > >>>> +}
> > >>>> +
> > >>>> +static inline void
> > >>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > >>>> +{
> > >>>> +    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
> > >>>> +        OVS_NOT_REACHED();
> > >>>> +    }
> > >>>> +
> > >>>> +    umemp->array[umemp->index++] = addr;
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
> > >>>> +{
> > >>>> +
> > >>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > >>>> +
> > >>>> +    ovs_spin_lock(&umemp->lock);
> > >>>> +    __umem_elem_push(umemp, addr);
> > >>>> +    ovs_spin_unlock(&umemp->lock);
> > >>>> +}
> > >>>> +
> > >>>> +static inline int
> > >>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > >>>> +{
> > >>>> +    void *ptr;
> > >>>> +
> > >>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > >>>> +        return -ENOMEM;
> > >>>> +    }
> > >>>> +
> > >>>> +    umemp->index -= n;
> > >>>> +    ptr = &umemp->array[umemp->index];
> > >>>> +    memcpy(addrs, ptr, n * sizeof(void *));
> > >>>> +
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > >>>> +{
> > >>>> +    int ret;
> > >>>> +
> > >>>> +    ovs_spin_lock(&umemp->lock);
> > >>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > >>>> +    ovs_spin_unlock(&umemp->lock);
> > >>>> +
> > >>>> +    return ret;
> > >>>> +}
> > >>>> +
> > >>>> +static inline void *
> > >>>> +__umem_elem_pop(struct umem_pool *umemp)
> > >>>> +{
> > >>>> +    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
> > >>>> +        return NULL;
> > >>>> +    }
> > >>>> +
> > >>>> +    return umemp->array[--umemp->index];
> > >>>> +}
> > >>>> +
> > >>>> +void *
> > >>>> +umem_elem_pop(struct umem_pool *umemp)
> > >>>> +{
> > >>>> +    void *ptr;
> > >>>> +
> > >>>> +    ovs_spin_lock(&umemp->lock);
> > >>>> +    ptr = __umem_elem_pop(umemp);
> > >>>> +    ovs_spin_unlock(&umemp->lock);
> > >>>> +
> > >>>> +    return ptr;
> > >>>> +}
> > >>>> +
> > >>>> +static void **
> > >>>> +__umem_pool_alloc(unsigned int size)
> > >>>> +{
> > >>>> +    void *bufs;
> > >>>> +
> > >>>> +    bufs = xmalloc_pagealign(size * sizeof(void *));
> > >>>> +    memset(bufs, 0, size * sizeof(void *));
> > >>>> +
> > >>>> +    return (void **)bufs;
> > >>>> +}
> > >>>> +
> > >>>> +int
> > >>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > >>>> +{
> > >>>> +    umemp->array = __umem_pool_alloc(size);
> > >>>> +    if (!umemp->array) {
> > >>>> +        return -ENOMEM;
> > >>>> +    }
> > >>>> +
> > >>>> +    umemp->size = size;
> > >>>> +    umemp->index = 0;
> > >>>> +    ovs_spinlock_init(&umemp->lock);
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +umem_pool_cleanup(struct umem_pool *umemp)
> > >>>> +{
> > >>>> +    free_pagealign(umemp->array);
> > >>>> +    umemp->array = NULL;
> > >>>> +}
> > >>>> +
> > >>>> +/* AF_XDP metadata init/destroy */
> > >>>> +int
> > >>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > >>>> +{
> > >>>> +    void *bufs;
> > >>>> +
> > >>>> +    bufs = xmalloc_pagealign(size * sizeof(struct
> > >>>> dp_packet_afxdp));
> > >>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > >>>> +
> > >>>> +    xp->array = bufs;
> > >>>> +    xp->size = size;
> > >>>> +
> > >>>> +    return 0;
> > >>>> +}
> > >>>> +
> > >>>> +void
> > >>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > >>>> +{
> > >>>> +    free_pagealign(xp->array);
> > >>>> +    xp->array = NULL;
> > >>>> +}
> > >>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > >>>> new file mode 100644
> > >>>> index 000000000000..1a1093381243
> > >>>> --- /dev/null
> > >>>> +++ b/lib/xdpsock.h
> > >>>> @@ -0,0 +1,101 @@
> > >>>> +/*
> > >>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> + *
> > >>>> + * Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> + * you may not use this file except in compliance with the
> > >>>> License.
> > >>>> + * You may obtain a copy of the License at:
> > >>>> + *
> > >>>> + *     http://www.apache.org/licenses/LICENSE-2.0
> > >>>> + *
> > >>>> + * Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> + * distributed under the License is distributed on an "AS IS"
> > >>>> BASIS,
> > >>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> + * See the License for the specific language governing permissions
> > >>>> and
> > >>>> + * limitations under the License.
> > >>>> + */
> > >>>> +
> > >>>> +#ifndef XDPSOCK_H
> > >>>> +#define XDPSOCK_H 1
> > >>>> +
> > >>>> +#include <config.h>
> > >>>> +
> > >>>> +#ifdef HAVE_AF_XDP
> > >>>> +
> > >>>> +#include <bpf/xsk.h>
> > >>>> +#include <errno.h>
> > >>>> +#include <stdbool.h>
> > >>>> +#include <stdio.h>
> > >>>> +
> > >>>> +#include "openvswitch/thread.h"
> > >>>> +#include "ovs-atomic.h"
> > >>>> +#include "spinlock.h"
> > >>>> +
> > >>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > >>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > >>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > >>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > >>>> +
> > >>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > >>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > >>>> +
> > >>>> +/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
> > >>>> + * Setting NUM_FRAMES to this makes sure umem_pop always
> > >>>> successes.
> > >>>> + */
> > >>>> +#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
> > >>>> +
> > >>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
> > >>>> +
> > >>>> +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
> > >>>> +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
> > >>>> +BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS +
> > >>>> CONS_NUM_DESCS));
> > >>>> +
> > >>>> +/* LIFO ptr_array */
> > >>>> +struct umem_pool {
> > >>>> +    int index;      /* point to top */
> > >>>> +    unsigned int size;
> > >>>> +    struct ovs_spinlock lock;
> > >>>> +    void **array;   /* a pointer array, point to umem buf */
> > >>>> +};
> > >>>> +
> > >>>> +/* array-based dp_packet_afxdp */
> > >>>> +struct xpacket_pool {
> > >>>> +    unsigned int size;
> > >>>> +    struct dp_packet_afxdp **array;
> > >>>> +};
> > >>>> +
> > >>>> +struct xsk_umem_info {
> > >>>> +    struct umem_pool mpool;
> > >>>> +    struct xpacket_pool xpool;
> > >>>> +    struct xsk_ring_prod fq;
> > >>>> +    struct xsk_ring_cons cq;
> > >>>> +    struct xsk_umem *umem;
> > >>>> +    void *buffer;
> > >>>> +};
> > >>>> +
> > >>>> +struct xsk_socket_info {
> > >>>> +    struct xsk_ring_cons rx;
> > >>>> +    struct xsk_ring_prod tx;
> > >>>> +    struct xsk_umem_info *umem;
> > >>>> +    struct xsk_socket *xsk;
> > >>>> +    unsigned long rx_dropped;
> > >>>> +    unsigned long tx_dropped;
> > >>>> +    uint32_t outstanding_tx;
> > >>>> +};
> > >>>> +
> > >>>> +struct umem_elem {
> > >>>> +    struct umem_elem *next;
> > >>>> +};
> > >>>> +
> > >>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > >>>> +void umem_elem_push_n(struct umem_pool *umemp, int n, void
> > >>>> **addrs);
> > >>>> +
> > >>>> +void *umem_elem_pop(struct umem_pool *umemp);
> > >>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > >>>> +
> > >>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > >>>> +void umem_pool_cleanup(struct umem_pool *umemp);
> > >>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > >>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > >>>> +
> > >>>> +#endif
> > >>>> +#endif
> > >>>> diff --git a/tests/automake.mk b/tests/automake.mk
> > >>>> index 2956e68b242c..131564bb0bd3 100644
> > >>>> --- a/tests/automake.mk
> > >>>> +++ b/tests/automake.mk
> > >>>> @@ -4,12 +4,14 @@ EXTRA_DIST += \
> > >>>>       $(SYSTEM_TESTSUITE_AT) \
> > >>>>       $(SYSTEM_KMOD_TESTSUITE_AT) \
> > >>>>       $(SYSTEM_USERSPACE_TESTSUITE_AT) \
> > >>>> +     $(SYSTEM_AFXDP_TESTSUITE_AT) \
> > >>>>       $(SYSTEM_OFFLOADS_TESTSUITE_AT) \
> > >>>>       $(SYSTEM_DPDK_TESTSUITE_AT) \
> > >>>>       $(OVSDB_CLUSTER_TESTSUITE_AT) \
> > >>>>       $(TESTSUITE) \
> > >>>>       $(SYSTEM_KMOD_TESTSUITE) \
> > >>>>       $(SYSTEM_USERSPACE_TESTSUITE) \
> > >>>> +     $(SYSTEM_AFXDP_TESTSUITE) \
> > >>>>       $(SYSTEM_OFFLOADS_TESTSUITE) \
> > >>>>       $(SYSTEM_DPDK_TESTSUITE) \
> > >>>>       $(OVSDB_CLUSTER_TESTSUITE) \
> > >>>> @@ -160,6 +162,10 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
> > >>>>       tests/system-userspace-macros.at \
> > >>>>       tests/system-userspace-packet-type-aware.at
> > >>>>
> > >>>> +SYSTEM_AFXDP_TESTSUITE_AT = \
> > >>>> +     tests/system-afxdp-testsuite.at \
> > >>>> +     tests/system-afxdp-macros.at
> > >>>> +
> > >>>>  SYSTEM_TESTSUITE_AT = \
> > >>>>       tests/system-common-macros.at \
> > >>>>       tests/system-ovn.at \
> > >>>> @@ -184,6 +190,7 @@ TESTSUITE = $(srcdir)/tests/testsuite
> > >>>>  TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
> > >>>>  SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
> > >>>>  SYSTEM_USERSPACE_TESTSUITE =
> > >>>> $(srcdir)/tests/system-userspace-testsuite
> > >>>> +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
> > >>>>  SYSTEM_OFFLOADS_TESTSUITE =
> > >>>> $(srcdir)/tests/system-offloads-testsuite
> > >>>>  SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
> > >>>>  OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
> > >>>> @@ -317,6 +324,11 @@ check-system-userspace: all
> > >>>>       set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests
> > >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> > >>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes &&
> > >>>> "$$@"
> > >>>> --recheck)
> > >>>>
> > >>>> +check-afxdp: all
> > >>>> +     $(MAKE) install
> > >>>> +     set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests
> > >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
> > >>>> +     "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
> > >>>> +
> > >>>>  check-offloads: all
> > >>>>       set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests
> > >>>> AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
> > >>>>       "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes &&
> > >>>> "$$@"
> > >>>> --recheck)
> > >>>> @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4
> > >>>> $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
> > >>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> > >>>>       $(AM_V_at)mv $@.tmp $@
> > >>>>
> > >>>> +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> > >>>> $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
> > >>>> +     $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> > >>>> +     $(AM_V_at)mv $@.tmp $@
> > >>>> +
> > >>>>  $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT)
> > >>>> $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
> > >>>>       $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
> > >>>>       $(AM_V_at)mv $@.tmp $@
> > >>>> diff --git a/tests/system-afxdp-macros.at
> > >>>> b/tests/system-afxdp-macros.at
> > >>>> new file mode 100644
> > >>>> index 000000000000..1e6f7a46b4b7
> > >>>> --- /dev/null
> > >>>> +++ b/tests/system-afxdp-macros.at
> > >>>> @@ -0,0 +1,20 @@
> > >>>> +# Add port to ovs bridge by using afxdp mode.
> > >>>> +# This will use generic XDP support in the veth driver.
> > >>>> +m4_define([ADD_VETH],
> > >>>> +    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 ||
> > >>>> return
> > >>>> 77])
> > >>>> +      CONFIGURE_VETH_OFFLOADS([$1])
> > >>>> +      AT_CHECK([ip link set $1 netns $2])
> > >>>> +      AT_CHECK([ip link set dev ovs-$1 up])
> > >>>> +      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
> > >>>> +                set interface ovs-$1 external-ids:iface-id="$1"
> > >>>> type="afxdp"])
> > >>>> +      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
> > >>>> +      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
> > >>>> +      if test -n "$5"; then
> > >>>> +        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
> > >>>> +      fi
> > >>>> +      if test -n "$6"; then
> > >>>> +        NS_CHECK_EXEC([$2], [ip route add default via $6])
> > >>>> +      fi
> > >>>> +      on_exit 'ip link del ovs-$1'
> > >>>> +    ]
> > >>>> +)
> > >>>> diff --git a/tests/system-afxdp-testsuite.at
> > >>>> b/tests/system-afxdp-testsuite.at
> > >>>> new file mode 100644
> > >>>> index 000000000000..9b7a29066614
> > >>>> --- /dev/null
> > >>>> +++ b/tests/system-afxdp-testsuite.at
> > >>>> @@ -0,0 +1,26 @@
> > >>>> +AT_INIT
> > >>>> +
> > >>>> +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
> > >>>> +
> > >>>> +Licensed under the Apache License, Version 2.0 (the "License");
> > >>>> +you may not use this file except in compliance with the License.
> > >>>> +You may obtain a copy of the License at:
> > >>>> +
> > >>>> +    http://www.apache.org/licenses/LICENSE-2.0
> > >>>> +
> > >>>> +Unless required by applicable law or agreed to in writing,
> > >>>> software
> > >>>> +distributed under the License is distributed on an "AS IS" BASIS,
> > >>>> +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > >>>> implied.
> > >>>> +See the License for the specific language governing permissions
> > >>>> and
> > >>>> +limitations under the License.])
> > >>>> +
> > >>>> +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
> > >>>> +
> > >>>> +m4_include([tests/ovs-macros.at])
> > >>>> +m4_include([tests/ovsdb-macros.at])
> > >>>> +m4_include([tests/ofproto-macros.at])
> > >>>> +m4_include([tests/system-common-macros.at])
> > >>>> +m4_include([tests/system-userspace-macros.at])
> > >>>> +m4_include([tests/system-afxdp-macros.at])
> > >>>> +
> > >>>> +m4_include([tests/system-traffic.at])
> > >>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > >>>> index 89c06a1b7877..1e3acbbb8075 100644
> > >>>> --- a/vswitchd/vswitch.xml
> > >>>> +++ b/vswitchd/vswitch.xml
> > >>>> @@ -3101,6 +3101,21 @@ ovs-vsctl add-port br0 p0 -- set Interface
> > >>>> p0
> > >>>> type=patch options:peer=p1 \
> > >>>>          </p>
> > >>>>        </column>
> > >>>>
> > >>>> +      <column name="other_config" key="xdpmode"
> > >>>> +              type='{"type": "string",
> > >>>> +                     "enum": ["set", ["skb", "drv"]]}'>
> > >>>> +        <p>
> > >>>> +          Specifies the operational mode of the XDP program.
> > >>>> +          If "drv", the XDP program is loaded into the device
> > >>>> driver
> > >>>> with
> > >>>> +          zero-copy RX and TX enabled. This mode requires device
> > >>>> driver with
> > >>>> +          AF_XDP support and has the best performance.
> > >>>> +          If "skb", the XDP program is using generic XDP mode in
> > >>>> kernel with
> > >>>> +          extra data copying between userspace and kernel. No
> > >>>> device
> > >>>> driver
> > >>>> +          support is needed. Note that this is afxdp netdev type
> > >>>> only.
> > >>>> +          Defaults to "skb" mode.
> > >>>> +        </p>
> > >>>> +      </column>
> > >>>> +
> > >>>>        <column name="options" key="vhost-server-path"
> > >>>>                type='{"type": "string"}'>
> > >>>>          <p>
> > >>>> --
> > >>>> 2.7.4
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eelco Chaudron June 12, 2019, 8:56 a.m. UTC | #10
Hi William,

I’m using 64 bytes generated by a Xena sending 10G wire speed.
My ingress and egress port are the same ixgbe nic.

I tried skb-mode, and as you mention it’s much slower (4x) and does 
not trigger the issue…

I also tried the v8 patch again on this setup and it’s NOT crashing in 
driver mode. So it’s caused by something introduced/changed after 
rev8.

Cheers,

Eelco

On 11 Jun 2019, at 19:47, William Tu wrote:

> Hi Eelco,
>
> I tested using ixgbe driver but still works ok.
> Is the crash due to packet size > mtu?
> In my case, I only tested 64B packets size.
>
> Thanks
> William
>
> On Tue, Jun 11, 2019 at 8:02 AM William Tu <u9012063@gmail.com> wrote:
>>
>> Hi Eelco,
>>
>> Thanks for the trace.
>>
>> On Tue, Jun 11, 2019 at 6:52 AM Eelco Chaudron <echaudro@redhat.com> 
>> wrote:
>>>
>>> Hi William,
>>>
>>> Here are some more details, this is a port to port test (same port 
>>> in as
>>> out) using the following rule:
>>>
>>>    ovs-ofctl add-flow ovs_pvp_br0 "in_port=eno1,action=IN_PORT"
>>>
>>> Sent packets wire speed, and crash…
>>>
>>> (gdb) bt
>>> #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
>>> #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
>>> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
>>> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
>>> #3  umem_elem_push_n (umemp=0x24cc790, n=32,
>>> addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
>>> #4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at
>>> lib/netdev-afxdp.c:736
>>> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0,
>>> batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at
>>> lib/netdev-afxdp.c:763
>>> #6  0x0000000000908041 in netdev_send (netdev=<optimized out>,
>>> qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80,
>>> concurrent_txq=concurrent_txq@entry=true)
>>>      at lib/netdev.c:800
>>> #7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port
>>> (pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at
>>> lib/dpif-netdev.c:4187
>>> #8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets
>>> (pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at
>>> lib/dpif-netdev.c:4227
>>> #9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets
>>> (force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
>>> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010,
>>> rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
>>> #11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
>>> lib/dpif-netdev.c:5449
>>> #12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) 
>>> at
>>> lib/ovs-thread.c:352
>>> #13 0x00007fbc6b0a12de in start_thread () from 
>>> /lib64/libpthread.so.0
>>> #14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6
>>>
>>> After this crash, systemd restart OVS, and it crashed again (guess
>>> traffic was still flowing for a bit with the NORMAL rule installed):
>>>
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
>>> qfill=0x0) at lib/netdev-afxdp.c:583
>>> 583         rx->fd = xsk_socket__fd(xsk->xsk);
>>> [Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]
>>>
>>> (gdb) bt
>>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
>>> qfill=0x0) at lib/netdev-afxdp.c:583
>>> #1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>,
>>> batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at
>>> lib/netdev.c:710
>>> #2  0x00000000008dd1d3 in dp_netdev_process_rxq_port
>>> (pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at
>>> lib/dpif-netdev.c:4257
>>> #3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
>>> lib/dpif-netdev.c:5449
>>> #4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>) 
>>> at
>>> lib/ovs-thread.c:352
>>> #5  0x00007f47229732de in start_thread () from 
>>> /lib64/libpthread.so.0
>>> #6  0x00007f4722118a63 in clone () from /lib64/libc.so.6
>>>
>>> I did not further investigate, but it should be easy to replicate. 
>>> This
>>> is the same setup that worked fine with the v8 patchset for port to
>>> port.
>>> Next step was to verify PVP was fixed, but could not get there…
>>> Cheers,
>>
>> I'm not able to reproduce it on my testbed using i40e, I will try
>> using ixgbe today.
>>
>> btw, if you try skb-mode, does the crash still show up?
>> Although skb-mode is much slower, so it might not trigger the issue.
>>
>> Regards,
>> William
>>
>>>
<SNIP>
William Tu June 12, 2019, 11:09 p.m. UTC | #11
On Wed, Jun 12, 2019 at 1:56 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> Hi William,
>
> I’m using 64 bytes generated by a Xena sending 10G wire speed.
> My ingress and egress port are the same ixgbe nic.
>
> I tried skb-mode, and as you mention it’s much slower (4x) and does
> not trigger the issue…
>
> I also tried the v8 patch again on this setup and it’s NOT crashing in
> driver mode. So it’s caused by something introduced/changed after
> rev8.

Thanks a lot!
I can re-produce the issue now.
The issue happens when I have two af_xdp ports running drv-mode.
Using one port as rx and the other as tx, and after a couple seconds
I can see the crash.

Let me compare with v8 to see what happen.

William
>
> Cheers,
>
> Eelco
>
> On 11 Jun 2019, at 19:47, William Tu wrote:
>
> > Hi Eelco,
> >
> > I tested using ixgbe driver but still works ok.
> > Is the crash due to packet size > mtu?
> > In my case, I only tested 64B packets size.
> >
> > Thanks
> > William
> >
> > On Tue, Jun 11, 2019 at 8:02 AM William Tu <u9012063@gmail.com> wrote:
> >>
> >> Hi Eelco,
> >>
> >> Thanks for the trace.
> >>
> >> On Tue, Jun 11, 2019 at 6:52 AM Eelco Chaudron <echaudro@redhat.com>
> >> wrote:
> >>>
> >>> Hi William,
> >>>
> >>> Here are some more details, this is a port to port test (same port
> >>> in as
> >>> out) using the following rule:
> >>>
> >>>    ovs-ofctl add-flow ovs_pvp_br0 "in_port=eno1,action=IN_PORT"
> >>>
> >>> Sent packets wire speed, and crash…
> >>>
> >>> (gdb) bt
> >>> #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
> >>> #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
> >>> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
> >>> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
> >>> #3  umem_elem_push_n (umemp=0x24cc790, n=32,
> >>> addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
> >>> #4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at
> >>> lib/netdev-afxdp.c:736
> >>> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0,
> >>> batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at
> >>> lib/netdev-afxdp.c:763
> >>> #6  0x0000000000908041 in netdev_send (netdev=<optimized out>,
> >>> qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80,
> >>> concurrent_txq=concurrent_txq@entry=true)
> >>>      at lib/netdev.c:800
> >>> #7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port
> >>> (pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at
> >>> lib/dpif-netdev.c:4187
> >>> #8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets
> >>> (pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at
> >>> lib/dpif-netdev.c:4227
> >>> #9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets
> >>> (force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
> >>> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010,
> >>> rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
> >>> #11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> >>> lib/dpif-netdev.c:5449
> >>> #12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>)
> >>> at
> >>> lib/ovs-thread.c:352
> >>> #13 0x00007fbc6b0a12de in start_thread () from
> >>> /lib64/libpthread.so.0
> >>> #14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6
> >>>
> >>> After this crash, systemd restart OVS, and it crashed again (guess
> >>> traffic was still flowing for a bit with the NORMAL rule installed):
> >>>
> >>> Program terminated with signal SIGSEGV, Segmentation fault.
> >>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> >>> qfill=0x0) at lib/netdev-afxdp.c:583
> >>> 583         rx->fd = xsk_socket__fd(xsk->xsk);
> >>> [Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]
> >>>
> >>> (gdb) bt
> >>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> >>> qfill=0x0) at lib/netdev-afxdp.c:583
> >>> #1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>,
> >>> batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at
> >>> lib/netdev.c:710
> >>> #2  0x00000000008dd1d3 in dp_netdev_process_rxq_port
> >>> (pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at
> >>> lib/dpif-netdev.c:4257
> >>> #3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> >>> lib/dpif-netdev.c:5449
> >>> #4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>)
> >>> at
> >>> lib/ovs-thread.c:352
> >>> #5  0x00007f47229732de in start_thread () from
> >>> /lib64/libpthread.so.0
> >>> #6  0x00007f4722118a63 in clone () from /lib64/libc.so.6
> >>>
> >>> I did not further investigate, but it should be easy to replicate.
> >>> This
> >>> is the same setup that worked fine with the v8 patchset for port to
> >>> port.
> >>> Next step was to verify PVP was fixed, but could not get there…
> >>> Cheers,
> >>
> >> I'm not able to reproduce it on my testbed using i40e, I will try
> >> using ixgbe today.
> >>
> >> btw, if you try skb-mode, does the crash still show up?
> >> Although skb-mode is much slower, so it might not trigger the issue.
> >>
> >> Regards,
> >> William
> >>
> >>>
> <SNIP>
William Tu June 13, 2019, 12:37 a.m. UTC | #12
Hi Eelco,

> > >>> #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
> > >>> #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
> > >>> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
> > >>> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
> > >>> #3  umem_elem_push_n (umemp=0x24cc790, n=32,

I've found that it's due to free the afxdp twice.
The free_afxdp_buf() should be called once per dp_packet, somehow
it gets called twice.
Applying this on v11 fixes the issue
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -145,7 +145,7 @@ dp_packet_uninit(struct dp_packet *b)
             free_dpdk_buf((struct dp_packet*) b);
 #endif
         } else if (b->source == DPBUF_AFXDP) {
-            free_afxdp_buf(b);
+            ;
         }
     }
 }

I will work on next version
Thank you
William

<snip>`
> > >>> addrs=addrs@entry=0x7fbc40f2eea0) at lib/xdpsock.c:43
> > >>> #4  0x00000000009b4f51 in afxdp_complete_tx (xsk=0x24c86f0) at
> > >>> lib/netdev-afxdp.c:736
> > >>> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0,
> > >>> batch=0x7fbc24004e80, concurrent_txq=<optimized out>) at
> > >>> lib/netdev-afxdp.c:763
> > >>> #6  0x0000000000908041 in netdev_send (netdev=<optimized out>,
> > >>> qid=qid@entry=0, batch=batch@entry=0x7fbc24004e80,
> > >>> concurrent_txq=concurrent_txq@entry=true)
> > >>>      at lib/netdev.c:800
> > >>> #7  0x00000000008d4c34 in dp_netdev_pmd_flush_output_on_port
> > >>> (pmd=pmd@entry=0x7fbc40f32010, p=p@entry=0x7fbc24004e50) at
> > >>> lib/dpif-netdev.c:4187
> > >>> #8  0x00000000008d4f4f in dp_netdev_pmd_flush_output_packets
> > >>> (pmd=pmd@entry=0x7fbc40f32010, force=force@entry=false) at
> > >>> lib/dpif-netdev.c:4227
> > >>> #9  0x00000000008dd2e7 in dp_netdev_pmd_flush_output_packets
> > >>> (force=false, pmd=0x7fbc40f32010) at lib/dpif-netdev.c:4282
> > >>> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fbc40f32010,
> > >>> rxq=0x24ce650, port_no=1) at lib/dpif-netdev.c:4282
> > >>> #11 0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> > >>> lib/dpif-netdev.c:5449
> > >>> #12 0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>)
> > >>> at
> > >>> lib/ovs-thread.c:352
> > >>> #13 0x00007fbc6b0a12de in start_thread () from
> > >>> /lib64/libpthread.so.0
> > >>> #14 0x00007fbc6a846a63 in clone () from /lib64/libc.so.6
> > >>>
> > >>> After this crash, systemd restart OVS, and it crashed again (guess
> > >>> traffic was still flowing for a bit with the NORMAL rule installed):
> > >>>
> > >>> Program terminated with signal SIGSEGV, Segmentation fault.
> > >>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> > >>> qfill=0x0) at lib/netdev-afxdp.c:583
> > >>> 583         rx->fd = xsk_socket__fd(xsk->xsk);
> > >>> [Current thread is 1 (Thread 0x7f46f8ff9700 (LWP 28171))]
> > >>>
> > >>> (gdb) bt
> > >>> #0  netdev_afxdp_rxq_recv (rxq_=0x2d5a860, batch=0x7f46f8ff70d0,
> > >>> qfill=0x0) at lib/netdev-afxdp.c:583
> > >>> #1  0x0000000000907f31 in netdev_rxq_recv (rx=<optimized out>,
> > >>> batch=batch@entry=0x7f46f8ff70d0, qfill=<optimized out>) at
> > >>> lib/netdev.c:710
> > >>> #2  0x00000000008dd1d3 in dp_netdev_process_rxq_port
> > >>> (pmd=pmd@entry=0x2d8f0c0, rxq=0x2d8c090, port_no=2) at
> > >>> lib/dpif-netdev.c:4257
> > >>> #3  0x00000000008dd64d in pmd_thread_main (f_=<optimized out>) at
> > >>> lib/dpif-netdev.c:5449
> > >>> #4  0x000000000095e95d in ovsthread_wrapper (aux_=<optimized out>)
> > >>> at
> > >>> lib/ovs-thread.c:352
> > >>> #5  0x00007f47229732de in start_thread () from
> > >>> /lib64/libpthread.so.0
> > >>> #6  0x00007f4722118a63 in clone () from /lib64/libc.so.6
> > >>>
> > >>> I did not further investigate, but it should be easy to replicate.
> > >>> This
> > >>> is the same setup that worked fine with the v8 patchset for port to
> > >>> port.
> > >>> Next step was to verify PVP was fixed, but could not get there…
> > >>> Cheers,
> > >>
> > >> I'm not able to reproduce it on my testbed using i40e, I will try
> > >> using ixgbe today.
> > >>
> > >> btw, if you try skb-mode, does the crash still show up?
> > >> Although skb-mode is much slower, so it might not trigger the issue.
> > >>
> > >> Regards,
> > >> William
> > >>
> > >>>
> > <SNIP>
Eelco Chaudron June 13, 2019, 7:17 a.m. UTC | #13
On 13 Jun 2019, at 2:37, William Tu wrote:

> Hi Eelco,
>
>>>>>> #0  0x00007fbc6a78193f in raise () from /lib64/libc.so.6
>>>>>> #1  0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
>>>>>> #2  0x00000000004ed1a1 in __umem_elem_push_n 
>>>>>> (addrs=0x7fbc40f2ec50,
>>>>>> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
>>>>>> #3  umem_elem_push_n (umemp=0x24cc790, n=32,
>
> I've found that it's due to free the afxdp twice.
> The free_afxdp_buf() should be called once per dp_packet, somehow
> it gets called twice.
> Applying this on v11 fixes the issue
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -145,7 +145,7 @@ dp_packet_uninit(struct dp_packet *b)
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
>          } else if (b->source == DPBUF_AFXDP) {
> -            free_afxdp_buf(b);
> +            ;
>          }
>      }
>  }
>

It’s still crashing for me, even with the change. I’m using a single 
nic, same port in as out:

```
@@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
               * created as a dp_packet */
              free_dpdk_buf((struct dp_packet*) b);
  #endif
+        } else if (b->source == DPBUF_AFXDP) {
+//            free_afxdp_buf(b);
          }
      }
  }
```

```
Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock 
-vconsole:emer -vsyslog:err -vfi'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f62bdf33700 (LWP 24737))]
Missing separate debuginfos, use: dnf debuginfo-install 
elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
#1  0x00007f62ef6fbc95 in abort () from /lib64/libc.so.6
#2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7f62bdf30c50, 
n=32, umemp=0x2378660) at lib/xdpsock.c:32
#3  umem_elem_push_n (umemp=0x2378660, n=32, 
addrs=addrs@entry=0x7f62bdf30ea0) at lib/xdpsock.c:43
#4  0x00000000009b4f41 in afxdp_complete_tx (xsk=0x2378250) at 
lib/netdev-afxdp.c:736
#5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0, 
batch=0x7f62a4004e80, concurrent_txq=<optimized out>) at 
lib/netdev-afxdp.c:763
#6  0x0000000000908031 in netdev_send (netdev=<optimized out>, 
qid=qid@entry=0, batch=batch@entry=0x7f62a4004e80, 
concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:800
#7  0x00000000008d4c24 in dp_netdev_pmd_flush_output_on_port 
(pmd=pmd@entry=0x7f62bdf34010, p=p@entry=0x7f62a4004e50) at 
lib/dpif-netdev.c:4187
#8  0x00000000008d4f3f in dp_netdev_pmd_flush_output_packets 
(pmd=pmd@entry=0x7f62bdf34010, force=force@entry=false) at 
lib/dpif-netdev.c:4227
#9  0x00000000008dd2d7 in dp_netdev_pmd_flush_output_packets 
(force=false, pmd=0x7f62bdf34010) at lib/dpif-netdev.c:4282
#10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f62bdf34010, 
rxq=0x237e650, port_no=1) at lib/dpif-netdev.c:4282
#11 0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5449
#12 0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at 
lib/ovs-thread.c:352
#13 0x00007f62f00312de in start_thread () from /lib64/libpthread.so.0
#14 0x00007f62ef7d6a63 in clone () from /lib64/libc.so.6
```


```
Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock 
-vconsole:emer -vsyslog:err -vfi'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, 
qfill=0x0) at lib/netdev-afxdp.c:583
583	    rx->fd = xsk_socket__fd(xsk->xsk);
[Current thread is 1 (Thread 0x7f399f7fe700 (LWP 24597))]
Missing separate debuginfos, use: dnf debuginfo-install 
elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, 
qfill=0x0) at lib/netdev-afxdp.c:583
#1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>, 
batch=batch@entry=0x7f399f7fc0d0, qfill=<optimized out>) at 
lib/netdev.c:710
#2  0x00000000008dd1c3 in dp_netdev_process_rxq_port 
(pmd=pmd@entry=0x2bf80c0, rxq=0x2bd63e0, port_no=2) at 
lib/dpif-netdev.c:4257
#3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5449
#4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at 
lib/ovs-thread.c:352
#5  0x00007f39d86de2de in start_thread () from /lib64/libpthread.so.0
#6  0x00007f39d7e83a63 in clone () from /lib64/libc.so.6
(gdb) p xsk->xsk
Cannot access memory at address 0x316f6ebd
(gdb) p xsk
$1 = (struct xsk_socket_info *) 0x316f6e65
```

> I will work on next version
> Thank you
> William


<SNIP>
William Tu June 15, 2019, 1:33 p.m. UTC | #14
Hi Eelco,

I think it's either a bug in kernel or my misunderstanding about how to
process the xsk cq ring. I posted the issue here
https://marc.info/?l=xdp-newbies&m=156055471727857&w=2

And apply this to v11 fix my crash.
Do you mind testing it again on your system?
Thanks for your time for trial and error

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index a6543e8f5126..800c047b71f9 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -705,6 +705,7 @@ afxdp_complete_tx(struct xsk_socket_info *xsk)
     struct umem_elem *elems_push[BATCH_SIZE];
     uint32_t idx_cq = 0;
     int tx_done, j, ret;
+    int tx = 0;

     if (!xsk->outstanding_tx) {
         return;
@@ -717,23 +718,29 @@ afxdp_complete_tx(struct xsk_socket_info *xsk)
     }

     tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq);
-    if (tx_done > 0) {
-        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
-        xsk->outstanding_tx -= tx_done;
-    }

     /* Recycle back to umem pool */
     for (j = 0; j < tx_done; j++) {
         struct umem_elem *elem;
-        uint64_t addr;
+        uint64_t *addr;

-        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+        addr = xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+        if (*addr == 0) {
+            continue;
+        }
         elem = ALIGNED_CAST(struct umem_elem *,
-                            (char *)xsk->umem->buffer + addr);
+                            (char *)xsk->umem->buffer + *addr);
         elems_push[j] = elem;
+        *addr = 0;
+        tx++;
     }

-    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
+    umem_elem_push_n(&xsk->umem->mpool, tx, (void **)elems_push);
+
+    if (tx_done > 0) {
+        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
+        xsk->outstanding_tx -= tx_done;
+    }
 }

 int

On Thu, Jun 13, 2019 at 12:17 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> On 13 Jun 2019, at 2:37, William Tu wrote:
>
> Hi Eelco,
>
> #0 0x00007fbc6a78193f in raise () from /lib64/libc.so.6
> #1 0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
> #2 0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
> #3 umem_elem_push_n (umemp=0x24cc790, n=32,
>
> I've found that it's due to free the afxdp twice.
> The free_afxdp_buf() should be called once per dp_packet, somehow
> it gets called twice.
> Applying this on v11 fixes the issue
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -145,7 +145,7 @@ dp_packet_uninit(struct dp_packet *b)
> free_dpdk_buf((struct dp_packet*) b);
> #endif
> } else if (b->source == DPBUF_AFXDP) {
> - free_afxdp_buf(b);
> + ;
> }
> }
> }
>
> It’s still crashing for me, even with the change. I’m using a single nic, same port in as out:
>
> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +//            free_afxdp_buf(b);
>          }
>      }
>  }
>
> Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfi'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
> [Current thread is 1 (Thread 0x7f62bdf33700 (LWP 24737))]
> Missing separate debuginfos, use: dnf debuginfo-install elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
> #1  0x00007f62ef6fbc95 in abort () from /lib64/libc.so.6
> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7f62bdf30c50, n=32, umemp=0x2378660) at lib/xdpsock.c:32
> #3  umem_elem_push_n (umemp=0x2378660, n=32, addrs=addrs@entry=0x7f62bdf30ea0) at lib/xdpsock.c:43
> #4  0x00000000009b4f41 in afxdp_complete_tx (xsk=0x2378250) at lib/netdev-afxdp.c:736
> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0, batch=0x7f62a4004e80, concurrent_txq=<optimized out>) at lib/netdev-afxdp.c:763
> #6  0x0000000000908031 in netdev_send (netdev=<optimized out>, qid=qid@entry=0, batch=batch@entry=0x7f62a4004e80, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:800
> #7  0x00000000008d4c24 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x7f62bdf34010, p=p@entry=0x7f62a4004e50) at lib/dpif-netdev.c:4187
> #8  0x00000000008d4f3f in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x7f62bdf34010, force=force@entry=false) at lib/dpif-netdev.c:4227
> #9  0x00000000008dd2d7 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x7f62bdf34010) at lib/dpif-netdev.c:4282
> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f62bdf34010, rxq=0x237e650, port_no=1) at lib/dpif-netdev.c:4282
> #11 0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at lib/dpif-netdev.c:5449
> #12 0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:352
> #13 0x00007f62f00312de in start_thread () from /lib64/libpthread.so.0
> #14 0x00007f62ef7d6a63 in clone () from /lib64/libc.so.6
>
> Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfi'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, qfill=0x0) at lib/netdev-afxdp.c:583
> 583        rx->fd = xsk_socket__fd(xsk->xsk);
> [Current thread is 1 (Thread 0x7f399f7fe700 (LWP 24597))]
> Missing separate debuginfos, use: dnf debuginfo-install elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, qfill=0x0) at lib/netdev-afxdp.c:583
> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>, batch=batch@entry=0x7f399f7fc0d0, qfill=<optimized out>) at lib/netdev.c:710
> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port (pmd=pmd@entry=0x2bf80c0, rxq=0x2bd63e0, port_no=2) at lib/dpif-netdev.c:4257
> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at lib/dpif-netdev.c:5449
> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:352
> #5  0x00007f39d86de2de in start_thread () from /lib64/libpthread.so.0
> #6  0x00007f39d7e83a63 in clone () from /lib64/libc.so.6
> (gdb) p xsk->xsk
> Cannot access memory at address 0x316f6ebd
> (gdb) p xsk
> $1 = (struct xsk_socket_info *) 0x316f6e65
>
> I will work on next version
> Thank you
> William
>
> <SNIP>
Eelco Chaudron June 17, 2019, 10:12 a.m. UTC | #15
Hi William,

See below parts of an offline email discussion I had with Magnus before, 
and some research I did in the end, which explains that by design you 
might not get all the descriptors ready.
Hope this helps change your design…

In addition, the Point to Point test is working with you change, 
however, the PVP test is still failing due to buffer starvation (see my 
comments in Patchv8 for a possible cause).

Also on OVS restart system crashes in the following part:

#0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0, 
qfill=0x0) at lib/netdev-afxdp.c:583
#1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>, 
batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at 
lib/netdev.c:710
#2  0x00000000008dd1c3 in dp_netdev_process_rxq_port 
(pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at 
lib/dpif-netdev.c:4257
#3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at 
lib/dpif-netdev.c:5449
#4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at 
lib/ovs-thread.c:352
#5  0x00007fe1633872de in start_thread () from /lib64/libpthread.so.0
#6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6


Cheers,

Eelco


>> On 2019-05-22 15:20, Eelco Chaudron wrote:
>> Hi Magnus, at all,
>>
>> I was working on the AF_XDP tutorial example and even with a single 
>> RX_BATCH_SIZE I was not getting woken up every packet, so I decided 
>> to get to the bottom of this :)
>>
>> Looking at the kernel RX size, this is what happens:
>>
>>   xsk_generic_rcv();
>>      xskq_peek_addr(xs->umem->fq, &addr);
>>      ...
>>      xskq_discard_addr(xs->umem->fq);
>>
>>
>> Look at:
>>
>> 148  static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 
>> *addr)
>> 149  {
>> 150      if (q->cons_tail == q->cons_head) {
>> 151          smp_mb(); /* D, matches A */
>> 152          WRITE_ONCE(q->ring->consumer, q->cons_tail);
>> 153          q->cons_head = q->cons_tail + xskq_nb_avail(q, 
>> RX_BATCH_SIZE);
>> 154
>> 155          /* Order consumer and data */
>> 156          smp_rmb();
>> 157      }
>> 158
>> 159      return xskq_validate_addr(q, addr);
>> 160  }
>>
>> So ring->consumer gets updated here if we need some additional 
>> buffers (we take 16).
>>
>> Once we are done with processing a single packet we do not update the 
>> ring->consumer, but only the tail:
>>
>> 162  static inline void xskq_discard_addr(struct xsk_queue *q)
>> 163  {
>> 164      q->cons_tail++;
>> 165  }
>>
>> Which means we free the consumers slots every 16th packet…
>>
>> Now looking at the xdpsock_user.c code:
>>
>> 503  static void rx_drop(struct xsk_socket_info *xsk)
>> 504  {
>> 505      unsigned int rcvd, i;
>> 506      u32 idx_rx = 0, idx_fq = 0;
>> 507      int ret;
>> 508
>> 509      rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
>> 510      if (!rcvd)
>> 511          return;
>> 512
>> 513      ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
>> 514      while (ret != rcvd) {
>> 515          if (ret < 0)
>> 516              exit_with_error(-ret);
>> 517          ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, 
>> &idx_fq);
>>
>> If we “sent a single ping only”, only one packet gets received, 
>> but because the consumer is not updated by the kernel every received 
>> packet, we end up waiting for the above line 152 in xskq_peek_addr().
>>
>> 518      }
>> 519
>> 520      for (i = 0; i < rcvd; i++) {
>> 521          u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, 
>> idx_rx)->addr;
>> 522          u32 len = xsk_ring_cons__rx_desc(&xsk->rx, 
>> idx_rx++)->len;
>> 523          char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>> 524
>> 525          hex_dump(pkt, len, addr);
>> 526          *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = 
>> addr;
>> 527      }
>> 528
>> 529      xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
>> 530      xsk_ring_cons__release(&xsk->rx, rcvd);
>> 531      xsk->rx_npkts += rcvd;
>> 532  }
>>
>> Assuming updating the consumer only every RX_BATCH_SIZE was a design 
>> decision, this means the example app needs some fixing.
>>
>>
>> What do you guys think?
>>
>>
> Ok, picking up this thread!
>>
> Yes, the kernel only updates the *consumer* pointer of the fill ring
> every RX_BATCH_SIZE, but the Rx ring *producer* pointer is bumped 
> everytime.
>
> This means, that if you rely on (the naive :-)) code in the sample
> application, you can endup in a situation where you can receive from 
> the
> Rx ring, but not post to the fill ring.
>
> So, the reason for the 16 packet hickup is as following:
>
> 1. Userland: The fill ring is completely filled.
> 2. Kernel: One packet is received, one entry picked from the fill 
> ring,
>    but the consumer pointer is not bumped, and packet is placed on the
>    Rx ring.
> 3. Userland: One packet is picked from the Rx ring.
> 4. Userland: Tries to put an entry on fill ring. The fill ring is 
> full,
>    so userland spins.
> 5. Kernel: When 16 packets has been picked from the fill ring the
>    consumer ptr is released.
> 6. Userland: Exists the while loop.
>
> We could make the sample more robust, but then again, it's a sample! 
> :-)
>>
>> Just wanted to get to the bottom of this, as Magnus was not able to 
>> reproduce it in his setup.
>> Guess his network is noisier and get more than a single packet in…
>>
>> I’m working on an AF_XDP example for the xdp-tutorial project and 
>> will fix it there. It’s just that people seem to blindly copy 
>> examples…
>>
>> Also, it might be useful to get an API to know how many slots are 
>> avail, like xsk_ring_prod__free(struct xsk_ring_prod *prod),
>> this way we could see how many descriptors we can add to top-off the 
>> queue. There is xsk_prod_nb_free() in the file, but does not seems 
>> like an official API (or are we ok to use it)?
>>
>> Let me know what you think, and I can send a patch for 
>> xsk_ring_prod__free().
>
>
> I'd say use it; It's part of the xsk.h file, but inlined, so it's not
> versioned...
>
> Cheers,
> Björn



On 15 Jun 2019, at 15:33, William Tu wrote:

> Hi Eelco,
>
> I think it's either a bug in kernel or my misunderstanding about how 
> to
> process the xsk cq ring. I posted the issue here
> https://marc.info/?l=xdp-newbies&m=156055471727857&w=2
>
> And apply this to v11 fix my crash.
> Do you mind testing it again on your system?
> Thanks for your time for trial and error
>
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index a6543e8f5126..800c047b71f9 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -705,6 +705,7 @@ afxdp_complete_tx(struct xsk_socket_info *xsk)
>      struct umem_elem *elems_push[BATCH_SIZE];
>      uint32_t idx_cq = 0;
>      int tx_done, j, ret;
> +    int tx = 0;
>
>      if (!xsk->outstanding_tx) {
>          return;
> @@ -717,23 +718,29 @@ afxdp_complete_tx(struct xsk_socket_info *xsk)
>      }
>
>      tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, 
> &idx_cq);
> -    if (tx_done > 0) {
> -        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> -        xsk->outstanding_tx -= tx_done;
> -    }
>
>      /* Recycle back to umem pool */
>      for (j = 0; j < tx_done; j++) {
>          struct umem_elem *elem;
> -        uint64_t addr;
> +        uint64_t *addr;
>
> -        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +        addr = xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> +        if (*addr == 0) {
> +            continue;
> +        }
>          elem = ALIGNED_CAST(struct umem_elem *,
> -                            (char *)xsk->umem->buffer + addr);
> +                            (char *)xsk->umem->buffer + *addr);
>          elems_push[j] = elem;
> +        *addr = 0;
> +        tx++;
>      }
>
> -    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void 
> **)elems_push);
> +    umem_elem_push_n(&xsk->umem->mpool, tx, (void **)elems_push);
> +
> +    if (tx_done > 0) {
> +        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> +        xsk->outstanding_tx -= tx_done;
> +    }
>  }
>
>  int
>
> On Thu, Jun 13, 2019 at 12:17 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>> On 13 Jun 2019, at 2:37, William Tu wrote:
>>
>> Hi Eelco,
>>
>> #0 0x00007fbc6a78193f in raise () from /lib64/libc.so.6
>> #1 0x00007fbc6a76bc95 in abort () from /lib64/libc.so.6
>> #2 0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7fbc40f2ec50,
>> n=32, umemp=0x24cc790) at lib/xdpsock.c:32
>> #3 umem_elem_push_n (umemp=0x24cc790, n=32,
>>
>> I've found that it's due to free the afxdp twice.
>> The free_afxdp_buf() should be called once per dp_packet, somehow
>> it gets called twice.
>> Applying this on v11 fixes the issue
>> --- a/lib/dp-packet.c
>> +++ b/lib/dp-packet.c
>> @@ -145,7 +145,7 @@ dp_packet_uninit(struct dp_packet *b)
>> free_dpdk_buf((struct dp_packet*) b);
>> #endif
>> } else if (b->source == DPBUF_AFXDP) {
>> - free_afxdp_buf(b);
>> + ;
>> }
>> }
>> }
>>
>> It’s still crashing for me, even with the change. I’m using a 
>> single nic, same port in as out:
>>
>> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>>               * created as a dp_packet */
>>              free_dpdk_buf((struct dp_packet*) b);
>>  #endif
>> +        } else if (b->source == DPBUF_AFXDP) {
>> +//            free_afxdp_buf(b);
>>          }
>>      }
>>  }
>>
>> Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock 
>> -vconsole:emer -vsyslog:err -vfi'.
>> Program terminated with signal SIGABRT, Aborted.
>> #0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
>> [Current thread is 1 (Thread 0x7f62bdf33700 (LWP 24737))]
>> Missing separate debuginfos, use: dnf debuginfo-install 
>> elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
>> libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
>> numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
>> zlib-1.2.11-10.el8.x86_64
>> (gdb) bt
>> #0  0x00007f62ef71193f in raise () from /lib64/libc.so.6
>> #1  0x00007f62ef6fbc95 in abort () from /lib64/libc.so.6
>> #2  0x00000000004ed1a1 in __umem_elem_push_n (addrs=0x7f62bdf30c50, 
>> n=32, umemp=0x2378660) at lib/xdpsock.c:32
>> #3  umem_elem_push_n (umemp=0x2378660, n=32, 
>> addrs=addrs@entry=0x7f62bdf30ea0) at lib/xdpsock.c:43
>> #4  0x00000000009b4f41 in afxdp_complete_tx (xsk=0x2378250) at 
>> lib/netdev-afxdp.c:736
>> #5  netdev_afxdp_batch_send (netdev=<optimized out>, qid=0, 
>> batch=0x7f62a4004e80, concurrent_txq=<optimized out>) at 
>> lib/netdev-afxdp.c:763
>> #6  0x0000000000908031 in netdev_send (netdev=<optimized out>, 
>> qid=qid@entry=0, batch=batch@entry=0x7f62a4004e80, 
>> concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:800
>> #7  0x00000000008d4c24 in dp_netdev_pmd_flush_output_on_port 
>> (pmd=pmd@entry=0x7f62bdf34010, p=p@entry=0x7f62a4004e50) at 
>> lib/dpif-netdev.c:4187
>> #8  0x00000000008d4f3f in dp_netdev_pmd_flush_output_packets 
>> (pmd=pmd@entry=0x7f62bdf34010, force=force@entry=false) at 
>> lib/dpif-netdev.c:4227
>> #9  0x00000000008dd2d7 in dp_netdev_pmd_flush_output_packets 
>> (force=false, pmd=0x7f62bdf34010) at lib/dpif-netdev.c:4282
>> #10 dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f62bdf34010, 
>> rxq=0x237e650, port_no=1) at lib/dpif-netdev.c:4282
>> #11 0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at 
>> lib/dpif-netdev.c:5449
>> #12 0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at 
>> lib/ovs-thread.c:352
>> #13 0x00007f62f00312de in start_thread () from /lib64/libpthread.so.0
>> #14 0x00007f62ef7d6a63 in clone () from /lib64/libc.so.6
>>
>> Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock 
>> -vconsole:emer -vsyslog:err -vfi'.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, 
>> qfill=0x0) at lib/netdev-afxdp.c:583
>> 583        rx->fd = xsk_socket__fd(xsk->xsk);
>> [Current thread is 1 (Thread 0x7f399f7fe700 (LWP 24597))]
>> Missing separate debuginfos, use: dnf debuginfo-install 
>> elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
>> libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
>> numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
>> zlib-1.2.11-10.el8.x86_64
>> (gdb) bt
>> #0  netdev_afxdp_rxq_recv (rxq_=0x2bd75d0, batch=0x7f399f7fc0d0, 
>> qfill=0x0) at lib/netdev-afxdp.c:583
>> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>, 
>> batch=batch@entry=0x7f399f7fc0d0, qfill=<optimized out>) at 
>> lib/netdev.c:710
>> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port 
>> (pmd=pmd@entry=0x2bf80c0, rxq=0x2bd63e0, port_no=2) at 
>> lib/dpif-netdev.c:4257
>> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at 
>> lib/dpif-netdev.c:5449
>> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at 
>> lib/ovs-thread.c:352
>> #5  0x00007f39d86de2de in start_thread () from /lib64/libpthread.so.0
>> #6  0x00007f39d7e83a63 in clone () from /lib64/libc.so.6
>> (gdb) p xsk->xsk
>> Cannot access memory at address 0x316f6ebd
>> (gdb) p xsk
>> $1 = (struct xsk_socket_info *) 0x316f6e65
>>
>> I will work on next version
>> Thank you
>> William
>>
>> <SNIP>
William Tu June 17, 2019, 6:23 p.m. UTC | #16
Hi Eelco,

On Mon, Jun 17, 2019 at 3:12 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> Hi William,
>
> See below parts of an offline email discussion I had with Magnus before,
> and some research I did in the end, which explains that by design you
> might not get all the descriptors ready.

I think it's different issues. The behavior you described is a hickup waiting
for queuing 16 rx packets. Here, at the afxdp_complete_tx, the
xsk_ring_cons__peek
returns descs that already been released, causing ovs push more elems and thus
crash.

> Hope this helps change your design…
>
> In addition, the Point to Point test is working with you change,
> however, the PVP test is still failing due to buffer starvation (see my
> comments in Patchv8 for a possible cause).
>
Thanks, looking back v8
https://patchwork.ozlabs.org/patch/1097740/
Hopefully next version will fix this issue.

> Also on OVS restart system crashes in the following part:
>
> #0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0,
> qfill=0x0) at lib/netdev-afxdp.c:583
> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>,
> batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at
> lib/netdev.c:710
> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port
> (pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at
> lib/dpif-netdev.c:4257
> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5449
> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:352
> #5  0x00007fe1633872de in start_thread () from /lib64/libpthread.so.0
> #6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6
>
How do you restart the system? So I have two afxdp port
        Port "eth3"
            Interface "eth3"
                type: afxdp
                options: {n_rxq="1", xdpmode=drv}
        Port "eth5"
            Interface "eth5"
                type: afxdp
                options: {n_rxq="1", xdpmode=drv}

I tested using
# ovs-vsctl del-port eth3
# ovs-vsctl del-port eth5
# ovs-vsctl del-br br0
# ovs-appctl -t ovs-vswitchd exit
Looks ok.

<snip>

> > This means, that if you rely on (the naive :-)) code in the sample
> > application, you can endup in a situation where you can receive from
> > the
> > Rx ring, but not post to the fill ring.
> >
> > So, the reason for the 16 packet hickup is as following:
> >
> > 1. Userland: The fill ring is completely filled.
> > 2. Kernel: One packet is received, one entry picked from the fill
> > ring,
> >    but the consumer pointer is not bumped, and packet is placed on the
> >    Rx ring.
> > 3. Userland: One packet is picked from the Rx ring.
> > 4. Userland: Tries to put an entry on fill ring. The fill ring is
> > full,
> >    so userland spins.
> > 5. Kernel: When 16 packets has been picked from the fill ring the
> >    consumer ptr is released.
> > 6. Userland: Exists the while loop.

Based on the above, there is no starvation problem here if there are more
than 16 packets, correct? And at step 4, we can skip spinning and try to
process more rx ring.

For next version, I will first check the fill ring by using xsk_prod_nb_free(),
to avoid the step 4.

Thanks
William
William Tu June 17, 2019, 8:32 p.m. UTC | #17
On Mon, Jun 17, 2019 at 11:23 AM William Tu <u9012063@gmail.com> wrote:
>
> Hi Eelco,
>
> On Mon, Jun 17, 2019 at 3:12 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >
> > Hi William,
> >
> > See below parts of an offline email discussion I had with Magnus before,
> > and some research I did in the end, which explains that by design you
> > might not get all the descriptors ready.
>
> I think it's different issues. The behavior you described is a hickup waiting
> for queuing 16 rx packets. Here, at the afxdp_complete_tx, the
> xsk_ring_cons__peek
> returns descs that already been released, causing ovs push more elems and thus
> crash.
>
> > Hope this helps change your design…
> >
> > In addition, the Point to Point test is working with you change,
> > however, the PVP test is still failing due to buffer starvation (see my
> > comments in Patchv8 for a possible cause).
> >
> Thanks, looking back v8
> https://patchwork.ozlabs.org/patch/1097740/
> Hopefully next version will fix this issue.
>
> > Also on OVS restart system crashes in the following part:
> >
> > #0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0,
> > qfill=0x0) at lib/netdev-afxdp.c:583
> > #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>,
> > batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at
> > lib/netdev.c:710
> > #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port
> > (pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at
> > lib/dpif-netdev.c:4257
> > #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at
> > lib/dpif-netdev.c:5449
> > #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at
> > lib/ovs-thread.c:352
> > #5  0x00007fe1633872de in start_thread () from /lib64/libpthread.so.0
> > #6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6
> >
> How do you restart the system? So I have two afxdp port
>         Port "eth3"
>             Interface "eth3"
>                 type: afxdp
>                 options: {n_rxq="1", xdpmode=drv}
>         Port "eth5"
>             Interface "eth5"
>                 type: afxdp
>                 options: {n_rxq="1", xdpmode=drv}
>
> I tested using
> # ovs-vsctl del-port eth3
> # ovs-vsctl del-port eth5
> # ovs-vsctl del-br br0
> # ovs-appctl -t ovs-vswitchd exit
> Looks ok.
>
> <snip>
>
> > > This means, that if you rely on (the naive :-)) code in the sample
> > > application, you can endup in a situation where you can receive from
> > > the
> > > Rx ring, but not post to the fill ring.
> > >
> > > So, the reason for the 16 packet hickup is as following:
> > >
> > > 1. Userland: The fill ring is completely filled.
> > > 2. Kernel: One packet is received, one entry picked from the fill
> > > ring,
> > >    but the consumer pointer is not bumped, and packet is placed on the
> > >    Rx ring.
> > > 3. Userland: One packet is picked from the Rx ring.
> > > 4. Userland: Tries to put an entry on fill ring. The fill ring is
> > > full,
> > >    so userland spins.
> > > 5. Kernel: When 16 packets has been picked from the fill ring the
> > >    consumer ptr is released.
> > > 6. Userland: Exists the while loop.
>
> Based on the above, there is no starvation problem here if there are more
> than 16 packets, correct? And at step 4, we can skip spinning and try to
> process more rx ring.
>
> For next version, I will first check the fill ring by using xsk_prod_nb_free(),
> to avoid the step 4.
>
> Thanks
> William

Hi Eelco,

I have some fixes with commit "prepare for v12" at
https://github.com/williamtu/ovs-ebpf/commits/afxdp-v11

I tested PVP and it works ok (using tap and also veth namespaces)
Can you give it a try?

Thanks a lot
William
Eelco Chaudron June 18, 2019, 9:45 a.m. UTC | #18
On 17 Jun 2019, at 20:23, William Tu wrote:

> Hi Eelco,
>
> On Mon, Jun 17, 2019 at 3:12 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>> Hi William,
>>
>> See below parts of an offline email discussion I had with Magnus 
>> before,
>> and some research I did in the end, which explains that by design you
>> might not get all the descriptors ready.
>
> I think it's different issues. The behavior you described is a hickup 
> waiting
> for queuing 16 rx packets. Here, at the afxdp_complete_tx, the
> xsk_ring_cons__peek
> returns descs that already been released, causing ovs push more elems 
> and thus
> crash.

You are right did not read it thoroughly… Looks like a bug to me, 
after __release() I would assume it will not return the same elements in 
__peek().

>
>> Hope this helps change your design…
>>
>> In addition, the Point to Point test is working with you change,
>> however, the PVP test is still failing due to buffer starvation (see 
>> my
>> comments in Patchv8 for a possible cause).
>>
> Thanks, looking back v8
> https://patchwork.ozlabs.org/patch/1097740/
> Hopefully next version will fix this issue.
>
>> Also on OVS restart system crashes in the following part:
>>
>> #0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0,
>> qfill=0x0) at lib/netdev-afxdp.c:583
>> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>,
>> batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at
>> lib/netdev.c:710
>> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port
>> (pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at
>> lib/dpif-netdev.c:4257
>> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at
>> lib/dpif-netdev.c:5449
>> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at
>> lib/ovs-thread.c:352
>> #5  0x00007fe1633872de in start_thread () from /lib64/libpthread.so.0
>> #6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6
>>
> How do you restart the system? So I have two afxdp port
>         Port "eth3"
>             Interface "eth3"
>                 type: afxdp
>                 options: {n_rxq="1", xdpmode=drv}
>         Port "eth5"
>             Interface "eth5"
>                 type: afxdp
>                 options: {n_rxq="1", xdpmode=drv}
>
> I tested using
> # ovs-vsctl del-port eth3
> # ovs-vsctl del-port eth5
> # ovs-vsctl del-br br0
> # ovs-appctl -t ovs-vswitchd exit
> Looks ok.

I’m using an RHEL7 instance and use systemd to restart openvswitch 
with “systemctl restart openvswitch”.
It uses ovs-ctl to stat/stop, see here for some details:

https://github.com/openvswitch/ovs/blob/master/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in


> <snip>
>
>>> This means, that if you rely on (the naive :-)) code in the sample
>>> application, you can endup in a situation where you can receive from
>>> the
>>> Rx ring, but not post to the fill ring.
>>>
>>> So, the reason for the 16 packet hickup is as following:
>>>
>>> 1. Userland: The fill ring is completely filled.
>>> 2. Kernel: One packet is received, one entry picked from the fill
>>> ring,
>>>    but the consumer pointer is not bumped, and packet is placed on 
>>> the
>>>    Rx ring.
>>> 3. Userland: One packet is picked from the Rx ring.
>>> 4. Userland: Tries to put an entry on fill ring. The fill ring is
>>> full,
>>>    so userland spins.
>>> 5. Kernel: When 16 packets has been picked from the fill ring the
>>>    consumer ptr is released.
>>> 6. Userland: Exists the while loop.
>
> Based on the above, there is no starvation problem here if there are 
> more
> than 16 packets, correct? And at step 4, we can skip spinning and try 
> to
> process more rx ring.
>
> For next version, I will first check the fill ring by using 
> xsk_prod_nb_free(),
> to avoid the step 4.

Yes, a __free() check here will skip this problem. I was running a 
single ping only test and it would spin forever…

> Thanks
> William
Eelco Chaudron June 18, 2019, 9:45 a.m. UTC | #19
On 17 Jun 2019, at 22:32, William Tu wrote:

> On Mon, Jun 17, 2019 at 11:23 AM William Tu <u9012063@gmail.com> 
> wrote:
>>
>> Hi Eelco,
>>
>> On Mon, Jun 17, 2019 at 3:12 AM Eelco Chaudron <echaudro@redhat.com> 
>> wrote:
>>>
>>> Hi William,
>>>
>>> See below parts of an offline email discussion I had with Magnus 
>>> before,
>>> and some research I did in the end, which explains that by design 
>>> you
>>> might not get all the descriptors ready.
>>
>> I think it's different issues. The behavior you described is a hickup 
>> waiting
>> for queuing 16 rx packets. Here, at the afxdp_complete_tx, the
>> xsk_ring_cons__peek
>> returns descs that already been released, causing ovs push more elems 
>> and thus
>> crash.
>>
>>> Hope this helps change your design…
>>>
>>> In addition, the Point to Point test is working with you change,
>>> however, the PVP test is still failing due to buffer starvation (see 
>>> my
>>> comments in Patchv8 for a possible cause).
>>>
>> Thanks, looking back v8
>> https://patchwork.ozlabs.org/patch/1097740/
>> Hopefully next version will fix this issue.
>>
>>> Also on OVS restart system crashes in the following part:
>>>
>>> #0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0,
>>> qfill=0x0) at lib/netdev-afxdp.c:583
>>> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>,
>>> batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at
>>> lib/netdev.c:710
>>> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port
>>> (pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at
>>> lib/dpif-netdev.c:4257
>>> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at
>>> lib/dpif-netdev.c:5449
>>> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) 
>>> at
>>> lib/ovs-thread.c:352
>>> #5  0x00007fe1633872de in start_thread () from 
>>> /lib64/libpthread.so.0
>>> #6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6
>>>
>> How do you restart the system? So I have two afxdp port
>>         Port "eth3"
>>             Interface "eth3"
>>                 type: afxdp
>>                 options: {n_rxq="1", xdpmode=drv}
>>         Port "eth5"
>>             Interface "eth5"
>>                 type: afxdp
>>                 options: {n_rxq="1", xdpmode=drv}
>>
>> I tested using
>> # ovs-vsctl del-port eth3
>> # ovs-vsctl del-port eth5
>> # ovs-vsctl del-br br0
>> # ovs-appctl -t ovs-vswitchd exit
>> Looks ok.
>>
>> <snip>
>>
>>>> This means, that if you rely on (the naive :-)) code in the sample
>>>> application, you can endup in a situation where you can receive 
>>>> from
>>>> the
>>>> Rx ring, but not post to the fill ring.
>>>>
>>>> So, the reason for the 16 packet hickup is as following:
>>>>
>>>> 1. Userland: The fill ring is completely filled.
>>>> 2. Kernel: One packet is received, one entry picked from the fill
>>>> ring,
>>>>    but the consumer pointer is not bumped, and packet is placed on 
>>>> the
>>>>    Rx ring.
>>>> 3. Userland: One packet is picked from the Rx ring.
>>>> 4. Userland: Tries to put an entry on fill ring. The fill ring is
>>>> full,
>>>>    so userland spins.
>>>> 5. Kernel: When 16 packets has been picked from the fill ring the
>>>>    consumer ptr is released.
>>>> 6. Userland: Exists the while loop.
>>
>> Based on the above, there is no starvation problem here if there are 
>> more
>> than 16 packets, correct? And at step 4, we can skip spinning and try 
>> to
>> process more rx ring.
>>
>> For next version, I will first check the fill ring by using 
>> xsk_prod_nb_free(),
>> to avoid the step 4.
>>
>> Thanks
>> William
>
> Hi Eelco,
>
> I have some fixes with commit "prepare for v12" at
> https://github.com/williamtu/ovs-ebpf/commits/afxdp-v11
>
> I tested PVP and it works ok (using tap and also veth namespaces)
> Can you give it a try?

The PVP test seems to work fine however after a while it stops 
forwarding:

$ ovs-ofctl dump-flows ovs_pvp_br0
  cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020, 
in_port=eno1 actions=output:tapVM
  cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252, 
in_port=tapVM actions=output:eno1

Results:

"Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
"Physical to Virtual to Physical test, L3 flows[port redirect]"
,Packet size
Number of flows,64,256,1024
10,13448,131687,0
100,596,0,0
1000,596,0,0

Rather low compared to the kernel, note the above is using a single 
queue:

"Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
"Physical to Virtual to Physical test, L3 flows[port redirect]"
,Packet size
Number of flows,64,256,1024
10,502411,451579,421558
100,525439,440637,422051
1000,463875,419996,402010

However I can not restart OVS (see other email on how I restart), even 
if I clear the XDP programs before a restart it fails, and cores.
The only way to recover is to reboot the box and start from scratch:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
462		bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id, &qid, 0);
[Current thread is 1 (Thread 0x7f4559f1c000 (LWP 4898))]
Missing separate debuginfos, use: dnf debuginfo-install 
elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
#1  0x00007f455919b278 in xsk_socket__delete (xsk=0x21) at xsk.c:711
#2  0x00000000009b3af1 in xsk_destroy (xsk_info=<optimized out>) at 
lib/netdev-afxdp.c:313
#3  xsk_destroy_all (netdev=0x1df49a0) at lib/netdev-afxdp.c:313
#4  0x00000000009b4fe9 in netdev_afxdp_destruct (netdev_=0x1df49a0) at 
lib/netdev-afxdp.c:845
#5  0x0000000000906e53 in netdev_unref (dev=0x1df49a0) at 
lib/netdev.c:573
#6  0x00000000008739b1 in iface_do_create (errp=0x7ffe4fc5b588, 
netdevp=0x7ffe4fc5b580, ofp_portp=0x7ffe4fc5b578, iface_cfg=0x1cde5d0, 
br=0x1ce1690) at vswitchd/bridge.c:1825
#7  iface_create (port_cfg=0x1cb3690, iface_cfg=0x1cde5d0, br=0x1ce1690) 
at vswitchd/bridge.c:1848
#8  bridge_add_ports__ (br=br@entry=0x1ce1690, 
wanted_ports=wanted_ports@entry=0x1ce1770, 
with_requested_port=with_requested_port@entry=false) at 
vswitchd/bridge.c:936
#9  0x0000000000875ef7 in bridge_add_ports (wanted_ports=0x1ce1770, 
br=0x1ce1690) at vswitchd/bridge.c:952
#10 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x1cb4b90) at 
vswitchd/bridge.c:666
#11 0x0000000000879521 in bridge_run () at vswitchd/bridge.c:3043
#12 0x00000000004ef545 in main (argc=<optimized out>, argv=<optimized 
out>) at vswitchd/ovs-vswitchd.c:127

Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00051|netdev_afxdp|ERR|xsk_socket__create failed 
(Device or resource busy) mode: SKB qid: 0
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00052|netdev_afxdp|ERR|failed to create AF_XDP 
socket on queue 0
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00055|netdev_afxdp|ERR|AF_XDP device tapVM 
reconfig fails
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00056|dpif_netdev|ERR|Failed to set interface 
tapVM new configuration
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00062|netdev_afxdp|ERR|xsk_socket__create failed 
(Device or resource busy) mode: DRV qid: 0
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00063|netdev_afxdp|ERR|failed to create AF_XDP 
socket on queue 0
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00066|netdev_afxdp|ERR|AF_XDP device eno1 
reconfig fails
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
ovs-vswitchd[5861]: ovs|00067|dpif_netdev|ERR|Failed to set interface 
eno1 new configuration
Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com kernel: 
ovs-vswitchd[5861]: segfault at 123 ip 00000000009b3afd sp 
00007ffff954a770 error 4 in ovs-vswitchd[400000+899000]

> Thanks a lot
> William
Ilya Maximets June 18, 2019, 10:17 a.m. UTC | #20
On 18.06.2019 12:45, Eelco Chaudron wrote:
> 
> 
> On 17 Jun 2019, at 22:32, William Tu wrote:
> 
>> On Mon, Jun 17, 2019 at 11:23 AM William Tu <u9012063@gmail.com> wrote:
>>>
>>> Hi Eelco,
>>>
>>> On Mon, Jun 17, 2019 at 3:12 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>>>
>>>> Hi William,
>>>>
>>>> See below parts of an offline email discussion I had with Magnus before,
>>>> and some research I did in the end, which explains that by design you
>>>> might not get all the descriptors ready.
>>>
>>> I think it's different issues. The behavior you described is a hickup waiting
>>> for queuing 16 rx packets. Here, at the afxdp_complete_tx, the
>>> xsk_ring_cons__peek
>>> returns descs that already been released, causing ovs push more elems and thus
>>> crash.
>>>
>>>> Hope this helps change your design…
>>>>
>>>> In addition, the Point to Point test is working with you change,
>>>> however, the PVP test is still failing due to buffer starvation (see my
>>>> comments in Patchv8 for a possible cause).
>>>>
>>> Thanks, looking back v8
>>> https://patchwork.ozlabs.org/patch/1097740/
>>> Hopefully next version will fix this issue.
>>>
>>>> Also on OVS restart system crashes in the following part:
>>>>
>>>> #0  netdev_afxdp_rxq_recv (rxq_=0x173c080, batch=0x7fe1397f80d0,
>>>> qfill=0x0) at lib/netdev-afxdp.c:583
>>>> #1  0x0000000000907f21 in netdev_rxq_recv (rx=<optimized out>,
>>>> batch=batch@entry=0x7fe1397f80d0, qfill=<optimized out>) at
>>>> lib/netdev.c:710
>>>> #2  0x00000000008dd1c3 in dp_netdev_process_rxq_port
>>>> (pmd=pmd@entry=0x175d990, rxq=0x175a460, port_no=2) at
>>>> lib/dpif-netdev.c:4257
>>>> #3  0x00000000008dd63d in pmd_thread_main (f_=<optimized out>) at
>>>> lib/dpif-netdev.c:5449
>>>> #4  0x000000000095e94d in ovsthread_wrapper (aux_=<optimized out>) at
>>>> lib/ovs-thread.c:352
>>>> #5  0x00007fe1633872de in start_thread () from /lib64/libpthread.so.0
>>>> #6  0x00007fe162b2ca63 in clone () from /lib64/libc.so.6
>>>>
>>> How do you restart the system? So I have two afxdp port
>>>         Port "eth3"
>>>             Interface "eth3"
>>>                 type: afxdp
>>>                 options: {n_rxq="1", xdpmode=drv}
>>>         Port "eth5"
>>>             Interface "eth5"
>>>                 type: afxdp
>>>                 options: {n_rxq="1", xdpmode=drv}
>>>
>>> I tested using
>>> # ovs-vsctl del-port eth3
>>> # ovs-vsctl del-port eth5
>>> # ovs-vsctl del-br br0
>>> # ovs-appctl -t ovs-vswitchd exit
>>> Looks ok.
>>>
>>> <snip>
>>>
>>>>> This means, that if you rely on (the naive :-)) code in the sample
>>>>> application, you can endup in a situation where you can receive from
>>>>> the
>>>>> Rx ring, but not post to the fill ring.
>>>>>
>>>>> So, the reason for the 16 packet hickup is as following:
>>>>>
>>>>> 1. Userland: The fill ring is completely filled.
>>>>> 2. Kernel: One packet is received, one entry picked from the fill
>>>>> ring,
>>>>>    but the consumer pointer is not bumped, and packet is placed on the
>>>>>    Rx ring.
>>>>> 3. Userland: One packet is picked from the Rx ring.
>>>>> 4. Userland: Tries to put an entry on fill ring. The fill ring is
>>>>> full,
>>>>>    so userland spins.
>>>>> 5. Kernel: When 16 packets has been picked from the fill ring the
>>>>>    consumer ptr is released.
>>>>> 6. Userland: Exists the while loop.
>>>
>>> Based on the above, there is no starvation problem here if there are more
>>> than 16 packets, correct? And at step 4, we can skip spinning and try to
>>> process more rx ring.
>>>
>>> For next version, I will first check the fill ring by using xsk_prod_nb_free(),
>>> to avoid the step 4.
>>>
>>> Thanks
>>> William
>>
>> Hi Eelco,
>>
>> I have some fixes with commit "prepare for v12" at
>> https://github.com/williamtu/ovs-ebpf/commits/afxdp-v11
>>
>> I tested PVP and it works ok (using tap and also veth namespaces)
>> Can you give it a try?
> 
> The PVP test seems to work fine however after a while it stops forwarding:
> 
> $ ovs-ofctl dump-flows ovs_pvp_br0
>  cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020, in_port=eno1 actions=output:tapVM
>  cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252, in_port=tapVM actions=output:eno1
> 
> Results:
> 
> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
> "Physical to Virtual to Physical test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,256,1024
> 10,13448,131687,0
> 100,596,0,0
> 1000,596,0,0
> 
> Rather low compared to the kernel, note the above is using a single queue:
> 
> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
> "Physical to Virtual to Physical test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,256,1024
> 10,502411,451579,421558
> 100,525439,440637,422051
> 1000,463875,419996,402010
> 
> However I can not restart OVS (see other email on how I restart), even if I clear the XDP programs before a restart it fails, and cores.
> The only way to recover is to reboot the box and start from scratch:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
> 462        bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id, &qid, 0);
> [Current thread is 1 (Thread 0x7f4559f1c000 (LWP 4898))]
> Missing separate debuginfos, use: dnf debuginfo-install elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
> #1  0x00007f455919b278 in xsk_socket__delete (xsk=0x21) at xsk.c:711
> #2  0x00000000009b3af1 in xsk_destroy (xsk_info=<optimized out>) at lib/netdev-afxdp.c:313
> #3  xsk_destroy_all (netdev=0x1df49a0) at lib/netdev-afxdp.c:313
> #4  0x00000000009b4fe9 in netdev_afxdp_destruct (netdev_=0x1df49a0) at lib/netdev-afxdp.c:845
> #5  0x0000000000906e53 in netdev_unref (dev=0x1df49a0) at lib/netdev.c:573
> #6  0x00000000008739b1 in iface_do_create (errp=0x7ffe4fc5b588, netdevp=0x7ffe4fc5b580, ofp_portp=0x7ffe4fc5b578, iface_cfg=0x1cde5d0, br=0x1ce1690) at vswitchd/bridge.c:1825
> #7  iface_create (port_cfg=0x1cb3690, iface_cfg=0x1cde5d0, br=0x1ce1690) at vswitchd/bridge.c:1848
> #8  bridge_add_ports__ (br=br@entry=0x1ce1690, wanted_ports=wanted_ports@entry=0x1ce1770, with_requested_port=with_requested_port@entry=false) at vswitchd/bridge.c:936
> #9  0x0000000000875ef7 in bridge_add_ports (wanted_ports=0x1ce1770, br=0x1ce1690) at vswitchd/bridge.c:952
> #10 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x1cb4b90) at vswitchd/bridge.c:666
> #11 0x0000000000879521 in bridge_run () at vswitchd/bridge.c:3043
> #12 0x00000000004ef545 in main (argc=<optimized out>, argv=<optimized out>) at vswitchd/ovs-vswitchd.c:127
> 
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00051|netdev_afxdp|ERR|xsk_socket__create failed (Device or resource busy) mode: SKB qid: 0
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00052|netdev_afxdp|ERR|failed to create AF_XDP socket on queue 0
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00055|netdev_afxdp|ERR|AF_XDP device tapVM reconfig fails
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00056|dpif_netdev|ERR|Failed to set interface tapVM new configuration
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00062|netdev_afxdp|ERR|xsk_socket__create failed (Device or resource busy) mode: DRV qid: 0
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00063|netdev_afxdp|ERR|failed to create AF_XDP socket on queue 0
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00066|netdev_afxdp|ERR|AF_XDP device eno1 reconfig fails
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com ovs-vswitchd[5861]: ovs|00067|dpif_netdev|ERR|Failed to set interface eno1 new configuration
> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com kernel: ovs-vswitchd[5861]: segfault at 123 ip 00000000009b3afd sp 00007ffff954a770 error 4 in ovs-vswitchd[400000+899000]
> 

I guess, this crash caused by trying to destroy unallocated queue.

Following change could help:
---
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index a6543e8f5..6e1431dce 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -249,7 +249,7 @@ xsk_configure_all(struct netdev *netdev)
     ifindex = linux_get_ifindex(netdev_get_name(netdev));
 
     n_rxq = netdev_n_rxq(netdev);
-    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
+    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
 
     /* configure each queue */
     for (i = 0; i < n_rxq; i++) {
---

This should prevent OVS from crash, however, I don't know why socket
creation fails in your case.

Best regards, Ilya Maximets.
Eelco Chaudron June 18, 2019, 11:37 a.m. UTC | #21
On 18 Jun 2019, at 12:17, Ilya Maximets wrote:

> On 18.06.2019 12:45, Eelco Chaudron wrote:
>>
>>
>> On 17 Jun 2019, at 22:32, William Tu wrote:
>>

<SNIP>

>> 1000,463875,419996,402010
>>
>> However I can not restart OVS (see other email on how I restart), 
>> even if I clear the XDP programs before a restart it fails, and 
>> cores.
>> The only way to recover is to reboot the box and start from scratch:
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
>> 462        bpf_map_update_elem(xsk->qidconf_map_fd, 
>> &xsk->queue_id, &qid, 0);
>> [Current thread is 1 (Thread 0x7f4559f1c000 (LWP 4898))]
>> Missing separate debuginfos, use: dnf debuginfo-install 
>> elfutils-libelf-0.174-6.el8.x86_64 glibc-2.28-42.el8_0.1.x86_64 
>> libatomic-8.2.1-3.5.el8.x86_64 libcap-ng-0.7.9-4.el8.x86_64 
>> numactl-libs-2.0.12-2.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 
>> zlib-1.2.11-10.el8.x86_64
>> (gdb) bt
>> #0  0x00007f455919a9b5 in xsk_clear_bpf_maps (xsk=0x21) at xsk.c:462
>> #1  0x00007f455919b278 in xsk_socket__delete (xsk=0x21) at xsk.c:711
>> #2  0x00000000009b3af1 in xsk_destroy (xsk_info=<optimized out>) at 
>> lib/netdev-afxdp.c:313
>> #3  xsk_destroy_all (netdev=0x1df49a0) at lib/netdev-afxdp.c:313
>> #4  0x00000000009b4fe9 in netdev_afxdp_destruct (netdev_=0x1df49a0) 
>> at lib/netdev-afxdp.c:845
>> #5  0x0000000000906e53 in netdev_unref (dev=0x1df49a0) at 
>> lib/netdev.c:573
>> #6  0x00000000008739b1 in iface_do_create (errp=0x7ffe4fc5b588, 
>> netdevp=0x7ffe4fc5b580, ofp_portp=0x7ffe4fc5b578, 
>> iface_cfg=0x1cde5d0, br=0x1ce1690) at vswitchd/bridge.c:1825
>> #7  iface_create (port_cfg=0x1cb3690, iface_cfg=0x1cde5d0, 
>> br=0x1ce1690) at vswitchd/bridge.c:1848
>> #8  bridge_add_ports__ (br=br@entry=0x1ce1690, 
>> wanted_ports=wanted_ports@entry=0x1ce1770, 
>> with_requested_port=with_requested_port@entry=false) at 
>> vswitchd/bridge.c:936
>> #9  0x0000000000875ef7 in bridge_add_ports (wanted_ports=0x1ce1770, 
>> br=0x1ce1690) at vswitchd/bridge.c:952
>> #10 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x1cb4b90) at 
>> vswitchd/bridge.c:666
>> #11 0x0000000000879521 in bridge_run () at vswitchd/bridge.c:3043
>> #12 0x00000000004ef545 in main (argc=<optimized out>, argv=<optimized 
>> out>) at vswitchd/ovs-vswitchd.c:127
>>
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00051|netdev_afxdp|ERR|xsk_socket__create 
>> failed (Device or resource busy) mode: SKB qid: 0
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00052|netdev_afxdp|ERR|failed to create 
>> AF_XDP socket on queue 0
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00055|netdev_afxdp|ERR|AF_XDP device tapVM 
>> reconfig fails
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00056|dpif_netdev|ERR|Failed to set interface 
>> tapVM new configuration
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00062|netdev_afxdp|ERR|xsk_socket__create 
>> failed (Device or resource busy) mode: DRV qid: 0
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00063|netdev_afxdp|ERR|failed to create 
>> AF_XDP socket on queue 0
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00066|netdev_afxdp|ERR|AF_XDP device eno1 
>> reconfig fails
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com 
>> ovs-vswitchd[5861]: ovs|00067|dpif_netdev|ERR|Failed to set interface 
>> eno1 new configuration
>> Jun 18 03:52:06 wsfd-netdev76.ntdv.lab.eng.bos.redhat.com kernel: 
>> ovs-vswitchd[5861]: segfault at 123 ip 00000000009b3afd sp 
>> 00007ffff954a770 error 4 in ovs-vswitchd[400000+899000]
>>
>
> I guess, this crash caused by trying to destroy unallocated queue.
>
> Following change could help:
> ---
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index a6543e8f5..6e1431dce 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -249,7 +249,7 @@ xsk_configure_all(struct netdev *netdev)
>      ifindex = linux_get_ifindex(netdev_get_name(netdev));
>
>      n_rxq = netdev_n_rxq(netdev);
> -    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
>
>      /* configure each queue */
>      for (i = 0; i < n_rxq; i++) {
> ---
>
> This should prevent OVS from crash, however, I don't know why socket
> creation fails in your case.

I know William is preparing a v12, so will do some more testing once 
it’s out, and update my kernel to the latest bpf-next just to make 
sure I’m not running into a known/fixed issue.

//Eelco
William Tu June 18, 2019, 7:32 p.m. UTC | #22
>
> I’m using an RHEL7 instance and use systemd to restart openvswitch
> with “systemctl restart openvswitch”.
> It uses ovs-ctl to stat/stop, see here for some details:
>
> https://github.com/openvswitch/ovs/blob/master/rhel/usr_lib_systemd_system_ovs-vswitchd.service.in
>
>
Thanks Eelco,

I can reproduce this issue now. Basically, if I start OVS with afxdp port, then
# kill `pidof ovs-vswitchd`
trigger the call to signal_remove_xdp()
# ip -s link show <afxdp port>
Then the xdp program is still there due to a bug
in xsk_remove_xdp_program (where I save prog_id as static)

I will fix it in next version!
Thanks
William
William Tu June 18, 2019, 10:28 p.m. UTC | #23
>
> I guess, this crash caused by trying to destroy unallocated queue.
>
> Following change could help:
> ---
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index a6543e8f5..6e1431dce 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -249,7 +249,7 @@ xsk_configure_all(struct netdev *netdev)
>      ifindex = linux_get_ifindex(netdev_get_name(netdev));
>
>      n_rxq = netdev_n_rxq(netdev);
> -    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
>
>      /* configure each queue */
>      for (i = 0; i < n_rxq; i++) {
> ---
>
> This should prevent OVS from crash, however, I don't know why socket
> creation fails in your case.
>
Hi Ilya,

Thanks, I will add this into my next version.

@Eelco
When using ovs-ctl restart, it still fails, but this is because userspace
datapath need extra "--cleanup" argument, s.t like
#ovs-appctl -t ovs-vswitchd --cleanup

We should add it to stop_daemon() at utilities/ovs-lib.in

Regards,
William
William Tu June 18, 2019, 10:31 p.m. UTC | #24
> The PVP test seems to work fine however after a while it stops
> forwarding:
>
> $ ovs-ofctl dump-flows ovs_pvp_br0
>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
> in_port=eno1 actions=output:tapVM
>   cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252,
> in_port=tapVM actions=output:eno1
>
> Results:
>
> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
> "Physical to Virtual to Physical test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,256,1024
> 10,13448,131687,0
> 100,596,0,0
> 1000,596,0,0
>
I'm not able to reproduce this issue...
Hopefully next version we can find out more about the root cause.
Eelco Chaudron June 19, 2019, 6:35 a.m. UTC | #25
On 19 Jun 2019, at 0:28, William Tu wrote:

>>
>> I guess, this crash caused by trying to destroy unallocated queue.
>>
>> Following change could help:
>> ---
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index a6543e8f5..6e1431dce 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -249,7 +249,7 @@ xsk_configure_all(struct netdev *netdev)
>>      ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>
>>      n_rxq = netdev_n_rxq(netdev);
>> -    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
>> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
>>
>>      /* configure each queue */
>>      for (i = 0; i < n_rxq; i++) {
>> ---
>>
>> This should prevent OVS from crash, however, I don't know why socket
>> creation fails in your case.
>>
> Hi Ilya,
>
> Thanks, I will add this into my next version.
>
> @Eelco
> When using ovs-ctl restart, it still fails, but this is because 
> userspace
> datapath need extra "--cleanup" argument, s.t like
> #ovs-appctl -t ovs-vswitchd --cleanup
>
> We should add it to stop_daemon() at utilities/ovs-lib.in

It’s not there for a reason, i.e. they do not want interfaces to be 
removed on OVS restart.

The only thing OVS should do is not to crash ;) Or maybe try to remove 
the program when it fails to add it, or an option to force program load 
as in the XDP examples.
Eelco Chaudron June 19, 2019, 6:36 a.m. UTC | #26
On 19 Jun 2019, at 0:31, William Tu wrote:

>> The PVP test seems to work fine however after a while it stops
>> forwarding:
>>
>> $ ovs-ofctl dump-flows ovs_pvp_br0
>>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
>> in_port=eno1 actions=output:tapVM
>>   cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252,
>> in_port=tapVM actions=output:eno1
>>
>> Results:
>>
>> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
>> "Physical to Virtual to Physical test, L3 flows[port redirect]"
>> ,Packet size
>> Number of flows,64,256,1024
>> 10,13448,131687,0
>> 100,596,0,0
>> 1000,596,0,0
>>
> I'm not able to reproduce this issue...
> Hopefully next version we can find out more about the root cause.

It has been there since day one… I’ll try to schedule some time next 
week to test v12 of your patch.
Ilya Maximets June 19, 2019, 6:50 a.m. UTC | #27
On 19.06.2019 9:35, Eelco Chaudron wrote:
> 
> 
> On 19 Jun 2019, at 0:28, William Tu wrote:
> 
>>>
>>> I guess, this crash caused by trying to destroy unallocated queue.
>>>
>>> Following change could help:
>>> ---
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> index a6543e8f5..6e1431dce 100644
>>> --- a/lib/netdev-afxdp.c
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -249,7 +249,7 @@ xsk_configure_all(struct netdev *netdev)
>>>      ifindex = linux_get_ifindex(netdev_get_name(netdev));
>>>
>>>      n_rxq = netdev_n_rxq(netdev);
>>> -    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
>>> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
>>>
>>>      /* configure each queue */
>>>      for (i = 0; i < n_rxq; i++) {
>>> ---
>>>
>>> This should prevent OVS from crash, however, I don't know why socket
>>> creation fails in your case.
>>>
>> Hi Ilya,
>>
>> Thanks, I will add this into my next version.
>>
>> @Eelco
>> When using ovs-ctl restart, it still fails, but this is because userspace
>> datapath need extra "--cleanup" argument, s.t like
>> #ovs-appctl -t ovs-vswitchd --cleanup
>>
>> We should add it to stop_daemon() at utilities/ovs-lib.in
> 
> It’s not there for a reason, i.e. they do not want interfaces to be removed on OVS restart.

ovs-ctl is optimized for kernel datapath, and yes, it shouldn't cleanup on
restart if kernel datapath is there. But I don't see any reason to stop
the deamon without cleaning the userspace. I'm working on a patch to clean
up dpif-netdev regardless of passed 'cleanup' option.

> 
> The only thing OVS should do is not to crash ;) Or maybe try to remove the program when it fails to add it, or an option to force program load as in the XDP examples.
William Tu June 20, 2019, 2:15 a.m. UTC | #28
> The PVP test seems to work fine however after a while it stops
> forwarding:
>
> $ ovs-ofctl dump-flows ovs_pvp_br0
>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
> in_port=eno1 actions=output:tapVM
>   cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252,
> in_port=tapVM actions=output:eno1
>
> Results:
>
> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
> "Physical to Virtual to Physical test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,256,1024
> 10,13448,131687,0
> 100,596,0,0
> 1000,596,0,0
>

Hi Eelco,

What traffic generator are you using?
Maybe I can try to reproduce it on my testbed.

Thanks
William
Eelco Chaudron June 20, 2019, 8:26 a.m. UTC | #29
On 20 Jun 2019, at 4:15, William Tu wrote:

>> The PVP test seems to work fine however after a while it stops
>> forwarding:
>>
>> $ ovs-ofctl dump-flows ovs_pvp_br0
>>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
>> in_port=eno1 actions=output:tapVM
>>   cookie=0x0, duration=8.504s, table=0, n_packets=1, n_bytes=252,
>> in_port=tapVM actions=output:eno1
>>
>> Results:
>>
>> "Physical port, ""eno1"", speed 10 Gbit/s, traffic rate 100%"
>> "Physical to Virtual to Physical test, L3 flows[port redirect]"
>> ,Packet size
>> Number of flows,64,256,1024
>> 10,13448,131687,0
>> 100,596,0,0
>> 1000,596,0,0
>>
>
> Hi Eelco,
>
> What traffic generator are you using?
> Maybe I can try to reproduce it on my testbed.

I’m using a Xena tester, with the following PVP script:

https://github.com/chaudron/ovs_perf

The config I use is:

~/ovs_perf/ovs_performance.py -d -l testrun_log.txt \
    --tester-type=xena --tester-address 10.19.188.64 --tester-interface 6,1 \
    --ovs-address $DUT \
    --ovs-user root --ovs-password $DUT_PW \
    --dut-vm-address 192.168.122.164 \
    --dut-vm-user root --dut-vm-password root \
    --virtual-interface tapVM \
    --dut-vm-nic-pci=0000:00:02.0 \
    --dut-vm-nic-queues=1 \
    --physical-interface eno1 \
    --physical-speed=10 \
    --stream-list=10,100,1000 \
    --packet-list=64,256,1024 \
    --warm-up --warm-up-no-fail --warm-up-time=5 \
    --skip-pv-test \
    --flow-rule-type=port \
    --testpmd-startup-delay=8 \
    --no-bridge-config
William Tu June 20, 2019, 9:58 p.m. UTC | #30
On Thu, Jun 20, 2019 at 1:26 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 20 Jun 2019, at 4:15, William Tu wrote:
>
> >> The PVP test seems to work fine however after a while it stops
> >> forwarding:
> >>
> >> $ ovs-ofctl dump-flows ovs_pvp_br0
> >>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
> >> in_port=eno1 actions=output:tapVM

So this means for the physical nic, there is no packet received.
I can reproduce similar issue using trex imix traffic.

I think it's an issue in the ixgbe driver. If you keep sending
traffic, then it's ok.
But if you run af_xdp, stop, and run it again. The ixgbe driver report

# dmesg
[272206.755287] ixgbe 0000:0d:00.0 eth3: initiating reset to clear Tx
work after link loss
[272206.820872] ixgbe 0000:0d:00.0 eth3: Reset adapter
[272207.484969] ixgbe 0000:0d:00.1 eth5: initiating reset to clear Tx
work after link loss
[272207.588880] ixgbe 0000:0d:00.1 eth5: Reset adapter

# ip link show // the device is DOWN
94: eth3: <NO-CARRIER,BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 xdp
qdisc mq state DOWN mode DEFAULT group default qlen 1000

So at the traffic generator side, trex detects link down and stops
sending traffic.
When restart OVS which clears the AF_XDP/XDP program, the link is up again,
and the trex resumes sending traffic.

Regards,
William
Eelco Chaudron June 21, 2019, 8:38 a.m. UTC | #31
On 20 Jun 2019, at 23:58, William Tu wrote:

> On Thu, Jun 20, 2019 at 1:26 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>>
>>
>> On 20 Jun 2019, at 4:15, William Tu wrote:
>>
>>>> The PVP test seems to work fine however after a while it stops
>>>> forwarding:
>>>>
>>>> $ ovs-ofctl dump-flows ovs_pvp_br0
>>>>   cookie=0x0, duration=8.510s, table=0, n_packets=1, n_bytes=1020,
>>>> in_port=eno1 actions=output:tapVM
>
> So this means for the physical nic, there is no packet received.
> I can reproduce similar issue using trex imix traffic.
>
> I think it's an issue in the ixgbe driver. If you keep sending
> traffic, then it's ok.
> But if you run af_xdp, stop, and run it again. The ixgbe driver report
>
> # dmesg
> [272206.755287] ixgbe 0000:0d:00.0 eth3: initiating reset to clear Tx
> work after link loss
> [272206.820872] ixgbe 0000:0d:00.0 eth3: Reset adapter
> [272207.484969] ixgbe 0000:0d:00.1 eth5: initiating reset to clear Tx
> work after link loss
> [272207.588880] ixgbe 0000:0d:00.1 eth5: Reset adapter
>
> # ip link show // the device is DOWN
> 94: eth3: <NO-CARRIER,BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 xdp
> qdisc mq state DOWN mode DEFAULT group default qlen 1000
>
> So at the traffic generator side, trex detects link down and stops
> sending traffic.
> When restart OVS which clears the AF_XDP/XDP program, the link is up 
> again,
> and the trex resumes sending traffic.

The PVP script will not restart OVS, or reapply the config, or bring 
down the link between the tests.

Just to be sure, I re-ran it, and checked the logging, but nothing of 
the above.
In addition, the same test without the VM does not show the problem.

I think it’s the packets being forwarded to the VM and as it takes 
time before they are released os causing some lockup. I’ll try to 
figure out more details next week when testing your v13.

Cheers,

Eelco
William Tu June 22, 2019, 5:59 a.m. UTC | #32
> The PVP script will not restart OVS, or reapply the config, or bring
> down the link between the tests.
>
> Just to be sure, I re-ran it, and checked the logging, but nothing of
> the above.

That's too bad, thanks for re-ran it.

> In addition, the same test without the VM does not show the problem.
>
> I think it’s the packets being forwarded to the VM and as it takes
> time before they are released os causing some lockup. I’ll try to
> figure out more details next week when testing your v13.

Thank you
William
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e09a33..11cc59efc881 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 46261235c732..aa9e7c49f179 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..554964396353
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,433 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+dpdk.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, called xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the afxdp netdev re-uses the existing userspace
+dpif-netdev datapath.  As a result, most of the packet processing
+happens at the userspace instead of linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+
+  * CONFIG_BPF=y
+
+  * CONFIG_BPF_SYSCALL=y
+
+  * CONFIG_XDP_SOCKETS=y
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  * CONFIG_BPF_JIT=y (Performance)
+
+  * CONFIG_HAVE_BPF_JIT=y (Performance)
+
+  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
+
+- Once your AF_XDP-enabled kernel is ready, if possible, run
+  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
+  This is an OVS independent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+First, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp TESTSUITEFLAGS='1'
+
+If a test case fails, check the log at::
+
+  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd ...
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+.. note::
+   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+Should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debugging by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. AF_XDP device added to bridge, remove, and added again will fail.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0 -- set interface tap0 type="afxdp"
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+The performance number I got is around 1.6Mpps.
+This is due to using the kernel's tap interface, which requires copying
+packet into kernel from the umem buffer in userspace.
+
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+Performance: for RX_DROP: 6.6Mpps, TX: 2.3Mpps
+
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0 (linux kernel mode)::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1
+
+Or, use AF_XDP with skb mode::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
+
+Performace: for RX_DROP: 800Kpps, TX: 700Kpps
+
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/acinclude.m4 b/acinclude.m4
index cf9cc8b8b0de..721653ab0ec0 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -236,6 +236,41 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index 2dbe9a9178e3..9e23e1c6958c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -99,6 +99,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index cc5dccf39d6b..b31e28f6e1f5 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -14,6 +14,10 @@  if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
+endif
+
 lib_libopenvswitch_la_LDFLAGS = \
         $(OVS_LTINFO) \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
@@ -392,6 +396,7 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/if-notifier.h \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
+	lib/netdev-linux-private.h \
 	lib/netdev-tc-offloads.c \
 	lib/netdev-tc-offloads.h \
 	lib/netlink-conntrack.c \
@@ -409,6 +414,15 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/xdpsock.c \
+	lib/xdpsock.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h \
+	lib/spinlock.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..e6a7947076b4 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -19,6 +19,7 @@ 
 #include <string.h>
 
 #include "dp-packet.h"
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
@@ -59,6 +60,27 @@  dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
     dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
 }
 
+#if HAVE_AF_XDP
+/* Initialize 'b' as an empty dp_packet that contains
+ * memory starting at AF_XDP umem base.
+ */
+void
+dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
+{
+    dp_packet_set_base(b, base);
+    dp_packet_set_data(b, base);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_set_allocated(b, allocated);
+    b->source = DPBUF_AFXDP;
+    dp_packet_reset_offsets(b);
+    pkt_metadata_init(&b->md, 0);
+    dp_packet_reset_cutlen(b);
+    dp_packet_reset_offload(b);
+    b->packet_type = htonl(PT_ETH);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -122,6 +144,8 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
         }
     }
 }
@@ -248,6 +272,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +460,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..e3438226e360 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,7 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +43,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* buffer data from XDP frame */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +91,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -122,7 +131,9 @@  static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
 void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
-
+#if HAVE_AF_XDP
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -184,6 +195,11 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
+            return;
+        }
+
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..6b6dfda7db1c 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -21,6 +21,7 @@ 
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
+#include <time.h>
 #include <math.h>
 
 #ifdef DPDK_NETDEV
@@ -186,6 +187,24 @@  struct pmd_perf_stats {
     char *log_reason;
 };
 
+#ifdef __linux__
+static inline uint64_t
+rdtsc_syscall(struct pmd_perf_stats *s)
+{
+    struct timespec val;
+    uint64_t v;
+
+    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
+       return s->last_tsc;
+    }
+
+    v  = (uint64_t) val.tv_sec * 1000000000LL;
+    v += (uint64_t) val.tv_nsec;
+
+    return s->last_tsc = v;
+}
+#endif
+
 /* Support for accurate timing of PMD execution on TSC clock cycle level.
  * These functions are intended to be invoked in the context of pmd threads. */
 
@@ -198,6 +217,13 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif !defined(_MSC_VER) && defined(__x86_64__)
+    uint32_t h, l;
+    asm volatile("rdtsc" : "=a" (l), "=d" (h));
+
+    return s->last_tsc = ((uint64_t) h << 32) | l;
+#elif defined(__linux__)
+    return rdtsc_syscall(s);
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..a6543e8f5126
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,891 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-afxdp.h"
+
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <stdlib.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "dp-packet.h"
+#include "dpif-netdev.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "socket-util.h"
+#include "spinlock.h"
+#include "util.h"
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+#define UMEM2XPKT(base, i) \
+                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
+                               i * sizeof(struct dp_packet_afxdp))
+
+static uint32_t prog_id;
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+static int xsk_configure_all(struct netdev *netdev);
+static void xsk_destroy_all(struct netdev *netdev);
+
+static struct xsk_umem_info *
+xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
+{
+    struct xsk_umem_config uconfig OVS_UNUSED;
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xcalloc(1, sizeof *umem);
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           NULL);
+    if (ret) {
+        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        free(umem);
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* set-up umem pool */
+    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("umem_pool_init failed");
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct umem_elem *elem;
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* set-up metadata */
+    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("xpacket_pool_init failed");
+        umem_pool_cleanup(&umem->mpool);
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = UMEM2XPKT(umem->xpool.array, i);
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0;
+    int ret;
+    int i;
+
+    xsk = xcalloc(1, sizeof(*xsk));
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        free(xsk);
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Populate (PROD_NUM_DESCS - BATCH_SIZE) elems to the FILL queue */
+    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
+                                   PROD_NUM_DESCS - BATCH_SIZE, &idx)) {
+        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
+    }
+
+    for (i = 0;
+         i < (PROD_NUM_DESCS - BATCH_SIZE) * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS - BATCH_SIZE);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+
+    /* umem memory region */
+    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+
+    /* create AF_XDP socket */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        free_pagealign(bufs);
+        return NULL;
+    }
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* clean up umem and xpacket pool */
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free_pagealign(bufs);
+        umem_pool_cleanup(&umem->mpool);
+        xpacket_pool_cleanup(&umem->xpool);
+        free(umem);
+    }
+    return xsk;
+}
+
+static int
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk;
+    int i, ifindex, n_rxq;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    n_rxq = netdev_n_rxq(netdev);
+    dev->xsks = xmalloc(n_rxq * sizeof(struct xsk_socket_info *));
+
+    /* configure each queue */
+    for (i = 0; i < n_rxq; i++) {
+        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
+                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk) {
+            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
+            dev->xsks[i] = NULL;
+            goto err;
+        }
+        dev->xsks[i] = xsk;
+        xsk->rx_dropped = 0;
+        xsk->tx_dropped = 0;
+    }
+
+    return 0;
+
+err:
+    xsk_destroy_all(netdev);
+    return EINVAL;
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk)
+{
+    struct xsk_umem *umem;
+
+    umem = xsk->umem->umem;
+    xsk_socket__delete(xsk->xsk);
+    if (xsk_umem__delete(umem)) {
+        VLOG_ERR("xsk_umem__delete failed");
+    }
+
+    /* free the packet buffer */
+    free_pagealign(xsk->umem->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&xsk->umem->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&xsk->umem->xpool);
+
+    free(xsk->umem);
+    free(xsk);
+}
+
+static void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    for (i = 0; i < netdev_n_rxq(netdev); i++) {
+        if (dev->xsks && dev->xsks[i]) {
+            VLOG_INFO("destroy xsk[%d]", i);
+            xsk_destroy(dev->xsks[i]);
+            dev->xsks[i] = NULL;
+        }
+    }
+
+    VLOG_INFO("remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+
+    free(dev->xsks);
+}
+
+static inline void OVS_UNUSED
+log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+               &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                stat.rx_dropped,
+                stat.rx_invalid_descs,
+                stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    const char *str_xdpmode;
+    int xdpmode, new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
+                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
+        return EINVAL;
+    }
+
+    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
+    if (!strcasecmp(str_xdpmode, "drv")) {
+        xdpmode = XDP_ZEROCOPY;
+    } else if (!strcasecmp(str_xdpmode, "skb")) {
+        xdpmode = XDP_COPY;
+    } else {
+        VLOG_ERR("%s: Incorrect xdpmode (%s).",
+                 netdev_get_name(netdev), str_xdpmode);
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (dev->requested_n_rxq != new_n_rxq
+        || dev->requested_xdpmode != xdpmode) {
+        dev->requested_n_rxq = new_n_rxq;
+        dev->requested_xdpmode = xdpmode;
+        netdev_request_reconfigure(netdev);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+static void
+netdev_afxdp_alloc_txq(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int n_txqs = netdev_n_rxq(netdev);
+    int i;
+
+    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
+
+    for (i = 0; i < n_txqs; i++) {
+        ovs_spinlock_init(&dev->tx_locks[i]);
+    }
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+    free(dev->tx_locks);
+
+    netdev->n_rxq = dev->requested_n_rxq;
+    netdev_afxdp_alloc_txq(netdev);
+
+    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
+        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
+        /* From SKB mode to DRV mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+        dev->xdp_bind_flags = XDP_ZEROCOPY;
+        dev->xdpmode = XDP_ZEROCOPY;
+
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
+                      ovs_strerror(errno));
+        }
+    } else {
+        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
+        /* From DRV mode to SKB mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+        dev->xdp_bind_flags = XDP_COPY;
+        dev->xdpmode = XDP_COPY;
+        /* TODO: set rlimit back to previous value
+         * when no device is in DRV mode.
+         */
+    }
+
+    err = xsk_configure_all(netdev);
+    if (err) {
+        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0",
+              netdev_get_name(netdev));
+    return 0;
+}
+
+static void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t curr_prog_id = 0;
+    uint32_t flags;
+
+    /* remove_xdp_program() */
+    if (xdpmode == XDP_COPY) {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    } else {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    }
+
+    if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, flags)) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    }
+    if (prog_id == curr_prog_id) {
+        bpf_set_link_xdp_fd(ifindex, -1, flags);
+    } else if (!curr_prog_id) {
+        VLOG_INFO("couldn't find a prog id on a given interface");
+    } else {
+        VLOG_INFO("program on interface changed, not removing");
+    }
+}
+
+void
+signal_remove_xdp(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    VLOG_WARN("force remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    uintptr_t addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+static void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+    struct dp_packet_afxdp *xpacket = NULL;
+    struct dp_packet *packet;
+    void *elems[BATCH_SIZE];
+    uintptr_t addr;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (xpacket->mpool) {
+            void *base = dp_packet_base(packet);
+
+            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+            elems[i] = (void *)addr;
+        }
+    }
+    umem_elem_push_n(xpacket->mpool, batch->count, elems);
+    dp_packet_batch_init(batch);
+}
+
+static inline void
+handle_rx_fail(struct xsk_socket_info *xsk, int rcvd, int idx_rx)
+{
+    void *elems[BATCH_SIZE];
+    int i;
+
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
+        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+
+        elems[i] = (void *)((uintptr_t)pkt & (~FRAME_SHIFT_MASK));
+    }
+    umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
+
+    xsk_ring_cons__release(&xsk->rx, rcvd);
+    xsk->rx_dropped += rcvd;
+}
+
+int
+netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+                      int *qfill)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev *netdev = rx->up.netdev;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct umem_elem *elems[BATCH_SIZE];
+    uint32_t idx_rx = 0, idx_fq = 0;
+    struct xsk_socket_info *xsk;
+    int qid = rxq_->queue_id;
+    unsigned int rcvd, i;
+    int ret = 0;
+
+    xsk = dev->xsks[qid];
+    if (!xsk) {
+        return 0;
+    }
+
+    rx->fd = xsk_socket__fd(xsk->xsk);
+
+    /* See if there is any packet on RX queue,
+     * if yes, idx_rx is the index having the packet.
+     */
+    rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return 0;
+    }
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems);
+    if (OVS_UNLIKELY(ret)) {
+        handle_rx_fail(xsk, rcvd, idx_rx);
+        return ENOMEM;
+    }
+
+    /* Prepare for the FILL queue */
+    if (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) {
+        /* The FILL queue is full, don't retry or process rx. Wait for kernel
+         * to move received packets from FILL queue to RX queue.
+         */
+        umem_elem_push_n(&xsk->umem->mpool, rcvd, (void **)elems);
+        handle_rx_fail(xsk, rcvd, idx_rx);
+        return ENOMEM;
+    }
+
+    /* Setup a dp_packet batch from descriptors in RX queue */
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
+        uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len;
+        char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+        uint64_t index;
+
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        index = addr >> FRAME_SHIFT;
+        xpacket = UMEM2XPKT(xsk->umem->xpool.array, index);
+        packet = &xpacket->packet;
+
+        /* Initialize the struct dp_packet */
+        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+    /* Release the RX queue */
+    xsk_ring_cons__release(&xsk->rx, rcvd);
+
+    for (i = 0; i < rcvd; i++) {
+        uint64_t index;
+        struct umem_elem *elem;
+
+        /* Get one free umem, program it into FILL queue */
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+
+    if (qfill) {
+        /* TODO: return the number of remaining packets in the queue. */
+        *qfill = 0;
+    }
+
+#ifdef AFXDP_DEBUG
+    log_xsk_stat(xsk);
+#endif
+    return 0;
+}
+
+static inline int
+kick_tx(struct xsk_socket_info *xsk)
+{
+    int ret;
+
+    /* This causes system call into kernel's xsk_sendmsg, and
+     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
+     */
+    ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);
+    if (OVS_UNLIKELY(ret < 0)) {
+        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
+            return errno;
+        }
+    }
+    /* no error, or EBUSY or EAGAIN */
+    return 0;
+}
+
+static inline bool
+check_free_batch(struct dp_packet_batch *batch)
+{
+    struct umem_pool *first_mpool = NULL;
+    struct dp_packet_afxdp *xpacket;
+    struct dp_packet *packet;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (packet->source != DPBUF_AFXDP) {
+            return false;
+        }
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (i == 0) {
+            first_mpool = xpacket->mpool;
+            continue;
+        }
+        if (xpacket->mpool != first_mpool) {
+            return false;
+        }
+    }
+    /* All packets are DPBUF_AFXDP and from the same mpool */
+    return true;
+}
+
+static inline void
+afxdp_complete_tx(struct xsk_socket_info *xsk)
+{
+    struct umem_elem *elems_push[BATCH_SIZE];
+    uint32_t idx_cq = 0;
+    int tx_done, j, ret;
+
+    if (!xsk->outstanding_tx) {
+        return;
+    }
+
+    ret = kick_tx(xsk);
+    if (OVS_UNLIKELY(ret)) {
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+    }
+
+    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx_cq);
+    if (tx_done > 0) {
+        xsk_ring_cons__release(&xsk->umem->cq, tx_done);
+        xsk->outstanding_tx -= tx_done;
+    }
+
+    /* Recycle back to umem pool */
+    for (j = 0; j < tx_done; j++) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)xsk->umem->buffer + addr);
+        elems_push[j] = elem;
+    }
+
+    umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
+}
+
+int
+netdev_afxdp_batch_send(struct netdev *netdev, int qid,
+                        struct dp_packet_batch *batch,
+                        bool concurrent_txq)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk = dev->xsks[qid];
+    struct umem_elem *elems_pop[BATCH_SIZE];
+    struct dp_packet *packet;
+    bool free_batch = true;
+    uint32_t idx = 0;
+    int error = 0;
+    int ret;
+
+    if (!xsk) {
+        goto out;
+    }
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        qid = qid % dev->up.n_txq;
+        ovs_spin_lock(&dev->tx_locks[qid]);
+    }
+
+    /* Process CQ first. */
+    afxdp_complete_tx(xsk);
+
+    free_batch = check_free_batch(batch);
+
+    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        xsk->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    /* Make sure we have enough TX descs */
+    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
+        xsk->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct umem_elem *elem;
+        uint64_t index;
+
+        elem = elems_pop[i];
+        /* Copy the packet to the umem we just pop from umem pool.
+         * TODO: avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk->tx, batch->count);
+    xsk->outstanding_tx += batch->count;
+
+    ret = kick_tx(xsk);
+    if (OVS_UNLIKELY(ret)) {
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+    }
+
+out:
+    if (free_batch) {
+        free_afxdp_buf_batch(batch);
+    } else {
+        dp_packet_delete_batch(batch, true);
+    }
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        ovs_spin_unlock(&dev->tx_locks[qid]);
+    }
+    return error;
+}
+
+int
+netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
+{
+   /* Done at reconfigure */
+   return 0;
+}
+
+void
+netdev_afxdp_destruct(struct netdev *netdev_)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+    /* Note: tc is by-passed when using drv-mode, but when using
+     * skb-mode, we might need to clean up tc. */
+
+    xsk_destroy_all(netdev_);
+    ovs_mutex_destroy(&netdev->mutex);
+}
+
+int
+netdev_afxdp_get_stats(const struct netdev *netdev,
+                       struct netdev_stats *stats)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct netdev_stats dev_stats;
+    struct xsk_socket_info *xsk;
+    int error, i;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    error = get_stats_via_netlink(netdev, &dev_stats);
+    if (error) {
+        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
+    } else {
+        /* Use kernel netdev's packet and byte counts */
+        stats->rx_packets = dev_stats.rx_packets;
+        stats->rx_bytes = dev_stats.rx_bytes;
+        stats->tx_packets = dev_stats.tx_packets;
+        stats->tx_bytes = dev_stats.tx_bytes;
+
+        stats->rx_errors           += dev_stats.rx_errors;
+        stats->tx_errors           += dev_stats.tx_errors;
+        stats->rx_dropped          += dev_stats.rx_dropped;
+        stats->tx_dropped          += dev_stats.tx_dropped;
+        stats->multicast           += dev_stats.multicast;
+        stats->collisions          += dev_stats.collisions;
+        stats->rx_length_errors    += dev_stats.rx_length_errors;
+        stats->rx_over_errors      += dev_stats.rx_over_errors;
+        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
+        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
+        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
+        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
+        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
+        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
+        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
+        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
+        stats->tx_window_errors    += dev_stats.tx_window_errors;
+
+        /* Account the dropped in each xsk */
+        for (i = 0; i < netdev_n_rxq(netdev); i++) {
+            xsk = dev->xsks[i];
+            if (xsk) {
+                stats->rx_dropped += xsk->rx_dropped;
+                stats->tx_dropped += xsk->tx_dropped;
+            }
+        }
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return error;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..dd2dc1a2064d
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,74 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+
+#define MAX_XSKQ 16
+
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+struct netdev_rxq;
+struct netdev_stats;
+
+int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
+void netdev_afxdp_destruct(struct netdev *netdev_);
+
+int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
+                          struct dp_packet_batch *batch,
+                          int *qfill);
+int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
+                            struct dp_packet_batch *batch,
+                            bool concurrent_txq);
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+int netdev_afxdp_get_stats(const struct netdev *netdev_,
+                           struct netdev_stats *stats);
+
+void free_afxdp_buf(struct dp_packet *p);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+void signal_remove_xdp(struct netdev *netdev);
+
+#else /* !HAVE_AF_XDP */
+
+#include "openvswitch/compiler.h"
+
+struct dp_packet;
+
+static inline void
+free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
+{
+    /* Nothing */
+}
+
+#endif /* HAVE_AF_XDP */
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
new file mode 100644
index 000000000000..6a0388cf9dc3
--- /dev/null
+++ b/lib/netdev-linux-private.h
@@ -0,0 +1,139 @@ 
+/*
+ * Copyright (c) 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_LINUX_PRIVATE_H
+#define NETDEV_LINUX_PRIVATE_H 1
+
+#include <config.h>
+
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "netdev-afxdp.h"
+#include "netdev-provider.h"
+#include "netdev-tc-offloads.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
+#include "xdpsock.h"
+
+/* These functions are Linux specific, so they should be used directly only by
+ * Linux-specific code. */
+
+struct netdev;
+
+struct netdev_rxq_linux {
+    struct netdev_rxq up;
+    bool is_tap;
+    int fd;
+};
+
+void netdev_linux_run(const struct netdev_class *);
+
+int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
+                                  const char *flag_name, bool enable);
+
+int get_stats_via_netlink(const struct netdev *netdev_,
+                          struct netdev_stats *stats);
+
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+    /* AF_XDP information */
+#ifdef HAVE_AF_XDP
+    struct xsk_socket_info **xsks;
+    int requested_n_rxq;
+    int xdpmode, requested_xdpmode; /* detect mode changed */
+    int xdp_flags, xdp_bind_flags;
+    struct ovs_spinlock *tx_locks;
+#endif
+};
+
+static bool
+is_netdev_linux_class(const struct netdev_class *netdev_class)
+{
+    return netdev_class->run == netdev_linux_run;
+}
+
+static struct netdev_linux *
+netdev_linux_cast(const struct netdev *netdev)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
+
+    return CONTAINER_OF(netdev, struct netdev_linux, up);
+}
+
+static struct netdev_rxq_linux *
+netdev_rxq_linux_cast(const struct netdev_rxq *rx)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
+
+    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
+}
+
+#endif /* netdev-linux-private.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index f75d73fd39f8..2883cf1f2586 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -17,6 +17,7 @@ 
 #include <config.h>
 
 #include "netdev-linux.h"
+#include "netdev-linux-private.h"
 
 #include <errno.h>
 #include <fcntl.h>
@@ -54,6 +55,7 @@ 
 #include "fatal-signal.h"
 #include "hash.h"
 #include "openvswitch/hmap.h"
+#include "netdev-afxdp.h"
 #include "netdev-provider.h"
 #include "netdev-tc-offloads.h"
 #include "netdev-vport.h"
@@ -487,57 +489,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
-
-struct netdev_rxq_linux {
-    struct netdev_rxq up;
-    bool is_tap;
-    int fd;
-};
 
 /* This is set pretty low because we probably won't learn anything from the
  * additional log messages. */
@@ -551,8 +502,6 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
-static void netdev_linux_run(const struct netdev_class *);
-
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -566,7 +515,6 @@  static int do_set_addr(struct netdev *netdev,
                        struct in_addr addr);
 static int get_etheraddr(const char *netdev_name, struct eth_addr *ea);
 static int set_etheraddr(const char *netdev_name, const struct eth_addr);
-static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *);
 static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
@@ -574,31 +522,10 @@  static void netdev_linux_miimon_wait(void);
 static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
 
 static bool
-is_netdev_linux_class(const struct netdev_class *netdev_class)
-{
-    return netdev_class->run == netdev_linux_run;
-}
-
-static bool
 is_tap_netdev(const struct netdev *netdev)
 {
     return netdev_get_class(netdev) == &netdev_tap_class;
 }
-
-static struct netdev_linux *
-netdev_linux_cast(const struct netdev *netdev)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
-
-    return CONTAINER_OF(netdev, struct netdev_linux, up);
-}
-
-static struct netdev_rxq_linux *
-netdev_rxq_linux_cast(const struct netdev_rxq *rx)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
-    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
-}
 
 static int
 netdev_linux_netnsid_update__(struct netdev_linux *netdev)
@@ -774,7 +701,7 @@  netdev_linux_update_lag(struct rtnetlink_change *change)
     }
 }
 
-static void
+void
 netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
 {
     struct nl_sock *sock;
@@ -3279,9 +3206,7 @@  exit:
     .run = netdev_linux_run,                                    \
     .wait = netdev_linux_wait,                                  \
     .alloc = netdev_linux_alloc,                                \
-    .destruct = netdev_linux_destruct,                          \
     .dealloc = netdev_linux_dealloc,                            \
-    .send = netdev_linux_send,                                  \
     .send_wait = netdev_linux_send_wait,                        \
     .set_etheraddr = netdev_linux_set_etheraddr,                \
     .get_etheraddr = netdev_linux_get_etheraddr,                \
@@ -3312,10 +3237,8 @@  exit:
     .arp_lookup = netdev_linux_arp_lookup,                      \
     .update_flags = netdev_linux_update_flags,                  \
     .rxq_alloc = netdev_linux_rxq_alloc,                        \
-    .rxq_construct = netdev_linux_rxq_construct,                \
     .rxq_destruct = netdev_linux_rxq_destruct,                  \
     .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
-    .rxq_recv = netdev_linux_rxq_recv,                          \
     .rxq_wait = netdev_linux_rxq_wait,                          \
     .rxq_drain = netdev_linux_rxq_drain
 
@@ -3323,30 +3246,64 @@  const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
-    .get_block_id = netdev_linux_get_block_id
+    .get_block_id = netdev_linux_get_block_id,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     LINUX_FLOW_OFFLOAD_API,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .destruct = netdev_afxdp_destruct,
+    .get_stats = netdev_afxdp_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+    .send = netdev_afxdp_batch_send,
+    .rxq_construct = netdev_afxdp_rxq_construct,
+    .rxq_recv = netdev_afxdp_rxq_recv,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
@@ -5918,7 +5875,7 @@  netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst,
     dst->tx_window_errors = src->tx_window_errors;
 }
 
-static int
+int
 get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats)
 {
     struct ofpbuf request;
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e6e8e8..91e6a9e2bfc0 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -903,6 +903,9 @@  extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
 
+#ifdef HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 7d7ecf6f0946..0fac117cc602 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -104,6 +104,9 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);
+#ifdef HAVE_AF_XDP
+void signal_remove_xdp(struct netdev *netdev);
+#endif
 
 int
 netdev_n_txq(const struct netdev *netdev)
@@ -146,6 +149,9 @@  netdev_initialize(void)
         netdev_register_provider(&netdev_internal_class);
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
@@ -2007,6 +2013,11 @@  restore_all_flags(void *aux OVS_UNUSED)
                                                saved_flags & ~saved_values,
                                                &old_flags);
         }
+#ifdef HAVE_AF_XDP
+        if (netdev->netdev_class == &netdev_afxdp_class) {
+            signal_remove_xdp(netdev);
+        }
+#endif
     }
 }
 
diff --git a/lib/spinlock.h b/lib/spinlock.h
new file mode 100644
index 000000000000..1ae634f23a6b
--- /dev/null
+++ b/lib/spinlock.h
@@ -0,0 +1,70 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef SPINLOCK_H
+#define SPINLOCK_H 1
+
+#include <config.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "ovs-atomic.h"
+
+struct ovs_spinlock {
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
+};
+
+static inline void
+ovs_spinlock_init(struct ovs_spinlock *sl)
+{
+    atomic_init(&sl->locked, 0);
+}
+
+static inline void
+ovs_spin_lock(struct ovs_spinlock *sl)
+{
+    int exp = 0, locked = 0;
+
+    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed)) {
+        locked = 1;
+        while (locked) {
+            atomic_read_relaxed(&sl->locked, &locked);
+        }
+        exp = 0;
+    }
+}
+
+static inline void
+ovs_spin_unlock(struct ovs_spinlock *sl)
+{
+    atomic_store_explicit(&sl->locked, 0, memory_order_release);
+}
+
+static inline int
+ovs_spin_trylock(struct ovs_spinlock *sl)
+{
+    int exp = 0;
+    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed);
+}
+#endif
diff --git a/lib/util.c b/lib/util.c
index 7b8ab81f6ee1..5eb20995b370 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -214,20 +214,19 @@  x2nrealloc(void *p, size_t *n, size_t s)
     return xrealloc(p, *n * s);
 }
 
-/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
- * dedicated cache lines.  That is, the memory block returned will not share a
- * cache line with other data, avoiding "false sharing".
+/* Allocates and returns 'size' bytes of memory aligned to 'alignment' bytes.
+ * 'alignment' must be a power of two and a multiple of sizeof(void *).
  *
- * Use free_cacheline() to free the returned memory block. */
+ * Use free_size_align() to free the returned memory block. */
 void *
-xmalloc_cacheline(size_t size)
+xmalloc_size_align(size_t size, size_t alignment)
 {
 #ifdef HAVE_POSIX_MEMALIGN
     void *p;
     int error;
 
     COVERAGE_INC(util_xalloc);
-    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
+    error = posix_memalign(&p, alignment, size ? size : 1);
     if (error != 0) {
         out_of_memory();
     }
@@ -235,16 +234,16 @@  xmalloc_cacheline(size_t size)
 #else
     /* Allocate room for:
      *
-     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
-     *       pointer to be aligned exactly sizeof(void *) bytes before the
-     *       beginning of a cache line.
+     *     - Header padding: Up to alignment - 1 bytes, to allow the
+     *       pointer 'q' to be aligned exactly sizeof(void *) bytes before the
+     *       beginning of the alignment.
      *
      *     - Pointer: A pointer to the start of the header padding, to allow us
      *       to free() the block later.
      *
      *     - User data: 'size' bytes.
      *
-     *     - Trailer padding: Enough to bring the user data up to a cache line
+     *     - Trailer padding: Enough to bring the user data up to a alignment
      *       multiple.
      *
      * +---------------+---------+------------------------+---------+
@@ -255,18 +254,56 @@  xmalloc_cacheline(size_t size)
      * p               q         r
      *
      */
-    void *p = xmalloc((CACHE_LINE_SIZE - 1)
-                      + sizeof(void *)
-                      + ROUND_UP(size, CACHE_LINE_SIZE));
-    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
-    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
-                                CACHE_LINE_SIZE);
-    void **q = (void **) r - 1;
+    void *p, *r, **q;
+    bool runt;
+
+    COVERAGE_INC(util_xalloc);
+    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
+        ovs_abort(0, "Invalid alignment");
+    }
+
+    p = xmalloc((alignment - 1)
+                + sizeof(void *)
+                + ROUND_UP(size, alignment));
+
+    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
+    /* When the padding size < sizeof(void*), we don't have enough room for
+     * pointer 'q'. As a reuslt, need to move 'r' to the next alignment.
+     * So ROUND_UP when xmalloc above, and ROUND_UP again when calculate 'r'
+     * below.
+     */
+    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment);
+    q = (void **) r - 1;
     *q = p;
+
     return r;
 #endif
 }
 
+void
+free_size_align(void *p)
+{
+#ifdef HAVE_POSIX_MEMALIGN
+    free(p);
+#else
+    if (p) {
+        void **q = (void **) p - 1;
+        free(*q);
+    }
+#endif
+}
+
+/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
+ * dedicated cache lines.  That is, the memory block returned will not share a
+ * cache line with other data, avoiding "false sharing".
+ *
+ * Use free_cacheline() to free the returned memory block. */
+void *
+xmalloc_cacheline(size_t size)
+{
+    return xmalloc_size_align(size, CACHE_LINE_SIZE);
+}
+
 /* Like xmalloc_cacheline() but clears the allocated memory to all zero
  * bytes. */
 void *
@@ -282,14 +319,19 @@  xzalloc_cacheline(size_t size)
 void
 free_cacheline(void *p)
 {
-#ifdef HAVE_POSIX_MEMALIGN
-    free(p);
-#else
-    if (p) {
-        void **q = (void **) p - 1;
-        free(*q);
-    }
-#endif
+    free_size_align(p);
+}
+
+void *
+xmalloc_pagealign(size_t size)
+{
+    return xmalloc_size_align(size, get_page_size());
+}
+
+void
+free_pagealign(void *p)
+{
+    free_size_align(p);
 }
 
 char *
diff --git a/lib/util.h b/lib/util.h
index c26605abdce3..33665748274c 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -166,6 +166,11 @@  void ovs_strzcpy(char *dst, const char *src, size_t size);
 
 int string_ends_with(const char *str, const char *suffix);
 
+void *xmalloc_pagealign(size_t) MALLOC_LIKE;
+void free_pagealign(void *);
+void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
+void free_size_align(void *);
+
 /* The C standards say that neither the 'dst' nor 'src' argument to
  * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
  * the null case. */
diff --git a/lib/xdpsock.c b/lib/xdpsock.c
new file mode 100644
index 000000000000..ea39fa557290
--- /dev/null
+++ b/lib/xdpsock.c
@@ -0,0 +1,170 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+
+#include "xdpsock.h"
+#include "dp-packet.h"
+#include "openvswitch/compiler.h"
+
+/* Note:
+ * umem_elem_push* shouldn't overflow because we always pop
+ * elem first, then push back to the stack.
+ */
+static inline void
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
+        OVS_NOT_REACHED();
+    }
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+}
+
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline void
+__umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
+        OVS_NOT_REACHED();
+    }
+
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+
+    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
+
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->lock);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ret;
+}
+
+static inline void *
+__umem_elem_pop(struct umem_pool *umemp)
+{
+    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
+        return NULL;
+    }
+
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->lock);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ptr;
+}
+
+static void **
+__umem_pool_alloc(unsigned int size)
+{
+    void *bufs;
+
+    bufs = xmalloc_pagealign(size * sizeof(void *));
+    memset(bufs, 0, size * sizeof(void *));
+
+    return (void **)bufs;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        return -ENOMEM;
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spinlock_init(&umemp->lock);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp)
+{
+    free_pagealign(umemp->array);
+    umemp->array = NULL;
+}
+
+/* AF_XDP metadata init/destroy */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    void *bufs;
+
+    bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp));
+    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
+
+    xp->array = bufs;
+    xp->size = size;
+
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free_pagealign(xp->array);
+    xp->array = NULL;
+}
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
new file mode 100644
index 000000000000..1a1093381243
--- /dev/null
+++ b/lib/xdpsock.h
@@ -0,0 +1,101 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <bpf/xsk.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stdio.h>
+
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "spinlock.h"
+
+#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
+#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
+#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
+
+/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
+ * Setting NUM_FRAMES to this makes sure umem_pop always successes.
+ */
+#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
+
+#define BATCH_SIZE      NETDEV_MAX_BURST
+
+BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
+BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
+BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
+
+/* LIFO ptr_array */
+struct umem_pool {
+    int index;      /* point to top */
+    unsigned int size;
+    struct ovs_spinlock lock;
+    void **array;   /* a pointer array, point to umem buf */
+};
+
+/* array-based dp_packet_afxdp */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp **array;
+};
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    unsigned long rx_dropped;
+    unsigned long tx_dropped;
+    uint32_t outstanding_tx;
+};
+
+struct umem_elem {
+    struct umem_elem *next;
+};
+
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *umem_elem_pop(struct umem_pool *umemp);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
+#endif
diff --git a/tests/automake.mk b/tests/automake.mk
index 2956e68b242c..131564bb0bd3 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -160,6 +162,10 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -184,6 +190,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -317,6 +324,11 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	$(MAKE) install
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -354,6 +366,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..1e6f7a46b4b7
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,20 @@ 
+# Add port to ovs bridge by using afxdp mode.
+# This will use generic XDP support in the veth driver.
+m4_define([ADD_VETH],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..9b7a29066614
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-common-macros.at])
+m4_include([tests/system-userspace-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+
+m4_include([tests/system-traffic.at])
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 89c06a1b7877..1e3acbbb8075 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3101,6 +3101,21 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="xdpmode"
+              type='{"type": "string",
+                     "enum": ["set", ["skb", "drv"]]}'>
+        <p>
+          Specifies the operational mode of the XDP program.
+          If "drv", the XDP program is loaded into the device driver with
+          zero-copy RX and TX enabled. This mode requires device driver with
+          AF_XDP support and has the best performance.
+          If "skb", the XDP program is using generic XDP mode in kernel with
+          extra data copying between userspace and kernel. No device driver
+          support is needed. Note that this is afxdp netdev type only.
+          Defaults to "skb" mode.
+        </p>
+      </column>
+
       <column name="options" key="vhost-server-path"
               type='{"type": "string"}'>
         <p>