diff mbox series

[ovs-dev,PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1563480674-21527-1-git-send-email-u9012063@gmail.com
State Accepted
Headers show
Series [ovs-dev,PATCHv18] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu July 18, 2019, 8:11 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
---
v15:
 * address review feedback from Ilya
   https://patchwork.ozlabs.org/patch/1125476/
 * skip TCP related test cases
 * reclaim all CONS_NUM_DESC at complete tx
 * add retries to kick_tx
 * increase memory pool size
 * remove redundant xdp flag and bind flag
 * remove unused rx_dropped var
 * make tx_dropped counter atomic
 * refactor dp_packet_init_afxdp using dp_packet_init__
 * rebase to ovs master, test with latest bpf-next kernel commit b14a260e33ddb4
   Ilya's kernel patches are required
   commit 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket")
   commit 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode")
 Possible issues:
 * still lots of afxdp_cq_skip  (ovs-appctl coverage/show)
    afxdp_cq_skip  44325273.6/sec 34362312.683/sec   572705.2114/sec   total: 2106010377
 * TODO:
   'make check-afxdp' still not all pass
   IP fragmentation expiry test not fix yet, need to implement
   deferral memory free, s.t like dpdk_mp_sweep.  Currently hit
   some missing umem descs when reclaiming.
   NSH test case still failed (not due to afxdp)

v16:
  * address feedbacks from Ilya
  * add deferral memory free
  * add afxdp testsuites files to gitignore

v17:
  * address feedbacks from Ilya and Ben
  https://patchwork.ozlabs.org/patch/1131547/
  * ovs_spin_lock: add pthread_spin_lock checks, fix typo
  * update NEWS
  * add pthread_spin_lock check at OVS_CHECK_LINUX_AF_XDP
  * fix bug in xmalloc_size_align
  * rename xdpsock to netdev-afxdp-pool
  * remove struct umem_elem, use void *
  * fix style and comments
  * fix afxdp.rst
  * rebase to OVS master, tested on kernel 5.2.0-rc6
 Note: I still leave the last_tsc in pmd_perf_stats the same as v16

v18:
  * address feedbacks from Ilya and Ben
  https://patchwork.ozlabs.org/patch/1133416/
  * update document about tcp and reconfiguration
  * fix leak in tx spin locks
  * refactor __umem_pool alloc and assert
  * refactor macro and defines used in netdev-afxdp[-pool]
  * refactor the xpool->array to remove using type casting
  * add empty netdev_afxdp_rxq_destruct to avoid closing
    afxdp socket
---
 Documentation/automake.mk             |    1 +
 Documentation/index.rst               |    1 +
 Documentation/intro/install/afxdp.rst |  432 ++++++++++++++
 Documentation/intro/install/index.rst |    1 +
 NEWS                                  |    1 +
 acinclude.m4                          |   35 ++
 configure.ac                          |    1 +
 lib/automake.mk                       |   10 +
 lib/dp-packet.c                       |   23 +
 lib/dp-packet.h                       |   18 +-
 lib/dpif-netdev-perf.h                |   24 +
 lib/netdev-afxdp-pool.c               |  167 ++++++
 lib/netdev-afxdp-pool.h               |   58 ++
 lib/netdev-afxdp.c                    | 1041 +++++++++++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |   73 +++
 lib/netdev-linux-private.h            |  132 +++++
 lib/netdev-linux.c                    |  126 ++--
 lib/netdev-provider.h                 |    3 +
 lib/netdev.c                          |   11 +
 lib/util.c                            |   92 ++-
 lib/util.h                            |    5 +
 tests/.gitignore                      |    3 +
 tests/automake.mk                     |   16 +
 tests/system-afxdp-macros.at          |   39 ++
 tests/system-afxdp-testsuite.at       |   26 +
 tests/system-traffic.at               |    2 +
 vswitchd/vswitch.xml                  |   15 +
 27 files changed, 2247 insertions(+), 109 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp-pool.c
 create mode 100644 lib/netdev-afxdp-pool.h
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at

Comments

Ilya Maximets July 19, 2019, 2:54 p.m. UTC | #1
On 18.07.2019 23:11, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>


Thanks, William, Eelco and Ben!

I fixed couple of things and applied to master!

List of changes:
* Dropped config.h from headers.
* Removed double increment of 'util_xalloc' coverage counter in xmalloc_size_align().
* Fixed style of a couple of comments.
* Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style.
  Ex.: __umem_elem_pop_n --> umem_elem_pop_n__


Best regards, Ilya Maximets.
William Tu July 19, 2019, 3:13 p.m. UTC | #2
On Fri, Jul 19, 2019 at 05:54:54PM +0300, Ilya Maximets wrote:
> On 18.07.2019 23:11, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> > type built upon the eBPF and XDP technology.  It is aims to have comparable
> > performance to DPDK but cooperate better with existing kernel's networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > attached to the netdev, by-passing a couple of Linux kernel's subsystems
> > As a result, AF_XDP socket shows much better performance than AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this feature is
> > not compiled in.
> > 
> > Signed-off-by: William Tu <u9012063@gmail.com>
> 
> 
> Thanks, William, Eelco and Ben!
> 
> I fixed couple of things and applied to master!
> 
> List of changes:
> * Dropped config.h from headers.
> * Removed double increment of 'util_xalloc' coverage counter in xmalloc_size_align().
> * Fixed style of a couple of comments.
> * Renamed underscored functions from netdev-afxdp-pool.c to be more OVS-style.
>   Ex.: __umem_elem_pop_n --> umem_elem_pop_n__
> 
> 
> Best regards, Ilya Maximets.

Thanks for your final review and fixes.
William
Eelco Chaudron Aug. 8, 2019, 11:42 a.m. UTC | #3
On 19 Jul 2019, at 16:54, Ilya Maximets wrote:

> On 18.07.2019 23:11, William Tu wrote:
>> The patch introduces experimental AF_XDP support for OVS netdev.
>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
>> socket
>> type built upon the eBPF and XDP technology.  It is aims to have 
>> comparable
>> performance to DPDK but cooperate better with existing kernel's 
>> networking
>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP 
>> program
>> attached to the netdev, by-passing a couple of Linux kernel's 
>> subsystems
>> As a result, AF_XDP socket shows much better performance than 
>> AF_PACKET
>> For more details about AF_XDP, please see linux kernel's
>> Documentation/networking/af_xdp.rst. Note that by default, this 
>> feature is
>> not compiled in.
>>
>> Signed-off-by: William Tu <u9012063@gmail.com>
>
>
> Thanks, William, Eelco and Ben!
>
> I fixed couple of things and applied to master!

Good to see this got merged into master while on PTO. However, when I 
got back I decided to test it once more…

When testing PVP I got a couple of packets trough, and then it would 
stall. I thought it might be my kernel, so updated to yesterdays latest, 
no luck…

I did see a bunch of “eno1: send failed due to exhausted memory 
pool.” messages in the log. Putting back patch v14, made my problems 
go away…

After some debugging, I noticed the problem was with the “continue” 
case in the afxdp_complete_tx() function.
Applying the following patch made it work again:

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index b7cc0d988..9b335ddf0 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info 
*xsk_info)

          if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
              umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
              xsk_info->outstanding_tx -= tx_to_free;
              tx_to_free = 0;
          }
      }

+    if (tx_to_free) {
+        umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
+        xsk_info->outstanding_tx -= tx_to_free;
+    }
+
      if (tx_done > 0) {
          xsk_ring_cons__release(&umem->cq, tx_done);
      } else {
          COVERAGE_INC(afxdp_cq_empty);
      }
  }


Which made me wonder why we do mark elements as being used? To my 
knowledge (and looking at some of the code and examples), after the  
xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
receive any duplicate slots.

I see a rather high number of afxdp_cq_skip, which should to my 
knowledge never happen?

$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty             0.0/sec   339.600/sec        5.6606/sec   
total: 20378
afxdp_tx_full              0.0/sec    29.967/sec        0.4994/sec   
total: 1798
afxdp_cq_skip              0.0/sec 61884770.167/sec  1174238.3644/sec   
total: 4227258112


You mentioned you saw this high number in your v15 change notes, did you 
do any research on why?

Cheers,

Eelco
Ilya Maximets Aug. 8, 2019, 12:09 p.m. UTC | #4
On 08.08.2019 14:42, Eelco Chaudron wrote:
> 
> 
> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
> 
>> On 18.07.2019 23:11, William Tu wrote:
>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
>>> type built upon the eBPF and XDP technology.  It is aims to have comparable
>>> performance to DPDK but cooperate better with existing kernel's networking
>>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
>>> attached to the netdev, by-passing a couple of Linux kernel's subsystems
>>> As a result, AF_XDP socket shows much better performance than AF_PACKET
>>> For more details about AF_XDP, please see linux kernel's
>>> Documentation/networking/af_xdp.rst. Note that by default, this feature is
>>> not compiled in.
>>>
>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>
>>
>> Thanks, William, Eelco and Ben!
>>
>> I fixed couple of things and applied to master!
> 
> Good to see this got merged into master while on PTO. However, when I got back I decided to test it once more…
> 
> When testing PVP I got a couple of packets trough, and then it would stall. I thought it might be my kernel, so updated to yesterdays latest, no luck…
> 
> I did see a bunch of “eno1: send failed due to exhausted memory pool.” messages in the log. Putting back patch v14, made my problems go away…
> 
> After some debugging, I noticed the problem was with the “continue” case in the afxdp_complete_tx() function.
> Applying the following patch made it work again:
> 
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index b7cc0d988..9b335ddf0 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
> 
>          if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>              umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
>              xsk_info->outstanding_tx -= tx_to_free;
>              tx_to_free = 0;
>          }
>      }
> 
> +    if (tx_to_free) {
> +        umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
> +        xsk_info->outstanding_tx -= tx_to_free;
> +    }
> +
>      if (tx_done > 0) {
>          xsk_ring_cons__release(&umem->cq, tx_done);
>      } else {
>          COVERAGE_INC(afxdp_cq_empty);
>      }
>  }

Good catch! Will you submit a patch?
BTW, to reduce the code duplication I'd suggest to remove the 'continue'
like this:

    if (*addr != UINT64_MAX) {
        Do work;
    } else {
        COVERAGE_INC(afxdp_cq_skip);
    }

> 
> 
> Which made me wonder why we do mark elements as being used? To my knowledge (and looking at some of the code and examples), after the  xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive any duplicate slots.
> 
> I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen?

I tried to investigate this previously, but didn't find anything suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because I had no
HW available for testing.

While investigation and stress-testing virtual ports I found few issues with
missing locking inside the kernel, so there is no trust for kernel part of XDP
implementation from my side. I'm suspecting that there are some other bugs in
kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw this coverage
counter being non-zero.

> 
> $ ovs-appctl coverage/show  | grep xdp
> afxdp_cq_empty             0.0/sec   339.600/sec        5.6606/sec   total: 20378
> afxdp_tx_full              0.0/sec    29.967/sec        0.4994/sec   total: 1798
> afxdp_cq_skip              0.0/sec 61884770.167/sec  1174238.3644/sec   total: 4227258112
> 
> 
> You mentioned you saw this high number in your v15 change notes, did you do any research on why?
> 
> Cheers,
> 
> Eelco
Eelco Chaudron Aug. 8, 2019, 2:53 p.m. UTC | #5
On 8 Aug 2019, at 14:09, Ilya Maximets wrote:

> On 08.08.2019 14:42, Eelco Chaudron wrote:
>>
>>
>> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
>>
>>> On 18.07.2019 23:11, William Tu wrote:
>>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
>>>> socket
>>>> type built upon the eBPF and XDP technology.  It is aims to have 
>>>> comparable
>>>> performance to DPDK but cooperate better with existing kernel's 
>>>> networking
>>>> stack.  An AF_XDP socket receives and sends packets from an 
>>>> eBPF/XDP program
>>>> attached to the netdev, by-passing a couple of Linux kernel's 
>>>> subsystems
>>>> As a result, AF_XDP socket shows much better performance than 
>>>> AF_PACKET
>>>> For more details about AF_XDP, please see linux kernel's
>>>> Documentation/networking/af_xdp.rst. Note that by default, this 
>>>> feature is
>>>> not compiled in.
>>>>
>>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>>
>>>
>>> Thanks, William, Eelco and Ben!
>>>
>>> I fixed couple of things and applied to master!
>>
>> Good to see this got merged into master while on PTO. However, when I 
>> got back I decided to test it once more…
>>
>> When testing PVP I got a couple of packets trough, and then it would 
>> stall. I thought it might be my kernel, so updated to yesterdays 
>> latest, no luck…
>>
>> I did see a bunch of “eno1: send failed due to exhausted memory 
>> pool.” messages in the log. Putting back patch v14, made my 
>> problems go away…
>>
>> After some debugging, I noticed the problem was with the 
>> “continue” case in the afxdp_complete_tx() function.
>> Applying the following patch made it work again:
>>
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index b7cc0d988..9b335ddf0 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info 
>> *xsk_info)
>>
>>          if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>>              umem_elem_push_n(&umem->mpool, tx_to_free, 
>> elems_push);
>>              xsk_info->outstanding_tx -= tx_to_free;
>>              tx_to_free = 0;
>>          }
>>      }
>>
>> +    if (tx_to_free) {
>> +        umem_elem_push_n(&umem->mpool, tx_to_free, 
>> elems_push);
>> +        xsk_info->outstanding_tx -= tx_to_free;
>> +    }
>> +
>>      if (tx_done > 0) {
>>          xsk_ring_cons__release(&umem->cq, tx_done);
>>      } else {
>>          COVERAGE_INC(afxdp_cq_empty);
>>      }
>>  }
>
> Good catch! Will you submit a patch?
> BTW, to reduce the code duplication I'd suggest to remove the 
> 'continue'
> like this:
>
>     if (*addr != UINT64_MAX) {
>         Do work;
>     } else {
>         COVERAGE_INC(afxdp_cq_skip);
>     }

Done, patch has been sent out…

>>
>>
>> Which made me wonder why we do mark elements as being used? To my 
>> knowledge (and looking at some of the code and examples), after the  
>> xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
>> receive any duplicate slots.
>>
>> I see a rather high number of afxdp_cq_skip, which should to my 
>> knowledge never happen?
>
> I tried to investigate this previously, but didn't find anything 
> suspicious.
> So, for my knowledge, this should never happen too.
> However, I only looked at the code without actually running, because I 
> had no
> HW available for testing.
>
> While investigation and stress-testing virtual ports I found few 
> issues with
> missing locking inside the kernel, so there is no trust for kernel 
> part of XDP
> implementation from my side. I'm suspecting that there are some other 
> bugs in
> kernel/libbpf that only could be reproduced with driver mode.
>
> This never happens for virtual ports with SKB mode, so I never saw 
> this coverage
> counter being non-zero.

Did some quick debugging, as something else has come up that needs my 
attention :)

But once I’m in a faulty state and sent a single packet, causing 
afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there 
might be some ring management bug. Maybe consumer and receiver are equal 
meaning 0 buffers, but it returns max? I did not look at the kernel 
code, so this is just a wild guess :)

(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, 
size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring 
= 0x7f08486b8080}

>>
>> $ ovs-appctl coverage/show  | grep xdp
>> afxdp_cq_empty             0.0/sec   
>> 339.600/sec        5.6606/sec   total: 20378
>> afxdp_tx_full              0.0/sec    
>> 29.967/sec        0.4994/sec   total: 1798
>> afxdp_cq_skip              0.0/sec 61884770.167/sec  
>> 1174238.3644/sec   total: 4227258112
>>
>>
>> You mentioned you saw this high number in your v15 change notes, did 
>> you do any research on why?
>>
>> Cheers,
>>
>> Eelco
Ilya Maximets Aug. 8, 2019, 3:38 p.m. UTC | #6
On 08.08.2019 17:53, Eelco Chaudron wrote:
> 
> 
> On 8 Aug 2019, at 14:09, Ilya Maximets wrote:
> 
>> On 08.08.2019 14:42, Eelco Chaudron wrote:
>>>
>>>
>>> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
>>>
>>>> On 18.07.2019 23:11, William Tu wrote:
>>>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
>>>>> type built upon the eBPF and XDP technology.  It is aims to have comparable
>>>>> performance to DPDK but cooperate better with existing kernel's networking
>>>>> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
>>>>> attached to the netdev, by-passing a couple of Linux kernel's subsystems
>>>>> As a result, AF_XDP socket shows much better performance than AF_PACKET
>>>>> For more details about AF_XDP, please see linux kernel's
>>>>> Documentation/networking/af_xdp.rst. Note that by default, this feature is
>>>>> not compiled in.
>>>>>
>>>>> Signed-off-by: William Tu <u9012063@gmail.com>
>>>>
>>>>
>>>> Thanks, William, Eelco and Ben!
>>>>
>>>> I fixed couple of things and applied to master!
>>>
>>> Good to see this got merged into master while on PTO. However, when I got back I decided to test it once more…
>>>
>>> When testing PVP I got a couple of packets trough, and then it would stall. I thought it might be my kernel, so updated to yesterdays latest, no luck…
>>>
>>> I did see a bunch of “eno1: send failed due to exhausted memory pool.” messages in the log. Putting back patch v14, made my problems go away…
>>>
>>> After some debugging, I noticed the problem was with the “continue” case in the afxdp_complete_tx() function.
>>> Applying the following patch made it work again:
>>>
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> index b7cc0d988..9b335ddf0 100644
>>> --- a/lib/netdev-afxdp.c
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
>>>
>>>          if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>>>              umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
>>>              xsk_info->outstanding_tx -= tx_to_free;
>>>              tx_to_free = 0;
>>>          }
>>>      }
>>>
>>> +    if (tx_to_free) {
>>> +        umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
>>> +        xsk_info->outstanding_tx -= tx_to_free;
>>> +    }
>>> +
>>>      if (tx_done > 0) {
>>>          xsk_ring_cons__release(&umem->cq, tx_done);
>>>      } else {
>>>          COVERAGE_INC(afxdp_cq_empty);
>>>      }
>>>  }
>>
>> Good catch! Will you submit a patch?
>> BTW, to reduce the code duplication I'd suggest to remove the 'continue'
>> like this:
>>
>>     if (*addr != UINT64_MAX) {
>>         Do work;
>>     } else {
>>         COVERAGE_INC(afxdp_cq_skip);
>>     }
> 
> Done, patch has been sent out…
> 
>>>
>>>
>>> Which made me wonder why we do mark elements as being used? To my knowledge (and looking at some of the code and examples), after the  xsk_ring_cons__release() function a xsk_ring_cons__peek() should not receive any duplicate slots.
>>>
>>> I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen?
>>
>> I tried to investigate this previously, but didn't find anything suspicious.
>> So, for my knowledge, this should never happen too.
>> However, I only looked at the code without actually running, because I had no
>> HW available for testing.
>>
>> While investigation and stress-testing virtual ports I found few issues with
>> missing locking inside the kernel, so there is no trust for kernel part of XDP
>> implementation from my side. I'm suspecting that there are some other bugs in
>> kernel/libbpf that only could be reproduced with driver mode.
>>
>> This never happens for virtual ports with SKB mode, so I never saw this coverage
>> counter being non-zero.
> 
> Did some quick debugging, as something else has come up that needs my attention :)
> 
> But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :)
> 
> (gdb) p tx_done
> $3 = 2048
> 
> (gdb) p umem->cq
> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080}

Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll be good
to look at *consumer and *producer values to verify the state of the
actual ring.

> 
>>>
>>> $ ovs-appctl coverage/show  | grep xdp
>>> afxdp_cq_empty             0.0/sec   339.600/sec        5.6606/sec   total: 20378
>>> afxdp_tx_full              0.0/sec    29.967/sec        0.4994/sec   total: 1798
>>> afxdp_cq_skip              0.0/sec 61884770.167/sec  1174238.3644/sec   total: 4227258112
>>>
>>>
>>> You mentioned you saw this high number in your v15 change notes, did you do any research on why?
>>>
>>> Cheers,
>>>
>>> Eelco
> 
>
Eelco Chaudron Aug. 14, 2019, 12:09 p.m. UTC | #7
On 8 Aug 2019, at 17:38, Ilya Maximets wrote:

<SNIP>

>>>> I see a rather high number of afxdp_cq_skip, which should to my 
>>>> knowledge never happen?
>>>
>>> I tried to investigate this previously, but didn't find anything 
>>> suspicious.
>>> So, for my knowledge, this should never happen too.
>>> However, I only looked at the code without actually running, because 
>>> I had no
>>> HW available for testing.
>>>
>>> While investigation and stress-testing virtual ports I found few 
>>> issues with
>>> missing locking inside the kernel, so there is no trust for kernel 
>>> part of XDP
>>> implementation from my side. I'm suspecting that there are some 
>>> other bugs in
>>> kernel/libbpf that only could be reproduced with driver mode.
>>>
>>> This never happens for virtual ports with SKB mode, so I never saw 
>>> this coverage
>>> counter being non-zero.
>>
>> Did some quick debugging, as something else has come up that needs my 
>> attention :)
>>
>> But once I’m in a faulty state and sent a single packet, causing 
>> afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that 
>> there might be some ring management bug. Maybe consumer and receiver 
>> are equal meaning 0 buffers, but it returns max? I did not look at 
>> the kernel code, so this is just a wild guess :)
>>
>> (gdb) p tx_done
>> $3 = 2048
>>
>> (gdb) p umem->cq
>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 
>> 2047, size = 2048, producer = 0x7f08486b8000, consumer = 
>> 0x7f08486b8040, ring = 0x7f08486b8080}
>
> Thanks for debugging!
>
> xsk_ring_cons__peek() just returns the difference between cached_prod
> and cached_cons, but these values are too different:
>
> 3830466864 - 3578066899 = 252399965
>
> Since this value > requested, it returns requested number (2048).
>
> So, the ring is broken. At least broken its 'cached' part. It'll be 
> good
> to look at *consumer and *producer values to verify the state of the
> actual ring.
>

I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention this 
problem of getting the wrong pointers. Did you ever follow up, or did 
further trouble shooting on the above?

>>
>>>>
>>>> $ ovs-appctl coverage/show  | grep xdp
>>>> afxdp_cq_empty             0.0/sec   
>>>> 339.600/sec        5.6606/sec   total: 20378
>>>> afxdp_tx_full              0.0/sec    
>>>> 29.967/sec        0.4994/sec   total: 1798
>>>> afxdp_cq_skip              0.0/sec 61884770.167/sec  
>>>> 1174238.3644/sec   total: 4227258112
>>>>
>>>>
>>>> You mentioned you saw this high number in your v15 change notes, 
>>>> did you do any research on why?
>>>>
>>>> Cheers,
>>>>
>>>> Eelco
>>
>>
William Tu Aug. 14, 2019, 2:58 p.m. UTC | #8
On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>
> <SNIP>
>
> >>>> I see a rather high number of afxdp_cq_skip, which should to my
> >>>> knowledge never happen?
> >>>
> >>> I tried to investigate this previously, but didn't find anything
> >>> suspicious.
> >>> So, for my knowledge, this should never happen too.
> >>> However, I only looked at the code without actually running, because
> >>> I had no
> >>> HW available for testing.
> >>>
> >>> While investigation and stress-testing virtual ports I found few
> >>> issues with
> >>> missing locking inside the kernel, so there is no trust for kernel
> >>> part of XDP
> >>> implementation from my side. I'm suspecting that there are some
> >>> other bugs in
> >>> kernel/libbpf that only could be reproduced with driver mode.
> >>>
> >>> This never happens for virtual ports with SKB mode, so I never saw
> >>> this coverage
> >>> counter being non-zero.
> >>
> >> Did some quick debugging, as something else has come up that needs my
> >> attention :)
> >>
> >> But once I’m in a faulty state and sent a single packet, causing
> >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> >> there might be some ring management bug. Maybe consumer and receiver
> >> are equal meaning 0 buffers, but it returns max? I did not look at
> >> the kernel code, so this is just a wild guess :)
> >>
> >> (gdb) p tx_done
> >> $3 = 2048
> >>
> >> (gdb) p umem->cq
> >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> >> 0x7f08486b8040, ring = 0x7f08486b8080}
> >
> > Thanks for debugging!
> >
> > xsk_ring_cons__peek() just returns the difference between cached_prod
> > and cached_cons, but these values are too different:
> >
> > 3830466864 - 3578066899 = 252399965
> >
> > Since this value > requested, it returns requested number (2048).
> >
> > So, the ring is broken. At least broken its 'cached' part. It'll be
> > good
> > to look at *consumer and *producer values to verify the state of the
> > actual ring.
> >
>
> I’ll try to find some more time next week to debug further.
>
> William I noticed your email in xdp-newbies where you mention this
> problem of getting the wrong pointers. Did you ever follow up, or did
> further trouble shooting on the above?

Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using the
xdpsock sample code from kernel. But turned out that my reproduction
code is not correct, so not able to show the case we hit here in OVS.

Then I put more similar code logic from OVS to xdpsock, but the problem
does not show up. As a result, I worked around it by marking addr as
"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.

William
William Tu Aug. 14, 2019, 4:16 p.m. UTC | #9
On Wed, Aug 14, 2019 at 7:58 AM William Tu <u9012063@gmail.com> wrote:
>
> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >
> >
> >
> > On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
> >
> > <SNIP>
> >
> > >>>> I see a rather high number of afxdp_cq_skip, which should to my
> > >>>> knowledge never happen?
> > >>>
> > >>> I tried to investigate this previously, but didn't find anything
> > >>> suspicious.
> > >>> So, for my knowledge, this should never happen too.
> > >>> However, I only looked at the code without actually running, because
> > >>> I had no
> > >>> HW available for testing.
> > >>>
> > >>> While investigation and stress-testing virtual ports I found few
> > >>> issues with
> > >>> missing locking inside the kernel, so there is no trust for kernel
> > >>> part of XDP
> > >>> implementation from my side. I'm suspecting that there are some
> > >>> other bugs in
> > >>> kernel/libbpf that only could be reproduced with driver mode.
> > >>>
> > >>> This never happens for virtual ports with SKB mode, so I never saw
> > >>> this coverage
> > >>> counter being non-zero.
> > >>
> > >> Did some quick debugging, as something else has come up that needs my
> > >> attention :)
> > >>
> > >> But once I’m in a faulty state and sent a single packet, causing
> > >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> > >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> > >> there might be some ring management bug. Maybe consumer and receiver
> > >> are equal meaning 0 buffers, but it returns max? I did not look at
> > >> the kernel code, so this is just a wild guess :)
> > >>
> > >> (gdb) p tx_done
> > >> $3 = 2048
> > >>
> > >> (gdb) p umem->cq
> > >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> > >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> > >> 0x7f08486b8040, ring = 0x7f08486b8080}
> > >
> > > Thanks for debugging!
> > >
> > > xsk_ring_cons__peek() just returns the difference between cached_prod
> > > and cached_cons, but these values are too different:
> > >
> > > 3830466864 - 3578066899 = 252399965
> > >
> > > Since this value > requested, it returns requested number (2048).
> > >
> > > So, the ring is broken. At least broken its 'cached' part. It'll be
> > > good
> > > to look at *consumer and *producer values to verify the state of the
> > > actual ring.
> > >
> >
> > I’ll try to find some more time next week to debug further.
> >
> > William I noticed your email in xdp-newbies where you mention this
> > problem of getting the wrong pointers. Did you ever follow up, or did
> > further trouble shooting on the above?
>
> Yes, I posted here
> https://www.spinics.net/lists/xdp-newbies/msg00956.html
> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>
> At that time I was thinking about reproducing the problem using the
> xdpsock sample code from kernel. But turned out that my reproduction
> code is not correct, so not able to show the case we hit here in OVS.
>
> Then I put more similar code logic from OVS to xdpsock, but the problem
> does not show up. As a result, I worked around it by marking addr as
> "*addr == UINT64_MAX".
>
> I will debug again this week once I get my testbed back.
>
Just to refresh my memory. The original issue is that
when calling:
tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
xsk_ring_cons__release(&umem->cq, tx_done);

I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
However, when I inspect these elems, I found some elems that 'already' been
reported complete last time I call xsk_ring_cons__peek. In other word, some
elems show up at CQ twice. And this cause overflow of the mempool.

Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
seen this elem.

William
Ilya Maximets Aug. 20, 2019, 10:10 a.m. UTC | #10
On 14.08.2019 19:16, William Tu wrote:
> On Wed, Aug 14, 2019 at 7:58 AM William Tu <u9012063@gmail.com> wrote:
>>
>> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>>
>>>
>>>
>>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>>>
>>> <SNIP>
>>>
>>>>>>> I see a rather high number of afxdp_cq_skip, which should to my
>>>>>>> knowledge never happen?
>>>>>>
>>>>>> I tried to investigate this previously, but didn't find anything
>>>>>> suspicious.
>>>>>> So, for my knowledge, this should never happen too.
>>>>>> However, I only looked at the code without actually running, because
>>>>>> I had no
>>>>>> HW available for testing.
>>>>>>
>>>>>> While investigation and stress-testing virtual ports I found few
>>>>>> issues with
>>>>>> missing locking inside the kernel, so there is no trust for kernel
>>>>>> part of XDP
>>>>>> implementation from my side. I'm suspecting that there are some
>>>>>> other bugs in
>>>>>> kernel/libbpf that only could be reproduced with driver mode.
>>>>>>
>>>>>> This never happens for virtual ports with SKB mode, so I never saw
>>>>>> this coverage
>>>>>> counter being non-zero.
>>>>>
>>>>> Did some quick debugging, as something else has come up that needs my
>>>>> attention :)
>>>>>
>>>>> But once I’m in a faulty state and sent a single packet, causing
>>>>> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
>>>>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
>>>>> there might be some ring management bug. Maybe consumer and receiver
>>>>> are equal meaning 0 buffers, but it returns max? I did not look at
>>>>> the kernel code, so this is just a wild guess :)
>>>>>
>>>>> (gdb) p tx_done
>>>>> $3 = 2048
>>>>>
>>>>> (gdb) p umem->cq
>>>>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
>>>>> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
>>>>> 0x7f08486b8040, ring = 0x7f08486b8080}
>>>>
>>>> Thanks for debugging!
>>>>
>>>> xsk_ring_cons__peek() just returns the difference between cached_prod
>>>> and cached_cons, but these values are too different:
>>>>
>>>> 3830466864 - 3578066899 = 252399965
>>>>
>>>> Since this value > requested, it returns requested number (2048).
>>>>
>>>> So, the ring is broken. At least broken its 'cached' part. It'll be
>>>> good
>>>> to look at *consumer and *producer values to verify the state of the
>>>> actual ring.
>>>>
>>>
>>> I’ll try to find some more time next week to debug further.
>>>
>>> William I noticed your email in xdp-newbies where you mention this
>>> problem of getting the wrong pointers. Did you ever follow up, or did
>>> further trouble shooting on the above?
>>
>> Yes, I posted here
>> https://www.spinics.net/lists/xdp-newbies/msg00956.html
>> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>>
>> At that time I was thinking about reproducing the problem using the
>> xdpsock sample code from kernel. But turned out that my reproduction
>> code is not correct, so not able to show the case we hit here in OVS.
>>
>> Then I put more similar code logic from OVS to xdpsock, but the problem
>> does not show up. As a result, I worked around it by marking addr as
>> "*addr == UINT64_MAX".
>>
>> I will debug again this week once I get my testbed back.
>>
> Just to refresh my memory. The original issue is that
> when calling:
> tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
> xsk_ring_cons__release(&umem->cq, tx_done);
> 
> I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
> However, when I inspect these elems, I found some elems that 'already' been
> reported complete last time I call xsk_ring_cons__peek. In other word, some
> elems show up at CQ twice. And this cause overflow of the mempool.
> 
> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
> seen this elem.

William, Eelco, which HW NIC you're using? Which kernel driver?

Best regards, Ilya Maximets.
Eelco Chaudron Aug. 20, 2019, 11:19 a.m. UTC | #11
On 20 Aug 2019, at 12:10, Ilya Maximets wrote:

> On 14.08.2019 19:16, William Tu wrote:
>> On Wed, Aug 14, 2019 at 7:58 AM William Tu <u9012063@gmail.com> 
>> wrote:
>>>
>>> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echaudro@redhat.com> 
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>>>>
>>>> <SNIP>
>>>>
>>>>>>>> I see a rather high number of afxdp_cq_skip, which should to my
>>>>>>>> knowledge never happen?
>>>>>>>
>>>>>>> I tried to investigate this previously, but didn't find anything
>>>>>>> suspicious.
>>>>>>> So, for my knowledge, this should never happen too.
>>>>>>> However, I only looked at the code without actually running, 
>>>>>>> because
>>>>>>> I had no
>>>>>>> HW available for testing.
>>>>>>>
>>>>>>> While investigation and stress-testing virtual ports I found few
>>>>>>> issues with
>>>>>>> missing locking inside the kernel, so there is no trust for 
>>>>>>> kernel
>>>>>>> part of XDP
>>>>>>> implementation from my side. I'm suspecting that there are some
>>>>>>> other bugs in
>>>>>>> kernel/libbpf that only could be reproduced with driver mode.
>>>>>>>
>>>>>>> This never happens for virtual ports with SKB mode, so I never 
>>>>>>> saw
>>>>>>> this coverage
>>>>>>> counter being non-zero.
>>>>>>
>>>>>> Did some quick debugging, as something else has come up that 
>>>>>> needs my
>>>>>> attention :)
>>>>>>
>>>>>> But once I’m in a faulty state and sent a single packet, 
>>>>>> causing
>>>>>> afxdp_complete_tx() to be called, it tells me 2048 descriptors 
>>>>>> are
>>>>>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
>>>>>> there might be some ring management bug. Maybe consumer and 
>>>>>> receiver
>>>>>> are equal meaning 0 buffers, but it returns max? I did not look 
>>>>>> at
>>>>>> the kernel code, so this is just a wild guess :)
>>>>>>
>>>>>> (gdb) p tx_done
>>>>>> $3 = 2048
>>>>>>
>>>>>> (gdb) p umem->cq
>>>>>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
>>>>>> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
>>>>>> 0x7f08486b8040, ring = 0x7f08486b8080}
>>>>>
>>>>> Thanks for debugging!
>>>>>
>>>>> xsk_ring_cons__peek() just returns the difference between 
>>>>> cached_prod
>>>>> and cached_cons, but these values are too different:
>>>>>
>>>>> 3830466864 - 3578066899 = 252399965
>>>>>
>>>>> Since this value > requested, it returns requested number (2048).
>>>>>
>>>>> So, the ring is broken. At least broken its 'cached' part. It'll 
>>>>> be
>>>>> good
>>>>> to look at *consumer and *producer values to verify the state of 
>>>>> the
>>>>> actual ring.
>>>>>
>>>>
>>>> I’ll try to find some more time next week to debug further.
>>>>
>>>> William I noticed your email in xdp-newbies where you mention this
>>>> problem of getting the wrong pointers. Did you ever follow up, or 
>>>> did
>>>> further trouble shooting on the above?
>>>
>>> Yes, I posted here
>>> https://www.spinics.net/lists/xdp-newbies/msg00956.html
>>> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>>>
>>> At that time I was thinking about reproducing the problem using the
>>> xdpsock sample code from kernel. But turned out that my reproduction
>>> code is not correct, so not able to show the case we hit here in 
>>> OVS.
>>>
>>> Then I put more similar code logic from OVS to xdpsock, but the 
>>> problem
>>> does not show up. As a result, I worked around it by marking addr as
>>> "*addr == UINT64_MAX".
>>>
>>> I will debug again this week once I get my testbed back.
>>>
>> Just to refresh my memory. The original issue is that
>> when calling:
>> tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
>> xsk_ring_cons__release(&umem->cq, tx_done);
>>
>> I expect there are 'tx_done' elems on the CQ to re-cycle back to 
>> memory pool.
>> However, when I inspect these elems, I found some elems that 
>> 'already' been
>> reported complete last time I call xsk_ring_cons__peek. In other 
>> word, some
>> elems show up at CQ twice. And this cause overflow of the mempool.
>>
>> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
>> seen this elem.
>
> William, Eelco, which HW NIC you're using? Which kernel driver?

I’m using the below on the latest bpf-next driver:

01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)

//Eelco
Ilya Maximets Aug. 20, 2019, 3:20 p.m. UTC | #12
On 20.08.2019 14:19, Eelco Chaudron wrote:
> 
> 
> On 20 Aug 2019, at 12:10, Ilya Maximets wrote:
> 
>> On 14.08.2019 19:16, William Tu wrote:
>>> On Wed, Aug 14, 2019 at 7:58 AM William Tu <u9012063@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>>>>>
>>>>> <SNIP>
>>>>>
>>>>>>>>> I see a rather high number of afxdp_cq_skip, which should to my
>>>>>>>>> knowledge never happen?
>>>>>>>>
>>>>>>>> I tried to investigate this previously, but didn't find anything
>>>>>>>> suspicious.
>>>>>>>> So, for my knowledge, this should never happen too.
>>>>>>>> However, I only looked at the code without actually running, because
>>>>>>>> I had no
>>>>>>>> HW available for testing.
>>>>>>>>
>>>>>>>> While investigation and stress-testing virtual ports I found few
>>>>>>>> issues with
>>>>>>>> missing locking inside the kernel, so there is no trust for kernel
>>>>>>>> part of XDP
>>>>>>>> implementation from my side. I'm suspecting that there are some
>>>>>>>> other bugs in
>>>>>>>> kernel/libbpf that only could be reproduced with driver mode.
>>>>>>>>
>>>>>>>> This never happens for virtual ports with SKB mode, so I never saw
>>>>>>>> this coverage
>>>>>>>> counter being non-zero.
>>>>>>>
>>>>>>> Did some quick debugging, as something else has come up that needs my
>>>>>>> attention :)
>>>>>>>
>>>>>>> But once I’m in a faulty state and sent a single packet, causing
>>>>>>> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
>>>>>>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
>>>>>>> there might be some ring management bug. Maybe consumer and receiver
>>>>>>> are equal meaning 0 buffers, but it returns max? I did not look at
>>>>>>> the kernel code, so this is just a wild guess :)
>>>>>>>
>>>>>>> (gdb) p tx_done
>>>>>>> $3 = 2048
>>>>>>>
>>>>>>> (gdb) p umem->cq
>>>>>>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
>>>>>>> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
>>>>>>> 0x7f08486b8040, ring = 0x7f08486b8080}
>>>>>>
>>>>>> Thanks for debugging!
>>>>>>
>>>>>> xsk_ring_cons__peek() just returns the difference between cached_prod
>>>>>> and cached_cons, but these values are too different:
>>>>>>
>>>>>> 3830466864 - 3578066899 = 252399965
>>>>>>
>>>>>> Since this value > requested, it returns requested number (2048).
>>>>>>
>>>>>> So, the ring is broken. At least broken its 'cached' part. It'll be
>>>>>> good
>>>>>> to look at *consumer and *producer values to verify the state of the
>>>>>> actual ring.
>>>>>>
>>>>>
>>>>> I’ll try to find some more time next week to debug further.
>>>>>
>>>>> William I noticed your email in xdp-newbies where you mention this
>>>>> problem of getting the wrong pointers. Did you ever follow up, or did
>>>>> further trouble shooting on the above?
>>>>
>>>> Yes, I posted here
>>>> https://www.spinics.net/lists/xdp-newbies/msg00956.html
>>>> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>>>>
>>>> At that time I was thinking about reproducing the problem using the
>>>> xdpsock sample code from kernel. But turned out that my reproduction
>>>> code is not correct, so not able to show the case we hit here in OVS.
>>>>
>>>> Then I put more similar code logic from OVS to xdpsock, but the problem
>>>> does not show up. As a result, I worked around it by marking addr as
>>>> "*addr == UINT64_MAX".
>>>>
>>>> I will debug again this week once I get my testbed back.
>>>>
>>> Just to refresh my memory. The original issue is that
>>> when calling:
>>> tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
>>> xsk_ring_cons__release(&umem->cq, tx_done);
>>>
>>> I expect there are 'tx_done' elems on the CQ to re-cycle back to memory pool.
>>> However, when I inspect these elems, I found some elems that 'already' been
>>> reported complete last time I call xsk_ring_cons__peek. In other word, some
>>> elems show up at CQ twice. And this cause overflow of the mempool.
>>>
>>> Thus, mark the elems on CQ as UINT64_MAX to indicate that we already
>>> seen this elem.
>>
>> William, Eelco, which HW NIC you're using? Which kernel driver?
> 
> I’m using the below on the latest bpf-next driver:
> 
> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Thanks for information.
I found one suspicious place inside the ixgbe driver that could break
the completion queue ring and prepared a patch:
    https://patchwork.ozlabs.org/patch/1150244/

It'll be good if you can test it.

Best regards, Ilya Maximets.
Eelco Chaudron Aug. 21, 2019, 9:31 a.m. UTC | #13
On 20 Aug 2019, at 17:20, Ilya Maximets wrote:

> On 20.08.2019 14:19, Eelco Chaudron wrote:
>>
>>
>> On 20 Aug 2019, at 12:10, Ilya Maximets wrote:
>>
>>> On 14.08.2019 19:16, William Tu wrote:
>>>> On Wed, Aug 14, 2019 at 7:58 AM William Tu <u9012063@gmail.com> 
>>>> wrote:
>>>>>
>>>>> On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron 
>>>>> <echaudro@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>>>>>>
>>>>>> <SNIP>
>>>>>>
>>>>>>>>>> I see a rather high number of afxdp_cq_skip, which should to 
>>>>>>>>>> my
>>>>>>>>>> knowledge never happen?
>>>>>>>>>
>>>>>>>>> I tried to investigate this previously, but didn't find 
>>>>>>>>> anything
>>>>>>>>> suspicious.
>>>>>>>>> So, for my knowledge, this should never happen too.
>>>>>>>>> However, I only looked at the code without actually running, 
>>>>>>>>> because
>>>>>>>>> I had no
>>>>>>>>> HW available for testing.
>>>>>>>>>
>>>>>>>>> While investigation and stress-testing virtual ports I found 
>>>>>>>>> few
>>>>>>>>> issues with
>>>>>>>>> missing locking inside the kernel, so there is no trust for 
>>>>>>>>> kernel
>>>>>>>>> part of XDP
>>>>>>>>> implementation from my side. I'm suspecting that there are 
>>>>>>>>> some
>>>>>>>>> other bugs in
>>>>>>>>> kernel/libbpf that only could be reproduced with driver mode.
>>>>>>>>>
>>>>>>>>> This never happens for virtual ports with SKB mode, so I never 
>>>>>>>>> saw
>>>>>>>>> this coverage
>>>>>>>>> counter being non-zero.
>>>>>>>>
>>>>>>>> Did some quick debugging, as something else has come up that 
>>>>>>>> needs my
>>>>>>>> attention :)
>>>>>>>>
>>>>>>>> But once I’m in a faulty state and sent a single packet, 
>>>>>>>> causing
>>>>>>>> afxdp_complete_tx() to be called, it tells me 2048 descriptors 
>>>>>>>> are
>>>>>>>> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess 
>>>>>>>> that
>>>>>>>> there might be some ring management bug. Maybe consumer and 
>>>>>>>> receiver
>>>>>>>> are equal meaning 0 buffers, but it returns max? I did not look 
>>>>>>>> at
>>>>>>>> the kernel code, so this is just a wild guess :)
>>>>>>>>
>>>>>>>> (gdb) p tx_done
>>>>>>>> $3 = 2048
>>>>>>>>
>>>>>>>> (gdb) p umem->cq
>>>>>>>> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask 
>>>>>>>> =
>>>>>>>> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
>>>>>>>> 0x7f08486b8040, ring = 0x7f08486b8080}
>>>>>>>
>>>>>>> Thanks for debugging!
>>>>>>>
>>>>>>> xsk_ring_cons__peek() just returns the difference between 
>>>>>>> cached_prod
>>>>>>> and cached_cons, but these values are too different:
>>>>>>>
>>>>>>> 3830466864 - 3578066899 = 252399965
>>>>>>>
>>>>>>> Since this value > requested, it returns requested number 
>>>>>>> (2048).
>>>>>>>
>>>>>>> So, the ring is broken. At least broken its 'cached' part. It'll 
>>>>>>> be
>>>>>>> good
>>>>>>> to look at *consumer and *producer values to verify the state of 
>>>>>>> the
>>>>>>> actual ring.
>>>>>>>
>>>>>>
>>>>>> I’ll try to find some more time next week to debug further.
>>>>>>
>>>>>> William I noticed your email in xdp-newbies where you mention 
>>>>>> this
>>>>>> problem of getting the wrong pointers. Did you ever follow up, or 
>>>>>> did
>>>>>> further trouble shooting on the above?
>>>>>
>>>>> Yes, I posted here
>>>>> https://www.spinics.net/lists/xdp-newbies/msg00956.html
>>>>> "Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"
>>>>>
>>>>> At that time I was thinking about reproducing the problem using 
>>>>> the
>>>>> xdpsock sample code from kernel. But turned out that my 
>>>>> reproduction
>>>>> code is not correct, so not able to show the case we hit here in 
>>>>> OVS.
>>>>>
>>>>> Then I put more similar code logic from OVS to xdpsock, but the 
>>>>> problem
>>>>> does not show up. As a result, I worked around it by marking addr 
>>>>> as
>>>>> "*addr == UINT64_MAX".
>>>>>
>>>>> I will debug again this week once I get my testbed back.
>>>>>
>>>> Just to refresh my memory. The original issue is that
>>>> when calling:
>>>> tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
>>>> xsk_ring_cons__release(&umem->cq, tx_done);
>>>>
>>>> I expect there are 'tx_done' elems on the CQ to re-cycle back to 
>>>> memory pool.
>>>> However, when I inspect these elems, I found some elems that 
>>>> 'already' been
>>>> reported complete last time I call xsk_ring_cons__peek. In other 
>>>> word, some
>>>> elems show up at CQ twice. And this cause overflow of the mempool.
>>>>
>>>> Thus, mark the elems on CQ as UINT64_MAX to indicate that we 
>>>> already
>>>> seen this elem.
>>>
>>> William, Eelco, which HW NIC you're using? Which kernel driver?
>>
>> I’m using the below on the latest bpf-next driver:
>>
>> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
>> SFI/SFP+ Network Connection (rev 01)
>> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
>> SFI/SFP+ Network Connection (rev 01)
>
> Thanks for information.
> I found one suspicious place inside the ixgbe driver that could break
> the completion queue ring and prepared a patch:
>     https://patchwork.ozlabs.org/patch/1150244/
>
> It'll be good if you can test it.

Hi Ilya, I was doping some testing of my own, and also concluded it was 
in the drivers' completion ring. I noticed after sending 512 packets the 
drivers TX counters kept increasing, which looks related to your fix.

Will try it out, and sent results to the upstream mailing list…

Thanks,

Eelco
William Tu Aug. 23, 2019, 4:08 p.m. UTC | #14
On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> >>> William, Eelco, which HW NIC you're using? Which kernel driver?
> >>
> >> I’m using the below on the latest bpf-next driver:
> >>
> >> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >
> > Thanks for information.
> > I found one suspicious place inside the ixgbe driver that could break
> > the completion queue ring and prepared a patch:
> >     https://patchwork.ozlabs.org/patch/1150244/
> >
> > It'll be good if you can test it.
>
> Hi Ilya, I was doping some testing of my own, and also concluded it was
> in the drivers' completion ring. I noticed after sending 512 packets the
> drivers TX counters kept increasing, which looks related to your fix.
>
> Will try it out, and sent results to the upstream mailing list…
>
> Thanks,
>
> Eelco

Hi,

I'm comparing the performance of netdev-afxdp.c on current master and
the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
I'm using ixgbe and doing physical port to physical port forwarding, sending
64 byte packets, with OpenFlow rule:
  ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"

In short
A. OVS's netdev-afxdp: 6.1Mpps
B. OVS-DPDK  AF_XDP pmd: 8Mpps
So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are
welcomed! Below is the analysis:

A. OVS netdev-afxdp Physical to physical 6.1Mpps
# pstree -p 702
ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706)
                  |-{handler4}(712)
                  |-{ipf_clean2}(707)
                  |-{pmd6}(790)
                  |-{pmd7}(791)
                  |-{pmd8}(792)
                  |-{pmd9}(793)
                |-{revalidator5}(713)
                  `-{urcu3}(708)

# ovs-appctl  dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 6:
  packets received: 92290351
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 92290319
  smc hits: 0
  megaflow hits: 31
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 0
  avg. packets per output batch: 31.88
  idle cycles: 20835727677 (34.86%)           --> pretty high!?
  processing cycles: 38932097052 (65.14%)
  avg cycles per packet: 647.61 (59767824729/92290351)
  avg processing cycles per packet: 421.84 (38932097052/92290351)

# ./perf record -t 790 sleep 10
  13.80%  pmd6 ovs-vswitchd        [.] miniflow_extract
  13.58%  pmd6 ovs-vswitchd        [.] __netdev_afxdp_batch_send
   9.64%  pmd6   ovs-vswitchd        [.] dp_netdev_input__
   9.07%  pmd6   ovs-vswitchd        [.] dp_packet_init__
   8.91%  pmd6   ovs-vswitchd        [.] netdev_afxdp_rxq_recv
   7.40%  pmd6   ovs-vswitchd        [.] miniflow_hash_5tuple
   5.32%  pmd6   libc-2.23.so        [.] __memcpy_avx_unaligned
   4.60%  pmd6   [kernel.vmlinux]    [k] do_syscall_64
   3.72%  pmd6   ovs-vswitchd        [.] dp_packet_use_afxdp    -->
maybe optimize?
   2.74%  pmd6   libpthread-2.23.so  [.] __pthread_enable_asynccancel
   2.43%  pmd6   [kernel.vmlinux]    [k] fput_many
   2.18%  pmd6   libc-2.23.so        [.] __memcmp_sse4_1
   2.06%  pmd6   [kernel.vmlinux]    [k] entry_SYSCALL_64
   1.79%  pmd6   [kernel.vmlinux]    [k] syscall_return_via_sysret
   1.71%  pmd6   ovs-vswitchd        [.] dp_execute_cb
   1.03%  pmd6   ovs-vswitchd        [.] non_atomic_ullong_add
   0.86%  pmd6   ovs-vswitchd        [.]dp_netdev_pmd_flush_output_on_port

B. OVS-DPDK afxdp using dpdk-latest 8Mpps
ovs-vswitchd(19462)-+-{ct_clean3}(19470)
                    |-{dpdk_watchdog1}(19468)
                    |-{eal-intr-thread}(19466)
                    |-{handler16}(19501)
                    |-{handler17}(19505)
                    |-{handler18}(19506)
                    |-{handler19}(19507)
                    |-{handler20}(19508)
                    |-{handler22}(19502)
                    |-{handler24}(19504)
                    |-{handler26}(19503)
                    |-{ipf_clean4}(19471)
                    |-{pmd27}(19536)
                    |-{revalidator21}(19509)
                    |-{revalidator23}(19511)
                    |-{revalidator25}(19510)
                    |-{rte_mp_handle}(19467)
                    `-{urcu2}(19469)

# ovs-appctl  dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 11:
  packets received: 1813689117
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 1813689053
  smc hits: 0
  megaflow hits: 63
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 0
  avg. packets per output batch: 31.85
  idle cycles: 13848892341 (2.50%)
  processing cycles: 541064826249 (97.50%)
  avg cycles per packet: 305.96 (554913718590/1813689117)
  avg processing cycles per packet: 298.32 (541064826249/1813689117)

#  ./perf record -t 19536 sleep 10
  24.84%  pmd27 ovs-vswitchd        [.] eth_af_xdp_rx
  16.27%  pmd27 ovs-vswitchd        [.] eth_af_xdp_tx
  13.20%  pmd27 ovs-vswitchd        [.] dp_netdev_input__
  12.54%  pmd27 ovs-vswitchd        [.] pull_umem_cq
  10.85%  pmd27 ovs-vswitchd        [.] miniflow_extract
   5.67%  pmd27   ovs-vswitchd        [.] miniflow_hash_5tuple
   3.41%  pmd27   libc-2.23.so        [.] __memcmp_sse4_1
   2.14%  pmd27   ovs-vswitchd        [.] netdev_dpdk_rxq_recv
   2.13%  pmd27   ovs-vswitchd        [.] dp_execute_cb
   1.50%  pmd27   ovs-vswitchd        [.] non_atomic_ullong_add
   1.49%  pmd27   ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
   1.05%  pmd27   ovs-vswitchd        [.] netdev_dpdk_filter_packet_len
   0.79%  pmd27   ovs-vswitchd        [.] pmd_perf_end_iteration
   0.74%  pmd27   ovs-vswitchd        [.] dp_netdev_process_rxq_port
   0.47%  pmd27   ovs-vswitchd        [.] memcmp@plt
   0.42%  pmd27   ovs-vswitchd        [.] netdev_dpdk_eth_send
Ilya Maximets Aug. 23, 2019, 4:59 p.m. UTC | #15
On 23.08.2019 19:08, William Tu wrote:
> On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>
>>
>>
>>>>> William, Eelco, which HW NIC you're using? Which kernel driver?
>>>>
>>>> I’m using the below on the latest bpf-next driver:
>>>>
>>>> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
>>>> SFI/SFP+ Network Connection (rev 01)
>>>> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
>>>> SFI/SFP+ Network Connection (rev 01)
>>>
>>> Thanks for information.
>>> I found one suspicious place inside the ixgbe driver that could break
>>> the completion queue ring and prepared a patch:
>>>     https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833&u=https://patchwork.ozlabs.org/patch/1150244/
>>>
>>> It'll be good if you can test it.
>>
>> Hi Ilya, I was doping some testing of my own, and also concluded it was
>> in the drivers' completion ring. I noticed after sending 512 packets the
>> drivers TX counters kept increasing, which looks related to your fix.
>>
>> Will try it out, and sent results to the upstream mailing list…
>>
>> Thanks,
>>
>> Eelco
> 
> Hi,
> 
> I'm comparing the performance of netdev-afxdp.c on current master and
> the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
> I'm using ixgbe and doing physical port to physical port forwarding, sending
> 64 byte packets, with OpenFlow rule:
>   ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"
> 
> In short
> A. OVS's netdev-afxdp: 6.1Mpps
> B. OVS-DPDK  AF_XDP pmd: 8Mpps
> So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are
> welcomed! Below is the analysis:

One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP
and it will be in use if you're building kernel from latest bpf-next tree.
This allowes to significantly decrease number of syscalls.
According to below perf stats, OVS implementation unlike dpdk one wastes ~11%
of time inside the kernel and this could be fixed by need_wakeup feature.

BTW, there are a lot of pmd threads in case A, but only one in case B.
Was the test setup really equal?

Best regards, Ilya Maximets.

> 
> A. OVS netdev-afxdp Physical to physical 6.1Mpps
> # pstree -p 702
> ovs-vswitchd(702)-+-{ct_clean1) S 1 7(706)
>                   |-{handler4}(712)
>                   |-{ipf_clean2}(707)
>                   |-{pmd6}(790)
>                   |-{pmd7}(791)
>                   |-{pmd8}(792)
>                   |-{pmd9}(793)
>                 |-{revalidator5}(713)
>                   `-{urcu3}(708)
> 
> # ovs-appctl  dpif-netdev/pmd-stats-show
> pmd thread numa_id 0 core_id 6:
>   packets received: 92290351
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 92290319
>   smc hits: 0
>   megaflow hits: 31
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 0
>   avg. packets per output batch: 31.88
>   idle cycles: 20835727677 (34.86%)           --> pretty high!?
>   processing cycles: 38932097052 (65.14%)
>   avg cycles per packet: 647.61 (59767824729/92290351)
>   avg processing cycles per packet: 421.84 (38932097052/92290351)
> 
> # ./perf record -t 790 sleep 10
>   13.80%  pmd6 ovs-vswitchd        [.] miniflow_extract
>   13.58%  pmd6 ovs-vswitchd        [.] __netdev_afxdp_batch_send
>    9.64%  pmd6   ovs-vswitchd        [.] dp_netdev_input__
>    9.07%  pmd6   ovs-vswitchd        [.] dp_packet_init__
>    8.91%  pmd6   ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>    7.40%  pmd6   ovs-vswitchd        [.] miniflow_hash_5tuple
>    5.32%  pmd6   libc-2.23.so        [.] __memcpy_avx_unaligned
>    4.60%  pmd6   [kernel.vmlinux]    [k] do_syscall_64
>    3.72%  pmd6   ovs-vswitchd        [.] dp_packet_use_afxdp    -->
> maybe optimize?
>    2.74%  pmd6   libpthread-2.23.so  [.] __pthread_enable_asynccancel
>    2.43%  pmd6   [kernel.vmlinux]    [k] fput_many
>    2.18%  pmd6   libc-2.23.so        [.] __memcmp_sse4_1
>    2.06%  pmd6   [kernel.vmlinux]    [k] entry_SYSCALL_64
>    1.79%  pmd6   [kernel.vmlinux]    [k] syscall_return_via_sysret
>    1.71%  pmd6   ovs-vswitchd        [.] dp_execute_cb
>    1.03%  pmd6   ovs-vswitchd        [.] non_atomic_ullong_add
>    0.86%  pmd6   ovs-vswitchd        [.]dp_netdev_pmd_flush_output_on_port
> 
> B. OVS-DPDK afxdp using dpdk-latest 8Mpps
> ovs-vswitchd(19462)-+-{ct_clean3}(19470)
>                     |-{dpdk_watchdog1}(19468)
>                     |-{eal-intr-thread}(19466)
>                     |-{handler16}(19501)
>                     |-{handler17}(19505)
>                     |-{handler18}(19506)
>                     |-{handler19}(19507)
>                     |-{handler20}(19508)
>                     |-{handler22}(19502)
>                     |-{handler24}(19504)
>                     |-{handler26}(19503)
>                     |-{ipf_clean4}(19471)
>                     |-{pmd27}(19536)
>                     |-{revalidator21}(19509)
>                     |-{revalidator23}(19511)
>                     |-{revalidator25}(19510)
>                     |-{rte_mp_handle}(19467)
>                     `-{urcu2}(19469)
> 
> # ovs-appctl  dpif-netdev/pmd-stats-show
> pmd thread numa_id 0 core_id 11:
>   packets received: 1813689117
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 1813689053
>   smc hits: 0
>   megaflow hits: 63
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 0
>   avg. packets per output batch: 31.85
>   idle cycles: 13848892341 (2.50%)
>   processing cycles: 541064826249 (97.50%)
>   avg cycles per packet: 305.96 (554913718590/1813689117)
>   avg processing cycles per packet: 298.32 (541064826249/1813689117)
> 
> #  ./perf record -t 19536 sleep 10
>   24.84%  pmd27 ovs-vswitchd        [.] eth_af_xdp_rx
>   16.27%  pmd27 ovs-vswitchd        [.] eth_af_xdp_tx
>   13.20%  pmd27 ovs-vswitchd        [.] dp_netdev_input__
>   12.54%  pmd27 ovs-vswitchd        [.] pull_umem_cq
>   10.85%  pmd27 ovs-vswitchd        [.] miniflow_extract
>    5.67%  pmd27   ovs-vswitchd        [.] miniflow_hash_5tuple
>    3.41%  pmd27   libc-2.23.so        [.] __memcmp_sse4_1
>    2.14%  pmd27   ovs-vswitchd        [.] netdev_dpdk_rxq_recv
>    2.13%  pmd27   ovs-vswitchd        [.] dp_execute_cb
>    1.50%  pmd27   ovs-vswitchd        [.] non_atomic_ullong_add
>    1.49%  pmd27   ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
>    1.05%  pmd27   ovs-vswitchd        [.] netdev_dpdk_filter_packet_len
>    0.79%  pmd27   ovs-vswitchd        [.] pmd_perf_end_iteration
>    0.74%  pmd27   ovs-vswitchd        [.] dp_netdev_process_rxq_port
>    0.47%  pmd27   ovs-vswitchd        [.] memcmp@plt
>    0.42%  pmd27   ovs-vswitchd        [.] netdev_dpdk_eth_send
> 
>
William Tu Aug. 23, 2019, 5:08 p.m. UTC | #16
On Fri, Aug 23, 2019 at 9:59 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 23.08.2019 19:08, William Tu wrote:
> > On Wed, Aug 21, 2019 at 2:31 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >>
> >>
> >>
> >>>>> William, Eelco, which HW NIC you're using? Which kernel driver?
> >>>>
> >>>> I’m using the below on the latest bpf-next driver:
> >>>>
> >>>> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >>>> SFI/SFP+ Network Connection (rev 01)
> >>>> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >>>> SFI/SFP+ Network Connection (rev 01)
> >>>
> >>> Thanks for information.
> >>> I found one suspicious place inside the ixgbe driver that could break
> >>> the completion queue ring and prepared a patch:
> >>>     https://protect2.fireeye.com/url?k=ac2418ed930ec67f.ac2593a2-94283087c2dd9833&u=https://patchwork.ozlabs.org/patch/1150244/
> >>>
> >>> It'll be good if you can test it.
> >>
> >> Hi Ilya, I was doping some testing of my own, and also concluded it was
> >> in the drivers' completion ring. I noticed after sending 512 packets the
> >> drivers TX counters kept increasing, which looks related to your fix.
> >>
> >> Will try it out, and sent results to the upstream mailing list…
> >>
> >> Thanks,
> >>
> >> Eelco
> >
> > Hi,
> >
> > I'm comparing the performance of netdev-afxdp.c on current master and
> > the DPDK's AF_XDP implementation in OVS dpdk-latest branch.
> > I'm using ixgbe and doing physical port to physical port forwarding, sending
> > 64 byte packets, with OpenFlow rule:
> >   ovs-ofctl  add-flow br0  "in_port=eth2, actions=output:eth3"
> >
> > In short
> > A. OVS's netdev-afxdp: 6.1Mpps
> > B. OVS-DPDK  AF_XDP pmd: 8Mpps
> > So I start to think about how to optimize lib/netdev-afxdp.c. Any comments are
> > welcomed! Below is the analysis:
>
> One major difference is that DPDK implementation supports XDP_USE_NEED_WAKEUP
> and it will be in use if you're building kernel from latest bpf-next tree.
> This allowes to significantly decrease number of syscalls.
> According to below perf stats, OVS implementation unlike dpdk one wastes ~11%
> of time inside the kernel and this could be fixed by need_wakeup feature.

Cool, thank you.
I will look at how to use XDP_USE_NEED_WAKEUP

>
> BTW, there are a lot of pmd threads in case A, but only one in case B.
> Was the test setup really equal?

Yes, they should be equal.
I accidentally in case A add a pmd-cpu-mask=0xf0
so it uses more cpus, but I always enable only one queue and pmd-stats-show
shows other pmd is doing nothing. Will fix it next time.

Regards,
William
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 8472921746ba..2a3214a3cc7f 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 331353fd337a..bace34dbf91b 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..820e9d993d8f
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,432 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+dpdk.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, called xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the afxdp netdev re-uses the existing userspace
+dpif-netdev datapath.  As a result, most of the packet processing
+happens at the userspace instead of linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+
+  * CONFIG_BPF=y
+
+  * CONFIG_BPF_SYSCALL=y
+
+  * CONFIG_XDP_SOCKETS=y
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  * CONFIG_BPF_JIT=y (Performance)
+
+  * CONFIG_HAVE_BPF_JIT=y (Performance)
+
+  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
+
+- Once your AF_XDP-enabled kernel is ready, if possible, run
+  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
+  This is an OVS independent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+First, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp TESTSUITEFLAGS='1'
+
+.. note::
+   Not all test cases pass at this time. Currenly all TCP related
+   tests, ex: using wget or http, are skipped due to XDP limitations
+   on veth. cvlan test is also skipped.
+
+If a test case fails, check the log at::
+
+  cat \
+  tests/system-afxdp-testsuite.dir/<test num>/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd ...
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+.. note::
+   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+Should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debugging by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+#. Due to limitations of current upstream kernel, TCP and various offloading
+   (vlan, cvlan) is not working over virtual interfaces (i.e. veth pair).
+
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0 -- set interface tap0
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0 (linux kernel mode)::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1
+
+Or, use AF_XDP with skb mode::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
+
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/NEWS b/NEWS
index 806e3c84c992..ca9ac76a1039 100644
--- a/NEWS
+++ b/NEWS
@@ -62,6 +62,7 @@  Post-v2.11.0
    - 'ovs-dpctl dump-flows' is no longer suitable for dumping offloaded flows.
      'ovs-appctl dpctl/dump-flows' should be used instead.
    - Add L2 GRE tunnel over IPv6 support.
+   - Add Linux AF_XDP support through a new experimental netdev type, "afxdp".
 
 
 v2.11.0 - 19 Feb 2019
diff --git a/acinclude.m4 b/acinclude.m4
index b8c9d6c06fba..9e1569b07c73 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -238,6 +238,41 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_CHECK_FUNCS([pthread_spin_lock], [],
+      [AC_MSG_ERROR([unable to find pthread_spin_lock for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index dd2a674af0c9..c33935499e1c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -98,6 +98,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index 1b89cac8c3a2..0139658651f9 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -9,6 +9,7 @@  lib_LTLIBRARIES += lib/libopenvswitch.la
 
 lib_libopenvswitch_la_LIBADD = $(SSL_LIBS)
 lib_libopenvswitch_la_LIBADD += $(CAPNG_LDADD)
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
 
 if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
@@ -394,6 +395,7 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/if-notifier.h \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
+	lib/netdev-linux-private.h \
 	lib/netdev-offload-tc.c \
 	lib/netlink-conntrack.c \
 	lib/netlink-conntrack.h \
@@ -410,6 +412,14 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/netdev-afxdp-pool.c \
+	lib/netdev-afxdp-pool.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..62d7faa4c59a 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -19,6 +19,7 @@ 
 #include <string.h>
 
 #include "dp-packet.h"
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
@@ -59,6 +60,22 @@  dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
     dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
 }
 
+#if HAVE_AF_XDP
+/* Initialize 'b' as an empty dp_packet that contains
+ * memory starting at AF_XDP umem base.
+ */
+void
+dp_packet_use_afxdp(struct dp_packet *b, void *data, size_t allocated,
+                    size_t headroom)
+{
+    dp_packet_set_base(b, (char *)data - headroom);
+    dp_packet_set_data(b, data);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_init__(b, allocated, DPBUF_AFXDP);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -122,6 +139,8 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
         }
     }
 }
@@ -248,6 +267,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +455,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..14f0897fa637 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,7 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +43,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* Buffer data from XDP frame. */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +91,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -122,7 +131,9 @@  static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
 void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
-
+#if HAVE_AF_XDP
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -184,6 +195,11 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
+            return;
+        }
+
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..244813ffe168 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -21,6 +21,7 @@ 
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
+#include <time.h>
 #include <math.h>
 
 #ifdef DPDK_NETDEV
@@ -186,6 +187,22 @@  struct pmd_perf_stats {
     char *log_reason;
 };
 
+#ifdef __linux__
+static inline uint64_t
+rdtsc_syscall(struct pmd_perf_stats *s)
+{
+    struct timespec val;
+    uint64_t v;
+
+    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
+       return s->last_tsc;
+    }
+
+    v  = val.tv_sec * UINT64_C(1000000000) + val.tv_nsec;
+    return s->last_tsc = v;
+}
+#endif
+
 /* Support for accurate timing of PMD execution on TSC clock cycle level.
  * These functions are intended to be invoked in the context of pmd threads. */
 
@@ -198,6 +215,13 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif !defined(_MSC_VER) && defined(__x86_64__)
+    uint32_t h, l;
+    asm volatile("rdtsc" : "=a" (l), "=d" (h));
+
+    return s->last_tsc = ((uint64_t) h << 32) | l;
+#elif defined(__linux__)
+    return rdtsc_syscall(s);
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp-pool.c b/lib/netdev-afxdp-pool.c
new file mode 100644
index 000000000000..6d29da4d707d
--- /dev/null
+++ b/lib/netdev-afxdp-pool.c
@@ -0,0 +1,167 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+
+#include "dp-packet.h"
+#include "netdev-afxdp-pool.h"
+#include "openvswitch/util.h"
+
+/* Note:
+ * umem_elem_push* shouldn't overflow because we always pop
+ * elem first, then push back to the stack.
+ */
+static inline void
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    ovs_assert(umemp->index + n <= umemp->size);
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+}
+
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline void
+__umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    ovs_assert(umemp->index + 1 <= umemp->size);
+
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->lock);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ret;
+}
+
+static inline void *
+__umem_elem_pop(struct umem_pool *umemp)
+{
+    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
+        return NULL;
+    }
+
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->lock);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ptr;
+}
+
+static void **
+__umem_pool_alloc(unsigned int size)
+{
+    void **bufs;
+
+    bufs = xmalloc_pagealign(size * sizeof *bufs);
+    memset(bufs, 0, size * sizeof *bufs);
+
+    return bufs;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        return -ENOMEM;
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spin_init(&umemp->lock);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp)
+{
+    free_pagealign(umemp->array);
+    umemp->array = NULL;
+    ovs_spin_destroy(&umemp->lock);
+}
+
+unsigned int
+umem_pool_count(struct umem_pool *umemp)
+{
+    return umemp->index;
+}
+
+/* AF_XDP metadata init/destroy. */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    xp->array = xmalloc_pagealign(size * sizeof *xp->array);
+    xp->size = size;
+
+    memset(xp->array, 0, size * sizeof *xp->array);
+
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free_pagealign(xp->array);
+    xp->array = NULL;
+}
diff --git a/lib/netdev-afxdp-pool.h b/lib/netdev-afxdp-pool.h
new file mode 100644
index 000000000000..a8c7e2b8cc9c
--- /dev/null
+++ b/lib/netdev-afxdp-pool.h
@@ -0,0 +1,58 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <bpf/xsk.h>
+#include <errno.h>
+#include <stdbool.h>
+
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+
+/* LIFO ptr_array. */
+struct umem_pool {
+    int index;      /* Point to top. */
+    unsigned int size;
+    struct ovs_spin lock;
+    void **array;   /* A pointer array pointing to umem buf. */
+};
+
+/* Array-based dp_packet_afxdp. */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp *array;
+};
+
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *umem_elem_pop(struct umem_pool *umemp);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+unsigned int umem_pool_count(struct umem_pool *umemp);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
+#endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..e5eaf978e5b1
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,1041 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-afxdp.h"
+#include "netdev-afxdp-pool.h"
+
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <stdlib.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dpif-netdev.h"
+#include "fatal-signal.h"
+#include "openvswitch/compiler.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/list.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "socket-util.h"
+#include "util.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+COVERAGE_DEFINE(afxdp_cq_empty);
+COVERAGE_DEFINE(afxdp_fq_full);
+COVERAGE_DEFINE(afxdp_tx_full);
+COVERAGE_DEFINE(afxdp_cq_skip);
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define MAX_XSKQ            16
+#define FRAME_HEADROOM      XDP_PACKET_HEADROOM
+#define OVS_XDP_HEADROOM    128
+#define FRAME_SIZE          XSK_UMEM__DEFAULT_FRAME_SIZE
+#define FRAME_SHIFT         XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define PROD_NUM_DESCS      XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS      XSK_RING_CONS__DEFAULT_NUM_DESCS
+
+/* The worst case is all 4 queues TX/CQ/RX/FILL are full + some packets
+ * still on processing in threads. Number of packets currently in OVS
+ * processing is hard to estimate because it depends on number of ports.
+ * Setting NUM_FRAMES twice as large than total of ring sizes should be
+ * enough for most corner cases.
+ */
+#define NUM_FRAMES          (4 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
+#define BATCH_SIZE          NETDEV_MAX_BURST
+
+BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
+BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+static int xsk_configure_all(struct netdev *netdev);
+static void xsk_destroy_all(struct netdev *netdev);
+
+struct unused_pool {
+    struct xsk_umem_info *umem_info;
+    int lost_in_rings; /* Number of packets left in tx, rx, cq and fq. */
+    struct ovs_list list_node;
+};
+
+static struct ovs_mutex unused_pools_mutex = OVS_MUTEX_INITIALIZER;
+static struct ovs_list unused_pools OVS_GUARDED_BY(unused_pools_mutex) =
+    OVS_LIST_INITIALIZER(&unused_pools);
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    uint32_t outstanding_tx; /* Number of descriptors filled in tx and cq. */
+    uint32_t available_rx;   /* Number of descriptors filled in rx and fq. */
+    atomic_uint64_t tx_dropped;
+};
+
+static void
+netdev_afxdp_cleanup_unused_pool(struct unused_pool *pool)
+{
+    /* free the packet buffer */
+    free_pagealign(pool->umem_info->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&pool->umem_info->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&pool->umem_info->xpool);
+
+    free(pool->umem_info);
+}
+
+static void
+netdev_afxdp_sweep_unused_pools(void *aux OVS_UNUSED)
+{
+    struct unused_pool *pool, *next;
+    unsigned int count;
+
+    ovs_mutex_lock(&unused_pools_mutex);
+    LIST_FOR_EACH_SAFE (pool, next, list_node, &unused_pools) {
+
+        count = umem_pool_count(&pool->umem_info->mpool);
+        ovs_assert(count + pool->lost_in_rings <= NUM_FRAMES);
+
+        if (count + pool->lost_in_rings == NUM_FRAMES) {
+            /* OVS doesn't use this memory pool anymore.  Kernel doesn't
+             * use it since closing the xdp socket.  So, it's safe to free
+             * the pool now. */
+            VLOG_DBG("Freeing umem pool at 0x%"PRIxPTR,
+                     (uintptr_t) pool->umem_info);
+            ovs_list_remove(&pool->list_node);
+            netdev_afxdp_cleanup_unused_pool(pool);
+            free(pool);
+        }
+    }
+    ovs_mutex_unlock(&unused_pools_mutex);
+}
+
+static struct xsk_umem_info *
+xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
+{
+    struct xsk_umem_config uconfig;
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xzalloc(sizeof *umem);
+
+    uconfig.fill_size = PROD_NUM_DESCS;
+    uconfig.comp_size = CONS_NUM_DESCS;
+    uconfig.frame_size = FRAME_SIZE;
+    uconfig.frame_headroom = OVS_XDP_HEADROOM;
+
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           &uconfig);
+    if (ret) {
+        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        free(umem);
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* Set-up umem pool. */
+    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("umem_pool_init failed");
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        void *elem;
+
+        elem = ALIGNED_CAST(void *, (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* Set-up metadata. */
+    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("xpacket_pool_init failed");
+        umem_pool_cleanup(&umem->mpool);
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    VLOG_DBG("%s: xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = &umem->xpool.array[i];
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0, prog_id;
+    int ret;
+    int i;
+
+    xsk = xzalloc(sizeof *xsk);
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        free(xsk);
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded. */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        free(xsk);
+        return NULL;
+    }
+
+    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
+                                   PROD_NUM_DESCS, &idx)) {
+        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
+    }
+
+    for (i = 0;
+         i < PROD_NUM_DESCS * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        void *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+
+    netdev_afxdp_sweep_unused_pools(NULL);
+
+    /* Umem memory region. */
+    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+
+    /* Create AF_XDP socket. */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        free_pagealign(bufs);
+        return NULL;
+    }
+
+    VLOG_DBG("Allocated umem pool at 0x%"PRIxPTR, (uintptr_t) umem);
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* Clean up umem and xpacket pool. */
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed.");
+        }
+        free_pagealign(bufs);
+        umem_pool_cleanup(&umem->mpool);
+        xpacket_pool_cleanup(&umem->xpool);
+        free(umem);
+    }
+    return xsk;
+}
+
+static int
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    int i, ifindex, n_rxq, n_txq;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    ovs_assert(dev->xsks == NULL);
+    ovs_assert(dev->tx_locks == NULL);
+
+    n_rxq = netdev_n_rxq(netdev);
+    dev->xsks = xcalloc(n_rxq, sizeof *dev->xsks);
+
+    /* Configure each queue. */
+    for (i = 0; i < n_rxq; i++) {
+        VLOG_INFO("%s: configure queue %d mode %s", __func__, i,
+                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk_info = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk_info) {
+            VLOG_ERR("Failed to create AF_XDP socket on queue %d.", i);
+            dev->xsks[i] = NULL;
+            goto err;
+        }
+        dev->xsks[i] = xsk_info;
+        atomic_init(&xsk_info->tx_dropped, 0);
+        xsk_info->outstanding_tx = 0;
+        xsk_info->available_rx = PROD_NUM_DESCS;
+    }
+
+    n_txq = netdev_n_txq(netdev);
+    dev->tx_locks = xcalloc(n_txq, sizeof *dev->tx_locks);
+
+    for (i = 0; i < n_txq; i++) {
+        ovs_spin_init(&dev->tx_locks[i]);
+    }
+
+    return 0;
+
+err:
+    xsk_destroy_all(netdev);
+    return EINVAL;
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk_info)
+{
+    struct xsk_umem *umem;
+    struct unused_pool *pool;
+
+    xsk_socket__delete(xsk_info->xsk);
+    xsk_info->xsk = NULL;
+
+    umem = xsk_info->umem->umem;
+    if (xsk_umem__delete(umem)) {
+        VLOG_ERR("xsk_umem__delete failed.");
+    }
+
+    pool = xzalloc(sizeof *pool);
+    pool->umem_info = xsk_info->umem;
+    pool->lost_in_rings = xsk_info->outstanding_tx + xsk_info->available_rx;
+
+    ovs_mutex_lock(&unused_pools_mutex);
+    ovs_list_push_back(&unused_pools, &pool->list_node);
+    ovs_mutex_unlock(&unused_pools_mutex);
+
+    free(xsk_info);
+
+    netdev_afxdp_sweep_unused_pools(NULL);
+}
+
+static void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    if (dev->xsks) {
+        for (i = 0; i < netdev_n_rxq(netdev); i++) {
+            if (dev->xsks[i]) {
+                xsk_destroy(dev->xsks[i]);
+                dev->xsks[i] = NULL;
+                VLOG_INFO("Destroyed xsk[%d].", i);
+            }
+        }
+
+        free(dev->xsks);
+        dev->xsks = NULL;
+    }
+
+    VLOG_INFO("%s: Removing xdp program.", netdev_get_name(netdev));
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+
+    if (dev->tx_locks) {
+        for (i = 0; i < netdev_n_txq(netdev); i++) {
+            ovs_spin_destroy(&dev->tx_locks[i]);
+        }
+        free(dev->tx_locks);
+        dev->tx_locks = NULL;
+    }
+}
+
+static inline void OVS_UNUSED
+log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+               &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                stat.rx_dropped,
+                stat.rx_invalid_descs,
+                stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    const char *str_xdpmode;
+    int xdpmode, new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
+                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
+        return EINVAL;
+    }
+
+    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
+    if (!strcasecmp(str_xdpmode, "drv")) {
+        xdpmode = XDP_ZEROCOPY;
+    } else if (!strcasecmp(str_xdpmode, "skb")) {
+        xdpmode = XDP_COPY;
+    } else {
+        VLOG_ERR("%s: Incorrect xdpmode (%s).",
+                 netdev_get_name(netdev), str_xdpmode);
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (dev->requested_n_rxq != new_n_rxq
+        || dev->requested_xdpmode != xdpmode) {
+        dev->requested_n_rxq = new_n_rxq;
+        dev->requested_xdpmode = xdpmode;
+        netdev_request_reconfigure(netdev);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdpmode == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+
+    netdev->n_rxq = dev->requested_n_rxq;
+    netdev->n_txq = netdev->n_rxq;
+
+    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
+        dev->xdpmode = XDP_ZEROCOPY;
+        VLOG_INFO("AF_XDP device %s in DRV mode.", netdev_get_name(netdev));
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
+                      ovs_strerror(errno));
+        }
+    } else {
+        dev->xdpmode = XDP_COPY;
+        VLOG_INFO("AF_XDP device %s in SKB mode.", netdev_get_name(netdev));
+        /* TODO: set rlimit back to previous value
+         * when no device is in DRV mode.
+         */
+    }
+
+    err = xsk_configure_all(netdev);
+    if (err) {
+        VLOG_ERR("AF_XDP device %s reconfig failed.", netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0.",
+              netdev_get_name(netdev));
+    return 0;
+}
+
+static void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t flags;
+
+    flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+    if (xdpmode == XDP_COPY) {
+        flags |= XDP_FLAGS_SKB_MODE;
+    } else if (xdpmode == XDP_ZEROCOPY) {
+        flags |= XDP_FLAGS_DRV_MODE;
+    }
+
+    bpf_set_link_xdp_fd(ifindex, -1, flags);
+}
+
+void
+signal_remove_xdp(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    VLOG_WARN("Force removing xdp program.");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+static inline void
+prepare_fill_queue(struct xsk_socket_info *xsk_info)
+{
+    struct xsk_umem_info *umem;
+    void *elems[BATCH_SIZE];
+    unsigned int idx_fq;
+    int i, ret;
+
+    umem = xsk_info->umem;
+
+    if (xsk_prod_nb_free(&umem->fq, BATCH_SIZE) < BATCH_SIZE) {
+        return;
+    }
+
+    ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, elems);
+    if (OVS_UNLIKELY(ret)) {
+        return;
+    }
+
+    if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) {
+        umem_elem_push_n(&umem->mpool, BATCH_SIZE, elems);
+        COVERAGE_INC(afxdp_fq_full);
+        return;
+    }
+
+    for (i = 0; i < BATCH_SIZE; i++) {
+        uint64_t index;
+        void *elem;
+
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&umem->fq, BATCH_SIZE);
+    xsk_info->available_rx += BATCH_SIZE;
+}
+
+int
+netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+                      int *qfill)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev *netdev = rx->up.netdev;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    struct xsk_umem_info *umem;
+    uint32_t idx_rx = 0;
+    int qid = rxq_->queue_id;
+    unsigned int rcvd, i;
+
+    xsk_info = dev->xsks[qid];
+    if (!xsk_info || !xsk_info->xsk) {
+        return EAGAIN;
+    }
+
+    prepare_fill_queue(xsk_info);
+
+    umem = xsk_info->umem;
+    rx->fd = xsk_socket__fd(xsk_info->xsk);
+
+    rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return EAGAIN;
+    }
+
+    /* Setup a dp_packet batch from descriptors in RX queue. */
+    for (i = 0; i < rcvd; i++) {
+        struct dp_packet_afxdp *xpacket;
+        const struct xdp_desc *desc;
+        struct dp_packet *packet;
+        uint64_t addr, index;
+        uint32_t len;
+        char *pkt;
+
+        desc = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx);
+        addr = desc->addr;
+        len = desc->len;
+
+        pkt = xsk_umem__get_data(umem->buffer, addr);
+        index = addr >> FRAME_SHIFT;
+        xpacket = &umem->xpool.array[index];
+        packet = &xpacket->packet;
+
+        /* Initialize the struct dp_packet. */
+        dp_packet_use_afxdp(packet, pkt,
+                            FRAME_SIZE - FRAME_HEADROOM,
+                            OVS_XDP_HEADROOM);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count. */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+    /* Release the RX queue. */
+    xsk_ring_cons__release(&xsk_info->rx, rcvd);
+    xsk_info->available_rx -= rcvd;
+
+    if (qfill) {
+        /* TODO: return the number of remaining packets in the queue. */
+        *qfill = 0;
+    }
+
+#ifdef AFXDP_DEBUG
+    log_xsk_stat(xsk_info);
+#endif
+    return 0;
+}
+
+static inline int
+kick_tx(struct xsk_socket_info *xsk_info, int xdpmode)
+{
+    int ret, retries;
+    static const int KERNEL_TX_BATCH_SIZE = 16;
+
+    /* In SKB_MODE packet transmission is synchronous, and the kernel xmits
+     * only TX_BATCH_SIZE(16) packets for a single sendmsg syscall.
+     * So, we have to kick the kernel (n_packets / 16) times to be sure that
+     * all packets are transmitted. */
+    retries = (xdpmode == XDP_COPY)
+              ? xsk_info->outstanding_tx / KERNEL_TX_BATCH_SIZE
+              : 0;
+kick_retry:
+    /* This causes system call into kernel's xsk_sendmsg, and
+     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
+     */
+    ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
+                                NULL, 0);
+    if (ret < 0) {
+        if (retries-- && errno == EAGAIN) {
+            goto kick_retry;
+        }
+        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
+            return errno;
+        }
+    }
+    /* No error, or EBUSY, or too many retries on EAGAIN. */
+    return 0;
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    uintptr_t addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+static void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+    struct dp_packet_afxdp *xpacket = NULL;
+    struct dp_packet *packet;
+    void *elems[BATCH_SIZE];
+    uintptr_t addr;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        void *base;
+
+        xpacket = dp_packet_cast_afxdp(packet);
+        base = dp_packet_base(packet);
+        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+        elems[i] = (void *)addr;
+    }
+    umem_elem_push_n(xpacket->mpool, batch->count, elems);
+    dp_packet_batch_init(batch);
+}
+
+static inline bool
+check_free_batch(struct dp_packet_batch *batch)
+{
+    struct umem_pool *first_mpool = NULL;
+    struct dp_packet_afxdp *xpacket;
+    struct dp_packet *packet;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (packet->source != DPBUF_AFXDP) {
+            return false;
+        }
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (i == 0) {
+            first_mpool = xpacket->mpool;
+            continue;
+        }
+        if (xpacket->mpool != first_mpool) {
+            return false;
+        }
+    }
+    /* All packets are DPBUF_AFXDP and from the same mpool. */
+    return true;
+}
+
+static inline void
+afxdp_complete_tx(struct xsk_socket_info *xsk_info)
+{
+    void *elems_push[BATCH_SIZE];
+    struct xsk_umem_info *umem;
+    uint32_t idx_cq = 0;
+    int tx_to_free = 0;
+    int tx_done, j;
+
+    umem = xsk_info->umem;
+    tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
+
+    /* Recycle back to umem pool. */
+    for (j = 0; j < tx_done; j++) {
+        uint64_t *addr;
+        void *elem;
+
+        addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
+        if (*addr == UINT64_MAX) {
+            /* The elem has been pushed already. */
+            COVERAGE_INC(afxdp_cq_skip);
+            continue;
+        }
+        elem = ALIGNED_CAST(void *, (char *)umem->buffer + *addr);
+        elems_push[tx_to_free] = elem;
+        *addr = UINT64_MAX; /* Mark as pushed. */
+        tx_to_free++;
+
+        if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
+            umem_elem_push_n(&umem->mpool, tx_to_free, elems_push);
+            xsk_info->outstanding_tx -= tx_to_free;
+            tx_to_free = 0;
+        }
+    }
+
+    if (tx_done > 0) {
+        xsk_ring_cons__release(&umem->cq, tx_done);
+    } else {
+        COVERAGE_INC(afxdp_cq_empty);
+    }
+}
+
+static inline int
+__netdev_afxdp_batch_send(struct netdev *netdev, int qid,
+                        struct dp_packet_batch *batch)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    void *elems_pop[BATCH_SIZE];
+    struct xsk_umem_info *umem;
+    struct dp_packet *packet;
+    bool free_batch = false;
+    unsigned long orig;
+    uint32_t idx = 0;
+    int error = 0;
+    int ret;
+
+    xsk_info = dev->xsks[qid];
+    if (!xsk_info || !xsk_info->xsk) {
+        goto out;
+    }
+
+    afxdp_complete_tx(xsk_info);
+
+    free_batch = check_free_batch(batch);
+
+    umem = xsk_info->umem;
+    ret = umem_elem_pop_n(&umem->mpool, batch->count, elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        atomic_add_relaxed(&xsk_info->tx_dropped, batch->count, &orig);
+        VLOG_WARN_RL(&rl, "%s: send failed due to exhausted memory pool.",
+                     netdev_get_name(netdev));
+        error = ENOMEM;
+        goto out;
+    }
+
+    /* Make sure we have enough TX descs. */
+    ret = xsk_ring_prod__reserve(&xsk_info->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        umem_elem_push_n(&umem->mpool, batch->count, elems_pop);
+        atomic_add_relaxed(&xsk_info->tx_dropped, batch->count, &orig);
+        COVERAGE_INC(afxdp_tx_full);
+        afxdp_complete_tx(xsk_info);
+        kick_tx(xsk_info, dev->xdpmode);
+        error = ENOMEM;
+        goto out;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        uint64_t index;
+        void *elem;
+
+        elem = elems_pop[i];
+        /* Copy the packet to the umem we just pop from umem pool.
+         * TODO: avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk_info->tx, batch->count);
+    xsk_info->outstanding_tx += batch->count;
+
+    ret = kick_tx(xsk_info, dev->xdpmode);
+    if (OVS_UNLIKELY(ret)) {
+        VLOG_WARN_RL(&rl, "%s: error sending AF_XDP packet: %s.",
+                     netdev_get_name(netdev), ovs_strerror(ret));
+    }
+
+out:
+    if (free_batch) {
+        free_afxdp_buf_batch(batch);
+    } else {
+        dp_packet_delete_batch(batch, true);
+    }
+
+    return error;
+}
+
+int
+netdev_afxdp_batch_send(struct netdev *netdev, int qid,
+                        struct dp_packet_batch *batch,
+                        bool concurrent_txq)
+{
+    struct netdev_linux *dev;
+    int ret;
+
+    if (concurrent_txq) {
+        dev = netdev_linux_cast(netdev);
+        qid = qid % netdev_n_txq(netdev);
+
+        ovs_spin_lock(&dev->tx_locks[qid]);
+        ret = __netdev_afxdp_batch_send(netdev, qid, batch);
+        ovs_spin_unlock(&dev->tx_locks[qid]);
+    } else {
+        ret = __netdev_afxdp_batch_send(netdev, qid, batch);
+    }
+
+    return ret;
+}
+
+int
+netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
+{
+   /* Done at reconfigure. */
+   return 0;
+}
+
+void
+netdev_afxdp_rxq_destruct(struct netdev_rxq *rxq_ OVS_UNUSED)
+{
+    /* Nothing. */
+}
+
+void
+netdev_afxdp_destruct(struct netdev *netdev)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    if (ovsthread_once_start(&once)) {
+        fatal_signal_add_hook(netdev_afxdp_sweep_unused_pools,
+                              NULL, NULL, true);
+        ovsthread_once_done(&once);
+    }
+
+    /* Note: tc is by-passed when using drv-mode, but when using
+     * skb-mode, we might need to clean up tc. */
+
+    xsk_destroy_all(netdev);
+    ovs_mutex_destroy(&dev->mutex);
+}
+
+int
+netdev_afxdp_get_stats(const struct netdev *netdev,
+                       struct netdev_stats *stats)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    struct netdev_stats dev_stats;
+    int error, i;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    error = get_stats_via_netlink(netdev, &dev_stats);
+    if (error) {
+        VLOG_WARN_RL(&rl, "%s: Error getting AF_XDP statistics.",
+                     netdev_get_name(netdev));
+    } else {
+        /* Use kernel netdev's packet and byte counts. */
+        stats->rx_packets = dev_stats.rx_packets;
+        stats->rx_bytes = dev_stats.rx_bytes;
+        stats->tx_packets = dev_stats.tx_packets;
+        stats->tx_bytes = dev_stats.tx_bytes;
+
+        stats->rx_errors           += dev_stats.rx_errors;
+        stats->tx_errors           += dev_stats.tx_errors;
+        stats->rx_dropped          += dev_stats.rx_dropped;
+        stats->tx_dropped          += dev_stats.tx_dropped;
+        stats->multicast           += dev_stats.multicast;
+        stats->collisions          += dev_stats.collisions;
+        stats->rx_length_errors    += dev_stats.rx_length_errors;
+        stats->rx_over_errors      += dev_stats.rx_over_errors;
+        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
+        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
+        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
+        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
+        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
+        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
+        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
+        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
+        stats->tx_window_errors    += dev_stats.tx_window_errors;
+
+        /* Account the dropped in each xsk. */
+        for (i = 0; i < netdev_n_rxq(netdev); i++) {
+            xsk_info = dev->xsks[i];
+            if (xsk_info) {
+                uint64_t tx_dropped;
+
+                atomic_read_relaxed(&xsk_info->tx_dropped, &tx_dropped);
+                stats->tx_dropped += tx_dropped;
+            }
+        }
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return error;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..9d506dcfd1bd
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,73 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+struct netdev_rxq;
+struct netdev_stats;
+
+int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
+void netdev_afxdp_rxq_destruct(struct netdev_rxq *rxq_);
+void netdev_afxdp_destruct(struct netdev *netdev_);
+
+int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
+                          struct dp_packet_batch *batch,
+                          int *qfill);
+int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
+                            struct dp_packet_batch *batch,
+                            bool concurrent_txq);
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+int netdev_afxdp_get_stats(const struct netdev *netdev_,
+                           struct netdev_stats *stats);
+
+void free_afxdp_buf(struct dp_packet *p);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+void signal_remove_xdp(struct netdev *netdev);
+
+#else /* !HAVE_AF_XDP */
+
+#include "openvswitch/compiler.h"
+
+struct dp_packet;
+
+static inline void
+free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
+{
+    /* Nothing. */
+}
+
+#endif /* HAVE_AF_XDP */
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
new file mode 100644
index 000000000000..ebd7d3128d19
--- /dev/null
+++ b/lib/netdev-linux-private.h
@@ -0,0 +1,132 @@ 
+/*
+ * Copyright (c) 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_LINUX_PRIVATE_H
+#define NETDEV_LINUX_PRIVATE_H 1
+
+#include <config.h>
+
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "netdev-afxdp.h"
+#include "netdev-afxdp-pool.h"
+#include "netdev-provider.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
+
+struct netdev;
+
+struct netdev_rxq_linux {
+    struct netdev_rxq up;
+    bool is_tap;
+    int fd;
+};
+
+void netdev_linux_run(const struct netdev_class *);
+
+int get_stats_via_netlink(const struct netdev *netdev_,
+                          struct netdev_stats *stats);
+
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+#ifdef HAVE_AF_XDP
+    /* AF_XDP information. */
+    struct xsk_socket_info **xsks;
+    int requested_n_rxq;
+    int xdpmode;                /* AF_XDP running mode: driver or skb. */
+    int requested_xdpmode;
+    struct ovs_spin *tx_locks;  /* spin lock array for TX queues. */
+#endif
+};
+
+static bool
+is_netdev_linux_class(const struct netdev_class *netdev_class)
+{
+    return netdev_class->run == netdev_linux_run;
+}
+
+static struct netdev_linux *
+netdev_linux_cast(const struct netdev *netdev)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
+
+    return CONTAINER_OF(netdev, struct netdev_linux, up);
+}
+
+static struct netdev_rxq_linux *
+netdev_rxq_linux_cast(const struct netdev_rxq *rx)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
+
+    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
+}
+
+#endif /* netdev-linux-private.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index e4ea94cf9243..877049508597 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -17,6 +17,7 @@ 
 #include <config.h>
 
 #include "netdev-linux.h"
+#include "netdev-linux-private.h"
 
 #include <errno.h>
 #include <fcntl.h>
@@ -54,6 +55,7 @@ 
 #include "fatal-signal.h"
 #include "hash.h"
 #include "openvswitch/hmap.h"
+#include "netdev-afxdp.h"
 #include "netdev-provider.h"
 #include "netdev-vport.h"
 #include "netlink-notifier.h"
@@ -486,57 +488,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
-
-struct netdev_rxq_linux {
-    struct netdev_rxq up;
-    bool is_tap;
-    int fd;
-};
 
 /* This is set pretty low because we probably won't learn anything from the
  * additional log messages. */
@@ -550,8 +501,6 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
-static void netdev_linux_run(const struct netdev_class *);
-
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -565,7 +514,6 @@  static int do_set_addr(struct netdev *netdev,
                        struct in_addr addr);
 static int get_etheraddr(const char *netdev_name, struct eth_addr *ea);
 static int set_etheraddr(const char *netdev_name, const struct eth_addr);
-static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *);
 static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
@@ -573,31 +521,10 @@  static void netdev_linux_miimon_wait(void);
 static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
 
 static bool
-is_netdev_linux_class(const struct netdev_class *netdev_class)
-{
-    return netdev_class->run == netdev_linux_run;
-}
-
-static bool
 is_tap_netdev(const struct netdev *netdev)
 {
     return netdev_get_class(netdev) == &netdev_tap_class;
 }
-
-static struct netdev_linux *
-netdev_linux_cast(const struct netdev *netdev)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
-
-    return CONTAINER_OF(netdev, struct netdev_linux, up);
-}
-
-static struct netdev_rxq_linux *
-netdev_rxq_linux_cast(const struct netdev_rxq *rx)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
-    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
-}
 
 static int
 netdev_linux_netnsid_update__(struct netdev_linux *netdev)
@@ -773,7 +700,7 @@  netdev_linux_update_lag(struct rtnetlink_change *change)
     }
 }
 
-static void
+void
 netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
 {
     struct nl_sock *sock;
@@ -3278,9 +3205,7 @@  exit:
     .run = netdev_linux_run,                                    \
     .wait = netdev_linux_wait,                                  \
     .alloc = netdev_linux_alloc,                                \
-    .destruct = netdev_linux_destruct,                          \
     .dealloc = netdev_linux_dealloc,                            \
-    .send = netdev_linux_send,                                  \
     .send_wait = netdev_linux_send_wait,                        \
     .set_etheraddr = netdev_linux_set_etheraddr,                \
     .get_etheraddr = netdev_linux_get_etheraddr,                \
@@ -3311,39 +3236,74 @@  exit:
     .arp_lookup = netdev_linux_arp_lookup,                      \
     .update_flags = netdev_linux_update_flags,                  \
     .rxq_alloc = netdev_linux_rxq_alloc,                        \
-    .rxq_construct = netdev_linux_rxq_construct,                \
-    .rxq_destruct = netdev_linux_rxq_destruct,                  \
     .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
-    .rxq_recv = netdev_linux_rxq_recv,                          \
     .rxq_wait = netdev_linux_rxq_wait,                          \
     .rxq_drain = netdev_linux_rxq_drain
 
 const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
-    .get_block_id = netdev_linux_get_block_id
+    .get_block_id = netdev_linux_get_block_id,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_destruct = netdev_linux_rxq_destruct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_destruct = netdev_linux_rxq_destruct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_destruct = netdev_linux_rxq_destruct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .destruct = netdev_afxdp_destruct,
+    .get_stats = netdev_afxdp_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+    .send = netdev_afxdp_batch_send,
+    .rxq_construct = netdev_afxdp_rxq_construct,
+    .rxq_destruct = netdev_afxdp_rxq_destruct,
+    .rxq_recv = netdev_afxdp_rxq_recv,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
@@ -5915,7 +5875,7 @@  netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst,
     dst->tx_window_errors = src->tx_window_errors;
 }
 
-static int
+int
 get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats)
 {
     struct ofpbuf request;
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 2a545c986b4b..1e5a40c898fc 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -832,6 +832,9 @@  extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
 
+#ifdef HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 6b34dec9c970..b1976d365428 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -103,6 +103,9 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);
+#ifdef HAVE_AF_XDP
+void signal_remove_xdp(struct netdev *netdev);
+#endif
 
 int
 netdev_n_txq(const struct netdev *netdev)
@@ -147,6 +150,9 @@  netdev_initialize(void)
         netdev_vport_tunnel_register();
 
         netdev_register_flow_api_provider(&netdev_offload_tc);
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
@@ -2021,6 +2027,11 @@  restore_all_flags(void *aux OVS_UNUSED)
                                                saved_flags & ~saved_values,
                                                &old_flags);
         }
+#ifdef HAVE_AF_XDP
+        if (netdev->netdev_class == &netdev_afxdp_class) {
+            signal_remove_xdp(netdev);
+        }
+#endif
     }
 }
 
diff --git a/lib/util.c b/lib/util.c
index 7b8ab81f6ee1..9e8814256665 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -214,20 +214,19 @@  x2nrealloc(void *p, size_t *n, size_t s)
     return xrealloc(p, *n * s);
 }
 
-/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
- * dedicated cache lines.  That is, the memory block returned will not share a
- * cache line with other data, avoiding "false sharing".
+/* Allocates and returns 'size' bytes of memory aligned to 'alignment' bytes.
+ * 'alignment' must be a power of two and a multiple of sizeof(void *).
  *
- * Use free_cacheline() to free the returned memory block. */
+ * Use free_size_align() to free the returned memory block. */
 void *
-xmalloc_cacheline(size_t size)
+xmalloc_size_align(size_t size, size_t alignment)
 {
 #ifdef HAVE_POSIX_MEMALIGN
     void *p;
     int error;
 
     COVERAGE_INC(util_xalloc);
-    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
+    error = posix_memalign(&p, alignment, size ? size : 1);
     if (error != 0) {
         out_of_memory();
     }
@@ -235,16 +234,16 @@  xmalloc_cacheline(size_t size)
 #else
     /* Allocate room for:
      *
-     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
-     *       pointer to be aligned exactly sizeof(void *) bytes before the
-     *       beginning of a cache line.
+     *     - Header padding: Up to alignment - 1 bytes, to allow the
+     *       pointer 'q' to be aligned exactly sizeof(void *) bytes before the
+     *       beginning of the alignment.
      *
      *     - Pointer: A pointer to the start of the header padding, to allow us
      *       to free() the block later.
      *
      *     - User data: 'size' bytes.
      *
-     *     - Trailer padding: Enough to bring the user data up to a cache line
+     *     - Trailer padding: Enough to bring the user data up to a alignment
      *       multiple.
      *
      * +---------------+---------+------------------------+---------+
@@ -255,18 +254,56 @@  xmalloc_cacheline(size_t size)
      * p               q         r
      *
      */
-    void *p = xmalloc((CACHE_LINE_SIZE - 1)
-                      + sizeof(void *)
-                      + ROUND_UP(size, CACHE_LINE_SIZE));
-    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
-    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
-                                CACHE_LINE_SIZE);
-    void **q = (void **) r - 1;
+    void *p, *r, **q;
+    bool runt;
+
+    COVERAGE_INC(util_xalloc);
+    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
+        ovs_abort(0, "Invalid alignment");
+    }
+
+    p = xmalloc((alignment - 1)
+                + sizeof(void *)
+                + ROUND_UP(size, alignment));
+
+    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
+    /* When the padding size < sizeof(void*), we don't have enough room for
+     * pointer 'q'. As a reuslt, need to move 'r' to the next alignment.
+     * So ROUND_UP when xmalloc above, and ROUND_UP again when calculate 'r'
+     * below.
+     */
+    r = (void *) ROUND_UP((uintptr_t) p + (runt ? alignment : 0), alignment);
+    q = (void **) r - 1;
     *q = p;
+
     return r;
 #endif
 }
 
+void
+free_size_align(void *p)
+{
+#ifdef HAVE_POSIX_MEMALIGN
+    free(p);
+#else
+    if (p) {
+        void **q = (void **) p - 1;
+        free(*q);
+    }
+#endif
+}
+
+/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
+ * dedicated cache lines.  That is, the memory block returned will not share a
+ * cache line with other data, avoiding "false sharing".
+ *
+ * Use free_cacheline() to free the returned memory block. */
+void *
+xmalloc_cacheline(size_t size)
+{
+    return xmalloc_size_align(size, CACHE_LINE_SIZE);
+}
+
 /* Like xmalloc_cacheline() but clears the allocated memory to all zero
  * bytes. */
 void *
@@ -282,14 +319,19 @@  xzalloc_cacheline(size_t size)
 void
 free_cacheline(void *p)
 {
-#ifdef HAVE_POSIX_MEMALIGN
-    free(p);
-#else
-    if (p) {
-        void **q = (void **) p - 1;
-        free(*q);
-    }
-#endif
+    free_size_align(p);
+}
+
+void *
+xmalloc_pagealign(size_t size)
+{
+    return xmalloc_size_align(size, get_page_size());
+}
+
+void
+free_pagealign(void *p)
+{
+    free_size_align(p);
 }
 
 char *
diff --git a/lib/util.h b/lib/util.h
index 095ede20f07f..7ad8758fe637 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -169,6 +169,11 @@  void ovs_strzcpy(char *dst, const char *src, size_t size);
 
 int string_ends_with(const char *str, const char *suffix);
 
+void *xmalloc_pagealign(size_t) MALLOC_LIKE;
+void free_pagealign(void *);
+void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
+void free_size_align(void *);
+
 /* The C standards say that neither the 'dst' nor 'src' argument to
  * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
  * the null case. */
diff --git a/tests/.gitignore b/tests/.gitignore
index 9b07508bd056..c5abb32d025a 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -13,6 +13,9 @@ 
 /ovsdb-cluster-testsuite.dir/
 /ovsdb-cluster-testsuite.log
 /pki/
+/system-afxdp-testsuite
+/system-afxdp-testsuite.dir/
+/system-afxdp-testsuite.log
 /system-dpdk-testsuite
 /system-dpdk-testsuite.dir/
 /system-dpdk-testsuite.log
diff --git a/tests/automake.mk b/tests/automake.mk
index 2956e68b242c..d6ab51732908 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -160,6 +162,11 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-userspace-macros.at \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -184,6 +191,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -317,6 +325,10 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -354,6 +366,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..f0683c0a901b
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,39 @@ 
+# Add port to ovs bridge by using afxdp mode.
+# This will use generic XDP support in the veth driver.
+m4_define([ADD_VETH],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
+
+m4_define([OVS_CHECK_8021AD],
+    [AT_SKIP_IF([:])])
+
+# CONFIGURE_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads and VLAN offloads for veths used in AF_XDP.
+m4_define([CONFIGURE_VETH_OFFLOADS],
+    [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])
+     AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
+    ]
+)
+
+# OVS_START_L7([namespace], [protocol])
+#
+# AF_XDP doesn't work with TCP over virtual interfaces for now.
+#
+m4_define([OVS_START_L7],
+   [AT_SKIP_IF([:])])
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..9b7a29066614
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-common-macros.at])
+m4_include([tests/system-userspace-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+
+m4_include([tests/system-traffic.at])
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 8ea450887076..4bd91a03946e 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -71,6 +71,7 @@  AT_CLEANUP
 
 AT_SETUP([datapath - ping between two ports on cvlan])
 OVS_TRAFFIC_VSWITCHD_START()
+OVS_CHECK_8021AD()
 
 AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
 
@@ -161,6 +162,7 @@  AT_CLEANUP
 
 AT_SETUP([datapath - ping6 between two ports on cvlan])
 OVS_TRAFFIC_VSWITCHD_START()
+OVS_CHECK_8021AD()
 
 AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 6d99f7c270cd..027aee2f523b 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3107,6 +3107,21 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="xdpmode"
+              type='{"type": "string",
+                     "enum": ["set", ["skb", "drv"]]}'>
+        <p>
+          Specifies the operational mode of the XDP program.
+          If "drv", the XDP program is loaded into the device driver with
+          zero-copy RX and TX enabled. This mode requires device driver with
+          AF_XDP support and has the best performance.
+          If "skb", the XDP program is using generic XDP mode in kernel with
+          extra data copying between userspace and kernel. No device driver
+          support is needed. Note that this is afxdp netdev type only.
+          Defaults to "skb" mode.
+        </p>
+      </column>
+
       <column name="options" key="vhost-server-path"
               type='{"type": "string"}'>
         <p>