[ovs-dev,RFCv4,0/4] AF_XDP netdev support for OVS
mbox series

Message ID 1554158812-44622-1-git-send-email-u9012063@gmail.com
Headers show
Series
  • AF_XDP netdev support for OVS
Related show

Message

William Tu April 1, 2019, 10:46 p.m. UTC
The patch series introduces AF_XDP support for OVS netdev.
AF_XDP is a new address family working together with eBPF.
In short, a socket with AF_XDP family can receive and send
packets from an eBPF/XDP program attached to the netdev.
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst

OVS has a couple of netdev types, i.e., system, tap, or
internal.  The patch first adds a new netdev types called
"afxdp", and implement its configuration, packet reception,
and transmit functions.  Since the AF_XDP socket, xsk,
operates in userspace, once ovs-vswitchd receives packets
from xsk, the proposed architecture re-uses the existing
userspace dpif-netdev datapath.  As a result, most of
the packet processing happens at the userspace instead of
linux kernel.

Architecure
===========
               _
              |   +-------------------+
              |   |    ovs-vswitchd   |<-->ovsdb-server
              |   +-------------------+
              |   |      ofproto      |<-->OpenFlow controllers
              |   +--------+-+--------+ 
              |   | netdev | |ofproto-|
    userspace |   +--------+ |  dpif  |
              |   | netdev | +--------+
              |   |provider| |  dpif  |
              |   +---||---+ +--------+
              |       ||     |  dpif- |
              |       ||     | netdev |
              |_      ||     +--------+  
                      ||         
               _  +---||-----+--------+
              |   | af_xdp prog +     |
       kernel |   |   xsk_map         |
              |_  +--------||---------+
                           ||
                        physical
                           NIC

To simply start, create a ovs userspace bridge using dpif-netdev
by setting the datapath_type to netdev:
  # ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev

And attach a linux netdev with type afxdp:
  # ovs-vsctl add-port br0 afxdp-p0 -- \
      set interface afxdp-p0 type="afxdp"

Performance
===========
For this version, v4, I mainly focus on making the features right with
libbpf AF_XDP API and use the AF_XDP SKB mode, which is the slower set-up.
My next version is to measure the performance and add optimizations.

Documentation
=============
Most of the design details are described in the paper presetned at
Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
section 4, and slides[2].
This path uses a not-yet upstreamed feature called XDP_ATTACH[3],
described in section 3.1, which is a built-in XDP program for the AF_XDP.
This greatly simplifies the management of XDP/eBPF programs.

[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf

For installation and configuration guide, see
  # Documentation/intro/install/bpf.rst

Test Cases
==========
Test cases are created using namespaces and veth peer, with AF_XDP socket
attached to the veth (thus the SKB_MODE).  By issuing "make check-afxdp",
the patch shows the following:

AF_XDP netdev datapath-sanity

  1: datapath - ping between two ports               ok
  2: datapath - ping between two ports on vlan       ok
  3: datapath - ping6 between two ports              ok
  4: datapath - ping6 between two ports on vlan      ok
  5: datapath - ping over vxlan tunnel               ok
  6: datapath - ping over vxlan6 tunnel              ok
  7: datapath - ping over gre tunnel                 ok
  8: datapath - ping over erspan v1 tunnel           ok
  9: datapath - ping over erspan v2 tunnel           ok
 10: datapath - ping over ip6erspan v1 tunnel        ok
 11: datapath - ping over ip6erspan v2 tunnel        ok
 12: datapath - ping over geneve tunnel              ok
 13: datapath - ping over geneve6 tunnel             ok
 14: datapath - clone action                         ok
 15: datapath - basic truncate action                ok

conntrack

 16: conntrack - controller                          ok
 17: conntrack - force commit                        ok
 18: conntrack - ct flush by 5-tuple                 ok
 19: conntrack - IPv4 ping                           ok
 20: conntrack - get_nconns and get/set_maxconns     ok
 21: conntrack - IPv6 ping                           ok

system-ovn

 22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
 23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
 24: ovn -- multiple gateway routers, SNAT and DNAT  ok
 25: ovn -- load-balancing                           ok
 26: ovn -- load-balancing - same subnet.            ok
 27: ovn -- load balancing in gateway router         ok
 28: ovn -- multiple gateway routers, load-balancing ok
 29: ovn -- load balancing in router with gateway router port ok
 30: ovn -- DNAT and SNAT on distributed router - N/S ok
 31: ovn -- DNAT and SNAT on distributed router - E/W ok

---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

William Tu (4):
  Add libbpf build support.
  netdev-afxdp: add new netdev type for AF_XDP
  tests: add AF_XDP netdev test cases.
  afxdp netdev: add documentation and configuration.

 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/bpf.rst   | 182 +++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  20 +
 configure.ac                          |   1 +
 lib/automake.mk                       |   7 +-
 lib/dp-packet.c                       |  12 +
 lib/dp-packet.h                       |  32 +-
 lib/dpif-netdev.c                     |   2 +-
 lib/netdev-afxdp.c                    | 491 +++++++++++++++++
 lib/netdev-afxdp.h                    |  39 ++
 lib/netdev-linux.c                    |  78 ++-
 lib/netdev-provider.h                 |   1 +
 lib/netdev.c                          |   1 +
 lib/xdpsock.c                         | 179 +++++++
 lib/xdpsock.h                         | 129 +++++
 tests/automake.mk                     |  17 +
 tests/system-afxdp-macros.at          | 153 ++++++
 tests/system-afxdp-testsuite.at       |  26 +
 tests/system-afxdp-traffic.at         | 978 ++++++++++++++++++++++++++++++++++
 21 files changed, 2345 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/intro/install/bpf.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at
 create mode 100644 tests/system-afxdp-traffic.at

Comments

Ben Pfaff April 16, 2019, 7:55 p.m. UTC | #1
On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
> The patch series introduces AF_XDP support for OVS netdev.
> AF_XDP is a new address family working together with eBPF.
> In short, a socket with AF_XDP family can receive and send
> packets from an eBPF/XDP program attached to the netdev.
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst

I'm glad to see some more revisions of this series!

AF_XDP is a faster way to access the existing kernel devices.  If we
take that point of view, then it would be ideal if AF_XDP were
automatically used when it was available, instead of adding a new
network device type.  Is there a reason that this point of view is
wrong?  That is, when AF_XDP is available, is there a reason not to use
it?

You said that your goal for the next version is to improve performance
and add optimizations.  Do you think that is important before we merge
the series?  We can continue to improve performance after it is merged.

If we set performance aside, do you have a reason to want to wait to
merge this?  (I wasn't able to easily apply this series to current
master, so it'll need at least a rebase before we apply it.  And I have
only skimmed it, not fully reviewed it.)

It might make sense to squash all of these into a single patch.  I am
not sure that they are really distinct conceptually.
Eelco Chaudron April 17, 2019, 8:09 a.m. UTC | #2
On 16 Apr 2019, at 21:55, Ben Pfaff wrote:

> On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
>> The patch series introduces AF_XDP support for OVS netdev.
>> AF_XDP is a new address family working together with eBPF.
>> In short, a socket with AF_XDP family can receive and send
>> packets from an eBPF/XDP program attached to the netdev.
>> For more details about AF_XDP, please see linux kernel's
>> Documentation/networking/af_xdp.rst
>
> I'm glad to see some more revisions of this series!

I’m planning on reviewing and testing this patch, I’ll try to start 
it this week, or else when I get back from PTO.

> AF_XDP is a faster way to access the existing kernel devices.  If we
> take that point of view, then it would be ideal if AF_XDP were
> automatically used when it was available, instead of adding a new
> network device type.  Is there a reason that this point of view is
> wrong?  That is, when AF_XDP is available, is there a reason not to 
> use
> it?

This needs support by all the ingress and egress ports in the system, 
and currently, there is no API to check this.

There are also features like traffic shaping that will not work. Maybe 
it will be worth adding the table for AF_XDP in 
http://docs.openvswitch.org/en/latest/faq/releases/

> You said that your goal for the next version is to improve performance
> and add optimizations.  Do you think that is important before we merge
> the series?  We can continue to improve performance after it is 
> merged.

The previous patch was rather unstable and I could not get it running 
with the PVP test without crashing. I think this patchset should get 
some proper testing and reviews by others. Especially for all the 
features being marked as supported in the above-mentioned table.

> If we set performance aside, do you have a reason to want to wait to
> merge this?  (I wasn't able to easily apply this series to current
> master, so it'll need at least a rebase before we apply it.  And I 
> have
> only skimmed it, not fully reviewed it.)

Other than the items above, do we really need another datapath? With 
this, we use two or more cores for processing packets. If we poll two 
physical ports it could be 300%, which is a typical use case with 
bonding. What about multiple queue support, does it work? Both in kernel 
and DPDK mode we use multiple queues to distribute the load, with this 
scenario does it double the number of CPUs used? Can we use the poll() 
mode as explained here, 
https://linuxplumbersconf.org/event/2/contributions/99/, and how will it 
work with multiple queues/pmd threads? What about any latency tests, is 
it worse or better than kernel/dpdk? Also with the AF_XDP datapath, 
there is no to leverage hardware offload, like DPDK and TC. And then 
there is the part that it only works on the most recent kernels.

To me looking at this I would say it’s far from being ready to be 
merged into OVS. However, if others decide to go ahead I think it should 
be disabled, not compiled in by default.

> It might make sense to squash all of these into a single patch.  I am
> not sure that they are really distinct conceptually.
Eelco Chaudron April 17, 2019, 10:16 a.m. UTC | #3
On 17 Apr 2019, at 10:09, Eelco Chaudron wrote:

> On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
>
>> On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
>>> The patch series introduces AF_XDP support for OVS netdev.
>>> AF_XDP is a new address family working together with eBPF.
>>> In short, a socket with AF_XDP family can receive and send
>>> packets from an eBPF/XDP program attached to the netdev.
>>> For more details about AF_XDP, please see linux kernel's
>>> Documentation/networking/af_xdp.rst
>>
>> I'm glad to see some more revisions of this series!
>
> I’m planning on reviewing and testing this patch, I’ll try to 
> start it this week, or else when I get back from PTO.
>
>> AF_XDP is a faster way to access the existing kernel devices.  If we
>> take that point of view, then it would be ideal if AF_XDP were
>> automatically used when it was available, instead of adding a new
>> network device type.  Is there a reason that this point of view is
>> wrong?  That is, when AF_XDP is available, is there a reason not to 
>> use
>> it?
>
> This needs support by all the ingress and egress ports in the system, 
> and currently, there is no API to check this.
>
> There are also features like traffic shaping that will not work. Maybe 
> it will be worth adding the table for AF_XDP in 
> http://docs.openvswitch.org/en/latest/faq/releases/
>
>> You said that your goal for the next version is to improve 
>> performance
>> and add optimizations.  Do you think that is important before we 
>> merge
>> the series?  We can continue to improve performance after it is 
>> merged.
>
> The previous patch was rather unstable and I could not get it running 
> with the PVP test without crashing. I think this patchset should get 
> some proper testing and reviews by others. Especially for all the 
> features being marked as supported in the above-mentioned table.
>
>> If we set performance aside, do you have a reason to want to wait to
>> merge this?  (I wasn't able to easily apply this series to current
>> master, so it'll need at least a rebase before we apply it.  And I 
>> have
>> only skimmed it, not fully reviewed it.)
>
> Other than the items above, do we really need another datapath? With 
> this, we use two or more cores for processing packets. If we poll two 
> physical ports it could be 300%, which is a typical use case with 
> bonding. What about multiple queue support, does it work? Both in 
> kernel and DPDK mode we use multiple queues to distribute the load, 
> with this scenario does it double the number of CPUs used? Can we use 
> the poll() mode as explained here, 
> https://linuxplumbersconf.org/event/2/contributions/99/, and how will 
> it work with multiple queues/pmd threads? What about any latency 
> tests, is it worse or better than kernel/dpdk? Also with the AF_XDP 
> datapath, there is no to leverage hardware offload, like DPDK and TC. 
> And then there is the part that it only works on the most recent 
> kernels.

One other thing that popped up in my head is how (will) it work together 
with DPDK enabled on the same system?

> To me looking at this I would say it’s far from being ready to be 
> merged into OVS. However, if others decide to go ahead I think it 
> should be disabled, not compiled in by default.
>
>> It might make sense to squash all of these into a single patch.  I am
>> not sure that they are really distinct conceptually.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eelco Chaudron April 17, 2019, 12:01 p.m. UTC | #4
Hi William,

I think you applied the following patch to get it to compile? Or did you 
copy in the kernel headers?

https://www.spinics.net/lists/netdev/msg563507.html

//Eelco

On 2 Apr 2019, at 0:46, William Tu wrote:

> The patch series introduces AF_XDP support for OVS netdev.
> AF_XDP is a new address family working together with eBPF.
> In short, a socket with AF_XDP family can receive and send
> packets from an eBPF/XDP program attached to the netdev.
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst
>
> OVS has a couple of netdev types, i.e., system, tap, or
> internal.  The patch first adds a new netdev types called
> "afxdp", and implement its configuration, packet reception,
> and transmit functions.  Since the AF_XDP socket, xsk,
> operates in userspace, once ovs-vswitchd receives packets
> from xsk, the proposed architecture re-uses the existing
> userspace dpif-netdev datapath.  As a result, most of
> the packet processing happens at the userspace instead of
> linux kernel.
>
> Architecure
> ===========
>                _
>               |   +-------------------+
>               |   |    ovs-vswitchd   |<-->ovsdb-server
>               |   +-------------------+
>               |   |      ofproto      |<-->OpenFlow controllers
>               |   +--------+-+--------+
>               |   | netdev | |ofproto-|
>     userspace |   +--------+ |  dpif  |
>               |   | netdev | +--------+
>               |   |provider| |  dpif  |
>               |   +---||---+ +--------+
>               |       ||     |  dpif- |
>               |       ||     | netdev |
>               |_      ||     +--------+
>                       ||
>                _  +---||-----+--------+
>               |   | af_xdp prog +     |
>        kernel |   |   xsk_map         |
>               |_  +--------||---------+
>                            ||
>                         physical
>                            NIC
>
> To simply start, create a ovs userspace bridge using dpif-netdev
> by setting the datapath_type to netdev:
>   # ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>
> And attach a linux netdev with type afxdp:
>   # ovs-vsctl add-port br0 afxdp-p0 -- \
>       set interface afxdp-p0 type="afxdp"
>
> Performance
> ===========
> For this version, v4, I mainly focus on making the features right with
> libbpf AF_XDP API and use the AF_XDP SKB mode, which is the slower 
> set-up.
> My next version is to measure the performance and add optimizations.
>
> Documentation
> =============
> Most of the design details are described in the paper presetned at
> Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> section 4, and slides[2].
> This path uses a not-yet upstreamed feature called XDP_ATTACH[3],
> described in section 3.1, which is a built-in XDP program for the 
> AF_XDP.
> This greatly simplifies the management of XDP/eBPF programs.
>
> [1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> [2] 
> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> [3] 
> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>
> For installation and configuration guide, see
>   # Documentation/intro/install/bpf.rst
>
> Test Cases
> ==========
> Test cases are created using namespaces and veth peer, with AF_XDP 
> socket
> attached to the veth (thus the SKB_MODE).  By issuing "make 
> check-afxdp",
> the patch shows the following:
>
> AF_XDP netdev datapath-sanity
>
>   1: datapath - ping between two ports               ok
>   2: datapath - ping between two ports on vlan       ok
>   3: datapath - ping6 between two ports              ok
>   4: datapath - ping6 between two ports on vlan      ok
>   5: datapath - ping over vxlan tunnel               ok
>   6: datapath - ping over vxlan6 tunnel              ok
>   7: datapath - ping over gre tunnel                 ok
>   8: datapath - ping over erspan v1 tunnel           ok
>   9: datapath - ping over erspan v2 tunnel           ok
>  10: datapath - ping over ip6erspan v1 tunnel        ok
>  11: datapath - ping over ip6erspan v2 tunnel        ok
>  12: datapath - ping over geneve tunnel              ok
>  13: datapath - ping over geneve6 tunnel             ok
>  14: datapath - clone action                         ok
>  15: datapath - basic truncate action                ok
>
> conntrack
>
>  16: conntrack - controller                          ok
>  17: conntrack - force commit                        ok
>  18: conntrack - ct flush by 5-tuple                 ok
>  19: conntrack - IPv4 ping                           ok
>  20: conntrack - get_nconns and get/set_maxconns     ok
>  21: conntrack - IPv6 ping                           ok
>
> system-ovn
>
>  22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
>  23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
>  24: ovn -- multiple gateway routers, SNAT and DNAT  ok
>  25: ovn -- load-balancing                           ok
>  26: ovn -- load-balancing - same subnet.            ok
>  27: ovn -- load balancing in gateway router         ok
>  28: ovn -- multiple gateway routers, load-balancing ok
>  29: ovn -- load balancing in router with gateway router port ok
>  30: ovn -- DNAT and SNAT on distributed router - N/S ok
>  31: ovn -- DNAT and SNAT on distributed router - E/W ok
>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
>
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread 
> library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: 
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
>
> William Tu (4):
>   Add libbpf build support.
>   netdev-afxdp: add new netdev type for AF_XDP
>   tests: add AF_XDP netdev test cases.
>   afxdp netdev: add documentation and configuration.
>
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/bpf.rst   | 182 +++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  20 +
>  configure.ac                          |   1 +
>  lib/automake.mk                       |   7 +-
>  lib/dp-packet.c                       |  12 +
>  lib/dp-packet.h                       |  32 +-
>  lib/dpif-netdev.c                     |   2 +-
>  lib/netdev-afxdp.c                    | 491 +++++++++++++++++
>  lib/netdev-afxdp.h                    |  39 ++
>  lib/netdev-linux.c                    |  78 ++-
>  lib/netdev-provider.h                 |   1 +
>  lib/netdev.c                          |   1 +
>  lib/xdpsock.c                         | 179 +++++++
>  lib/xdpsock.h                         | 129 +++++
>  tests/automake.mk                     |  17 +
>  tests/system-afxdp-macros.at          | 153 ++++++
>  tests/system-afxdp-testsuite.at       |  26 +
>  tests/system-afxdp-traffic.at         | 978 
> ++++++++++++++++++++++++++++++++++
>  21 files changed, 2345 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/intro/install/bpf.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
>  create mode 100644 tests/system-afxdp-traffic.at
>
> -- 
> 2.7.4
Eelco Chaudron April 17, 2019, 2:26 p.m. UTC | #5
On 17 Apr 2019, at 14:01, Eelco Chaudron wrote:

> Hi William,
>
> I think you applied the following patch to get it to compile? Or did 
> you copy in the kernel headers?
>
> https://www.spinics.net/lists/netdev/msg563507.html

I noticed you duplicated the macros, which resulted in all kind of 
compile errors. So I removed them, applied the two patches above, which 
would get me to the next step.

I’m building it with DPDK enabled and it was causing all kind of 
duplicate definition errors as the kernel and DPDK re-use some structure 
names.

To get it all compiled and working I had top make the following changes:

$ git diff
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index b3bf2f044..47fb3342a 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -295,7 +295,7 @@ netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
      uint32_t idx_rx = 0, idx_fq = 0;
      int ret = 0;

-    unsigned int non_afxdp;
+    unsigned int non_afxdp = 0;

      /* See if there is any packet on RX queue,
       * if yes, idx_rx is the index having the packet.
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 47153dc60..77f2150ab 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -24,7 +24,7 @@
  #include <unistd.h>
  #include <linux/virtio_net.h>
  #include <sys/socket.h>
-#include <linux/if.h>
+//#include <linux/if.h>

  #include <rte_bus_pci.h>
  #include <rte_config.h>
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
index 8df8fa451..a2ed1a136 100644
--- a/lib/xdpsock.h
+++ b/lib/xdpsock.h
@@ -28,7 +28,7 @@
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
-#include <net/ethernet.h>
+//#include <net/ethernet.h>
  #include <sys/resource.h>
  #include <sys/socket.h>
  #include <sys/mman.h>
@@ -43,14 +43,6 @@
  #include "ovs-atomic.h"
  #include "openvswitch/thread.h"

-/* bpf/xsk.h uses the following macros not defined in OVS,
- * so re-define them before include.
- */
-#define unlikely OVS_UNLIKELY
-#define likely OVS_LIKELY
-#define barrier() __asm__ __volatile__("": : :"memory")
-#define smp_rmb() barrier()
-#define smp_wmb() barrier()
  #include <bpf/xsk.h>

In addition you need to do “make install_headers” from kernel libbpf 
and copy the libbpf_util.h manually.

I was able to do a simple physical port in same physical port out test 
without crashing, but the numbers seem low:

$ ovs-ofctl dump-flows ovs_pvp_br0
  cookie=0x0, duration=210.344s, table=0, n_packets=1784692, 
n_bytes=2694884920, in_port=eno1 actions=IN_PORT

"Physical loopback test, L3 flows[port redirect]"
,Packet size
Number of flows,64,128,256,512,768,1024,1514
100,77574,77329,76605,76417,75539,75252,74617

The above is using two cores, but with a single DPDK core I get the 
following (on the same machine):

"Physical loopback test, L3 flows[port redirect]"
,Packet size
Number of flows,64,128,256,512,768,1024,1514
100,9527075,8445852,4528935,2349597,1586276,1197304,814854

For the kernel datapath the numbers are:

"Physical loopback test, L3 flows[port redirect]"
,Packet size
Number of flows,64,128,256,512,768,1024,1514
100,4862995,5521870,4528872,2349596,1586277,1197305,814854

But keep in mind it uses roughly 550/610/520/380/180/140/110% of the CPU 
for the respective packet size.

> On 2 Apr 2019, at 0:46, William Tu wrote:
>
>> The patch series introduces AF_XDP support for OVS netdev.
>> AF_XDP is a new address family working together with eBPF.
>> In short, a socket with AF_XDP family can receive and send
>> packets from an eBPF/XDP program attached to the netdev.
>> For more details about AF_XDP, please see linux kernel's
>> Documentation/networking/af_xdp.rst
>>
>> OVS has a couple of netdev types, i.e., system, tap, or
>> internal.  The patch first adds a new netdev types called
>> "afxdp", and implement its configuration, packet reception,
>> and transmit functions.  Since the AF_XDP socket, xsk,
>> operates in userspace, once ovs-vswitchd receives packets
>> from xsk, the proposed architecture re-uses the existing
>> userspace dpif-netdev datapath.  As a result, most of
>> the packet processing happens at the userspace instead of
>> linux kernel.
>>
>> Architecure
>> ===========
>>                _
>>               |   +-------------------+
>>               |   |    ovs-vswitchd   |<-->ovsdb-server
>>               |   +-------------------+
>>               |   |      ofproto      |<-->OpenFlow controllers
>>               |   +--------+-+--------+
>>               |   | netdev | |ofproto-|
>>     userspace |   +--------+ |  dpif  |
>>               |   | netdev | +--------+
>>               |   |provider| |  dpif  |
>>               |   +---||---+ +--------+
>>               |       ||     |  dpif- |
>>               |       ||     | netdev |
>>               |_      ||     +--------+
>>                       ||
>>                _  +---||-----+--------+
>>               |   | af_xdp prog +     |
>>        kernel |   |   xsk_map         |
>>               |_  +--------||---------+
>>                            ||
>>                         physical
>>                            NIC
>>
>> To simply start, create a ovs userspace bridge using dpif-netdev
>> by setting the datapath_type to netdev:
>>   # ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
>>
>> And attach a linux netdev with type afxdp:
>>   # ovs-vsctl add-port br0 afxdp-p0 -- \
>>       set interface afxdp-p0 type="afxdp"
>>
>> Performance
>> ===========
>> For this version, v4, I mainly focus on making the features right 
>> with
>> libbpf AF_XDP API and use the AF_XDP SKB mode, which is the slower 
>> set-up.
>> My next version is to measure the performance and add optimizations.
>>
>> Documentation
>> =============
>> Most of the design details are described in the paper presetned at
>> Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
>> section 4, and slides[2].
>> This path uses a not-yet upstreamed feature called XDP_ATTACH[3],
>> described in section 3.1, which is a built-in XDP program for the 
>> AF_XDP.
>> This greatly simplifies the management of XDP/eBPF programs.
>>
>> [1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
>> [2] 
>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
>> [3] 
>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>>
>> For installation and configuration guide, see
>>   # Documentation/intro/install/bpf.rst
>>
>> Test Cases
>> ==========
>> Test cases are created using namespaces and veth peer, with AF_XDP 
>> socket
>> attached to the veth (thus the SKB_MODE).  By issuing "make 
>> check-afxdp",
>> the patch shows the following:
>>
>> AF_XDP netdev datapath-sanity
>>
>>   1: datapath - ping between two ports               ok
>>   2: datapath - ping between two ports on vlan       ok
>>   3: datapath - ping6 between two ports              ok
>>   4: datapath - ping6 between two ports on vlan      ok
>>   5: datapath - ping over vxlan tunnel               ok
>>   6: datapath - ping over vxlan6 tunnel              ok
>>   7: datapath - ping over gre tunnel                 ok
>>   8: datapath - ping over erspan v1 tunnel           ok
>>   9: datapath - ping over erspan v2 tunnel           ok
>>  10: datapath - ping over ip6erspan v1 tunnel        ok
>>  11: datapath - ping over ip6erspan v2 tunnel        ok
>>  12: datapath - ping over geneve tunnel              ok
>>  13: datapath - ping over geneve6 tunnel             ok
>>  14: datapath - clone action                         ok
>>  15: datapath - basic truncate action                ok
>>
>> conntrack
>>
>>  16: conntrack - controller                          ok
>>  17: conntrack - force commit                        ok
>>  18: conntrack - ct flush by 5-tuple                 ok
>>  19: conntrack - IPv4 ping                           ok
>>  20: conntrack - get_nconns and get/set_maxconns     ok
>>  21: conntrack - IPv6 ping                           ok
>>
>> system-ovn
>>
>>  22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
>>  23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
>>  24: ovn -- multiple gateway routers, SNAT and DNAT  ok
>>  25: ovn -- load-balancing                           ok
>>  26: ovn -- load-balancing - same subnet.            ok
>>  27: ovn -- load balancing in gateway router         ok
>>  28: ovn -- multiple gateway routers, load-balancing ok
>>  29: ovn -- load balancing in router with gateway router port ok
>>  30: ovn -- DNAT and SNAT on distributed router - N/S ok
>>  31: ovn -- DNAT and SNAT on distributed router - E/W ok
>>
>> ---
>> v1->v2:
>> - add a list to maintain unused umem elements
>> - remove copy from rx umem to ovs internal buffer
>> - use hugetlb to reduce misses (not much difference)
>> - use pmd mode netdev in OVS (huge performance improve)
>> - remove malloc dp_packet, instead put dp_packet in umem
>>
>> v2->v3:
>> - rebase on the OVS master, 7ab4b0653784
>>   ("configure: Check for more specific function to pull in pthread 
>> library.")
>> - remove the dependency on libbpf and dpif-bpf.
>>   instead, use the built-in XDP_ATTACH feature.
>> - data structure optimizations for better performance, see[1]
>> - more test cases support
>> v3: 
>> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
>>
>> v3->v4:
>> - Use AF_XDP API provided by libbpf
>> - Remove the dependency on XDP_ATTACH kernel patch set
>> - Add documentation, bpf.rst
>>
>> William Tu (4):
>>   Add libbpf build support.
>>   netdev-afxdp: add new netdev type for AF_XDP
>>   tests: add AF_XDP netdev test cases.
>>   afxdp netdev: add documentation and configuration.
>>
>>  Documentation/automake.mk             |   1 +
>>  Documentation/index.rst               |   1 +
>>  Documentation/intro/install/bpf.rst   | 182 +++++++
>>  Documentation/intro/install/index.rst |   1 +
>>  acinclude.m4                          |  20 +
>>  configure.ac                          |   1 +
>>  lib/automake.mk                       |   7 +-
>>  lib/dp-packet.c                       |  12 +
>>  lib/dp-packet.h                       |  32 +-
>>  lib/dpif-netdev.c                     |   2 +-
>>  lib/netdev-afxdp.c                    | 491 +++++++++++++++++
>>  lib/netdev-afxdp.h                    |  39 ++
>>  lib/netdev-linux.c                    |  78 ++-
>>  lib/netdev-provider.h                 |   1 +
>>  lib/netdev.c                          |   1 +
>>  lib/xdpsock.c                         | 179 +++++++
>>  lib/xdpsock.h                         | 129 +++++
>>  tests/automake.mk                     |  17 +
>>  tests/system-afxdp-macros.at          | 153 ++++++
>>  tests/system-afxdp-testsuite.at       |  26 +
>>  tests/system-afxdp-traffic.at         | 978 
>> ++++++++++++++++++++++++++++++++++
>>  21 files changed, 2345 insertions(+), 6 deletions(-)
>>  create mode 100644 Documentation/intro/install/bpf.rst
>>  create mode 100644 lib/netdev-afxdp.c
>>  create mode 100644 lib/netdev-afxdp.h
>>  create mode 100644 lib/xdpsock.c
>>  create mode 100644 lib/xdpsock.h
>>  create mode 100644 tests/system-afxdp-macros.at
>>  create mode 100644 tests/system-afxdp-testsuite.at
>>  create mode 100644 tests/system-afxdp-traffic.at
>>
>> -- 
>> 2.7.4
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ben Pfaff April 17, 2019, 4:47 p.m. UTC | #6
On Wed, Apr 17, 2019 at 10:09:53AM +0200, Eelco Chaudron wrote:
> On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
> > AF_XDP is a faster way to access the existing kernel devices.  If we
> > take that point of view, then it would be ideal if AF_XDP were
> > automatically used when it was available, instead of adding a new
> > network device type.  Is there a reason that this point of view is
> > wrong?  That is, when AF_XDP is available, is there a reason not to use
> > it?
> 
> This needs support by all the ingress and egress ports in the system, and
> currently, there is no API to check this.

Do you mean for performance or for some other reason?  I would suspect
that, if AF_XDP was not available, then everything would still work OK
via AF_PACKET, just slower.

> There are also features like traffic shaping that will not work. Maybe it
> will be worth adding the table for AF_XDP in
> http://docs.openvswitch.org/en/latest/faq/releases/

AF_XDP is comparable to DPDK/userspace, not to the Linux kernel
datapath.

The table currently conflates the userspace datapath with the DPDK
network device.  I believe that the only entry there that depends on the
DPDK network device is the one for policing.  It could be replaced by a
[*] with a note like this:

        YES - for DPDK network devices.
        NO - for system or AF_XDP network devices.

> > You said that your goal for the next version is to improve performance
> > and add optimizations.  Do you think that is important before we merge
> > the series?  We can continue to improve performance after it is merged.
> 
> The previous patch was rather unstable and I could not get it running with
> the PVP test without crashing. I think this patchset should get some proper
> testing and reviews by others. Especially for all the features being marked
> as supported in the above-mentioned table.

If it's unstable, we should fix that before adding it in.

However, the bar is lower for new features that don't break existing
features, especially optional ones and ones that can be easily be
removed if they don't work out in the end.  DPDK support was considered
"experimental" for a long time, it's possible that AF_XDP would be in
the same boat for a while.

> > If we set performance aside, do you have a reason to want to wait to
> > merge this?  (I wasn't able to easily apply this series to current
> > master, so it'll need at least a rebase before we apply it.  And I have
> > only skimmed it, not fully reviewed it.)
> 
> Other than the items above, do we really need another datapath? 

It's less than a new datapath.  It's a new network device
implementation.

> With this, we use two or more cores for processing packets. If we poll
> two physical ports it could be 300%, which is a typical use case with
> bonding. What about multiple queue support, does it work? Both in
> kernel and DPDK mode we use multiple queues to distribute the load,
> with this scenario does it double the number of CPUs used? Can we use
> the poll() mode as explained here,
> https://linuxplumbersconf.org/event/2/contributions/99/, and how will
> it work with multiple queues/pmd threads? What about any latency
> tests, is it worse or better than kernel/dpdk? Also with the AF_XDP
> datapath, there is no to leverage hardware offload, like DPDK and
> TC. And then there is the part that it only works on the most recent
> kernels.

These are good questions.  William will have some of the answers.

> To me looking at this I would say it’s far from being ready to be merged
> into OVS. However, if others decide to go ahead I think it should be
> disabled, not compiled in by default.

Yes, that seems reasonable to me.
Ben Pfaff April 17, 2019, 4:48 p.m. UTC | #7
On Wed, Apr 17, 2019 at 12:16:59PM +0200, Eelco Chaudron wrote:
> One other thing that popped up in my head is how (will) it work together
> with DPDK enabled on the same system?

Why not?
William Tu April 17, 2019, 4:53 p.m. UTC | #8
Thanks for the feedbacks.

On Tue, Apr 16, 2019 at 12:55 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
> > The patch series introduces AF_XDP support for OVS netdev.
> > AF_XDP is a new address family working together with eBPF.
> > In short, a socket with AF_XDP family can receive and send
> > packets from an eBPF/XDP program attached to the netdev.
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst
>
> I'm glad to see some more revisions of this series!
>
> AF_XDP is a faster way to access the existing kernel devices.  If we
> take that point of view, then it would be ideal if AF_XDP were
> automatically used when it was available, instead of adding a new
> network device type.  Is there a reason that this point of view is
> wrong?  That is, when AF_XDP is available, is there a reason not to use
> it?

I think we should use it if it is available. However, now only ixgbe/i40e
driver support AF_XDP mode. But I think more vendors are working on
this feature.

>
> You said that your goal for the next version is to improve performance
> and add optimizations.  Do you think that is important before we merge
> the series?  We can continue to improve performance after it is merged.
>
> If we set performance aside, do you have a reason to want to wait to
> merge this?  (I wasn't able to easily apply this series to current
> master, so it'll need at least a rebase before we apply it.  And I have
> only skimmed it, not fully reviewed it.)

OK Thanks.
I have been working on measuring the performance and adding some
optimizations. I will consider submit another version.

>
> It might make sense to squash all of these into a single patch.  I am
> not sure that they are really distinct conceptually.
William Tu April 17, 2019, 5:09 p.m. UTC | #9
On Wed, Apr 17, 2019 at 1:09 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
>
> > On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
> >> The patch series introduces AF_XDP support for OVS netdev.
> >> AF_XDP is a new address family working together with eBPF.
> >> In short, a socket with AF_XDP family can receive and send
> >> packets from an eBPF/XDP program attached to the netdev.
> >> For more details about AF_XDP, please see linux kernel's
> >> Documentation/networking/af_xdp.rst
> >
> > I'm glad to see some more revisions of this series!
>
> I’m planning on reviewing and testing this patch, I’ll try to start
> it this week, or else when I get back from PTO.
>
> > AF_XDP is a faster way to access the existing kernel devices.  If we
> > take that point of view, then it would be ideal if AF_XDP were
> > automatically used when it was available, instead of adding a new
> > network device type.  Is there a reason that this point of view is
> > wrong?  That is, when AF_XDP is available, is there a reason not to
> > use
> > it?
>
> This needs support by all the ingress and egress ports in the system,
> and currently, there is no API to check this.

Not necessary all ports.
On a OVS switch, you can have some ports supporting AF_XDP,
and some ports are other types, ex: DPDK vhost, or tap.

>
> There are also features like traffic shaping that will not work. Maybe
> it will be worth adding the table for AF_XDP in
> http://docs.openvswitch.org/en/latest/faq/releases/

Right, when using AF_XDP, we don't have QoS support.
If people want to do rate limiting on a AF_XDP port, another
way is to use OpenFlow meter actions.

>
> > You said that your goal for the next version is to improve performance
> > and add optimizations.  Do you think that is important before we merge
> > the series?  We can continue to improve performance after it is
> > merged.
>
> The previous patch was rather unstable and I could not get it running
> with the PVP test without crashing. I think this patchset should get
> some proper testing and reviews by others. Especially for all the
> features being marked as supported in the above-mentioned table.
>

Yes, Tim has been helping a lot to test this and I have a couple of
new fixes. I will incorporate into next version.

> > If we set performance aside, do you have a reason to want to wait to
> > merge this?  (I wasn't able to easily apply this series to current
> > master, so it'll need at least a rebase before we apply it.  And I
> > have
> > only skimmed it, not fully reviewed it.)
>
> Other than the items above, do we really need another datapath? With

This is using the same datapath, the userspace datapath, as OVS-DPDK.
So we don't introduce another datapath, we introduce a new netdev type.

> this, we use two or more cores for processing packets. If we poll two
> physical ports it could be 300%, which is a typical use case with
> bonding. What about multiple queue support, does it work? Both in kernel

Yes, this patchset only allows 1 pmd and 1 queue.
I'm adding the multiqueue support.

> and DPDK mode we use multiple queues to distribute the load, with this
> scenario does it double the number of CPUs used? Can we use the poll()
> mode as explained here,
> https://linuxplumbersconf.org/event/2/contributions/99/, and how will it
> work with multiple queues/pmd threads? What about any latency tests, is
> it worse or better than kernel/dpdk? Also with the AF_XDP datapath,
> there is no to leverage hardware offload, like DPDK and TC. And then
> there is the part that it only works on the most recent kernels.

You have lots of good points here.
My experiments show that it's slower than DPDK, but much faster than
kernel.

>
> To me looking at this I would say it’s far from being ready to be
> merged into OVS. However, if others decide to go ahead I think it should
> be disabled, not compiled in by default.
>
I agree. This should be experimental feature and we're adding s.t like
#./configure --enable-afxdp
so not compiled in by default

Thanks
William
William Tu April 17, 2019, 5:16 p.m. UTC | #10
Hi Eelco,
Thanks for trying this patchset!

On Wed, Apr 17, 2019 at 7:26 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 17 Apr 2019, at 14:01, Eelco Chaudron wrote:
>
> > Hi William,
> >
> > I think you applied the following patch to get it to compile? Or did
> > you copy in the kernel headers?
> >
> > https://www.spinics.net/lists/netdev/msg563507.html
>
Right. I apply the patch to get it compile.
I should document it better in next version about how to install.

> I noticed you duplicated the macros, which resulted in all kind of
> compile errors. So I removed them, applied the two patches above, which
> would get me to the next step.
>
> I’m building it with DPDK enabled and it was causing all kind of
> duplicate definition errors as the kernel and DPDK re-use some structure
> names.

Sorry about that. I will fix it in next version.

>
> To get it all compiled and working I had top make the following changes:
>
> $ git diff
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index b3bf2f044..47fb3342a 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -295,7 +295,7 @@ netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
>       uint32_t idx_rx = 0, idx_fq = 0;
>       int ret = 0;
>
> -    unsigned int non_afxdp;
> +    unsigned int non_afxdp = 0;
>
>       /* See if there is any packet on RX queue,
>        * if yes, idx_rx is the index having the packet.
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 47153dc60..77f2150ab 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -24,7 +24,7 @@
>   #include <unistd.h>
>   #include <linux/virtio_net.h>
>   #include <sys/socket.h>
> -#include <linux/if.h>
> +//#include <linux/if.h>
>
>   #include <rte_bus_pci.h>
>   #include <rte_config.h>
> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> index 8df8fa451..a2ed1a136 100644
> --- a/lib/xdpsock.h
> +++ b/lib/xdpsock.h
> @@ -28,7 +28,7 @@
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <string.h>
> -#include <net/ethernet.h>
> +//#include <net/ethernet.h>
>   #include <sys/resource.h>
>   #include <sys/socket.h>
>   #include <sys/mman.h>
> @@ -43,14 +43,6 @@
>   #include "ovs-atomic.h"
>   #include "openvswitch/thread.h"
>
> -/* bpf/xsk.h uses the following macros not defined in OVS,
> - * so re-define them before include.
> - */
> -#define unlikely OVS_UNLIKELY
> -#define likely OVS_LIKELY
> -#define barrier() __asm__ __volatile__("": : :"memory")
> -#define smp_rmb() barrier()
> -#define smp_wmb() barrier()
>   #include <bpf/xsk.h>
>
> In addition you need to do “make install_headers” from kernel libbpf
> and copy the libbpf_util.h manually.
>
> I was able to do a simple physical port in same physical port out test
> without crashing, but the numbers seem low:

This probably is due to some log printing.
I have a couple of optimizations, will share it soon.

Regards,
William
>
> $ ovs-ofctl dump-flows ovs_pvp_br0
>   cookie=0x0, duration=210.344s, table=0, n_packets=1784692,
> n_bytes=2694884920, in_port=eno1 actions=IN_PORT
>
> "Physical loopback test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,128,256,512,768,1024,1514
> 100,77574,77329,76605,76417,75539,75252,74617
>
> The above is using two cores, but with a single DPDK core I get the
> following (on the same machine):
>
> "Physical loopback test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,128,256,512,768,1024,1514
> 100,9527075,8445852,4528935,2349597,1586276,1197304,814854
>
> For the kernel datapath the numbers are:
>
> "Physical loopback test, L3 flows[port redirect]"
> ,Packet size
> Number of flows,64,128,256,512,768,1024,1514
> 100,4862995,5521870,4528872,2349596,1586277,1197305,814854
>
> But keep in mind it uses roughly 550/610/520/380/180/140/110% of the CPU
> for the respective packet size.
>
> > On 2 Apr 2019, at 0:46, William Tu wrote:
> >
> >> The patch series introduces AF_XDP support for OVS netdev.
> >> AF_XDP is a new address family working together with eBPF.
> >> In short, a socket with AF_XDP family can receive and send
> >> packets from an eBPF/XDP program attached to the netdev.
> >> For more details about AF_XDP, please see linux kernel's
> >> Documentation/networking/af_xdp.rst
> >>
> >> OVS has a couple of netdev types, i.e., system, tap, or
> >> internal.  The patch first adds a new netdev types called
> >> "afxdp", and implement its configuration, packet reception,
> >> and transmit functions.  Since the AF_XDP socket, xsk,
> >> operates in userspace, once ovs-vswitchd receives packets
> >> from xsk, the proposed architecture re-uses the existing
> >> userspace dpif-netdev datapath.  As a result, most of
> >> the packet processing happens at the userspace instead of
> >> linux kernel.
> >>
> >> Architecure
> >> ===========
> >>                _
> >>               |   +-------------------+
> >>               |   |    ovs-vswitchd   |<-->ovsdb-server
> >>               |   +-------------------+
> >>               |   |      ofproto      |<-->OpenFlow controllers
> >>               |   +--------+-+--------+
> >>               |   | netdev | |ofproto-|
> >>     userspace |   +--------+ |  dpif  |
> >>               |   | netdev | +--------+
> >>               |   |provider| |  dpif  |
> >>               |   +---||---+ +--------+
> >>               |       ||     |  dpif- |
> >>               |       ||     | netdev |
> >>               |_      ||     +--------+
> >>                       ||
> >>                _  +---||-----+--------+
> >>               |   | af_xdp prog +     |
> >>        kernel |   |   xsk_map         |
> >>               |_  +--------||---------+
> >>                            ||
> >>                         physical
> >>                            NIC
> >>
> >> To simply start, create a ovs userspace bridge using dpif-netdev
> >> by setting the datapath_type to netdev:
> >>   # ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> >>
> >> And attach a linux netdev with type afxdp:
> >>   # ovs-vsctl add-port br0 afxdp-p0 -- \
> >>       set interface afxdp-p0 type="afxdp"
> >>
> >> Performance
> >> ===========
> >> For this version, v4, I mainly focus on making the features right
> >> with
> >> libbpf AF_XDP API and use the AF_XDP SKB mode, which is the slower
> >> set-up.
> >> My next version is to measure the performance and add optimizations.
> >>
> >> Documentation
> >> =============
> >> Most of the design details are described in the paper presetned at
> >> Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> >> section 4, and slides[2].
> >> This path uses a not-yet upstreamed feature called XDP_ATTACH[3],
> >> described in section 3.1, which is a built-in XDP program for the
> >> AF_XDP.
> >> This greatly simplifies the management of XDP/eBPF programs.
> >>
> >> [1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> >> [2]
> >> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> >> [3]
> >> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> >>
> >> For installation and configuration guide, see
> >>   # Documentation/intro/install/bpf.rst
> >>
> >> Test Cases
> >> ==========
> >> Test cases are created using namespaces and veth peer, with AF_XDP
> >> socket
> >> attached to the veth (thus the SKB_MODE).  By issuing "make
> >> check-afxdp",
> >> the patch shows the following:
> >>
> >> AF_XDP netdev datapath-sanity
> >>
> >>   1: datapath - ping between two ports               ok
> >>   2: datapath - ping between two ports on vlan       ok
> >>   3: datapath - ping6 between two ports              ok
> >>   4: datapath - ping6 between two ports on vlan      ok
> >>   5: datapath - ping over vxlan tunnel               ok
> >>   6: datapath - ping over vxlan6 tunnel              ok
> >>   7: datapath - ping over gre tunnel                 ok
> >>   8: datapath - ping over erspan v1 tunnel           ok
> >>   9: datapath - ping over erspan v2 tunnel           ok
> >>  10: datapath - ping over ip6erspan v1 tunnel        ok
> >>  11: datapath - ping over ip6erspan v2 tunnel        ok
> >>  12: datapath - ping over geneve tunnel              ok
> >>  13: datapath - ping over geneve6 tunnel             ok
> >>  14: datapath - clone action                         ok
> >>  15: datapath - basic truncate action                ok
> >>
> >> conntrack
> >>
> >>  16: conntrack - controller                          ok
> >>  17: conntrack - force commit                        ok
> >>  18: conntrack - ct flush by 5-tuple                 ok
> >>  19: conntrack - IPv4 ping                           ok
> >>  20: conntrack - get_nconns and get/set_maxconns     ok
> >>  21: conntrack - IPv6 ping                           ok
> >>
> >> system-ovn
> >>
> >>  22: ovn -- 2 LRs connected via LS, gateway router, SNAT and DNAT ok
> >>  23: ovn -- 2 LRs connected via LS, gateway router, easy SNAT ok
> >>  24: ovn -- multiple gateway routers, SNAT and DNAT  ok
> >>  25: ovn -- load-balancing                           ok
> >>  26: ovn -- load-balancing - same subnet.            ok
> >>  27: ovn -- load balancing in gateway router         ok
> >>  28: ovn -- multiple gateway routers, load-balancing ok
> >>  29: ovn -- load balancing in router with gateway router port ok
> >>  30: ovn -- DNAT and SNAT on distributed router - N/S ok
> >>  31: ovn -- DNAT and SNAT on distributed router - E/W ok
> >>
> >> ---
> >> v1->v2:
> >> - add a list to maintain unused umem elements
> >> - remove copy from rx umem to ovs internal buffer
> >> - use hugetlb to reduce misses (not much difference)
> >> - use pmd mode netdev in OVS (huge performance improve)
> >> - remove malloc dp_packet, instead put dp_packet in umem
> >>
> >> v2->v3:
> >> - rebase on the OVS master, 7ab4b0653784
> >>   ("configure: Check for more specific function to pull in pthread
> >> library.")
> >> - remove the dependency on libbpf and dpif-bpf.
> >>   instead, use the built-in XDP_ATTACH feature.
> >> - data structure optimizations for better performance, see[1]
> >> - more test cases support
> >> v3:
> >> https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> >>
> >> v3->v4:
> >> - Use AF_XDP API provided by libbpf
> >> - Remove the dependency on XDP_ATTACH kernel patch set
> >> - Add documentation, bpf.rst
> >>
> >> William Tu (4):
> >>   Add libbpf build support.
> >>   netdev-afxdp: add new netdev type for AF_XDP
> >>   tests: add AF_XDP netdev test cases.
> >>   afxdp netdev: add documentation and configuration.
> >>
> >>  Documentation/automake.mk             |   1 +
> >>  Documentation/index.rst               |   1 +
> >>  Documentation/intro/install/bpf.rst   | 182 +++++++
> >>  Documentation/intro/install/index.rst |   1 +
> >>  acinclude.m4                          |  20 +
> >>  configure.ac                          |   1 +
> >>  lib/automake.mk                       |   7 +-
> >>  lib/dp-packet.c                       |  12 +
> >>  lib/dp-packet.h                       |  32 +-
> >>  lib/dpif-netdev.c                     |   2 +-
> >>  lib/netdev-afxdp.c                    | 491 +++++++++++++++++
> >>  lib/netdev-afxdp.h                    |  39 ++
> >>  lib/netdev-linux.c                    |  78 ++-
> >>  lib/netdev-provider.h                 |   1 +
> >>  lib/netdev.c                          |   1 +
> >>  lib/xdpsock.c                         | 179 +++++++
> >>  lib/xdpsock.h                         | 129 +++++
> >>  tests/automake.mk                     |  17 +
> >>  tests/system-afxdp-macros.at          | 153 ++++++
> >>  tests/system-afxdp-testsuite.at       |  26 +
> >>  tests/system-afxdp-traffic.at         | 978
> >> ++++++++++++++++++++++++++++++++++
> >>  21 files changed, 2345 insertions(+), 6 deletions(-)
> >>  create mode 100644 Documentation/intro/install/bpf.rst
> >>  create mode 100644 lib/netdev-afxdp.c
> >>  create mode 100644 lib/netdev-afxdp.h
> >>  create mode 100644 lib/xdpsock.c
> >>  create mode 100644 lib/xdpsock.h
> >>  create mode 100644 tests/system-afxdp-macros.at
> >>  create mode 100644 tests/system-afxdp-testsuite.at
> >>  create mode 100644 tests/system-afxdp-traffic.at
> >>
> >> --
> >> 2.7.4
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
William Tu April 17, 2019, 5:36 p.m. UTC | #11
On Wed, Apr 17, 2019 at 9:47 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Wed, Apr 17, 2019 at 10:09:53AM +0200, Eelco Chaudron wrote:
> > On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
> > > AF_XDP is a faster way to access the existing kernel devices.  If we
> > > take that point of view, then it would be ideal if AF_XDP were
> > > automatically used when it was available, instead of adding a new
> > > network device type.  Is there a reason that this point of view is
> > > wrong?  That is, when AF_XDP is available, is there a reason not to use
> > > it?
> >
> > This needs support by all the ingress and egress ports in the system, and
> > currently, there is no API to check this.
>
> Do you mean for performance or for some other reason?  I would suspect
> that, if AF_XDP was not available, then everything would still work OK
> via AF_PACKET, just slower.
>
> > There are also features like traffic shaping that will not work. Maybe it
> > will be worth adding the table for AF_XDP in
> > http://docs.openvswitch.org/en/latest/faq/releases/
>
> AF_XDP is comparable to DPDK/userspace, not to the Linux kernel
> datapath.
>
> The table currently conflates the userspace datapath with the DPDK
> network device.  I believe that the only entry there that depends on the
> DPDK network device is the one for policing.  It could be replaced by a
> [*] with a note like this:
>
>         YES - for DPDK network devices.
>         NO - for system or AF_XDP network devices.
>
> > > You said that your goal for the next version is to improve performance
> > > and add optimizations.  Do you think that is important before we merge
> > > the series?  We can continue to improve performance after it is merged.
> >
> > The previous patch was rather unstable and I could not get it running with
> > the PVP test without crashing. I think this patchset should get some proper
> > testing and reviews by others. Especially for all the features being marked
> > as supported in the above-mentioned table.
>
> If it's unstable, we should fix that before adding it in.

Agree.
My first goal is to make sure people can at least run
$ make check-afxdp
This uses the virtual device, veth XDP skb-mode, to run various
OVS test case. The performance will be bad, but it makes sure the
correctness.

Regards,
William

>
> However, the bar is lower for new features that don't break existing
> features, especially optional ones and ones that can be easily be
> removed if they don't work out in the end.  DPDK support was considered
> "experimental" for a long time, it's possible that AF_XDP would be in
> the same boat for a while.
>
> > > If we set performance aside, do you have a reason to want to wait to
> > > merge this?  (I wasn't able to easily apply this series to current
> > > master, so it'll need at least a rebase before we apply it.  And I have
> > > only skimmed it, not fully reviewed it.)
> >
> > Other than the items above, do we really need another datapath?
>
> It's less than a new datapath.  It's a new network device
> implementation.
>
> > With this, we use two or more cores for processing packets. If we poll
> > two physical ports it could be 300%, which is a typical use case with
> > bonding. What about multiple queue support, does it work? Both in
> > kernel and DPDK mode we use multiple queues to distribute the load,
> > with this scenario does it double the number of CPUs used? Can we use
> > the poll() mode as explained here,
> > https://linuxplumbersconf.org/event/2/contributions/99/, and how will
> > it work with multiple queues/pmd threads? What about any latency
> > tests, is it worse or better than kernel/dpdk? Also with the AF_XDP
> > datapath, there is no to leverage hardware offload, like DPDK and
> > TC. And then there is the part that it only works on the most recent
> > kernels.
>
> These are good questions.  William will have some of the answers.
>
> > To me looking at this I would say it’s far from being ready to be merged
> > into OVS. However, if others decide to go ahead I think it should be
> > disabled, not compiled in by default.
>
> Yes, that seems reasonable to me.
William Tu April 17, 2019, 5:39 p.m. UTC | #12
On Wed, Apr 17, 2019 at 9:48 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Wed, Apr 17, 2019 at 12:16:59PM +0200, Eelco Chaudron wrote:
> > One other thing that popped up in my head is how (will) it work together
> > with DPDK enabled on the same system?
>
> Why not?

It works OK with OVS-DPDK.
For example, I can create a br0, attach a af_xdp port and also attach
a dpdk port to it. (I tested using dpdk vhost port, not physical one).

The performance is lower than using two dpdk ports due to some packet
copying from one to another.

Regards,
William
Eelco Chaudron April 18, 2019, 8:20 a.m. UTC | #13
On 17 Apr 2019, at 19:39, William Tu wrote:

> On Wed, Apr 17, 2019 at 9:48 AM Ben Pfaff <blp@ovn.org> wrote:
>>
>> On Wed, Apr 17, 2019 at 12:16:59PM +0200, Eelco Chaudron wrote:
>>> One other thing that popped up in my head is how (will) it work 
>>> together
>>> with DPDK enabled on the same system?
>>
>> Why not?

I’m like if it’s not tested it’s not working…

>
> It works OK with OVS-DPDK.
> For example, I can create a br0, attach a af_xdp port and also attach
> a dpdk port to it. (I tested using dpdk vhost port, not physical one).
>
> The performance is lower than using two dpdk ports due to some packet
> copying from one to another.

This is because to sent to the DPDK ports you use the shared queue which 
might block on a mutex. Sending from a DPDK port to XDP might be worse, 
as the PMD might stall due to the syscall required.


I’ll try to do some more tests on this combination once I return from 
PTO.
Eelco Chaudron April 18, 2019, 8:25 a.m. UTC | #14
On 17 Apr 2019, at 19:16, William Tu wrote:

> Hi Eelco,
> Thanks for trying this patchset!
<SNIP>

>> In addition you need to do “make install_headers” from kernel 
>> libbpf
>> and copy the libbpf_util.h manually.
>>
>> I was able to do a simple physical port in same physical port out 
>> test
>> without crashing, but the numbers seem low:
>
> This probably is due to some log printing.
> I have a couple of optimizations, will share it soon.

I do not see any additional logging, and even with all logging disabled, 
I get the same numbers.
Will continue my testing once back from PTO. If you want me to test 
something without sending a new patch set, just let me know what to 
change.

>>
>> $ ovs-ofctl dump-flows ovs_pvp_br0
>>   cookie=0x0, duration=210.344s, table=0, n_packets=1784692,
>> n_bytes=2694884920, in_port=eno1 actions=IN_PORT
>>
>> "Physical loopback test, L3 flows[port redirect]"
>> ,Packet size
>> Number of flows,64,128,256,512,768,1024,1514
>> 100,77574,77329,76605,76417,75539,75252,74617
>>
>> The above is using two cores, but with a single DPDK core I get the
>> following (on the same machine):
>>
>> "Physical loopback test, L3 flows[port redirect]"
>> ,Packet size
>> Number of flows,64,128,256,512,768,1024,1514
>> 100,9527075,8445852,4528935,2349597,1586276,1197304,814854
>>
>> For the kernel datapath the numbers are:
>>
>> "Physical loopback test, L3 flows[port redirect]"
>> ,Packet size
>> Number of flows,64,128,256,512,768,1024,1514
>> 100,4862995,5521870,4528872,2349596,1586277,1197305,814854
>>
>> But keep in mind it uses roughly 550/610/520/380/180/140/110% of the 
>> CPU
>> for the respective packet size.

<SNIP>
Eelco Chaudron April 18, 2019, 8:39 a.m. UTC | #15
On 17 Apr 2019, at 18:47, Ben Pfaff wrote:

> On Wed, Apr 17, 2019 at 10:09:53AM +0200, Eelco Chaudron wrote:
>> On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
>>> AF_XDP is a faster way to access the existing kernel devices.  If we
>>> take that point of view, then it would be ideal if AF_XDP were
>>> automatically used when it was available, instead of adding a new
>>> network device type.  Is there a reason that this point of view is
>>> wrong?  That is, when AF_XDP is available, is there a reason not to 
>>> use
>>> it?
>>
>> This needs support by all the ingress and egress ports in the system, 
>> and
>> currently, there is no API to check this.
>
> Do you mean for performance or for some other reason?  I would suspect
> that, if AF_XDP was not available, then everything would still work OK
> via AF_PACKET, just slower.

Yes, it will become slower and people do not understand why. For 
example, it's easy to combine kernel, XDP and DPDK ports. But receiving 
one and tx on another becomes slow. It could even impact the DPDK/XDP 
performance as now syscall’s need to be executed stalling the PMD loop 
causing packets to be dropped.

Maybe we should add something about this in the documentation, for 
example, that the kernel receive loop is done in the main thread, 
XDP/DPDK in dedicated PMD threads, etc. etc.

>> There are also features like traffic shaping that will not work. 
>> Maybe it
>> will be worth adding the table for AF_XDP in
>> http://docs.openvswitch.org/en/latest/faq/releases/
>
> AF_XDP is comparable to DPDK/userspace, not to the Linux kernel
> datapath.
>
> The table currently conflates the userspace datapath with the DPDK
> network device.  I believe that the only entry there that depends on 
> the
> DPDK network device is the one for policing.  It could be replaced by 
> a
> [*] with a note like this:
>
>         YES - for DPDK network devices.
>         NO - for system or AF_XDP network devices.

This would work, just want to make sure it’s tested rather than assume 
it will work as there might be corner cases.

>>> You said that your goal for the next version is to improve 
>>> performance
>>> and add optimizations.  Do you think that is important before we 
>>> merge
>>> the series?  We can continue to improve performance after it is 
>>> merged.
>>
>> The previous patch was rather unstable and I could not get it running 
>> with
>> the PVP test without crashing. I think this patchset should get some 
>> proper
>> testing and reviews by others. Especially for all the features being 
>> marked
>> as supported in the above-mentioned table.
>
> If it's unstable, we should fix that before adding it in.
>
> However, the bar is lower for new features that don't break existing
> features, especially optional ones and ones that can be easily be
> removed if they don't work out in the end.  DPDK support was 
> considered
> "experimental" for a long time, it's possible that AF_XDP would be in
> the same boat for a while.

Thats fine, as long as there are some serious reviews of it. I’ll work 
on it once I return from PTO, but I guess more would be welcome.

>>> If we set performance aside, do you have a reason to want to wait to
>>> merge this?  (I wasn't able to easily apply this series to current
>>> master, so it'll need at least a rebase before we apply it.  And I 
>>> have
>>> only skimmed it, not fully reviewed it.)
>>
>> Other than the items above, do we really need another datapath?
>
> It's less than a new datapath.  It's a new network device
> implementation.

Sorry yes, from OVS terminology it is…

>> With this, we use two or more cores for processing packets. If we 
>> poll
>> two physical ports it could be 300%, which is a typical use case with
>> bonding. What about multiple queue support, does it work? Both in
>> kernel and DPDK mode we use multiple queues to distribute the load,
>> with this scenario does it double the number of CPUs used? Can we use
>> the poll() mode as explained here,
>> https://linuxplumbersconf.org/event/2/contributions/99/, and how will
>> it work with multiple queues/pmd threads? What about any latency
>> tests, is it worse or better than kernel/dpdk? Also with the AF_XDP
>> datapath, there is no to leverage hardware offload, like DPDK and
>> TC. And then there is the part that it only works on the most recent
>> kernels.
>
> These are good questions.  William will have some of the answers.
>
>> To me looking at this I would say it’s far from being ready to be 
>> merged
>> into OVS. However, if others decide to go ahead I think it should be
>> disabled, not compiled in by default.
>
> Yes, that seems reasonable to me.
Eelco Chaudron April 18, 2019, 8:59 a.m. UTC | #16
On 17 Apr 2019, at 19:09, William Tu wrote:

> On Wed, Apr 17, 2019 at 1:09 AM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>>
>>
>> On 16 Apr 2019, at 21:55, Ben Pfaff wrote:
>>
>>> On Mon, Apr 01, 2019 at 03:46:48PM -0700, William Tu wrote:
>>>> The patch series introduces AF_XDP support for OVS netdev.
>>>> AF_XDP is a new address family working together with eBPF.
>>>> In short, a socket with AF_XDP family can receive and send
>>>> packets from an eBPF/XDP program attached to the netdev.
>>>> For more details about AF_XDP, please see linux kernel's
>>>> Documentation/networking/af_xdp.rst
>>>
>>> I'm glad to see some more revisions of this series!
>>
>> I’m planning on reviewing and testing this patch, I’ll try to 
>> start
>> it this week, or else when I get back from PTO.
>>
>>> AF_XDP is a faster way to access the existing kernel devices.  If we
>>> take that point of view, then it would be ideal if AF_XDP were
>>> automatically used when it was available, instead of adding a new
>>> network device type.  Is there a reason that this point of view is
>>> wrong?  That is, when AF_XDP is available, is there a reason not to
>>> use
>>> it?
>>
>> This needs support by all the ingress and egress ports in the system,
>> and currently, there is no API to check this.
>
> Not necessary all ports.
> On a OVS switch, you can have some ports supporting AF_XDP,
> and some ports are other types, ex: DPDK vhost, or tap.

But I’m wondering how would you deal with ports not supporting this at 
driver level?
Will you fall back to skb style, will you report this (as it’s 
interesting to know from a performance level).
Guess I just need to look at your code :)

>>
>> There are also features like traffic shaping that will not work. 
>> Maybe
>> it will be worth adding the table for AF_XDP in
>> http://docs.openvswitch.org/en/latest/faq/releases/
>
> Right, when using AF_XDP, we don't have QoS support.
> If people want to do rate limiting on a AF_XDP port, another
> way is to use OpenFlow meter actions.

That for me was the only thing that stood out, but just want to make 
sure no other things were abstracted in the DPDK APIs…

Guess you could use the DPDK meters framework to support the same as 
DPDK, the only thing is that you need enablement of DPDK also.

>>
>>> You said that your goal for the next version is to improve 
>>> performance
>>> and add optimizations.  Do you think that is important before we 
>>> merge
>>> the series?  We can continue to improve performance after it is
>>> merged.
>>
>> The previous patch was rather unstable and I could not get it running
>> with the PVP test without crashing. I think this patchset should get
>> some proper testing and reviews by others. Especially for all the
>> features being marked as supported in the above-mentioned table.
>>
>
> Yes, Tim has been helping a lot to test this and I have a couple of
> new fixes. I will incorporate into next version.

Cool, I’ll talk to Tim offline, in addition, copy me on the next patch 
and I’ll check it out.
Do you have a time frame, so I can do the review based on that revision?

>>> If we set performance aside, do you have a reason to want to wait to
>>> merge this?  (I wasn't able to easily apply this series to current
>>> master, so it'll need at least a rebase before we apply it.  And I
>>> have
>>> only skimmed it, not fully reviewed it.)
>>
>> Other than the items above, do we really need another datapath? With
>
> This is using the same datapath, the userspace datapath, as OVS-DPDK.
> So we don't introduce another datapath, we introduce a new netdev 
> type.

My fault, I was not referring to the OVS data path definition ;)

>> this, we use two or more cores for processing packets. If we poll two
>> physical ports it could be 300%, which is a typical use case with
>> bonding. What about multiple queue support, does it work? Both in 
>> kernel
>
> Yes, this patchset only allows 1 pmd and 1 queue.
> I'm adding the multiqueue support.

We need some alignment here on how we add threads for PMDs XDP vs DPDK. 
If there are not enough cores for both the system will not start 
(EMERGENCY exit). And user also might want to control which cores run 
DPDK and which XDP.

>> and DPDK mode we use multiple queues to distribute the load, with 
>> this
>> scenario does it double the number of CPUs used? Can we use the 
>> poll()
>> mode as explained here,
>> https://linuxplumbersconf.org/event/2/contributions/99/, and how will 
>> it
>> work with multiple queues/pmd threads? What about any latency tests, 
>> is
>> it worse or better than kernel/dpdk? Also with the AF_XDP datapath,
>> there is no to leverage hardware offload, like DPDK and TC. And then
>> there is the part that it only works on the most recent kernels.
>
> You have lots of good points here.
> My experiments show that it's slower than DPDK, but much faster than
> kernel.

Looking for your improvement patch as for me it’s about 10x slower for 
the kernel with a single queue (see other email).

>>
>> To me looking at this I would say it’s far from being ready to be
>> merged into OVS. However, if others decide to go ahead I think it 
>> should
>> be disabled, not compiled in by default.
>>
> I agree. This should be experimental feature and we're adding s.t like
> #./configure --enable-afxdp
> so not compiled in by default
>
> Thanks
> William
Eelco Chaudron April 18, 2019, 9:08 a.m. UTC | #17
On 2 Apr 2019, at 0:46, William Tu wrote:

> The patch series introduces AF_XDP support for OVS netdev.
> AF_XDP is a new address family working together with eBPF.
> In short, a socket with AF_XDP family can receive and send
> packets from an eBPF/XDP program attached to the netdev.
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst
>
> OVS has a couple of netdev types, i.e., system, tap, or
> internal.  The patch first adds a new netdev types called
> "afxdp", and implement its configuration, packet reception,
> and transmit functions.  Since the AF_XDP socket, xsk,
> operates in userspace, once ovs-vswitchd receives packets
> from xsk, the proposed architecture re-uses the existing
> userspace dpif-netdev datapath.  As a result, most of
> the packet processing happens at the userspace instead of
> linux kernel.

One other issue I found it that if a XDP program is already attached, 
due to crash or previous loaded one, adding the port will hang.

<SNIP>
William Tu April 18, 2019, 10:11 p.m. UTC | #18
Hi Eelco,
Thanks for your feedbacks!

>
> > Not necessary all ports.
> > On a OVS switch, you can have some ports supporting AF_XDP,
> > and some ports are other types, ex: DPDK vhost, or tap.
>
> But I’m wondering how would you deal with ports not supporting this at
> driver level?
> Will you fall back to skb style, will you report this (as it’s
> interesting to know from a performance level).
> Guess I just need to look at your code :)
>

I'm adding an option when adding the port
s.t like options:xdpmode=drv, or skb

I put the patch here:
https://github.com/williamtu/ovs-ebpf/commit/ef2bfe15db55ecd629cdb75cbc90c7be613745e3


>
> >>
> >> There are also features like traffic shaping that will not work.
> >> Maybe
> >> it will be worth adding the table for AF_XDP in
> >> http://docs.openvswitch.org/en/latest/faq/releases/
> >
> > Right, when using AF_XDP, we don't have QoS support.
> > If people want to do rate limiting on a AF_XDP port, another
> > way is to use OpenFlow meter actions.
>
> That for me was the only thing that stood out, but just want to make
> sure no other things were abstracted in the DPDK APIs…
>
> Guess you could use the DPDK meters framework to support the same as
> DPDK, the only thing is that you need enablement of DPDK also.
>
> Right. We can try
./configure --with-dpdk  --with-afxdp


> >>
> >>> You said that your goal for the next version is to improve
> >>> performance
> >>> and add optimizations.  Do you think that is important before we
> >>> merge
> >>> the series?  We can continue to improve performance after it is
> >>> merged.
> >>
> >> The previous patch was rather unstable and I could not get it running
> >> with the PVP test without crashing. I think this patchset should get
> >> some proper testing and reviews by others. Especially for all the
> >> features being marked as supported in the above-mentioned table.
> >>
> >
> > Yes, Tim has been helping a lot to test this and I have a couple of
> > new fixes. I will incorporate into next version.
>
> Cool, I’ll talk to Tim offline, in addition, copy me on the next patch
> and I’ll check it out.
> Do you have a time frame, so I can do the review based on that revision?
>
> OK I plan to incorporate your and Tim's feedback, and resubmit next version
next Monday (4/22)


> >>> If we set performance aside, do you have a reason to want to wait to
> >>> merge this?  (I wasn't able to easily apply this series to current
> >>> master, so it'll need at least a rebase before we apply it.  And I
> >>> have
> >>> only skimmed it, not fully reviewed it.)
> >>
> >
> > Yes, this patchset only allows 1 pmd and 1 queue.
> > I'm adding the multiqueue support.
>
> We need some alignment here on how we add threads for PMDs XDP vs DPDK.
> If there are not enough cores for both the system will not start
> (EMERGENCY exit). And user also might want to control which cores run
> DPDK and which XDP.
>

Yes, my plan is to use the same commandline interface as OVS-DPDK
The pmd-cpu-mask and pmd-rxq-affinity

for example 4 pmds:
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
// AF_XDP uses 2
# ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
options:n_rxq=2 options:xdpmode=drv other_config:pmd-rxq-affinity="0:1,1:2"
// another DPDK device can use another 2 pmds
ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="dpdkvhost-user"
other_config:pmd-rxq-affinity="0:3,1:4"

>
> >> and DPDK mode we use multiple queues to distribute the load, with
> >> this
> >> scenario does it double the number of CPUs used? Can we use the
> >> poll()
> >> mode as explained here,
> >> https://linuxplumbersconf.org/event/2/contributions/99/, and how will
> >> it
> >> work with multiple queues/pmd threads? What about any latency tests,
> >> is
> >> it worse or better than kernel/dpdk? Also with the AF_XDP datapath,
> >> there is no to leverage hardware offload, like DPDK and TC. And then
> >> there is the part that it only works on the most recent kernels.
> >
> > You have lots of good points here.
> > My experiments show that it's slower than DPDK, but much faster than
> > kernel.
>
> Looking for your improvement patch as for me it’s about 10x slower for
> the kernel with a single queue (see other email).
>

Thanks
Regards,
William


>
> >>
> >> To me looking at this I would say it’s far from being ready to be
> >> merged into OVS. However, if others decide to go ahead I think it
> >> should
> >> be disabled, not compiled in by default.
> >>
> > I agree. This should be experimental feature and we're adding s.t like
> > #./configure --enable-afxdp
> > so not compiled in by default
> >
> > Thanks
> > William
>
Eelco Chaudron April 19, 2019, 9:54 a.m. UTC | #19
On 19 Apr 2019, at 0:11, William Tu wrote:

> Hi Eelco,
> Thanks for your feedbacks!
>
>>
>>> Not necessary all ports.
>>> On a OVS switch, you can have some ports supporting AF_XDP,
>>> and some ports are other types, ex: DPDK vhost, or tap.
>>
>> But I’m wondering how would you deal with ports not supporting this 
>> at
>> driver level?
>> Will you fall back to skb style, will you report this (as it’s
>> interesting to know from a performance level).
>> Guess I just need to look at your code :)
>>
>
> I'm adding an option when adding the port
> s.t like options:xdpmode=drv, or skb
>
> I put the patch here:
> https://github.com/williamtu/ovs-ebpf/commit/ef2bfe15db55ecd629cdb75cbc90c7be613745e3

Nice! Will review your next patch in detail!
>>
>>>>
>>>> There are also features like traffic shaping that will not work.
>>>> Maybe
>>>> it will be worth adding the table for AF_XDP in
>>>> http://docs.openvswitch.org/en/latest/faq/releases/
>>>
>>> Right, when using AF_XDP, we don't have QoS support.
>>> If people want to do rate limiting on a AF_XDP port, another
>>> way is to use OpenFlow meter actions.
>>
>> That for me was the only thing that stood out, but just want to make
>> sure no other things were abstracted in the DPDK APIs…
>>
>> Guess you could use the DPDK meters framework to support the same as
>> DPDK, the only thing is that you need enablement of DPDK also.
>>
>> Right. We can try
> ./configure --with-dpdk  --with-afxdp

Yes this way policing is supported, if compiled without DPDK it’s not.
Guess we need to give it some thought to see how to warn for this etc.

>>>>
>>>>> You said that your goal for the next version is to improve
>>>>> performance
>>>>> and add optimizations.  Do you think that is important before we
>>>>> merge
>>>>> the series?  We can continue to improve performance after it is
>>>>> merged.
>>>>
>>>> The previous patch was rather unstable and I could not get it 
>>>> running
>>>> with the PVP test without crashing. I think this patchset should 
>>>> get
>>>> some proper testing and reviews by others. Especially for all the
>>>> features being marked as supported in the above-mentioned table.
>>>>
>>>
>>> Yes, Tim has been helping a lot to test this and I have a couple of
>>> new fixes. I will incorporate into next version.
>>
>> Cool, I’ll talk to Tim offline, in addition, copy me on the next 
>> patch
>> and I’ll check it out.
>> Do you have a time frame, so I can do the review based on that 
>> revision?
>>
>> OK I plan to incorporate your and Tim's feedback, and resubmit next 
>> version
> next Monday (4/22)

I’m back from PTO the 30th, so take whatever time you need…

>>>>> If we set performance aside, do you have a reason to want to wait 
>>>>> to
>>>>> merge this?  (I wasn't able to easily apply this series to current
>>>>> master, so it'll need at least a rebase before we apply it.  And I
>>>>> have
>>>>> only skimmed it, not fully reviewed it.)
>>>>
>>>
>>> Yes, this patchset only allows 1 pmd and 1 queue.
>>> I'm adding the multiqueue support.
>>
>> We need some alignment here on how we add threads for PMDs XDP vs 
>> DPDK.
>> If there are not enough cores for both the system will not start
>> (EMERGENCY exit). And user also might want to control which cores run
>> DPDK and which XDP.
>>
>
> Yes, my plan is to use the same commandline interface as OVS-DPDK
> The pmd-cpu-mask and pmd-rxq-affinity
>
> for example 4 pmds:
> # ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> // AF_XDP uses 2
> # ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> options:n_rxq=2 options:xdpmode=drv 
> other_config:pmd-rxq-affinity="0:1,1:2"
> // another DPDK device can use another 2 pmds
> ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
> type="dpdkvhost-user"
> other_config:pmd-rxq-affinity="0:3,1:4"

In real life, people might not always use pmd-rxq-affinity, especially 
with people now looking into dynamic re-assignment based on traffic 
patterns.

However, the real problem I was referring two is how to assign specific 
cores to DPDK PMDs vs AFXPD PMDs.

First, if you enable DPDK and AFXPD and only have a single core OVS 
crashed(force exit). I think should just warn in the log and continue.
Secondly, there is no control on which core is used by which type. If 
you have two hyperthreading pairs you might want to use one sibling set 
for AFXDP and one for DPDK. Also not talking about NUMA aware yet, which 
I think needs to be taken care of also.

>>
>>>> and DPDK mode we use multiple queues to distribute the load, with
>>>> this
>>>> scenario does it double the number of CPUs used? Can we use the
>>>> poll()
>>>> mode as explained here,
>>>> https://linuxplumbersconf.org/event/2/contributions/99/, and how 
>>>> will
>>>> it
>>>> work with multiple queues/pmd threads? What about any latency 
>>>> tests,
>>>> is
>>>> it worse or better than kernel/dpdk? Also with the AF_XDP datapath,
>>>> there is no to leverage hardware offload, like DPDK and TC. And 
>>>> then
>>>> there is the part that it only works on the most recent kernels.
>>>
>>> You have lots of good points here.
>>> My experiments show that it's slower than DPDK, but much faster than
>>> kernel.
>>
>> Looking for your improvement patch as for me it’s about 10x slower 
>> for
>> the kernel with a single queue (see other email).
>>
>
> Thanks
> Regards,
> William
>
>
>>
>>>>
>>>> To me looking at this I would say it’s far from being ready to be
>>>> merged into OVS. However, if others decide to go ahead I think it
>>>> should
>>>> be disabled, not compiled in by default.
>>>>
>>> I agree. This should be experimental feature and we're adding s.t 
>>> like
>>> #./configure --enable-afxdp
>>> so not compiled in by default
>>>
>>> Thanks
>>> William
>>