mbox series

[ovs-dev,v4,0/5] XDP offload using flow API provider

Message ID 20200731025514.1669061-1-toshiaki.makita1@gmail.com
Headers show
Series XDP offload using flow API provider | expand

Message

Toshiaki Makita July 31, 2020, 2:55 a.m. UTC
This patch adds an XDP-based flow cache using the OVS netdev-offload
flow API provider.  When an OVS device with XDP offload enabled,
packets first are processed in the XDP flow cache (with parse, and
table lookup implemented in eBPF) and if hits, the action processing
are also done in the context of XDP, which has the minimum overhead.

This provider is based on top of William's recently posted patch for
custom XDP load.  When a custom XDP is loaded, the provider detects if
the program supports classifier, and if supported it starts offloading
flows to the XDP program.

The patches are derived from xdp_flow[1], which is a mechanism similar to
this but implemented in kernel.


* Motivation

While userspace datapath using netdev-afxdp or netdev-dpdk shows good
performance, there are use cases where packets better to be processed in
kernel, for example, TCP/IP connections, or container to container
connections.  Current solution is to use tap device or af_packet with
extra kernel-to/from-userspace overhead.  But with XDP, a better solution
is to steer packets earlier in the XDP program, and decides to send to
userspace datapath or stay in kernel.

One problem with current netdev-afxdp is that it forwards all packets to
userspace, The first patch from William (netdev-afxdp: Enable loading XDP
program.) only provides the interface to load XDP program, howerver users
usually don't know how to write their own XDP program.

XDP also supports HW-offload so it may be possible to offload flows to
HW through this provider in the future, although not currently.
The reason is that map-in-map is required for our program to support
classifier with subtables in XDP, but map-in-map is not offloadable.
If map-in-map becomes offloadable, HW-offload of our program may also
be possible.


* How to use

1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
   kernel >= 5.3.

2. make with --enable-afxdp --enable-xdp-offload
--enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
the BPF object will not be installed anywhere by "make install" at this point. 

3. Load custom XDP program
E.g.
$ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
  options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
$ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
  options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"

4. Enable XDP_REDIRECT
If you use veth devices, make sure to load some (possibly dummy) programs
on the peers of veth devices. This patch set includes a program which
does nothing but returns XDP_PASS. You can use it for the veth peer like
this:
$ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp

Some HW NIC drivers require as many queues as cores on its system. Tweak
queues using "ethtool -L".

5. Enable hw-offload 
$ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
$ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
This will starts offloading flows to the XDP program.

You should be able to see some maps installed, including "debug_stats".
$ bpftool map

If packets are successfully redirected by the XDP program,
debug_stats[2] will be counted.
$ bpftool map dump id <ID of debug_stats>

Currently only very limited keys and output actions are supported.
For example NORMAL action entry and IP based matching work with current
key support. VLAN actions used by port tag/trunks are also supported.


* Performance

Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
Test 1 Measured drop rate at veth interface with redirect action from
physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
Silver 4114 (2.20 GHz).
                                                               XDP_DROP
                    +------+                      +-------+    +-------+
 pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
                    +------+                      +-------+    +-------+

Test 2 uses i40e instead of veth, and measured tx packet rate at output
device.

Single-flow performance test results:

1) i40e-veth

  a) no-zerocopy in i40e

    - xdp   3.7 Mpps
    - afxdp 980 kpps

  b) zerocopy in i40e (veth does not have zc)

    - xdp   1.9 Mpps
    - afxdp 980 Kpps

2) i40e-i40e

  a) no-zerocopy

    - xdp   3.5 Mpps
    - afxdp 1.5 Mpps

  b) zerocopy

    - xdp   2.0 Mpps
    - afxdp 4.4 Mpps

** xdp is better when zc is disabled. The reason of poor performance on zc
   is that xdp_frame requires packet memory allocation and memcpy on
   XDP_REDIRECT to other devices iff zc is enabled.

** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
   in this case, one is pmd and the other is softirq. When pmd and softirq
   were running on the same core, the performance was extremely poor as
   pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
   userspace processing on the same core, but the result was lower than
   (pmd results) / 2.
   With nonpmd, xdp performance was the same as xdp with pmd. This means
   xdp only uses one core (for softirq only). Even with pmd, we need only
   one pmd for xdp even when we want to use more cores for multi-flow.


This patch set is based on top of commit e8bf77748 ("odp-util: Fix clearing
match mask if set action is partially unnecessary.").

To make review easier I left pre-squashed commits from v3 here.
https://github.com/tmakita/ovs/compare/xdp_offload_v3...tmakita:xdp_offload_v4_history?expand=1

[1] https://lwn.net/Articles/802653/

v4:
- Fix checkpatch errors.
- Fix duplicate flow api register.
- Don't call unnecessary flow api init callbacks when default flow api
  provider can be used.
- Fix typo in comments.
- Improve bpf Makefile.am to support automatic dependencies.
- Add a dummy XDP program for veth peers.
- Rename netdev_info to netdev_xdp_info.
- Use id-pool for free subtable entry management and devmap indexes.
- Rename --enable-bpf to --enable-xdp-offload.
- Compile xdp flow api provider only with --enable-xdp-offload.
- Tested again and updated performance numbers in cover letter (get
  slightly better numbers).

v3:
- Use ".ovs_meta" section to inform vswitchd of metadata like supported
  keys.
- Rewrite action loop logic in bpf to support multiple actions.
- Add missing linux/types.h in acinclude.m4, as per William Tu.
- Fix infinite reconfiguration loop when xsks_map is missing.
- Add vlan-related actions in bpf program.
- Fix CI build error.
- Fix inability to delete subtable entries.

v2:
- Add uninit callback of netdev-offload-xdp.
- Introduce "offload-driver" other_config to specify offload driver.
- Add --enable-bpf (HAVE_BPF) config option to build bpf programs.
- Workaround incorrect UINTPTR_MAX in x64 clang bpf build.
- Fix boot.sh autoconf warning.


Toshiaki Makita (4):
  netdev-offload: Add "offload-driver" other_config to specify offload
    driver
  netdev-offload: Add xdp flow api provider
  bpf: Add reference XDP program implementation for netdev-offload-xdp
  bpf: Add dummy program for veth devices

William Tu (1):
  netdev-afxdp: Enable loading XDP program.

 .travis.yml                           |    2 +-
 Documentation/intro/install/afxdp.rst |   59 ++
 Makefile.am                           |    9 +-
 NEWS                                  |    2 +
 acinclude.m4                          |   60 ++
 bpf/.gitignore                        |    4 +
 bpf/Makefile.am                       |   83 ++
 bpf/bpf_compiler.h                    |   25 +
 bpf/bpf_miniflow.h                    |  179 ++++
 bpf/bpf_netlink.h                     |   63 ++
 bpf/bpf_workaround.h                  |   28 +
 bpf/flowtable_afxdp.c                 |  585 ++++++++++++
 bpf/xdp_noop.c                        |   31 +
 configure.ac                          |    2 +
 lib/automake.mk                       |    8 +
 lib/bpf-util.c                        |   38 +
 lib/bpf-util.h                        |   22 +
 lib/netdev-afxdp.c                    |  373 +++++++-
 lib/netdev-afxdp.h                    |    3 +
 lib/netdev-linux-private.h            |    5 +
 lib/netdev-offload-provider.h         |    8 +-
 lib/netdev-offload-xdp.c              | 1213 +++++++++++++++++++++++++
 lib/netdev-offload-xdp.h              |   49 +
 lib/netdev-offload.c                  |   42 +
 24 files changed, 2881 insertions(+), 12 deletions(-)
 create mode 100644 bpf/.gitignore
 create mode 100644 bpf/Makefile.am
 create mode 100644 bpf/bpf_compiler.h
 create mode 100644 bpf/bpf_miniflow.h
 create mode 100644 bpf/bpf_netlink.h
 create mode 100644 bpf/bpf_workaround.h
 create mode 100644 bpf/flowtable_afxdp.c
 create mode 100644 bpf/xdp_noop.c
 create mode 100644 lib/bpf-util.c
 create mode 100644 lib/bpf-util.h
 create mode 100644 lib/netdev-offload-xdp.c
 create mode 100644 lib/netdev-offload-xdp.h

Comments

Toshiaki Makita Aug. 15, 2020, 1:54 a.m. UTC | #1
Ping.
Any feedback is welcome.

Thanks,
Toshiaki Makita

On 2020/07/31 11:55, Toshiaki Makita wrote:
> This patch adds an XDP-based flow cache using the OVS netdev-offload
> flow API provider.  When an OVS device with XDP offload enabled,
> packets first are processed in the XDP flow cache (with parse, and
> table lookup implemented in eBPF) and if hits, the action processing
> are also done in the context of XDP, which has the minimum overhead.
> 
> This provider is based on top of William's recently posted patch for
> custom XDP load.  When a custom XDP is loaded, the provider detects if
> the program supports classifier, and if supported it starts offloading
> flows to the XDP program.
> 
> The patches are derived from xdp_flow[1], which is a mechanism similar to
> this but implemented in kernel.
> 
> 
> * Motivation
> 
> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> performance, there are use cases where packets better to be processed in
> kernel, for example, TCP/IP connections, or container to container
> connections.  Current solution is to use tap device or af_packet with
> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> is to steer packets earlier in the XDP program, and decides to send to
> userspace datapath or stay in kernel.
> 
> One problem with current netdev-afxdp is that it forwards all packets to
> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> program.) only provides the interface to load XDP program, howerver users
> usually don't know how to write their own XDP program.
> 
> XDP also supports HW-offload so it may be possible to offload flows to
> HW through this provider in the future, although not currently.
> The reason is that map-in-map is required for our program to support
> classifier with subtables in XDP, but map-in-map is not offloadable.
> If map-in-map becomes offloadable, HW-offload of our program may also
> be possible.
> 
> 
> * How to use
> 
> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>     kernel >= 5.3.
> 
> 2. make with --enable-afxdp --enable-xdp-offload
> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
> the BPF object will not be installed anywhere by "make install" at this point.
> 
> 3. Load custom XDP program
> E.g.
> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> 
> 4. Enable XDP_REDIRECT
> If you use veth devices, make sure to load some (possibly dummy) programs
> on the peers of veth devices. This patch set includes a program which
> does nothing but returns XDP_PASS. You can use it for the veth peer like
> this:
> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
> 
> Some HW NIC drivers require as many queues as cores on its system. Tweak
> queues using "ethtool -L".
> 
> 5. Enable hw-offload
> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> This will starts offloading flows to the XDP program.
> 
> You should be able to see some maps installed, including "debug_stats".
> $ bpftool map
> 
> If packets are successfully redirected by the XDP program,
> debug_stats[2] will be counted.
> $ bpftool map dump id <ID of debug_stats>
> 
> Currently only very limited keys and output actions are supported.
> For example NORMAL action entry and IP based matching work with current
> key support. VLAN actions used by port tag/trunks are also supported.
> 
> 
> * Performance
> 
> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
> Test 1 Measured drop rate at veth interface with redirect action from
> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
> Silver 4114 (2.20 GHz).
>                                                                 XDP_DROP
>                      +------+                      +-------+    +-------+
>   pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
>                      +------+                      +-------+    +-------+
> 
> Test 2 uses i40e instead of veth, and measured tx packet rate at output
> device.
> 
> Single-flow performance test results:
> 
> 1) i40e-veth
> 
>    a) no-zerocopy in i40e
> 
>      - xdp   3.7 Mpps
>      - afxdp 980 kpps
> 
>    b) zerocopy in i40e (veth does not have zc)
> 
>      - xdp   1.9 Mpps
>      - afxdp 980 Kpps
> 
> 2) i40e-i40e
> 
>    a) no-zerocopy
> 
>      - xdp   3.5 Mpps
>      - afxdp 1.5 Mpps
> 
>    b) zerocopy
> 
>      - xdp   2.0 Mpps
>      - afxdp 4.4 Mpps
> 
> ** xdp is better when zc is disabled. The reason of poor performance on zc
>     is that xdp_frame requires packet memory allocation and memcpy on
>     XDP_REDIRECT to other devices iff zc is enabled.
> 
> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
>     in this case, one is pmd and the other is softirq. When pmd and softirq
>     were running on the same core, the performance was extremely poor as
>     pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
>     userspace processing on the same core, but the result was lower than
>     (pmd results) / 2.
>     With nonpmd, xdp performance was the same as xdp with pmd. This means
>     xdp only uses one core (for softirq only). Even with pmd, we need only
>     one pmd for xdp even when we want to use more cores for multi-flow.
> 
> 
> This patch set is based on top of commit e8bf77748 ("odp-util: Fix clearing
> match mask if set action is partially unnecessary.").
> 
> To make review easier I left pre-squashed commits from v3 here.
> https://github.com/tmakita/ovs/compare/xdp_offload_v3...tmakita:xdp_offload_v4_history?expand=1
> 
> [1] https://lwn.net/Articles/802653/
> 
> v4:
> - Fix checkpatch errors.
> - Fix duplicate flow api register.
> - Don't call unnecessary flow api init callbacks when default flow api
>    provider can be used.
> - Fix typo in comments.
> - Improve bpf Makefile.am to support automatic dependencies.
> - Add a dummy XDP program for veth peers.
> - Rename netdev_info to netdev_xdp_info.
> - Use id-pool for free subtable entry management and devmap indexes.
> - Rename --enable-bpf to --enable-xdp-offload.
> - Compile xdp flow api provider only with --enable-xdp-offload.
> - Tested again and updated performance numbers in cover letter (get
>    slightly better numbers).
> 
> v3:
> - Use ".ovs_meta" section to inform vswitchd of metadata like supported
>    keys.
> - Rewrite action loop logic in bpf to support multiple actions.
> - Add missing linux/types.h in acinclude.m4, as per William Tu.
> - Fix infinite reconfiguration loop when xsks_map is missing.
> - Add vlan-related actions in bpf program.
> - Fix CI build error.
> - Fix inability to delete subtable entries.
> 
> v2:
> - Add uninit callback of netdev-offload-xdp.
> - Introduce "offload-driver" other_config to specify offload driver.
> - Add --enable-bpf (HAVE_BPF) config option to build bpf programs.
> - Workaround incorrect UINTPTR_MAX in x64 clang bpf build.
> - Fix boot.sh autoconf warning.
> 
> 
> Toshiaki Makita (4):
>    netdev-offload: Add "offload-driver" other_config to specify offload
>      driver
>    netdev-offload: Add xdp flow api provider
>    bpf: Add reference XDP program implementation for netdev-offload-xdp
>    bpf: Add dummy program for veth devices
> 
> William Tu (1):
>    netdev-afxdp: Enable loading XDP program.
> 
>   .travis.yml                           |    2 +-
>   Documentation/intro/install/afxdp.rst |   59 ++
>   Makefile.am                           |    9 +-
>   NEWS                                  |    2 +
>   acinclude.m4                          |   60 ++
>   bpf/.gitignore                        |    4 +
>   bpf/Makefile.am                       |   83 ++
>   bpf/bpf_compiler.h                    |   25 +
>   bpf/bpf_miniflow.h                    |  179 ++++
>   bpf/bpf_netlink.h                     |   63 ++
>   bpf/bpf_workaround.h                  |   28 +
>   bpf/flowtable_afxdp.c                 |  585 ++++++++++++
>   bpf/xdp_noop.c                        |   31 +
>   configure.ac                          |    2 +
>   lib/automake.mk                       |    8 +
>   lib/bpf-util.c                        |   38 +
>   lib/bpf-util.h                        |   22 +
>   lib/netdev-afxdp.c                    |  373 +++++++-
>   lib/netdev-afxdp.h                    |    3 +
>   lib/netdev-linux-private.h            |    5 +
>   lib/netdev-offload-provider.h         |    8 +-
>   lib/netdev-offload-xdp.c              | 1213 +++++++++++++++++++++++++
>   lib/netdev-offload-xdp.h              |   49 +
>   lib/netdev-offload.c                  |   42 +
>   24 files changed, 2881 insertions(+), 12 deletions(-)
>   create mode 100644 bpf/.gitignore
>   create mode 100644 bpf/Makefile.am
>   create mode 100644 bpf/bpf_compiler.h
>   create mode 100644 bpf/bpf_miniflow.h
>   create mode 100644 bpf/bpf_netlink.h
>   create mode 100644 bpf/bpf_workaround.h
>   create mode 100644 bpf/flowtable_afxdp.c
>   create mode 100644 bpf/xdp_noop.c
>   create mode 100644 lib/bpf-util.c
>   create mode 100644 lib/bpf-util.h
>   create mode 100644 lib/netdev-offload-xdp.c
>   create mode 100644 lib/netdev-offload-xdp.h
>
Toshiaki Makita Oct. 30, 2020, 5:19 a.m. UTC | #2
Hi all,

It's about 3 months since I submitted this patch set.
Could someone review this?
Or should I resubmit the patch set on the top of current master?

Thanks,
Toshiaki Makita

On 2020/08/15 10:54, Toshiaki Makita wrote:
> Ping.
> Any feedback is welcome.
> 
> Thanks,
> Toshiaki Makita
> 
> On 2020/07/31 11:55, Toshiaki Makita wrote:
>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>> flow API provider.  When an OVS device with XDP offload enabled,
>> packets first are processed in the XDP flow cache (with parse, and
>> table lookup implemented in eBPF) and if hits, the action processing
>> are also done in the context of XDP, which has the minimum overhead.
>>
>> This provider is based on top of William's recently posted patch for
>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>> the program supports classifier, and if supported it starts offloading
>> flows to the XDP program.
>>
>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>> this but implemented in kernel.
>>
>>
>> * Motivation
>>
>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>> performance, there are use cases where packets better to be processed in
>> kernel, for example, TCP/IP connections, or container to container
>> connections.  Current solution is to use tap device or af_packet with
>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>> is to steer packets earlier in the XDP program, and decides to send to
>> userspace datapath or stay in kernel.
>>
>> One problem with current netdev-afxdp is that it forwards all packets to
>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>> program.) only provides the interface to load XDP program, howerver users
>> usually don't know how to write their own XDP program.
>>
>> XDP also supports HW-offload so it may be possible to offload flows to
>> HW through this provider in the future, although not currently.
>> The reason is that map-in-map is required for our program to support
>> classifier with subtables in XDP, but map-in-map is not offloadable.
>> If map-in-map becomes offloadable, HW-offload of our program may also
>> be possible.
>>
>>
>> * How to use
>>
>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>     kernel >= 5.3.
>>
>> 2. make with --enable-afxdp --enable-xdp-offload
>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
>> the BPF object will not be installed anywhere by "make install" at this point.
>>
>> 3. Load custom XDP program
>> E.g.
>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>
>> 4. Enable XDP_REDIRECT
>> If you use veth devices, make sure to load some (possibly dummy) programs
>> on the peers of veth devices. This patch set includes a program which
>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>> this:
>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
>>
>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>> queues using "ethtool -L".
>>
>> 5. Enable hw-offload
>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>> This will starts offloading flows to the XDP program.
>>
>> You should be able to see some maps installed, including "debug_stats".
>> $ bpftool map
>>
>> If packets are successfully redirected by the XDP program,
>> debug_stats[2] will be counted.
>> $ bpftool map dump id <ID of debug_stats>
>>
>> Currently only very limited keys and output actions are supported.
>> For example NORMAL action entry and IP based matching work with current
>> key support. VLAN actions used by port tag/trunks are also supported.
>>
>>
>> * Performance
>>
>> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
>> Test 1 Measured drop rate at veth interface with redirect action from
>> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
>> Silver 4114 (2.20 GHz).
>>                                                                 XDP_DROP
>>                      +------+                      +-------+    +-------+
>>   pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
>>                      +------+                      +-------+    +-------+
>>
>> Test 2 uses i40e instead of veth, and measured tx packet rate at output
>> device.
>>
>> Single-flow performance test results:
>>
>> 1) i40e-veth
>>
>>    a) no-zerocopy in i40e
>>
>>      - xdp   3.7 Mpps
>>      - afxdp 980 kpps
>>
>>    b) zerocopy in i40e (veth does not have zc)
>>
>>      - xdp   1.9 Mpps
>>      - afxdp 980 Kpps
>>
>> 2) i40e-i40e
>>
>>    a) no-zerocopy
>>
>>      - xdp   3.5 Mpps
>>      - afxdp 1.5 Mpps
>>
>>    b) zerocopy
>>
>>      - xdp   2.0 Mpps
>>      - afxdp 4.4 Mpps
>>
>> ** xdp is better when zc is disabled. The reason of poor performance on zc
>>     is that xdp_frame requires packet memory allocation and memcpy on
>>     XDP_REDIRECT to other devices iff zc is enabled.
>>
>> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
>>     in this case, one is pmd and the other is softirq. When pmd and softirq
>>     were running on the same core, the performance was extremely poor as
>>     pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
>>     userspace processing on the same core, but the result was lower than
>>     (pmd results) / 2.
>>     With nonpmd, xdp performance was the same as xdp with pmd. This means
>>     xdp only uses one core (for softirq only). Even with pmd, we need only
>>     one pmd for xdp even when we want to use more cores for multi-flow.
>>
>>
>> This patch set is based on top of commit e8bf77748 ("odp-util: Fix clearing
>> match mask if set action is partially unnecessary.").
>>
>> To make review easier I left pre-squashed commits from v3 here.
>> https://github.com/tmakita/ovs/compare/xdp_offload_v3...tmakita:xdp_offload_v4_history?expand=1
>>
>> [1] https://lwn.net/Articles/802653/
>>
>> v4:
>> - Fix checkpatch errors.
>> - Fix duplicate flow api register.
>> - Don't call unnecessary flow api init callbacks when default flow api
>>    provider can be used.
>> - Fix typo in comments.
>> - Improve bpf Makefile.am to support automatic dependencies.
>> - Add a dummy XDP program for veth peers.
>> - Rename netdev_info to netdev_xdp_info.
>> - Use id-pool for free subtable entry management and devmap indexes.
>> - Rename --enable-bpf to --enable-xdp-offload.
>> - Compile xdp flow api provider only with --enable-xdp-offload.
>> - Tested again and updated performance numbers in cover letter (get
>>    slightly better numbers).
>>
>> v3:
>> - Use ".ovs_meta" section to inform vswitchd of metadata like supported
>>    keys.
>> - Rewrite action loop logic in bpf to support multiple actions.
>> - Add missing linux/types.h in acinclude.m4, as per William Tu.
>> - Fix infinite reconfiguration loop when xsks_map is missing.
>> - Add vlan-related actions in bpf program.
>> - Fix CI build error.
>> - Fix inability to delete subtable entries.
>>
>> v2:
>> - Add uninit callback of netdev-offload-xdp.
>> - Introduce "offload-driver" other_config to specify offload driver.
>> - Add --enable-bpf (HAVE_BPF) config option to build bpf programs.
>> - Workaround incorrect UINTPTR_MAX in x64 clang bpf build.
>> - Fix boot.sh autoconf warning.
>>
>>
>> Toshiaki Makita (4):
>>    netdev-offload: Add "offload-driver" other_config to specify offload
>>      driver
>>    netdev-offload: Add xdp flow api provider
>>    bpf: Add reference XDP program implementation for netdev-offload-xdp
>>    bpf: Add dummy program for veth devices
>>
>> William Tu (1):
>>    netdev-afxdp: Enable loading XDP program.
>>
>>   .travis.yml                           |    2 +-
>>   Documentation/intro/install/afxdp.rst |   59 ++
>>   Makefile.am                           |    9 +-
>>   NEWS                                  |    2 +
>>   acinclude.m4                          |   60 ++
>>   bpf/.gitignore                        |    4 +
>>   bpf/Makefile.am                       |   83 ++
>>   bpf/bpf_compiler.h                    |   25 +
>>   bpf/bpf_miniflow.h                    |  179 ++++
>>   bpf/bpf_netlink.h                     |   63 ++
>>   bpf/bpf_workaround.h                  |   28 +
>>   bpf/flowtable_afxdp.c                 |  585 ++++++++++++
>>   bpf/xdp_noop.c                        |   31 +
>>   configure.ac                          |    2 +
>>   lib/automake.mk                       |    8 +
>>   lib/bpf-util.c                        |   38 +
>>   lib/bpf-util.h                        |   22 +
>>   lib/netdev-afxdp.c                    |  373 +++++++-
>>   lib/netdev-afxdp.h                    |    3 +
>>   lib/netdev-linux-private.h            |    5 +
>>   lib/netdev-offload-provider.h         |    8 +-
>>   lib/netdev-offload-xdp.c              | 1213 +++++++++++++++++++++++++
>>   lib/netdev-offload-xdp.h              |   49 +
>>   lib/netdev-offload.c                  |   42 +
>>   24 files changed, 2881 insertions(+), 12 deletions(-)
>>   create mode 100644 bpf/.gitignore
>>   create mode 100644 bpf/Makefile.am
>>   create mode 100644 bpf/bpf_compiler.h
>>   create mode 100644 bpf/bpf_miniflow.h
>>   create mode 100644 bpf/bpf_netlink.h
>>   create mode 100644 bpf/bpf_workaround.h
>>   create mode 100644 bpf/flowtable_afxdp.c
>>   create mode 100644 bpf/xdp_noop.c
>>   create mode 100644 lib/bpf-util.c
>>   create mode 100644 lib/bpf-util.h
>>   create mode 100644 lib/netdev-offload-xdp.c
>>   create mode 100644 lib/netdev-offload-xdp.h
>>
Numan Siddique Nov. 2, 2020, 9:37 a.m. UTC | #3
On Fri, Oct 30, 2020 at 10:49 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> Hi all,
>
> It's about 3 months since I submitted this patch set.
> Could someone review this?
> Or should I resubmit the patch set on the top of current master?

Since the patches don't apply cleanly, I think you can rebase and
repost them and/or provide the
ovs commit id on top of which these patches apply cleanly.

Thanks
Numan


>
> Thanks,
> Toshiaki Makita
>
> On 2020/08/15 10:54, Toshiaki Makita wrote:
> > Ping.
> > Any feedback is welcome.
> >
> > Thanks,
> > Toshiaki Makita
> >
> > On 2020/07/31 11:55, Toshiaki Makita wrote:
> >> This patch adds an XDP-based flow cache using the OVS netdev-offload
> >> flow API provider.  When an OVS device with XDP offload enabled,
> >> packets first are processed in the XDP flow cache (with parse, and
> >> table lookup implemented in eBPF) and if hits, the action processing
> >> are also done in the context of XDP, which has the minimum overhead.
> >>
> >> This provider is based on top of William's recently posted patch for
> >> custom XDP load.  When a custom XDP is loaded, the provider detects if
> >> the program supports classifier, and if supported it starts offloading
> >> flows to the XDP program.
> >>
> >> The patches are derived from xdp_flow[1], which is a mechanism similar to
> >> this but implemented in kernel.
> >>
> >>
> >> * Motivation
> >>
> >> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> >> performance, there are use cases where packets better to be processed in
> >> kernel, for example, TCP/IP connections, or container to container
> >> connections.  Current solution is to use tap device or af_packet with
> >> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> >> is to steer packets earlier in the XDP program, and decides to send to
> >> userspace datapath or stay in kernel.
> >>
> >> One problem with current netdev-afxdp is that it forwards all packets to
> >> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> >> program.) only provides the interface to load XDP program, howerver users
> >> usually don't know how to write their own XDP program.
> >>
> >> XDP also supports HW-offload so it may be possible to offload flows to
> >> HW through this provider in the future, although not currently.
> >> The reason is that map-in-map is required for our program to support
> >> classifier with subtables in XDP, but map-in-map is not offloadable.
> >> If map-in-map becomes offloadable, HW-offload of our program may also
> >> be possible.
> >>
> >>
> >> * How to use
> >>
> >> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
> >>     kernel >= 5.3.
> >>
> >> 2. make with --enable-afxdp --enable-xdp-offload
> >> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
> >> the BPF object will not be installed anywhere by "make install" at this point.
> >>
> >> 3. Load custom XDP program
> >> E.g.
> >> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
> >>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
> >>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >>
> >> 4. Enable XDP_REDIRECT
> >> If you use veth devices, make sure to load some (possibly dummy) programs
> >> on the peers of veth devices. This patch set includes a program which
> >> does nothing but returns XDP_PASS. You can use it for the veth peer like
> >> this:
> >> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
> >>
> >> Some HW NIC drivers require as many queues as cores on its system. Tweak
> >> queues using "ethtool -L".
> >>
> >> 5. Enable hw-offload
> >> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
> >> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> >> This will starts offloading flows to the XDP program.
> >>
> >> You should be able to see some maps installed, including "debug_stats".
> >> $ bpftool map
> >>
> >> If packets are successfully redirected by the XDP program,
> >> debug_stats[2] will be counted.
> >> $ bpftool map dump id <ID of debug_stats>
> >>
> >> Currently only very limited keys and output actions are supported.
> >> For example NORMAL action entry and IP based matching work with current
> >> key support. VLAN actions used by port tag/trunks are also supported.
> >>
> >>
> >> * Performance
> >>
> >> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
> >> Test 1 Measured drop rate at veth interface with redirect action from
> >> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
> >> Silver 4114 (2.20 GHz).
> >>                                                                 XDP_DROP
> >>                      +------+                      +-------+    +-------+
> >>   pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
> >>                      +------+                      +-------+    +-------+
> >>
> >> Test 2 uses i40e instead of veth, and measured tx packet rate at output
> >> device.
> >>
> >> Single-flow performance test results:
> >>
> >> 1) i40e-veth
> >>
> >>    a) no-zerocopy in i40e
> >>
> >>      - xdp   3.7 Mpps
> >>      - afxdp 980 kpps
> >>
> >>    b) zerocopy in i40e (veth does not have zc)
> >>
> >>      - xdp   1.9 Mpps
> >>      - afxdp 980 Kpps
> >>
> >> 2) i40e-i40e
> >>
> >>    a) no-zerocopy
> >>
> >>      - xdp   3.5 Mpps
> >>      - afxdp 1.5 Mpps
> >>
> >>    b) zerocopy
> >>
> >>      - xdp   2.0 Mpps
> >>      - afxdp 4.4 Mpps
> >>
> >> ** xdp is better when zc is disabled. The reason of poor performance on zc
> >>     is that xdp_frame requires packet memory allocation and memcpy on
> >>     XDP_REDIRECT to other devices iff zc is enabled.
> >>
> >> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
> >>     in this case, one is pmd and the other is softirq. When pmd and softirq
> >>     were running on the same core, the performance was extremely poor as
> >>     pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
> >>     userspace processing on the same core, but the result was lower than
> >>     (pmd results) / 2.
> >>     With nonpmd, xdp performance was the same as xdp with pmd. This means
> >>     xdp only uses one core (for softirq only). Even with pmd, we need only
> >>     one pmd for xdp even when we want to use more cores for multi-flow.
> >>
> >>
> >> This patch set is based on top of commit e8bf77748 ("odp-util: Fix clearing
> >> match mask if set action is partially unnecessary.").
> >>
> >> To make review easier I left pre-squashed commits from v3 here.
> >> https://github.com/tmakita/ovs/compare/xdp_offload_v3...tmakita:xdp_offload_v4_history?expand=1
> >>
> >> [1] https://lwn.net/Articles/802653/
> >>
> >> v4:
> >> - Fix checkpatch errors.
> >> - Fix duplicate flow api register.
> >> - Don't call unnecessary flow api init callbacks when default flow api
> >>    provider can be used.
> >> - Fix typo in comments.
> >> - Improve bpf Makefile.am to support automatic dependencies.
> >> - Add a dummy XDP program for veth peers.
> >> - Rename netdev_info to netdev_xdp_info.
> >> - Use id-pool for free subtable entry management and devmap indexes.
> >> - Rename --enable-bpf to --enable-xdp-offload.
> >> - Compile xdp flow api provider only with --enable-xdp-offload.
> >> - Tested again and updated performance numbers in cover letter (get
> >>    slightly better numbers).
> >>
> >> v3:
> >> - Use ".ovs_meta" section to inform vswitchd of metadata like supported
> >>    keys.
> >> - Rewrite action loop logic in bpf to support multiple actions.
> >> - Add missing linux/types.h in acinclude.m4, as per William Tu.
> >> - Fix infinite reconfiguration loop when xsks_map is missing.
> >> - Add vlan-related actions in bpf program.
> >> - Fix CI build error.
> >> - Fix inability to delete subtable entries.
> >>
> >> v2:
> >> - Add uninit callback of netdev-offload-xdp.
> >> - Introduce "offload-driver" other_config to specify offload driver.
> >> - Add --enable-bpf (HAVE_BPF) config option to build bpf programs.
> >> - Workaround incorrect UINTPTR_MAX in x64 clang bpf build.
> >> - Fix boot.sh autoconf warning.
> >>
> >>
> >> Toshiaki Makita (4):
> >>    netdev-offload: Add "offload-driver" other_config to specify offload
> >>      driver
> >>    netdev-offload: Add xdp flow api provider
> >>    bpf: Add reference XDP program implementation for netdev-offload-xdp
> >>    bpf: Add dummy program for veth devices
> >>
> >> William Tu (1):
> >>    netdev-afxdp: Enable loading XDP program.
> >>
> >>   .travis.yml                           |    2 +-
> >>   Documentation/intro/install/afxdp.rst |   59 ++
> >>   Makefile.am                           |    9 +-
> >>   NEWS                                  |    2 +
> >>   acinclude.m4                          |   60 ++
> >>   bpf/.gitignore                        |    4 +
> >>   bpf/Makefile.am                       |   83 ++
> >>   bpf/bpf_compiler.h                    |   25 +
> >>   bpf/bpf_miniflow.h                    |  179 ++++
> >>   bpf/bpf_netlink.h                     |   63 ++
> >>   bpf/bpf_workaround.h                  |   28 +
> >>   bpf/flowtable_afxdp.c                 |  585 ++++++++++++
> >>   bpf/xdp_noop.c                        |   31 +
> >>   configure.ac                          |    2 +
> >>   lib/automake.mk                       |    8 +
> >>   lib/bpf-util.c                        |   38 +
> >>   lib/bpf-util.h                        |   22 +
> >>   lib/netdev-afxdp.c                    |  373 +++++++-
> >>   lib/netdev-afxdp.h                    |    3 +
> >>   lib/netdev-linux-private.h            |    5 +
> >>   lib/netdev-offload-provider.h         |    8 +-
> >>   lib/netdev-offload-xdp.c              | 1213 +++++++++++++++++++++++++
> >>   lib/netdev-offload-xdp.h              |   49 +
> >>   lib/netdev-offload.c                  |   42 +
> >>   24 files changed, 2881 insertions(+), 12 deletions(-)
> >>   create mode 100644 bpf/.gitignore
> >>   create mode 100644 bpf/Makefile.am
> >>   create mode 100644 bpf/bpf_compiler.h
> >>   create mode 100644 bpf/bpf_miniflow.h
> >>   create mode 100644 bpf/bpf_netlink.h
> >>   create mode 100644 bpf/bpf_workaround.h
> >>   create mode 100644 bpf/flowtable_afxdp.c
> >>   create mode 100644 bpf/xdp_noop.c
> >>   create mode 100644 lib/bpf-util.c
> >>   create mode 100644 lib/bpf-util.h
> >>   create mode 100644 lib/netdev-offload-xdp.c
> >>   create mode 100644 lib/netdev-offload-xdp.h
> >>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Toshiaki Makita Nov. 2, 2020, 4:36 p.m. UTC | #4
On 2020/11/02 18:37, Numan Siddique wrote:
> On Fri, Oct 30, 2020 at 10:49 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> Hi all,
>>
>> It's about 3 months since I submitted this patch set.
>> Could someone review this?
>> Or should I resubmit the patch set on the top of current master?
> 
> Since the patches don't apply cleanly, I think you can rebase and
> repost them and/or provide the
> ovs commit id on top of which these patches apply cleanly.

Hi Numan,

Thank you for the advice!

The patches are based on top of commit e8bf77748 ("odp-util: Fix clearing
match mask if set action is partially unnecessary.").
Actually I provided this information at the bottom of the cover letter.

Also you can pull the changes from
https://github.com/tmakita/ovs.git (xdp_offload_v4 branch).

Thanks,
Toshiaki Makita

>> On 2020/08/15 10:54, Toshiaki Makita wrote:
>>> Ping.
>>> Any feedback is welcome.
>>>
>>> Thanks,
>>> Toshiaki Makita
>>>
>>> On 2020/07/31 11:55, Toshiaki Makita wrote:
>>>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>>>> flow API provider.  When an OVS device with XDP offload enabled,
>>>> packets first are processed in the XDP flow cache (with parse, and
>>>> table lookup implemented in eBPF) and if hits, the action processing
>>>> are also done in the context of XDP, which has the minimum overhead.
>>>>
>>>> This provider is based on top of William's recently posted patch for
>>>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>>>> the program supports classifier, and if supported it starts offloading
>>>> flows to the XDP program.
>>>>
>>>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>>>> this but implemented in kernel.
>>>>
>>>>
>>>> * Motivation
>>>>
>>>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>>>> performance, there are use cases where packets better to be processed in
>>>> kernel, for example, TCP/IP connections, or container to container
>>>> connections.  Current solution is to use tap device or af_packet with
>>>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>>>> is to steer packets earlier in the XDP program, and decides to send to
>>>> userspace datapath or stay in kernel.
>>>>
>>>> One problem with current netdev-afxdp is that it forwards all packets to
>>>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>>>> program.) only provides the interface to load XDP program, howerver users
>>>> usually don't know how to write their own XDP program.
>>>>
>>>> XDP also supports HW-offload so it may be possible to offload flows to
>>>> HW through this provider in the future, although not currently.
>>>> The reason is that map-in-map is required for our program to support
>>>> classifier with subtables in XDP, but map-in-map is not offloadable.
>>>> If map-in-map becomes offloadable, HW-offload of our program may also
>>>> be possible.
>>>>
>>>>
>>>> * How to use
>>>>
>>>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>>>      kernel >= 5.3.
>>>>
>>>> 2. make with --enable-afxdp --enable-xdp-offload
>>>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
>>>> the BPF object will not be installed anywhere by "make install" at this point.
>>>>
>>>> 3. Load custom XDP program
>>>> E.g.
>>>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>>
>>>> 4. Enable XDP_REDIRECT
>>>> If you use veth devices, make sure to load some (possibly dummy) programs
>>>> on the peers of veth devices. This patch set includes a program which
>>>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>>>> this:
>>>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
>>>>
>>>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>>>> queues using "ethtool -L".
>>>>
>>>> 5. Enable hw-offload
>>>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>>>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>>>> This will starts offloading flows to the XDP program.
>>>>
>>>> You should be able to see some maps installed, including "debug_stats".
>>>> $ bpftool map
>>>>
>>>> If packets are successfully redirected by the XDP program,
>>>> debug_stats[2] will be counted.
>>>> $ bpftool map dump id <ID of debug_stats>
>>>>
>>>> Currently only very limited keys and output actions are supported.
>>>> For example NORMAL action entry and IP based matching work with current
>>>> key support. VLAN actions used by port tag/trunks are also supported.
>>>>
>>>>
>>>> * Performance
>>>>
>>>> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
>>>> Test 1 Measured drop rate at veth interface with redirect action from
>>>> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
>>>> Silver 4114 (2.20 GHz).
>>>>                                                                  XDP_DROP
>>>>                       +------+                      +-------+    +-------+
>>>>    pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
>>>>                       +------+                      +-------+    +-------+
>>>>
>>>> Test 2 uses i40e instead of veth, and measured tx packet rate at output
>>>> device.
>>>>
>>>> Single-flow performance test results:
>>>>
>>>> 1) i40e-veth
>>>>
>>>>     a) no-zerocopy in i40e
>>>>
>>>>       - xdp   3.7 Mpps
>>>>       - afxdp 980 kpps
>>>>
>>>>     b) zerocopy in i40e (veth does not have zc)
>>>>
>>>>       - xdp   1.9 Mpps
>>>>       - afxdp 980 Kpps
>>>>
>>>> 2) i40e-i40e
>>>>
>>>>     a) no-zerocopy
>>>>
>>>>       - xdp   3.5 Mpps
>>>>       - afxdp 1.5 Mpps
>>>>
>>>>     b) zerocopy
>>>>
>>>>       - xdp   2.0 Mpps
>>>>       - afxdp 4.4 Mpps
>>>>
>>>> ** xdp is better when zc is disabled. The reason of poor performance on zc
>>>>      is that xdp_frame requires packet memory allocation and memcpy on
>>>>      XDP_REDIRECT to other devices iff zc is enabled.
>>>>
>>>> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
>>>>      in this case, one is pmd and the other is softirq. When pmd and softirq
>>>>      were running on the same core, the performance was extremely poor as
>>>>      pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
>>>>      userspace processing on the same core, but the result was lower than
>>>>      (pmd results) / 2.
>>>>      With nonpmd, xdp performance was the same as xdp with pmd. This means
>>>>      xdp only uses one core (for softirq only). Even with pmd, we need only
>>>>      one pmd for xdp even when we want to use more cores for multi-flow.
>>>>
>>>>
>>>> This patch set is based on top of commit e8bf77748 ("odp-util: Fix clearing
>>>> match mask if set action is partially unnecessary.").
>>>>
>>>> To make review easier I left pre-squashed commits from v3 here.
>>>> https://github.com/tmakita/ovs/compare/xdp_offload_v3...tmakita:xdp_offload_v4_history?expand=1
>>>>
>>>> [1] https://lwn.net/Articles/802653/
>>>>
>>>> v4:
>>>> - Fix checkpatch errors.
>>>> - Fix duplicate flow api register.
>>>> - Don't call unnecessary flow api init callbacks when default flow api
>>>>     provider can be used.
>>>> - Fix typo in comments.
>>>> - Improve bpf Makefile.am to support automatic dependencies.
>>>> - Add a dummy XDP program for veth peers.
>>>> - Rename netdev_info to netdev_xdp_info.
>>>> - Use id-pool for free subtable entry management and devmap indexes.
>>>> - Rename --enable-bpf to --enable-xdp-offload.
>>>> - Compile xdp flow api provider only with --enable-xdp-offload.
>>>> - Tested again and updated performance numbers in cover letter (get
>>>>     slightly better numbers).
>>>>
>>>> v3:
>>>> - Use ".ovs_meta" section to inform vswitchd of metadata like supported
>>>>     keys.
>>>> - Rewrite action loop logic in bpf to support multiple actions.
>>>> - Add missing linux/types.h in acinclude.m4, as per William Tu.
>>>> - Fix infinite reconfiguration loop when xsks_map is missing.
>>>> - Add vlan-related actions in bpf program.
>>>> - Fix CI build error.
>>>> - Fix inability to delete subtable entries.
>>>>
>>>> v2:
>>>> - Add uninit callback of netdev-offload-xdp.
>>>> - Introduce "offload-driver" other_config to specify offload driver.
>>>> - Add --enable-bpf (HAVE_BPF) config option to build bpf programs.
>>>> - Workaround incorrect UINTPTR_MAX in x64 clang bpf build.
>>>> - Fix boot.sh autoconf warning.
>>>>
>>>>
>>>> Toshiaki Makita (4):
>>>>     netdev-offload: Add "offload-driver" other_config to specify offload
>>>>       driver
>>>>     netdev-offload: Add xdp flow api provider
>>>>     bpf: Add reference XDP program implementation for netdev-offload-xdp
>>>>     bpf: Add dummy program for veth devices
>>>>
>>>> William Tu (1):
>>>>     netdev-afxdp: Enable loading XDP program.
>>>>
>>>>    .travis.yml                           |    2 +-
>>>>    Documentation/intro/install/afxdp.rst |   59 ++
>>>>    Makefile.am                           |    9 +-
>>>>    NEWS                                  |    2 +
>>>>    acinclude.m4                          |   60 ++
>>>>    bpf/.gitignore                        |    4 +
>>>>    bpf/Makefile.am                       |   83 ++
>>>>    bpf/bpf_compiler.h                    |   25 +
>>>>    bpf/bpf_miniflow.h                    |  179 ++++
>>>>    bpf/bpf_netlink.h                     |   63 ++
>>>>    bpf/bpf_workaround.h                  |   28 +
>>>>    bpf/flowtable_afxdp.c                 |  585 ++++++++++++
>>>>    bpf/xdp_noop.c                        |   31 +
>>>>    configure.ac                          |    2 +
>>>>    lib/automake.mk                       |    8 +
>>>>    lib/bpf-util.c                        |   38 +
>>>>    lib/bpf-util.h                        |   22 +
>>>>    lib/netdev-afxdp.c                    |  373 +++++++-
>>>>    lib/netdev-afxdp.h                    |    3 +
>>>>    lib/netdev-linux-private.h            |    5 +
>>>>    lib/netdev-offload-provider.h         |    8 +-
>>>>    lib/netdev-offload-xdp.c              | 1213 +++++++++++++++++++++++++
>>>>    lib/netdev-offload-xdp.h              |   49 +
>>>>    lib/netdev-offload.c                  |   42 +
>>>>    24 files changed, 2881 insertions(+), 12 deletions(-)
>>>>    create mode 100644 bpf/.gitignore
>>>>    create mode 100644 bpf/Makefile.am
>>>>    create mode 100644 bpf/bpf_compiler.h
>>>>    create mode 100644 bpf/bpf_miniflow.h
>>>>    create mode 100644 bpf/bpf_netlink.h
>>>>    create mode 100644 bpf/bpf_workaround.h
>>>>    create mode 100644 bpf/flowtable_afxdp.c
>>>>    create mode 100644 bpf/xdp_noop.c
>>>>    create mode 100644 lib/bpf-util.c
>>>>    create mode 100644 lib/bpf-util.h
>>>>    create mode 100644 lib/netdev-offload-xdp.c
>>>>    create mode 100644 lib/netdev-offload-xdp.h
>>>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
William Tu Feb. 4, 2021, 5:36 p.m. UTC | #5
Hi Toshiaki,

Thanks for the patch. I've been testing it for a couple days.
I liked it a lot! The compile and build process all work without any issues.

On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> This patch adds an XDP-based flow cache using the OVS netdev-offload
> flow API provider.  When an OVS device with XDP offload enabled,
> packets first are processed in the XDP flow cache (with parse, and
> table lookup implemented in eBPF) and if hits, the action processing
> are also done in the context of XDP, which has the minimum overhead.
>
> This provider is based on top of William's recently posted patch for
> custom XDP load.  When a custom XDP is loaded, the provider detects if
> the program supports classifier, and if supported it starts offloading
> flows to the XDP program.
>
> The patches are derived from xdp_flow[1], which is a mechanism similar to
> this but implemented in kernel.
>
>
> * Motivation
>
> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> performance, there are use cases where packets better to be processed in
> kernel, for example, TCP/IP connections, or container to container
> connections.  Current solution is to use tap device or af_packet with
> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> is to steer packets earlier in the XDP program, and decides to send to
> userspace datapath or stay in kernel.
>
> One problem with current netdev-afxdp is that it forwards all packets to
> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> program.) only provides the interface to load XDP program, howerver users
> usually don't know how to write their own XDP program.
>
> XDP also supports HW-offload so it may be possible to offload flows to
> HW through this provider in the future, although not currently.
> The reason is that map-in-map is required for our program to support
> classifier with subtables in XDP, but map-in-map is not offloadable.
> If map-in-map becomes offloadable, HW-offload of our program may also
> be possible.

I think it's too far away for XDP to be offloaded into HW and meet OVS's
feature requirements.
There is a research prototype here, FYI.
https://www.usenix.org/conference/osdi20/presentation/brunella

>
>
> * How to use
>
> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>    kernel >= 5.3.
>
> 2. make with --enable-afxdp --enable-xdp-offload
> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that

typo: I think you mean --enable-xdp-offload

> the BPF object will not be installed anywhere by "make install" at this point.
>
> 3. Load custom XDP program
> E.g.
> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>   options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>   options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>
> 4. Enable XDP_REDIRECT
> If you use veth devices, make sure to load some (possibly dummy) programs
> on the peers of veth devices. This patch set includes a program which
> does nothing but returns XDP_PASS. You can use it for the veth peer like
> this:
> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp

I'd suggest not using "veth1" as an example, because in (3) above, people
might think "veth1" is already attached to ovsbr0.
IIUC, here your "veth1" should be the device at the peer inside
another namespace.

>
> Some HW NIC drivers require as many queues as cores on its system. Tweak
> queues using "ethtool -L".
>
> 5. Enable hw-offload
> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> This will starts offloading flows to the XDP program.
>
> You should be able to see some maps installed, including "debug_stats".
> $ bpftool map
>
> If packets are successfully redirected by the XDP program,
> debug_stats[2] will be counted.
> $ bpftool map dump id <ID of debug_stats>
>
> Currently only very limited keys and output actions are supported.
> For example NORMAL action entry and IP based matching work with current
> key support. VLAN actions used by port tag/trunks are also supported.
>

I don't know if this is too much to ask for.
I wonder if you, or we can work together, to add at least a tunnel
support, ex: vxlan?
The current version is a good prototype for people to test an L2/L3
XDP offload switch,
but without a good use case, it's hard to attract more people to
contribute or use it.

From a maintenance point of view, can we add a test case to avoid regression?
For example, something like "make check-afxdp".
We can have "make check-xdp-offload". I can also help adding it.

>
> * Performance
>
> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
> Test 1 Measured drop rate at veth interface with redirect action from
> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
> Silver 4114 (2.20 GHz).
>                                                                XDP_DROP
>                     +------+                      +-------+    +-------+
>  pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
>                     +------+                      +-------+    +-------+
>
> Test 2 uses i40e instead of veth, and measured tx packet rate at output
> device.
>

Thanks for the performance results. I tested using two 25Gb ConnectX-6
cards, using mlx5. I got pretty similar performance results as yours.
But in general, I'm more worried about whether a feature can be implemented
in XDP than its performance.

> Single-flow performance test results:
>
> 1) i40e-veth
>
>   a) no-zerocopy in i40e
>
>     - xdp   3.7 Mpps
>     - afxdp 980 kpps
>
>   b) zerocopy in i40e (veth does not have zc)
>
>     - xdp   1.9 Mpps
>     - afxdp 980 Kpps
>
> 2) i40e-i40e
>
>   a) no-zerocopy
>
>     - xdp   3.5 Mpps
>     - afxdp 1.5 Mpps
>
>   b) zerocopy
>
>     - xdp   2.0 Mpps
>     - afxdp 4.4 Mpps
>
> ** xdp is better when zc is disabled. The reason of poor performance on zc
>    is that xdp_frame requires packet memory allocation and memcpy on
>    XDP_REDIRECT to other devices iff zc is enabled.
>
> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
>    in this case, one is pmd and the other is softirq. When pmd and softirq
>    were running on the same core, the performance was extremely poor as
>    pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
>    userspace processing on the same core, but the result was lower than
>    (pmd results) / 2.
>    With nonpmd, xdp performance was the same as xdp with pmd. This means
>    xdp only uses one core (for softirq only). Even with pmd, we need only
>    one pmd for xdp even when we want to use more cores for multi-flow.
>

Since XDP is evolving, it would be helpful to point out some future work and
current limitations. As far as I know
1) No broadcast/multicast support in XDP. Patch below:
https://lwn.net/Articles/817582/
2) No large packet support (ex: TSO)
3) No HW checksum offload
Will continue reviewing the following individual patches.
Regards,
William
Toshiaki Makita Feb. 9, 2021, 9:39 a.m. UTC | #6
On 2021/02/05 2:36, William Tu wrote:
> Hi Toshiaki,
> 
> Thanks for the patch. I've been testing it for a couple days.
> I liked it a lot! The compile and build process all work without any issues.

Hi, thank you for reviewing!
Sorry for taking time to reply. It took time to remember every detail of the patch set...

> On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>> flow API provider.  When an OVS device with XDP offload enabled,
>> packets first are processed in the XDP flow cache (with parse, and
>> table lookup implemented in eBPF) and if hits, the action processing
>> are also done in the context of XDP, which has the minimum overhead.
>>
>> This provider is based on top of William's recently posted patch for
>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>> the program supports classifier, and if supported it starts offloading
>> flows to the XDP program.
>>
>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>> this but implemented in kernel.
>>
>>
>> * Motivation
>>
>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>> performance, there are use cases where packets better to be processed in
>> kernel, for example, TCP/IP connections, or container to container
>> connections.  Current solution is to use tap device or af_packet with
>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>> is to steer packets earlier in the XDP program, and decides to send to
>> userspace datapath or stay in kernel.
>>
>> One problem with current netdev-afxdp is that it forwards all packets to
>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>> program.) only provides the interface to load XDP program, howerver users
>> usually don't know how to write their own XDP program.
>>
>> XDP also supports HW-offload so it may be possible to offload flows to
>> HW through this provider in the future, although not currently.
>> The reason is that map-in-map is required for our program to support
>> classifier with subtables in XDP, but map-in-map is not offloadable.
>> If map-in-map becomes offloadable, HW-offload of our program may also
>> be possible.
> 
> I think it's too far away for XDP to be offloaded into HW and meet OVS's
> feature requirements.

I don't know blockers other than map-in-map, but probably there are more.
If you can provide explicit blockers I can add them in the cover letter.

> There is a research prototype here, FYI.
> https://www.usenix.org/conference/osdi20/presentation/brunella

This is a presentation about FPGA, not HW offload to SmartNIC, right?

>>
>>
>> * How to use
>>
>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>     kernel >= 5.3.
>>
>> 2. make with --enable-afxdp --enable-xdp-offload
>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
> 
> typo: I think you mean --enable-xdp-offload

Thanks.

> 
>> the BPF object will not be installed anywhere by "make install" at this point.
>>
>> 3. Load custom XDP program
>> E.g.
>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>
>> 4. Enable XDP_REDIRECT
>> If you use veth devices, make sure to load some (possibly dummy) programs
>> on the peers of veth devices. This patch set includes a program which
>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>> this:
>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
> 
> I'd suggest not using "veth1" as an example, because in (3) above, people
> might think "veth1" is already attached to ovsbr0.
> IIUC, here your "veth1" should be the device at the peer inside
> another namespace.

Sure, will rename it.

>>
>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>> queues using "ethtool -L".
>>
>> 5. Enable hw-offload
>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>> This will starts offloading flows to the XDP program.
>>
>> You should be able to see some maps installed, including "debug_stats".
>> $ bpftool map
>>
>> If packets are successfully redirected by the XDP program,
>> debug_stats[2] will be counted.
>> $ bpftool map dump id <ID of debug_stats>
>>
>> Currently only very limited keys and output actions are supported.
>> For example NORMAL action entry and IP based matching work with current
>> key support. VLAN actions used by port tag/trunks are also supported.
>>
> 
> I don't know if this is too much to ask for.
> I wonder if you, or we can work together, to add at least a tunnel
> support, ex: vxlan?
> The current version is a good prototype for people to test an L2/L3
> XDP offload switch,
> but without a good use case, it's hard to attract more people to
> contribute or use it.

I think we have discussed this before.
Vxlan or other tunneling is indeed important, but that's not straightforward.
Push is easy, but pop is not. Pop requires two rules and recirculation.
Recirculation is highly likely to cause eBPF 1M insn limit error.
One possible solution is to combine two rules into one and insert it instead,
but I have not verified whether this can work or how to implement it.
Can we leave this and make following patches later?

>  From a maintenance point of view, can we add a test case to avoid regression?
> For example, something like "make check-afxdp".
> We can have "make check-xdp-offload". I can also help adding it.

OK, let me think about what kind of test we can do...

>>
>> * Performance
>>
>> Tested 2 cases. 1) i40e to veth, 2) i40e to i40e.
>> Test 1 Measured drop rate at veth interface with redirect action from
>> physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon
>> Silver 4114 (2.20 GHz).
>>                                                                 XDP_DROP
>>                      +------+                      +-------+    +-------+
>>   pktgen -- wire --> | eth0 | -- NORMAL ACTION --> | veth0 |----| veth2 |
>>                      +------+                      +-------+    +-------+
>>
>> Test 2 uses i40e instead of veth, and measured tx packet rate at output
>> device.
>>
> 
> Thanks for the performance results. I tested using two 25Gb ConnectX-6
> cards, using mlx5. I got pretty similar performance results as yours.
> But in general, I'm more worried about whether a feature can be implemented
> in XDP than its performance.
> 
>> Single-flow performance test results:
>>
>> 1) i40e-veth
>>
>>    a) no-zerocopy in i40e
>>
>>      - xdp   3.7 Mpps
>>      - afxdp 980 kpps
>>
>>    b) zerocopy in i40e (veth does not have zc)
>>
>>      - xdp   1.9 Mpps
>>      - afxdp 980 Kpps
>>
>> 2) i40e-i40e
>>
>>    a) no-zerocopy
>>
>>      - xdp   3.5 Mpps
>>      - afxdp 1.5 Mpps
>>
>>    b) zerocopy
>>
>>      - xdp   2.0 Mpps
>>      - afxdp 4.4 Mpps
>>
>> ** xdp is better when zc is disabled. The reason of poor performance on zc
>>     is that xdp_frame requires packet memory allocation and memcpy on
>>     XDP_REDIRECT to other devices iff zc is enabled.
>>
>> ** afxdp with zc is better than xdp without zc, but afxdp is using 2 cores
>>     in this case, one is pmd and the other is softirq. When pmd and softirq
>>     were running on the same core, the performance was extremely poor as
>>     pmd consumes cpus. I also tested afxdp-nonpmd to run softirq and
>>     userspace processing on the same core, but the result was lower than
>>     (pmd results) / 2.
>>     With nonpmd, xdp performance was the same as xdp with pmd. This means
>>     xdp only uses one core (for softirq only). Even with pmd, we need only
>>     one pmd for xdp even when we want to use more cores for multi-flow.
>>
> 
> Since XDP is evolving, it would be helpful to point out some future work and
> current limitations. As far as I know
> 1) No broadcast/multicast support in XDP. Patch below:
> https://lwn.net/Articles/817582/
> 2) No large packet support (ex: TSO)
> 3) No HW checksum offload

Will add them, thanks!

Toshiaki Makita
William Tu Feb. 16, 2021, 1:49 a.m. UTC | #7
On Tue, Feb 9, 2021 at 1:39 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2021/02/05 2:36, William Tu wrote:
> > Hi Toshiaki,
> >
> > Thanks for the patch. I've been testing it for a couple days.
> > I liked it a lot! The compile and build process all work without any issues.
>
> Hi, thank you for reviewing!
> Sorry for taking time to reply. It took time to remember every detail of the patch set...
>
> > On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
> > <toshiaki.makita1@gmail.com> wrote:
> >>
> >> This patch adds an XDP-based flow cache using the OVS netdev-offload
> >> flow API provider.  When an OVS device with XDP offload enabled,
> >> packets first are processed in the XDP flow cache (with parse, and
> >> table lookup implemented in eBPF) and if hits, the action processing
> >> are also done in the context of XDP, which has the minimum overhead.
> >>
> >> This provider is based on top of William's recently posted patch for
> >> custom XDP load.  When a custom XDP is loaded, the provider detects if
> >> the program supports classifier, and if supported it starts offloading
> >> flows to the XDP program.
> >>
> >> The patches are derived from xdp_flow[1], which is a mechanism similar to
> >> this but implemented in kernel.
> >>
> >>
> >> * Motivation
> >>
> >> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
> >> performance, there are use cases where packets better to be processed in
> >> kernel, for example, TCP/IP connections, or container to container
> >> connections.  Current solution is to use tap device or af_packet with
> >> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
> >> is to steer packets earlier in the XDP program, and decides to send to
> >> userspace datapath or stay in kernel.
> >>
> >> One problem with current netdev-afxdp is that it forwards all packets to
> >> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
> >> program.) only provides the interface to load XDP program, howerver users
> >> usually don't know how to write their own XDP program.
> >>
> >> XDP also supports HW-offload so it may be possible to offload flows to
> >> HW through this provider in the future, although not currently.
> >> The reason is that map-in-map is required for our program to support
> >> classifier with subtables in XDP, but map-in-map is not offloadable.
> >> If map-in-map becomes offloadable, HW-offload of our program may also
> >> be possible.
> >
> > I think it's too far away for XDP to be offloaded into HW and meet OVS's
> > feature requirements.
>
> I don't know blockers other than map-in-map, but probably there are more.
> If you can provide explicit blockers I can add them in the cover letter.

It's hard to list them when we don't have a full OVS datapath
implemented in XDP.
Here are a couple things I can imagine. How does HW offloaded support:
- AF_XDP socket. The XSK map contains XSK fd, how does it exchange
  the fd to host kernel?
- how does offloaded XDP redirect to another netdev
- Helper functions such as adjust_head for pushing the outer header.

>
> > There is a research prototype here, FYI.
> > https://www.usenix.org/conference/osdi20/presentation/brunella
>
> This is a presentation about FPGA, not HW offload to SmartNIC, right?
>
Yes, that's for offloading to FPGA.

> >>
> >>
> >> * How to use
> >>
> >> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
> >>     kernel >= 5.3.
> >>
> >> 2. make with --enable-afxdp --enable-xdp-offload
> >> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
> >
> > typo: I think you mean --enable-xdp-offload
>
> Thanks.
>
> >
> >> the BPF object will not be installed anywhere by "make install" at this point.
> >>
> >> 3. Load custom XDP program
> >> E.g.
> >> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
> >>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
> >>    options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
> >>
> >> 4. Enable XDP_REDIRECT
> >> If you use veth devices, make sure to load some (possibly dummy) programs
> >> on the peers of veth devices. This patch set includes a program which
> >> does nothing but returns XDP_PASS. You can use it for the veth peer like
> >> this:
> >> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
> >
> > I'd suggest not using "veth1" as an example, because in (3) above, people
> > might think "veth1" is already attached to ovsbr0.
> > IIUC, here your "veth1" should be the device at the peer inside
> > another namespace.
>
> Sure, will rename it.
>
> >>
> >> Some HW NIC drivers require as many queues as cores on its system. Tweak
> >> queues using "ethtool -L".
> >>
> >> 5. Enable hw-offload
> >> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
> >> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> >> This will starts offloading flows to the XDP program.
> >>
> >> You should be able to see some maps installed, including "debug_stats".
> >> $ bpftool map
> >>
> >> If packets are successfully redirected by the XDP program,
> >> debug_stats[2] will be counted.
> >> $ bpftool map dump id <ID of debug_stats>
> >>
> >> Currently only very limited keys and output actions are supported.
> >> For example NORMAL action entry and IP based matching work with current
> >> key support. VLAN actions used by port tag/trunks are also supported.
> >>
> >
> > I don't know if this is too much to ask for.
> > I wonder if you, or we can work together, to add at least a tunnel
> > support, ex: vxlan?
> > The current version is a good prototype for people to test an L2/L3
> > XDP offload switch,
> > but without a good use case, it's hard to attract more people to
> > contribute or use it.
>
> I think we have discussed this before.
> Vxlan or other tunneling is indeed important, but that's not straightforward.
> Push is easy, but pop is not. Pop requires two rules and recirculation.
> Recirculation is highly likely to cause eBPF 1M insn limit error.

Recirculation is pretty important. For example connection tracking also
relies on recirc. Can we break into multiple program and tail call?
For recirc action, can we tail call the main ebpf program, and let the
packet goes through parse/megaflow lookup/action?

> One possible solution is to combine two rules into one and insert it instead,
I think it's hard to combine two rules into one.
Because after recirc, the flow might match a variety of flows, and you
basically need to do a across product of all matching flows.

> but I have not verified whether this can work or how to implement it.
> Can we leave this and make following patches later?
I'm not sure how others think. But I think it's better that in the first
design, there is at least a simple vxlan or recirc support so people
are more willing to try it and give feedback.
I also expect there will be huge change and new issues when
later on, we add tunnel support. Ex: the flow key structure will increase
and we need to break into smaller bpf programs.

Thanks
William
Toshiaki Makita Feb. 22, 2021, 2:44 p.m. UTC | #8
On 2021/02/16 10:49, William Tu wrote:
> On Tue, Feb 9, 2021 at 1:39 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2021/02/05 2:36, William Tu wrote:
>>> Hi Toshiaki,
>>>
>>> Thanks for the patch. I've been testing it for a couple days.
>>> I liked it a lot! The compile and build process all work without any issues.
>>
>> Hi, thank you for reviewing!
>> Sorry for taking time to reply. It took time to remember every detail of the patch set...
>>
>>> On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
>>> <toshiaki.makita1@gmail.com> wrote:
>>>>
>>>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>>>> flow API provider.  When an OVS device with XDP offload enabled,
>>>> packets first are processed in the XDP flow cache (with parse, and
>>>> table lookup implemented in eBPF) and if hits, the action processing
>>>> are also done in the context of XDP, which has the minimum overhead.
>>>>
>>>> This provider is based on top of William's recently posted patch for
>>>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>>>> the program supports classifier, and if supported it starts offloading
>>>> flows to the XDP program.
>>>>
>>>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>>>> this but implemented in kernel.
>>>>
>>>>
>>>> * Motivation
>>>>
>>>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>>>> performance, there are use cases where packets better to be processed in
>>>> kernel, for example, TCP/IP connections, or container to container
>>>> connections.  Current solution is to use tap device or af_packet with
>>>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>>>> is to steer packets earlier in the XDP program, and decides to send to
>>>> userspace datapath or stay in kernel.
>>>>
>>>> One problem with current netdev-afxdp is that it forwards all packets to
>>>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>>>> program.) only provides the interface to load XDP program, howerver users
>>>> usually don't know how to write their own XDP program.
>>>>
>>>> XDP also supports HW-offload so it may be possible to offload flows to
>>>> HW through this provider in the future, although not currently.
>>>> The reason is that map-in-map is required for our program to support
>>>> classifier with subtables in XDP, but map-in-map is not offloadable.
>>>> If map-in-map becomes offloadable, HW-offload of our program may also
>>>> be possible.
>>>
>>> I think it's too far away for XDP to be offloaded into HW and meet OVS's
>>> feature requirements.
>>
>> I don't know blockers other than map-in-map, but probably there are more.
>> If you can provide explicit blockers I can add them in the cover letter.
> 
> It's hard to list them when we don't have a full OVS datapath
> implemented in XDP.
> Here are a couple things I can imagine. How does HW offloaded support:
> - AF_XDP socket. The XSK map contains XSK fd, how does it exchange
>    the fd to host kernel?
> - how does offloaded XDP redirect to another netdev
> - Helper functions such as adjust_head for pushing the outer header.

Thanks, this is helpful. will add them.

>>> There is a research prototype here, FYI.
>>> https://www.usenix.org/conference/osdi20/presentation/brunella
>>
>> This is a presentation about FPGA, not HW offload to SmartNIC, right?
>>
> Yes, that's for offloading to FPGA.
> 
>>>>
>>>>
>>>> * How to use
>>>>
>>>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>>>      kernel >= 5.3.
>>>>
>>>> 2. make with --enable-afxdp --enable-xdp-offload
>>>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
>>>
>>> typo: I think you mean --enable-xdp-offload
>>
>> Thanks.
>>
>>>
>>>> the BPF object will not be installed anywhere by "make install" at this point.
>>>>
>>>> 3. Load custom XDP program
>>>> E.g.
>>>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>>
>>>> 4. Enable XDP_REDIRECT
>>>> If you use veth devices, make sure to load some (possibly dummy) programs
>>>> on the peers of veth devices. This patch set includes a program which
>>>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>>>> this:
>>>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
>>>
>>> I'd suggest not using "veth1" as an example, because in (3) above, people
>>> might think "veth1" is already attached to ovsbr0.
>>> IIUC, here your "veth1" should be the device at the peer inside
>>> another namespace.
>>
>> Sure, will rename it.
>>
>>>>
>>>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>>>> queues using "ethtool -L".
>>>>
>>>> 5. Enable hw-offload
>>>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>>>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>>>> This will starts offloading flows to the XDP program.
>>>>
>>>> You should be able to see some maps installed, including "debug_stats".
>>>> $ bpftool map
>>>>
>>>> If packets are successfully redirected by the XDP program,
>>>> debug_stats[2] will be counted.
>>>> $ bpftool map dump id <ID of debug_stats>
>>>>
>>>> Currently only very limited keys and output actions are supported.
>>>> For example NORMAL action entry and IP based matching work with current
>>>> key support. VLAN actions used by port tag/trunks are also supported.
>>>>
>>>
>>> I don't know if this is too much to ask for.
>>> I wonder if you, or we can work together, to add at least a tunnel
>>> support, ex: vxlan?
>>> The current version is a good prototype for people to test an L2/L3
>>> XDP offload switch,
>>> but without a good use case, it's hard to attract more people to
>>> contribute or use it.
>>
>> I think we have discussed this before.
>> Vxlan or other tunneling is indeed important, but that's not straightforward.
>> Push is easy, but pop is not. Pop requires two rules and recirculation.
>> Recirculation is highly likely to cause eBPF 1M insn limit error.
> 
> Recirculation is pretty important. For example connection tracking also
> relies on recirc. Can we break into multiple program and tail call?
> For recirc action, can we tail call the main ebpf program, and let the
> packet goes through parse/megaflow lookup/action?

OK, will try using tail calls.
This will require vswitchd to load another bpf program for tail calls.
I guess such a program can be specified in main bpf program meta data.
I'll check if it works.

>> One possible solution is to combine two rules into one and insert it instead,
> I think it's hard to combine two rules into one.
> Because after recirc, the flow might match a variety of flows, and you
> basically need to do a across product of all matching flows.
> 
>> but I have not verified whether this can work or how to implement it.
>> Can we leave this and make following patches later?
> I'm not sure how others think. But I think it's better that in the first
> design, there is at least a simple vxlan or recirc support so people
> are more willing to try it and give feedback.
> I also expect there will be huge change and new issues when
> later on, we add tunnel support. Ex: the flow key structure will increase
> and we need to break into smaller bpf programs.

Yes we should determine how to avoid insns inflation anyway.
Will check the verifier logic again...

Thanks,
Toshiaki Makita
William Tu Feb. 23, 2021, 6:37 p.m. UTC | #9
> >>> I don't know if this is too much to ask for.
> >>> I wonder if you, or we can work together, to add at least a tunnel
> >>> support, ex: vxlan?
> >>> The current version is a good prototype for people to test an L2/L3
> >>> XDP offload switch,
> >>> but without a good use case, it's hard to attract more people to
> >>> contribute or use it.
> >>
> >> I think we have discussed this before.
> >> Vxlan or other tunneling is indeed important, but that's not straightforward.
> >> Push is easy, but pop is not. Pop requires two rules and recirculation.
> >> Recirculation is highly likely to cause eBPF 1M insn limit error.
> >
> > Recirculation is pretty important. For example connection tracking also
> > relies on recirc. Can we break into multiple program and tail call?
> > For recirc action, can we tail call the main ebpf program, and let the
> > packet goes through parse/megaflow lookup/action?
>
> OK, will try using tail calls.
> This will require vswitchd to load another bpf program for tail calls.
> I guess such a program can be specified in main bpf program meta data.
> I'll check if it works.
>

Do you know whether using bpf function call helps solving the
1M insn limit or stack limitation?
https://lwn.net/Articles/741773/

IIUC, If we have a flow that requires executing multiple actions,
using bpf function call can save the stack size, but the total
instruction limit is still 1M.
When using tail call, we have more work to do to break into
individual ebpf program, but each program can have 1M insn
and stack size.

William
Toshiaki Makita Feb. 25, 2021, 3:26 p.m. UTC | #10
On 2021/02/24 3:37, William Tu wrote:
>>>>> I don't know if this is too much to ask for.
>>>>> I wonder if you, or we can work together, to add at least a tunnel
>>>>> support, ex: vxlan?
>>>>> The current version is a good prototype for people to test an L2/L3
>>>>> XDP offload switch,
>>>>> but without a good use case, it's hard to attract more people to
>>>>> contribute or use it.
>>>>
>>>> I think we have discussed this before.
>>>> Vxlan or other tunneling is indeed important, but that's not straightforward.
>>>> Push is easy, but pop is not. Pop requires two rules and recirculation.
>>>> Recirculation is highly likely to cause eBPF 1M insn limit error.
>>>
>>> Recirculation is pretty important. For example connection tracking also
>>> relies on recirc. Can we break into multiple program and tail call?
>>> For recirc action, can we tail call the main ebpf program, and let the
>>> packet goes through parse/megaflow lookup/action?
>>
>> OK, will try using tail calls.
>> This will require vswitchd to load another bpf program for tail calls.
>> I guess such a program can be specified in main bpf program meta data.
>> I'll check if it works.
>>
> 
> Do you know whether using bpf function call helps solving the
> 1M insn limit or stack limitation?
> https://lwn.net/Articles/741773/

Looking at check_func_call() in verifier.c, it seems not.

> IIUC, If we have a flow that requires executing multiple actions,
> using bpf function call can save the stack size, but the total
> instruction limit is still 1M.
> When using tail call, we have more work to do to break into
> individual ebpf program, but each program can have 1M insn
> and stack size.

Yes I think tail calls do help.
But on the second thought, I think I should identify the verifier logic
which causes consumption of that many instructions. Then we can determine if
we have another option to mitigate that.
E.g. The problem happens around loops. Loop unrolling may help mitigating it or not?

Toshiaki Makita
William Tu Feb. 26, 2021, 1:13 a.m. UTC | #11
On Thu, Feb 25, 2021 at 7:26 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2021/02/24 3:37, William Tu wrote:
> >>>>> I don't know if this is too much to ask for.
> >>>>> I wonder if you, or we can work together, to add at least a tunnel
> >>>>> support, ex: vxlan?
> >>>>> The current version is a good prototype for people to test an L2/L3
> >>>>> XDP offload switch,
> >>>>> but without a good use case, it's hard to attract more people to
> >>>>> contribute or use it.
> >>>>
> >>>> I think we have discussed this before.
> >>>> Vxlan or other tunneling is indeed important, but that's not straightforward.
> >>>> Push is easy, but pop is not. Pop requires two rules and recirculation.
> >>>> Recirculation is highly likely to cause eBPF 1M insn limit error.
> >>>
> >>> Recirculation is pretty important. For example connection tracking also
> >>> relies on recirc. Can we break into multiple program and tail call?
> >>> For recirc action, can we tail call the main ebpf program, and let the
> >>> packet goes through parse/megaflow lookup/action?
> >>
> >> OK, will try using tail calls.
> >> This will require vswitchd to load another bpf program for tail calls.
> >> I guess such a program can be specified in main bpf program meta data.
> >> I'll check if it works.
> >>
> >
> > Do you know whether using bpf function call helps solving the
> > 1M insn limit or stack limitation?
> > https://lwn.net/Articles/741773/
>
> Looking at check_func_call() in verifier.c, it seems not.
>
> > IIUC, If we have a flow that requires executing multiple actions,
> > using bpf function call can save the stack size, but the total
> > instruction limit is still 1M.
> > When using tail call, we have more work to do to break into
> > individual ebpf program, but each program can have 1M insn
> > and stack size.
>
> Yes I think tail calls do help.
> But on the second thought, I think I should identify the verifier logic
> which causes consumption of that many instructions. Then we can determine if
> we have another option to mitigate that.
> E.g. The problem happens around loops. Loop unrolling may help mitigating it or not?
>
I thought the problem is loop unrolling?
because of loop unrolling, the number of instructions increases a lot,
and might cause over 1M insn.
William
Toshiaki Makita Feb. 27, 2021, 3:04 p.m. UTC | #12
On 2021/02/26 10:13, William Tu wrote:
> On Thu, Feb 25, 2021 at 7:26 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2021/02/24 3:37, William Tu wrote:
>>>>>>> I don't know if this is too much to ask for.
>>>>>>> I wonder if you, or we can work together, to add at least a tunnel
>>>>>>> support, ex: vxlan?
>>>>>>> The current version is a good prototype for people to test an L2/L3
>>>>>>> XDP offload switch,
>>>>>>> but without a good use case, it's hard to attract more people to
>>>>>>> contribute or use it.
>>>>>>
>>>>>> I think we have discussed this before.
>>>>>> Vxlan or other tunneling is indeed important, but that's not straightforward.
>>>>>> Push is easy, but pop is not. Pop requires two rules and recirculation.
>>>>>> Recirculation is highly likely to cause eBPF 1M insn limit error.
>>>>>
>>>>> Recirculation is pretty important. For example connection tracking also
>>>>> relies on recirc. Can we break into multiple program and tail call?
>>>>> For recirc action, can we tail call the main ebpf program, and let the
>>>>> packet goes through parse/megaflow lookup/action?
>>>>
>>>> OK, will try using tail calls.
>>>> This will require vswitchd to load another bpf program for tail calls.
>>>> I guess such a program can be specified in main bpf program meta data.
>>>> I'll check if it works.
>>>>
>>>
>>> Do you know whether using bpf function call helps solving the
>>> 1M insn limit or stack limitation?
>>> https://lwn.net/Articles/741773/
>>
>> Looking at check_func_call() in verifier.c, it seems not.
>>
>>> IIUC, If we have a flow that requires executing multiple actions,
>>> using bpf function call can save the stack size, but the total
>>> instruction limit is still 1M.
>>> When using tail call, we have more work to do to break into
>>> individual ebpf program, but each program can have 1M insn
>>> and stack size.
>>
>> Yes I think tail calls do help.
>> But on the second thought, I think I should identify the verifier logic
>> which causes consumption of that many instructions. Then we can determine if
>> we have another option to mitigate that.
>> E.g. The problem happens around loops. Loop unrolling may help mitigating it or not?
>>
> I thought the problem is loop unrolling?

Not really.

> because of loop unrolling, the number of instructions increases a lot,
> and might cause over 1M insn.

The problem is not the program size but the number of insns processed in the verifier.

AFAICS through trial and error, it looks like adding branches outside loops does not 
bloat insns processed in the verifier, but branches inside loops do.

I think the point is state pruning[1].
In fact the logic of state pruning looks more complicated inside loops[2].

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f1bca824dabb
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2589726d12a1

Toshiaki Makita