diff mbox series

[ovs-dev] RFC: netdev-afxdp: Support for XDP metadata HW hints.

Message ID 1614882425-52800-1-git-send-email-u9012063@gmail.com
State RFC
Headers show
Series [ovs-dev] RFC: netdev-afxdp: Support for XDP metadata HW hints. | expand

Commit Message

William Tu March 4, 2021, 6:27 p.m. UTC
One big problem of netdev-afxdp is that there is no metadata support
from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
or TCP checksum in software, resulting in high performance overhead.

A generic meta data type for XDP frame using BTF is proposed[1] and
there is sample implementation[2][3].  This patch experiments enabling 
the XDP metadata, or called HW hints, and shows the potential performance
improvement.  The patch uses only the rxhash value provided from HW,
so avoiding at the calculation of hash at lib/dpif-netdev.c:
    if (!dp_packet_rss_valid(execute->packet)) {
        dp_packet_set_rss_hash(execute->packet,
                               flow_hash_5tuple(execute->flow, 0));
    }

Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
cycles per packet' drops from 402 to 272.  More details below

Reference:
----------
[1] https://www.kernel.org/doc/html/latest/bpf/btf.html
[2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
[3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4

Testbed:
--------
Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
$ bpftool net xdp show
xdp:
enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
$ bpftool net xdp set dev enp2s0f0np0 md_btf on
$ bpftool net xdp
xdp:
enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)

Limitations/TODO:
-----------------
1. Support only AF_XDP native mode, not zero-copy mode.
2. Currently only three fields: vlan, hash, and flow_mark, and only receive
   side supports XDP metadata.
3. Control plane, how to enable and probe the structure, not upstream yet.

OVS rxdrop without HW hints:
---------------------------
Drop rate: 4.8Mpps

pmd thread numa_id 0 core_id 3:
  packets received: 196592006
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 196592006
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 0.00
  idle cycles: 56009063835 (41.43%)
  processing cycles: 79164971931 (58.57%)
  avg cycles per packet: 687.59 (135174035766/196592006)
  avg processing cycles per packet: 402.69 (79164971931/196592006)

pmd thread numa_id 0 core_id 3:
  Iterations:           339607649  (0.23 us/it)
  - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
  - idle iterations:    330697002  ( 40.3 % of used cycles)
  - busy iterations:      8910647  ( 59.7 % of used cycles)
  Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
  Datapath passes:      285140031  (1.00 passes/pkt)
  - EMC hits:           285139999  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
  - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:                   0

Perf report:
  17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
  14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
  14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
  10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
  10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
   7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
   5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
   3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
   2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
   2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
   2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
   2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
   1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd@plt

OVS rxdrop with HW hints:
-------------------------
rxdrop rate: 4.73Mpps

pmd thread numa_id 0 core_id 7:
  packets received: 13686880
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 13686880
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 0.00
  idle cycles: 3182105544 (46.02%)
  processing cycles: 3732023844 (53.98%)
  avg cycles per packet: 505.16 (6914129388/13686880)
  avg processing cycles per packet: 272.67 (3732023844/13686880)

pmd thread numa_id 0 core_id 7:

  Iterations:           392909539  (0.18 us/it)
  - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
  - idle iterations:    382539861  ( 46.0 % of used cycles)
  - busy iterations:     10369678  ( 54.0 % of used cycles)
  Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
  Datapath passes:      331829656  (1.00 passes/pkt)
  - EMC hits:           331829656  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
  - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:                   0

Perf record/report:
  22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
  10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
   7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
   7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
   6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
   6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
   5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
   5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
   3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
   3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
   2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
   2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
   2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
   1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
   1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
   1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration

Signed-off-by: William Tu <u9012063@gmail.com>
---
 lib/netdev-afxdp.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Jesper Dangaard Brouer March 5, 2021, 10:13 a.m. UTC | #1
Bjørn and Magnus please take a look at my questions inlined below.

On Thu,  4 Mar 2021 10:27:05 -0800
William Tu <u9012063@gmail.com> wrote:

> One big problem of netdev-afxdp is that there is no metadata support
> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
> or TCP checksum in software, resulting in high performance overhead.
> 
> A generic meta data type for XDP frame using BTF is proposed[1] and
> there is sample implementation[2][3].  This patch experiments enabling 
> the XDP metadata, or called HW hints, and shows the potential performance
> improvement.  The patch uses only the rxhash value provided from HW,
> so avoiding at the calculation of hash at lib/dpif-netdev.c:
>     if (!dp_packet_rss_valid(execute->packet)) {
>         dp_packet_set_rss_hash(execute->packet,
>                                flow_hash_5tuple(execute->flow, 0));
>     }
> 
> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
> cycles per packet' drops from 402 to 272.  More details below
> 
> Reference:
> ----------
> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
> 
> Testbed:
> --------
> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
> $ bpftool net xdp show
> xdp:
> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> $ bpftool net xdp
> xdp:
> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
> 
> Limitations/TODO:
> -----------------
> 1. Support only AF_XDP native mode, not zero-copy mode.
> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
>    side supports XDP metadata.
> 3. Control plane, how to enable and probe the structure, not upstream yet.
> 
> OVS rxdrop without HW hints:
> ---------------------------
> Drop rate: 4.8Mpps
> 
> pmd thread numa_id 0 core_id 3:
>   packets received: 196592006
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 196592006
>   smc hits: 0
>   megaflow hits: 0
>   avg. subtable lookups per megaflow hit: 0.00
>   miss with success upcall: 0
>   miss with failed upcall: 0
>   avg. packets per output batch: 0.00
>   idle cycles: 56009063835 (41.43%)
>   processing cycles: 79164971931 (58.57%)
>   avg cycles per packet: 687.59 (135174035766/196592006)
>   avg processing cycles per packet: 402.69 (79164971931/196592006)
> 
> pmd thread numa_id 0 core_id 3:
>   Iterations:           339607649  (0.23 us/it)
>   - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
>   - idle iterations:    330697002  ( 40.3 % of used cycles)
>   - busy iterations:      8910647  ( 59.7 % of used cycles)
>   Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
>   Datapath passes:      285140031  (1.00 passes/pkt)
>   - EMC hits:           285139999  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>   - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:                   0
> 
> Perf report:
>   17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>   14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>   14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
>   10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
>   10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
>    7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
>    5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
>    3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
>    2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
>    2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
>    2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
>    2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
>    1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd@plt
> 
> OVS rxdrop with HW hints:
> -------------------------
> rxdrop rate: 4.73Mpps
> 
> pmd thread numa_id 0 core_id 7:
>   packets received: 13686880
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 13686880
>   smc hits: 0
>   megaflow hits: 0
>   avg. subtable lookups per megaflow hit: 0.00
>   miss with success upcall: 0
>   miss with failed upcall: 0
>   avg. packets per output batch: 0.00
>   idle cycles: 3182105544 (46.02%)
>   processing cycles: 3732023844 (53.98%)
>   avg cycles per packet: 505.16 (6914129388/13686880)
>   avg processing cycles per packet: 272.67 (3732023844/13686880)
> 
> pmd thread numa_id 0 core_id 7:
> 
>   Iterations:           392909539  (0.18 us/it)
>   - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
>   - idle iterations:    382539861  ( 46.0 % of used cycles)
>   - busy iterations:     10369678  ( 54.0 % of used cycles)
>   Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
>   Datapath passes:      331829656  (1.00 passes/pkt)
>   - EMC hits:           331829656  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>   - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:                   0
> 
> Perf record/report:
>   22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>   10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
>    7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
>    7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
>    6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>    6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
>    5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>    5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
>    3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
>    3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
>    2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
>    2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
>    2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
>    1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
>    1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
>    1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  lib/netdev-afxdp.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index 482400d8d135..49881a8cc0cb 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
>      );
>  };
>  
> +/* FIXME:
> + * This should be done dynamically by query the device's
> + * XDP metadata structure. Ex:
> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
> + */
> +struct xdp_md_desc {

Adding:
       uint32_t btf_id;

Would be valuable IMHO.

> +    uint32_t flow_mark;
> +    uint32_t hash32;
> +    uint16_t vlan;
> +};
> +
>  #ifdef HAVE_XDP_NEED_WAKEUP
>  static inline void
>  xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          struct dp_packet_afxdp *xpacket;
>          const struct xdp_desc *desc;
>          struct dp_packet *packet;
> +        struct xdp_md_desc *md;
>          uint64_t addr, index;
>          uint32_t len;
>          char *pkt;
> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          len = desc->len;
>  
>          pkt = xsk_umem__get_data(umem->buffer, addr);
> +        md = pkt - sizeof *md;

We also want this sizeof-offset to be dynamic.
And I hope that AF_XDP could provide this info, on the size of
data_meta area.  I need some help from Magnus and Bjørn here?

Look at the kernel-side code of AF_XDP/XSK I don't see this being
transferred? Help is this true? Can we add the info?



>          index = addr >> FRAME_SHIFT;
>          xpacket = &umem->xpool.array[index];
>          packet = &xpacket->packet;
> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>                              OVS_XDP_HEADROOM);
>          dp_packet_set_size(packet, len);
>  
> +        /* FIXME: This should be done by detecting whether
> +         * XDP MD is enabled or not. Ex:
> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> +         */
> +        dp_packet_set_rss_hash(packet, md->hash32);

The general idea to make this code more dynamic is to let AF_XDP
provide the meta-data-len and then use the (proposed) btf_id to know
what "callback" C-code to call.

Using a btf_id make this independent of the NIC driver.  And we can
tell the difference between two BPF structs that contain the same info
but at different offsets. (I'll leave out the details, but I can
explain more if there is interest).

The btf_id also helps when NIC reconfigure the meta-data contents, and
there are packets in-flight.


> +
>          /* Add packet into batch, increase batch->count. */
>          dp_packet_batch_add(batch, packet);
>
Björn Töpel March 5, 2021, 1:56 p.m. UTC | #2
On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
> 
> Bjørn and Magnus please take a look at my questions inlined below.
> 
> On Thu,  4 Mar 2021 10:27:05 -0800
> William Tu <u9012063@gmail.com> wrote:
> 
>> One big problem of netdev-afxdp is that there is no metadata support
>> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
>> or TCP checksum in software, resulting in high performance overhead.
>>
>> A generic meta data type for XDP frame using BTF is proposed[1] and
>> there is sample implementation[2][3].  This patch experiments enabling
>> the XDP metadata, or called HW hints, and shows the potential performance
>> improvement.  The patch uses only the rxhash value provided from HW,
>> so avoiding at the calculation of hash at lib/dpif-netdev.c:
>>      if (!dp_packet_rss_valid(execute->packet)) {
>>          dp_packet_set_rss_hash(execute->packet,
>>                                 flow_hash_5tuple(execute->flow, 0));
>>      }
>>
>> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
>> cycles per packet' drops from 402 to 272.  More details below
>>
>> Reference:
>> ----------
>> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
>> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
>>
>> Testbed:
>> --------
>> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
>> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
>> $ bpftool net xdp show
>> xdp:
>> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
>> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
>> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>> $ bpftool net xdp
>> xdp:
>> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
>>
>> Limitations/TODO:
>> -----------------
>> 1. Support only AF_XDP native mode, not zero-copy mode.
>> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
>>     side supports XDP metadata.
>> 3. Control plane, how to enable and probe the structure, not upstream yet.
>>
>> OVS rxdrop without HW hints:
>> ---------------------------
>> Drop rate: 4.8Mpps
>>
>> pmd thread numa_id 0 core_id 3:
>>    packets received: 196592006
>>    packet recirculations: 0
>>    avg. datapath passes per packet: 1.00
>>    emc hits: 196592006
>>    smc hits: 0
>>    megaflow hits: 0
>>    avg. subtable lookups per megaflow hit: 0.00
>>    miss with success upcall: 0
>>    miss with failed upcall: 0
>>    avg. packets per output batch: 0.00
>>    idle cycles: 56009063835 (41.43%)
>>    processing cycles: 79164971931 (58.57%)
>>    avg cycles per packet: 687.59 (135174035766/196592006)
>>    avg processing cycles per packet: 402.69 (79164971931/196592006)
>>
>> pmd thread numa_id 0 core_id 3:
>>    Iterations:           339607649  (0.23 us/it)
>>    - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
>>    - idle iterations:    330697002  ( 40.3 % of used cycles)
>>    - busy iterations:      8910647  ( 59.7 % of used cycles)
>>    Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
>>    Datapath passes:      285140031  (1.00 passes/pkt)
>>    - EMC hits:           285139999  (100.0 %)
>>    - SMC hits:                   0  (  0.0 %)
>>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>>    - Lost upcalls:               0  (  0.0 %)
>>    Tx packets:                   0
>>
>> Perf report:
>>    17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>>    14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>    14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
>>    10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
>>    10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
>>     5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
>>     3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
>>     2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
>>     2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
>>     2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
>>     2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
>>     1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd@plt
>>
>> OVS rxdrop with HW hints:
>> -------------------------
>> rxdrop rate: 4.73Mpps
>>
>> pmd thread numa_id 0 core_id 7:
>>    packets received: 13686880
>>    packet recirculations: 0
>>    avg. datapath passes per packet: 1.00
>>    emc hits: 13686880
>>    smc hits: 0
>>    megaflow hits: 0
>>    avg. subtable lookups per megaflow hit: 0.00
>>    miss with success upcall: 0
>>    miss with failed upcall: 0
>>    avg. packets per output batch: 0.00
>>    idle cycles: 3182105544 (46.02%)
>>    processing cycles: 3732023844 (53.98%)
>>    avg cycles per packet: 505.16 (6914129388/13686880)
>>    avg processing cycles per packet: 272.67 (3732023844/13686880)
>>
>> pmd thread numa_id 0 core_id 7:
>>
>>    Iterations:           392909539  (0.18 us/it)
>>    - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
>>    - idle iterations:    382539861  ( 46.0 % of used cycles)
>>    - busy iterations:     10369678  ( 54.0 % of used cycles)
>>    Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
>>    Datapath passes:      331829656  (1.00 passes/pkt)
>>    - EMC hits:           331829656  (100.0 %)
>>    - SMC hits:                   0  (  0.0 %)
>>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>>    - Lost upcalls:               0  (  0.0 %)
>>    Tx packets:                   0
>>
>> Perf record/report:
>>    22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>>    10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
>>     7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
>>     7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
>>     6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>     6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
>>     5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
>>     3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
>>     3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
>>     2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
>>     2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
>>     2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
>>     1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
>>     1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
>>     1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
>>
>> Signed-off-by: William Tu <u9012063@gmail.com>
>> ---
>>   lib/netdev-afxdp.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index 482400d8d135..49881a8cc0cb 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
>>       );
>>   };
>>   
>> +/* FIXME:
>> + * This should be done dynamically by query the device's
>> + * XDP metadata structure. Ex:
>> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
>> + */
>> +struct xdp_md_desc {
> 
> Adding:
>         uint32_t btf_id;
> 
> Would be valuable IMHO.
> 
>> +    uint32_t flow_mark;
>> +    uint32_t hash32;
>> +    uint16_t vlan;
>> +};
>> +
>>   #ifdef HAVE_XDP_NEED_WAKEUP
>>   static inline void
>>   xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
>> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>           struct dp_packet_afxdp *xpacket;
>>           const struct xdp_desc *desc;
>>           struct dp_packet *packet;
>> +        struct xdp_md_desc *md;
>>           uint64_t addr, index;
>>           uint32_t len;
>>           char *pkt;
>> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>           len = desc->len;
>>   
>>           pkt = xsk_umem__get_data(umem->buffer, addr);
>> +        md = pkt - sizeof *md;
> 
> We also want this sizeof-offset to be dynamic.
> And I hope that AF_XDP could provide this info, on the size of
> data_meta area.  I need some help from Magnus and Bjørn here?
> 
> Look at the kernel-side code of AF_XDP/XSK I don't see this being
> transferred? Help is this true? Can we add the info?
>

XDP sockets make sure that the meta data is propagated to userland.
However, the meta data *size* is not passed via the AF_XDP descriptors.

This was per-design; The metadata is a contract between the XDP program
and the XDP socket, and in the future maybe the driver as well.

So, we're simply stating that "your AF_XDP application should know if
there's meta data or not, and how it's structured".

I guess we could explore starting to use the options bits of struct
xdp_desc for this, but only if we're *really* seeing a need for it.

If the XDP program would like to pass it's length, it can do that via
the metadata area as well.

struct meta {
   int len;
   ...
   ...
};


> 
> 
>>           index = addr >> FRAME_SHIFT;
>>           xpacket = &umem->xpool.array[index];
>>           packet = &xpacket->packet;
>> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>                               OVS_XDP_HEADROOM);
>>           dp_packet_set_size(packet, len);
>>   
>> +        /* FIXME: This should be done by detecting whether
>> +         * XDP MD is enabled or not. Ex:
>> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>> +         */
>> +        dp_packet_set_rss_hash(packet, md->hash32);
> 
> The general idea to make this code more dynamic is to let AF_XDP
> provide the meta-data-len and then use the (proposed) btf_id to know
> what "callback" C-code to call.
>

Hmm, again, here I might have a bit of a different view. Personally I 
think it's better that all the "NIC/XDP/socket" does the negotiation in 
the control/setup phase, and keep the fast-path lean. Instead of:

recv(void *data)
{
    len = get_meta_data_len(data);
    if (len) {
       // check this, check that, if-else-if-else
    }
    ...
}

You'd like to use the length of metadata to express if it's available or 
not. I'm stating that we don't need that for AF_XDP. In XDP if we're 
accessing the field, we need to validate access due to the verifier. But 
it's not to know if it's actually there. When we attach a certain XDP to 
a device, we can query what offload it supports, and the the application 
knows that meta data is structured in a certain way.

Some application has simple metadata:
struct tstamp { u64 rx; };

That application will unconditionally look at (struct tstamp 
*)(pkt-sizeof(struct tstamp)->rx;.

Some might be more flexible and have:
struct metadata { char storage[64]; int btf_id; };

If you need an "availability check", then you can solve that.

int xdp_prog(struct xdp_md *ctx)
{
	struct metadata *m = (void *)(long)ctx->data_meta;
	void *data = (void *)(unsigned long)ctx->data;

         if ((m + 1) > data) {
		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
		if (ret < 0)
			return XDP_ABORTED;
		data = (void *)(unsigned long)ctx->data;
		m = (void *)(long)ctx->data_meta;
		if ((m + 1) > data)
			return XDP_ABORTED;
		m->btf_id = -1; // not valid
	}
	// ... bpf_redirect_map() to socket
}

Would you agree? Or maybe expand a bit why you need the length.



> Using a btf_id make this independent of the NIC driver.  And we can
> tell the difference between two BPF structs that contain the same info
> but at different offsets. (I'll leave out the details, but I can
> explain more if there is interest).
>

Details, please! :-D Maybe I'm just not following your ideas!


Cheers!
Björn

> The btf_id also helps when NIC reconfigure the meta-data contents, and
> there are packets in-flight.
>
> 
>> +
>>           /* Add packet into batch, increase batch->count. */
>>           dp_packet_batch_add(batch, packet);
>>   
> 
> 
>
Jesper Dangaard Brouer March 5, 2021, 3:07 p.m. UTC | #3
On Fri, 5 Mar 2021 14:56:02 +0100
Björn Töpel <bjorn.topel@intel.com> wrote:

> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
> > 
> > Bjørn and Magnus please take a look at my questions inlined below.
> > 
> > On Thu,  4 Mar 2021 10:27:05 -0800
> > William Tu <u9012063@gmail.com> wrote:
> >   
> >> One big problem of netdev-afxdp is that there is no metadata support
> >> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
> >> or TCP checksum in software, resulting in high performance overhead.
> >>
> >> A generic meta data type for XDP frame using BTF is proposed[1] and
> >> there is sample implementation[2][3].  This patch experiments enabling
> >> the XDP metadata, or called HW hints, and shows the potential performance
> >> improvement.  The patch uses only the rxhash value provided from HW,
> >> so avoiding at the calculation of hash at lib/dpif-netdev.c:
> >>      if (!dp_packet_rss_valid(execute->packet)) {
> >>          dp_packet_set_rss_hash(execute->packet,
> >>                                 flow_hash_5tuple(execute->flow, 0));
> >>      }
> >>
> >> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
> >> cycles per packet' drops from 402 to 272.  More details below
> >>
> >> Reference:
> >> ----------
> >> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
> >> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
> >> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
> >>
> >> Testbed:
> >> --------
> >> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
> >> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
> >> $ bpftool net xdp show
> >> xdp:
> >> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
> >> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
> >> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> >> $ bpftool net xdp
> >> xdp:
> >> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
> >>
> >> Limitations/TODO:
> >> -----------------
> >> 1. Support only AF_XDP native mode, not zero-copy mode.
> >> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
> >>     side supports XDP metadata.
> >> 3. Control plane, how to enable and probe the structure, not upstream yet.
> >>
> >> OVS rxdrop without HW hints:
> >> ---------------------------
> >> Drop rate: 4.8Mpps
> >>
> >> pmd thread numa_id 0 core_id 3:
> >>    packets received: 196592006
> >>    packet recirculations: 0
> >>    avg. datapath passes per packet: 1.00
> >>    emc hits: 196592006
> >>    smc hits: 0
> >>    megaflow hits: 0
> >>    avg. subtable lookups per megaflow hit: 0.00
> >>    miss with success upcall: 0
> >>    miss with failed upcall: 0
> >>    avg. packets per output batch: 0.00
> >>    idle cycles: 56009063835 (41.43%)
> >>    processing cycles: 79164971931 (58.57%)
> >>    avg cycles per packet: 687.59 (135174035766/196592006)
> >>    avg processing cycles per packet: 402.69 (79164971931/196592006)
> >>
> >> pmd thread numa_id 0 core_id 3:
> >>    Iterations:           339607649  (0.23 us/it)
> >>    - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
> >>    - idle iterations:    330697002  ( 40.3 % of used cycles)
> >>    - busy iterations:      8910647  ( 59.7 % of used cycles)
> >>    Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
> >>    Datapath passes:      285140031  (1.00 passes/pkt)
> >>    - EMC hits:           285139999  (100.0 %)
> >>    - SMC hits:                   0  (  0.0 %)
> >>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
> >>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
> >>    - Lost upcalls:               0  (  0.0 %)
> >>    Tx packets:                   0
> >>
> >> Perf report:
> >>    17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
> >>    14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
> >>    14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
> >>    10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
> >>    10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
> >>     7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
> >>     5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
> >>     3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
> >>     2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
> >>     2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
> >>     2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
> >>     2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
> >>     1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd@plt
> >>
> >> OVS rxdrop with HW hints:
> >> -------------------------
> >> rxdrop rate: 4.73Mpps
> >>
> >> pmd thread numa_id 0 core_id 7:
> >>    packets received: 13686880
> >>    packet recirculations: 0
> >>    avg. datapath passes per packet: 1.00
> >>    emc hits: 13686880
> >>    smc hits: 0
> >>    megaflow hits: 0
> >>    avg. subtable lookups per megaflow hit: 0.00
> >>    miss with success upcall: 0
> >>    miss with failed upcall: 0
> >>    avg. packets per output batch: 0.00
> >>    idle cycles: 3182105544 (46.02%)
> >>    processing cycles: 3732023844 (53.98%)
> >>    avg cycles per packet: 505.16 (6914129388/13686880)
> >>    avg processing cycles per packet: 272.67 (3732023844/13686880)
> >>
> >> pmd thread numa_id 0 core_id 7:
> >>
> >>    Iterations:           392909539  (0.18 us/it)
> >>    - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
> >>    - idle iterations:    382539861  ( 46.0 % of used cycles)
> >>    - busy iterations:     10369678  ( 54.0 % of used cycles)
> >>    Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
> >>    Datapath passes:      331829656  (1.00 passes/pkt)
> >>    - EMC hits:           331829656  (100.0 %)
> >>    - SMC hits:                   0  (  0.0 %)
> >>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
> >>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
> >>    - Lost upcalls:               0  (  0.0 %)
> >>    Tx packets:                   0
> >>
> >> Perf record/report:
> >>    22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
> >>    10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
> >>     7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
> >>     7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
> >>     6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
> >>     6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
> >>     5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
> >>     5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
> >>     3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
> >>     3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
> >>     2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
> >>     2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
> >>     2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
> >>     1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
> >>     1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
> >>     1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
> >>
> >> Signed-off-by: William Tu <u9012063@gmail.com>
> >> ---
> >>   lib/netdev-afxdp.c | 19 +++++++++++++++++++
> >>   1 file changed, 19 insertions(+)
> >>
> >> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> >> index 482400d8d135..49881a8cc0cb 100644
> >> --- a/lib/netdev-afxdp.c
> >> +++ b/lib/netdev-afxdp.c
> >> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
> >>       );
> >>   };
> >>   
> >> +/* FIXME:
> >> + * This should be done dynamically by query the device's
> >> + * XDP metadata structure. Ex:
> >> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
> >> + */
> >> +struct xdp_md_desc {  
> > 
> > Adding:
> >         uint32_t btf_id;
> > 
> > Would be valuable IMHO.
> >   
> >> +    uint32_t flow_mark;
> >> +    uint32_t hash32;
> >> +    uint16_t vlan;
> >> +};
> >> +
> >>   #ifdef HAVE_XDP_NEED_WAKEUP
> >>   static inline void
> >>   xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
> >> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>           struct dp_packet_afxdp *xpacket;
> >>           const struct xdp_desc *desc;
> >>           struct dp_packet *packet;
> >> +        struct xdp_md_desc *md;
> >>           uint64_t addr, index;
> >>           uint32_t len;
> >>           char *pkt;
> >> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>           len = desc->len;
> >>   
> >>           pkt = xsk_umem__get_data(umem->buffer, addr);
> >> +        md = pkt - sizeof *md;  
> > 
> > We also want this sizeof-offset to be dynamic.
> > And I hope that AF_XDP could provide this info, on the size of
> > data_meta area.  I need some help from Magnus and Bjørn here?
> > 
> > Look at the kernel-side code of AF_XDP/XSK I don't see this being
> > transferred? Help is this true? Can we add the info?
> >  
> 
> XDP sockets make sure that the meta data is propagated to userland.
> However, the meta data *size* is not passed via the AF_XDP descriptors.
> 
> This was per-design; The metadata is a contract between the XDP program
> and the XDP socket, and in the future maybe the driver as well.
> 
> So, we're simply stating that "your AF_XDP application should know if
> there's meta data or not, and how it's structured".
> 
> I guess we could explore starting to use the options bits of struct
> xdp_desc for this, but only if we're *really* seeing a need for it.
> 
> If the XDP program would like to pass it's length, it can do that via
> the metadata area as well.
> 
> struct meta {
>    int len;

I'm not interested in having the len part of metadata.   The btf_id
will basically/indirectly give us the length.

>    ...
>    ...
> };
> 
> 
> > 
> >   
> >>           index = addr >> FRAME_SHIFT;
> >>           xpacket = &umem->xpool.array[index];
> >>           packet = &xpacket->packet;
> >> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>                               OVS_XDP_HEADROOM);
> >>           dp_packet_set_size(packet, len);
> >>   
> >> +        /* FIXME: This should be done by detecting whether
> >> +         * XDP MD is enabled or not. Ex:
> >> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> >> +         */
> >> +        dp_packet_set_rss_hash(packet, md->hash32);  
> > 
> > The general idea to make this code more dynamic is to let AF_XDP
> > provide the meta-data-len and then use the (proposed) btf_id to know
> > what "callback" C-code to call.
> >  
> 
> Hmm, again, here I might have a bit of a different view. Personally I 
> think it's better that all the "NIC/XDP/socket" does the negotiation in 
> the control/setup phase, and keep the fast-path lean. Instead of:
> 
> recv(void *data)
> {
>     len = get_meta_data_len(data);
>     if (len) {
>        // check this, check that, if-else-if-else
>     }
>     ...
> }
> 
> You'd like to use the length of metadata to express if it's available or 
> not. I'm stating that we don't need that for AF_XDP. In XDP if we're 
> accessing the field, we need to validate access due to the verifier. But 
> it's not to know if it's actually there. When we attach a certain XDP to 
> a device, we can query what offload it supports, and the the application 
> knows that meta data is structured in a certain way.
> 
> Some application has simple metadata:
> struct tstamp { u64 rx; };
> 
> That application will unconditionally look at (struct tstamp 
> *)(pkt-sizeof(struct tstamp)->rx;.

I imagine that NIC hardware will/can provide packets with different
metadata.  Your example with timestamps are a good example, as some
drivers/HW only support timestamping PTP packets.  Thus, I don't want
to provide metadata info on tstamp if HW didn't provide it (and I also
don't want to spend CPU-cycles on clearing the u64 field with zero).
Another example is vlan IDs is sometimes provided by hardware.

Thus, on a per packet basis the metadata can change.

> Some might be more flexible and have:
> struct metadata { char storage[64]; int btf_id; };

I really like the idea of placing the btf_id as the last entry, that
way we know its location.  I can get the btf_id via packet minus-offset
sizeof(int).

>
> If you need an "availability check", then you can solve that.
> 
> int xdp_prog(struct xdp_md *ctx)
> {
> 	struct metadata *m = (void *)(long)ctx->data_meta;
> 	void *data = (void *)(unsigned long)ctx->data;
> 
>          if ((m + 1) > data) {
> 		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
> 		if (ret < 0)
> 			return XDP_ABORTED;
> 		data = (void *)(unsigned long)ctx->data;
> 		m = (void *)(long)ctx->data_meta;
> 		if ((m + 1) > data)
> 			return XDP_ABORTED;
> 		m->btf_id = -1; // not valid
> 	}
> 	// ... bpf_redirect_map() to socket
> }
> 
> Would you agree? Or maybe expand a bit why you need the length.

With the btf_id as the last entry, I basically don't need the length.

The dispatcher function, will based on the btf_id know the size of the
struct.  The C-code will basically type-cast the struct at the pointer
offset and get access to the metadata at the known offsets.

> 
> > Using a btf_id make this independent of the NIC driver.  

E.g. we don't need to dereference the NIC driver struct to find what
metadata this driver have.  Instead this is given directly by the
btf_id, and two drivers can use the same btf_id.


> > And we can
> > tell the difference between two BPF structs that contain the same info
> > but at different offsets. (I'll leave out the details, but I can
> > explain more if there is interest).
> >  

BTF sort-of make the struct names identifiers and API.  And gives *BPF*
programs the ability to adjust offsets, when loading the BPF-prog into
the kernel.  

Driver-1 choose:

 struct xdp_md_desc {  
     uint32_t flow_mark;
     uint32_t hash32;
 } __attribute__((preserve_access_index));

Driver-2 choose:

 struct xdp_md_desc {  
     uint32_t hash32;
     uint32_t flow_mark;
 } __attribute__((preserve_access_index));

The BPF code can access the members, and kernel will remap-struct
access, and internally in the kernel create two different BPF-programs
with adjusted offsets in the byte-code.

How do we handle this in the OVS userspace C-code?

Maybe/hopefully you have a better idea than mine, which is simply to
compile two C-programs (or program for-each know BTF-id) and dispatch
based on the BTF-id.


> Details, please! :-D Maybe I'm just not following your ideas!

I hope the details above helped.
 
 
> > The btf_id also helps when NIC reconfigure the meta-data contents, and
> > there are packets in-flight.

This is another reason why C-code need to handle meta-data can change
runtime, as I don't like to freeze config changes to NIC.
Björn Töpel March 5, 2021, 3:34 p.m. UTC | #4
On 2021-03-05 16:07, Jesper Dangaard Brouer wrote:
> On Fri, 5 Mar 2021 14:56:02 +0100
> Björn Töpel <bjorn.topel@intel.com> wrote:
> 
>> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:


[...]

>>> We also want this sizeof-offset to be dynamic.
>>> And I hope that AF_XDP could provide this info, on the size of
>>> data_meta area.  I need some help from Magnus and Bjørn here?
>>>
>>> Look at the kernel-side code of AF_XDP/XSK I don't see this being
>>> transferred? Help is this true? Can we add the info?
>>>   
>>
>> XDP sockets make sure that the meta data is propagated to userland.
>> However, the meta data *size* is not passed via the AF_XDP descriptors.
>>
>> This was per-design; The metadata is a contract between the XDP program
>> and the XDP socket, and in the future maybe the driver as well.
>>
>> So, we're simply stating that "your AF_XDP application should know if
>> there's meta data or not, and how it's structured".
>>
>> I guess we could explore starting to use the options bits of struct
>> xdp_desc for this, but only if we're *really* seeing a need for it.
>>
>> If the XDP program would like to pass it's length, it can do that via
>> the metadata area as well.
>>
>> struct meta {
>>     int len;
> 
> I'm not interested in having the len part of metadata.   The btf_id
> will basically/indirectly give us the length.
>

Ok!


>>     ...
>>     ...
>> };
>>
>>
>>>
>>>    
>>>>            index = addr >> FRAME_SHIFT;
>>>>            xpacket = &umem->xpool.array[index];
>>>>            packet = &xpacket->packet;
>>>> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>>>                                OVS_XDP_HEADROOM);
>>>>            dp_packet_set_size(packet, len);
>>>>    
>>>> +        /* FIXME: This should be done by detecting whether
>>>> +         * XDP MD is enabled or not. Ex:
>>>> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>>>> +         */
>>>> +        dp_packet_set_rss_hash(packet, md->hash32);
>>>
>>> The general idea to make this code more dynamic is to let AF_XDP
>>> provide the meta-data-len and then use the (proposed) btf_id to know
>>> what "callback" C-code to call.
>>>   
>>
>> Hmm, again, here I might have a bit of a different view. Personally I
>> think it's better that all the "NIC/XDP/socket" does the negotiation in
>> the control/setup phase, and keep the fast-path lean. Instead of:
>>
>> recv(void *data)
>> {
>>      len = get_meta_data_len(data);
>>      if (len) {
>>         // check this, check that, if-else-if-else
>>      }
>>      ...
>> }
>>
>> You'd like to use the length of metadata to express if it's available or
>> not. I'm stating that we don't need that for AF_XDP. In XDP if we're
>> accessing the field, we need to validate access due to the verifier. But
>> it's not to know if it's actually there. When we attach a certain XDP to
>> a device, we can query what offload it supports, and the the application
>> knows that meta data is structured in a certain way.
>>
>> Some application has simple metadata:
>> struct tstamp { u64 rx; };
>>
>> That application will unconditionally look at (struct tstamp
>> *)(pkt-sizeof(struct tstamp)->rx;.
> 
> I imagine that NIC hardware will/can provide packets with different
> metadata.  Your example with timestamps are a good example, as some
> drivers/HW only support timestamping PTP packets.  Thus, I don't want
> to provide metadata info on tstamp if HW didn't provide it (and I also
> don't want to spend CPU-cycles on clearing the u64 field with zero).
> Another example is vlan IDs is sometimes provided by hardware.
> 
> Thus, on a per packet basis the metadata can change.
>

Yes, I agree. For a certain set of packets, there can be different
metadata per packet that, as you pointed out, can be dispatched do
different handlers.

>> Some might be more flexible and have:
>> struct metadata { char storage[64]; int btf_id; };
> 
> I really like the idea of placing the btf_id as the last entry, that
> way we know its location.  I can get the btf_id via packet minus-offset
> sizeof(int).
> 
>>
>> If you need an "availability check", then you can solve that.
>>
>> int xdp_prog(struct xdp_md *ctx)
>> {
>> 	struct metadata *m = (void *)(long)ctx->data_meta;
>> 	void *data = (void *)(unsigned long)ctx->data;
>>
>>           if ((m + 1) > data) {
>> 		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
>> 		if (ret < 0)
>> 			return XDP_ABORTED;
>> 		data = (void *)(unsigned long)ctx->data;
>> 		m = (void *)(long)ctx->data_meta;
>> 		if ((m + 1) > data)
>> 			return XDP_ABORTED;
>> 		m->btf_id = -1; // not valid
>> 	}
>> 	// ... bpf_redirect_map() to socket
>> }
>>
>> Would you agree? Or maybe expand a bit why you need the length.
> 
> With the btf_id as the last entry, I basically don't need the length.
>

Ok! Good! :-)

> The dispatcher function, will based on the btf_id know the size of the
> struct.  The C-code will basically type-cast the struct at the pointer
> offset and get access to the metadata at the known offsets.
>

Make sense, and inline with my view!


>>
>>> Using a btf_id make this independent of the NIC driver.
> 
> E.g. we don't need to dereference the NIC driver struct to find what
> metadata this driver have.  Instead this is given directly by the
> btf_id, and two drivers can use the same btf_id.
>

Yeah!


> 
>>> And we can
>>> tell the difference between two BPF structs that contain the same info
>>> but at different offsets. (I'll leave out the details, but I can
>>> explain more if there is interest).
>>>   
> 
> BTF sort-of make the struct names identifiers and API.  And gives *BPF*
> programs the ability to adjust offsets, when loading the BPF-prog into
> the kernel.
> 
> Driver-1 choose:
> 
>   struct xdp_md_desc {
>       uint32_t flow_mark;
>       uint32_t hash32;
>   } __attribute__((preserve_access_index));
> 
> Driver-2 choose:
> 
>   struct xdp_md_desc {
>       uint32_t hash32;
>       uint32_t flow_mark;
>   } __attribute__((preserve_access_index));
> 
> The BPF code can access the members, and kernel will remap-struct
> access, and internally in the kernel create two different BPF-programs
> with adjusted offsets in the byte-code.
> 
> How do we handle this in the OVS userspace C-code?
>

Longer-term, I think extending the linkers to support struct member 
relocations would be a good thing! I.e. simply support 
__attribute__((preserve_access_index)) in userland. That would enable 
very powerful userland/kernel interface, like what we're discussing now.

But I wonder if it's too big of an initial undertaking to do that as a
first step? Maybe non-relocatable struct could be a start.

I definitely agree that the goal should be relocatable structs!

> Maybe/hopefully you have a better idea than mine, which is simply to
> compile two C-programs (or program for-each know BTF-id) and dispatch
> based on the BTF-id.
>

Not really. :-( A solution w/o __attribute__((preserve_access_index))
would mean that the application would need per-nic structures, or a
run-time/recompile component.

I'm excited to see that William and you are experimenting in the space.
We need more "pasta on the wall" and see what sticks^Wwork.

> 
>> Details, please! :-D Maybe I'm just not following your ideas!
> 
> I hope the details above helped.
>

Yes. Thank you for the write-up!


Björn

>   
>>> The btf_id also helps when NIC reconfigure the meta-data contents, and
>>> there are packets in-flight.
> 
> This is another reason why C-code need to handle meta-data can change
> runtime, as I don't like to freeze config changes to NIC.
>
William Tu March 5, 2021, 8:07 p.m. UTC | #5
Thanks Björn and Jesper for feedbacks.

On Fri, Mar 5, 2021 at 7:34 AM Björn Töpel <bjorn.topel@intel.com> wrote:
>
> On 2021-03-05 16:07, Jesper Dangaard Brouer wrote:
> > On Fri, 5 Mar 2021 14:56:02 +0100
> > Björn Töpel <bjorn.topel@intel.com> wrote:
> >
> >> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
>
>
> [...]
> >>
> >> You'd like to use the length of metadata to express if it's available or
> >> not. I'm stating that we don't need that for AF_XDP. In XDP if we're
> >> accessing the field, we need to validate access due to the verifier. But
> >> it's not to know if it's actually there. When we attach a certain XDP to
> >> a device, we can query what offload it supports, and the the application
> >> knows that meta data is structured in a certain way.
> >>
> >> Some application has simple metadata:
> >> struct tstamp { u64 rx; };
> >>
> >> That application will unconditionally look at (struct tstamp
> >> *)(pkt-sizeof(struct tstamp)->rx;.
> >
> > I imagine that NIC hardware will/can provide packets with different
> > metadata.  Your example with timestamps are a good example, as some
> > drivers/HW only support timestamping PTP packets.  Thus, I don't want
> > to provide metadata info on tstamp if HW didn't provide it (and I also
> > don't want to spend CPU-cycles on clearing the u64 field with zero).
> > Another example is vlan IDs is sometimes provided by hardware.
> >
> > Thus, on a per packet basis the metadata can change.
> >
Is this future work?

Looking at Saeed's patch, I thought we query the netdev through
BTF netlink API, and we get a BTF ID and C-struct per-netdev.
And this will be done when OVS control plane, when a user attaches
a device to its bridge, we query its metadata structure.

>
> Yes, I agree. For a certain set of packets, there can be different
> metadata per packet that, as you pointed out, can be dispatched do
> different handlers.
>
> >> Some might be more flexible and have:
> >> struct metadata { char storage[64]; int btf_id; };
> >
> > I really like the idea of placing the btf_id as the last entry, that
> > way we know its location.  I can get the btf_id via packet minus-offset
> > sizeof(int).
> >
[...]

> >
> > BTF sort-of make the struct names identifiers and API.  And gives *BPF*
> > programs the ability to adjust offsets, when loading the BPF-prog into
> > the kernel.
> >
> > Driver-1 choose:
> >
> >   struct xdp_md_desc {
> >       uint32_t flow_mark;
> >       uint32_t hash32;
> >   } __attribute__((preserve_access_index));
> >
> > Driver-2 choose:
> >
> >   struct xdp_md_desc {
> >       uint32_t hash32;
> >       uint32_t flow_mark;
> >   } __attribute__((preserve_access_index));
> >
> > The BPF code can access the members, and kernel will remap-struct
> > access, and internally in the kernel create two different BPF-programs
> > with adjusted offsets in the byte-code.
> >
> > How do we handle this in the OVS userspace C-code?
> >

BTW, do we assume all vendors use the same field name
for same purpose? For example:
When OVS query Intel or mlx, if rxhash is available, they should
use the same name as "hash32".
Otherwise, I don't know how to associate it into OVS code logic.

If that's the case, I can do (in userspace C code, not BPF program)
1) When attaching a device, query its MD struct and MD's size
    using BTF info.
2) From the BTF info returned, check if string "hash32" exists.
    if yes, then calculate its offset from the beginning of the MD.
3) OVS cat set it by calling "dp_packet_set_rss(packet, md[rxhash_offset]);"

>
> Longer-term, I think extending the linkers to support struct member
> relocations would be a good thing! I.e. simply support
> __attribute__((preserve_access_index)) in userland. That would enable
> very powerful userland/kernel interface, like what we're discussing now.
>
> But I wonder if it's too big of an initial undertaking to do that as a
> first step? Maybe non-relocatable struct could be a start.
>
Having relocatable struct would be great. So I can just use 'md->hash32',
instead of offset like above.

> I definitely agree that the goal should be relocatable structs!
>
> > Maybe/hopefully you have a better idea than mine, which is simply to
> > compile two C-programs (or program for-each know BTF-id) and dispatch
> > based on the BTF-id.
> >

Thanks
William
Jesper Dangaard Brouer March 8, 2021, 2 p.m. UTC | #6
On Fri, 5 Mar 2021 12:07:28 -0800
William Tu <u9012063@gmail.com> wrote:

> Thanks Björn and Jesper for feedbacks.
> 
> On Fri, Mar 5, 2021 at 7:34 AM Björn Töpel <bjorn.topel@intel.com> wrote:
> >
> > On 2021-03-05 16:07, Jesper Dangaard Brouer wrote:  
> > > On Fri, 5 Mar 2021 14:56:02 +0100
> > > Björn Töpel <bjorn.topel@intel.com> wrote:
> > >  
> > >> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:  
> >
> >
> > [...]  
> > >>
> > >> You'd like to use the length of metadata to express if it's available or
> > >> not. I'm stating that we don't need that for AF_XDP. In XDP if we're
> > >> accessing the field, we need to validate access due to the verifier. But
> > >> it's not to know if it's actually there. When we attach a certain XDP to
> > >> a device, we can query what offload it supports, and the the application
> > >> knows that meta data is structured in a certain way.
> > >>
> > >> Some application has simple metadata:
> > >> struct tstamp { u64 rx; };
> > >>
> > >> That application will unconditionally look at (struct tstamp
> > >> *)(pkt-sizeof(struct tstamp)->rx;.  
> > >
> > > I imagine that NIC hardware will/can provide packets with different
> > > metadata.  Your example with timestamps are a good example, as some
> > > drivers/HW only support timestamping PTP packets.  Thus, I don't want
> > > to provide metadata info on tstamp if HW didn't provide it (and I also
> > > don't want to spend CPU-cycles on clearing the u64 field with zero).
> > > Another example is vlan IDs is sometimes provided by hardware.
> > >
> > > Thus, on a per packet basis the metadata can change.
> > >  
> Is this future work?

Yes. I will propose this, and I will also be working on the code myself.
(p.s. I'll be on vacation rest of this week).

It is great that we have AF_XDP+OVS as one use-case/consumer.

My use-case is CPUMAP and veth that gets redirected an xdp_frame, and
creates SKBs based this info.  I will code this on-top of Saeeds
patchset.  With xdp_frame I could also store the btf_id in the top of
the frame, but that would not make it avail to AF_XDP frames.


> Looking at Saeed's patch, I thought we query the netdev through
> BTF netlink API, and we get a BTF ID and C-struct per-netdev.
> And this will be done when OVS control plane, when a user attaches
> a device to its bridge, we query its metadata structure.

Correct.  Remember that Saeed's patches[1] have not been accepted
upstream.  Thus, those[1] are also a proposal (but a really good
starting point).

Even if a netdev only have one current BTF-metadata-hint, then we want
the ability to (runtime) change the BTF-metadata-hint to something
else.  That is the hole point in making the these
metadata-HW-offload-hints more flexible.
 Thus, in OVS you will need some way to detect, when the NIC change the
BTF-metadata-hint to contain new info. My *proposal* is to include the
btf_id as part of the struct. (Feel free to slap me around, I'm sure
Alexei will on this idea).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4


> > Yes, I agree. For a certain set of packets, there can be different
> > metadata per packet that, as you pointed out, can be dispatched do
> > different handlers.
> >  
> > >> Some might be more flexible and have:
> > >> struct metadata { char storage[64]; int btf_id; };  
> > >
> > > I really like the idea of placing the btf_id as the last entry, that
> > > way we know its location.  I can get the btf_id via packet minus-offset
> > > sizeof(int).
> > >  
> [...]
> 
> > >
> > > BTF sort-of make the struct names identifiers and API.  And gives *BPF*
> > > programs the ability to adjust offsets, when loading the BPF-prog into
> > > the kernel.
> > >
> > > Driver-1 choose:
> > >
> > >   struct xdp_md_desc {
> > >       uint32_t flow_mark;
> > >       uint32_t hash32;
> > >   } __attribute__((preserve_access_index));
> > >
> > > Driver-2 choose:
> > >
> > >   struct xdp_md_desc {
> > >       uint32_t hash32;
> > >       uint32_t flow_mark;
> > >   } __attribute__((preserve_access_index));
> > >
> > > The BPF code can access the members, and kernel will remap-struct
> > > access, and internally in the kernel create two different BPF-programs
> > > with adjusted offsets in the byte-code.
> > >
> > > How do we handle this in the OVS userspace C-code?
> > >  
> 
> BTW, do we assume all vendors use the same field name
> for same purpose? For example:
> When OVS query Intel or mlx, if rxhash is available, they should
> use the same name as "hash32".
> Otherwise, I don't know how to associate it into OVS code logic.

Yes, AFAIK the struct member field names, defined as BTF, will become
like-API and will have meaning.  The offset will be dynamic, and the
size/type could be validated via BTF-info.


> If that's the case, I can do (in userspace C code, not BPF program)
> 1) When attaching a device, query its MD struct and MD's size
>     using BTF info.
> 2) From the BTF info returned, check if string "hash32" exists.
>     if yes, then calculate its offset from the beginning of the MD.
> 3) OVS cat set it by calling "dp_packet_set_rss(packet, md[rxhash_offset]);"

That sounds good.

The 'rxhash_offset' can be dynamically updated if the btf_id change.

> >
> > Longer-term, I think extending the linkers to support struct member
> > relocations would be a good thing! I.e. simply support
> > __attribute__((preserve_access_index)) in userland. That would enable
> > very powerful userland/kernel interface, like what we're discussing now.
> >
> > But I wonder if it's too big of an initial undertaking to do that as a
> > first step? Maybe non-relocatable struct could be a start.
> >  
> Having relocatable struct would be great. So I can just use 'md->hash32',
> instead of offset like above.
> 
> > I definitely agree that the goal should be relocatable structs!
> >  
> > > Maybe/hopefully you have a better idea than mine, which is simply to
> > > compile two C-programs (or program for-each know BTF-id) and dispatch
> > > based on the BTF-id.
> > >  
> 
> Thanks
> William
diff mbox series

Patch

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 482400d8d135..49881a8cc0cb 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -169,6 +169,17 @@  struct netdev_afxdp_tx_lock {
     );
 };
 
+/* FIXME:
+ * This should be done dynamically by query the device's
+ * XDP metadata structure. Ex:
+ *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
+ */
+struct xdp_md_desc {
+    uint32_t flow_mark;
+    uint32_t hash32;
+    uint16_t vlan;
+};
+
 #ifdef HAVE_XDP_NEED_WAKEUP
 static inline void
 xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
@@ -849,6 +860,7 @@  netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         struct dp_packet_afxdp *xpacket;
         const struct xdp_desc *desc;
         struct dp_packet *packet;
+        struct xdp_md_desc *md;
         uint64_t addr, index;
         uint32_t len;
         char *pkt;
@@ -858,6 +870,7 @@  netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         len = desc->len;
 
         pkt = xsk_umem__get_data(umem->buffer, addr);
+        md = pkt - sizeof *md;
         index = addr >> FRAME_SHIFT;
         xpacket = &umem->xpool.array[index];
         packet = &xpacket->packet;
@@ -868,6 +881,12 @@  netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
                             OVS_XDP_HEADROOM);
         dp_packet_set_size(packet, len);
 
+        /* FIXME: This should be done by detecting whether
+         * XDP MD is enabled or not. Ex:
+         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
+         */
+        dp_packet_set_rss_hash(packet, md->hash32);
+
         /* Add packet into batch, increase batch->count. */
         dp_packet_batch_add(batch, packet);