mbox series

[RFC,bpf-next,0/6] XDP RX device meta data acceleration (WIP)

Message ID 20180627024615.17856-1-saeedm@mellanox.com
Headers show
Series XDP RX device meta data acceleration (WIP) | expand

Message

Saeed Mahameed June 27, 2018, 2:46 a.m. UTC
Hello,

Although it is far from being close to completion, this series provides
the means and infrastructure to enable device drivers to share packets
meta data information and accelerations with XDP programs, 
such as hash, stripped vlan, checksum, flow mark, packet header types,
etc ..

The series is still work in progress and sending it now as RFC in order
to get early feedback and to discuss the design, structures and UAPI.

For now the general idea is to help XDP programs accelerate, and provide
them with meta data that already available from the HW or device driver
to save cpu cycles on packet headers and data processing and decision
making, aside of that we want to avoid having a fixed size meta data
structure that wastes a lot of buffer space for stuff that the xdp program
might not need, like the current "elephant" SKB fields, kidding :) ..

So my idea here is to provide a dynamic mechanism to allow XDP programs
on xdp load to request only the specific meta data they need, and
the device driver or netdevice will provide them in a predefined order
in the xdp_buff/xdp_md data meta section.

And here is how it is done and what i would like to discuss:

1. In the first patch: added the infrastructure to request predefined meta data
flags on xdp load indicating that the XDP program is going to need them.

1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
TODO: this needs to be improved in order to allow more meta data flags,
maybe use a new dedicated flags ?

1.2) Device driver that want to support xdp meta data should implement
new XDP command XDP_QUERY_META_FLAGS and report the meta data flags they
can support.

1.3) the kernel will cross check the requested flags with the device's
supported flags and will fail the xdp load in case of discrepancy.

Question : do we want this ? or is it better to return to the program
the actual supported flags and let it decide ?


2. This is the interesting part: in the 2nd patch Added the meta data
set/get infrastructure to allow device drivers populate them into xdp_buff
data meta in a well defined and structured format and let the xdp program
read them according to the predefined format/structure, the idea here is
that the xdp program and the device driver will share a static information
offsets array that will define where each meta data will sit inside
xdp_buff data meta, such structure will be decided once on xdp load.

Enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : for XDP prgram that requested the following flags:
flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

the offsets array will be:

xdp_md_info_arr mdi = {
        [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
        [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
        [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
        [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

For this example: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.
The user space XDP program will use a similar xdp_md_info_arr to
statically know what is the offset of each meta data.

*For future meta data they will be added to the end of the array with
higher flags value.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Question: currently the XDP program is responsible to build the static
meta data offsets array "xdp_md_info_arr" and the kernel will build it
for the netdevice, if we are going to choose this direction, should we
somehow share the same xdp_md_info_arr built by the kernel with the xdp
program ?

3. The last mlx5e patch is the actual show case of how the device driver
will add the support for xdp meta data, it is pretty simple and straight
forward ! The first two mlx5e patches are just small refactoring to make
the xdp_md_info_arr and packet completion information available in the rx
xdp handlers data path.

4. Just added a small example to demonstrate how XDP program can request
meta data on xdp load and how it can read them on the critical path.
of course more examples are needed and some performance numbers.
Exciting use cases such as:
  - using flow mark to allow fast white/black listing lookups.
  - flow mark to accelerate DDos prevention.
  - Generic Data path: Use the meta data from the xdp_buff to build SKBs !(Jesper's Idea)
  - using hash type to know the packet headers and encapsulation without
    touching the packet data at all.
  - using packet hash to do RPS and XPS like cpu redirecting.
  - and many more.


More ideas:

From Jesper: 
=========================
Give XDP more rich information about the hash,
By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.

I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
IPSEC. (And you have another bit telling of this is IPv6 with extension
headers).

If we don't want to invent our own xdp_hash_types, we can simply adopt
the RSS Hashing Types defined by Microsoft:
 https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

An interesting part of the RSS standard, is that the hash type can help
identify if this is a fragment. (XDP could use this info to avoid
touching payload and react, e.g. drop fragments, or redirect all
fragments to another CPU, or skip parsing in XDP and defer to network
stack via XDP_PASS).

By using the RSS standard, we do e.g. loose the ability to say this is
IPSEC traffic, even-though your HW supports this.

I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
 https://marc.info/?l=linux-netdev&m=149512213531769
 https://marc.info/?l=linux-netdev&m=149512213631774
=========================

Thanks,
Saeed.

Saeed Mahameed (6):
  net: xdp: Add support for meta data flags requests
  net: xdp: RX meta data infrastructure
  net/mlx5e: Store xdp flags and meta data info
  net/mlx5e: Pass CQE to RX handlers
  net/mlx5e: Add XDP RX meta data support
  samples/bpf: Add meta data hash example to xdp_redirect_cpu

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
 include/linux/netdevice.h                     |  10 ++
 include/net/xdp.h                             |   6 +
 include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
 include/uapi/linux/if_link.h                  |  20 ++-
 net/core/dev.c                                |  41 +++++++
 samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
 samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
 10 files changed, 354 insertions(+), 35 deletions(-)

Comments

Parikh, Neerav June 27, 2018, 4:42 p.m. UTC | #1
Thanks Saeed for starting this thread  :)
My comments inline. 

> -----Original Message-----
> From: Saeed Mahameed [mailto:saeedm@dev.mellanox.co.il]
> Sent: Tuesday, June 26, 2018 7:46 PM
> To: Jesper Dangaard Brouer <brouer@redhat.com>; Alexei Starovoitov
> <alexei.starovoitov@gmail.com>; Daniel Borkmann
> <borkmann@iogearbox.net>
> Cc: Parikh, Neerav <neerav.parikh@intel.com>; pjwaskiewicz@gmail.com;
> ttoukan.linux@gmail.com; Tariq Toukan <tariqt@mellanox.com>; Duyck,
> Alexander H <alexander.h.duyck@intel.com>; Waskiewicz Jr, Peter
> <peter.waskiewicz.jr@intel.com>; Opher Reviv <opher@mellanox.com>; Rony
> Efraim <ronye@mellanox.com>; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>
> Subject: [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
> 
> Hello,
> 
> Although it is far from being close to completion, this series provides
> the means and infrastructure to enable device drivers to share packets
> meta data information and accelerations with XDP programs,
> such as hash, stripped vlan, checksum, flow mark, packet header types,
> etc ..
> 
> The series is still work in progress and sending it now as RFC in order
> to get early feedback and to discuss the design, structures and UAPI.
> 
> For now the general idea is to help XDP programs accelerate, and provide
> them with meta data that already available from the HW or device driver
> to save cpu cycles on packet headers and data processing and decision
> making, aside of that we want to avoid having a fixed size meta data
> structure that wastes a lot of buffer space for stuff that the xdp program
> might not need, like the current "elephant" SKB fields, kidding :) ..
> 
> So my idea here is to provide a dynamic mechanism to allow XDP programs
> on xdp load to request only the specific meta data they need, and
> the device driver or netdevice will provide them in a predefined order
> in the xdp_buff/xdp_md data meta section.
>
> And here is how it is done and what i would like to discuss:
> 
> 1. In the first patch: added the infrastructure to request predefined meta data
> flags on xdp load indicating that the XDP program is going to need them.
> 
> 1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
> TODO: this needs to be improved in order to allow more meta data flags,
> maybe use a new dedicated flags ?
> 
> 1.2) Device driver that want to support xdp meta data should implement
> new XDP command XDP_QUERY_META_FLAGS and report the meta data flags
> they
> can support.
> 
> 1.3) the kernel will cross check the requested flags with the device's
> supported flags and will fail the xdp load in case of discrepancy.
> 
> Question : do we want this ? or is it better to return to the program
> the actual supported flags and let it decide ?
> 
>
The work we are doing in this direction does not assume any specific flags but
instead the XDP program requests for certain "meta data" that it needs and
the driver (or HW) if can provide that then allow loading of that program.
If we put the flags and capabilities in kernel then it will depend on the control
program loading the program to pass on that information. If the XDP program
has that built-in via say ELF "section" (similar to maps) then the program
can be loaded independently and knows what kind of meta data it wants
and receives. 
If the meta data is not supported by the device (driver or software mode)
then that would perhaps fail the program load.
 
> 2. This is the interesting part: in the 2nd patch Added the meta data
> set/get infrastructure to allow device drivers populate them into xdp_buff
> data meta in a well defined and structured format and let the xdp program
> read them according to the predefined format/structure, the idea here is
> that the xdp program and the device driver will share a static information
> offsets array that will define where each meta data will sit inside
> xdp_buff data meta, such structure will be decided once on xdp load.
> 
> Enters struct xdp_md_info and xdp_md_info_arr:
> 
> struct xdp_md_info {
>        __u16 present:1;
>        __u16 offset:15; /* offset from data_meta in xdp_md buffer */
> };
We were trying to not define a generic structure if possible and were working
towards a model similar to current usage where the XDP program produces
a meta data that is consumed by the eBPF program on the TC classifier without
hard coding any structure. It's purely a producer-consumer model with the
kernel helping transfer that data from one end to another instead of providing
a structure. 
So, the NIC produces the meta data requested by the XDP program v/s producing
some meta data then that gets translated in drivers into a generic structure.


> 
> /* XDP meta data offsets info array
>  * present bit describes if a meta data is or will be present in xdp_md buff
>  * offset describes where a meta data is or should be placed in xdp_md buff
>  *
>  * Kernel builds this array using xdp_md_info_build helper on demand.
>  * User space builds it statically in the xdp program.
>  */
> typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> 
> Offsets in xdp_md_info_arr are always in ascending order and only for
> requested meta data:
> example : for XDP prgram that requested the following flags:
> flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> the offsets array will be:
> 
> xdp_md_info_arr mdi = {
>         [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
>         [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
>         [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present
> = 1},
>         [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> }
> 
> For this example: hash fields will always appear first then the vlan for every
> xdp_md.
> 
> Once requested to provide xdp meta data, device driver will use a pre-built
> xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> xdp_md_info_arr will tell the driver what is the offset of each meta data.
> The user space XDP program will use a similar xdp_md_info_arr to
> statically know what is the offset of each meta data.
> 
> *For future meta data they will be added to the end of the array with
> higher flags value.
> 
> This patch also provides helper functions for the device drivers to store
> meta data into xdp_buff, and helper function for XDP programs to fetch
> specific xdp meta data from xdp_md buffer.
> 
> Question: currently the XDP program is responsible to build the static
> meta data offsets array "xdp_md_info_arr" and the kernel will build it
> for the netdevice, if we are going to choose this direction, should we
> somehow share the same xdp_md_info_arr built by the kernel with the xdp
> program ?
> 
> 3. The last mlx5e patch is the actual show case of how the device driver
> will add the support for xdp meta data, it is pretty simple and straight
> forward ! The first two mlx5e patches are just small refactoring to make
> the xdp_md_info_arr and packet completion information available in the rx
> xdp handlers data path.
> 
> 4. Just added a small example to demonstrate how XDP program can request
> meta data on xdp load and how it can read them on the critical path.
> of course more examples are needed and some performance numbers.
> Exciting use cases such as:
>   - using flow mark to allow fast white/black listing lookups.
>   - flow mark to accelerate DDos prevention.
>   - Generic Data path: Use the meta data from the xdp_buff to build SKBs
> !(Jesper's Idea)
>   - using hash type to know the packet headers and encapsulation without
>     touching the packet data at all.
>   - using packet hash to do RPS and XPS like cpu redirecting.
>   - and many more.
> 
> 
> More ideas:
> 
> From Jesper:
> =========================
> Give XDP more rich information about the hash,
> By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.
> 
> I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
> is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
> IPSEC. (And you have another bit telling of this is IPv6 with extension
> headers).
> 
Yes the commodity NICs have lot of rich information about packet parsing and
protocol information. It would be worth extending this for the XDP programs so
that they can take advantage of as much work done by the NIC already v/s
redoing all that in the XDP program. 
So, basically that will allow XDP programs to put more focus on the 
"business logic" of whatever it determines to do with the packet. 


> If we don't want to invent our own xdp_hash_types, we can simply adopt
> the RSS Hashing Types defined by Microsoft:
>  https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-
> hashing-types
> 
> An interesting part of the RSS standard, is that the hash type can help
> identify if this is a fragment. (XDP could use this info to avoid
> touching payload and react, e.g. drop fragments, or redirect all
> fragments to another CPU, or skip parsing in XDP and defer to network
> stack via XDP_PASS).
> 
Yes as well it would be interesting if there is use if XDP program would like to
monitor any packet errors that were captured by the NIC to monitor. 
While traditionally the driver may decide to drop/consume such packets but
it would be a good use-case for debugging.

> By using the RSS standard, we do e.g. loose the ability to say this is
> IPSEC traffic, even-though your HW supports this.
> 
> I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
>  https://marc.info/?l=linux-netdev&m=149512213531769
>  https://marc.info/?l=linux-netdev&m=149512213631774
> =========================
> 
> Thanks,
> Saeed.
> 
> Saeed Mahameed (6):
>   net: xdp: Add support for meta data flags requests
>   net: xdp: RX meta data infrastructure
>   net/mlx5e: Store xdp flags and meta data info
>   net/mlx5e: Pass CQE to RX handlers
>   net/mlx5e: Add XDP RX meta data support
>   samples/bpf: Add meta data hash example to xdp_redirect_cpu
> 
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
>  include/linux/netdevice.h                     |  10 ++
>  include/net/xdp.h                             |   6 +
>  include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
>  include/uapi/linux/if_link.h                  |  20 ++-
>  net/core/dev.c                                |  41 +++++++
>  samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
>  samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
>  10 files changed, 354 insertions(+), 35 deletions(-)
> 
> --
> 2.17.0