mbox series

[bpf-next,v4,00/17] AF_XDP infrastructure improvements and mlx5e support

Message ID 20190612155605.22450-1-maximmi@mellanox.com
Headers show
Series AF_XDP infrastructure improvements and mlx5e support | expand

Message

Maxim Mikityanskiy June 12, 2019, 3:56 p.m. UTC
This series contains improvements to the AF_XDP kernel infrastructure
and AF_XDP support in mlx5e. The infrastructure improvements are
required for mlx5e, but also some of them benefit to all drivers, and
some can be useful for other drivers that want to implement AF_XDP.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled, single stream:

txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps

The results with retpoline enabled, single stream:

txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps

v2 changes:

Added patches for mlx5e and addressed the comments for v1. Rebased for
bpf-next.

v3 changes:

Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed
Björn's comments for coding style. Fixed a bug in error handling flow in
mlx5e_open_xsk.

v4 changes:

UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
half of the available amount of RX queues are regular queues, and the
upper half are XSK RX queues. The patch "xsk: Extend channels to support
combined XSK/non-XSK traffic" was dropped. The final patch was reworked
accordingly.

Added "net/mlx5e: Attach/detach XDP program safely", as the changes
introduced in the XSK patch base on the stuff from this one.

Added "libbpf: Support drivers with non-combined channels", which aligns
the condition in libbpf with the condition in the kernel.

Rebased over the newer bpf-next.

Maxim Mikityanskiy (17):
  net/mlx5e: Attach/detach XDP program safely
  xsk: Add API to check for available entries in FQ
  xsk: Add getsockopt XDP_OPTIONS
  libbpf: Support getsockopt XDP_OPTIONS
  xsk: Change the default frame size to 4096 and allow controlling it
  xsk: Return the whole xdp_desc from xsk_umem_consume_tx
  libbpf: Support drivers with non-combined channels
  net/mlx5e: Replace deprecated PCI_DMA_TODEVICE
  net/mlx5e: Calculate linear RX frag size considering XSK
  net/mlx5e: Allow ICO SQ to be used by multiple RQs
  net/mlx5e: Refactor struct mlx5e_xdp_info
  net/mlx5e: Share the XDP SQ for XDP_TX between RQs
  net/mlx5e: XDP_TX from UMEM support
  net/mlx5e: Consider XSK in XDP MTU limit calculation
  net/mlx5e: Encapsulate open/close queues into a function
  net/mlx5e: Move queue param structs to en/params.h
  net/mlx5e: Add XSK zero-copy support

 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  12 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  15 +-
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 155 +++-
 .../ethernet/mellanox/mlx5/core/en/params.c   | 108 ++-
 .../ethernet/mellanox/mlx5/core/en/params.h   | 118 ++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 231 ++++--
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  36 +-
 .../mellanox/mlx5/core/en/xsk/Makefile        |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 192 +++++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  27 +
 .../mellanox/mlx5/core/en/xsk/setup.c         | 223 ++++++
 .../mellanox/mlx5/core/en/xsk/setup.h         |  25 +
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 111 +++
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.h   |  15 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c | 267 +++++++
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |  31 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  29 +-
 .../mellanox/mlx5/core/en_fs_ethtool.c        |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 726 ++++++++++++------
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  12 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 104 ++-
 .../ethernet/mellanox/mlx5/core/en_stats.c    | 115 ++-
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  30 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  42 +-
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  14 +-
 drivers/net/ethernet/mellanox/mlx5/core/wq.h  |   5 -
 include/net/xdp_sock.h                        |  27 +-
 include/uapi/linux/if_xdp.h                   |   8 +
 net/xdp/xsk.c                                 |  36 +-
 net/xdp/xsk_queue.h                           |  14 +
 samples/bpf/xdpsock_user.c                    |  44 +-
 tools/include/uapi/linux/if_xdp.h             |   8 +
 tools/lib/bpf/xsk.c                           |  18 +-
 tools/lib/bpf/xsk.h                           |   2 +-
 35 files changed, 2337 insertions(+), 484 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/Makefile
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h

Comments

Jonathan Lemon June 12, 2019, 7:10 p.m. UTC | #1
On 12 Jun 2019, at 8:56, Maxim Mikityanskiy wrote:

> This series contains improvements to the AF_XDP kernel infrastructure
> and AF_XDP support in mlx5e. The infrastructure improvements are
> required for mlx5e, but also some of them benefit to all drivers, and
> some can be useful for other drivers that want to implement AF_XDP.
>
> The performance testing was performed on a machine with the following
> configuration:
>
> - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
> - Mellanox ConnectX-5 Ex with 100 Gbit/s link
>
> The results with retpoline disabled, single stream:
>
> txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
> rxdrop: 12.2 Mpps
> l2fwd: 9.4 Mpps
>
> The results with retpoline enabled, single stream:
>
> txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
> rxdrop: 9.9 Mpps
> l2fwd: 6.8 Mpps
>
> v2 changes:
>
> Added patches for mlx5e and addressed the comments for v1. Rebased for
> bpf-next.
>
> v3 changes:
>
> Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed
> Björn's comments for coding style. Fixed a bug in error handling flow in
> mlx5e_open_xsk.
>
> v4 changes:
>
> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
> half of the available amount of RX queues are regular queues, and the
> upper half are XSK RX queues. The patch "xsk: Extend channels to support
> combined XSK/non-XSK traffic" was dropped. The final patch was reworked
> accordingly.
>
> Added "net/mlx5e: Attach/detach XDP program safely", as the changes
> introduced in the XSK patch base on the stuff from this one.
>
> Added "libbpf: Support drivers with non-combined channels", which aligns
> the condition in libbpf with the condition in the kernel.
>
> Rebased over the newer bpf-next.

Very nice change for the RX queues!
For the series:

Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Jakub Kicinski June 12, 2019, 8:48 p.m. UTC | #2
On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
> half of the available amount of RX queues are regular queues, and the
> upper half are XSK RX queues. 

If I have 32 queues enabled on the NIC and I install AF_XDP socket on
queue 10, does the NIC now have 64 RQs, but only first 32 are in the
normal RSS map?

> The patch "xsk: Extend channels to support combined XSK/non-XSK
> traffic" was dropped. The final patch was reworked accordingly.

The final patches has 2k LoC, kind of hard to digest.  You can also
post the clean up patches separately, no need for large series here.
Björn Töpel June 13, 2019, 12:58 p.m. UTC | #3
On Wed, 12 Jun 2019 at 22:49, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
> > UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
> > half of the available amount of RX queues are regular queues, and the
> > upper half are XSK RX queues.
>
> If I have 32 queues enabled on the NIC and I install AF_XDP socket on
> queue 10, does the NIC now have 64 RQs, but only first 32 are in the
> normal RSS map?
>

Additional, related, question to Jakub's: Say that I'd like to hijack
all 32 Rx queues of the NIC. I create 32 AF_XDP socket and attach them
in zero-copy mode to the device. What's the result?

> > The patch "xsk: Extend channels to support combined XSK/non-XSK
> > traffic" was dropped. The final patch was reworked accordingly.
>
> The final patches has 2k LoC, kind of hard to digest.  You can also
> post the clean up patches separately, no need for large series here.
Maxim Mikityanskiy June 13, 2019, 2:01 p.m. UTC | #4
On 2019-06-12 23:48, Jakub Kicinski wrote:
> On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
>> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
>> half of the available amount of RX queues are regular queues, and the
>> upper half are XSK RX queues.
> 
> If I have 32 queues enabled on the NIC

Let's say we have 32 combined channels. In this case RX queues 0..31 are 
regular ones, and 32..63 are XSK-ZC-enabled.

> and I install AF_XDP socket on
> queue 10

It'll trigger the compatibility mode of AF_XDP (without zero copy). You 
should use queue 42, which is in the 32..63 set.

> , does the NIC now have 64 RQs, but only first 32 are in the
> normal RSS map?

Only the regular 0..31 RX queues are part of RSS.

> 
>> The patch "xsk: Extend channels to support combined XSK/non-XSK
>> traffic" was dropped. The final patch was reworked accordingly.
> 
> The final patches has 2k LoC, kind of hard to digest.  You can also
> post the clean up patches separately, no need for large series here.
> 

I used to have the final patch as three patches (add XSK stubs, add RX 
support and add TX support), but I prefer not to have this separation, 
because it doesn't look right to add empty stub functions with /* TODO: 
implement */ comments in one patch and to add the implementations 
immediately in the next patch.

Thanks for reviewing!
Max
Maxim Mikityanskiy June 13, 2019, 2:01 p.m. UTC | #5
On 2019-06-13 15:58, Björn Töpel wrote:
> On Wed, 12 Jun 2019 at 22:49, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>>
>> On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
>>> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
>>> half of the available amount of RX queues are regular queues, and the
>>> upper half are XSK RX queues.
>>
>> If I have 32 queues enabled on the NIC and I install AF_XDP socket on
>> queue 10, does the NIC now have 64 RQs, but only first 32 are in the
>> normal RSS map?
>>
> 
> Additional, related, question to Jakub's: Say that I'd like to hijack
> all 32 Rx queues of the NIC. I create 32 AF_XDP socket and attach them
> in zero-copy mode to the device. What's the result?

There are 32 regular RX queues (0..31) and 32 XSK RX queues (32..63). If 
you want 32 zero-copy AF_XDP sockets, you can attach them to queues 
32..63, and the regular traffic won't be affected at all.

>>> The patch "xsk: Extend channels to support combined XSK/non-XSK
>>> traffic" was dropped. The final patch was reworked accordingly.
>>
>> The final patches has 2k LoC, kind of hard to digest.  You can also
>> post the clean up patches separately, no need for large series here.
Björn Töpel June 13, 2019, 2:11 p.m. UTC | #6
On 2019-06-13 16:01, Maxim Mikityanskiy wrote:
> On 2019-06-13 15:58, Björn Töpel wrote:
>> On Wed, 12 Jun 2019 at 22:49, Jakub Kicinski
>> <jakub.kicinski@netronome.com> wrote:
>>>
>>> On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
>>>> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
>>>> half of the available amount of RX queues are regular queues, and the
>>>> upper half are XSK RX queues.
>>>
>>> If I have 32 queues enabled on the NIC and I install AF_XDP socket on
>>> queue 10, does the NIC now have 64 RQs, but only first 32 are in the
>>> normal RSS map?
>>>
>>
>> Additional, related, question to Jakub's: Say that I'd like to hijack
>> all 32 Rx queues of the NIC. I create 32 AF_XDP socket and attach them
>> in zero-copy mode to the device. What's the result?
> 
> There are 32 regular RX queues (0..31) and 32 XSK RX queues (32..63). If
> you want 32 zero-copy AF_XDP sockets, you can attach them to queues
> 32..63, and the regular traffic won't be affected at all.
> 
Thanks for getting back! More questions!

Ok, so I cannot (with zero-copy) get the regular traffic into AF_XDP
sockets?

How does qids map? Can I only bind a zero-copy socket to qid 32..63 in
the example above?

Say that I have a a copy-mode AF_XDP socket bound to queue 2. In this
case I will receive the regular traffic from queue 2. Enabling zero-copy
for the same queue, will this give an error, or receive AF_XDP specific
traffic from queue 2+32? Or return an error, and require an explicit
bind to any of the queues 32..63?


Thanks,
Björn
Björn Töpel June 13, 2019, 2:53 p.m. UTC | #7
On Thu, 13 Jun 2019 at 16:11, Björn Töpel <bjorn.topel@intel.com> wrote:
>
>
> On 2019-06-13 16:01, Maxim Mikityanskiy wrote:
> > On 2019-06-13 15:58, Björn Töpel wrote:
> >> On Wed, 12 Jun 2019 at 22:49, Jakub Kicinski
> >> <jakub.kicinski@netronome.com> wrote:
> >>>
> >>> On Wed, 12 Jun 2019 15:56:33 +0000, Maxim Mikityanskiy wrote:
> >>>> UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
> >>>> half of the available amount of RX queues are regular queues, and the
> >>>> upper half are XSK RX queues.
> >>>
> >>> If I have 32 queues enabled on the NIC and I install AF_XDP socket on
> >>> queue 10, does the NIC now have 64 RQs, but only first 32 are in the
> >>> normal RSS map?
> >>>
> >>
> >> Additional, related, question to Jakub's: Say that I'd like to hijack
> >> all 32 Rx queues of the NIC. I create 32 AF_XDP socket and attach them
> >> in zero-copy mode to the device. What's the result?
> >
> > There are 32 regular RX queues (0..31) and 32 XSK RX queues (32..63). If
> > you want 32 zero-copy AF_XDP sockets, you can attach them to queues
> > 32..63, and the regular traffic won't be affected at all.
> >
> Thanks for getting back! More questions!
>
> Ok, so I cannot (with zero-copy) get the regular traffic into AF_XDP
> sockets?
>
> How does qids map? Can I only bind a zero-copy socket to qid 32..63 in
> the example above?
>
> Say that I have a a copy-mode AF_XDP socket bound to queue 2. In this
> case I will receive the regular traffic from queue 2. Enabling zero-copy
> for the same queue, will this give an error, or receive AF_XDP specific
> traffic from queue 2+32? Or return an error, and require an explicit
> bind to any of the queues 32..63?
>
>

Let me expand a bit on why I'm asking these qid questions.

It's unfortunate that vendors have different view/mapping on
"qids". For Intel, we allow to bind a zero-copy socket to all Rx
qids. For Mellanox, a certain set of qids are allowed for zero-copy
sockets.

This highlights a need for a better abstraction for queues than "some
queue id from ethtool". This will take some time, and I think that we
have to accept for now that we'll have different behavior/mapping for
zero-copy sockets on different NICs.

Let's address this need for a better queue abstraction, but that
shouldn't block this series IMO. Other than patch:

"[PATCH bpf-next v4 07/17] libbpf: Support drivers with non-combined channels"

which I'd like to see a bit more discussion on, I'm OK with this
series. I haven't been able to test it (no hardware "hint, hint"), but
I know Jonathan has been running it.

Thanks for working on this, Max!

Björn