mbox series

[bpf-next,00/14] xsk: support shared umems between devices and queues

Message ID 1593692353-15102-1-git-send-email-magnus.karlsson@intel.com
Headers show
Series xsk: support shared umems between devices and queues | expand

Message

Magnus Karlsson July 2, 2020, 12:18 p.m. UTC
This patch set adds support to share a umem between AF_XDP sockets
bound to different queue ids on the same device or even between
devices. It has already been possible to do this by registering the
umem multiple times, but this wastes a lot of memory. Just imagine
having 10 threads each having 10 sockets open sharing a single
umem. This means that you would have to register the umem 100 times
consuming large quantities of memory.

Instead, we extend the existing XDP_SHARED_UMEM flag to also work when
sharing a umem between different queue ids as well as devices. If you
would like to share umem between two sockets, just create the first
one as would do normally. For the second socket you would not register
the same umem using the XDP_UMEM_REG setsockopt. Instead attach one
new fill ring and one new completion ring to this second socket and
then use the XDP_SHARED_UMEM bind flag supplying the file descriptor of
the first socket in the sxdp_shared_umem_fd field to signify that it
is the umem of the first socket you would like to share.

One important thing to note in this example, is that there needs to be
one fill ring and one completion ring per unique device and queue id
bound to. This so that the single-producer and single-consumer semantics
of the rings can be upheld. To recap, if you bind multiple sockets to
the same device and queue id (already supported without this patch
set), you only need one pair of fill and completion rings. If you bind
multiple sockets to multiple different queues or devices, you need one
fill and completion ring pair per unique device,queue_id tuple.

The implementation is based around extending the buffer pool in the
core xsk code. This is a structure that exists on a per unique device
and queue id basis. So, a number of entities that can now be shared
are moved from the umem to the buffer pool. Information about DMA
mappings are also moved from the buffer pool, but as these are per
device independent of the queue id, they are now hanging off the
netdev. In summary after this patch set, there is one xdp_sock struct
per socket created. This points to an xsk_buff_pool for which there is
one per unique device and queue id. The buffer pool points to a DMA
mapping structure for which there is one per device that a umem has
been bound to. And finally, the buffer pool also points to a xdp_umem
struct, for which there is only one per umem registration.

Before:

XSK -> UMEM -> POOL

Now:

XSK -> POOL -> DMA
            \
	     > UMEM

Patches 1-8 only rearrange internal structures to support the buffer
pool carrying this new information, while patch 9 improves performance
as we now have rearrange the internal structures quite a bit. Finally,
patches 10-14 introduce the new functionality together with libbpf
support, samples, and documentation.

Libbpf has also been extended to support sharing of umems between
sockets bound to different devices and queue ids by introducing a new
function called xsk_socket__create_shared(). The difference between
this and the existing xsk_socket__create() is that the former takes a
reference to a fill ring and a completion ring as these need to be
created. This new function needs to be used for the second and
following sockets that binds to the same umem. The first one can be
created by either function as it will also have called
xsk_umem__create().

There is also a new sample xsk_fwd that demonstrates this new
interface and capability.

Note to Maxim at Mellanox. I do not have a mlx5 card, so I have not
been able to test the changes to your driver. It compiles, but that is
all I can say, so it would be great if you could test it. Also, I did
change the name of many functions and variables from umem to pool as a
buffer pool is passed down to the driver in this patch set instead of
the umem. I did not change the name of the files umem.c and
umem.h. Please go through the changes and change things to your
liking.

Performance for the non-shared umem case is unchanged for the xdpsock
sample application with this patch set. For workloads that share a
umem, this patch set can give rise to added performance benefits due
to the decrease in memory usage.

This patch has been applied against commit 91f77560e473 ("Merge branch 'test_progs-improvements'")

Structure of the patch set:

Patch 1: Pass the buffer pool to the driver instead of the umem. This
         because the driver needs one buffer pool per napi context
         when we later introduce sharing of the umem between queue ids
         and devices.
Patch 2: Rename the xsk driver interface so they have better names
         after the move to the buffer pool
Patch 3: There is one buffer pool per device and queue, while there is
         only one umem per registration. The buffer pool needs to be
         created and destroyed independently of the umem.
Patch 4: Move fill and completion rings to the buffer pool as there will
         be one set of these per device and queue
Patch 5: Move queue_id, dev and need_wakeup to buffer pool again as these
         will now be per buffer pool as the umem can be shared between
         devices and queues
Patch 6: Move xsk_tx_list and its lock to buffer pool
Patch 7: Move the creation/deletion of addrs from buffer pool to umem
Patch 8: Enable sharing of DMA mappings when multiple queues of the
         same device are bound
Patch 9: Rearrange internal structs for better performance as these
         have been substantially scrambled by the previous patches
Patch 10: Add shared umem support between queue ids
Patch 11: Add shared umem support between devices
Patch 12: Add support for this in libbpf
Patch 13: Add a new sample that demonstrates this new feature by
          forwarding packets between different netdevs and queues
Patch 14: Add documentation

Thanks: Magnus

Cristian Dumitrescu (1):
  samples/bpf: add new sample xsk_fwd.c

Magnus Karlsson (13):
  xsk: i40e: ice: ixgbe: mlx5: pass buffer pool to driver instead of
    umem
  xsk: i40e: ice: ixgbe: mlx5: rename xsk zero-copy driver interfaces
  xsk: create and free context independently from umem
  xsk: move fill and completion rings to buffer pool
  xsk: move queue_id, dev and need_wakeup to context
  xsk: move xsk_tx_list and its lock to buffer pool
  xsk: move addrs from buffer pool to umem
  xsk: net: enable sharing of dma mappings
  xsk: rearrange internal structs for better performance
  xsk: add shared umem support between queue ids
  xsk: add shared umem support between devices
  libbpf: support shared umems between queues and devices
  xsk: documentation for XDP_SHARED_UMEM between queues and netdevs

 Documentation/networking/af_xdp.rst                |   68 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c     |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c        |   29 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   10 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h        |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c         |   79 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.h         |    4 +-
 drivers/net/ethernet/intel/ice/ice.h               |   18 +-
 drivers/net/ethernet/intel/ice/ice_base.c          |   16 +-
 drivers/net/ethernet/intel/ice/ice_lib.c           |    2 +-
 drivers/net/ethernet/intel/ice/ice_main.c          |   10 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c          |    8 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h          |    2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c           |  142 +--
 drivers/net/ethernet/intel/ice/ice_xsk.h           |    7 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h           |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   34 +-
 .../net/ethernet/intel/ixgbe/ixgbe_txrx_common.h   |    7 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c       |   61 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   19 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |    5 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   10 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c |   12 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.h |    2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c    |   12 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.h    |    6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c  |  108 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h  |   14 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   46 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   16 +-
 include/linux/netdevice.h                          |   13 +-
 include/net/xdp_sock.h                             |   28 +-
 include/net/xdp_sock_drv.h                         |  115 ++-
 include/net/xsk_buff_pool.h                        |   47 +-
 net/core/dev.c                                     |    3 +
 net/ethtool/channels.c                             |    2 +-
 net/ethtool/ioctl.c                                |    2 +-
 net/xdp/xdp_umem.c                                 |  221 +---
 net/xdp/xdp_umem.h                                 |    6 -
 net/xdp/xsk.c                                      |  213 ++--
 net/xdp/xsk.h                                      |    3 +
 net/xdp/xsk_buff_pool.c                            |  314 +++++-
 net/xdp/xsk_diag.c                                 |   14 +-
 net/xdp/xsk_queue.h                                |   12 +-
 samples/bpf/Makefile                               |    3 +
 samples/bpf/xsk_fwd.c                              | 1075 ++++++++++++++++++++
 tools/lib/bpf/libbpf.map                           |    1 +
 tools/lib/bpf/xsk.c                                |  376 ++++---
 tools/lib/bpf/xsk.h                                |    9 +
 49 files changed, 2327 insertions(+), 883 deletions(-)
 create mode 100644 samples/bpf/xsk_fwd.c

--
2.7.4

Comments

Daniel Borkmann July 6, 2020, 6:39 p.m. UTC | #1
On 7/2/20 2:18 PM, Magnus Karlsson wrote:
> This patch set adds support to share a umem between AF_XDP sockets
> bound to different queue ids on the same device or even between
> devices. It has already been possible to do this by registering the
> umem multiple times, but this wastes a lot of memory. Just imagine
> having 10 threads each having 10 sockets open sharing a single
> umem. This means that you would have to register the umem 100 times
> consuming large quantities of memory.

[...]

> Note to Maxim at Mellanox. I do not have a mlx5 card, so I have not
> been able to test the changes to your driver. It compiles, but that is
> all I can say, so it would be great if you could test it. Also, I did
> change the name of many functions and variables from umem to pool as a
> buffer pool is passed down to the driver in this patch set instead of
> the umem. I did not change the name of the files umem.c and
> umem.h. Please go through the changes and change things to your
> liking.

Bjorn / Maxim, this is waiting on review (& mlx5 testing) from you, ptal.

Thanks,
Daniel
Maxim Mikityanskiy July 7, 2020, 10:37 a.m. UTC | #2
On 2020-07-06 21:39, Daniel Borkmann wrote:
> On 7/2/20 2:18 PM, Magnus Karlsson wrote:
>> This patch set adds support to share a umem between AF_XDP sockets
>> bound to different queue ids on the same device or even between
>> devices. It has already been possible to do this by registering the
>> umem multiple times, but this wastes a lot of memory. Just imagine
>> having 10 threads each having 10 sockets open sharing a single
>> umem. This means that you would have to register the umem 100 times
>> consuming large quantities of memory.

Sounds like this series has some great stuff!

> [...]
> 
>> Note to Maxim at Mellanox. I do not have a mlx5 card, so I have not
>> been able to test the changes to your driver. It compiles, but that is
>> all I can say, so it would be great if you could test it. Also, I did
>> change the name of many functions and variables from umem to pool as a
>> buffer pool is passed down to the driver in this patch set instead of
>> the umem. I did not change the name of the files umem.c and
>> umem.h. Please go through the changes and change things to your
>> liking.
> 
> Bjorn / Maxim, this is waiting on review (& mlx5 testing) from you, ptal.

Sure, I'll take a look and do the mlx5 testing (I only noticed this 
series yesterday).

> Thanks,
> Daniel
Maxim Mikityanskiy July 8, 2020, 3 p.m. UTC | #3
On 2020-07-02 15:18, Magnus Karlsson wrote:
> This patch set adds support to share a umem between AF_XDP sockets
> bound to different queue ids on the same device or even between
> devices. It has already been possible to do this by registering the
> umem multiple times, but this wastes a lot of memory. Just imagine
> having 10 threads each having 10 sockets open sharing a single
> umem. This means that you would have to register the umem 100 times
> consuming large quantities of memory.

Just to clarify: the main memory savings are achieved, because we don't 
need to store an array of pages in struct xdp_umem multiple times, right?

I guess there is one more drawback of sharing a UMEM the old way 
(register it multiple times): it would map (DMA) the same pages multiple 
times.

> Instead, we extend the existing XDP_SHARED_UMEM flag to also work when
> sharing a umem between different queue ids as well as devices. If you
> would like to share umem between two sockets, just create the first
> one as would do normally. For the second socket you would not register
> the same umem using the XDP_UMEM_REG setsockopt. Instead attach one
> new fill ring and one new completion ring to this second socket and
> then use the XDP_SHARED_UMEM bind flag supplying the file descriptor of
> the first socket in the sxdp_shared_umem_fd field to signify that it
> is the umem of the first socket you would like to share.
> 
> One important thing to note in this example, is that there needs to be
> one fill ring and one completion ring per unique device and queue id
> bound to. This so that the single-producer and single-consumer semantics
> of the rings can be upheld. To recap, if you bind multiple sockets to
> the same device and queue id (already supported without this patch
> set), you only need one pair of fill and completion rings. If you bind
> multiple sockets to multiple different queues or devices, you need one
> fill and completion ring pair per unique device,queue_id tuple.
> 
> The implementation is based around extending the buffer pool in the
> core xsk code. This is a structure that exists on a per unique device
> and queue id basis. So, a number of entities that can now be shared
> are moved from the umem to the buffer pool. Information about DMA
> mappings are also moved from the buffer pool, but as these are per
> device independent of the queue id, they are now hanging off the
> netdev.

Basically, you want to map a pair of (netdev, UMEM) to DMA info. The 
current implementation of xp_find_dma_map stores a list of UMEMs in the 
netdev and goes over that list to find the corresponding DMA info. It 
would be more effective to do it vice-versa, i.e. to store the list of 
netdevs inside of a UMEM, because you normally have fewer netdevs in the 
system than sockets, and you'll have fewer list items to traverse. Of 
course, it has no effect on the data path, but it will improve the time 
to open a socket (i.e. connection rate).

> In summary after this patch set, there is one xdp_sock struct
> per socket created. This points to an xsk_buff_pool for which there is
> one per unique device and queue id. The buffer pool points to a DMA
> mapping structure for which there is one per device that a umem has
> been bound to. And finally, the buffer pool also points to a xdp_umem
> struct, for which there is only one per umem registration.
> 
> Before:
> 
> XSK -> UMEM -> POOL
> 
> Now:
> 
> XSK -> POOL -> DMA
>              \
> 	     > UMEM
> 
> Patches 1-8 only rearrange internal structures to support the buffer
> pool carrying this new information, while patch 9 improves performance
> as we now have rearrange the internal structures quite a bit. Finally,
> patches 10-14 introduce the new functionality together with libbpf
> support, samples, and documentation.
> 
> Libbpf has also been extended to support sharing of umems between
> sockets bound to different devices and queue ids by introducing a new
> function called xsk_socket__create_shared(). The difference between
> this and the existing xsk_socket__create() is that the former takes a
> reference to a fill ring and a completion ring as these need to be
> created. This new function needs to be used for the second and
> following sockets that binds to the same umem. The first one can be
> created by either function as it will also have called
> xsk_umem__create().
> 
> There is also a new sample xsk_fwd that demonstrates this new
> interface and capability.
> 
> Note to Maxim at Mellanox. I do not have a mlx5 card, so I have not
> been able to test the changes to your driver. It compiles, but that is
> all I can say, so it would be great if you could test it. Also, I did
> change the name of many functions and variables from umem to pool as a
> buffer pool is passed down to the driver in this patch set instead of
> the umem. I did not change the name of the files umem.c and
> umem.h. Please go through the changes and change things to your
> liking.

I looked through the mlx5 patches, and I see the changes are minor, and 
most importantly, the functionality is not broken (tested with xdpsock). 
I would still like to make some cosmetic amendments - I'll send you an 
updated patch.

> Performance for the non-shared umem case is unchanged for the xdpsock
> sample application with this patch set.

I also tested it on mlx5 (ConnectX-5 Ex), and the performance hasn't 
been hurt.

> For workloads that share a
> umem, this patch set can give rise to added performance benefits due
> to the decrease in memory usage.
> 
> This patch has been applied against commit 91f77560e473 ("Merge branch 'test_progs-improvements'")
> 
> Structure of the patch set:
> 
> Patch 1: Pass the buffer pool to the driver instead of the umem. This
>           because the driver needs one buffer pool per napi context
>           when we later introduce sharing of the umem between queue ids
>           and devices.
> Patch 2: Rename the xsk driver interface so they have better names
>           after the move to the buffer pool
> Patch 3: There is one buffer pool per device and queue, while there is
>           only one umem per registration. The buffer pool needs to be
>           created and destroyed independently of the umem.
> Patch 4: Move fill and completion rings to the buffer pool as there will
>           be one set of these per device and queue
> Patch 5: Move queue_id, dev and need_wakeup to buffer pool again as these
>           will now be per buffer pool as the umem can be shared between
>           devices and queues
> Patch 6: Move xsk_tx_list and its lock to buffer pool
> Patch 7: Move the creation/deletion of addrs from buffer pool to umem
> Patch 8: Enable sharing of DMA mappings when multiple queues of the
>           same device are bound
> Patch 9: Rearrange internal structs for better performance as these
>           have been substantially scrambled by the previous patches
> Patch 10: Add shared umem support between queue ids
> Patch 11: Add shared umem support between devices
> Patch 12: Add support for this in libbpf
> Patch 13: Add a new sample that demonstrates this new feature by
>            forwarding packets between different netdevs and queues
> Patch 14: Add documentation
> 
> Thanks: Magnus
> 
> Cristian Dumitrescu (1):
>    samples/bpf: add new sample xsk_fwd.c
> 
> Magnus Karlsson (13):
>    xsk: i40e: ice: ixgbe: mlx5: pass buffer pool to driver instead of
>      umem
>    xsk: i40e: ice: ixgbe: mlx5: rename xsk zero-copy driver interfaces
>    xsk: create and free context independently from umem
>    xsk: move fill and completion rings to buffer pool
>    xsk: move queue_id, dev and need_wakeup to context
>    xsk: move xsk_tx_list and its lock to buffer pool
>    xsk: move addrs from buffer pool to umem
>    xsk: net: enable sharing of dma mappings
>    xsk: rearrange internal structs for better performance
>    xsk: add shared umem support between queue ids
>    xsk: add shared umem support between devices
>    libbpf: support shared umems between queues and devices
>    xsk: documentation for XDP_SHARED_UMEM between queues and netdevs
> 
>   Documentation/networking/af_xdp.rst                |   68 +-
>   drivers/net/ethernet/intel/i40e/i40e_ethtool.c     |    2 +-
>   drivers/net/ethernet/intel/i40e/i40e_main.c        |   29 +-
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   10 +-
>   drivers/net/ethernet/intel/i40e/i40e_txrx.h        |    2 +-
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c         |   79 +-
>   drivers/net/ethernet/intel/i40e/i40e_xsk.h         |    4 +-
>   drivers/net/ethernet/intel/ice/ice.h               |   18 +-
>   drivers/net/ethernet/intel/ice/ice_base.c          |   16 +-
>   drivers/net/ethernet/intel/ice/ice_lib.c           |    2 +-
>   drivers/net/ethernet/intel/ice/ice_main.c          |   10 +-
>   drivers/net/ethernet/intel/ice/ice_txrx.c          |    8 +-
>   drivers/net/ethernet/intel/ice/ice_txrx.h          |    2 +-
>   drivers/net/ethernet/intel/ice/ice_xsk.c           |  142 +--
>   drivers/net/ethernet/intel/ice/ice_xsk.h           |    7 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe.h           |    2 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   34 +-
>   .../net/ethernet/intel/ixgbe/ixgbe_txrx_common.h   |    7 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c       |   61 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en.h       |   19 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |    5 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   10 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c |   12 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.h |    2 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c    |   12 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.h    |    6 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c  |  108 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h  |   14 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   46 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   16 +-
>   include/linux/netdevice.h                          |   13 +-
>   include/net/xdp_sock.h                             |   28 +-
>   include/net/xdp_sock_drv.h                         |  115 ++-
>   include/net/xsk_buff_pool.h                        |   47 +-
>   net/core/dev.c                                     |    3 +
>   net/ethtool/channels.c                             |    2 +-
>   net/ethtool/ioctl.c                                |    2 +-
>   net/xdp/xdp_umem.c                                 |  221 +---
>   net/xdp/xdp_umem.h                                 |    6 -
>   net/xdp/xsk.c                                      |  213 ++--
>   net/xdp/xsk.h                                      |    3 +
>   net/xdp/xsk_buff_pool.c                            |  314 +++++-
>   net/xdp/xsk_diag.c                                 |   14 +-
>   net/xdp/xsk_queue.h                                |   12 +-
>   samples/bpf/Makefile                               |    3 +
>   samples/bpf/xsk_fwd.c                              | 1075 ++++++++++++++++++++
>   tools/lib/bpf/libbpf.map                           |    1 +
>   tools/lib/bpf/xsk.c                                |  376 ++++---
>   tools/lib/bpf/xsk.h                                |    9 +
>   49 files changed, 2327 insertions(+), 883 deletions(-)
>   create mode 100644 samples/bpf/xsk_fwd.c
> 
> --
> 2.7.4
>
Magnus Karlsson July 9, 2020, 6:54 a.m. UTC | #4
On Wed, Jul 8, 2020 at 5:02 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> On 2020-07-02 15:18, Magnus Karlsson wrote:
> > This patch set adds support to share a umem between AF_XDP sockets
> > bound to different queue ids on the same device or even between
> > devices. It has already been possible to do this by registering the
> > umem multiple times, but this wastes a lot of memory. Just imagine
> > having 10 threads each having 10 sockets open sharing a single
> > umem. This means that you would have to register the umem 100 times
> > consuming large quantities of memory.
>
> Just to clarify: the main memory savings are achieved, because we don't
> need to store an array of pages in struct xdp_umem multiple times, right?
>
> I guess there is one more drawback of sharing a UMEM the old way
> (register it multiple times): it would map (DMA) the same pages multiple
> times.

Both are correct. The main saving is from only having to DMA map the
device once, your second comment.

> > Instead, we extend the existing XDP_SHARED_UMEM flag to also work when
> > sharing a umem between different queue ids as well as devices. If you
> > would like to share umem between two sockets, just create the first
> > one as would do normally. For the second socket you would not register
> > the same umem using the XDP_UMEM_REG setsockopt. Instead attach one
> > new fill ring and one new completion ring to this second socket and
> > then use the XDP_SHARED_UMEM bind flag supplying the file descriptor of
> > the first socket in the sxdp_shared_umem_fd field to signify that it
> > is the umem of the first socket you would like to share.
> >
> > One important thing to note in this example, is that there needs to be
> > one fill ring and one completion ring per unique device and queue id
> > bound to. This so that the single-producer and single-consumer semantics
> > of the rings can be upheld. To recap, if you bind multiple sockets to
> > the same device and queue id (already supported without this patch
> > set), you only need one pair of fill and completion rings. If you bind
> > multiple sockets to multiple different queues or devices, you need one
> > fill and completion ring pair per unique device,queue_id tuple.
> >
> > The implementation is based around extending the buffer pool in the
> > core xsk code. This is a structure that exists on a per unique device
> > and queue id basis. So, a number of entities that can now be shared
> > are moved from the umem to the buffer pool. Information about DMA
> > mappings are also moved from the buffer pool, but as these are per
> > device independent of the queue id, they are now hanging off the
> > netdev.
>
> Basically, you want to map a pair of (netdev, UMEM) to DMA info. The
> current implementation of xp_find_dma_map stores a list of UMEMs in the
> netdev and goes over that list to find the corresponding DMA info. It
> would be more effective to do it vice-versa, i.e. to store the list of
> netdevs inside of a UMEM, because you normally have fewer netdevs in the
> system than sockets, and you'll have fewer list items to traverse. Of
> course, it has no effect on the data path, but it will improve the time
> to open a socket (i.e. connection rate).

Good idea. Will fix.

> > In summary after this patch set, there is one xdp_sock struct
> > per socket created. This points to an xsk_buff_pool for which there is
> > one per unique device and queue id. The buffer pool points to a DMA
> > mapping structure for which there is one per device that a umem has
> > been bound to. And finally, the buffer pool also points to a xdp_umem
> > struct, for which there is only one per umem registration.
> >
> > Before:
> >
> > XSK -> UMEM -> POOL
> >
> > Now:
> >
> > XSK -> POOL -> DMA
> >              \
> >            > UMEM
> >
> > Patches 1-8 only rearrange internal structures to support the buffer
> > pool carrying this new information, while patch 9 improves performance
> > as we now have rearrange the internal structures quite a bit. Finally,
> > patches 10-14 introduce the new functionality together with libbpf
> > support, samples, and documentation.
> >
> > Libbpf has also been extended to support sharing of umems between
> > sockets bound to different devices and queue ids by introducing a new
> > function called xsk_socket__create_shared(). The difference between
> > this and the existing xsk_socket__create() is that the former takes a
> > reference to a fill ring and a completion ring as these need to be
> > created. This new function needs to be used for the second and
> > following sockets that binds to the same umem. The first one can be
> > created by either function as it will also have called
> > xsk_umem__create().
> >
> > There is also a new sample xsk_fwd that demonstrates this new
> > interface and capability.
> >
> > Note to Maxim at Mellanox. I do not have a mlx5 card, so I have not
> > been able to test the changes to your driver. It compiles, but that is
> > all I can say, so it would be great if you could test it. Also, I did
> > change the name of many functions and variables from umem to pool as a
> > buffer pool is passed down to the driver in this patch set instead of
> > the umem. I did not change the name of the files umem.c and
> > umem.h. Please go through the changes and change things to your
> > liking.
>
> I looked through the mlx5 patches, and I see the changes are minor, and
> most importantly, the functionality is not broken (tested with xdpsock).
> I would still like to make some cosmetic amendments - I'll send you an
> updated patch.

Appreciated. Thanks.

> > Performance for the non-shared umem case is unchanged for the xdpsock
> > sample application with this patch set.
>
> I also tested it on mlx5 (ConnectX-5 Ex), and the performance hasn't
> been hurt.

Good to hear. I might include another patch in the v2 that improves
performance with 3% for the l2fwd sample app on my system. It is in
common code so should benefit everyone. Though, it is dependent on a
new DMA interface patch set from Christof making it from bpf to
bpf-next. If it makes it over in time, I will include it. Otherwise,
it will be submitted later.

/Magnus

> > For workloads that share a
> > umem, this patch set can give rise to added performance benefits due
> > to the decrease in memory usage.
> >
> > This patch has been applied against commit 91f77560e473 ("Merge branch 'test_progs-improvements'")
> >
> > Structure of the patch set:
> >
> > Patch 1: Pass the buffer pool to the driver instead of the umem. This
> >           because the driver needs one buffer pool per napi context
> >           when we later introduce sharing of the umem between queue ids
> >           and devices.
> > Patch 2: Rename the xsk driver interface so they have better names
> >           after the move to the buffer pool
> > Patch 3: There is one buffer pool per device and queue, while there is
> >           only one umem per registration. The buffer pool needs to be
> >           created and destroyed independently of the umem.
> > Patch 4: Move fill and completion rings to the buffer pool as there will
> >           be one set of these per device and queue
> > Patch 5: Move queue_id, dev and need_wakeup to buffer pool again as these
> >           will now be per buffer pool as the umem can be shared between
> >           devices and queues
> > Patch 6: Move xsk_tx_list and its lock to buffer pool
> > Patch 7: Move the creation/deletion of addrs from buffer pool to umem
> > Patch 8: Enable sharing of DMA mappings when multiple queues of the
> >           same device are bound
> > Patch 9: Rearrange internal structs for better performance as these
> >           have been substantially scrambled by the previous patches
> > Patch 10: Add shared umem support between queue ids
> > Patch 11: Add shared umem support between devices
> > Patch 12: Add support for this in libbpf
> > Patch 13: Add a new sample that demonstrates this new feature by
> >            forwarding packets between different netdevs and queues
> > Patch 14: Add documentation
> >
> > Thanks: Magnus
> >
> > Cristian Dumitrescu (1):
> >    samples/bpf: add new sample xsk_fwd.c
> >
> > Magnus Karlsson (13):
> >    xsk: i40e: ice: ixgbe: mlx5: pass buffer pool to driver instead of
> >      umem
> >    xsk: i40e: ice: ixgbe: mlx5: rename xsk zero-copy driver interfaces
> >    xsk: create and free context independently from umem
> >    xsk: move fill and completion rings to buffer pool
> >    xsk: move queue_id, dev and need_wakeup to context
> >    xsk: move xsk_tx_list and its lock to buffer pool
> >    xsk: move addrs from buffer pool to umem
> >    xsk: net: enable sharing of dma mappings
> >    xsk: rearrange internal structs for better performance
> >    xsk: add shared umem support between queue ids
> >    xsk: add shared umem support between devices
> >    libbpf: support shared umems between queues and devices
> >    xsk: documentation for XDP_SHARED_UMEM between queues and netdevs
> >
> >   Documentation/networking/af_xdp.rst                |   68 +-
> >   drivers/net/ethernet/intel/i40e/i40e_ethtool.c     |    2 +-
> >   drivers/net/ethernet/intel/i40e/i40e_main.c        |   29 +-
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   10 +-
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.h        |    2 +-
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.c         |   79 +-
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.h         |    4 +-
> >   drivers/net/ethernet/intel/ice/ice.h               |   18 +-
> >   drivers/net/ethernet/intel/ice/ice_base.c          |   16 +-
> >   drivers/net/ethernet/intel/ice/ice_lib.c           |    2 +-
> >   drivers/net/ethernet/intel/ice/ice_main.c          |   10 +-
> >   drivers/net/ethernet/intel/ice/ice_txrx.c          |    8 +-
> >   drivers/net/ethernet/intel/ice/ice_txrx.h          |    2 +-
> >   drivers/net/ethernet/intel/ice/ice_xsk.c           |  142 +--
> >   drivers/net/ethernet/intel/ice/ice_xsk.h           |    7 +-
> >   drivers/net/ethernet/intel/ixgbe/ixgbe.h           |    2 +-
> >   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   34 +-
> >   .../net/ethernet/intel/ixgbe/ixgbe_txrx_common.h   |    7 +-
> >   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c       |   61 +-
> >   drivers/net/ethernet/mellanox/mlx5/core/en.h       |   19 +-
> >   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |    5 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   10 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c |   12 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.h |    2 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c    |   12 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.h    |    6 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c  |  108 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h  |   14 +-
> >   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   46 +-
> >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   16 +-
> >   include/linux/netdevice.h                          |   13 +-
> >   include/net/xdp_sock.h                             |   28 +-
> >   include/net/xdp_sock_drv.h                         |  115 ++-
> >   include/net/xsk_buff_pool.h                        |   47 +-
> >   net/core/dev.c                                     |    3 +
> >   net/ethtool/channels.c                             |    2 +-
> >   net/ethtool/ioctl.c                                |    2 +-
> >   net/xdp/xdp_umem.c                                 |  221 +---
> >   net/xdp/xdp_umem.h                                 |    6 -
> >   net/xdp/xsk.c                                      |  213 ++--
> >   net/xdp/xsk.h                                      |    3 +
> >   net/xdp/xsk_buff_pool.c                            |  314 +++++-
> >   net/xdp/xsk_diag.c                                 |   14 +-
> >   net/xdp/xsk_queue.h                                |   12 +-
> >   samples/bpf/Makefile                               |    3 +
> >   samples/bpf/xsk_fwd.c                              | 1075 ++++++++++++++++++++
> >   tools/lib/bpf/libbpf.map                           |    1 +
> >   tools/lib/bpf/xsk.c                                |  376 ++++---
> >   tools/lib/bpf/xsk.h                                |    9 +
> >   49 files changed, 2327 insertions(+), 883 deletions(-)
> >   create mode 100644 samples/bpf/xsk_fwd.c
> >
> > --
> > 2.7.4
> >
>