mbox series

[bpf-next,00/11] AF_XDP: introducing zero-copy support

Message ID 20180604120601.18123-1-bjorn.topel@gmail.com
Headers show
Series AF_XDP: introducing zero-copy support | expand

Message

Björn Töpel June 4, 2018, 12:05 p.m. UTC
From: Björn Töpel <bjorn.topel@intel.com>

This patch serie introduces zerocopy (ZC) support for
AF_XDP. Programs using AF_XDP sockets will now receive RX packets
without any copies and can also transmit packets without incurring any
copies. No modifications to the application are needed, but the NIC
driver needs to be modified to support ZC. If ZC is not supported by
the driver, the modes introduced in the AF_XDP patch will be
used. Using ZC in our micro benchmarks results in significantly
improved performance as can be seen in the performance section later
in this cover letter.

Note that for an untrusted application, HW packet steering to a
specific queue pair (the one associated with the application) is a
requirement when using ZC, as the application would otherwise be able
to see other user space processes' packets. If the HW cannot support
the required packet steering you need to use the XDP_SKB mode or the
XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
patch set can be used to do load balancing in that case.

For benchmarking, you can use the xdpsock application from the AF_XDP
patch set without any modifications. Say that you would like your UDP
traffic from port 4242 to end up in queue 16, that we will enable
AF_XDP on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
done using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. 

AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
set are also reported for ease of reference. The numbers within
parantheses are from the RFC V1 ZC patch set.
Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.9*       9.6*       21.1(21.5)
txpush       2.6*       -          22.0(21.6)
l2fwd        1.9*       2.5*       15.3(15.0)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.1*       3.3*       3.3(3.3)
l2fwd        1.4*       1.8*       3.1(3.1)

* From AF_XDP V3 patch set and cover letter.

So why do we not get higher values for RX similar to the 34 Mpps we
had in AF_PACKET V4? We made an experiment running the rxdrop
benchmark without using the xdp_do_redirect/flush infrastructure nor
using an XDP program (all traffic on a queue goes to one
socket). Instead the driver acts directly on the AF_XDP socket. With
this we got 36.9 Mpps, a significant improvement without any change to
the uapi. So not forcing users to have an XDP program if they do not
need it, might be a good idea. This measurement is actually higher
than what we got with AF_PACKET V4.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patches 1-3: Plumbing for AF_XDP ZC support
Patches 4-5: AF_XDP ZC for RX
Patches 6-7: AF_XDP ZC for TX
Patch 8-10: ZC support for i40e.
Patch 11: Use the bind flags in sample application to force TX skb
          path when -S is providedd on the command line.

This patch set is based on the new uapi introduced in "AF_XDP: bug
fixes and descriptor changes". You need to apply that patch set
first, before applying this one.

We based this patch set on bpf-next commit bd3a08aaa9a3 ("bpf:
flowlabel in bpf_fib_lookup should be flowinfo")

Comments:

* Implementing dynamic creation and deletion of queues in the i40e
  driver would facilitate the coexistence of xdp_redirect and af_xdp.

Thanks: Björn and Magnus

Björn Töpel (8):
  xsk: moved struct xdp_umem definition
  xsk: introduce xdp_umem_page
  net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM
  xdp: add MEM_TYPE_ZERO_COPY
  xsk: add zero-copy support for Rx
  i40e: added queue pair disable/enable functions
  i40e: implement AF_XDP zero-copy support for Rx
  samples/bpf: xdpsock: use skb Tx path for XDP_SKB

Magnus Karlsson (3):
  net: added netdevice operation for Tx
  xsk: wire upp Tx zero-copy functions
  i40e: implement AF_XDP zero-copy support for Tx

 drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h      |  23 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 287 +++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 256 ++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 151 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 677 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  19 +
 include/linux/netdevice.h                   |  10 +
 include/net/xdp.h                           |  10 +
 include/net/xdp_sock.h                      |  77 +++-
 include/uapi/linux/if_xdp.h                 |   4 +-
 net/core/xdp.c                              |  19 +-
 net/xdp/xdp_umem.c                          | 118 ++++-
 net/xdp/xdp_umem.h                          |  32 +-
 net/xdp/xsk.c                               | 166 +++++--
 net/xdp/xsk_queue.h                         |  35 +-
 samples/bpf/xdpsock_user.c                  |   5 +
 17 files changed, 1657 insertions(+), 235 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

Comments

Alexei Starovoitov June 4, 2018, 4:38 p.m. UTC | #1
On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch serie introduces zerocopy (ZC) support for
> AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> without any copies and can also transmit packets without incurring any
> copies. No modifications to the application are needed, but the NIC
> driver needs to be modified to support ZC. If ZC is not supported by
> the driver, the modes introduced in the AF_XDP patch will be
> used. Using ZC in our micro benchmarks results in significantly
> improved performance as can be seen in the performance section later
> in this cover letter.
> 
> Note that for an untrusted application, HW packet steering to a
> specific queue pair (the one associated with the application) is a
> requirement when using ZC, as the application would otherwise be able
> to see other user space processes' packets. If the HW cannot support
> the required packet steering you need to use the XDP_SKB mode or the
> XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
> patch set can be used to do load balancing in that case.
> 
> For benchmarking, you can use the xdpsock application from the AF_XDP
> patch set without any modifications. Say that you would like your UDP
> traffic from port 4242 to end up in queue 16, that we will enable
> AF_XDP on. Here, we use ethtool for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
> done using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference. The numbers within
> parantheses are from the RFC V1 ZC patch set.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.1(21.5)
> txpush       2.6*       -          22.0(21.6)
> l2fwd        1.9*       2.5*       15.3(15.0)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1*       3.3*       3.3(3.3)
> l2fwd        1.4*       1.8*       3.1(3.1)
> 
> * From AF_XDP V3 patch set and cover letter.
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patches 1-3: Plumbing for AF_XDP ZC support
> Patches 4-5: AF_XDP ZC for RX
> Patches 6-7: AF_XDP ZC for TX

Acked-by: Alexei Starovoitov <ast@kernel.org>
for above patches

> Patch 8-10: ZC support for i40e.

these also look good to me.
would be great if i40e experts take a look at them asap.

If there are no major objections we'd like to merge all of it
for this merge window.

Thanks!
Kirsher, Jeffrey T June 4, 2018, 8:29 p.m. UTC | #2
On Mon, 2018-06-04 at 09:38 -0700, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> > 
> > This patch serie introduces zerocopy (ZC) support for
> > AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> > without any copies and can also transmit packets without incurring
> > any
> > copies. No modifications to the application are needed, but the NIC
> > driver needs to be modified to support ZC. If ZC is not supported
> > by
> > the driver, the modes introduced in the AF_XDP patch will be
> > used. Using ZC in our micro benchmarks results in significantly
> > improved performance as can be seen in the performance section
> > later
> > in this cover letter.
> > 
> > Note that for an untrusted application, HW packet steering to a
> > specific queue pair (the one associated with the application) is a
> > requirement when using ZC, as the application would otherwise be
> > able
> > to see other user space processes' packets. If the HW cannot
> > support
> > the required packet steering you need to use the XDP_SKB mode or
> > the
> > XDP_DRV mode without ZC turned on. The XSKMAP introduced in the
> > AF_XDP
> > patch set can be used to do load balancing in that case.
> > 
> > For benchmarking, you can use the xdpsock application from the
> > AF_XDP
> > patch set without any modifications. Say that you would like your
> > UDP
> > traffic from port 4242 to end up in queue 16, that we will enable
> > AF_XDP on. Here, we use ethtool for this:
> > 
> >       ethtool -N p3p2 rx-flow-hash udp4 fn
> >       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> >           action 16
> > 
> > Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then
> > be
> > done using:
> > 
> >       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> > 
> > We have run some benchmarks on a dual socket system with two
> > Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has
> > 14
> > cores which gives a total of 28, but only two cores are used in
> > these
> > experiments. One for TR/RX and one for the user space application.
> > The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of
> > total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0.
> > The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> > 
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator
> > HW
> > outputing packets at full 40 Gbit/s line rate. The results are
> > without
> > retpoline so that we can compare against previous numbers. 
> > 
> > AF_XDP performance 64 byte packets. Results from the AF_XDP V3
> > patch
> > set are also reported for ease of reference. The numbers within
> > parantheses are from the RFC V1 ZC patch set.
> > Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> > rxdrop       2.9*       9.6*       21.1(21.5)
> > txpush       2.6*       -          22.0(21.6)
> > l2fwd        1.9*       2.5*       15.3(15.0)
> > 
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> > rxdrop       2.1*       3.3*       3.3(3.3)
> > l2fwd        1.4*       1.8*       3.1(3.1)
> > 
> > * From AF_XDP V3 patch set and cover letter.
> > 
> > So why do we not get higher values for RX similar to the 34 Mpps we
> > had in AF_PACKET V4? We made an experiment running the rxdrop
> > benchmark without using the xdp_do_redirect/flush infrastructure
> > nor
> > using an XDP program (all traffic on a queue goes to one
> > socket). Instead the driver acts directly on the AF_XDP socket.
> > With
> > this we got 36.9 Mpps, a significant improvement without any change
> > to
> > the uapi. So not forcing users to have an XDP program if they do
> > not
> > need it, might be a good idea. This measurement is actually higher
> > than what we got with AF_PACKET V4.
> > 
> > XDP performance on our system as a base line:
> > 
> > 64 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      32.3M  0
> > 
> > 1500 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      3.3M    0
> > 
> > The structure of the patch set is as follows:
> > 
> > Patches 1-3: Plumbing for AF_XDP ZC support
> > Patches 4-5: AF_XDP ZC for RX
> > Patches 6-7: AF_XDP ZC for TX
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> for above patches
> 
> > Patch 8-10: ZC support for i40e.
> 
> these also look good to me.
> would be great if i40e experts take a look at them asap.
> 
> If there are no major objections we'd like to merge all of it
> for this merge window.

We would like a bit more time to review and test the changes, I
understand your eagerness for wanting this to get into 4.18 but this
change is large enough that a 24-48 hour review time is not prudent,
IMHO.

Alex also has requested for more time so that he can review the changes
as well.  I will go ahead and put the entire series in my tree so that
our validation team can start to "kick the tires".
Michael S. Tsirkin Nov. 14, 2018, 8:10 a.m. UTC | #3
So a I mentioned during the presentation for the af_xdp zero copy I
think it's pretty important to be able to close the device and get back
the affected memory. One way would be to unmap the DMA memory from
userspace and map in some other memory. It's tricky since you need
to also replace the mapping to the backing file which could be
hugetlbfs, tmpfs, just a file ...

HTH,