mbox series

[bpf-next,v3,00/15] Introducing AF_XDP support

Message ID 20180502110136.3738-1-bjorn.topel@gmail.com
Headers show
Series Introducing AF_XDP support | expand

Message

Björn Töpel May 2, 2018, 11:01 a.m. UTC
From: Björn Töpel <bjorn.topel@intel.com>

This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.

A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. With
retpoline, the AF_XDP numbers drop with between 10 - 15 percent.

AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.9(3.0)   9.6(9.5)  
txpush       2.6(2.5)   NA*
l2fwd        1.9(1.9)   2.5(2.5) (TX using XDP_SKB in both cases)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.1(2.2)   3.3(3.3)  
l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this patch set. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this has been upstreamed by Jesper
  Dangaard Brouer.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3(32.9)M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3(3.3)M    0

Changes from V2:

* Fixed a race in XSKMAP map found by Will. The code has been
  completely rearchitected and is now simpler, faster, and hopefully
  also not racy. Please review and check if it holds.

If you would like to diff V2 against V3, you can find them here:
https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next

The structure of the patch set is as follows:

Patches 1-3: Basic socket and umem plumbing 
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application

We based this patch set on bpf-next commit a3fe1f6f2ada ("tools:
bpftool: change time format for program 'loaded at:' information")

To do for this patch set:

* Syzkaller torture session being worked on

Post-series plan:

* Optimize performance

* Kernel selftest

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments,
  starting with TX then adding RX. 

Thanks: Björn and Magnus

Björn Töpel (7):
  net: initial AF_XDP skeleton
  xsk: add user memory registration support sockopt
  xsk: add Rx queue setup and mmap support
  xsk: add Rx receive functions and poll support
  bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  xsk: wire up XDP_DRV side of AF_XDP
  xsk: wire up XDP_SKB side of AF_XDP

Magnus Karlsson (8):
  xsk: add umem fill queue support and mmap
  xsk: add support for bind for Rx
  xsk: add umem completion queue support and mmap
  xsk: add Tx queue setup and mmap support
  dev: packet: make packet_direct_xmit a common function
  xsk: support for Tx
  xsk: statistics support
  samples/bpf: sample application and documentation for AF_XDP sockets

 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 MAINTAINERS                         |   8 +
 include/linux/bpf.h                 |  25 +
 include/linux/bpf_types.h           |   3 +
 include/linux/filter.h              |   2 +-
 include/linux/netdevice.h           |   1 +
 include/linux/socket.h              |   5 +-
 include/net/xdp.h                   |   1 +
 include/net/xdp_sock.h              |  66 +++
 include/uapi/linux/bpf.h            |   1 +
 include/uapi/linux/if_xdp.h         |  87 ++++
 kernel/bpf/Makefile                 |   3 +
 kernel/bpf/verifier.c               |   8 +-
 kernel/bpf/xskmap.c                 | 239 +++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/dev.c                      |  73 ++-
 net/core/filter.c                   |  40 +-
 net/core/sock.c                     |  12 +-
 net/core/xdp.c                      |  15 +-
 net/packet/af_packet.c              |  42 +-
 net/xdp/Kconfig                     |   7 +
 net/xdp/Makefile                    |   2 +
 net/xdp/xdp_umem.c                  | 260 ++++++++++
 net/xdp/xdp_umem.h                  |  67 +++
 net/xdp/xdp_umem_props.h            |  23 +
 net/xdp/xsk.c                       | 656 +++++++++++++++++++++++++
 net/xdp/xsk_queue.c                 |  73 +++
 net/xdp/xsk_queue.h                 | 247 ++++++++++
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 36 files changed, 3221 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

Comments

Willem de Bruijn May 3, 2018, 1:55 p.m. UTC | #1
On Wed, May 2, 2018 at 1:01 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing

Great patchset, thanks.

> and, in upcoming
> patch sets, zero-copy semantics.

And looking forward to this!

> Thanks: Björn and Magnus
>
> Björn Töpel (7):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (8):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   dev: packet: make packet_direct_xmit a common function
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application and documentation for AF_XDP sockets

For the series

Acked-by: Willem de Bruijn <willemb@google.com>
David Miller May 3, 2018, 3:07 p.m. UTC | #2
From: Björn Töpel <bjorn.topel@gmail.com>
Date: Wed,  2 May 2018 13:01:21 +0200

> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this patch set, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.
 ...

Looks great.

Acked-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann May 3, 2018, 10:49 p.m. UTC | #3
On 05/02/2018 01:01 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this patch set, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.

+1, would be great to see it land this cycle. Saw few minor nits here
and there but nothing to hold it up, for the series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks everyone!
Alexei Starovoitov May 3, 2018, 11:38 p.m. UTC | #4
On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
> On 05/02/2018 01:01 PM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> > 
> > This patch set introduces a new address family called AF_XDP that is
> > optimized for high performance packet processing and, in upcoming
> > patch sets, zero-copy semantics. In this patch set, we have removed
> > all zero-copy related code in order to make it smaller, simpler and
> > hopefully more review friendly. This patch set only supports copy-mode
> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> > work has already been accepted. We will publish our zero-copy support
> > for RX and TX on top of his patch sets at a later point in time.
> 
> +1, would be great to see it land this cycle. Saw few minor nits here
> and there but nothing to hold it up, for the series:
> 
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> Thanks everyone!

Great stuff!

Applied to bpf-next, with one condition.
Upcoming zero-copy patches for both RX and TX need to be posted
and reviewed within this release window.
If netdev community as a whole won't be able to agree on the zero-copy
bits we'd need to revert this feature before the next merge window.

Few other minor nits:
patch 3:
+struct xdp_ring {
+       __u32 producer __attribute__((aligned(64)));
+       __u32 consumer __attribute__((aligned(64)));
+};
It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.

patch 5:
+struct sockaddr_xdp {
+       __u16 sxdp_family;
+       __u32 sxdp_ifindex;
Not great to have a hole in uapi struct. Please fix it in the follow up.

patch 7:
Has a lot of synchronize_net(). I think udpate/delete side
can be improved to avoid them. Otherwise users may unknowingly DoS.

As the next steps I suggest to prioritize the highest to ship
zero-copy rx/tx patches and to add selftests.

Thanks!
Magnus Karlsson May 4, 2018, 11:22 a.m. UTC | #5
On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > This patch set introduces a new address family called AF_XDP that is
>> > optimized for high performance packet processing and, in upcoming
>> > patch sets, zero-copy semantics. In this patch set, we have removed
>> > all zero-copy related code in order to make it smaller, simpler and
>> > hopefully more review friendly. This patch set only supports copy-mode
>> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> > work has already been accepted. We will publish our zero-copy support
>> > for RX and TX on top of his patch sets at a later point in time.
>>
>> +1, would be great to see it land this cycle. Saw few minor nits here
>> and there but nothing to hold it up, for the series:
>>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> Thanks everyone!
>
> Great stuff!
>
> Applied to bpf-next, with one condition.
> Upcoming zero-copy patches for both RX and TX need to be posted
> and reviewed within this release window.
> If netdev community as a whole won't be able to agree on the zero-copy
> bits we'd need to revert this feature before the next merge window.

Thanks everyone for reviewing this. Highly appreciated.

Just so we understand the purpose correctly:

1: Do you want to see the ZC patches in order to verify that the user
space API holds? If so, we can produce an additional RFC  patch set
using a big chunk of code that we had in RFC V1. We are not proud of
this code since it is clunky, but it hopefully proves the point with
the uapi being the same.

2: And/Or are you worried about us all (the netdev community) not
agreeing on a way to implement ZC internally in the drivers and the
XDP infrastructure? This is not going to be possible to finish during
this cycle since we do not like the implementation we had in RFC V1.
Too intrusive and now we also have nicer abstractions from Jesper that
we can use and extend to provide a (hopefully) much cleaner and less
intrusive solution.

Just so that we focus on the right proof points.

> Few other minor nits:
> patch 3:
> +struct xdp_ring {
> +       __u32 producer __attribute__((aligned(64)));
> +       __u32 consumer __attribute__((aligned(64)));
> +};
> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.

Agreed.

> patch 5:
> +struct sockaddr_xdp {
> +       __u16 sxdp_family;
> +       __u32 sxdp_ifindex;
> Not great to have a hole in uapi struct. Please fix it in the follow up.

You are correct. Will fix.

> patch 7:
> Has a lot of synchronize_net(). I think udpate/delete side
> can be improved to avoid them. Otherwise users may unknowingly DoS.

OK. Could you please elaborate on what kind of DoS attacks can be
performed with this, so we can come up with the right solution here?

Thanks: Magnus

> As the next steps I suggest to prioritize the highest to ship
> zero-copy rx/tx patches and to add selftests.
>
> Thanks!
>
Alexei Starovoitov May 5, 2018, 12:34 a.m. UTC | #6
On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >
> >> > This patch set introduces a new address family called AF_XDP that is
> >> > optimized for high performance packet processing and, in upcoming
> >> > patch sets, zero-copy semantics. In this patch set, we have removed
> >> > all zero-copy related code in order to make it smaller, simpler and
> >> > hopefully more review friendly. This patch set only supports copy-mode
> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> >> > work has already been accepted. We will publish our zero-copy support
> >> > for RX and TX on top of his patch sets at a later point in time.
> >>
> >> +1, would be great to see it land this cycle. Saw few minor nits here
> >> and there but nothing to hold it up, for the series:
> >>
> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>
> >> Thanks everyone!
> >
> > Great stuff!
> >
> > Applied to bpf-next, with one condition.
> > Upcoming zero-copy patches for both RX and TX need to be posted
> > and reviewed within this release window.
> > If netdev community as a whole won't be able to agree on the zero-copy
> > bits we'd need to revert this feature before the next merge window.
> 
> Thanks everyone for reviewing this. Highly appreciated.
> 
> Just so we understand the purpose correctly:
> 
> 1: Do you want to see the ZC patches in order to verify that the user
> space API holds? If so, we can produce an additional RFC  patch set
> using a big chunk of code that we had in RFC V1. We are not proud of
> this code since it is clunky, but it hopefully proves the point with
> the uapi being the same.
> 
> 2: And/Or are you worried about us all (the netdev community) not
> agreeing on a way to implement ZC internally in the drivers and the
> XDP infrastructure? This is not going to be possible to finish during
> this cycle since we do not like the implementation we had in RFC V1.
> Too intrusive and now we also have nicer abstractions from Jesper that
> we can use and extend to provide a (hopefully) much cleaner and less
> intrusive solution.

short answer: both.

Cleanliness and performance of the ZC code is not as important as
getting API right. The main concern that during ZC review process
we will find out that existing API has issues, so we have to
do this exercise before the merge window.
And RFC won't fly. Send the patches for real. They have to go
through the proper code review. The hackers of netdev community
can accept a partial, or a bit unclean, or slightly inefficient
implementation, since it can be and will be improved later,
but API we cannot change once it goes into official release.

Here is the example of API concern:
this patch set added shared umem concept. It sounds good in theory,
but will it perform well with ZC ? Earlier RFCs didn't have that
feature. If it won't perform well than it shouldn't be in the tree.
The key reason to let AF_XDP into the tree is its performance promise.
If it doesn't perform we should rip it out and redesign.
Magnus Karlsson May 7, 2018, 9:13 a.m. UTC | #7
On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
>> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >
>> >> > This patch set introduces a new address family called AF_XDP that is
>> >> > optimized for high performance packet processing and, in upcoming
>> >> > patch sets, zero-copy semantics. In this patch set, we have removed
>> >> > all zero-copy related code in order to make it smaller, simpler and
>> >> > hopefully more review friendly. This patch set only supports copy-mode
>> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> >> > work has already been accepted. We will publish our zero-copy support
>> >> > for RX and TX on top of his patch sets at a later point in time.
>> >>
>> >> +1, would be great to see it land this cycle. Saw few minor nits here
>> >> and there but nothing to hold it up, for the series:
>> >>
>> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> >>
>> >> Thanks everyone!
>> >
>> > Great stuff!
>> >
>> > Applied to bpf-next, with one condition.
>> > Upcoming zero-copy patches for both RX and TX need to be posted
>> > and reviewed within this release window.
>> > If netdev community as a whole won't be able to agree on the zero-copy
>> > bits we'd need to revert this feature before the next merge window.
>>
>> Thanks everyone for reviewing this. Highly appreciated.
>>
>> Just so we understand the purpose correctly:
>>
>> 1: Do you want to see the ZC patches in order to verify that the user
>> space API holds? If so, we can produce an additional RFC  patch set
>> using a big chunk of code that we had in RFC V1. We are not proud of
>> this code since it is clunky, but it hopefully proves the point with
>> the uapi being the same.
>>
>> 2: And/Or are you worried about us all (the netdev community) not
>> agreeing on a way to implement ZC internally in the drivers and the
>> XDP infrastructure? This is not going to be possible to finish during
>> this cycle since we do not like the implementation we had in RFC V1.
>> Too intrusive and now we also have nicer abstractions from Jesper that
>> we can use and extend to provide a (hopefully) much cleaner and less
>> intrusive solution.
>
> short answer: both.
>
> Cleanliness and performance of the ZC code is not as important as
> getting API right. The main concern that during ZC review process
> we will find out that existing API has issues, so we have to
> do this exercise before the merge window.
> And RFC won't fly. Send the patches for real. They have to go
> through the proper code review. The hackers of netdev community
> can accept a partial, or a bit unclean, or slightly inefficient
> implementation, since it can be and will be improved later,
> but API we cannot change once it goes into official release.
>
> Here is the example of API concern:
> this patch set added shared umem concept. It sounds good in theory,
> but will it perform well with ZC ? Earlier RFCs didn't have that
> feature. If it won't perform well than it shouldn't be in the tree.
> The key reason to let AF_XDP into the tree is its performance promise.
> If it doesn't perform we should rip it out and redesign.

That is a fair point. We will try to produce patch sets for zero-copy
RX and TX using the latest interfaces within this merge window. Just
note that we will focus on this for the next week(s) instead of the
review items that you and Daniel Borkmann submitted. If we get those
patch sets out in time and we agree that they are a possible way
forward, then we produce patches with your fixes. It was mainly small
items, so should be quick.

/Magnus
Jesper Dangaard Brouer May 7, 2018, 1:09 p.m. UTC | #8
On Mon, 7 May 2018 11:13:58 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:  
> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:  
> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:  
> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:  
> >> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >> >
> >> >> > This patch set introduces a new address family called AF_XDP that is
> >> >> > optimized for high performance packet processing and, in upcoming
> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed
> >> >> > all zero-copy related code in order to make it smaller, simpler and
> >> >> > hopefully more review friendly. This patch set only supports copy-mode
> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> >> >> > work has already been accepted. We will publish our zero-copy support
> >> >> > for RX and TX on top of his patch sets at a later point in time.  
> >> >>
> >> >> +1, would be great to see it land this cycle. Saw few minor nits here
> >> >> and there but nothing to hold it up, for the series:
> >> >>
> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >> >>
> >> >> Thanks everyone!  
> >> >
> >> > Great stuff!
> >> >
> >> > Applied to bpf-next, with one condition.
> >> > Upcoming zero-copy patches for both RX and TX need to be posted
> >> > and reviewed within this release window.
> >> > If netdev community as a whole won't be able to agree on the zero-copy
> >> > bits we'd need to revert this feature before the next merge window.  
> >>
> >> Thanks everyone for reviewing this. Highly appreciated.
> >>
> >> Just so we understand the purpose correctly:
> >>
> >> 1: Do you want to see the ZC patches in order to verify that the user
> >> space API holds? If so, we can produce an additional RFC  patch set
> >> using a big chunk of code that we had in RFC V1. We are not proud of
> >> this code since it is clunky, but it hopefully proves the point with
> >> the uapi being the same.
> >>
> >> 2: And/Or are you worried about us all (the netdev community) not
> >> agreeing on a way to implement ZC internally in the drivers and the
> >> XDP infrastructure? This is not going to be possible to finish during
> >> this cycle since we do not like the implementation we had in RFC V1.
> >> Too intrusive and now we also have nicer abstractions from Jesper that
> >> we can use and extend to provide a (hopefully) much cleaner and less
> >> intrusive solution.  
> >
> > short answer: both.
> >
> > Cleanliness and performance of the ZC code is not as important as
> > getting API right. The main concern that during ZC review process
> > we will find out that existing API has issues, so we have to
> > do this exercise before the merge window.
> > And RFC won't fly. Send the patches for real. They have to go
> > through the proper code review. The hackers of netdev community
> > can accept a partial, or a bit unclean, or slightly inefficient
> > implementation, since it can be and will be improved later,
> > but API we cannot change once it goes into official release.
> >
> > Here is the example of API concern:
> > this patch set added shared umem concept. It sounds good in theory,
> > but will it perform well with ZC ? Earlier RFCs didn't have that
> > feature. If it won't perform well than it shouldn't be in the tree.
> > The key reason to let AF_XDP into the tree is its performance promise.
> > If it doesn't perform we should rip it out and redesign.  
> 
> That is a fair point. We will try to produce patch sets for zero-copy
> RX and TX using the latest interfaces within this merge window. Just
> note that we will focus on this for the next week(s) instead of the
> review items that you and Daniel Borkmann submitted. If we get those
> patch sets out in time and we agree that they are a possible way
> forward, then we produce patches with your fixes. It was mainly small
> items, so should be quick.

I would like to see that you create a new xdp_mem_type for this new
zero-copy type. This will allow other XDP redirect methods/types (e.g.
devmap and cpumap) to react appropriately when receiving a zero-copy
frame.

For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call
without (first) copying (into a newly allocated page).  By arguing that
if an xsk-userspace app modify a frame it's not allowed to, then it is
simply a bug in the program. (Note, this would also allow using
ndo_xdp_xmit call for TX from xsk-userspace).

For cpumap, it is hard to avoid a copy, but I'm hoping we could delay
the copy (and alloc of mem dest area) until on the remote CPU.  This is
already the principle of cpumap; of moving the allocation of the SKB to
the remote CPU.

For ZC to interact with XDP redirect-core and return API, the zero-copy
memory type/allocator, need to provide an area for the xdp_frame data
to be stored in (as we cannot allow using top-of-frame like
non-zero-copy variants), and extend xdp_frame with an ZC umem-id.
I imagine we can avoid any dynamic allocations, as we upfront (at bind
and XDP_UMEM_REG time) know the number of frames.  (e.g. pre-alloc in
xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).
Björn Töpel May 7, 2018, 7:47 p.m. UTC | #9
2018-05-07 15:09 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Mon, 7 May 2018 11:13:58 +0200
> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
>> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
>> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> >> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >> >
>> >> >> > This patch set introduces a new address family called AF_XDP that is
>> >> >> > optimized for high performance packet processing and, in upcoming
>> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed
>> >> >> > all zero-copy related code in order to make it smaller, simpler and
>> >> >> > hopefully more review friendly. This patch set only supports copy-mode
>> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> >> >> > work has already been accepted. We will publish our zero-copy support
>> >> >> > for RX and TX on top of his patch sets at a later point in time.
>> >> >>
>> >> >> +1, would be great to see it land this cycle. Saw few minor nits here
>> >> >> and there but nothing to hold it up, for the series:
>> >> >>
>> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> >> >>
>> >> >> Thanks everyone!
>> >> >
>> >> > Great stuff!
>> >> >
>> >> > Applied to bpf-next, with one condition.
>> >> > Upcoming zero-copy patches for both RX and TX need to be posted
>> >> > and reviewed within this release window.
>> >> > If netdev community as a whole won't be able to agree on the zero-copy
>> >> > bits we'd need to revert this feature before the next merge window.
>> >>
>> >> Thanks everyone for reviewing this. Highly appreciated.
>> >>
>> >> Just so we understand the purpose correctly:
>> >>
>> >> 1: Do you want to see the ZC patches in order to verify that the user
>> >> space API holds? If so, we can produce an additional RFC  patch set
>> >> using a big chunk of code that we had in RFC V1. We are not proud of
>> >> this code since it is clunky, but it hopefully proves the point with
>> >> the uapi being the same.
>> >>
>> >> 2: And/Or are you worried about us all (the netdev community) not
>> >> agreeing on a way to implement ZC internally in the drivers and the
>> >> XDP infrastructure? This is not going to be possible to finish during
>> >> this cycle since we do not like the implementation we had in RFC V1.
>> >> Too intrusive and now we also have nicer abstractions from Jesper that
>> >> we can use and extend to provide a (hopefully) much cleaner and less
>> >> intrusive solution.
>> >
>> > short answer: both.
>> >
>> > Cleanliness and performance of the ZC code is not as important as
>> > getting API right. The main concern that during ZC review process
>> > we will find out that existing API has issues, so we have to
>> > do this exercise before the merge window.
>> > And RFC won't fly. Send the patches for real. They have to go
>> > through the proper code review. The hackers of netdev community
>> > can accept a partial, or a bit unclean, or slightly inefficient
>> > implementation, since it can be and will be improved later,
>> > but API we cannot change once it goes into official release.
>> >
>> > Here is the example of API concern:
>> > this patch set added shared umem concept. It sounds good in theory,
>> > but will it perform well with ZC ? Earlier RFCs didn't have that
>> > feature. If it won't perform well than it shouldn't be in the tree.
>> > The key reason to let AF_XDP into the tree is its performance promise.
>> > If it doesn't perform we should rip it out and redesign.
>>
>> That is a fair point. We will try to produce patch sets for zero-copy
>> RX and TX using the latest interfaces within this merge window. Just
>> note that we will focus on this for the next week(s) instead of the
>> review items that you and Daniel Borkmann submitted. If we get those
>> patch sets out in time and we agree that they are a possible way
>> forward, then we produce patches with your fixes. It was mainly small
>> items, so should be quick.
>
> I would like to see that you create a new xdp_mem_type for this new
> zero-copy type. This will allow other XDP redirect methods/types (e.g.
> devmap and cpumap) to react appropriately when receiving a zero-copy
> frame.
>

Yes, that's the plan!

> For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call
> without (first) copying (into a newly allocated page).  By arguing that
> if an xsk-userspace app modify a frame it's not allowed to, then it is
> simply a bug in the program. (Note, this would also allow using
> ndo_xdp_xmit call for TX from xsk-userspace).
>

Makes sense. I think the ZC rational for Rx can indeed be extended for
devmap redirects -- i.e. no frame cloning is required.

> For cpumap, it is hard to avoid a copy, but I'm hoping we could delay
> the copy (and alloc of mem dest area) until on the remote CPU.  This is
> already the principle of cpumap; of moving the allocation of the SKB to
> the remote CPU.
>

I think for most AF_XDP applications that would like to pass frames to
the kernel, the cpumap would be preferred instead of XDP_PASS (moving
the stack execution to another off-AF_XDP-thread).

> For ZC to interact with XDP redirect-core and return API, the zero-copy
> memory type/allocator, need to provide an area for the xdp_frame data
> to be stored in (as we cannot allow using top-of-frame like
> non-zero-copy variants), and extend xdp_frame with an ZC umem-id.
> I imagine we can avoid any dynamic allocations, as we upfront (at bind
> and XDP_UMEM_REG time) know the number of frames.  (e.g. pre-alloc in
> xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).
>

Yeah, we can allocate a kernel-side-only xdp_frame for each umem frame.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
Björn Töpel May 17, 2018, 6:46 a.m. UTC | #10
2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > This patch set introduces a new address family called AF_XDP that is
>> > optimized for high performance packet processing and, in upcoming
>> > patch sets, zero-copy semantics. In this patch set, we have removed
>> > all zero-copy related code in order to make it smaller, simpler and
>> > hopefully more review friendly. This patch set only supports copy-mode
>> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> > work has already been accepted. We will publish our zero-copy support
>> > for RX and TX on top of his patch sets at a later point in time.
>>
>> +1, would be great to see it land this cycle. Saw few minor nits here
>> and there but nothing to hold it up, for the series:
>>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> Thanks everyone!
>
> Great stuff!
>
> Applied to bpf-next, with one condition.
> Upcoming zero-copy patches for both RX and TX need to be posted
> and reviewed within this release window.
> If netdev community as a whole won't be able to agree on the zero-copy
> bits we'd need to revert this feature before the next merge window.
>
> Few other minor nits:
> patch 3:
> +struct xdp_ring {
> +       __u32 producer __attribute__((aligned(64)));
> +       __u32 consumer __attribute__((aligned(64)));
> +};
> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>

Hmm, I need some guidance on what a sane uapi variant would be. We
can't have the uapi depend on the kernel build. ARM64, e.g., can have
both 64B and 128B according to the specs. Contemporary IA processors
have 64B.

The simplest, and maybe most future-proof, would be 128B aligned for
all. Another is having 128B for ARM and 64B for all IA. A third option
is having a hand-shaking API (I think virtio has that) for determine
the cache line size, but I'd rather not go down that route.

Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
would look like?

> patch 5:
> +struct sockaddr_xdp {
> +       __u16 sxdp_family;
> +       __u32 sxdp_ifindex;
> Not great to have a hole in uapi struct. Please fix it in the follow up.
>
> patch 7:
> Has a lot of synchronize_net(). I think udpate/delete side
> can be improved to avoid them. Otherwise users may unknowingly DoS.
>
> As the next steps I suggest to prioritize the highest to ship
> zero-copy rx/tx patches and to add selftests.
>
> Thanks!
>
Alexei Starovoitov May 18, 2018, 3:38 a.m. UTC | #11
On 5/16/18 11:46 PM, Björn Töpel wrote:
> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> This patch set introduces a new address family called AF_XDP that is
>>>> optimized for high performance packet processing and, in upcoming
>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>> all zero-copy related code in order to make it smaller, simpler and
>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>> work has already been accepted. We will publish our zero-copy support
>>>> for RX and TX on top of his patch sets at a later point in time.
>>>
>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>> and there but nothing to hold it up, for the series:
>>>
>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>
>>> Thanks everyone!
>>
>> Great stuff!
>>
>> Applied to bpf-next, with one condition.
>> Upcoming zero-copy patches for both RX and TX need to be posted
>> and reviewed within this release window.
>> If netdev community as a whole won't be able to agree on the zero-copy
>> bits we'd need to revert this feature before the next merge window.
>>
>> Few other minor nits:
>> patch 3:
>> +struct xdp_ring {
>> +       __u32 producer __attribute__((aligned(64)));
>> +       __u32 consumer __attribute__((aligned(64)));
>> +};
>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>
>
> Hmm, I need some guidance on what a sane uapi variant would be. We
> can't have the uapi depend on the kernel build. ARM64, e.g., can have
> both 64B and 128B according to the specs. Contemporary IA processors
> have 64B.
>
> The simplest, and maybe most future-proof, would be 128B aligned for
> all. Another is having 128B for ARM and 64B for all IA. A third option
> is having a hand-shaking API (I think virtio has that) for determine
> the cache line size, but I'd rather not go down that route.
>
> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
> would look like?

I suspect i40e+arm combination wasn't tested anyway.
The api may have endianness issues too on something like sparc.
I think the way to be backwards compatible in this area
is to make the api usable on x86 only by adding
to include/uapi/linux/if_xdp.h
#if defined(__x86_64__)
#define AF_XDP_CACHE_BYTES 64
#else
#error "AF_XDP support is not yet available for this architecture"
#endif
and doing:
     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));

And progressively add to this for arm64 and few other archs.
Eventually removing #error and adding some generic define
that's good enough for long tail of architectures that
we really cannot test.
Daniel Borkmann May 18, 2018, 1:43 p.m. UTC | #12
On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
> On 5/16/18 11:46 PM, Björn Töpel wrote:
>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>
>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>> optimized for high performance packet processing and, in upcoming
>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>> work has already been accepted. We will publish our zero-copy support
>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>
>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>> and there but nothing to hold it up, for the series:
>>>>
>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>
>>>> Thanks everyone!
>>>
>>> Great stuff!
>>>
>>> Applied to bpf-next, with one condition.
>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>> and reviewed within this release window.
>>> If netdev community as a whole won't be able to agree on the zero-copy
>>> bits we'd need to revert this feature before the next merge window.
>>>
>>> Few other minor nits:
>>> patch 3:
>>> +struct xdp_ring {
>>> +       __u32 producer __attribute__((aligned(64)));
>>> +       __u32 consumer __attribute__((aligned(64)));
>>> +};
>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>
>> Hmm, I need some guidance on what a sane uapi variant would be. We
>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>> both 64B and 128B according to the specs. Contemporary IA processors
>> have 64B.
>>
>> The simplest, and maybe most future-proof, would be 128B aligned for
>> all. Another is having 128B for ARM and 64B for all IA. A third option
>> is having a hand-shaking API (I think virtio has that) for determine
>> the cache line size, but I'd rather not go down that route.
>>
>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>> would look like?
> 
> I suspect i40e+arm combination wasn't tested anyway.
> The api may have endianness issues too on something like sparc.
> I think the way to be backwards compatible in this area
> is to make the api usable on x86 only by adding
> to include/uapi/linux/if_xdp.h
> #if defined(__x86_64__)
> #define AF_XDP_CACHE_BYTES 64
> #else
> #error "AF_XDP support is not yet available for this architecture"
> #endif
> and doing:
>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
> 
> And progressively add to this for arm64 and few other archs.
> Eventually removing #error and adding some generic define
> that's good enough for long tail of architectures that
> we really cannot test.

Been looking into this yesterday as well a bit, and it's a bit of a mess what
uapi headers do on this regard (though there are just a handful of such headers).
Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
underlying arch. In general, the kernel does expose it to user space via sysfs
(coherency_line_size). Here's what perf does to retrieve it:

#ifdef _SC_LEVEL1_DCACHE_LINESIZE
#define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
#else
static void cache_line_size(int *cacheline_sizep)
{
        if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
                pr_debug("cannot determine cache line size");
}
#endif

The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
In the x86 case it retrieves the info from cpuid insn. In order to generically
use it in combination with the header you'd have some probe which would then
set this as a define before including the header.

Then projects like urcu, they do ...

#define ____cacheline_internodealigned_in_smp \
        __attribute__((__aligned__(CAA_CACHE_LINE_SIZE)))

... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128),
ppc (== 256) and sparc64 (== 256) with a generic fallback to 64.

Hmm, perhaps a combination of the two would make sense where in case of known
cacheline size it can still be used and we only have the fallback in such way.
Like:

#ifndef XDP_CACHE_BYTES
# if defined(__x86_64__)
#  define XDP_CACHE_BYTES	64
# else
#  error "Please define XDP_CACHE_BYTES for this architecture!"
# endif
#endif

Too bad there's no asm uapi header at least for the archs where it's fixed
anyway such that not every project out there has to redefine all of it from
scratch and we could just include it (and the generic-asm one would throw
a compile error if it's not externally defined or such).

Cheers,
Daniel
Björn Töpel May 18, 2018, 3:18 p.m. UTC | #13
2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>
>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>
>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>> and there but nothing to hold it up, for the series:
>>>>>
>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>
>>>>> Thanks everyone!
>>>>
>>>> Great stuff!
>>>>
>>>> Applied to bpf-next, with one condition.
>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>> and reviewed within this release window.
>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>> bits we'd need to revert this feature before the next merge window.
>>>>
>>>> Few other minor nits:
>>>> patch 3:
>>>> +struct xdp_ring {
>>>> +       __u32 producer __attribute__((aligned(64)));
>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>> +};
>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>
>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>> both 64B and 128B according to the specs. Contemporary IA processors
>>> have 64B.
>>>
>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>> is having a hand-shaking API (I think virtio has that) for determine
>>> the cache line size, but I'd rather not go down that route.
>>>
>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>> would look like?
>>
>> I suspect i40e+arm combination wasn't tested anyway.
>> The api may have endianness issues too on something like sparc.
>> I think the way to be backwards compatible in this area
>> is to make the api usable on x86 only by adding
>> to include/uapi/linux/if_xdp.h
>> #if defined(__x86_64__)
>> #define AF_XDP_CACHE_BYTES 64
>> #else
>> #error "AF_XDP support is not yet available for this architecture"
>> #endif
>> and doing:
>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>
>> And progressively add to this for arm64 and few other archs.
>> Eventually removing #error and adding some generic define
>> that's good enough for long tail of architectures that
>> we really cannot test.
>
> Been looking into this yesterday as well a bit, and it's a bit of a mess what
> uapi headers do on this regard (though there are just a handful of such headers).
> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
> underlying arch. In general, the kernel does expose it to user space via sysfs
> (coherency_line_size). Here's what perf does to retrieve it:
>
> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
> #else
> static void cache_line_size(int *cacheline_sizep)
> {
>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>                 pr_debug("cannot determine cache line size");
> }
> #endif
>
> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
> In the x86 case it retrieves the info from cpuid insn. In order to generically
> use it in combination with the header you'd have some probe which would then
> set this as a define before including the header.
>

But as a uapi we cannot depend on the L1 cache line size for what's
currently running on the system, right? So, either a "one cache line
size for all flavors of an arch", e.g. for ARMv8 that would be 128,
even though there can be 64 flavors out there.

Another way would be to remove the ring structure completely, and
leave that to the user-space application to figure out. So there's a
runtime interface (getsockopt) to probe the offsets the head and tail
pointer is after/before the mmap call, and only expose the descriptor
format explicitly in if_xdp.h. Don't know if that is too unorthodox or
not...

> Then projects like urcu, they do ...
>
> #define ____cacheline_internodealigned_in_smp \
>         __attribute__((__aligned__(CAA_CACHE_LINE_SIZE)))
>
> ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128),
> ppc (== 256) and sparc64 (== 256) with a generic fallback to 64.
>
> Hmm, perhaps a combination of the two would make sense where in case of known
> cacheline size it can still be used and we only have the fallback in such way.
> Like:
>
> #ifndef XDP_CACHE_BYTES
> # if defined(__x86_64__)
> #  define XDP_CACHE_BYTES       64
> # else
> #  error "Please define XDP_CACHE_BYTES for this architecture!"
> # endif
> #endif
>
> Too bad there's no asm uapi header at least for the archs where it's fixed
> anyway such that not every project out there has to redefine all of it from
> scratch and we could just include it (and the generic-asm one would throw
> a compile error if it's not externally defined or such).
>

I started out with adding a cache.h to the arch/XXX/include/uapi, but
then realized that this didn't really work. :-)

> Cheers,
> Daniel
Daniel Borkmann May 18, 2018, 4:17 p.m. UTC | #14
On 05/18/2018 05:18 PM, Björn Töpel wrote:
> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>>
>>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>>
>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>>> and there but nothing to hold it up, for the series:
>>>>>>
>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>
>>>>>> Thanks everyone!
>>>>>
>>>>> Great stuff!
>>>>>
>>>>> Applied to bpf-next, with one condition.
>>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>>> and reviewed within this release window.
>>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>>> bits we'd need to revert this feature before the next merge window.
>>>>>
>>>>> Few other minor nits:
>>>>> patch 3:
>>>>> +struct xdp_ring {
>>>>> +       __u32 producer __attribute__((aligned(64)));
>>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>>> +};
>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>>
>>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>>> both 64B and 128B according to the specs. Contemporary IA processors
>>>> have 64B.
>>>>
>>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>>> is having a hand-shaking API (I think virtio has that) for determine
>>>> the cache line size, but I'd rather not go down that route.
>>>>
>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>>> would look like?
>>>
>>> I suspect i40e+arm combination wasn't tested anyway.
>>> The api may have endianness issues too on something like sparc.
>>> I think the way to be backwards compatible in this area
>>> is to make the api usable on x86 only by adding
>>> to include/uapi/linux/if_xdp.h
>>> #if defined(__x86_64__)
>>> #define AF_XDP_CACHE_BYTES 64
>>> #else
>>> #error "AF_XDP support is not yet available for this architecture"
>>> #endif
>>> and doing:
>>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>
>>> And progressively add to this for arm64 and few other archs.
>>> Eventually removing #error and adding some generic define
>>> that's good enough for long tail of architectures that
>>> we really cannot test.
>>
>> Been looking into this yesterday as well a bit, and it's a bit of a mess what
>> uapi headers do on this regard (though there are just a handful of such headers).
>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
>> underlying arch. In general, the kernel does expose it to user space via sysfs
>> (coherency_line_size). Here's what perf does to retrieve it:
>>
>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
>> #else
>> static void cache_line_size(int *cacheline_sizep)
>> {
>>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>>                 pr_debug("cannot determine cache line size");
>> }
>> #endif
>>
>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
>> In the x86 case it retrieves the info from cpuid insn. In order to generically
>> use it in combination with the header you'd have some probe which would then
>> set this as a define before including the header.
> 
> But as a uapi we cannot depend on the L1 cache line size for what's
> currently running on the system, right? So, either a "one cache line
> size for all flavors of an arch", e.g. for ARMv8 that would be 128,
> even though there can be 64 flavors out there.
> 
> Another way would be to remove the ring structure completely, and
> leave that to the user-space application to figure out. So there's a
> runtime interface (getsockopt) to probe the offsets the head and tail
> pointer is after/before the mmap call, and only expose the descriptor
> format explicitly in if_xdp.h. Don't know if that is too unorthodox or
> not...

Good point, I think that may not be too unreasonable thing to do, imho,
at least this doesn't get us into the issue discussed here in the first
place, and it would work for other archs more seamlessly rather than ugly
build error or single per arch 'catch-all' define.

Thanks,
Daniel
Björn Töpel May 18, 2018, 4:32 p.m. UTC | #15
2018-05-18 18:17 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
> On 05/18/2018 05:18 PM, Björn Töpel wrote:
>> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
>>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>>>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>>>
>>>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>>>
>>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>>>> and there but nothing to hold it up, for the series:
>>>>>>>
>>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>>
>>>>>>> Thanks everyone!
>>>>>>
>>>>>> Great stuff!
>>>>>>
>>>>>> Applied to bpf-next, with one condition.
>>>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>>>> and reviewed within this release window.
>>>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>>>> bits we'd need to revert this feature before the next merge window.
>>>>>>
>>>>>> Few other minor nits:
>>>>>> patch 3:
>>>>>> +struct xdp_ring {
>>>>>> +       __u32 producer __attribute__((aligned(64)));
>>>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>>>> +};
>>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>>>
>>>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>>>> both 64B and 128B according to the specs. Contemporary IA processors
>>>>> have 64B.
>>>>>
>>>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>>>> is having a hand-shaking API (I think virtio has that) for determine
>>>>> the cache line size, but I'd rather not go down that route.
>>>>>
>>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>>>> would look like?
>>>>
>>>> I suspect i40e+arm combination wasn't tested anyway.
>>>> The api may have endianness issues too on something like sparc.
>>>> I think the way to be backwards compatible in this area
>>>> is to make the api usable on x86 only by adding
>>>> to include/uapi/linux/if_xdp.h
>>>> #if defined(__x86_64__)
>>>> #define AF_XDP_CACHE_BYTES 64
>>>> #else
>>>> #error "AF_XDP support is not yet available for this architecture"
>>>> #endif
>>>> and doing:
>>>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>
>>>> And progressively add to this for arm64 and few other archs.
>>>> Eventually removing #error and adding some generic define
>>>> that's good enough for long tail of architectures that
>>>> we really cannot test.
>>>
>>> Been looking into this yesterday as well a bit, and it's a bit of a mess what
>>> uapi headers do on this regard (though there are just a handful of such headers).
>>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
>>> underlying arch. In general, the kernel does expose it to user space via sysfs
>>> (coherency_line_size). Here's what perf does to retrieve it:
>>>
>>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
>>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
>>> #else
>>> static void cache_line_size(int *cacheline_sizep)
>>> {
>>>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>>>                 pr_debug("cannot determine cache line size");
>>> }
>>> #endif
>>>
>>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
>>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
>>> In the x86 case it retrieves the info from cpuid insn. In order to generically
>>> use it in combination with the header you'd have some probe which would then
>>> set this as a define before including the header.
>>
>> But as a uapi we cannot depend on the L1 cache line size for what's
>> currently running on the system, right? So, either a "one cache line
>> size for all flavors of an arch", e.g. for ARMv8 that would be 128,
>> even though there can be 64 flavors out there.
>>
>> Another way would be to remove the ring structure completely, and
>> leave that to the user-space application to figure out. So there's a
>> runtime interface (getsockopt) to probe the offsets the head and tail
>> pointer is after/before the mmap call, and only expose the descriptor
>> format explicitly in if_xdp.h. Don't know if that is too unorthodox or
>> not...
>
> Good point, I think that may not be too unreasonable thing to do, imho,
> at least this doesn't get us into the issue discussed here in the first
> place, and it would work for other archs more seamlessly rather than ugly
> build error or single per arch 'catch-all' define.
>

I'll try that route (runtime-based offset to prod/cons), and see where
it ends up!


Enjoy the weekend,
Björn

> Thanks,
> Daniel