Message ID | 20180502110136.3738-1-bjorn.topel@gmail.com |
---|---|
Headers | show |
Series | Introducing AF_XDP support | expand |
On Wed, May 2, 2018 at 1:01 PM, Björn Töpel <bjorn.topel@gmail.com> wrote: > From: Björn Töpel <bjorn.topel@intel.com> > > This patch set introduces a new address family called AF_XDP that is > optimized for high performance packet processing Great patchset, thanks. > and, in upcoming > patch sets, zero-copy semantics. And looking forward to this! > Thanks: Björn and Magnus > > Björn Töpel (7): > net: initial AF_XDP skeleton > xsk: add user memory registration support sockopt > xsk: add Rx queue setup and mmap support > xsk: add Rx receive functions and poll support > bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP > xsk: wire up XDP_DRV side of AF_XDP > xsk: wire up XDP_SKB side of AF_XDP > > Magnus Karlsson (8): > xsk: add umem fill queue support and mmap > xsk: add support for bind for Rx > xsk: add umem completion queue support and mmap > xsk: add Tx queue setup and mmap support > dev: packet: make packet_direct_xmit a common function > xsk: support for Tx > xsk: statistics support > samples/bpf: sample application and documentation for AF_XDP sockets For the series Acked-by: Willem de Bruijn <willemb@google.com>
From: Björn Töpel <bjorn.topel@gmail.com> Date: Wed, 2 May 2018 13:01:21 +0200 > This patch set introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this patch set, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This patch set only supports copy-mode > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode > for RX using the XDP_DRV path. Zero-copy support requires XDP and > driver changes that Jesper Dangaard Brouer is working on. Some of his > work has already been accepted. We will publish our zero-copy support > for RX and TX on top of his patch sets at a later point in time. ... Looks great. Acked-by: David S. Miller <davem@davemloft.net>
On 05/02/2018 01:01 PM, Björn Töpel wrote: > From: Björn Töpel <bjorn.topel@intel.com> > > This patch set introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this patch set, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This patch set only supports copy-mode > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode > for RX using the XDP_DRV path. Zero-copy support requires XDP and > driver changes that Jesper Dangaard Brouer is working on. Some of his > work has already been accepted. We will publish our zero-copy support > for RX and TX on top of his patch sets at a later point in time. +1, would be great to see it land this cycle. Saw few minor nits here and there but nothing to hold it up, for the series: Acked-by: Daniel Borkmann <daniel@iogearbox.net> Thanks everyone!
On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: > On 05/02/2018 01:01 PM, Björn Töpel wrote: > > From: Björn Töpel <bjorn.topel@intel.com> > > > > This patch set introduces a new address family called AF_XDP that is > > optimized for high performance packet processing and, in upcoming > > patch sets, zero-copy semantics. In this patch set, we have removed > > all zero-copy related code in order to make it smaller, simpler and > > hopefully more review friendly. This patch set only supports copy-mode > > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode > > for RX using the XDP_DRV path. Zero-copy support requires XDP and > > driver changes that Jesper Dangaard Brouer is working on. Some of his > > work has already been accepted. We will publish our zero-copy support > > for RX and TX on top of his patch sets at a later point in time. > > +1, would be great to see it land this cycle. Saw few minor nits here > and there but nothing to hold it up, for the series: > > Acked-by: Daniel Borkmann <daniel@iogearbox.net> > > Thanks everyone! Great stuff! Applied to bpf-next, with one condition. Upcoming zero-copy patches for both RX and TX need to be posted and reviewed within this release window. If netdev community as a whole won't be able to agree on the zero-copy bits we'd need to revert this feature before the next merge window. Few other minor nits: patch 3: +struct xdp_ring { + __u32 producer __attribute__((aligned(64))); + __u32 consumer __attribute__((aligned(64))); +}; It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. patch 5: +struct sockaddr_xdp { + __u16 sxdp_family; + __u32 sxdp_ifindex; Not great to have a hole in uapi struct. Please fix it in the follow up. patch 7: Has a lot of synchronize_net(). I think udpate/delete side can be improved to avoid them. Otherwise users may unknowingly DoS. As the next steps I suggest to prioritize the highest to ship zero-copy rx/tx patches and to add selftests. Thanks!
On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >> On 05/02/2018 01:01 PM, Björn Töpel wrote: >> > From: Björn Töpel <bjorn.topel@intel.com> >> > >> > This patch set introduces a new address family called AF_XDP that is >> > optimized for high performance packet processing and, in upcoming >> > patch sets, zero-copy semantics. In this patch set, we have removed >> > all zero-copy related code in order to make it smaller, simpler and >> > hopefully more review friendly. This patch set only supports copy-mode >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and >> > driver changes that Jesper Dangaard Brouer is working on. Some of his >> > work has already been accepted. We will publish our zero-copy support >> > for RX and TX on top of his patch sets at a later point in time. >> >> +1, would be great to see it land this cycle. Saw few minor nits here >> and there but nothing to hold it up, for the series: >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >> >> Thanks everyone! > > Great stuff! > > Applied to bpf-next, with one condition. > Upcoming zero-copy patches for both RX and TX need to be posted > and reviewed within this release window. > If netdev community as a whole won't be able to agree on the zero-copy > bits we'd need to revert this feature before the next merge window. Thanks everyone for reviewing this. Highly appreciated. Just so we understand the purpose correctly: 1: Do you want to see the ZC patches in order to verify that the user space API holds? If so, we can produce an additional RFC patch set using a big chunk of code that we had in RFC V1. We are not proud of this code since it is clunky, but it hopefully proves the point with the uapi being the same. 2: And/Or are you worried about us all (the netdev community) not agreeing on a way to implement ZC internally in the drivers and the XDP infrastructure? This is not going to be possible to finish during this cycle since we do not like the implementation we had in RFC V1. Too intrusive and now we also have nicer abstractions from Jesper that we can use and extend to provide a (hopefully) much cleaner and less intrusive solution. Just so that we focus on the right proof points. > Few other minor nits: > patch 3: > +struct xdp_ring { > + __u32 producer __attribute__((aligned(64))); > + __u32 consumer __attribute__((aligned(64))); > +}; > It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. Agreed. > patch 5: > +struct sockaddr_xdp { > + __u16 sxdp_family; > + __u32 sxdp_ifindex; > Not great to have a hole in uapi struct. Please fix it in the follow up. You are correct. Will fix. > patch 7: > Has a lot of synchronize_net(). I think udpate/delete side > can be improved to avoid them. Otherwise users may unknowingly DoS. OK. Could you please elaborate on what kind of DoS attacks can be performed with this, so we can come up with the right solution here? Thanks: Magnus > As the next steps I suggest to prioritize the highest to ship > zero-copy rx/tx patches and to add selftests. > > Thanks! >
On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote: > On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: > >> On 05/02/2018 01:01 PM, Björn Töpel wrote: > >> > From: Björn Töpel <bjorn.topel@intel.com> > >> > > >> > This patch set introduces a new address family called AF_XDP that is > >> > optimized for high performance packet processing and, in upcoming > >> > patch sets, zero-copy semantics. In this patch set, we have removed > >> > all zero-copy related code in order to make it smaller, simpler and > >> > hopefully more review friendly. This patch set only supports copy-mode > >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode > >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and > >> > driver changes that Jesper Dangaard Brouer is working on. Some of his > >> > work has already been accepted. We will publish our zero-copy support > >> > for RX and TX on top of his patch sets at a later point in time. > >> > >> +1, would be great to see it land this cycle. Saw few minor nits here > >> and there but nothing to hold it up, for the series: > >> > >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> > >> > >> Thanks everyone! > > > > Great stuff! > > > > Applied to bpf-next, with one condition. > > Upcoming zero-copy patches for both RX and TX need to be posted > > and reviewed within this release window. > > If netdev community as a whole won't be able to agree on the zero-copy > > bits we'd need to revert this feature before the next merge window. > > Thanks everyone for reviewing this. Highly appreciated. > > Just so we understand the purpose correctly: > > 1: Do you want to see the ZC patches in order to verify that the user > space API holds? If so, we can produce an additional RFC patch set > using a big chunk of code that we had in RFC V1. We are not proud of > this code since it is clunky, but it hopefully proves the point with > the uapi being the same. > > 2: And/Or are you worried about us all (the netdev community) not > agreeing on a way to implement ZC internally in the drivers and the > XDP infrastructure? This is not going to be possible to finish during > this cycle since we do not like the implementation we had in RFC V1. > Too intrusive and now we also have nicer abstractions from Jesper that > we can use and extend to provide a (hopefully) much cleaner and less > intrusive solution. short answer: both. Cleanliness and performance of the ZC code is not as important as getting API right. The main concern that during ZC review process we will find out that existing API has issues, so we have to do this exercise before the merge window. And RFC won't fly. Send the patches for real. They have to go through the proper code review. The hackers of netdev community can accept a partial, or a bit unclean, or slightly inefficient implementation, since it can be and will be improved later, but API we cannot change once it goes into official release. Here is the example of API concern: this patch set added shared umem concept. It sounds good in theory, but will it perform well with ZC ? Earlier RFCs didn't have that feature. If it won't perform well than it shouldn't be in the tree. The key reason to let AF_XDP into the tree is its performance promise. If it doesn't perform we should rip it out and redesign.
On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote: >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote: >> >> > From: Björn Töpel <bjorn.topel@intel.com> >> >> > >> >> > This patch set introduces a new address family called AF_XDP that is >> >> > optimized for high performance packet processing and, in upcoming >> >> > patch sets, zero-copy semantics. In this patch set, we have removed >> >> > all zero-copy related code in order to make it smaller, simpler and >> >> > hopefully more review friendly. This patch set only supports copy-mode >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his >> >> > work has already been accepted. We will publish our zero-copy support >> >> > for RX and TX on top of his patch sets at a later point in time. >> >> >> >> +1, would be great to see it land this cycle. Saw few minor nits here >> >> and there but nothing to hold it up, for the series: >> >> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >> >> >> >> Thanks everyone! >> > >> > Great stuff! >> > >> > Applied to bpf-next, with one condition. >> > Upcoming zero-copy patches for both RX and TX need to be posted >> > and reviewed within this release window. >> > If netdev community as a whole won't be able to agree on the zero-copy >> > bits we'd need to revert this feature before the next merge window. >> >> Thanks everyone for reviewing this. Highly appreciated. >> >> Just so we understand the purpose correctly: >> >> 1: Do you want to see the ZC patches in order to verify that the user >> space API holds? If so, we can produce an additional RFC patch set >> using a big chunk of code that we had in RFC V1. We are not proud of >> this code since it is clunky, but it hopefully proves the point with >> the uapi being the same. >> >> 2: And/Or are you worried about us all (the netdev community) not >> agreeing on a way to implement ZC internally in the drivers and the >> XDP infrastructure? This is not going to be possible to finish during >> this cycle since we do not like the implementation we had in RFC V1. >> Too intrusive and now we also have nicer abstractions from Jesper that >> we can use and extend to provide a (hopefully) much cleaner and less >> intrusive solution. > > short answer: both. > > Cleanliness and performance of the ZC code is not as important as > getting API right. The main concern that during ZC review process > we will find out that existing API has issues, so we have to > do this exercise before the merge window. > And RFC won't fly. Send the patches for real. They have to go > through the proper code review. The hackers of netdev community > can accept a partial, or a bit unclean, or slightly inefficient > implementation, since it can be and will be improved later, > but API we cannot change once it goes into official release. > > Here is the example of API concern: > this patch set added shared umem concept. It sounds good in theory, > but will it perform well with ZC ? Earlier RFCs didn't have that > feature. If it won't perform well than it shouldn't be in the tree. > The key reason to let AF_XDP into the tree is its performance promise. > If it doesn't perform we should rip it out and redesign. That is a fair point. We will try to produce patch sets for zero-copy RX and TX using the latest interfaces within this merge window. Just note that we will focus on this for the next week(s) instead of the review items that you and Daniel Borkmann submitted. If we get those patch sets out in time and we agree that they are a possible way forward, then we produce patches with your fixes. It was mainly small items, so should be quick. /Magnus
On Mon, 7 May 2018 11:13:58 +0200 Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote: > >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: > >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote: > >> >> > From: Björn Töpel <bjorn.topel@intel.com> > >> >> > > >> >> > This patch set introduces a new address family called AF_XDP that is > >> >> > optimized for high performance packet processing and, in upcoming > >> >> > patch sets, zero-copy semantics. In this patch set, we have removed > >> >> > all zero-copy related code in order to make it smaller, simpler and > >> >> > hopefully more review friendly. This patch set only supports copy-mode > >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode > >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and > >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his > >> >> > work has already been accepted. We will publish our zero-copy support > >> >> > for RX and TX on top of his patch sets at a later point in time. > >> >> > >> >> +1, would be great to see it land this cycle. Saw few minor nits here > >> >> and there but nothing to hold it up, for the series: > >> >> > >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> > >> >> > >> >> Thanks everyone! > >> > > >> > Great stuff! > >> > > >> > Applied to bpf-next, with one condition. > >> > Upcoming zero-copy patches for both RX and TX need to be posted > >> > and reviewed within this release window. > >> > If netdev community as a whole won't be able to agree on the zero-copy > >> > bits we'd need to revert this feature before the next merge window. > >> > >> Thanks everyone for reviewing this. Highly appreciated. > >> > >> Just so we understand the purpose correctly: > >> > >> 1: Do you want to see the ZC patches in order to verify that the user > >> space API holds? If so, we can produce an additional RFC patch set > >> using a big chunk of code that we had in RFC V1. We are not proud of > >> this code since it is clunky, but it hopefully proves the point with > >> the uapi being the same. > >> > >> 2: And/Or are you worried about us all (the netdev community) not > >> agreeing on a way to implement ZC internally in the drivers and the > >> XDP infrastructure? This is not going to be possible to finish during > >> this cycle since we do not like the implementation we had in RFC V1. > >> Too intrusive and now we also have nicer abstractions from Jesper that > >> we can use and extend to provide a (hopefully) much cleaner and less > >> intrusive solution. > > > > short answer: both. > > > > Cleanliness and performance of the ZC code is not as important as > > getting API right. The main concern that during ZC review process > > we will find out that existing API has issues, so we have to > > do this exercise before the merge window. > > And RFC won't fly. Send the patches for real. They have to go > > through the proper code review. The hackers of netdev community > > can accept a partial, or a bit unclean, or slightly inefficient > > implementation, since it can be and will be improved later, > > but API we cannot change once it goes into official release. > > > > Here is the example of API concern: > > this patch set added shared umem concept. It sounds good in theory, > > but will it perform well with ZC ? Earlier RFCs didn't have that > > feature. If it won't perform well than it shouldn't be in the tree. > > The key reason to let AF_XDP into the tree is its performance promise. > > If it doesn't perform we should rip it out and redesign. > > That is a fair point. We will try to produce patch sets for zero-copy > RX and TX using the latest interfaces within this merge window. Just > note that we will focus on this for the next week(s) instead of the > review items that you and Daniel Borkmann submitted. If we get those > patch sets out in time and we agree that they are a possible way > forward, then we produce patches with your fixes. It was mainly small > items, so should be quick. I would like to see that you create a new xdp_mem_type for this new zero-copy type. This will allow other XDP redirect methods/types (e.g. devmap and cpumap) to react appropriately when receiving a zero-copy frame. For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call without (first) copying (into a newly allocated page). By arguing that if an xsk-userspace app modify a frame it's not allowed to, then it is simply a bug in the program. (Note, this would also allow using ndo_xdp_xmit call for TX from xsk-userspace). For cpumap, it is hard to avoid a copy, but I'm hoping we could delay the copy (and alloc of mem dest area) until on the remote CPU. This is already the principle of cpumap; of moving the allocation of the SKB to the remote CPU. For ZC to interact with XDP redirect-core and return API, the zero-copy memory type/allocator, need to provide an area for the xdp_frame data to be stored in (as we cannot allow using top-of-frame like non-zero-copy variants), and extend xdp_frame with an ZC umem-id. I imagine we can avoid any dynamic allocations, as we upfront (at bind and XDP_UMEM_REG time) know the number of frames. (e.g. pre-alloc in xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).
2018-05-07 15:09 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>: > On Mon, 7 May 2018 11:13:58 +0200 > Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > >> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote: >> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov >> >> <alexei.starovoitov@gmail.com> wrote: >> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote: >> >> >> > From: Björn Töpel <bjorn.topel@intel.com> >> >> >> > >> >> >> > This patch set introduces a new address family called AF_XDP that is >> >> >> > optimized for high performance packet processing and, in upcoming >> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed >> >> >> > all zero-copy related code in order to make it smaller, simpler and >> >> >> > hopefully more review friendly. This patch set only supports copy-mode >> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and >> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his >> >> >> > work has already been accepted. We will publish our zero-copy support >> >> >> > for RX and TX on top of his patch sets at a later point in time. >> >> >> >> >> >> +1, would be great to see it land this cycle. Saw few minor nits here >> >> >> and there but nothing to hold it up, for the series: >> >> >> >> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >> >> >> >> >> >> Thanks everyone! >> >> > >> >> > Great stuff! >> >> > >> >> > Applied to bpf-next, with one condition. >> >> > Upcoming zero-copy patches for both RX and TX need to be posted >> >> > and reviewed within this release window. >> >> > If netdev community as a whole won't be able to agree on the zero-copy >> >> > bits we'd need to revert this feature before the next merge window. >> >> >> >> Thanks everyone for reviewing this. Highly appreciated. >> >> >> >> Just so we understand the purpose correctly: >> >> >> >> 1: Do you want to see the ZC patches in order to verify that the user >> >> space API holds? If so, we can produce an additional RFC patch set >> >> using a big chunk of code that we had in RFC V1. We are not proud of >> >> this code since it is clunky, but it hopefully proves the point with >> >> the uapi being the same. >> >> >> >> 2: And/Or are you worried about us all (the netdev community) not >> >> agreeing on a way to implement ZC internally in the drivers and the >> >> XDP infrastructure? This is not going to be possible to finish during >> >> this cycle since we do not like the implementation we had in RFC V1. >> >> Too intrusive and now we also have nicer abstractions from Jesper that >> >> we can use and extend to provide a (hopefully) much cleaner and less >> >> intrusive solution. >> > >> > short answer: both. >> > >> > Cleanliness and performance of the ZC code is not as important as >> > getting API right. The main concern that during ZC review process >> > we will find out that existing API has issues, so we have to >> > do this exercise before the merge window. >> > And RFC won't fly. Send the patches for real. They have to go >> > through the proper code review. The hackers of netdev community >> > can accept a partial, or a bit unclean, or slightly inefficient >> > implementation, since it can be and will be improved later, >> > but API we cannot change once it goes into official release. >> > >> > Here is the example of API concern: >> > this patch set added shared umem concept. It sounds good in theory, >> > but will it perform well with ZC ? Earlier RFCs didn't have that >> > feature. If it won't perform well than it shouldn't be in the tree. >> > The key reason to let AF_XDP into the tree is its performance promise. >> > If it doesn't perform we should rip it out and redesign. >> >> That is a fair point. We will try to produce patch sets for zero-copy >> RX and TX using the latest interfaces within this merge window. Just >> note that we will focus on this for the next week(s) instead of the >> review items that you and Daniel Borkmann submitted. If we get those >> patch sets out in time and we agree that they are a possible way >> forward, then we produce patches with your fixes. It was mainly small >> items, so should be quick. > > I would like to see that you create a new xdp_mem_type for this new > zero-copy type. This will allow other XDP redirect methods/types (e.g. > devmap and cpumap) to react appropriately when receiving a zero-copy > frame. > Yes, that's the plan! > For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call > without (first) copying (into a newly allocated page). By arguing that > if an xsk-userspace app modify a frame it's not allowed to, then it is > simply a bug in the program. (Note, this would also allow using > ndo_xdp_xmit call for TX from xsk-userspace). > Makes sense. I think the ZC rational for Rx can indeed be extended for devmap redirects -- i.e. no frame cloning is required. > For cpumap, it is hard to avoid a copy, but I'm hoping we could delay > the copy (and alloc of mem dest area) until on the remote CPU. This is > already the principle of cpumap; of moving the allocation of the SKB to > the remote CPU. > I think for most AF_XDP applications that would like to pass frames to the kernel, the cpumap would be preferred instead of XDP_PASS (moving the stack execution to another off-AF_XDP-thread). > For ZC to interact with XDP redirect-core and return API, the zero-copy > memory type/allocator, need to provide an area for the xdp_frame data > to be stored in (as we cannot allow using top-of-frame like > non-zero-copy variants), and extend xdp_frame with an ZC umem-id. > I imagine we can avoid any dynamic allocations, as we upfront (at bind > and XDP_UMEM_REG time) know the number of frames. (e.g. pre-alloc in > xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func). > Yeah, we can allocate a kernel-side-only xdp_frame for each umem frame. > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >> On 05/02/2018 01:01 PM, Björn Töpel wrote: >> > From: Björn Töpel <bjorn.topel@intel.com> >> > >> > This patch set introduces a new address family called AF_XDP that is >> > optimized for high performance packet processing and, in upcoming >> > patch sets, zero-copy semantics. In this patch set, we have removed >> > all zero-copy related code in order to make it smaller, simpler and >> > hopefully more review friendly. This patch set only supports copy-mode >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and >> > driver changes that Jesper Dangaard Brouer is working on. Some of his >> > work has already been accepted. We will publish our zero-copy support >> > for RX and TX on top of his patch sets at a later point in time. >> >> +1, would be great to see it land this cycle. Saw few minor nits here >> and there but nothing to hold it up, for the series: >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >> >> Thanks everyone! > > Great stuff! > > Applied to bpf-next, with one condition. > Upcoming zero-copy patches for both RX and TX need to be posted > and reviewed within this release window. > If netdev community as a whole won't be able to agree on the zero-copy > bits we'd need to revert this feature before the next merge window. > > Few other minor nits: > patch 3: > +struct xdp_ring { > + __u32 producer __attribute__((aligned(64))); > + __u32 consumer __attribute__((aligned(64))); > +}; > It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. > Hmm, I need some guidance on what a sane uapi variant would be. We can't have the uapi depend on the kernel build. ARM64, e.g., can have both 64B and 128B according to the specs. Contemporary IA processors have 64B. The simplest, and maybe most future-proof, would be 128B aligned for all. Another is having 128B for ARM and 64B for all IA. A third option is having a hand-shaking API (I think virtio has that) for determine the cache line size, but I'd rather not go down that route. Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version would look like? > patch 5: > +struct sockaddr_xdp { > + __u16 sxdp_family; > + __u32 sxdp_ifindex; > Not great to have a hole in uapi struct. Please fix it in the follow up. > > patch 7: > Has a lot of synchronize_net(). I think udpate/delete side > can be improved to avoid them. Otherwise users may unknowingly DoS. > > As the next steps I suggest to prioritize the highest to ship > zero-copy rx/tx patches and to add selftests. > > Thanks! >
On 5/16/18 11:46 PM, Björn Töpel wrote: > 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: >> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>> From: Björn Töpel <bjorn.topel@intel.com> >>>> >>>> This patch set introduces a new address family called AF_XDP that is >>>> optimized for high performance packet processing and, in upcoming >>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>> all zero-copy related code in order to make it smaller, simpler and >>>> hopefully more review friendly. This patch set only supports copy-mode >>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>> work has already been accepted. We will publish our zero-copy support >>>> for RX and TX on top of his patch sets at a later point in time. >>> >>> +1, would be great to see it land this cycle. Saw few minor nits here >>> and there but nothing to hold it up, for the series: >>> >>> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >>> >>> Thanks everyone! >> >> Great stuff! >> >> Applied to bpf-next, with one condition. >> Upcoming zero-copy patches for both RX and TX need to be posted >> and reviewed within this release window. >> If netdev community as a whole won't be able to agree on the zero-copy >> bits we'd need to revert this feature before the next merge window. >> >> Few other minor nits: >> patch 3: >> +struct xdp_ring { >> + __u32 producer __attribute__((aligned(64))); >> + __u32 consumer __attribute__((aligned(64))); >> +}; >> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. >> > > Hmm, I need some guidance on what a sane uapi variant would be. We > can't have the uapi depend on the kernel build. ARM64, e.g., can have > both 64B and 128B according to the specs. Contemporary IA processors > have 64B. > > The simplest, and maybe most future-proof, would be 128B aligned for > all. Another is having 128B for ARM and 64B for all IA. A third option > is having a hand-shaking API (I think virtio has that) for determine > the cache line size, but I'd rather not go down that route. > > Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version > would look like? I suspect i40e+arm combination wasn't tested anyway. The api may have endianness issues too on something like sparc. I think the way to be backwards compatible in this area is to make the api usable on x86 only by adding to include/uapi/linux/if_xdp.h #if defined(__x86_64__) #define AF_XDP_CACHE_BYTES 64 #else #error "AF_XDP support is not yet available for this architecture" #endif and doing: __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); And progressively add to this for arm64 and few other archs. Eventually removing #error and adding some generic define that's good enough for long tail of architectures that we really cannot test.
On 05/18/2018 05:38 AM, Alexei Starovoitov wrote: > On 5/16/18 11:46 PM, Björn Töpel wrote: >> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: >>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>>> From: Björn Töpel <bjorn.topel@intel.com> >>>>> >>>>> This patch set introduces a new address family called AF_XDP that is >>>>> optimized for high performance packet processing and, in upcoming >>>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>>> all zero-copy related code in order to make it smaller, simpler and >>>>> hopefully more review friendly. This patch set only supports copy-mode >>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>>> work has already been accepted. We will publish our zero-copy support >>>>> for RX and TX on top of his patch sets at a later point in time. >>>> >>>> +1, would be great to see it land this cycle. Saw few minor nits here >>>> and there but nothing to hold it up, for the series: >>>> >>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >>>> >>>> Thanks everyone! >>> >>> Great stuff! >>> >>> Applied to bpf-next, with one condition. >>> Upcoming zero-copy patches for both RX and TX need to be posted >>> and reviewed within this release window. >>> If netdev community as a whole won't be able to agree on the zero-copy >>> bits we'd need to revert this feature before the next merge window. >>> >>> Few other minor nits: >>> patch 3: >>> +struct xdp_ring { >>> + __u32 producer __attribute__((aligned(64))); >>> + __u32 consumer __attribute__((aligned(64))); >>> +}; >>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. >> >> Hmm, I need some guidance on what a sane uapi variant would be. We >> can't have the uapi depend on the kernel build. ARM64, e.g., can have >> both 64B and 128B according to the specs. Contemporary IA processors >> have 64B. >> >> The simplest, and maybe most future-proof, would be 128B aligned for >> all. Another is having 128B for ARM and 64B for all IA. A third option >> is having a hand-shaking API (I think virtio has that) for determine >> the cache line size, but I'd rather not go down that route. >> >> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version >> would look like? > > I suspect i40e+arm combination wasn't tested anyway. > The api may have endianness issues too on something like sparc. > I think the way to be backwards compatible in this area > is to make the api usable on x86 only by adding > to include/uapi/linux/if_xdp.h > #if defined(__x86_64__) > #define AF_XDP_CACHE_BYTES 64 > #else > #error "AF_XDP support is not yet available for this architecture" > #endif > and doing: > __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); > __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); > > And progressively add to this for arm64 and few other archs. > Eventually removing #error and adding some generic define > that's good enough for long tail of architectures that > we really cannot test. Been looking into this yesterday as well a bit, and it's a bit of a mess what uapi headers do on this regard (though there are just a handful of such headers). Some of the kernel uapi headers hard-code generally 64 bytes regardless of the underlying arch. In general, the kernel does expose it to user space via sysfs (coherency_line_size). Here's what perf does to retrieve it: #ifdef _SC_LEVEL1_DCACHE_LINESIZE #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE) #else static void cache_line_size(int *cacheline_sizep) { if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep)) pr_debug("cannot determine cache line size"); } #endif The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only available for x86, arm64, s390 and ppc on a cursory glance in the glibc code. In the x86 case it retrieves the info from cpuid insn. In order to generically use it in combination with the header you'd have some probe which would then set this as a define before including the header. Then projects like urcu, they do ... #define ____cacheline_internodealigned_in_smp \ __attribute__((__aligned__(CAA_CACHE_LINE_SIZE))) ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128), ppc (== 256) and sparc64 (== 256) with a generic fallback to 64. Hmm, perhaps a combination of the two would make sense where in case of known cacheline size it can still be used and we only have the fallback in such way. Like: #ifndef XDP_CACHE_BYTES # if defined(__x86_64__) # define XDP_CACHE_BYTES 64 # else # error "Please define XDP_CACHE_BYTES for this architecture!" # endif #endif Too bad there's no asm uapi header at least for the archs where it's fixed anyway such that not every project out there has to redefine all of it from scratch and we could just include it (and the generic-asm one would throw a compile error if it's not externally defined or such). Cheers, Daniel
2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>: > On 05/18/2018 05:38 AM, Alexei Starovoitov wrote: >> On 5/16/18 11:46 PM, Björn Töpel wrote: >>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: >>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>>>> From: Björn Töpel <bjorn.topel@intel.com> >>>>>> >>>>>> This patch set introduces a new address family called AF_XDP that is >>>>>> optimized for high performance packet processing and, in upcoming >>>>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>>>> all zero-copy related code in order to make it smaller, simpler and >>>>>> hopefully more review friendly. This patch set only supports copy-mode >>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>>>> work has already been accepted. We will publish our zero-copy support >>>>>> for RX and TX on top of his patch sets at a later point in time. >>>>> >>>>> +1, would be great to see it land this cycle. Saw few minor nits here >>>>> and there but nothing to hold it up, for the series: >>>>> >>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >>>>> >>>>> Thanks everyone! >>>> >>>> Great stuff! >>>> >>>> Applied to bpf-next, with one condition. >>>> Upcoming zero-copy patches for both RX and TX need to be posted >>>> and reviewed within this release window. >>>> If netdev community as a whole won't be able to agree on the zero-copy >>>> bits we'd need to revert this feature before the next merge window. >>>> >>>> Few other minor nits: >>>> patch 3: >>>> +struct xdp_ring { >>>> + __u32 producer __attribute__((aligned(64))); >>>> + __u32 consumer __attribute__((aligned(64))); >>>> +}; >>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. >>> >>> Hmm, I need some guidance on what a sane uapi variant would be. We >>> can't have the uapi depend on the kernel build. ARM64, e.g., can have >>> both 64B and 128B according to the specs. Contemporary IA processors >>> have 64B. >>> >>> The simplest, and maybe most future-proof, would be 128B aligned for >>> all. Another is having 128B for ARM and 64B for all IA. A third option >>> is having a hand-shaking API (I think virtio has that) for determine >>> the cache line size, but I'd rather not go down that route. >>> >>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version >>> would look like? >> >> I suspect i40e+arm combination wasn't tested anyway. >> The api may have endianness issues too on something like sparc. >> I think the way to be backwards compatible in this area >> is to make the api usable on x86 only by adding >> to include/uapi/linux/if_xdp.h >> #if defined(__x86_64__) >> #define AF_XDP_CACHE_BYTES 64 >> #else >> #error "AF_XDP support is not yet available for this architecture" >> #endif >> and doing: >> __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >> __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >> >> And progressively add to this for arm64 and few other archs. >> Eventually removing #error and adding some generic define >> that's good enough for long tail of architectures that >> we really cannot test. > > Been looking into this yesterday as well a bit, and it's a bit of a mess what > uapi headers do on this regard (though there are just a handful of such headers). > Some of the kernel uapi headers hard-code generally 64 bytes regardless of the > underlying arch. In general, the kernel does expose it to user space via sysfs > (coherency_line_size). Here's what perf does to retrieve it: > > #ifdef _SC_LEVEL1_DCACHE_LINESIZE > #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE) > #else > static void cache_line_size(int *cacheline_sizep) > { > if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep)) > pr_debug("cannot determine cache line size"); > } > #endif > > The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only > available for x86, arm64, s390 and ppc on a cursory glance in the glibc code. > In the x86 case it retrieves the info from cpuid insn. In order to generically > use it in combination with the header you'd have some probe which would then > set this as a define before including the header. > But as a uapi we cannot depend on the L1 cache line size for what's currently running on the system, right? So, either a "one cache line size for all flavors of an arch", e.g. for ARMv8 that would be 128, even though there can be 64 flavors out there. Another way would be to remove the ring structure completely, and leave that to the user-space application to figure out. So there's a runtime interface (getsockopt) to probe the offsets the head and tail pointer is after/before the mmap call, and only expose the descriptor format explicitly in if_xdp.h. Don't know if that is too unorthodox or not... > Then projects like urcu, they do ... > > #define ____cacheline_internodealigned_in_smp \ > __attribute__((__aligned__(CAA_CACHE_LINE_SIZE))) > > ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128), > ppc (== 256) and sparc64 (== 256) with a generic fallback to 64. > > Hmm, perhaps a combination of the two would make sense where in case of known > cacheline size it can still be used and we only have the fallback in such way. > Like: > > #ifndef XDP_CACHE_BYTES > # if defined(__x86_64__) > # define XDP_CACHE_BYTES 64 > # else > # error "Please define XDP_CACHE_BYTES for this architecture!" > # endif > #endif > > Too bad there's no asm uapi header at least for the archs where it's fixed > anyway such that not every project out there has to redefine all of it from > scratch and we could just include it (and the generic-asm one would throw > a compile error if it's not externally defined or such). > I started out with adding a cache.h to the arch/XXX/include/uapi, but then realized that this didn't really work. :-) > Cheers, > Daniel
On 05/18/2018 05:18 PM, Björn Töpel wrote: > 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>: >> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote: >>> On 5/16/18 11:46 PM, Björn Töpel wrote: >>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: >>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>>>>> From: Björn Töpel <bjorn.topel@intel.com> >>>>>>> >>>>>>> This patch set introduces a new address family called AF_XDP that is >>>>>>> optimized for high performance packet processing and, in upcoming >>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>>>>> all zero-copy related code in order to make it smaller, simpler and >>>>>>> hopefully more review friendly. This patch set only supports copy-mode >>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>>>>> work has already been accepted. We will publish our zero-copy support >>>>>>> for RX and TX on top of his patch sets at a later point in time. >>>>>> >>>>>> +1, would be great to see it land this cycle. Saw few minor nits here >>>>>> and there but nothing to hold it up, for the series: >>>>>> >>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >>>>>> >>>>>> Thanks everyone! >>>>> >>>>> Great stuff! >>>>> >>>>> Applied to bpf-next, with one condition. >>>>> Upcoming zero-copy patches for both RX and TX need to be posted >>>>> and reviewed within this release window. >>>>> If netdev community as a whole won't be able to agree on the zero-copy >>>>> bits we'd need to revert this feature before the next merge window. >>>>> >>>>> Few other minor nits: >>>>> patch 3: >>>>> +struct xdp_ring { >>>>> + __u32 producer __attribute__((aligned(64))); >>>>> + __u32 consumer __attribute__((aligned(64))); >>>>> +}; >>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. >>>> >>>> Hmm, I need some guidance on what a sane uapi variant would be. We >>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have >>>> both 64B and 128B according to the specs. Contemporary IA processors >>>> have 64B. >>>> >>>> The simplest, and maybe most future-proof, would be 128B aligned for >>>> all. Another is having 128B for ARM and 64B for all IA. A third option >>>> is having a hand-shaking API (I think virtio has that) for determine >>>> the cache line size, but I'd rather not go down that route. >>>> >>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version >>>> would look like? >>> >>> I suspect i40e+arm combination wasn't tested anyway. >>> The api may have endianness issues too on something like sparc. >>> I think the way to be backwards compatible in this area >>> is to make the api usable on x86 only by adding >>> to include/uapi/linux/if_xdp.h >>> #if defined(__x86_64__) >>> #define AF_XDP_CACHE_BYTES 64 >>> #else >>> #error "AF_XDP support is not yet available for this architecture" >>> #endif >>> and doing: >>> __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >>> __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >>> >>> And progressively add to this for arm64 and few other archs. >>> Eventually removing #error and adding some generic define >>> that's good enough for long tail of architectures that >>> we really cannot test. >> >> Been looking into this yesterday as well a bit, and it's a bit of a mess what >> uapi headers do on this regard (though there are just a handful of such headers). >> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the >> underlying arch. In general, the kernel does expose it to user space via sysfs >> (coherency_line_size). Here's what perf does to retrieve it: >> >> #ifdef _SC_LEVEL1_DCACHE_LINESIZE >> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE) >> #else >> static void cache_line_size(int *cacheline_sizep) >> { >> if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep)) >> pr_debug("cannot determine cache line size"); >> } >> #endif >> >> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only >> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code. >> In the x86 case it retrieves the info from cpuid insn. In order to generically >> use it in combination with the header you'd have some probe which would then >> set this as a define before including the header. > > But as a uapi we cannot depend on the L1 cache line size for what's > currently running on the system, right? So, either a "one cache line > size for all flavors of an arch", e.g. for ARMv8 that would be 128, > even though there can be 64 flavors out there. > > Another way would be to remove the ring structure completely, and > leave that to the user-space application to figure out. So there's a > runtime interface (getsockopt) to probe the offsets the head and tail > pointer is after/before the mmap call, and only expose the descriptor > format explicitly in if_xdp.h. Don't know if that is too unorthodox or > not... Good point, I think that may not be too unreasonable thing to do, imho, at least this doesn't get us into the issue discussed here in the first place, and it would work for other archs more seamlessly rather than ugly build error or single per arch 'catch-all' define. Thanks, Daniel
2018-05-18 18:17 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>: > On 05/18/2018 05:18 PM, Björn Töpel wrote: >> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>: >>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote: >>>> On 5/16/18 11:46 PM, Björn Töpel wrote: >>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>: >>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>>>>>> From: Björn Töpel <bjorn.topel@intel.com> >>>>>>>> >>>>>>>> This patch set introduces a new address family called AF_XDP that is >>>>>>>> optimized for high performance packet processing and, in upcoming >>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>>>>>> all zero-copy related code in order to make it smaller, simpler and >>>>>>>> hopefully more review friendly. This patch set only supports copy-mode >>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>>>>>> work has already been accepted. We will publish our zero-copy support >>>>>>>> for RX and TX on top of his patch sets at a later point in time. >>>>>>> >>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here >>>>>>> and there but nothing to hold it up, for the series: >>>>>>> >>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net> >>>>>>> >>>>>>> Thanks everyone! >>>>>> >>>>>> Great stuff! >>>>>> >>>>>> Applied to bpf-next, with one condition. >>>>>> Upcoming zero-copy patches for both RX and TX need to be posted >>>>>> and reviewed within this release window. >>>>>> If netdev community as a whole won't be able to agree on the zero-copy >>>>>> bits we'd need to revert this feature before the next merge window. >>>>>> >>>>>> Few other minor nits: >>>>>> patch 3: >>>>>> +struct xdp_ring { >>>>>> + __u32 producer __attribute__((aligned(64))); >>>>>> + __u32 consumer __attribute__((aligned(64))); >>>>>> +}; >>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers. >>>>> >>>>> Hmm, I need some guidance on what a sane uapi variant would be. We >>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have >>>>> both 64B and 128B according to the specs. Contemporary IA processors >>>>> have 64B. >>>>> >>>>> The simplest, and maybe most future-proof, would be 128B aligned for >>>>> all. Another is having 128B for ARM and 64B for all IA. A third option >>>>> is having a hand-shaking API (I think virtio has that) for determine >>>>> the cache line size, but I'd rather not go down that route. >>>>> >>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version >>>>> would look like? >>>> >>>> I suspect i40e+arm combination wasn't tested anyway. >>>> The api may have endianness issues too on something like sparc. >>>> I think the way to be backwards compatible in this area >>>> is to make the api usable on x86 only by adding >>>> to include/uapi/linux/if_xdp.h >>>> #if defined(__x86_64__) >>>> #define AF_XDP_CACHE_BYTES 64 >>>> #else >>>> #error "AF_XDP support is not yet available for this architecture" >>>> #endif >>>> and doing: >>>> __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >>>> __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >>>> >>>> And progressively add to this for arm64 and few other archs. >>>> Eventually removing #error and adding some generic define >>>> that's good enough for long tail of architectures that >>>> we really cannot test. >>> >>> Been looking into this yesterday as well a bit, and it's a bit of a mess what >>> uapi headers do on this regard (though there are just a handful of such headers). >>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the >>> underlying arch. In general, the kernel does expose it to user space via sysfs >>> (coherency_line_size). Here's what perf does to retrieve it: >>> >>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE >>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE) >>> #else >>> static void cache_line_size(int *cacheline_sizep) >>> { >>> if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep)) >>> pr_debug("cannot determine cache line size"); >>> } >>> #endif >>> >>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only >>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code. >>> In the x86 case it retrieves the info from cpuid insn. In order to generically >>> use it in combination with the header you'd have some probe which would then >>> set this as a define before including the header. >> >> But as a uapi we cannot depend on the L1 cache line size for what's >> currently running on the system, right? So, either a "one cache line >> size for all flavors of an arch", e.g. for ARMv8 that would be 128, >> even though there can be 64 flavors out there. >> >> Another way would be to remove the ring structure completely, and >> leave that to the user-space application to figure out. So there's a >> runtime interface (getsockopt) to probe the offsets the head and tail >> pointer is after/before the mmap call, and only expose the descriptor >> format explicitly in if_xdp.h. Don't know if that is too unorthodox or >> not... > > Good point, I think that may not be too unreasonable thing to do, imho, > at least this doesn't get us into the issue discussed here in the first > place, and it would work for other archs more seamlessly rather than ugly > build error or single per arch 'catch-all' define. > I'll try that route (runtime-based offset to prod/cons), and see where it ends up! Enjoy the weekend, Björn > Thanks, > Daniel
From: Björn Töpel <bjorn.topel@intel.com> This patch set introduces a new address family called AF_XDP that is optimized for high performance packet processing and, in upcoming patch sets, zero-copy semantics. In this patch set, we have removed all zero-copy related code in order to make it smaller, simpler and hopefully more review friendly. This patch set only supports copy-mode for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX using the XDP_DRV path. Zero-copy support requires XDP and driver changes that Jesper Dangaard Brouer is working on. Some of his work has already been accepted. We will publish our zero-copy support for RX and TX on top of his patch sets at a later point in time. An AF_XDP socket (XSK) is created with the normal socket() syscall. Associated with each XSK are two queues: the RX queue and the TX queue. A socket can receive packets on the RX queue and it can send packets on the TX queue. These queues are registered and sized with the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory to have at least one of these queues for each socket. In contrast to AF_PACKET V2/V3 these descriptor queues are separated from packet buffers. An RX or TX descriptor points to a data buffer in a memory area called a UMEM. RX and TX can share the same UMEM so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. This new dedicated packet buffer area is call a UMEM. It consists of a number of equally size frames and each frame has a unique frame id. A descriptor in one of the queues references a frame by referencing its frame id. The user space allocates memory for this UMEM using whatever means it feels is most appropriate (malloc, mmap, huge pages, etc). This memory area is then registered with the kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue and the COMPLETION queue. The fill queue is used by the application to send down frame ids for the kernel to fill in with RX packet data. References to these frames will then appear in the RX queue of the XSK once they have been received. The completion queue, on the other hand, contains frame ids that the kernel has transmitted completely and can now be used again by user space, for either TX or RX. Thus, the frame ids appearing in the completion queue are ids that were previously transmitted using the TX queue. In summary, the RX and FILL queues are used for the RX path and the TX and COMPLETION queues are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is completed that traffic starts to flow. Note that in this patch set, all packet data is copied out to user-space. A new feature in this patch set is that the UMEM can be shared between processes, if desired. If a process wants to do this, it simply skips the registration of the UMEM and its corresponding two queues, sets a flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will then receive frame id references in its own RX queue that point to this shared UMEM. Note that since the queue structures are single-consumer / single-producer (for performance reasons), the new process has to create its own socket with associated RX and TX queues, since it cannot share this with the other process. This is also the reason that there is only one set of FILL and COMPLETION queues per UMEM. It is the responsibility of a single process to handle the UMEM. If multiple-producer / multiple-consumer queues are implemented in the future, this requirement could be relaxed. How is then packets distributed between these two XSK? We have introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The user-space application can place an XSK at an arbitrary place in this map. The XDP program can then redirect a packet to a specific index in this map and at this point XDP validates that the XSK in that map was indeed bound to that device and queue number. If not, the packet is dropped. If the map is empty at that index, the packet is also dropped. This also means that it is currently mandatory to have an XDP program loaded (and one XSK in the XSKMAP) to be able to get any traffic to user space through the XSK. AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the driver does not have support for XDP, or XDP_SKB is explicitly chosen when loading the XDP program, XDP_SKB mode is employed that uses SKBs together with the generic XDP support and copies out the data to user space. A fallback mode that works for any network device. On the other hand, if the driver has support for XDP, it will be used by the AF_XDP code to provide better performance, but there is still a copy of the data into user space. There is a xdpsock benchmarking/test application included that demonstrates how to use AF_XDP sockets with both private and shared UMEMs. Say that you would like your UDP traffic from port 4242 to end up in queue 16, that we will enable AF_XDP on. Here, we use ethtool for this: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 Running the rxdrop benchmark in XDP_DRV mode can then be done using: samples/bpf/xdpsock -i p3p2 -q 16 -r -N For XDP_SKB mode, use the switch "-S" instead of "-N" and all options can be displayed with "-h", as usual. We have run some benchmarks on a dual socket system with two Broadwell E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 cores which gives a total of 28, but only two cores are used in these experiments. One for TR/RX and one for the user space application. The memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is 8192MB and with 8 of those DIMMs in the system we have 64 GB of total memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The NIC is Intel I40E 40Gbit/s using the i40e driver. Below are the results in Mpps of the I40E NIC benchmark runs for 64 and 1500 byte packets, generated by a commercial packet generator HW outputing packets at full 40 Gbit/s line rate. The results are without retpoline so that we can compare against previous numbers. With retpoline, the AF_XDP numbers drop with between 10 - 15 percent. AF_XDP performance 64 byte packets. Results from V2 in parenthesis. Benchmark XDP_SKB XDP_DRV rxdrop 2.9(3.0) 9.6(9.5) txpush 2.6(2.5) NA* l2fwd 1.9(1.9) 2.5(2.5) (TX using XDP_SKB in both cases) AF_XDP performance 1500 byte packets: Benchmark XDP_SKB XDP_DRV rxdrop 2.1(2.2) 3.3(3.3) l2fwd 1.4(1.4) 1.8(1.8) (TX using XDP_SKB in both cases) * NA since we have no support for TX using the XDP_DRV infrastructure in this patch set. This is for a future patch set since it involves changes to the XDP NDOs. Some of this has been upstreamed by Jesper Dangaard Brouer. XDP performance on our system as a base line: 64 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 32.3(32.9)M 0 1500 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 3.3(3.3)M 0 Changes from V2: * Fixed a race in XSKMAP map found by Will. The code has been completely rearchitected and is now simpler, faster, and hopefully also not racy. Please review and check if it holds. If you would like to diff V2 against V3, you can find them here: https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next The structure of the patch set is as follows: Patches 1-3: Basic socket and umem plumbing Patches 4-9: RX support together with the new XSKMAP Patches 10-13: TX support Patch 14: Statistics support with getsockopt() Patch 15: Sample application We based this patch set on bpf-next commit a3fe1f6f2ada ("tools: bpftool: change time format for program 'loaded at:' information") To do for this patch set: * Syzkaller torture session being worked on Post-series plan: * Optimize performance * Kernel selftest * Kernel load module support of AF_XDP would be nice. Unclear how to achieve this though since our XDP code depends on net/core. * Support for AF_XDP sockets without an XPD program loaded. In this case all the traffic on a queue should go up to the user space socket. * Daniel Borkmann's suggestion for a "copy to XDP socket, and return XDP_PASS" for a tcpdump-like functionality. * And of course getting to zero-copy support in small increments, starting with TX then adding RX. Thanks: Björn and Magnus Björn Töpel (7): net: initial AF_XDP skeleton xsk: add user memory registration support sockopt xsk: add Rx queue setup and mmap support xsk: add Rx receive functions and poll support bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP xsk: wire up XDP_DRV side of AF_XDP xsk: wire up XDP_SKB side of AF_XDP Magnus Karlsson (8): xsk: add umem fill queue support and mmap xsk: add support for bind for Rx xsk: add umem completion queue support and mmap xsk: add Tx queue setup and mmap support dev: packet: make packet_direct_xmit a common function xsk: support for Tx xsk: statistics support samples/bpf: sample application and documentation for AF_XDP sockets Documentation/networking/af_xdp.rst | 297 +++++++++++ Documentation/networking/index.rst | 1 + MAINTAINERS | 8 + include/linux/bpf.h | 25 + include/linux/bpf_types.h | 3 + include/linux/filter.h | 2 +- include/linux/netdevice.h | 1 + include/linux/socket.h | 5 +- include/net/xdp.h | 1 + include/net/xdp_sock.h | 66 +++ include/uapi/linux/bpf.h | 1 + include/uapi/linux/if_xdp.h | 87 ++++ kernel/bpf/Makefile | 3 + kernel/bpf/verifier.c | 8 +- kernel/bpf/xskmap.c | 239 +++++++++ net/Kconfig | 1 + net/Makefile | 1 + net/core/dev.c | 73 ++- net/core/filter.c | 40 +- net/core/sock.c | 12 +- net/core/xdp.c | 15 +- net/packet/af_packet.c | 42 +- net/xdp/Kconfig | 7 + net/xdp/Makefile | 2 + net/xdp/xdp_umem.c | 260 ++++++++++ net/xdp/xdp_umem.h | 67 +++ net/xdp/xdp_umem_props.h | 23 + net/xdp/xsk.c | 656 +++++++++++++++++++++++++ net/xdp/xsk_queue.c | 73 +++ net/xdp/xsk_queue.h | 247 ++++++++++ samples/bpf/Makefile | 4 + samples/bpf/xdpsock.h | 11 + samples/bpf/xdpsock_kern.c | 56 +++ samples/bpf/xdpsock_user.c | 948 ++++++++++++++++++++++++++++++++++++ security/selinux/hooks.c | 4 +- security/selinux/include/classmap.h | 4 +- 36 files changed, 3221 insertions(+), 72 deletions(-) create mode 100644 Documentation/networking/af_xdp.rst create mode 100644 include/net/xdp_sock.h create mode 100644 include/uapi/linux/if_xdp.h create mode 100644 kernel/bpf/xskmap.c create mode 100644 net/xdp/Kconfig create mode 100644 net/xdp/Makefile create mode 100644 net/xdp/xdp_umem.c create mode 100644 net/xdp/xdp_umem.h create mode 100644 net/xdp/xdp_umem_props.h create mode 100644 net/xdp/xsk.c create mode 100644 net/xdp/xsk_queue.c create mode 100644 net/xdp/xsk_queue.h create mode 100644 samples/bpf/xdpsock.h create mode 100644 samples/bpf/xdpsock_kern.c create mode 100644 samples/bpf/xdpsock_user.c