Message ID | 458f9c13ab58abb1a15627906d03c33c42b02a7c.1506297988.git.daniel@iogearbox.net |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Series | BPF metadata for direct access | expand |
On Mon, Sep 25, 2017 at 02:25:51AM +0200, Daniel Borkmann wrote: > This work enables generic transfer of metadata from XDP into skb. The > basic idea is that we can make use of the fact that the resulting skb > must be linear and already comes with a larger headroom for supporting > bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work > on a similar principle and introduce a small helper bpf_xdp_adjust_meta() > for adjusting a new pointer called xdp->data_meta. Thus, the packet has > a flexible and programmable room for meta data, followed by the actual > packet data. struct xdp_buff is therefore laid out that we first point > to data_hard_start, then data_meta directly prepended to data followed > by data_end marking the end of packet. bpf_xdp_adjust_head() takes into > account whether we have meta data already prepended and if so, memmove()s > this along with the given offset provided there's enough room. > > xdp->data_meta is optional and programs are not required to use it. The > rationale is that when we process the packet in XDP (e.g. as DoS filter), > we can push further meta data along with it for the XDP_PASS case, and > give the guarantee that a clsact ingress BPF program on the same device > can pick this up for further post-processing. Since we work with skb > there, we can also set skb->mark, skb->priority or other skb meta data > out of BPF, thus having this scratch space generic and programmable > allows for more flexibility than defining a direct 1:1 transfer of > potentially new XDP members into skb (it's also more efficient as we > don't need to initialize/handle each of such new members). The facility > also works together with GRO aggregation. The scratch space at the head > of the packet can be multiple of 4 byte up to 32 byte large. Drivers not > yet supporting xdp->data_meta can simply be set up with xdp->data_meta > as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, > such that the subsequent match against xdp->data for later access is > guaranteed to fail. > > The verifier treats xdp->data_meta/xdp->data the same way as we treat > xdp->data/xdp->data_end pointer comparisons. The requirement for doing > the compare against xdp->data is that it hasn't been modified from it's > original address we got from ctx access. It may have a range marking > already from prior successful xdp->data/xdp->data_end pointer comparisons > though. First, thanks for this detailed description. It was helpful to read along with the patches. My only concern about this area being generic is that you are now in a state where any bpf program must know about all the bpf programs in the receive pipeline before it can properly parse what is stored in the meta-data and add it to an skb (or perform any other action). Especially if each program adds it's own meta-data along the way. Maybe this isn't a big concern based on the number of users of this today, but it just starts to seem like a concern as there are these hints being passed between layers that are challenging to track due to a lack of a standard format for passing data between. The main reason I bring this up is that Michael and I had discussed and designed a way for drivers to communicate between each other that rx resources could be freed after a tx completion on an XDP_REDIRECT action. Much like this code, it involved adding an new element to struct xdp_md that could point to the important information. Now that there is a generic way to handle this, it would seem nice to be able to leverage it, but I'm not sure how reliable this meta-data area would be without the ability to mark it in some manner. For additional background, the minimum amount of data needed in the case Michael and I were discussing was really 2 words. One to serve as a pointer to an rx_ring structure and one to have a counter to the rx producer entry. This data could be acessed by the driver processing the tx completions and callback to the driver that received the frame off the wire to perform any needed processing. (For those curious this would also require a new callback/netdev op to act on this data stored in the XDP buffer.) IIUC, I could use this meta_data area to store this information, but would it also be useful to create some type field/marker that could also be stored in the meta_data to indicate what type of information is there? I hate to propose such a thing as it may add unneeded complexity, but I just wanted to make sure to say something before it was too late as there may be more users of this right away. :-) > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > Acked-by: Alexei Starovoitov <ast@kernel.org> > Acked-by: John Fastabend <john.fastabend@gmail.com> > --- > drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1 + > drivers/net/ethernet/cavium/thunder/nicvf_main.c | 1 + > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 + > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 + > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 + > .../net/ethernet/netronome/nfp/nfp_net_common.c | 1 + > drivers/net/ethernet/qlogic/qede/qede_fp.c | 1 + > drivers/net/tun.c | 1 + > drivers/net/virtio_net.c | 2 + > include/linux/bpf.h | 1 + > include/linux/filter.h | 21 +++- > include/linux/skbuff.h | 68 +++++++++++- > include/uapi/linux/bpf.h | 13 ++- > kernel/bpf/verifier.c | 114 ++++++++++++++++----- > net/bpf/test_run.c | 1 + > net/core/dev.c | 31 +++++- > net/core/filter.c | 77 +++++++++++++- > net/core/skbuff.c | 2 + > 19 files changed, 297 insertions(+), 42 deletions(-) [...]
On 09/25/2017 08:10 PM, Andy Gospodarek wrote: [...] > First, thanks for this detailed description. It was helpful to read > along with the patches. > > My only concern about this area being generic is that you are now in a > state where any bpf program must know about all the bpf programs in the > receive pipeline before it can properly parse what is stored in the > meta-data and add it to an skb (or perform any other action). > Especially if each program adds it's own meta-data along the way. > > Maybe this isn't a big concern based on the number of users of this > today, but it just starts to seem like a concern as there are these > hints being passed between layers that are challenging to track due to a > lack of a standard format for passing data between. Btw, we do have similar kind of programmable scratch buffer also today wrt skb cb[] that you can program from tc side, the perf ring buffer, which doesn't have any fixed layout for the slots, or a per-cpu map where you can transfer data between tail calls for example, then tail calls themselves that need to coordinate, or simply mangling of packets itself if you will, but more below to your use case ... > The main reason I bring this up is that Michael and I had discussed and > designed a way for drivers to communicate between each other that rx > resources could be freed after a tx completion on an XDP_REDIRECT > action. Much like this code, it involved adding an new element to > struct xdp_md that could point to the important information. Now that > there is a generic way to handle this, it would seem nice to be able to > leverage it, but I'm not sure how reliable this meta-data area would be > without the ability to mark it in some manner. > > For additional background, the minimum amount of data needed in the case > Michael and I were discussing was really 2 words. One to serve as a > pointer to an rx_ring structure and one to have a counter to the rx > producer entry. This data could be acessed by the driver processing the > tx completions and callback to the driver that received the frame off the wire > to perform any needed processing. (For those curious this would also require a > new callback/netdev op to act on this data stored in the XDP buffer.) What you describe above doesn't seem to be fitting to the use-case of this set, meaning the area here is fully programmable out of the BPF program, the infrastructure you're describing is some sort of means of communication between drivers for the XDP_REDIRECT, and should be outside of the control of the BPF program to mangle. You could probably reuse the base infra here and make a part of that inaccessible for the program with some sort of a fixed layout, but I haven't seen your code yet to be able to fully judge. Intention here is to allow for programmability within the BPF prog in a generic way, such that based on the use-case it can be populated in specific ways and propagated to the skb w/o having to define a fixed layout and bloat xdp_buff all the way to an skb while still retaining all the flexibility. Thanks, Daniel
On 09/25/2017 11:50 AM, Daniel Borkmann wrote: > On 09/25/2017 08:10 PM, Andy Gospodarek wrote: > [...] >> First, thanks for this detailed description. It was helpful to read >> along with the patches. >> >> My only concern about this area being generic is that you are now in a >> state where any bpf program must know about all the bpf programs in the >> receive pipeline before it can properly parse what is stored in the >> meta-data and add it to an skb (or perform any other action). >> Especially if each program adds it's own meta-data along the way. >> >> Maybe this isn't a big concern based on the number of users of this >> today, but it just starts to seem like a concern as there are these >> hints being passed between layers that are challenging to track due to a >> lack of a standard format for passing data between. > > Btw, we do have similar kind of programmable scratch buffer also today > wrt skb cb[] that you can program from tc side, the perf ring buffer, > which doesn't have any fixed layout for the slots, or a per-cpu map > where you can transfer data between tail calls for example, then tail > calls themselves that need to coordinate, or simply mangling of packets > itself if you will, but more below to your use case ... > >> The main reason I bring this up is that Michael and I had discussed and >> designed a way for drivers to communicate between each other that rx >> resources could be freed after a tx completion on an XDP_REDIRECT >> action. Much like this code, it involved adding an new element to >> struct xdp_md that could point to the important information. Now that >> there is a generic way to handle this, it would seem nice to be able to >> leverage it, but I'm not sure how reliable this meta-data area would be >> without the ability to mark it in some manner. >> >> For additional background, the minimum amount of data needed in the case >> Michael and I were discussing was really 2 words. One to serve as a >> pointer to an rx_ring structure and one to have a counter to the rx >> producer entry. This data could be acessed by the driver processing the >> tx completions and callback to the driver that received the frame off >> the wire >> to perform any needed processing. (For those curious this would also >> require a >> new callback/netdev op to act on this data stored in the XDP buffer.) > > What you describe above doesn't seem to be fitting to the use-case of > this set, meaning the area here is fully programmable out of the BPF > program, the infrastructure you're describing is some sort of means of > communication between drivers for the XDP_REDIRECT, and should be > outside of the control of the BPF program to mangle. > > You could probably reuse the base infra here and make a part of that > inaccessible for the program with some sort of a fixed layout, but I > haven't seen your code yet to be able to fully judge. Intention here > is to allow for programmability within the BPF prog in a generic way, > such that based on the use-case it can be populated in specific ways > and propagated to the skb w/o having to define a fixed layout and > bloat xdp_buff all the way to an skb while still retaining all the > flexibility. > > Thanks, > Daniel Hi Andy, I'm guessing this data needs to be passed from the input dev to the output dev based on your description. If the driver data is pushed after the BPF program is run but before the xdp_do_flush_map call no other BPF programs can be run on that xdp_buff. It should be safe at this point to use the metadata region directly from the driver. We would just need to add a few helpers for the drivers to use for this maybe, xdp_metadata_write_drv, xdp_meadata_read_drv. I think this would work for your use case? The data structure would have to be agreed upon by all the drivers but would not be UAPI because it would only be exposed in the driver. So we would be free to change/update it as needed. Thanks, John
On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: > On 09/25/2017 08:10 PM, Andy Gospodarek wrote: > [...] > > First, thanks for this detailed description. It was helpful to read > > along with the patches. > > > > My only concern about this area being generic is that you are now in a > > state where any bpf program must know about all the bpf programs in the > > receive pipeline before it can properly parse what is stored in the > > meta-data and add it to an skb (or perform any other action). > > Especially if each program adds it's own meta-data along the way. > > > > Maybe this isn't a big concern based on the number of users of this > > today, but it just starts to seem like a concern as there are these > > hints being passed between layers that are challenging to track due to a > > lack of a standard format for passing data between. > > Btw, we do have similar kind of programmable scratch buffer also today > wrt skb cb[] that you can program from tc side, the perf ring buffer, > which doesn't have any fixed layout for the slots, or a per-cpu map > where you can transfer data between tail calls for example, then tail > calls themselves that need to coordinate, or simply mangling of packets > itself if you will, but more below to your use case ... > > > The main reason I bring this up is that Michael and I had discussed and > > designed a way for drivers to communicate between each other that rx > > resources could be freed after a tx completion on an XDP_REDIRECT > > action. Much like this code, it involved adding an new element to > > struct xdp_md that could point to the important information. Now that > > there is a generic way to handle this, it would seem nice to be able to > > leverage it, but I'm not sure how reliable this meta-data area would be > > without the ability to mark it in some manner. > > > > For additional background, the minimum amount of data needed in the case > > Michael and I were discussing was really 2 words. One to serve as a > > pointer to an rx_ring structure and one to have a counter to the rx > > producer entry. This data could be acessed by the driver processing the > > tx completions and callback to the driver that received the frame off the wire > > to perform any needed processing. (For those curious this would also require a > > new callback/netdev op to act on this data stored in the XDP buffer.) > > What you describe above doesn't seem to be fitting to the use-case of > this set, meaning the area here is fully programmable out of the BPF > program, the infrastructure you're describing is some sort of means of > communication between drivers for the XDP_REDIRECT, and should be > outside of the control of the BPF program to mangle. OK, I understand that perspective. I think saying this is really meant as a BPF<->BPF communication channel for now is fine. > You could probably reuse the base infra here and make a part of that > inaccessible for the program with some sort of a fixed layout, but I > haven't seen your code yet to be able to fully judge. Intention here > is to allow for programmability within the BPF prog in a generic way, > such that based on the use-case it can be populated in specific ways > and propagated to the skb w/o having to define a fixed layout and > bloat xdp_buff all the way to an skb while still retaining all the > flexibility. Some level of reuse might be proper, but I'd rather it be explicit for my use since it's not exclusively something that will need to be used by a BPF prog, but rather the driver. I'll produce some patches this week for reference.
On Mon, 25 Sep 2017 02:25:51 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > This work enables generic transfer of metadata from XDP into skb. The > basic idea is that we can make use of the fact that the resulting skb > must be linear and already comes with a larger headroom for supporting > bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work > on a similar principle and introduce a small helper bpf_xdp_adjust_meta() > for adjusting a new pointer called xdp->data_meta. Thus, the packet has > a flexible and programmable room for meta data, followed by the actual > packet data. struct xdp_buff is therefore laid out that we first point > to data_hard_start, then data_meta directly prepended to data followed > by data_end marking the end of packet. bpf_xdp_adjust_head() takes into > account whether we have meta data already prepended and if so, memmove()s > this along with the given offset provided there's enough room. > > [...] The scratch space at the head > of the packet can be multiple of 4 byte up to 32 byte large. Drivers not > yet supporting xdp->data_meta can simply be set up with xdp->data_meta > as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, > such that the subsequent match against xdp->data for later access is > guaranteed to fail. So, xdp->meta_data is placed just before the packet xdp->data starts. I'm currently implementing a cpumap type, that transfers raw XDP frames to another CPU, and the SKB is allocated on the remote CPU. (It actually works extremely well). For transferring info I need, I'm currently using xdp->data_hard_start (the top/start of the xdp page). Which should be compatible with your approach, right? The info I need: struct xdp_pkt { void *data; u16 len; u16 headroom; struct net_device *dev_rx; }; When I enqueue the xdp packet I do the following: int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, struct net_device *dev_rx) { struct xdp_pkt *xdp_pkt; int headroom; /* Convert xdp_buff to xdp_pkt */ headroom = xdp->data - xdp->data_hard_start; if (headroom < sizeof(*xdp_pkt)) return -EOVERFLOW; xdp_pkt = xdp->data_hard_start; xdp_pkt->data = xdp->data; xdp_pkt->len = xdp->data_end - xdp->data; xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); /* Info needed when constructing SKB on remote CPU */ xdp_pkt->dev_rx = dev_rx; bq_enqueue(rcpu, xdp_pkt); return 0; } On the remote CPU dequeueing the packet, I'm doing the following. As you can see I'm still lacking some meta-data, that would be nice to also transfer. Could I use your infrastructure for that? static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt) { unsigned int truesize; void *pkt_data_start; struct sk_buff *skb; /* TODO: rcpu could provide truesize, it's static per RX-ring */ truesize = 2048; // pkt_data_start = xdp_pkt + sizeof(*xdp_pkt); pkt_data_start = xdp_pkt->data - xdp_pkt->headroom; /* Need to adjust "truesize" for skb_shared_info to get proper * placed, to take into account that xdp_pkt is using part of * headroom */ skb = build_skb(pkt_data_start, truesize - sizeof(*xdp_pkt)); if (!skb) return NULL; skb_reserve(skb, xdp_pkt->headroom); __skb_put(skb, xdp_pkt->len); // skb_record_rx_queue(skb, rx_ring->queue_index); skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx); // How much does csum matter? // skb->ip_summed = CHECKSUM_UNNECESSARY; // Try to fake it... // Does setting skb_set_hash()) matter? // __skb_set_hash(skb, 42, true, false); // Say it is software // __skb_set_hash(skb, 42, false, true); // Say it is hardware // Do we lack setting rx_queue... it doesn't seem to matter // skb_record_rx_queue(skb, 0); return skb; } (I'll send out some patches soonish, hopefully tomorrow... to show in more details what I'm doing)
On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: [...] > I'm currently implementing a cpumap type, that transfers raw XDP frames > to another CPU, and the SKB is allocated on the remote CPU. (It > actually works extremely well). Meaning you let all the XDP_PASS packets get processed on a different CPU, so you can reserve the whole CPU just for prefiltering, right? Do you have some numbers to share at this point, just curious when you mention it works extremely well. > For transferring info I need, I'm currently using xdp->data_hard_start > (the top/start of the xdp page). Which should be compatible with your > approach, right? Should be possible, yes. More below. > The info I need: > > struct xdp_pkt { > void *data; > u16 len; > u16 headroom; > struct net_device *dev_rx; > }; > > When I enqueue the xdp packet I do the following: > > int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, > struct net_device *dev_rx) > { > struct xdp_pkt *xdp_pkt; > int headroom; > > /* Convert xdp_buff to xdp_pkt */ > headroom = xdp->data - xdp->data_hard_start; > if (headroom < sizeof(*xdp_pkt)) > return -EOVERFLOW; > xdp_pkt = xdp->data_hard_start; > xdp_pkt->data = xdp->data; > xdp_pkt->len = xdp->data_end - xdp->data; > xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); > > /* Info needed when constructing SKB on remote CPU */ > xdp_pkt->dev_rx = dev_rx; > > bq_enqueue(rcpu, xdp_pkt); > return 0; > } > > On the remote CPU dequeueing the packet, I'm doing the following. As > you can see I'm still lacking some meta-data, that would be nice to > also transfer. Could I use your infrastructure for that? There could be multiple options to use it, in case you have a helper where you look up the CPU in the map and would also store the meta data, you could use a per-CPU scratch buffer similarly as we do with struct redirect_info, and move that later e.g. after program return into xdp->data_hard_start pointer. You could also reserve that upfront potentially, so it's hidden from the beginning from the program unless you want the program itself to fill it out (modulo the pointers). Not all drivers currently leave room though, I've also seen where xdp->data_hard_start points directly to xdp->data, so there's 0 headroom available to use; in such case it could either be treated as a hint and for those drivers where they just pass the skb up the current CPU or you would need some other means to move the meta data to the remote CPU, or potentially just use tail room. Thanks, Daniel
On Tue, 26 Sep 2017 21:58:53 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: > [...] > > I'm currently implementing a cpumap type, that transfers raw XDP frames > > to another CPU, and the SKB is allocated on the remote CPU. (It > > actually works extremely well). > > Meaning you let all the XDP_PASS packets get processed on a > different CPU, so you can reserve the whole CPU just for > prefiltering, right? Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. The trick is using the map-flush point, to transfer packets in bulk to the remote CPU (single call IPC is too slow), but at the same time flush single packets if NAPI didn't see a bulk. > Do you have some numbers to share at this point, just curious when > you mention it works extremely well. Sure... I've done a lot of benchmarking on this patchset ;-) I have a benchmark program called xdp_redirect_cpu [1][2], that collect stats via tracepoints (atm I'm limiting bulking 8 packets, and have tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c Here I'm installing a DDoS program that drops UDP port 9 (pktgen packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that the 11.9Mpps DDoS attack is hitting. Running XDP/eBPF prog_num:4 XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 0 12,030,471 11,966,982 0 XDP-RX total 12,030,471 11,966,982 cpumap-enqueue 0:2 63,488 0 0 cpumap-enqueue sum:2 63,488 0 0 cpumap_kthread 2 63,488 0 3 time_exceed cpumap_kthread total 63,488 0 0 redirect_err total 0 0 $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1024 1024 10.00 12735.97 16384 87380 The netperf TCP_CRR performance is the same, without XDP loaded. > Another test I've previously shown (and optimized) in commit c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited"), that my system can handle approx 2.7Mpps for UdpNoPorts, before the network stack chokes. Thus it is interesting to see, when I get UDP traffic that hits the same CPU, if I can simply round-robin distribute it other CPUs. This evaluate if the cross-CPU transfer mechanism is fast-enough. I do have to increase the ixgbe RX-ring size, else the ixgbe recycle scheme breaks down, and we stall on the page spin_lock (as Tariq have demonstrated before). # ethtool -G ixgbe1 rx 1024 tx 1024 Start RR program and add some CPUs: # ./xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 1 --cpu 2 --cpu 3 --cpu 4 Running XDP/eBPF prog_num:2 XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 0 11,006,992 0 0 XDP-RX total 11,006,992 0 cpumap-enqueue 0:1 2,751,744 0 0 cpumap-enqueue sum:1 2,751,744 0 0 cpumap-enqueue 0:2 2,751,748 0 0 cpumap-enqueue sum:2 2,751,748 0 0 cpumap-enqueue 0:3 2,751,744 35 0 cpumap-enqueue sum:3 2,751,744 35 0 cpumap-enqueue 0:4 2,751,748 0 0 cpumap-enqueue sum:4 2,751,748 0 0 cpumap_kthread 1 2,751,745 0 156 time_exceed cpumap_kthread 2 2,751,749 0 142 time_exceed cpumap_kthread 3 2,751,713 0 131 time_exceed cpumap_kthread 4 2,751,749 0 128 time_exceed cpumap_kthread total 11,006,957 0 0 redirect_err total 0 0 $ nstat > /dev/null && sleep 1 && nstat | grep UdpNoPorts UdpNoPorts 11042282 0.0 The nstat show that the Linux network stack is actually now processing, SKB alloc + free, 11Mpps. The generator was sending with 14Mpps, thus the XDP-RX program is actually a bottleneck here. And I do see some drops on the HW level. Thus, 1-CPU was not 100% fast-enough. Thus, lets allocate two CPUs for XDP-RX: Running XDP/eBPF prog_num:2 XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 0 6,352,578 0 0 XDP-RX 1 6,352,711 0 0 XDP-RX total 12,705,289 0 cpumap-enqueue 0:2 1,588,156 1,351 0 cpumap-enqueue 1:2 1,588,174 1,330 0 cpumap-enqueue sum:2 3,176,331 2,682 0 cpumap-enqueue 0:3 1,588,157 994 0 cpumap-enqueue 1:3 1,588,170 912 0 cpumap-enqueue sum:3 3,176,327 1,907 0 cpumap-enqueue 0:4 1,588,157 529 0 cpumap-enqueue 1:4 1,588,167 514 0 cpumap-enqueue sum:4 3,176,324 1,044 0 cpumap-enqueue 0:5 1,588,159 625 0 cpumap-enqueue 1:5 1,588,166 614 0 cpumap-enqueue sum:5 3,176,326 1,240 0 cpumap_kthread 2 3,173,642 0 11257 time_exceed cpumap_kthread 3 3,174,423 0 9779 time_exceed cpumap_kthread 4 3,175,283 0 3938 time_exceed cpumap_kthread 5 3,175,083 0 3120 time_exceed cpumap_kthread total 12,698,432 0 0 (null) redirect_err total 0 0 Below, I'm using ./pktgen_sample04_many_flows.sh, and my generator machine cannot generate more that 12,682,445 tx_packets /sec. nstat says: UdpNoPorts 12,698,001 pps. The XDP-RX CPUs actually have 30% idle CPU cycles, as the "only" handle 6.3Mpps each ;-) Perf top on a CPU(3) that have to alloc and free SKBs etc. # Overhead CPU Symbol # ........ ... ....................................... # 15.51% 003 [k] fib_table_lookup 8.91% 003 [k] cpu_map_kthread_run 8.04% 003 [k] build_skb 7.88% 003 [k] page_frag_free 5.13% 003 [k] kmem_cache_alloc 4.76% 003 [k] ip_route_input_rcu 4.59% 003 [k] kmem_cache_free 4.02% 003 [k] __udp4_lib_rcv 3.20% 003 [k] fib_validate_source 3.02% 003 [k] __netif_receive_skb_core 3.02% 003 [k] udp_v4_early_demux 2.90% 003 [k] ip_rcv 2.80% 003 [k] ip_rcv_finish 2.26% 003 [k] eth_type_trans 2.23% 003 [k] __build_skb 2.00% 003 [k] icmp_send 1.84% 003 [k] __rcu_read_unlock 1.30% 003 [k] ip_local_deliver_finish 1.26% 003 [k] netif_receive_skb_internal 1.17% 003 [k] ip_route_input_noref 1.11% 003 [k] make_kuid 1.09% 003 [k] __udp4_lib_lookup 1.07% 003 [k] skb_release_head_state 1.04% 003 [k] __rcu_read_lock 0.95% 003 [k] kfree_skb 0.89% 003 [k] __local_bh_enable_ip 0.88% 003 [k] skb_release_data 0.71% 003 [k] ip_local_deliver 0.58% 003 [k] netif_receive_skb cmdline: perf report --sort cpu,symbol --kallsyms=/proc/kallsyms --no-children -C3 -g none --stdio
On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote: > On Tue, 26 Sep 2017 21:58:53 +0200 > Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: >> [...] >>> I'm currently implementing a cpumap type, that transfers raw XDP frames >>> to another CPU, and the SKB is allocated on the remote CPU. (It >>> actually works extremely well). >> >> Meaning you let all the XDP_PASS packets get processed on a >> different CPU, so you can reserve the whole CPU just for >> prefiltering, right? > > Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. > The trick is using the map-flush point, to transfer packets in bulk to > the remote CPU (single call IPC is too slow), but at the same time > flush single packets if NAPI didn't see a bulk. > >> Do you have some numbers to share at this point, just curious when >> you mention it works extremely well. > > Sure... I've done a lot of benchmarking on this patchset ;-) > I have a benchmark program called xdp_redirect_cpu [1][2], that collect > stats via tracepoints (atm I'm limiting bulking 8 packets, and have > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen > packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that > the 11.9Mpps DDoS attack is hitting. > > Running XDP/eBPF prog_num:4 > XDP-cpumap CPU:to pps drop-pps extra-info > XDP-RX 0 12,030,471 11,966,982 0 > XDP-RX total 12,030,471 11,966,982 > cpumap-enqueue 0:2 63,488 0 0 > cpumap-enqueue sum:2 63,488 0 0 > cpumap_kthread 2 63,488 0 3 time_exceed > cpumap_kthread total 63,488 0 0 > redirect_err total 0 0 > > $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 > Local /Remote > Socket Size Request Resp. Elapsed Trans. > Send Recv Size Size Time Rate > bytes Bytes bytes bytes secs. per sec > > 16384 87380 1024 1024 10.00 12735.97 > 16384 87380 > > The netperf TCP_CRR performance is the same, without XDP loaded. > Just curious could you also try this with RPS enabled (or does this have RPS enabled). RPS should effectively do the same thing but higher in the stack. I'm curious what the delta would be. Might be another interesting case and fairly easy to setup if you already have the above scripts. Thanks, John [...]
On Wed, 27 Sep 2017 06:35:40 -0700 John Fastabend <john.fastabend@gmail.com> wrote: > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote: > > On Tue, 26 Sep 2017 21:58:53 +0200 > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: > >> [...] > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames > >>> to another CPU, and the SKB is allocated on the remote CPU. (It > >>> actually works extremely well). > >> > >> Meaning you let all the XDP_PASS packets get processed on a > >> different CPU, so you can reserve the whole CPU just for > >> prefiltering, right? > > > > Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. > > The trick is using the map-flush point, to transfer packets in bulk to > > the remote CPU (single call IPC is too slow), but at the same time > > flush single packets if NAPI didn't see a bulk. > > > >> Do you have some numbers to share at this point, just curious when > >> you mention it works extremely well. > > > > Sure... I've done a lot of benchmarking on this patchset ;-) > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c > > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen > > packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that > > the 11.9Mpps DDoS attack is hitting. > > > > Running XDP/eBPF prog_num:4 > > XDP-cpumap CPU:to pps drop-pps extra-info > > XDP-RX 0 12,030,471 11,966,982 0 > > XDP-RX total 12,030,471 11,966,982 > > cpumap-enqueue 0:2 63,488 0 0 > > cpumap-enqueue sum:2 63,488 0 0 > > cpumap_kthread 2 63,488 0 3 time_exceed > > cpumap_kthread total 63,488 0 0 > > redirect_err total 0 0 > > > > $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. > > Send Recv Size Size Time Rate > > bytes Bytes bytes bytes secs. per sec > > > > 16384 87380 1024 1024 10.00 12735.97 > > 16384 87380 > > > > The netperf TCP_CRR performance is the same, without XDP loaded. > > > > Just curious could you also try this with RPS enabled (or does this have > RPS enabled). RPS should effectively do the same thing but higher in the > stack. I'm curious what the delta would be. Might be another interesting > case and fairly easy to setup if you already have the above scripts. Yes, I'm essentially competing with RSP, thus such a comparison is very relevant... This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let other 4 CPUS process packet. Summary of RPS (Receive Packet Steering) performance: * End result is 6.3 Mpps max performance * netperf TCP_CRR is 1 trans/sec. * Each RX-RPS CPU stall at ~3.2Mpps. The full test report below with setup: The mask needed:: perl -e 'printf "%b\n",0x3C' 111100 RPS setup:: sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries' for N in $(seq 0 5) ; do \ sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \ sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \ grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \ done Reduce RX queues to two :: ethtool -L ixgbe1 combined 2 IRQ align to CPU numbers:: $ ~/setup01.sh Not root, running with sudo --- Disable Ethernet flow-control --- rx unmodified, ignoring tx unmodified, ignoring no pause parameters changed, aborting rx unmodified, ignoring tx unmodified, ignoring no pause parameters changed, aborting --- Align IRQs --- /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0 /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1 /proc/irq/56/ixgbe1/../smp_affinity_list:0-5 $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c Generator is sending: 12,715,782 tx_packets /sec ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \ -d 172.16.0.2 -t8 $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives 6346544 0.0 IpInDelivers 6346544 0.0 IpOutRequests 1020 0.0 IcmpOutMsgs 1020 0.0 IcmpOutDestUnreachs 1020 0.0 IcmpMsgOutType3 1020 0.0 UdpNoPorts 6346898 0.0 IpExtInOctets 291964714 0.0 IpExtOutOctets 73440 0.0 IpExtInNoECTPkts 6347063 0.0 $ mpstat -P ALL -u -I SCPU -I SUM Average: CPU %usr %nice %sys %irq %soft %idle Average: all 0.00 0.00 0.00 0.42 72.97 26.61 Average: 0 0.00 0.00 0.00 0.17 99.83 0.00 Average: 1 0.00 0.00 0.00 0.17 99.83 0.00 Average: 2 0.00 0.00 0.00 0.67 60.37 38.96 Average: 3 0.00 0.00 0.00 0.67 58.70 40.64 Average: 4 0.00 0.00 0.00 0.67 59.53 39.80 Average: 5 0.00 0.00 0.00 0.67 58.93 40.40 Average: CPU intr/s Average: all 152067.22 Average: 0 50064.73 Average: 1 50089.35 Average: 2 45095.17 Average: 3 44875.04 Average: 4 44906.32 Average: 5 45152.08 Average: CPU TIMER/s NET_TX/s NET_RX/s TASKLET/s SCHED/s RCU/s Average: 0 609.48 0.17 49431.28 0.00 2.66 21.13 Average: 1 567.55 0.00 49498.00 0.00 2.66 21.13 Average: 2 998.34 0.00 43941.60 4.16 82.86 68.22 Average: 3 540.60 0.17 44140.27 0.00 85.52 108.49 Average: 4 537.27 0.00 44219.63 0.00 84.53 64.89 Average: 5 530.78 0.17 44445.59 0.00 85.02 90.52 From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck. Show adapter(s) (ixgbe1) statistics (ONLY that changed!) Ethtool(ixgbe1) stat: 11109531 ( 11,109,531) <= fdir_miss /sec Ethtool(ixgbe1) stat: 380632356 ( 380,632,356) <= rx_bytes /sec Ethtool(ixgbe1) stat: 812792611 ( 812,792,611) <= rx_bytes_nic /sec Ethtool(ixgbe1) stat: 1753550 ( 1,753,550) <= rx_missed_errors /sec Ethtool(ixgbe1) stat: 4602487 ( 4,602,487) <= rx_no_dma_resources /sec Ethtool(ixgbe1) stat: 6343873 ( 6,343,873) <= rx_packets /sec Ethtool(ixgbe1) stat: 10946441 ( 10,946,441) <= rx_pkts_nic /sec Ethtool(ixgbe1) stat: 190287853 ( 190,287,853) <= rx_queue_0_bytes /sec Ethtool(ixgbe1) stat: 3171464 ( 3,171,464) <= rx_queue_0_packets /sec Ethtool(ixgbe1) stat: 190344503 ( 190,344,503) <= rx_queue_1_bytes /sec Ethtool(ixgbe1) stat: 3172408 ( 3,172,408) <= rx_queue_1_packets /sec Notice, each RX-CPU can only process 3.1Mpps. RPS RX-CPU(0): # Overhead CPU Symbol # ........ ... ....................................... # 11.72% 000 [k] ixgbe_poll 11.29% 000 [k] _raw_spin_lock 10.35% 000 [k] dev_gro_receive 8.36% 000 [k] __build_skb 7.35% 000 [k] __skb_get_hash 6.22% 000 [k] enqueue_to_backlog 5.89% 000 [k] __skb_flow_dissect 4.43% 000 [k] inet_gro_receive 4.19% 000 [k] ___slab_alloc 3.90% 000 [k] queued_spin_lock_slowpath 3.85% 000 [k] kmem_cache_alloc 3.06% 000 [k] build_skb 2.66% 000 [k] get_rps_cpu 2.57% 000 [k] napi_gro_receive 2.34% 000 [k] eth_type_trans 1.81% 000 [k] __cmpxchg_double_slab.isra.61 1.47% 000 [k] ixgbe_alloc_rx_buffers 1.43% 000 [k] get_partial_node.isra.81 0.84% 000 [k] swiotlb_sync_single 0.74% 000 [k] udp4_gro_receive 0.73% 000 [k] netif_receive_skb_internal 0.72% 000 [k] udp_gro_receive 0.63% 000 [k] skb_gro_reset_offset 0.49% 000 [k] __skb_flow_get_ports 0.48% 000 [k] llist_add_batch 0.36% 000 [k] swiotlb_sync_single_for_cpu 0.34% 000 [k] __slab_alloc Remote RPS-CPU(3) getting packets:: # Overhead CPU Symbol # ........ ... .............................................. # 33.02% 003 [k] poll_idle 10.99% 003 [k] __netif_receive_skb_core 10.45% 003 [k] page_frag_free 8.49% 003 [k] ip_rcv 4.19% 003 [k] fib_table_lookup 2.84% 003 [k] __udp4_lib_rcv 2.81% 003 [k] __slab_free 2.23% 003 [k] __udp4_lib_lookup 2.09% 003 [k] ip_route_input_rcu 2.07% 003 [k] kmem_cache_free 2.06% 003 [k] udp_v4_early_demux 1.73% 003 [k] ip_rcv_finish 1.44% 003 [k] process_backlog 1.32% 003 [k] icmp_send 1.30% 003 [k] cmpxchg_double_slab.isra.73 0.95% 003 [k] intel_idle 0.88% 003 [k] _raw_spin_lock 0.84% 003 [k] fib_validate_source 0.79% 003 [k] ip_local_deliver_finish 0.67% 003 [k] ip_local_deliver 0.56% 003 [k] skb_release_data 0.53% 003 [k] unfreeze_partials.isra.80 0.51% 003 [k] skb_release_head_state 0.44% 003 [k] kfree_skb 0.44% 003 [k] queued_spin_lock_slowpath 0.44% 003 [k] __cmpxchg_double_slab.isra.61 $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -T5,5 -- -r 1024,1024 MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.16.0.2 () port 0 AF_INET : histogram : demo : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1024 1024 10.00 1.10 16384 87380
On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote: > On Wed, 27 Sep 2017 06:35:40 -0700 > John Fastabend <john.fastabend@gmail.com> wrote: > > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote: > > > On Tue, 26 Sep 2017 21:58:53 +0200 > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: > > >> [...] > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames > > >>> to another CPU, and the SKB is allocated on the remote CPU. (It > > >>> actually works extremely well). > > >> > > >> Meaning you let all the XDP_PASS packets get processed on a > > >> different CPU, so you can reserve the whole CPU just for > > >> prefiltering, right? > > > > > > Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. > > > The trick is using the map-flush point, to transfer packets in bulk to > > > the remote CPU (single call IPC is too slow), but at the same time > > > flush single packets if NAPI didn't see a bulk. > > > > > >> Do you have some numbers to share at this point, just curious when > > >> you mention it works extremely well. > > > > > > Sure... I've done a lot of benchmarking on this patchset ;-) > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) > > > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c > > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c > > > > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen > > > packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that > > > the 11.9Mpps DDoS attack is hitting. > > > > > > Running XDP/eBPF prog_num:4 > > > XDP-cpumap CPU:to pps drop-pps extra-info > > > XDP-RX 0 12,030,471 11,966,982 0 > > > XDP-RX total 12,030,471 11,966,982 > > > cpumap-enqueue 0:2 63,488 0 0 > > > cpumap-enqueue sum:2 63,488 0 0 > > > cpumap_kthread 2 63,488 0 3 time_exceed > > > cpumap_kthread total 63,488 0 0 > > > redirect_err total 0 0 > > > > > > $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. > > > Send Recv Size Size Time Rate > > > bytes Bytes bytes bytes secs. per sec > > > > > > 16384 87380 1024 1024 10.00 12735.97 > > > 16384 87380 > > > > > > The netperf TCP_CRR performance is the same, without XDP loaded. > > > > > > > Just curious could you also try this with RPS enabled (or does this have > > RPS enabled). RPS should effectively do the same thing but higher in the > > stack. I'm curious what the delta would be. Might be another interesting > > case and fairly easy to setup if you already have the above scripts. > > Yes, I'm essentially competing with RSP, thus such a comparison is very > relevant... > > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let > other 4 CPUS process packet. > > Summary of RPS (Receive Packet Steering) performance: > * End result is 6.3 Mpps max performance > * netperf TCP_CRR is 1 trans/sec. > * Each RX-RPS CPU stall at ~3.2Mpps. > > The full test report below with setup: > > The mask needed:: > > perl -e 'printf "%b\n",0x3C' > 111100 > > RPS setup:: > > sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries' > > for N in $(seq 0 5) ; do \ > sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \ > sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \ > grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \ > done > > Reduce RX queues to two :: > > ethtool -L ixgbe1 combined 2 > > IRQ align to CPU numbers:: > > $ ~/setup01.sh > Not root, running with sudo > --- Disable Ethernet flow-control --- > rx unmodified, ignoring > tx unmodified, ignoring > no pause parameters changed, aborting > rx unmodified, ignoring > tx unmodified, ignoring > no pause parameters changed, aborting > --- Align IRQs --- > /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0 > /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1 > /proc/irq/56/ixgbe1/../smp_affinity_list:0-5 > > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c > > Generator is sending: 12,715,782 tx_packets /sec > > ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \ > -d 172.16.0.2 -t8 > > $ nstat > /dev/null && sleep 1 && nstat > #kernel > IpInReceives 6346544 0.0 > IpInDelivers 6346544 0.0 > IpOutRequests 1020 0.0 > IcmpOutMsgs 1020 0.0 > IcmpOutDestUnreachs 1020 0.0 > IcmpMsgOutType3 1020 0.0 > UdpNoPorts 6346898 0.0 > IpExtInOctets 291964714 0.0 > IpExtOutOctets 73440 0.0 > IpExtInNoECTPkts 6347063 0.0 > > $ mpstat -P ALL -u -I SCPU -I SUM > > Average: CPU %usr %nice %sys %irq %soft %idle > Average: all 0.00 0.00 0.00 0.42 72.97 26.61 > Average: 0 0.00 0.00 0.00 0.17 99.83 0.00 > Average: 1 0.00 0.00 0.00 0.17 99.83 0.00 > Average: 2 0.00 0.00 0.00 0.67 60.37 38.96 > Average: 3 0.00 0.00 0.00 0.67 58.70 40.64 > Average: 4 0.00 0.00 0.00 0.67 59.53 39.80 > Average: 5 0.00 0.00 0.00 0.67 58.93 40.40 > > Average: CPU intr/s > Average: all 152067.22 > Average: 0 50064.73 > Average: 1 50089.35 > Average: 2 45095.17 > Average: 3 44875.04 > Average: 4 44906.32 > Average: 5 45152.08 > > Average: CPU TIMER/s NET_TX/s NET_RX/s TASKLET/s SCHED/s RCU/s > Average: 0 609.48 0.17 49431.28 0.00 2.66 21.13 > Average: 1 567.55 0.00 49498.00 0.00 2.66 21.13 > Average: 2 998.34 0.00 43941.60 4.16 82.86 68.22 > Average: 3 540.60 0.17 44140.27 0.00 85.52 108.49 > Average: 4 537.27 0.00 44219.63 0.00 84.53 64.89 > Average: 5 530.78 0.17 44445.59 0.00 85.02 90.52 > > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck. > > Show adapter(s) (ixgbe1) statistics (ONLY that changed!) > Ethtool(ixgbe1) stat: 11109531 ( 11,109,531) <= fdir_miss /sec > Ethtool(ixgbe1) stat: 380632356 ( 380,632,356) <= rx_bytes /sec > Ethtool(ixgbe1) stat: 812792611 ( 812,792,611) <= rx_bytes_nic /sec > Ethtool(ixgbe1) stat: 1753550 ( 1,753,550) <= rx_missed_errors /sec > Ethtool(ixgbe1) stat: 4602487 ( 4,602,487) <= rx_no_dma_resources /sec > Ethtool(ixgbe1) stat: 6343873 ( 6,343,873) <= rx_packets /sec > Ethtool(ixgbe1) stat: 10946441 ( 10,946,441) <= rx_pkts_nic /sec > Ethtool(ixgbe1) stat: 190287853 ( 190,287,853) <= rx_queue_0_bytes /sec > Ethtool(ixgbe1) stat: 3171464 ( 3,171,464) <= rx_queue_0_packets /sec > Ethtool(ixgbe1) stat: 190344503 ( 190,344,503) <= rx_queue_1_bytes /sec > Ethtool(ixgbe1) stat: 3172408 ( 3,172,408) <= rx_queue_1_packets /sec > > Notice, each RX-CPU can only process 3.1Mpps. > > RPS RX-CPU(0): > > # Overhead CPU Symbol > # ........ ... ....................................... > # > 11.72% 000 [k] ixgbe_poll > 11.29% 000 [k] _raw_spin_lock > 10.35% 000 [k] dev_gro_receive > 8.36% 000 [k] __build_skb > 7.35% 000 [k] __skb_get_hash > 6.22% 000 [k] enqueue_to_backlog > 5.89% 000 [k] __skb_flow_dissect > 4.43% 000 [k] inet_gro_receive > 4.19% 000 [k] ___slab_alloc > 3.90% 000 [k] queued_spin_lock_slowpath > 3.85% 000 [k] kmem_cache_alloc > 3.06% 000 [k] build_skb > 2.66% 000 [k] get_rps_cpu > 2.57% 000 [k] napi_gro_receive > 2.34% 000 [k] eth_type_trans > 1.81% 000 [k] __cmpxchg_double_slab.isra.61 > 1.47% 000 [k] ixgbe_alloc_rx_buffers > 1.43% 000 [k] get_partial_node.isra.81 > 0.84% 000 [k] swiotlb_sync_single > 0.74% 000 [k] udp4_gro_receive > 0.73% 000 [k] netif_receive_skb_internal > 0.72% 000 [k] udp_gro_receive > 0.63% 000 [k] skb_gro_reset_offset > 0.49% 000 [k] __skb_flow_get_ports > 0.48% 000 [k] llist_add_batch > 0.36% 000 [k] swiotlb_sync_single_for_cpu > 0.34% 000 [k] __slab_alloc > > > Remote RPS-CPU(3) getting packets:: > > # Overhead CPU Symbol > # ........ ... .............................................. > # > 33.02% 003 [k] poll_idle > 10.99% 003 [k] __netif_receive_skb_core > 10.45% 003 [k] page_frag_free > 8.49% 003 [k] ip_rcv > 4.19% 003 [k] fib_table_lookup > 2.84% 003 [k] __udp4_lib_rcv > 2.81% 003 [k] __slab_free > 2.23% 003 [k] __udp4_lib_lookup > 2.09% 003 [k] ip_route_input_rcu > 2.07% 003 [k] kmem_cache_free > 2.06% 003 [k] udp_v4_early_demux > 1.73% 003 [k] ip_rcv_finish Very interesting data. So above perf report compares to xdp-redirect-cpu this one: Perf top on a CPU(3) that have to alloc and free SKBs etc. # Overhead CPU Symbol # ........ ... ....................................... # 15.51% 003 [k] fib_table_lookup 8.91% 003 [k] cpu_map_kthread_run 8.04% 003 [k] build_skb 7.88% 003 [k] page_frag_free 5.13% 003 [k] kmem_cache_alloc 4.76% 003 [k] ip_route_input_rcu 4.59% 003 [k] kmem_cache_free 4.02% 003 [k] __udp4_lib_rcv 3.20% 003 [k] fib_validate_source 3.02% 003 [k] __netif_receive_skb_core 3.02% 003 [k] udp_v4_early_demux 2.90% 003 [k] ip_rcv 2.80% 003 [k] ip_rcv_finish right? and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu you can load it up all the way. Am I interpreting all this correctly that with RPS cpu0 cannot distributed the packets to other cpus fast enough and that's a bottleneck? whereas in redirect-cpu you're doing early packet distribution before skb alloc? So in other words with redirect-cpu all consumer cpus are doing skb alloc and in RPS cpu0 is allocating skbs for all ? and that's where 6M->12M performance gain comes from?
On 9/26/17 10:21 AM, Andy Gospodarek wrote: > On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >> [...] >>> First, thanks for this detailed description. It was helpful to read >>> along with the patches. >>> >>> My only concern about this area being generic is that you are now in a >>> state where any bpf program must know about all the bpf programs in the >>> receive pipeline before it can properly parse what is stored in the >>> meta-data and add it to an skb (or perform any other action). >>> Especially if each program adds it's own meta-data along the way. >>> >>> Maybe this isn't a big concern based on the number of users of this >>> today, but it just starts to seem like a concern as there are these >>> hints being passed between layers that are challenging to track due to a >>> lack of a standard format for passing data between. >> >> Btw, we do have similar kind of programmable scratch buffer also today >> wrt skb cb[] that you can program from tc side, the perf ring buffer, >> which doesn't have any fixed layout for the slots, or a per-cpu map >> where you can transfer data between tail calls for example, then tail >> calls themselves that need to coordinate, or simply mangling of packets >> itself if you will, but more below to your use case ... >> >>> The main reason I bring this up is that Michael and I had discussed and >>> designed a way for drivers to communicate between each other that rx >>> resources could be freed after a tx completion on an XDP_REDIRECT >>> action. Much like this code, it involved adding an new element to >>> struct xdp_md that could point to the important information. Now that >>> there is a generic way to handle this, it would seem nice to be able to >>> leverage it, but I'm not sure how reliable this meta-data area would be >>> without the ability to mark it in some manner. >>> >>> For additional background, the minimum amount of data needed in the case >>> Michael and I were discussing was really 2 words. One to serve as a >>> pointer to an rx_ring structure and one to have a counter to the rx >>> producer entry. This data could be acessed by the driver processing the >>> tx completions and callback to the driver that received the frame off the wire >>> to perform any needed processing. (For those curious this would also require a >>> new callback/netdev op to act on this data stored in the XDP buffer.) >> >> What you describe above doesn't seem to be fitting to the use-case of >> this set, meaning the area here is fully programmable out of the BPF >> program, the infrastructure you're describing is some sort of means of >> communication between drivers for the XDP_REDIRECT, and should be >> outside of the control of the BPF program to mangle. > > OK, I understand that perspective. I think saying this is really meant > as a BPF<->BPF communication channel for now is fine. > >> You could probably reuse the base infra here and make a part of that >> inaccessible for the program with some sort of a fixed layout, but I >> haven't seen your code yet to be able to fully judge. Intention here >> is to allow for programmability within the BPF prog in a generic way, >> such that based on the use-case it can be populated in specific ways >> and propagated to the skb w/o having to define a fixed layout and >> bloat xdp_buff all the way to an skb while still retaining all the >> flexibility. > > Some level of reuse might be proper, but I'd rather it be explicit for > my use since it's not exclusively something that will need to be used by > a BPF prog, but rather the driver. I'll produce some patches this week > for reference. Sorry for chiming in late, I've been offline. We're looking to add some functionality from driver to XDP inside this xdp_buff->data_meta region. We want to assign it to an opaque structure, that would be specific per driver (think of a flex descriptor coming out of the hardware). We'd like to pass these offloaded computations into XDP programs to help accelerate them, such as packet type, where headers are located, etc. It's similar to Jesper's RFC patches back in May when passing through the mlx Rx descriptor to XDP. This is actually what a few of us are planning to present at NetDev 2.2 in November. If you're hoping to restrict this headroom in the xdp_buff for an exclusive use case with XDP_REDIRECT, then I'd like to discuss that further. -PJ
On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter <peter.waskiewicz.jr@intel.com> wrote: > On 9/26/17 10:21 AM, Andy Gospodarek wrote: >> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >>> [...] >>>> First, thanks for this detailed description. It was helpful to read >>>> along with the patches. >>>> >>>> My only concern about this area being generic is that you are now in a >>>> state where any bpf program must know about all the bpf programs in the >>>> receive pipeline before it can properly parse what is stored in the >>>> meta-data and add it to an skb (or perform any other action). >>>> Especially if each program adds it's own meta-data along the way. >>>> >>>> Maybe this isn't a big concern based on the number of users of this >>>> today, but it just starts to seem like a concern as there are these >>>> hints being passed between layers that are challenging to track due to a >>>> lack of a standard format for passing data between. >>> >>> Btw, we do have similar kind of programmable scratch buffer also today >>> wrt skb cb[] that you can program from tc side, the perf ring buffer, >>> which doesn't have any fixed layout for the slots, or a per-cpu map >>> where you can transfer data between tail calls for example, then tail >>> calls themselves that need to coordinate, or simply mangling of packets >>> itself if you will, but more below to your use case ... >>> >>>> The main reason I bring this up is that Michael and I had discussed and >>>> designed a way for drivers to communicate between each other that rx >>>> resources could be freed after a tx completion on an XDP_REDIRECT >>>> action. Much like this code, it involved adding an new element to >>>> struct xdp_md that could point to the important information. Now that >>>> there is a generic way to handle this, it would seem nice to be able to >>>> leverage it, but I'm not sure how reliable this meta-data area would be >>>> without the ability to mark it in some manner. >>>> >>>> For additional background, the minimum amount of data needed in the case >>>> Michael and I were discussing was really 2 words. One to serve as a >>>> pointer to an rx_ring structure and one to have a counter to the rx >>>> producer entry. This data could be acessed by the driver processing the >>>> tx completions and callback to the driver that received the frame off the wire >>>> to perform any needed processing. (For those curious this would also require a >>>> new callback/netdev op to act on this data stored in the XDP buffer.) >>> >>> What you describe above doesn't seem to be fitting to the use-case of >>> this set, meaning the area here is fully programmable out of the BPF >>> program, the infrastructure you're describing is some sort of means of >>> communication between drivers for the XDP_REDIRECT, and should be >>> outside of the control of the BPF program to mangle. >> >> OK, I understand that perspective. I think saying this is really meant >> as a BPF<->BPF communication channel for now is fine. >> >>> You could probably reuse the base infra here and make a part of that >>> inaccessible for the program with some sort of a fixed layout, but I >>> haven't seen your code yet to be able to fully judge. Intention here >>> is to allow for programmability within the BPF prog in a generic way, >>> such that based on the use-case it can be populated in specific ways >>> and propagated to the skb w/o having to define a fixed layout and >>> bloat xdp_buff all the way to an skb while still retaining all the >>> flexibility. >> >> Some level of reuse might be proper, but I'd rather it be explicit for >> my use since it's not exclusively something that will need to be used by >> a BPF prog, but rather the driver. I'll produce some patches this week >> for reference. > > Sorry for chiming in late, I've been offline. > > We're looking to add some functionality from driver to XDP inside this > xdp_buff->data_meta region. We want to assign it to an opaque > structure, that would be specific per driver (think of a flex descriptor > coming out of the hardware). We'd like to pass these offloaded > computations into XDP programs to help accelerate them, such as packet > type, where headers are located, etc. It's similar to Jesper's RFC > patches back in May when passing through the mlx Rx descriptor to XDP. > > This is actually what a few of us are planning to present at NetDev 2.2 > in November. If you're hoping to restrict this headroom in the xdp_buff > for an exclusive use case with XDP_REDIRECT, then I'd like to discuss > that further. > No sweat, PJ, thanks for replying. I saw the notes for your accepted session and I'm looking forward to it. John's suggestion earlier in the thread was actually similar to the conclusion I reached when thinking about Daniel's patch a bit more. (I like John's better though as it doesn't get constrained by UAPI.) Since redirect actions happen at a point where no other programs will run on the buffer, that space can be used for this redirect data and there are no conflicts. It sounds like the idea behind your proposal includes populating some data into the buffer before the XDP program is executed so that it can be used by the program. Would this data be useful later in the driver or stack or are you just hoping to accelerate processing of frames in the BPF program? If the headroom needed for redirect info was only added after it was clear the redirect action was needed, would this conflict with the information you are trying to provide? I had planned to add this just after the action was XDP_REDIRECT was selected or at the end of the driver's ndo_xdp_xmit function -- it seems like it would not conflict. (There's also Jesper's series from today -- I've seen it but have not had time to fully grok all of those changes.) Thoughts?
On 9/28/17 12:59 PM, Andy Gospodarek wrote: > On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter > <peter.waskiewicz.jr@intel.com> wrote: >> On 9/26/17 10:21 AM, Andy Gospodarek wrote: >>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >>>> [...] >>>>> First, thanks for this detailed description. It was helpful to read >>>>> along with the patches. >>>>> >>>>> My only concern about this area being generic is that you are now in a >>>>> state where any bpf program must know about all the bpf programs in the >>>>> receive pipeline before it can properly parse what is stored in the >>>>> meta-data and add it to an skb (or perform any other action). >>>>> Especially if each program adds it's own meta-data along the way. >>>>> >>>>> Maybe this isn't a big concern based on the number of users of this >>>>> today, but it just starts to seem like a concern as there are these >>>>> hints being passed between layers that are challenging to track due to a >>>>> lack of a standard format for passing data between. >>>> >>>> Btw, we do have similar kind of programmable scratch buffer also today >>>> wrt skb cb[] that you can program from tc side, the perf ring buffer, >>>> which doesn't have any fixed layout for the slots, or a per-cpu map >>>> where you can transfer data between tail calls for example, then tail >>>> calls themselves that need to coordinate, or simply mangling of packets >>>> itself if you will, but more below to your use case ... >>>> >>>>> The main reason I bring this up is that Michael and I had discussed and >>>>> designed a way for drivers to communicate between each other that rx >>>>> resources could be freed after a tx completion on an XDP_REDIRECT >>>>> action. Much like this code, it involved adding an new element to >>>>> struct xdp_md that could point to the important information. Now that >>>>> there is a generic way to handle this, it would seem nice to be able to >>>>> leverage it, but I'm not sure how reliable this meta-data area would be >>>>> without the ability to mark it in some manner. >>>>> >>>>> For additional background, the minimum amount of data needed in the case >>>>> Michael and I were discussing was really 2 words. One to serve as a >>>>> pointer to an rx_ring structure and one to have a counter to the rx >>>>> producer entry. This data could be acessed by the driver processing the >>>>> tx completions and callback to the driver that received the frame off the wire >>>>> to perform any needed processing. (For those curious this would also require a >>>>> new callback/netdev op to act on this data stored in the XDP buffer.) >>>> >>>> What you describe above doesn't seem to be fitting to the use-case of >>>> this set, meaning the area here is fully programmable out of the BPF >>>> program, the infrastructure you're describing is some sort of means of >>>> communication between drivers for the XDP_REDIRECT, and should be >>>> outside of the control of the BPF program to mangle. >>> >>> OK, I understand that perspective. I think saying this is really meant >>> as a BPF<->BPF communication channel for now is fine. >>> >>>> You could probably reuse the base infra here and make a part of that >>>> inaccessible for the program with some sort of a fixed layout, but I >>>> haven't seen your code yet to be able to fully judge. Intention here >>>> is to allow for programmability within the BPF prog in a generic way, >>>> such that based on the use-case it can be populated in specific ways >>>> and propagated to the skb w/o having to define a fixed layout and >>>> bloat xdp_buff all the way to an skb while still retaining all the >>>> flexibility. >>> >>> Some level of reuse might be proper, but I'd rather it be explicit for >>> my use since it's not exclusively something that will need to be used by >>> a BPF prog, but rather the driver. I'll produce some patches this week >>> for reference. >> >> Sorry for chiming in late, I've been offline. >> >> We're looking to add some functionality from driver to XDP inside this >> xdp_buff->data_meta region. We want to assign it to an opaque >> structure, that would be specific per driver (think of a flex descriptor >> coming out of the hardware). We'd like to pass these offloaded >> computations into XDP programs to help accelerate them, such as packet >> type, where headers are located, etc. It's similar to Jesper's RFC >> patches back in May when passing through the mlx Rx descriptor to XDP. >> >> This is actually what a few of us are planning to present at NetDev 2.2 >> in November. If you're hoping to restrict this headroom in the xdp_buff >> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss >> that further. >> > > No sweat, PJ, thanks for replying. I saw the notes for your accepted > session and I'm looking forward to it. > > John's suggestion earlier in the thread was actually similar to the > conclusion I reached when thinking about Daniel's patch a bit more. > (I like John's better though as it doesn't get constrained by UAPI.) > Since redirect actions happen at a point where no other programs will > run on the buffer, that space can be used for this redirect data and > there are no conflicts. Ah, yes, John and I spoke about this at Plumber's and this is basically what we came to as well. A set of helpers that won't have to be in UAPI, but they will be potentially vendor-specific to extract the meta-data hints out for the XDP program to use. > It sounds like the idea behind your proposal includes populating some > data into the buffer before the XDP program is executed so that it can > be used by the program. Would this data be useful later in the driver > or stack or are you just hoping to accelerate processing of frames in > the BPF program? Right now we're thinking it would only be useful for XDP programs to execute things quicker, i.e. not have to compute things that are already computed by the hardware (rxhash, ptype, header locations, etc.). I don't have any plans to pass this data off elsewhere in the stack or back to the driver at this point. > If the headroom needed for redirect info was only added after it was > clear the redirect action was needed, would this conflict with the > information you are trying to provide? I had planned to add this just > after the action was XDP_REDIRECT was selected or at the end of the > driver's ndo_xdp_xmit function -- it seems like it would not conflict. I'm pretty sure I misunderstood what you were going after with XDP_REDIRECT reserving the headroom. Our use case (patches coming in a few weeks) will populate the headroom coming out of the driver to XDP, and then once the XDP program extracts whatever hints it wants via helpers, I fully expect that area in the headroom to get stomped by something else. If we want to send any of that hint data up farther, we'll already have it extracted via the helpers, and the eBPF program can happily assign it to wherever in the outbound metadata area. > (There's also Jesper's series from today -- I've seen it but have not > had time to fully grok all of those changes.) I'm also working through my inbox to get to that series. I have some email to catch up on... Thanks Andy, -PJ
[...] > I'm pretty sure I misunderstood what you were going after with > XDP_REDIRECT reserving the headroom. Our use case (patches coming in a > few weeks) will populate the headroom coming out of the driver to XDP, > and then once the XDP program extracts whatever hints it wants via > helpers, I fully expect that area in the headroom to get stomped by > something else. If we want to send any of that hint data up farther, > we'll already have it extracted via the helpers, and the eBPF program > can happily assign it to wherever in the outbound metadata area. In case its not obvious with the latest xdp metadata patches the outbound metadata can then be pushed into skb fields via a tc_cls program if needed. .John > >> (There's also Jesper's series from today -- I've seen it but have not >> had time to fully grok all of those changes.) > > I'm also working through my inbox to get to that series. I have some > email to catch up on... > > Thanks Andy, > -PJ >
On 09/28/2017 10:52 PM, Waskiewicz Jr, Peter wrote: > On 9/28/17 12:59 PM, Andy Gospodarek wrote: >> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter >> <peter.waskiewicz.jr@intel.com> wrote: >>> On 9/26/17 10:21 AM, Andy Gospodarek wrote: >>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >>>>> [...] >>>>>> First, thanks for this detailed description. It was helpful to read >>>>>> along with the patches. >>>>>> >>>>>> My only concern about this area being generic is that you are now in a >>>>>> state where any bpf program must know about all the bpf programs in the >>>>>> receive pipeline before it can properly parse what is stored in the >>>>>> meta-data and add it to an skb (or perform any other action). >>>>>> Especially if each program adds it's own meta-data along the way. >>>>>> >>>>>> Maybe this isn't a big concern based on the number of users of this >>>>>> today, but it just starts to seem like a concern as there are these >>>>>> hints being passed between layers that are challenging to track due to a >>>>>> lack of a standard format for passing data between. >>>>> >>>>> Btw, we do have similar kind of programmable scratch buffer also today >>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer, >>>>> which doesn't have any fixed layout for the slots, or a per-cpu map >>>>> where you can transfer data between tail calls for example, then tail >>>>> calls themselves that need to coordinate, or simply mangling of packets >>>>> itself if you will, but more below to your use case ... >>>>> >>>>>> The main reason I bring this up is that Michael and I had discussed and >>>>>> designed a way for drivers to communicate between each other that rx >>>>>> resources could be freed after a tx completion on an XDP_REDIRECT >>>>>> action. Much like this code, it involved adding an new element to >>>>>> struct xdp_md that could point to the important information. Now that >>>>>> there is a generic way to handle this, it would seem nice to be able to >>>>>> leverage it, but I'm not sure how reliable this meta-data area would be >>>>>> without the ability to mark it in some manner. >>>>>> >>>>>> For additional background, the minimum amount of data needed in the case >>>>>> Michael and I were discussing was really 2 words. One to serve as a >>>>>> pointer to an rx_ring structure and one to have a counter to the rx >>>>>> producer entry. This data could be acessed by the driver processing the >>>>>> tx completions and callback to the driver that received the frame off the wire >>>>>> to perform any needed processing. (For those curious this would also require a >>>>>> new callback/netdev op to act on this data stored in the XDP buffer.) >>>>> >>>>> What you describe above doesn't seem to be fitting to the use-case of >>>>> this set, meaning the area here is fully programmable out of the BPF >>>>> program, the infrastructure you're describing is some sort of means of >>>>> communication between drivers for the XDP_REDIRECT, and should be >>>>> outside of the control of the BPF program to mangle. >>>> >>>> OK, I understand that perspective. I think saying this is really meant >>>> as a BPF<->BPF communication channel for now is fine. >>>> >>>>> You could probably reuse the base infra here and make a part of that >>>>> inaccessible for the program with some sort of a fixed layout, but I >>>>> haven't seen your code yet to be able to fully judge. Intention here >>>>> is to allow for programmability within the BPF prog in a generic way, >>>>> such that based on the use-case it can be populated in specific ways >>>>> and propagated to the skb w/o having to define a fixed layout and >>>>> bloat xdp_buff all the way to an skb while still retaining all the >>>>> flexibility. >>>> >>>> Some level of reuse might be proper, but I'd rather it be explicit for >>>> my use since it's not exclusively something that will need to be used by >>>> a BPF prog, but rather the driver. I'll produce some patches this week >>>> for reference. >>> >>> Sorry for chiming in late, I've been offline. >>> >>> We're looking to add some functionality from driver to XDP inside this >>> xdp_buff->data_meta region. We want to assign it to an opaque >>> structure, that would be specific per driver (think of a flex descriptor >>> coming out of the hardware). We'd like to pass these offloaded >>> computations into XDP programs to help accelerate them, such as packet >>> type, where headers are located, etc. It's similar to Jesper's RFC >>> patches back in May when passing through the mlx Rx descriptor to XDP. >>> >>> This is actually what a few of us are planning to present at NetDev 2.2 >>> in November. If you're hoping to restrict this headroom in the xdp_buff >>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss >>> that further. >> >> No sweat, PJ, thanks for replying. I saw the notes for your accepted >> session and I'm looking forward to it. >> >> John's suggestion earlier in the thread was actually similar to the >> conclusion I reached when thinking about Daniel's patch a bit more. >> (I like John's better though as it doesn't get constrained by UAPI.) >> Since redirect actions happen at a point where no other programs will >> run on the buffer, that space can be used for this redirect data and >> there are no conflicts. Yep fully agree, it's not read anywhere else anymore or could go up the stack where we'd read it out again, so that's the best solution for your use-case moving forward, Andy. I do like that we don't expose to uapi. > Ah, yes, John and I spoke about this at Plumber's and this is basically > what we came to as well. A set of helpers that won't have to be in > UAPI, but they will be potentially vendor-specific to extract the > meta-data hints out for the XDP program to use. > >> It sounds like the idea behind your proposal includes populating some >> data into the buffer before the XDP program is executed so that it can >> be used by the program. Would this data be useful later in the driver >> or stack or are you just hoping to accelerate processing of frames in >> the BPF program? > > Right now we're thinking it would only be useful for XDP programs to > execute things quicker, i.e. not have to compute things that are already > computed by the hardware (rxhash, ptype, header locations, etc.). I > don't have any plans to pass this data off elsewhere in the stack or > back to the driver at this point. > >> If the headroom needed for redirect info was only added after it was >> clear the redirect action was needed, would this conflict with the >> information you are trying to provide? I had planned to add this just >> after the action was XDP_REDIRECT was selected or at the end of the >> driver's ndo_xdp_xmit function -- it seems like it would not conflict. > > I'm pretty sure I misunderstood what you were going after with > XDP_REDIRECT reserving the headroom. Our use case (patches coming in a > few weeks) will populate the headroom coming out of the driver to XDP, > and then once the XDP program extracts whatever hints it wants via > helpers, I fully expect that area in the headroom to get stomped by > something else. If we want to send any of that hint data up farther, > we'll already have it extracted via the helpers, and the eBPF program > can happily assign it to wherever in the outbound metadata area. Sure, these two are compatible with each other; in your case it's populated before the prog is called, and the prog can use it while processing, in Andy's case it's populated after the prog was called when we need to redirect, so both fine. Thanks, Daniel
On 9/28/17 2:23 PM, John Fastabend wrote: > [...] > >> I'm pretty sure I misunderstood what you were going after with >> XDP_REDIRECT reserving the headroom. Our use case (patches coming in a >> few weeks) will populate the headroom coming out of the driver to XDP, >> and then once the XDP program extracts whatever hints it wants via >> helpers, I fully expect that area in the headroom to get stomped by >> something else. If we want to send any of that hint data up farther, >> we'll already have it extracted via the helpers, and the eBPF program >> can happily assign it to wherever in the outbound metadata area. > > In case its not obvious with the latest xdp metadata patches the outbound > metadata can then be pushed into skb fields via a tc_cls program if needed. Yes, that was what I was alluding to with "can happily assign it to wherever." The patches we're working on are driver->XDP, then anything else using the latest meta-data patches would be XDP->anywhere else. So I don't think we're going to step on any toes. Thanks John, -PJ
On Wed, 27 Sep 2017 10:32:36 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote: > > On Wed, 27 Sep 2017 06:35:40 -0700 > > John Fastabend <john.fastabend@gmail.com> wrote: > > > > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote: > > > > On Tue, 26 Sep 2017 21:58:53 +0200 > > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > > > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: > > > >> [...] > > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames > > > >>> to another CPU, and the SKB is allocated on the remote CPU. (It > > > >>> actually works extremely well). > > > >> > > > >> Meaning you let all the XDP_PASS packets get processed on a > > > >> different CPU, so you can reserve the whole CPU just for > > > >> prefiltering, right? > > > > > > > > Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. > > > > The trick is using the map-flush point, to transfer packets in bulk to > > > > the remote CPU (single call IPC is too slow), but at the same time > > > > flush single packets if NAPI didn't see a bulk. > > > > > > > >> Do you have some numbers to share at this point, just curious when > > > >> you mention it works extremely well. > > > > > > > > Sure... I've done a lot of benchmarking on this patchset ;-) > > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect > > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have > > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) > > > > > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c > > > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c > > > > > > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen > > > > packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that > > > > the 11.9Mpps DDoS attack is hitting. > > > > > > > > Running XDP/eBPF prog_num:4 > > > > XDP-cpumap CPU:to pps drop-pps extra-info > > > > XDP-RX 0 12,030,471 11,966,982 0 > > > > XDP-RX total 12,030,471 11,966,982 > > > > cpumap-enqueue 0:2 63,488 0 0 > > > > cpumap-enqueue sum:2 63,488 0 0 > > > > cpumap_kthread 2 63,488 0 3 time_exceed > > > > cpumap_kthread total 63,488 0 0 > > > > redirect_err total 0 0 > > > > > > > > $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 > > > > Local /Remote > > > > Socket Size Request Resp. Elapsed Trans. > > > > Send Recv Size Size Time Rate > > > > bytes Bytes bytes bytes secs. per sec > > > > > > > > 16384 87380 1024 1024 10.00 12735.97 > > > > 16384 87380 > > > > > > > > The netperf TCP_CRR performance is the same, without XDP loaded. > > > > > > > > > > Just curious could you also try this with RPS enabled (or does this have > > > RPS enabled). RPS should effectively do the same thing but higher in the > > > stack. I'm curious what the delta would be. Might be another interesting > > > case and fairly easy to setup if you already have the above scripts. > > > > Yes, I'm essentially competing with RSP, thus such a comparison is very > > relevant... > > > > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let > > other 4 CPUS process packet. > > > > Summary of RPS (Receive Packet Steering) performance: > > * End result is 6.3 Mpps max performance > > * netperf TCP_CRR is 1 trans/sec. > > * Each RX-RPS CPU stall at ~3.2Mpps. > > > > The full test report below with setup: > > > > The mask needed:: > > > > perl -e 'printf "%b\n",0x3C' > > 111100 > > > > RPS setup:: > > > > sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries' > > > > for N in $(seq 0 5) ; do \ > > sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \ > > sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \ > > grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \ > > done > > > > Reduce RX queues to two :: > > > > ethtool -L ixgbe1 combined 2 > > > > IRQ align to CPU numbers:: > > > > $ ~/setup01.sh > > Not root, running with sudo > > --- Disable Ethernet flow-control --- > > rx unmodified, ignoring > > tx unmodified, ignoring > > no pause parameters changed, aborting > > rx unmodified, ignoring > > tx unmodified, ignoring > > no pause parameters changed, aborting > > --- Align IRQs --- > > /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0 > > /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1 > > /proc/irq/56/ixgbe1/../smp_affinity_list:0-5 > > > > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus > > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c > > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c > > > > Generator is sending: 12,715,782 tx_packets /sec > > > > ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \ > > -d 172.16.0.2 -t8 > > > > $ nstat > /dev/null && sleep 1 && nstat > > #kernel > > IpInReceives 6346544 0.0 > > IpInDelivers 6346544 0.0 > > IpOutRequests 1020 0.0 > > IcmpOutMsgs 1020 0.0 > > IcmpOutDestUnreachs 1020 0.0 > > IcmpMsgOutType3 1020 0.0 > > UdpNoPorts 6346898 0.0 > > IpExtInOctets 291964714 0.0 > > IpExtOutOctets 73440 0.0 > > IpExtInNoECTPkts 6347063 0.0 > > > > $ mpstat -P ALL -u -I SCPU -I SUM > > > > Average: CPU %usr %nice %sys %irq %soft %idle > > Average: all 0.00 0.00 0.00 0.42 72.97 26.61 > > Average: 0 0.00 0.00 0.00 0.17 99.83 0.00 > > Average: 1 0.00 0.00 0.00 0.17 99.83 0.00 > > Average: 2 0.00 0.00 0.00 0.67 60.37 38.96 > > Average: 3 0.00 0.00 0.00 0.67 58.70 40.64 > > Average: 4 0.00 0.00 0.00 0.67 59.53 39.80 > > Average: 5 0.00 0.00 0.00 0.67 58.93 40.40 > > > > Average: CPU intr/s > > Average: all 152067.22 > > Average: 0 50064.73 > > Average: 1 50089.35 > > Average: 2 45095.17 > > Average: 3 44875.04 > > Average: 4 44906.32 > > Average: 5 45152.08 > > > > Average: CPU TIMER/s NET_TX/s NET_RX/s TASKLET/s SCHED/s RCU/s > > Average: 0 609.48 0.17 49431.28 0.00 2.66 21.13 > > Average: 1 567.55 0.00 49498.00 0.00 2.66 21.13 > > Average: 2 998.34 0.00 43941.60 4.16 82.86 68.22 > > Average: 3 540.60 0.17 44140.27 0.00 85.52 108.49 > > Average: 4 537.27 0.00 44219.63 0.00 84.53 64.89 > > Average: 5 530.78 0.17 44445.59 0.00 85.02 90.52 > > > > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck. > > > > Show adapter(s) (ixgbe1) statistics (ONLY that changed!) > > Ethtool(ixgbe1) stat: 11109531 ( 11,109,531) <= fdir_miss /sec > > Ethtool(ixgbe1) stat: 380632356 ( 380,632,356) <= rx_bytes /sec > > Ethtool(ixgbe1) stat: 812792611 ( 812,792,611) <= rx_bytes_nic /sec > > Ethtool(ixgbe1) stat: 1753550 ( 1,753,550) <= rx_missed_errors /sec > > Ethtool(ixgbe1) stat: 4602487 ( 4,602,487) <= rx_no_dma_resources /sec > > Ethtool(ixgbe1) stat: 6343873 ( 6,343,873) <= rx_packets /sec > > Ethtool(ixgbe1) stat: 10946441 ( 10,946,441) <= rx_pkts_nic /sec > > Ethtool(ixgbe1) stat: 190287853 ( 190,287,853) <= rx_queue_0_bytes /sec > > Ethtool(ixgbe1) stat: 3171464 ( 3,171,464) <= rx_queue_0_packets /sec > > Ethtool(ixgbe1) stat: 190344503 ( 190,344,503) <= rx_queue_1_bytes /sec > > Ethtool(ixgbe1) stat: 3172408 ( 3,172,408) <= rx_queue_1_packets /sec > > > > Notice, each RX-CPU can only process 3.1Mpps. > > > > RPS RX-CPU(0): > > > > # Overhead CPU Symbol > > # ........ ... ....................................... > > # > > 11.72% 000 [k] ixgbe_poll > > 11.29% 000 [k] _raw_spin_lock > > 10.35% 000 [k] dev_gro_receive > > 8.36% 000 [k] __build_skb > > 7.35% 000 [k] __skb_get_hash > > 6.22% 000 [k] enqueue_to_backlog > > 5.89% 000 [k] __skb_flow_dissect > > 4.43% 000 [k] inet_gro_receive > > 4.19% 000 [k] ___slab_alloc > > 3.90% 000 [k] queued_spin_lock_slowpath > > 3.85% 000 [k] kmem_cache_alloc > > 3.06% 000 [k] build_skb > > 2.66% 000 [k] get_rps_cpu > > 2.57% 000 [k] napi_gro_receive > > 2.34% 000 [k] eth_type_trans > > 1.81% 000 [k] __cmpxchg_double_slab.isra.61 > > 1.47% 000 [k] ixgbe_alloc_rx_buffers > > 1.43% 000 [k] get_partial_node.isra.81 > > 0.84% 000 [k] swiotlb_sync_single > > 0.74% 000 [k] udp4_gro_receive > > 0.73% 000 [k] netif_receive_skb_internal > > 0.72% 000 [k] udp_gro_receive > > 0.63% 000 [k] skb_gro_reset_offset > > 0.49% 000 [k] __skb_flow_get_ports > > 0.48% 000 [k] llist_add_batch > > 0.36% 000 [k] swiotlb_sync_single_for_cpu > > 0.34% 000 [k] __slab_alloc > > > > > > Remote RPS-CPU(3) getting packets:: > > > > # Overhead CPU Symbol > > # ........ ... .............................................. > > # > > 33.02% 003 [k] poll_idle > > 10.99% 003 [k] __netif_receive_skb_core > > 10.45% 003 [k] page_frag_free > > 8.49% 003 [k] ip_rcv > > 4.19% 003 [k] fib_table_lookup > > 2.84% 003 [k] __udp4_lib_rcv > > 2.81% 003 [k] __slab_free Notice slow-path of SLUB > > 2.23% 003 [k] __udp4_lib_lookup > > 2.09% 003 [k] ip_route_input_rcu > > 2.07% 003 [k] kmem_cache_free > > 2.06% 003 [k] udp_v4_early_demux > > 1.73% 003 [k] ip_rcv_finish > > Very interesting data. You removed some of the more interesting part of the perf-report, that showed us hitting more of the SLUB slowpath for SKBs. The slowpath consist of many separate function calls, thus it doesn't bubble to the top (the FlameGraph tool shows them easier). > So above perf report compares to xdp-redirect-cpu this one: > Perf top on a CPU(3) that have to alloc and free SKBs etc. > > # Overhead CPU Symbol > # ........ ... ....................................... > # > 15.51% 003 [k] fib_table_lookup > 8.91% 003 [k] cpu_map_kthread_run > 8.04% 003 [k] build_skb > 7.88% 003 [k] page_frag_free > 5.13% 003 [k] kmem_cache_alloc > 4.76% 003 [k] ip_route_input_rcu > 4.59% 003 [k] kmem_cache_free > 4.02% 003 [k] __udp4_lib_rcv > 3.20% 003 [k] fib_validate_source > 3.02% 003 [k] __netif_receive_skb_core > 3.02% 003 [k] udp_v4_early_demux > 2.90% 003 [k] ip_rcv > 2.80% 003 [k] ip_rcv_finish > > right? > and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu > you can load it up all the way. > Am I interpreting all this correctly that with RPS cpu0 cannot > distributed the packets to other cpus fast enough and that's > a bottleneck? Yes, exactly. The work needed on the RPS cpu0 is simply too much. > whereas in redirect-cpu you're doing early packet distribution > before skb alloc? Yes, the main point to reducing the CPU cycles spend on the packet for doing early packet distribution. > So in other words with redirect-cpu all consumer cpus are doing > skb alloc and in RPS cpu0 is allocating skbs for all ? Yes. > and that's where 6M->12M performance gain comes from? Yes, basically. There are many small thing that help this along. Like cpumap case always hitting the SLUB fastpath. Another big thing is bulking. It is sort of hidden, but the XDP_REDIRECT flush mechanism is implementing the RX bulking (I've been "screaming" about for the last couple of years! ;-))
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index d8f0c83..06ce63c 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -94,6 +94,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons, xdp.data_hard_start = *data_ptr - offset; xdp.data = *data_ptr; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = *data_ptr + *len; orig_data = xdp.data; mapping = rx_buf->mapping - bp->rx_dma_offset; diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 49b80da..d68478a 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -523,6 +523,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog, xdp.data_hard_start = page_address(page); xdp.data = (void *)cpu_addr; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + len; orig_data = xdp.data; diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 1519dfb..f426762 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2107,6 +2107,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) if (!skb) { xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_hard_start = xdp.data - i40e_rx_offset(rx_ring); xdp.data_end = xdp.data + size; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index d962368..04bb03b 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2326,6 +2326,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, if (!skb) { xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_hard_start = xdp.data - ixgbe_rx_offset(rx_ring); xdp.data_end = xdp.data + size; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index b97a55c8..8f9cb8a 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -762,6 +762,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud xdp.data_hard_start = va - frags[0].page_offset; xdp.data = va; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + length; orig_data = xdp.data; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index f1dd638..30b3f3f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -794,6 +794,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, return false; xdp.data = va + *rx_headroom; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len; xdp.data_hard_start = va; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 1c0187f..e3a38be 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1583,6 +1583,7 @@ static int nfp_net_run_xdp(struct bpf_prog *prog, void *data, void *hard_start, xdp.data_hard_start = hard_start; xdp.data = data + *off; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = data + *off + *len; orig_data = xdp.data; diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c index 6fc854b..48ec4c5 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_fp.c +++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c @@ -1004,6 +1004,7 @@ static bool qede_rx_xdp(struct qede_dev *edev, xdp.data_hard_start = page_address(bd->data); xdp.data = xdp.data_hard_start + *data_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len; /* Queues always have a full reset currently, so for the time diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3c9985f..1757fd7 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1314,6 +1314,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, xdp.data_hard_start = buf; xdp.data = buf + pad; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + len; orig_data = xdp.data; act = bpf_prog_run_xdp(xdp_prog, &xdp); diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index dd14a45..fc059f1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -554,6 +554,7 @@ static struct sk_buff *receive_small(struct net_device *dev, xdp.data_hard_start = buf + VIRTNET_RX_PAD + vi->hdr_len; xdp.data = xdp.data_hard_start + xdp_headroom; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + len; orig_data = xdp.data; act = bpf_prog_run_xdp(xdp_prog, &xdp); @@ -686,6 +687,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, data = page_address(xdp_page) + offset; xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len; xdp.data = data + vi->hdr_len; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + (len - vi->hdr_len); act = bpf_prog_run_xdp(xdp_prog, &xdp); diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 8390859..2b672c5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -137,6 +137,7 @@ enum bpf_reg_type { PTR_TO_MAP_VALUE, /* reg points to map element value */ PTR_TO_MAP_VALUE_OR_NULL,/* points to map elem value or NULL */ PTR_TO_STACK, /* reg == frame_pointer + offset */ + PTR_TO_PACKET_META, /* skb->data - meta_len */ PTR_TO_PACKET, /* reg points to skb->data */ PTR_TO_PACKET_END, /* skb->data + headlen */ }; diff --git a/include/linux/filter.h b/include/linux/filter.h index 052bab3..911d454 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -487,12 +487,14 @@ struct sk_filter { struct bpf_skb_data_end { struct qdisc_skb_cb qdisc_cb; + void *data_meta; void *data_end; }; struct xdp_buff { void *data; void *data_end; + void *data_meta; void *data_hard_start; }; @@ -507,7 +509,8 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; BUILD_BUG_ON(sizeof(*cb) > FIELD_SIZEOF(struct sk_buff, cb)); - cb->data_end = skb->data + skb_headlen(skb); + cb->data_meta = skb->data - skb_metadata_len(skb); + cb->data_end = skb->data + skb_headlen(skb); } static inline u8 *bpf_skb_cb(struct sk_buff *skb) @@ -728,8 +731,22 @@ int xdp_do_redirect(struct net_device *dev, struct bpf_prog *prog); void xdp_do_flush_map(void); +/* Drivers not supporting XDP metadata can use this helper, which + * rejects any room expansion for metadata as a result. + */ +static __always_inline void +xdp_set_data_meta_invalid(struct xdp_buff *xdp) +{ + xdp->data_meta = xdp->data + 1; +} + +static __always_inline bool +xdp_data_meta_unsupported(const struct xdp_buff *xdp) +{ + return unlikely(xdp->data_meta > xdp->data); +} + void bpf_warn_invalid_xdp_action(u32 act); -void bpf_warn_invalid_xdp_redirect(u32 ifindex); struct sock *do_sk_redirect_map(void); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f9db553..19e64bf 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -489,8 +489,9 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, * the end of the header data, ie. at skb->end. */ struct skb_shared_info { - unsigned short _unused; - unsigned char nr_frags; + __u8 __unused; + __u8 meta_len; + __u8 nr_frags; __u8 tx_flags; unsigned short gso_size; /* Warning: this field is not always filled in (UFO)! */ @@ -3400,6 +3401,69 @@ static inline ktime_t net_invalid_timestamp(void) return 0; } +static inline u8 skb_metadata_len(const struct sk_buff *skb) +{ + return skb_shinfo(skb)->meta_len; +} + +static inline void *skb_metadata_end(const struct sk_buff *skb) +{ + return skb_mac_header(skb); +} + +static inline bool __skb_metadata_differs(const struct sk_buff *skb_a, + const struct sk_buff *skb_b, + u8 meta_len) +{ + const void *a = skb_metadata_end(skb_a); + const void *b = skb_metadata_end(skb_b); + /* Using more efficient varaiant than plain call to memcmp(). */ +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64 + u64 diffs = 0; + + switch (meta_len) { +#define __it(x, op) (x -= sizeof(u##op)) +#define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op)) + case 32: diffs |= __it_diff(a, b, 64); + case 24: diffs |= __it_diff(a, b, 64); + case 16: diffs |= __it_diff(a, b, 64); + case 8: diffs |= __it_diff(a, b, 64); + break; + case 28: diffs |= __it_diff(a, b, 64); + case 20: diffs |= __it_diff(a, b, 64); + case 12: diffs |= __it_diff(a, b, 64); + case 4: diffs |= __it_diff(a, b, 32); + break; + } + return diffs; +#else + return memcmp(a - meta_len, b - meta_len, meta_len); +#endif +} + +static inline bool skb_metadata_differs(const struct sk_buff *skb_a, + const struct sk_buff *skb_b) +{ + u8 len_a = skb_metadata_len(skb_a); + u8 len_b = skb_metadata_len(skb_b); + + if (!(len_a | len_b)) + return false; + + return len_a != len_b ? + true : __skb_metadata_differs(skb_a, skb_b, len_a); +} + +static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len) +{ + skb_shinfo(skb)->meta_len = meta_len; +} + +static inline void skb_metadata_clear(struct sk_buff *skb) +{ + skb_metadata_set(skb, 0); +} + struct sk_buff *skb_clone_sk(struct sk_buff *skb); #ifdef CONFIG_NETWORK_PHY_TIMESTAMPING diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 43ab5c4..e43491a 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -582,6 +582,12 @@ enum bpf_attach_type { * @map: pointer to sockmap to update * @key: key to insert/update sock in map * @flags: same flags as map update elem + * + * int bpf_xdp_adjust_meta(xdp_md, delta) + * Adjust the xdp_md.data_meta by delta + * @xdp_md: pointer to xdp_md + * @delta: An positive/negative integer to be added to xdp_md.data_meta + * Return: 0 on success or negative on error */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -638,6 +644,7 @@ enum bpf_attach_type { FN(redirect_map), \ FN(sk_redirect_map), \ FN(sock_map_update), \ + FN(xdp_adjust_meta), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -715,7 +722,7 @@ struct __sk_buff { __u32 data_end; __u32 napi_id; - /* accessed by BPF_PROG_TYPE_sk_skb types */ + /* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */ __u32 family; __u32 remote_ip4; /* Stored in network byte order */ __u32 local_ip4; /* Stored in network byte order */ @@ -723,6 +730,9 @@ struct __sk_buff { __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ + /* ... here. */ + + __u32 data_meta; }; struct bpf_tunnel_key { @@ -783,6 +793,7 @@ enum xdp_action { struct xdp_md { __u32 data; __u32 data_end; + __u32 data_meta; }; enum sk_action { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b914fbe..f849eca 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -177,6 +177,12 @@ static __printf(1, 2) void verbose(const char *fmt, ...) va_end(args); } +static bool type_is_pkt_pointer(enum bpf_reg_type type) +{ + return type == PTR_TO_PACKET || + type == PTR_TO_PACKET_META; +} + /* string representation of 'enum bpf_reg_type' */ static const char * const reg_type_str[] = { [NOT_INIT] = "?", @@ -187,6 +193,7 @@ static __printf(1, 2) void verbose(const char *fmt, ...) [PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null", [PTR_TO_STACK] = "fp", [PTR_TO_PACKET] = "pkt", + [PTR_TO_PACKET_META] = "pkt_meta", [PTR_TO_PACKET_END] = "pkt_end", }; @@ -226,7 +233,7 @@ static void print_verifier_state(struct bpf_verifier_state *state) verbose("(id=%d", reg->id); if (t != SCALAR_VALUE) verbose(",off=%d", reg->off); - if (t == PTR_TO_PACKET) + if (type_is_pkt_pointer(t)) verbose(",r=%d", reg->range); else if (t == CONST_PTR_TO_MAP || t == PTR_TO_MAP_VALUE || @@ -519,6 +526,31 @@ static void mark_reg_known_zero(struct bpf_reg_state *regs, u32 regno) __mark_reg_known_zero(regs + regno); } +static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg) +{ + return type_is_pkt_pointer(reg->type); +} + +static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg) +{ + return reg_is_pkt_pointer(reg) || + reg->type == PTR_TO_PACKET_END; +} + +/* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */ +static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg, + enum bpf_reg_type which) +{ + /* The register can already have a range from prior markings. + * This is fine as long as it hasn't been advanced from its + * origin. + */ + return reg->type == which && + reg->id == 0 && + reg->off == 0 && + tnum_equals_const(reg->var_off, 0); +} + /* Attempts to improve min/max values based on var_off information */ static void __update_reg_bounds(struct bpf_reg_state *reg) { @@ -702,6 +734,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case PTR_TO_STACK: case PTR_TO_CTX: case PTR_TO_PACKET: + case PTR_TO_PACKET_META: case PTR_TO_PACKET_END: case CONST_PTR_TO_MAP: return true; @@ -1047,7 +1080,10 @@ static int check_ptr_alignment(struct bpf_verifier_env *env, switch (reg->type) { case PTR_TO_PACKET: - /* special case, because of NET_IP_ALIGN */ + case PTR_TO_PACKET_META: + /* Special case, because of NET_IP_ALIGN. Given metadata sits + * right in front, treat it the very same way. + */ return check_pkt_ptr_alignment(reg, off, size, strict); case PTR_TO_MAP_VALUE: pointer_desc = "value "; @@ -1124,8 +1160,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn err = check_ctx_access(env, insn_idx, off, size, t, ®_type); if (!err && t == BPF_READ && value_regno >= 0) { /* ctx access returns either a scalar, or a - * PTR_TO_PACKET[_END]. In the latter case, we know - * the offset is zero. + * PTR_TO_PACKET[_META,_END]. In the latter + * case, we know the offset is zero. */ if (reg_type == SCALAR_VALUE) mark_reg_unknown(state->regs, value_regno); @@ -1170,7 +1206,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn } else { err = check_stack_read(state, off, size, value_regno); } - } else if (reg->type == PTR_TO_PACKET) { + } else if (reg_is_pkt_pointer(reg)) { if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) { verbose("cannot write into packet\n"); return -EACCES; @@ -1310,6 +1346,7 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, switch (reg->type) { case PTR_TO_PACKET: + case PTR_TO_PACKET_META: return check_packet_access(env, regno, reg->off, access_size); case PTR_TO_MAP_VALUE: return check_map_access(env, regno, reg->off, access_size); @@ -1342,7 +1379,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, return 0; } - if (type == PTR_TO_PACKET && + if (type_is_pkt_pointer(type) && !may_access_direct_pkt_data(env, meta, BPF_READ)) { verbose("helper access to the packet is not allowed\n"); return -EACCES; @@ -1351,7 +1388,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, if (arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE) { expected_type = PTR_TO_STACK; - if (type != PTR_TO_PACKET && type != expected_type) + if (!type_is_pkt_pointer(type) && + type != expected_type) goto err_type; } else if (arg_type == ARG_CONST_SIZE || arg_type == ARG_CONST_SIZE_OR_ZERO) { @@ -1375,7 +1413,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, */ if (register_is_null(*reg)) /* final test in check_stack_boundary() */; - else if (type != PTR_TO_PACKET && type != PTR_TO_MAP_VALUE && + else if (!type_is_pkt_pointer(type) && + type != PTR_TO_MAP_VALUE && type != expected_type) goto err_type; meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM; @@ -1401,7 +1440,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, verbose("invalid map_ptr to access map->key\n"); return -EACCES; } - if (type == PTR_TO_PACKET) + if (type_is_pkt_pointer(type)) err = check_packet_access(env, regno, reg->off, meta->map_ptr->key_size); else @@ -1417,7 +1456,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, verbose("invalid map_ptr to access map->value\n"); return -EACCES; } - if (type == PTR_TO_PACKET) + if (type_is_pkt_pointer(type)) err = check_packet_access(env, regno, reg->off, meta->map_ptr->value_size); else @@ -1590,8 +1629,8 @@ static int check_raw_mode(const struct bpf_func_proto *fn) return count > 1 ? -EINVAL : 0; } -/* Packet data might have moved, any old PTR_TO_PACKET[_END] are now invalid, - * so turn them into unknown SCALAR_VALUE. +/* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] + * are now invalid, so turn them into unknown SCALAR_VALUE. */ static void clear_all_pkt_pointers(struct bpf_verifier_env *env) { @@ -1600,18 +1639,15 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env) int i; for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].type == PTR_TO_PACKET || - regs[i].type == PTR_TO_PACKET_END) + if (reg_is_pkt_pointer_any(®s[i])) mark_reg_unknown(regs, i); for (i = 0; i < MAX_BPF_STACK; i += BPF_REG_SIZE) { if (state->stack_slot_type[i] != STACK_SPILL) continue; reg = &state->spilled_regs[i / BPF_REG_SIZE]; - if (reg->type != PTR_TO_PACKET && - reg->type != PTR_TO_PACKET_END) - continue; - __mark_reg_unknown(reg); + if (reg_is_pkt_pointer_any(reg)) + __mark_reg_unknown(reg); } } @@ -1871,7 +1907,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } dst_reg->var_off = tnum_add(ptr_reg->var_off, off_reg->var_off); dst_reg->off = ptr_reg->off; - if (ptr_reg->type == PTR_TO_PACKET) { + if (reg_is_pkt_pointer(ptr_reg)) { dst_reg->id = ++env->id_gen; /* something was added to pkt_ptr, set range to zero */ dst_reg->range = 0; @@ -1931,7 +1967,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } dst_reg->var_off = tnum_sub(ptr_reg->var_off, off_reg->var_off); dst_reg->off = ptr_reg->off; - if (ptr_reg->type == PTR_TO_PACKET) { + if (reg_is_pkt_pointer(ptr_reg)) { dst_reg->id = ++env->id_gen; /* something was added to pkt_ptr, set range to zero */ if (smin_val < 0) @@ -2421,7 +2457,8 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) } static void find_good_pkt_pointers(struct bpf_verifier_state *state, - struct bpf_reg_state *dst_reg) + struct bpf_reg_state *dst_reg, + enum bpf_reg_type type) { struct bpf_reg_state *regs = state->regs, *reg; int i; @@ -2483,7 +2520,7 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *state, * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16. */ for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].type == PTR_TO_PACKET && regs[i].id == dst_reg->id) + if (regs[i].type == type && regs[i].id == dst_reg->id) /* keep the maximum range already checked */ regs[i].range = max_t(u16, regs[i].range, dst_reg->off); @@ -2491,7 +2528,7 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *state, if (state->stack_slot_type[i] != STACK_SPILL) continue; reg = &state->spilled_regs[i / BPF_REG_SIZE]; - if (reg->type == PTR_TO_PACKET && reg->id == dst_reg->id) + if (reg->type == type && reg->id == dst_reg->id) reg->range = max_t(u16, reg->range, dst_reg->off); } } @@ -2856,19 +2893,39 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGT && dst_reg->type == PTR_TO_PACKET && regs[insn->src_reg].type == PTR_TO_PACKET_END) { - find_good_pkt_pointers(this_branch, dst_reg); + find_good_pkt_pointers(this_branch, dst_reg, PTR_TO_PACKET); } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLT && dst_reg->type == PTR_TO_PACKET && regs[insn->src_reg].type == PTR_TO_PACKET_END) { - find_good_pkt_pointers(other_branch, dst_reg); + find_good_pkt_pointers(other_branch, dst_reg, PTR_TO_PACKET); } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGE && dst_reg->type == PTR_TO_PACKET_END && regs[insn->src_reg].type == PTR_TO_PACKET) { - find_good_pkt_pointers(other_branch, ®s[insn->src_reg]); + find_good_pkt_pointers(other_branch, ®s[insn->src_reg], + PTR_TO_PACKET); } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLE && dst_reg->type == PTR_TO_PACKET_END && regs[insn->src_reg].type == PTR_TO_PACKET) { - find_good_pkt_pointers(this_branch, ®s[insn->src_reg]); + find_good_pkt_pointers(this_branch, ®s[insn->src_reg], + PTR_TO_PACKET); + } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGT && + dst_reg->type == PTR_TO_PACKET_META && + reg_is_init_pkt_pointer(®s[insn->src_reg], PTR_TO_PACKET)) { + find_good_pkt_pointers(this_branch, dst_reg, PTR_TO_PACKET_META); + } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLT && + dst_reg->type == PTR_TO_PACKET_META && + reg_is_init_pkt_pointer(®s[insn->src_reg], PTR_TO_PACKET)) { + find_good_pkt_pointers(other_branch, dst_reg, PTR_TO_PACKET_META); + } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGE && + reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) && + regs[insn->src_reg].type == PTR_TO_PACKET_META) { + find_good_pkt_pointers(other_branch, ®s[insn->src_reg], + PTR_TO_PACKET_META); + } else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLE && + reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) && + regs[insn->src_reg].type == PTR_TO_PACKET_META) { + find_good_pkt_pointers(this_branch, ®s[insn->src_reg], + PTR_TO_PACKET_META); } else if (is_pointer_value(env, insn->dst_reg)) { verbose("R%d pointer comparison prohibited\n", insn->dst_reg); return -EACCES; @@ -3298,8 +3355,9 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, return false; /* Check our ids match any regs they're supposed to */ return check_ids(rold->id, rcur->id, idmap); + case PTR_TO_PACKET_META: case PTR_TO_PACKET: - if (rcur->type != PTR_TO_PACKET) + if (rcur->type != rold->type) return false; /* We must have at least as much range as the old ptr * did, so that any accesses which were safe before are diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index df67251..a86e668 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -162,6 +162,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, xdp.data_hard_start = data; xdp.data = data + XDP_PACKET_HEADROOM + NET_IP_ALIGN; + xdp.data_meta = xdp.data; xdp.data_end = xdp.data + size; retval = bpf_test_run(prog, &xdp, repeat, &duration); diff --git a/net/core/dev.c b/net/core/dev.c index 97abddd..e350c76 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3864,8 +3864,8 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, static u32 netif_receive_generic_xdp(struct sk_buff *skb, struct bpf_prog *xdp_prog) { + u32 metalen, act = XDP_DROP; struct xdp_buff xdp; - u32 act = XDP_DROP; void *orig_data; int hlen, off; u32 mac_len; @@ -3876,8 +3876,25 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, if (skb_cloned(skb)) return XDP_PASS; - if (skb_linearize(skb)) - goto do_drop; + /* XDP packets must be linear and must have sufficient headroom + * of XDP_PACKET_HEADROOM bytes. This is the guarantee that also + * native XDP provides, thus we need to do it here as well. + */ + if (skb_is_nonlinear(skb) || + skb_headroom(skb) < XDP_PACKET_HEADROOM) { + int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb); + int troom = skb->tail + skb->data_len - skb->end; + + /* In case we have to go down the path and also linearize, + * then lets do the pskb_expand_head() work just once here. + */ + if (pskb_expand_head(skb, + hroom > 0 ? ALIGN(hroom, NET_SKB_PAD) : 0, + troom > 0 ? troom + 128 : 0, GFP_ATOMIC)) + goto do_drop; + if (troom > 0 && __skb_linearize(skb)) + goto do_drop; + } /* The XDP program wants to see the packet starting at the MAC * header. @@ -3885,6 +3902,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, mac_len = skb->data - skb_mac_header(skb); hlen = skb_headlen(skb) + mac_len; xdp.data = skb->data - mac_len; + xdp.data_meta = xdp.data; xdp.data_end = xdp.data + hlen; xdp.data_hard_start = skb->data - skb_headroom(skb); orig_data = xdp.data; @@ -3902,10 +3920,12 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, case XDP_REDIRECT: case XDP_TX: __skb_push(skb, mac_len); - /* fall through */ + break; case XDP_PASS: + metalen = xdp.data - xdp.data_meta; + if (metalen) + skb_metadata_set(skb, metalen); break; - default: bpf_warn_invalid_xdp_action(act); /* fall through */ @@ -4695,6 +4715,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev; diffs |= p->vlan_tci ^ skb->vlan_tci; diffs |= skb_metadata_dst_cmp(p, skb); + diffs |= skb_metadata_differs(p, skb); if (maclen == ETH_HLEN) diffs |= compare_ether_header(skb_mac_header(p), skb_mac_header(skb)); diff --git a/net/core/filter.c b/net/core/filter.c index c468e7c..9b6e7e8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2447,14 +2447,26 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) .arg3_type = ARG_ANYTHING, }; +static unsigned long xdp_get_metalen(const struct xdp_buff *xdp) +{ + return xdp_data_meta_unsupported(xdp) ? 0 : + xdp->data - xdp->data_meta; +} + BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset) { + unsigned long metalen = xdp_get_metalen(xdp); + void *data_start = xdp->data_hard_start + metalen; void *data = xdp->data + offset; - if (unlikely(data < xdp->data_hard_start || + if (unlikely(data < data_start || data > xdp->data_end - ETH_HLEN)) return -EINVAL; + if (metalen) + memmove(xdp->data_meta + offset, + xdp->data_meta, metalen); + xdp->data_meta += offset; xdp->data = data; return 0; @@ -2468,6 +2480,33 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) .arg2_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset) +{ + void *meta = xdp->data_meta + offset; + unsigned long metalen = xdp->data - meta; + + if (xdp_data_meta_unsupported(xdp)) + return -ENOTSUPP; + if (unlikely(meta < xdp->data_hard_start || + meta > xdp->data)) + return -EINVAL; + if (unlikely((metalen & (sizeof(__u32) - 1)) || + (metalen > 32))) + return -EACCES; + + xdp->data_meta = meta; + + return 0; +} + +static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { + .func = bpf_xdp_adjust_meta, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + static int __bpf_tx_xdp(struct net_device *dev, struct bpf_map *map, struct xdp_buff *xdp, @@ -2692,7 +2731,8 @@ bool bpf_helper_changes_pkt_data(void *func) func == bpf_clone_redirect || func == bpf_l3_csum_replace || func == bpf_l4_csum_replace || - func == bpf_xdp_adjust_head) + func == bpf_xdp_adjust_head || + func == bpf_xdp_adjust_meta) return true; return false; @@ -3288,6 +3328,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, return &bpf_get_smp_processor_id_proto; case BPF_FUNC_xdp_adjust_head: return &bpf_xdp_adjust_head_proto; + case BPF_FUNC_xdp_adjust_meta: + return &bpf_xdp_adjust_meta_proto; case BPF_FUNC_redirect: return &bpf_xdp_redirect_proto; case BPF_FUNC_redirect_map: @@ -3418,6 +3460,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type case bpf_ctx_range_till(struct __sk_buff, remote_ip4, remote_ip4): case bpf_ctx_range_till(struct __sk_buff, local_ip4, local_ip4): case bpf_ctx_range(struct __sk_buff, data): + case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, data_end): if (size != size_default) return false; @@ -3444,6 +3487,7 @@ static bool sk_filter_is_valid_access(int off, int size, switch (off) { case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range(struct __sk_buff, data): + case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, data_end): case bpf_ctx_range_till(struct __sk_buff, family, local_port): return false; @@ -3468,6 +3512,7 @@ static bool lwt_is_valid_access(int off, int size, switch (off) { case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range_till(struct __sk_buff, family, local_port): + case bpf_ctx_range(struct __sk_buff, data_meta): return false; } @@ -3586,6 +3631,9 @@ static bool tc_cls_act_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, data): info->reg_type = PTR_TO_PACKET; break; + case bpf_ctx_range(struct __sk_buff, data_meta): + info->reg_type = PTR_TO_PACKET_META; + break; case bpf_ctx_range(struct __sk_buff, data_end): info->reg_type = PTR_TO_PACKET_END; break; @@ -3619,6 +3667,9 @@ static bool xdp_is_valid_access(int off, int size, case offsetof(struct xdp_md, data): info->reg_type = PTR_TO_PACKET; break; + case offsetof(struct xdp_md, data_meta): + info->reg_type = PTR_TO_PACKET_META; + break; case offsetof(struct xdp_md, data_end): info->reg_type = PTR_TO_PACKET_END; break; @@ -3677,6 +3728,12 @@ static bool sk_skb_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info) { + switch (off) { + case bpf_ctx_range(struct __sk_buff, tc_classid): + case bpf_ctx_range(struct __sk_buff, data_meta): + return false; + } + if (type == BPF_WRITE) { switch (off) { case bpf_ctx_range(struct __sk_buff, mark): @@ -3689,8 +3746,6 @@ static bool sk_skb_is_valid_access(int off, int size, } switch (off) { - case bpf_ctx_range(struct __sk_buff, tc_classid): - return false; case bpf_ctx_range(struct __sk_buff, data): info->reg_type = PTR_TO_PACKET; break; @@ -3847,6 +3902,15 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, offsetof(struct sk_buff, data)); break; + case offsetof(struct __sk_buff, data_meta): + off = si->off; + off -= offsetof(struct __sk_buff, data_meta); + off += offsetof(struct sk_buff, cb); + off += offsetof(struct bpf_skb_data_end, data_meta); + *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, + si->src_reg, off); + break; + case offsetof(struct __sk_buff, data_end): off = si->off; off -= offsetof(struct __sk_buff, data_end); @@ -4095,6 +4159,11 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, si->dst_reg, si->src_reg, offsetof(struct xdp_buff, data)); break; + case offsetof(struct xdp_md, data_meta): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_meta), + si->dst_reg, si->src_reg, + offsetof(struct xdp_buff, data_meta)); + break; case offsetof(struct xdp_md, data_end): *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end), si->dst_reg, si->src_reg, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 16982de..681177b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1509,6 +1509,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, skb->nohdr = 0; atomic_set(&skb_shinfo(skb)->dataref, 1); + skb_metadata_clear(skb); + /* It is not generally safe to change skb->truesize. * For the moment, we really care of rx path, or * when skb is orphaned (not attached to a socket).