Message ID | 20200123014210.38412-4-dsahern@kernel.org |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | Add support for XDP in egress path | expand |
David Ahern <dsahern@kernel.org> writes: > From: David Ahern <dahern@digitalocean.com> > > Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached > to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as > the egress counterpart to the existing rtnl_xdp_fill. The expectation > is that going forward egress path will acquire the various levels of > attach - generic, driver and hardware. How would a 'hardware' attach work for this? As I said in my reply to the previous patch, isn't this explicitly for emulating XDP on the other end of a point-to-point link? How would that work with offloaded programs? -Toke
On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: > David Ahern <dsahern@kernel.org> writes: > >> From: David Ahern <dahern@digitalocean.com> >> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as >> the egress counterpart to the existing rtnl_xdp_fill. The expectation >> is that going forward egress path will acquire the various levels of >> attach - generic, driver and hardware. > > How would a 'hardware' attach work for this? As I said in my reply to > the previous patch, isn't this explicitly for emulating XDP on the other > end of a point-to-point link? How would that work with offloaded > programs? > > -Toke > Nothing about this patch set is limited to point-to-point links.
On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: > On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: > > David Ahern <dsahern@kernel.org> writes: > >> From: David Ahern <dahern@digitalocean.com> > >> > >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached > >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as > >> the egress counterpart to the existing rtnl_xdp_fill. The expectation > >> is that going forward egress path will acquire the various levels of > >> attach - generic, driver and hardware. > > > > How would a 'hardware' attach work for this? As I said in my reply to > > the previous patch, isn't this explicitly for emulating XDP on the other > > end of a point-to-point link? How would that work with offloaded > > programs? > > Nothing about this patch set is limited to point-to-point links. I struggle to understand of what the expected semantics of this new hook are. Is this going to be run on all frames sent to the device from the stack? All frames from the stack and from XDP_REDIRECT? A little hard to figure out the semantics when we start from a funky device like tun :S
Jakub Kicinski <kuba@kernel.org> writes: > On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: >> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: >> > David Ahern <dsahern@kernel.org> writes: >> >> From: David Ahern <dahern@digitalocean.com> >> >> >> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached >> >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as >> >> the egress counterpart to the existing rtnl_xdp_fill. The expectation >> >> is that going forward egress path will acquire the various levels of >> >> attach - generic, driver and hardware. >> > >> > How would a 'hardware' attach work for this? As I said in my reply to >> > the previous patch, isn't this explicitly for emulating XDP on the other >> > end of a point-to-point link? How would that work with offloaded >> > programs? >> >> Nothing about this patch set is limited to point-to-point links. > > I struggle to understand of what the expected semantics of this new > hook are. Is this going to be run on all frames sent to the device > from the stack? All frames from the stack and from XDP_REDIRECT? > > A little hard to figure out the semantics when we start from a funky > device like tun :S Yes, that is also why I found this a bit weird. We have discussed plans for an XDP TX hook before: https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx That TX hook would run for everything at TX, but it would be a separate program type with its own metadata access. Whereas the idea with this series (seemed to me) to be just to be able to "emulate" run a regular RX-side XDP program on egress for devices where this makes sense. If this series is not meant to implement that "emulation", but rather be usable for all devices, I really think we should go straight for the full TX hook as discussed earlier... -Toke
On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote: > Jakub Kicinski <kuba@kernel.org> writes: > >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: >>>> David Ahern <dsahern@kernel.org> writes: >>>>> From: David Ahern <dahern@digitalocean.com> >>>>> >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation >>>>> is that going forward egress path will acquire the various levels of >>>>> attach - generic, driver and hardware. >>>> >>>> How would a 'hardware' attach work for this? As I said in my reply to >>>> the previous patch, isn't this explicitly for emulating XDP on the other >>>> end of a point-to-point link? How would that work with offloaded >>>> programs? >>> >>> Nothing about this patch set is limited to point-to-point links. >> >> I struggle to understand of what the expected semantics of this new >> hook are. Is this going to be run on all frames sent to the device >> from the stack? All frames from the stack and from XDP_REDIRECT? >> >> A little hard to figure out the semantics when we start from a funky >> device like tun :S > > Yes, that is also why I found this a bit weird. We have discussed plans > for an XDP TX hook before: > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx > > That TX hook would run for everything at TX, but it would be a separate > program type with its own metadata access. Whereas the idea with this > series (seemed to me) to be just to be able to "emulate" run a regular > RX-side XDP program on egress for devices where this makes sense. > > If this series is not meant to implement that "emulation", but rather be > usable for all devices, I really think we should go straight for the > full TX hook as discussed earlier... > The first patch set from Jason and Prashant started from the perspective of offloading XDP programs for a guest. Independently, I was looking at XDP in the TX path (now referred to as egress to avoid confusion with the XDP_TX return type). Jason and Prashant were touching some of the same code paths in the tun driver that I needed for XDP in the Tx path, so we decided to consolidate and have XDP egress done first and then offload of VMs as a followup. Offload in virtio_net can be done very similar to how it is done in nfp -- the program is passed to the host as a hardware level attach mode, and the driver verifies the program can be offloaded (e.g., does not contain helpers that expose host specific data like the fib lookup helper). At this point, you need to stop thinking solely from the perspective of tun or tap and VM offload; think about this from the ability to run an XDP program on egress path at an appropriate place in the NIC driver that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has been discussed before as a need (e.g, Toke's reference above), and I am trying to get this initial support done. I very much wanted to avoid copy-paste-modify for the entire XDP API for this. For the most part XDP means ebpf at the NIC driver / hardware level (obviously with the exception of generic mode). The goal is tempered with the need for the verifier to reject rx entries in the xdp_md context. Hence the reason for use of an attach_type - existing infrastructure to test and reject the accesses. That said, Martin's comment throws a wrench in the goal: if the existing code does not enforce expected_attach_type then that option can not be used in which case I guess I have to go with a new program type (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), has different return codes, etc.
On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote: > > That said, Martin's comment throws a wrench in the goal: if the existing > code does not enforce expected_attach_type then that option can not be > used in which case I guess I have to go with a new program type > (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), > has different return codes, etc. This is acceptable risk. We did such thing in the past. The chances of user space breakage are extremely low.
On Sat, 25 Jan 2020 18:43:36 -0700 David Ahern <dsahern@gmail.com> wrote: > On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote: > > Jakub Kicinski <kuba@kernel.org> writes: > > > >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: > >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: > >>>> David Ahern <dsahern@kernel.org> writes: > >>>>> From: David Ahern <dahern@digitalocean.com> > >>>>> > >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached > >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as > >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation > >>>>> is that going forward egress path will acquire the various levels of > >>>>> attach - generic, driver and hardware. > >>>> > >>>> How would a 'hardware' attach work for this? As I said in my reply to > >>>> the previous patch, isn't this explicitly for emulating XDP on the other > >>>> end of a point-to-point link? How would that work with offloaded > >>>> programs? > >>> > >>> Nothing about this patch set is limited to point-to-point links. > >> > >> I struggle to understand of what the expected semantics of this new > >> hook are. Is this going to be run on all frames sent to the device > >> from the stack? All frames from the stack and from XDP_REDIRECT? > >> > >> A little hard to figure out the semantics when we start from a funky > >> device like tun :S > > > > Yes, that is also why I found this a bit weird. We have discussed plans > > for an XDP TX hook before: > > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx > > > > That TX hook would run for everything at TX, but it would be a separate > > program type with its own metadata access. Whereas the idea with this > > series (seemed to me) to be just to be able to "emulate" run a regular > > RX-side XDP program on egress for devices where this makes sense. > > > > If this series is not meant to implement that "emulation", but rather be > > usable for all devices, I really think we should go straight for the > > full TX hook as discussed earlier... > > > > The first patch set from Jason and Prashant started from the perspective > of offloading XDP programs for a guest. Independently, I was looking at > XDP in the TX path (now referred to as egress to avoid confusion with > the XDP_TX return type). Jason and Prashant were touching some of the > same code paths in the tun driver that I needed for XDP in the Tx path, > so we decided to consolidate and have XDP egress done first and then > offload of VMs as a followup. Offload in virtio_net can be done very > similar to how it is done in nfp -- the program is passed to the host as > a hardware level attach mode, and the driver verifies the program can be > offloaded (e.g., does not contain helpers that expose host specific data > like the fib lookup helper). > > At this point, you need to stop thinking solely from the perspective of > tun or tap and VM offload; think about this from the ability to run an > XDP program on egress path at an appropriate place in the NIC driver > that covers both skbs and xdp_frames (e.g., on a REDIRECT). Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames. > This has > been discussed before as a need (e.g, Toke's reference above), and I am > trying to get this initial support done. > > I very much wanted to avoid copy-paste-modify for the entire XDP API for > this. For the most part XDP means ebpf at the NIC driver / hardware > level (obviously with the exception of generic mode). The goal is > tempered with the need for the verifier to reject rx entries in the > xdp_md context. Hence the reason for use of an attach_type - existing > infrastructure to test and reject the accesses. > > That said, Martin's comment throws a wrench in the goal: if the existing > code does not enforce expected_attach_type then that option can not be > used in which case I guess I have to go with a new program type > (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), > has different return codes, etc. Taking about return codes. Does XDP the return codes make sense for this EGRESS hook? (if thinking about this being egress on the real NIC). E.g. XDP_REDIRECT would have to be supported, which is interesting, but also have implications (like looping packets). E.g. what is the semantics/action of XDP_TX return code? E.g. I'm considering adding a XDP_CONGESTED return code that can cause backpressure towards qdisc layer. Also think about that if this EGRESS hook uses standard prog type for XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff (and also convert SKBs to xdp_buff). Are we sure that reusing the same bpf prog type is the right choice?
On 1/26/20 5:49 AM, Jesper Dangaard Brouer wrote: >> This has >> been discussed before as a need (e.g, Toke's reference above), and I am >> trying to get this initial support done. >> >> I very much wanted to avoid copy-paste-modify for the entire XDP API for >> this. For the most part XDP means ebpf at the NIC driver / hardware >> level (obviously with the exception of generic mode). The goal is >> tempered with the need for the verifier to reject rx entries in the >> xdp_md context. Hence the reason for use of an attach_type - existing >> infrastructure to test and reject the accesses. >> >> That said, Martin's comment throws a wrench in the goal: if the existing >> code does not enforce expected_attach_type then that option can not be >> used in which case I guess I have to go with a new program type >> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), >> has different return codes, etc. > > Taking about return codes. Does XDP the return codes make sense for > this EGRESS hook? (if thinking about this being egress on the real NIC). > > E.g. XDP_REDIRECT would have to be supported, which is interesting, but > also have implications (like looping packets). > > E.g. what is the semantics/action of XDP_TX return code? This has been discussed. XDP_TX in the EGRESS path could arguably be equal to XDP_PASS. > > E.g. I'm considering adding a XDP_CONGESTED return code that can cause > backpressure towards qdisc layer. > > Also think about that if this EGRESS hook uses standard prog type for > XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff > (and also convert SKBs to xdp_buff). Why? What about the patch set requires that change to be done to have support for EGRESS path? > > Are we sure that reusing the same bpf prog type is the right choice? > Martin's comment about existing checking on the expected attach type is the only reason I have seen so far to not have the same program type. Looking at the helpers for use in XDP programs do you believe any of those should not be allowed with EGRESS programs? Do you have any reason to think that existing XDP capabilities should be prohibited or different for EGRESS? As mentioned earlier the attach type can be used to have the verifier handle small context differences (and restrict helpers if needed).
On Sat, 25 Jan 2020 18:43:36 -0700, David Ahern wrote: > On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote: > > Jakub Kicinski <kuba@kernel.org> writes: > >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: > >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: > >>>> David Ahern <dsahern@kernel.org> writes: > >>>>> From: David Ahern <dahern@digitalocean.com> > >>>>> > >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached > >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as > >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation > >>>>> is that going forward egress path will acquire the various levels of > >>>>> attach - generic, driver and hardware. > >>>> > >>>> How would a 'hardware' attach work for this? As I said in my reply to > >>>> the previous patch, isn't this explicitly for emulating XDP on the other > >>>> end of a point-to-point link? How would that work with offloaded > >>>> programs? > >>> > >>> Nothing about this patch set is limited to point-to-point links. > >> > >> I struggle to understand of what the expected semantics of this new > >> hook are. Is this going to be run on all frames sent to the device > >> from the stack? All frames from the stack and from XDP_REDIRECT? > >> > >> A little hard to figure out the semantics when we start from a funky > >> device like tun :S > > > > Yes, that is also why I found this a bit weird. We have discussed plans > > for an XDP TX hook before: > > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx > > > > That TX hook would run for everything at TX, but it would be a separate > > program type with its own metadata access. Whereas the idea with this > > series (seemed to me) to be just to be able to "emulate" run a regular > > RX-side XDP program on egress for devices where this makes sense. > > > > If this series is not meant to implement that "emulation", but rather be > > usable for all devices, I really think we should go straight for the > > full TX hook as discussed earlier... > > The first patch set from Jason and Prashant started from the perspective > of offloading XDP programs for a guest. Independently, I was looking at > XDP in the TX path (now referred to as egress to avoid confusion with > the XDP_TX return type). I looked through the commit message and the cover letter again, and you never explain why you need the egress hook. Could you please clarify your needs? If it's container-related maybe what Daniel talked about at last netconf could be a better solution? I can't quite square the concept of XDP which started as close to the metal BPF hook for HW drivers, and this heavily SW-focused addition. > Jason and Prashant were touching some of the > same code paths in the tun driver that I needed for XDP in the Tx path, > so we decided to consolidate and have XDP egress done first and then > offload of VMs as a followup. Offload in virtio_net can be done very > similar to how it is done in nfp -- the program is passed to the host as > a hardware level attach mode, and the driver verifies the program can be > offloaded (e.g., does not contain helpers that expose host specific data > like the fib lookup helper). <rant> I'd ask to please never compare this work to the nfp offload. Netronome was able to open up their NIC down to the instruction set level, with the JIT in tree and rest of the FW open source: https://github.com/Netronome/nic-firmware/ and that work is now used as precedent for something that risks turning the kernel into a damn control plane for proprietary clouds? I can see how they may seem similar in operational terms, but for people who care about open source they couldn't be more different. </rant> > At this point, you need to stop thinking solely from the perspective of > tun or tap and VM offload; think about this from the ability to run an > XDP program on egress path at an appropriate place in the NIC driver > that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has > been discussed before as a need (e.g, Toke's reference above), and I am > trying to get this initial support done. TX hook related to queuing is a very different beast than just a RX hook flipped. The queuing is a problem that indeed needs work, but just adding a mirror RX hook does not solve that, and establishes semantics which may be counter productive. That's why I was asking for clear semantics. > I very much wanted to avoid copy-paste-modify for the entire XDP API for > this. For the most part XDP means ebpf at the NIC driver / hardware > level (obviously with the exception of generic mode). The goal is > tempered with the need for the verifier to reject rx entries in the > xdp_md context. Hence the reason for use of an attach_type - existing > infrastructure to test and reject the accesses. For the offload host rx queue == dev PCI tx queue and vice versa. So other than the name the rejection makes no sense. Just add a union to xdp_md so both tx and rx names can be used.
On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:
> Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.
Any pointers on what for? Unless we see actual use cases there's
a justifiable concern of the entire thing just being an application of
"We can solve any problem by introducing an extra level of indirection."
On 1/26/20 3:11 PM, Jakub Kicinski wrote: > > I looked through the commit message and the cover letter again, and you > never explain why you need the egress hook. Could you please clarify > your needs? XDP is about efficient network processing - ie., bypassing the Linux stack when it does not make sense for the person deploying some solution. XDP right now is Rx centric. I want to run an ebpf program in the Tx path of the NIC regardless of how the packet arrived at the device -- as an skb or an xdp_frame. There are options for running programs on skb-based packets (e.g., tc). There are *zero* options for manipulating/controlling/denying xdp_frames - e.g., one REDIRECTED from an ingress device.
On Sun, 26 Jan 2020 21:03:15 -0700, David Ahern wrote: > On 1/26/20 3:11 PM, Jakub Kicinski wrote: > > I looked through the commit message and the cover letter again, and you > > never explain why you need the egress hook. Could you please clarify > > your needs? > > XDP is about efficient network processing - ie., bypassing the Linux > stack when it does not make sense for the person deploying some > solution. XDP right now is Rx centric. Network hardware is also "Rx centric" and somehow it works.. > I want to run an ebpf program in the Tx path of the NIC regardless of > how the packet arrived at the device -- as an skb or an xdp_frame. There > are options for running programs on skb-based packets (e.g., tc). There > are *zero* options for manipulating/controlling/denying xdp_frames - > e.g., one REDIRECTED from an ingress device. Okay - so no precise use case. You can run the same program at the end of whatever is doing the redirect (especially with Alexei's work on linking) and from cls_bpf 🤷♂️ I'm sure all driver authors can't wait to plough through their TX paths :/
On 1/27/20 7:16 AM, Jakub Kicinski wrote: >> I want to run an ebpf program in the Tx path of the NIC regardless of >> how the packet arrived at the device -- as an skb or an xdp_frame. There >> are options for running programs on skb-based packets (e.g., tc). There >> are *zero* options for manipulating/controlling/denying xdp_frames - >> e.g., one REDIRECTED from an ingress device. > > Okay - so no precise use case. You can run the same program at the For the sake of this discussion, consider a per-VM, per-tap device ebpf program for firewall / ACL / traffic verification (packet manipulation is also a possibility). Small, singly focused ebpf programs - attached at startup, driven by maps, cleaned up when the tap device is destroyed. (Remember: Tx for tap device is ingress to a VM.) Small, singly focused programs only run for traffic to be delivered to the VM. Setup is easy; cleanup automatic. How the traffic got there could vary - from a bridge (L2 forwarding), the host stack (L3 routing), or an XDP program on the host's NICs. It could have arrived at the host with an encap header which is removed and the inner packet forwarded to the VM. > end of whatever is doing the redirect (especially with Alexei's work There are use cases where they may make sense, but this is not one. > on linking) and from cls_bpf 🤷♂️ > cls_bpf is tc based == skb, no? I want to handle any packet, regardless of how it arrived at the device's xmit function.
On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote: > > end of whatever is doing the redirect (especially with Alexei's work > > There are use cases where they may make sense, but this is not one. > > > on linking) and from cls_bpf 🤷♂️ > > cls_bpf is tc based == skb, no? I want to handle any packet, regardless > of how it arrived at the device's xmit function. Yes, that's why I said you need the same rules in XDP before REDIRECT and cls_bpf. Sure it's more complex, but (1) it's faster to drop in the ingress prog before going though the entire redirect code and without parsing the packet twice and (2) no extra kernel code necessary. Even the VM "offload" work doesn't need this. Translating an XDP prog into a cls_bpf one should be trivial. Slap on some prologue to linearize the skb, move ctx offsets around, slap on an epilogue to convert exit codes, anything else? I'm weary of partially implemented XDP features, EGRESS prog does us no good when most drivers didn't yet catch up with the REDIRECTs. And we're adding this before we considered the queuing problem. But if I'm alone in thinking this, and I'm not convincing anyone we can move on :)
On Sun, 26 Jan 2020 14:17:01 -0800 Jakub Kicinski <kuba@kernel.org> wrote: > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote: > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames. > > Any pointers on what for? Unless we see actual use cases there's > a justifiable concern of the entire thing just being an application of > "We can solve any problem by introducing an extra level of indirection." I have two use-cases: (1) For XDP easier handling of interface specific setting on egress, e.g. pushing a VLAN-id, instead of having to figure this out in RX hook. (I think this is also David Ahern's use-case) (2) I want this egress XDP hook to have the ability to signal backpressure. Today we have BQL in most drivers (which is essential to avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve. For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(), or a return code that can stop the queue towards the higher layers.
On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote: > On Sun, 26 Jan 2020 14:17:01 -0800 > Jakub Kicinski <kuba@kernel.org> wrote: > > > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote: > > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames. > > > > Any pointers on what for? Unless we see actual use cases there's > > a justifiable concern of the entire thing just being an application of > > "We can solve any problem by introducing an extra level of indirection." > > I have two use-cases: > > (1) For XDP easier handling of interface specific setting on egress, > e.g. pushing a VLAN-id, instead of having to figure this out in RX hook. > (I think this is also David Ahern's use-case) Is it really useful to have a hook before multi-buffer frames are possible and perhaps TSO? The local TCP performance is going to tank with XDP enabled otherwise. > (2) I want this egress XDP hook to have the ability to signal > backpressure. Today we have BQL in most drivers (which is essential to > avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve. > > For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(), > or a return code that can stop the queue towards the higher layers. Agreed, although for that use case, I'm not sure if non-XDP frames have to pass trough the hook. Hard to tell as the current patches don't touch on this use case.
David Ahern <dsahern@gmail.com> writes: > On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote: >> Jakub Kicinski <kuba@kernel.org> writes: >> >>> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote: >>>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote: >>>>> David Ahern <dsahern@kernel.org> writes: >>>>>> From: David Ahern <dahern@digitalocean.com> >>>>>> >>>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached >>>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as >>>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation >>>>>> is that going forward egress path will acquire the various levels of >>>>>> attach - generic, driver and hardware. >>>>> >>>>> How would a 'hardware' attach work for this? As I said in my reply to >>>>> the previous patch, isn't this explicitly for emulating XDP on the other >>>>> end of a point-to-point link? How would that work with offloaded >>>>> programs? >>>> >>>> Nothing about this patch set is limited to point-to-point links. >>> >>> I struggle to understand of what the expected semantics of this new >>> hook are. Is this going to be run on all frames sent to the device >>> from the stack? All frames from the stack and from XDP_REDIRECT? >>> >>> A little hard to figure out the semantics when we start from a funky >>> device like tun :S >> >> Yes, that is also why I found this a bit weird. We have discussed plans >> for an XDP TX hook before: >> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx >> >> That TX hook would run for everything at TX, but it would be a separate >> program type with its own metadata access. Whereas the idea with this >> series (seemed to me) to be just to be able to "emulate" run a regular >> RX-side XDP program on egress for devices where this makes sense. >> >> If this series is not meant to implement that "emulation", but rather be >> usable for all devices, I really think we should go straight for the >> full TX hook as discussed earlier... >> > > The first patch set from Jason and Prashant started from the perspective > of offloading XDP programs for a guest. Independently, I was looking at > XDP in the TX path (now referred to as egress to avoid confusion with > the XDP_TX return type). Jason and Prashant were touching some of the > same code paths in the tun driver that I needed for XDP in the Tx path, > so we decided to consolidate and have XDP egress done first and then > offload of VMs as a followup. Offload in virtio_net can be done very > similar to how it is done in nfp -- the program is passed to the host as > a hardware level attach mode, and the driver verifies the program can be > offloaded (e.g., does not contain helpers that expose host specific data > like the fib lookup helper). > > At this point, you need to stop thinking solely from the perspective of > tun or tap and VM offload; think about this from the ability to run an > XDP program on egress path at an appropriate place in the NIC driver > that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has > been discussed before as a need (e.g, Toke's reference above), and I am > trying to get this initial support done. Right, so what we're discussing here *is* the "full" egress hook we've discussed earlier. I thought this was still the other thing, which is why I was confused. > I very much wanted to avoid copy-paste-modify for the entire XDP API for > this. For the most part XDP means ebpf at the NIC driver / hardware > level (obviously with the exception of generic mode). The goal is > tempered with the need for the verifier to reject rx entries in the > xdp_md context. Hence the reason for use of an attach_type - existing > infrastructure to test and reject the accesses. > > That said, Martin's comment throws a wrench in the goal: if the existing > code does not enforce expected_attach_type then that option can not be > used in which case I guess I have to go with a new program type > (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), > has different return codes, etc. In any case an egress program will differ in: - The context object (the RX-related fields will be invalid on egress, and we'll probably want to add new TX-related ones, such as HW TX-queue occupancy). - The return code semantics (even if XDP_TX becomes equivalent to XDP_PASS, that is still a semantic difference from the RX side; and it's not necessarily clear whether we'll want to support REDIRECT on the egress side either, is it?) So we'll have to disambiguate between the two different types of programs. Which means that what we're discussing is really whether that disambiguation should be encoded in the program type, or in the attach type. IMO, making it a separate program type is a clearer and more explicit UAPI. The verifier could still treat the two program types as basically equivalent except for those cases where there has to be a difference anyway. So it seems to me that all you are saving by using attach_type instead of program type is the need for a new enum value and a bunch of additions to switch statements? Or am I wildly underestimating the effort to add a new program type? -Toke
Jakub Kicinski <kuba@kernel.org> writes: > On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote: >> On Sun, 26 Jan 2020 14:17:01 -0800 >> Jakub Kicinski <kuba@kernel.org> wrote: >> >> > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote: >> > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames. >> > >> > Any pointers on what for? Unless we see actual use cases there's >> > a justifiable concern of the entire thing just being an application of >> > "We can solve any problem by introducing an extra level of indirection." >> >> I have two use-cases: >> >> (1) For XDP easier handling of interface specific setting on egress, >> e.g. pushing a VLAN-id, instead of having to figure this out in RX hook. >> (I think this is also David Ahern's use-case) > > Is it really useful to have a hook before multi-buffer frames are > possible and perhaps TSO? The local TCP performance is going to tank > with XDP enabled otherwise. For a software router (i.e., something that mostly forwards packets) it can still be useful without multi-buffer. But yeah, that is definitely something we need to solve, regardless of where this goes. >> (2) I want this egress XDP hook to have the ability to signal >> backpressure. Today we have BQL in most drivers (which is essential to >> avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve. >> >> For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(), >> or a return code that can stop the queue towards the higher layers. > > Agreed, although for that use case, I'm not sure if non-XDP frames > have to pass trough the hook. Hard to tell as the current patches > don't touch on this use case. I think it would be weird and surprising if it *doesn't* see packets from the stack. On RX, XDP sees everything; the natural expectation would be that this was also the case on TX, no? -Toke
[ skipping the first part to just comment on the below: ] > I'm weary of partially implemented XDP features, EGRESS prog does us > no good when most drivers didn't yet catch up with the REDIRECTs. I kinda agree with this; but on the other hand, if we have to wait for all drivers to catch up, that would mean we couldn't add *anything* new that requires driver changes, which is not ideal either :/ > And we're adding this before we considered the queuing problem. > > But if I'm alone in thinking this, and I'm not convincing anyone we > can move on :) I do share your concern that this will end up being incompatible with whatever solution we end up with for queueing. However, I don't necessarily think it will: I view the XDP egress hook as something that in any case will run *after* packets are dequeued from whichever intermediate queueing it has been through (if any). I think such a hook is missing in any case; for instance, it's currently impossible to implement something like CoDel (which needs to know how long a packet spent in the queue) in eBPF. -Toke
On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote: > > I'm weary of partially implemented XDP features, EGRESS prog does us > > no good when most drivers didn't yet catch up with the REDIRECTs. > > I kinda agree with this; but on the other hand, if we have to wait for > all drivers to catch up, that would mean we couldn't add *anything* > new that requires driver changes, which is not ideal either :/ If EGRESS is only for XDP frames we could try to hide the handling in the core (with slight changes to XDP_TX handling in the drivers), making drivers smaller and XDP feature velocity higher. I think loading the drivers with complexity is hurting us in so many ways.. > > And we're adding this before we considered the queuing problem. > > > > But if I'm alone in thinking this, and I'm not convincing anyone we > > can move on :) > > I do share your concern that this will end up being incompatible with > whatever solution we end up with for queueing. However, I don't > necessarily think it will: I view the XDP egress hook as something > that in any case will run *after* packets are dequeued from whichever > intermediate queueing it has been through (if any). I think such a > hook is missing in any case; for instance, it's currently impossible > to implement something like CoDel (which needs to know how long a > packet spent in the queue) in eBPF. Possibly 🤔 I don't have a good mental image of how the XDP queuing would work. Maybe once the queuing primitives are defined they can easily be hooked into the Qdisc layer. With Martin's recent work all we need is a fifo that can store skb pointers, really... It'd be good if the BPF queuing could replace TC Qdiscs, rather than layer underneath.
Jakub Kicinski <kuba@kernel.org> writes: > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote: >> > I'm weary of partially implemented XDP features, EGRESS prog does us >> > no good when most drivers didn't yet catch up with the REDIRECTs. >> >> I kinda agree with this; but on the other hand, if we have to wait for >> all drivers to catch up, that would mean we couldn't add *anything* >> new that requires driver changes, which is not ideal either :/ > > If EGRESS is only for XDP frames we could try to hide the handling in > the core (with slight changes to XDP_TX handling in the drivers), > making drivers smaller and XDP feature velocity higher. But if it's only for XDP frames that are REDIRECTed, then one might as well perform whatever action the TX hook was doing before REDIRECTing (as you yourself argued)... :) > I think loading the drivers with complexity is hurting us in so many > ways.. Yeah, but having the low-level details available to the XDP program (such as HW queue occupancy for the egress hook) is one of the benefits of XDP, isn't it? Ultimately, I think Jesper's idea of having drivers operate exclusively on XDP frames and have the skb handling entirely in the core is an intriguing way to resolve this problem. Though this is obviously a long-term thing, and one might reasonably doubt we'll ever get there for existing drivers... >> > And we're adding this before we considered the queuing problem. >> > >> > But if I'm alone in thinking this, and I'm not convincing anyone we >> > can move on :) >> >> I do share your concern that this will end up being incompatible with >> whatever solution we end up with for queueing. However, I don't >> necessarily think it will: I view the XDP egress hook as something >> that in any case will run *after* packets are dequeued from whichever >> intermediate queueing it has been through (if any). I think such a >> hook is missing in any case; for instance, it's currently impossible >> to implement something like CoDel (which needs to know how long a >> packet spent in the queue) in eBPF. > > Possibly 🤔 I don't have a good mental image of how the XDP queuing > would work. > > Maybe once the queuing primitives are defined they can easily be > hooked into the Qdisc layer. With Martin's recent work all we need is > a fifo that can store skb pointers, really... > > It'd be good if the BPF queuing could replace TC Qdiscs, rather than > layer underneath. Hmm, hooking into the existing qdisc layer is an interesting idea. Ultimately, I fear it won't be feasible for performance reasons; but it's certainly something to consider. Maybe at least as an option? -Toke
On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote: > Jakub Kicinski <kuba@kernel.org> writes: > > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote: > >> > I'm weary of partially implemented XDP features, EGRESS prog does us > >> > no good when most drivers didn't yet catch up with the REDIRECTs. > >> > >> I kinda agree with this; but on the other hand, if we have to wait for > >> all drivers to catch up, that would mean we couldn't add *anything* > >> new that requires driver changes, which is not ideal either :/ > > > > If EGRESS is only for XDP frames we could try to hide the handling in > > the core (with slight changes to XDP_TX handling in the drivers), > > making drivers smaller and XDP feature velocity higher. > > But if it's only for XDP frames that are REDIRECTed, then one might as > well perform whatever action the TX hook was doing before REDIRECTing > (as you yourself argued)... :) Right, that's why I think the design needs to start from queuing which can't be done today, and has to be done in context of the destination. Solving queuing justifies the added complexity if you will :) > > I think loading the drivers with complexity is hurting us in so many > > ways.. > > Yeah, but having the low-level details available to the XDP program > (such as HW queue occupancy for the egress hook) is one of the benefits > of XDP, isn't it? I think I glossed over the hope for having access to HW queue occupancy - what exactly are you after? I don't think one can get anything beyond a BQL type granularity. Reading over PCIe is out of question, device write back on high granularity would burn through way too much bus throughput. > Ultimately, I think Jesper's idea of having drivers operate exclusively > on XDP frames and have the skb handling entirely in the core is an > intriguing way to resolve this problem. Though this is obviously a > long-term thing, and one might reasonably doubt we'll ever get there for > existing drivers... > > >> > And we're adding this before we considered the queuing problem. > >> > > >> > But if I'm alone in thinking this, and I'm not convincing anyone we > >> > can move on :) > >> > >> I do share your concern that this will end up being incompatible with > >> whatever solution we end up with for queueing. However, I don't > >> necessarily think it will: I view the XDP egress hook as something > >> that in any case will run *after* packets are dequeued from whichever > >> intermediate queueing it has been through (if any). I think such a > >> hook is missing in any case; for instance, it's currently impossible > >> to implement something like CoDel (which needs to know how long a > >> packet spent in the queue) in eBPF. > > > > Possibly 🤔 I don't have a good mental image of how the XDP queuing > > would work. > > > > Maybe once the queuing primitives are defined they can easily be > > hooked into the Qdisc layer. With Martin's recent work all we need is > > a fifo that can store skb pointers, really... > > > > It'd be good if the BPF queuing could replace TC Qdiscs, rather than > > layer underneath. > > Hmm, hooking into the existing qdisc layer is an interesting idea. > Ultimately, I fear it won't be feasible for performance reasons; but > it's certainly something to consider. Maybe at least as an option? For forwarding sure, but for locally generated traffic.. 🤷♂️
On 1/28/20 6:57 AM, Jakub Kicinski wrote: > On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote: >>> end of whatever is doing the redirect (especially with Alexei's work >> >> There are use cases where they may make sense, but this is not one. >> >>> on linking) and from cls_bpf 🤷♂️ >> >> cls_bpf is tc based == skb, no? I want to handle any packet, regardless >> of how it arrived at the device's xmit function. > > Yes, that's why I said you need the same rules in XDP before REDIRECT > and cls_bpf. Sure it's more complex, but (1) it's faster to drop in > the ingress prog before going though the entire redirect code and > without parsing the packet twice and (2) no extra kernel code necessary. you are making a lot of assumptions and frankly it's the 'c' word (complex) that I want to avoid. I do not believe in the OVS style of packet processing - one gigantic program with a bunch of logic and data driven flow lookups that affect every packet. I prefer simple, singly focused programs logically concatenated to make decisions. Simpler management, simpler lifecycle. The scope, scale and lifecycle management of VMs/containers is just as important as minimizing the cycles spent per packet. XDP in the Tx path is a missing primitive to make life simple. As an example, the program on the ingress NICs to the host can be nothing more than a demux'er - use the ethernet header and vlan (preferably with vlan offload enabled) to do a <vlan,mac> lookup. No packet parsing at all at this level. The lookup returns an index of where to redirect the packet (e.g., the tap device of a VM). At the same time, packets can hit the "slow path" - processing support from the full stack when it is needed and still packets end up at the tap device. *IF* there is an ingress ACL for that tap device managed by the host, it can exist in 1 place - a program and map attached to the tap device limited to that tap device's networking function (VMs can have multiple connections to different networks with different needs at this point), and the program gets run for both XDP fast path and skb slow path.
On 2/1/20 10:08 AM, Jakub Kicinski wrote: > If EGRESS is only for XDP frames we could try to hide the handling in > the core (with slight changes to XDP_TX handling in the drivers), It is not. I have said multiple times it is to work on ALL packets that hit the xmit function, both skbs and xdp_frames.
On 2/1/20 9:03 AM, Toke Høiland-Jørgensen wrote: > I think it would be weird and surprising if it *doesn't* see packets > from the stack. On RX, XDP sees everything; the natural expectation > would be that this was also the case on TX, no? > Yes. XDP_EGRESS should be symmetrical with XDP on the Rx side.
On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote: > > In any case an egress program will differ in: > > - The context object (the RX-related fields will be invalid on egress, > and we'll probably want to add new TX-related ones, such as HW > TX-queue occupancy). Jakub has suggested that rx_queue_index can be a union with tx_queue_index; the former for the Rx path and the latter for the egress. The rest of the fields in xdp_md are legit for either direction. > > - The return code semantics (even if XDP_TX becomes equivalent to > XDP_PASS, that is still a semantic difference from the RX side; and > it's not necessarily clear whether we'll want to support REDIRECT on > the egress side either, is it?) Why should REDIRECT not be allowed in the egress path? e.g., service chaining or capturing suspicious packets (e.g., encap with a header and redirect somewhere for analysis). > > So we'll have to disambiguate between the two different types of > programs. Which means that what we're discussing is really whether that > disambiguation should be encoded in the program type, or in the attach > type. IMO, making it a separate program type is a clearer and more > explicit UAPI. The verifier could still treat the two program types as > basically equivalent except for those cases where there has to be a > difference anyway. So it seems to me that all you are saving by using > attach_type instead of program type is the need for a new enum value and > a bunch of additions to switch statements? Or am I wildly > underestimating the effort to add a new program type? > IMHO that is duplicating code and APIs for no real reason. XDP refers to fast path processing, the only difference is where the program is attached - Rx side or Tx side.
On 1/25/20 9:54 PM, Alexei Starovoitov wrote: > On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote: >> >> That said, Martin's comment throws a wrench in the goal: if the existing >> code does not enforce expected_attach_type then that option can not be >> used in which case I guess I have to go with a new program type >> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md), >> has different return codes, etc. > > This is acceptable risk. We did such thing in the past. The chances of > user space breakage are extremely low. > Ultimately that is a decision for the maintainers. Code wise both iproute2 and libbpf always initialize bpf_attr to 0 and given the many uses of that union it seems odd that someone would initialize one field at a time. Unless someone comes back with a strong 'hell, no' I am planning to send the next RFC version with the current API.
On Sun, 2 Feb 2020 10:45:27 -0700, David Ahern wrote: > On 2/1/20 10:08 AM, Jakub Kicinski wrote: > > If EGRESS is only for XDP frames we could try to hide the handling in > > the core (with slight changes to XDP_TX handling in the drivers), > > It is not. I have said multiple times it is to work on ALL packets that > hit the xmit function, both skbs and xdp_frames. Okay, I should have said "was to be", I guess? If EGRESS _was_to_be_ only for XDP frames we could try to hide the handling in the core (with slight changes to XDP_TX handling in the drivers), making drivers smaller and XDP feature velocity higher. I understand you'd like a hook for all packets, that is clear. I'm just trying to highlight the cost and consequences.
On Sun, 2 Feb 2020 10:43:58 -0700, David Ahern wrote: > you are making a lot of assumptions and frankly it's the 'c' word > (complex) that I want to avoid. I do not believe in the OVS style of > packet processing - one gigantic program with a bunch of logic and data > driven flow lookups that affect every packet. I prefer simple, singly > focused programs logically concatenated to make decisions. Simpler > management, simpler lifecycle. The scope, scale and lifecycle management > of VMs/containers is just as important as minimizing the cycles spent > per packet. XDP in the Tx path is a missing primitive to make life simple. > > As an example, the program on the ingress NICs to the host can be > nothing more than a demux'er - use the ethernet header and vlan > (preferably with vlan offload enabled) to do a <vlan,mac> lookup. No > packet parsing at all at this level. The lookup returns an index of > where to redirect the packet (e.g., the tap device of a VM). At the same > time, packets can hit the "slow path" - processing support from the full > stack when it is needed and still packets end up at the tap device. *IF* > there is an ingress ACL for that tap device managed by the host, it can > exist in 1 place - a program and map attached to the tap device limited > to that tap device's networking function (VMs can have multiple > connections to different networks with different needs at this point), > and the program gets run for both XDP fast path and skb slow path. I think our perspectives will be difficult to reconcile. My reading is that you look at this from slow software device perspective. Of which there are few (and already loaded with hacks). And you want the control plane to be simple rather than performance. I look at this from HW driver perspective of which there are many. Saying that it's fine to load TX paths of all drivers with extra code, and that it's fine to effectively disable SG in the entire system just doesn't compute. Is that a fair summary of the arguments?
On 2/2/20 12:31 PM, Jakub Kicinski wrote: > > I think our perspectives will be difficult to reconcile. > > My reading is that you look at this from slow software device > perspective. Of which there are few (and already loaded with hacks). > And you want the control plane to be simple rather than performance. > > I look at this from HW driver perspective of which there are many. > Saying that it's fine to load TX paths of all drivers with extra > code, and that it's fine to effectively disable SG in the entire > system just doesn't compute. > > Is that a fair summary of the arguments? > I do not believe so. My argument is about allowing XDP and full stack to work synergistically: Fast path for known unicast traffic, and slow path for BUM traffic without duplicating config such as ACLs and forcing a specific design (such as forcing all programs in the XDP Rx path which does not work for all traffic patterns). For example, if I am willing to trade a few cycles of over head (redirect processing) for a simpler overall solution then I should be to do that. Fast path with XDP is still faster than the full host stack for known unicast traffic.
Jakub Kicinski <kuba@kernel.org> writes: > On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote: >> Jakub Kicinski <kuba@kernel.org> writes: >> > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote: >> >> > I'm weary of partially implemented XDP features, EGRESS prog does us >> >> > no good when most drivers didn't yet catch up with the REDIRECTs. >> >> >> >> I kinda agree with this; but on the other hand, if we have to wait for >> >> all drivers to catch up, that would mean we couldn't add *anything* >> >> new that requires driver changes, which is not ideal either :/ >> > >> > If EGRESS is only for XDP frames we could try to hide the handling in >> > the core (with slight changes to XDP_TX handling in the drivers), >> > making drivers smaller and XDP feature velocity higher. >> >> But if it's only for XDP frames that are REDIRECTed, then one might as >> well perform whatever action the TX hook was doing before REDIRECTing >> (as you yourself argued)... :) > > Right, that's why I think the design needs to start from queuing which > can't be done today, and has to be done in context of the destination. > Solving queuing justifies the added complexity if you will :) Right, that makes sense. Hmm, I wonder if a TX driver hook is enough? I.e., a driver callback in the TX path could also just queue that packet (returning XDP_QUEUED?), without the driver needing any more handling than it will already have? I'm spitballing a little bit here, but it may be quite straight-forward? :) >> > I think loading the drivers with complexity is hurting us in so many >> > ways.. >> >> Yeah, but having the low-level details available to the XDP program >> (such as HW queue occupancy for the egress hook) is one of the benefits >> of XDP, isn't it? > > I think I glossed over the hope for having access to HW queue occupancy > - what exactly are you after? > > I don't think one can get anything beyond a BQL type granularity. > Reading over PCIe is out of question, device write back on high > granularity would burn through way too much bus throughput. > >> Ultimately, I think Jesper's idea of having drivers operate exclusively >> on XDP frames and have the skb handling entirely in the core is an >> intriguing way to resolve this problem. Though this is obviously a >> long-term thing, and one might reasonably doubt we'll ever get there for >> existing drivers... >> >> >> > And we're adding this before we considered the queuing problem. >> >> > >> >> > But if I'm alone in thinking this, and I'm not convincing anyone we >> >> > can move on :) >> >> >> >> I do share your concern that this will end up being incompatible with >> >> whatever solution we end up with for queueing. However, I don't >> >> necessarily think it will: I view the XDP egress hook as something >> >> that in any case will run *after* packets are dequeued from whichever >> >> intermediate queueing it has been through (if any). I think such a >> >> hook is missing in any case; for instance, it's currently impossible >> >> to implement something like CoDel (which needs to know how long a >> >> packet spent in the queue) in eBPF. >> > >> > Possibly 🤔 I don't have a good mental image of how the XDP queuing >> > would work. >> > >> > Maybe once the queuing primitives are defined they can easily be >> > hooked into the Qdisc layer. With Martin's recent work all we need is >> > a fifo that can store skb pointers, really... >> > >> > It'd be good if the BPF queuing could replace TC Qdiscs, rather than >> > layer underneath. >> >> Hmm, hooking into the existing qdisc layer is an interesting idea. >> Ultimately, I fear it won't be feasible for performance reasons; but >> it's certainly something to consider. Maybe at least as an option? > > For forwarding sure, but for locally generated traffic.. 🤷♂️ Right, well, for locally generated traffic we already have the qdisc? This was kinda the reason why my original thought was to add the queueing for XDP only at REDIRECT time. I.e., we already have the xdp_dev_bulk_queue (which now even lives in struct net_device), so we could "just" extend that and make it into a proper queueing structure, and call it a day? :) But from your comments it sounds like that when you're saying "BPF queueing" you mean that the queueing itself should be programmable using BPF? To me, the most urgent thing has been to figure out a way to do *any* kind of queueing with XDP-forwarded frames, so haven't given much thought to what a "fully programmable" queue would look like... Do you have any thoughts? -Toke
David Ahern <dsahern@gmail.com> writes: > On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote: >> >> In any case an egress program will differ in: >> >> - The context object (the RX-related fields will be invalid on egress, >> and we'll probably want to add new TX-related ones, such as HW >> TX-queue occupancy). > > Jakub has suggested that rx_queue_index can be a union with > tx_queue_index; the former for the Rx path and the latter for the egress. > > The rest of the fields in xdp_md are legit for either direction. Right, okay, ifindex and queue index could make sense in both directions. But it would still mean that we would limit ourselves to only adding new fields that would work on both RX and TX. Maybe that is OK, though. >> - The return code semantics (even if XDP_TX becomes equivalent to >> XDP_PASS, that is still a semantic difference from the RX side; and >> it's not necessarily clear whether we'll want to support REDIRECT on >> the egress side either, is it?) > > Why should REDIRECT not be allowed in the egress path? e.g., service > chaining or capturing suspicious packets (e.g., encap with a header and > redirect somewhere for analysis). Implementation and deployment complexity, mostly (loops!)? I do agree that it would be nice to allow it, I'm just not entirely convinced that it's feasible... >> So we'll have to disambiguate between the two different types of >> programs. Which means that what we're discussing is really whether that >> disambiguation should be encoded in the program type, or in the attach >> type. IMO, making it a separate program type is a clearer and more >> explicit UAPI. The verifier could still treat the two program types as >> basically equivalent except for those cases where there has to be a >> difference anyway. So it seems to me that all you are saving by using >> attach_type instead of program type is the need for a new enum value and >> a bunch of additions to switch statements? Or am I wildly >> underestimating the effort to add a new program type? >> > > IMHO that is duplicating code and APIs for no real reason. XDP refers to > fast path processing, the only difference is where the program is > attached - Rx side or Tx side. Sure, if we can guarantee API and semantic equivalence. I.e., if any XDP program really *can* be attached as both an RX and TX program, I have no objections to re-using the program type. But if it turns out that there *is* a difference, we're making implicit subtypes anyway, in which case I think it's better to be explicit about the difference. -Toke
Oops, I see I forgot to reply to this bit: >> Yeah, but having the low-level details available to the XDP program >> (such as HW queue occupancy for the egress hook) is one of the benefits >> of XDP, isn't it? > > I think I glossed over the hope for having access to HW queue occupancy > - what exactly are you after? > > I don't think one can get anything beyond a BQL type granularity. > Reading over PCIe is out of question, device write back on high > granularity would burn through way too much bus throughput. This was Jesper's idea originally, so maybe he can explain better; but as I understood it, he basically wanted to expose the same information that BQL has to eBPF. Making it possible for an eBPF program to either (re-)implement BQL with its own custom policy, or react to HWQ pressure in some other way, such as by load balancing to another interface. -Toke
On Mon, 03 Feb 2020 21:13:24 +0100 Toke Høiland-Jørgensen <toke@redhat.com> wrote: > Oops, I see I forgot to reply to this bit: > > >> Yeah, but having the low-level details available to the XDP program > >> (such as HW queue occupancy for the egress hook) is one of the benefits > >> of XDP, isn't it? > > > > I think I glossed over the hope for having access to HW queue occupancy > > - what exactly are you after? > > > > I don't think one can get anything beyond a BQL type granularity. > > Reading over PCIe is out of question, device write back on high > > granularity would burn through way too much bus throughput. > > This was Jesper's idea originally, so maybe he can explain better; but > as I understood it, he basically wanted to expose the same information > that BQL has to eBPF. Making it possible for an eBPF program to either > (re-)implement BQL with its own custom policy, or react to HWQ pressure > in some other way, such as by load balancing to another interface. Yes, and I also have plans that goes beyond BQL. But let me start with explaining the BQL part, and answer Toke's question below. On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote: > [...] Hmm, I wonder if a TX driver hook is enough? Short answer is no, a TX driver hook is not enough. The queue state info the TX driver hook have access to, needs to be updated once the hardware have "confirmed" the TX-DMA operation have completed. For BQL/DQL this update happens during TX-DMA completion/cleanup (code see call sites for netdev_tx_completed_queue()). (As Jakub wisely point out we cannot query the device directly due to performance implications). It doesn't need to be a new BPF hook, just something that update the queue state info (we could piggy back on the netdev_tx_completed_queue() call or give TX hook access to dev_queue->dql). Regarding "where is the queue": For me the XDP-TX queue is the NIC hardware queue, that this BPF hook have some visibility into and can do filtering on. (Imagine that my TX queue is bandwidth limited, then I can shrink the packet size and still send a "congestion" packet to my receiver). The bigger picture is that I envision the XDP-TX/egress hook can open-up for taking advantage of NIC hardware TX queue features. This also ties into the queue abstraction work by Björn+Magnus. Today NIC hardware can do a million TX-queues, and hardware can also do rate limiting per queue. Thus, I also envision that the XDP-TX/egress hook can choose/change the TX queue the packet is queue/sent on (we can likely just overload the XDP_REDIRECT and have a new bpf map type for this).
Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Mon, 03 Feb 2020 21:13:24 +0100 > Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> Oops, I see I forgot to reply to this bit: >> >> >> Yeah, but having the low-level details available to the XDP program >> >> (such as HW queue occupancy for the egress hook) is one of the benefits >> >> of XDP, isn't it? >> > >> > I think I glossed over the hope for having access to HW queue occupancy >> > - what exactly are you after? >> > >> > I don't think one can get anything beyond a BQL type granularity. >> > Reading over PCIe is out of question, device write back on high >> > granularity would burn through way too much bus throughput. >> >> This was Jesper's idea originally, so maybe he can explain better; but >> as I understood it, he basically wanted to expose the same information >> that BQL has to eBPF. Making it possible for an eBPF program to either >> (re-)implement BQL with its own custom policy, or react to HWQ pressure >> in some other way, such as by load balancing to another interface. > > Yes, and I also have plans that goes beyond BQL. But let me start with > explaining the BQL part, and answer Toke's question below. > > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote: >> [...] Hmm, I wonder if a TX driver hook is enough? > > Short answer is no, a TX driver hook is not enough. The queue state > info the TX driver hook have access to, needs to be updated once the > hardware have "confirmed" the TX-DMA operation have completed. For > BQL/DQL this update happens during TX-DMA completion/cleanup (code > see call sites for netdev_tx_completed_queue()). (As Jakub wisely > point out we cannot query the device directly due to performance > implications). It doesn't need to be a new BPF hook, just something > that update the queue state info (we could piggy back on the > netdev_tx_completed_queue() call or give TX hook access to > dev_queue->dql). The question is whether this can't simply be done through bpf helpers? bpf_get_txq_occupancy(ifindex, txqno)? > Regarding "where is the queue": For me the XDP-TX queue is the NIC > hardware queue, that this BPF hook have some visibility into and can do > filtering on. (Imagine that my TX queue is bandwidth limited, then I > can shrink the packet size and still send a "congestion" packet to my > receiver). I'm not sure the hardware queues will be enough, though. Unless I'm misunderstanding something, hardware queues are (1) fairly short and (2) FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd still need a software queueing layer on top of the hardware queue. If the hardware is EDT-aware this may change, I suppose, but I'm not sure if we can design the XDP queueing primitives with this assumption? :) > The bigger picture is that I envision the XDP-TX/egress hook can > open-up for taking advantage of NIC hardware TX queue features. This > also ties into the queue abstraction work by Björn+Magnus. Today NIC > hardware can do a million TX-queues, and hardware can also do rate > limiting per queue. Thus, I also envision that the XDP-TX/egress hook > can choose/change the TX queue the packet is queue/sent on (we can > likely just overload the XDP_REDIRECT and have a new bpf map type for > this). Yes, I think I mentioned in another email that putting all the queueing smarts into the redirect map was also something I'd considered (well, I do think we've discussed this in the past, so maybe not so surprising if we're thinking along the same lines) :) But the implication of this is also that an actual TX hook in the driver need not necessarily incorporate a lot of new functionality, as it can control the queueing through a combination of BPF helpers and map updates? -Toke
On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote: > Jesper Dangaard Brouer <brouer@redhat.com> writes: > > On Mon, 03 Feb 2020 21:13:24 +0100 > > Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > >> Oops, I see I forgot to reply to this bit: > >> > >> >> Yeah, but having the low-level details available to the XDP program > >> >> (such as HW queue occupancy for the egress hook) is one of the benefits > >> >> of XDP, isn't it? > >> > > >> > I think I glossed over the hope for having access to HW queue occupancy > >> > - what exactly are you after? > >> > > >> > I don't think one can get anything beyond a BQL type granularity. > >> > Reading over PCIe is out of question, device write back on high > >> > granularity would burn through way too much bus throughput. > >> > >> This was Jesper's idea originally, so maybe he can explain better; but > >> as I understood it, he basically wanted to expose the same information > >> that BQL has to eBPF. Making it possible for an eBPF program to either > >> (re-)implement BQL with its own custom policy, or react to HWQ pressure > >> in some other way, such as by load balancing to another interface. > > > > Yes, and I also have plans that goes beyond BQL. But let me start with > > explaining the BQL part, and answer Toke's question below. > > > > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote: > >> [...] Hmm, I wonder if a TX driver hook is enough? > > > > Short answer is no, a TX driver hook is not enough. The queue state > > info the TX driver hook have access to, needs to be updated once the > > hardware have "confirmed" the TX-DMA operation have completed. For > > BQL/DQL this update happens during TX-DMA completion/cleanup (code > > see call sites for netdev_tx_completed_queue()). (As Jakub wisely > > point out we cannot query the device directly due to performance > > implications). It doesn't need to be a new BPF hook, just something > > that update the queue state info (we could piggy back on the > > netdev_tx_completed_queue() call or give TX hook access to > > dev_queue->dql). Interesting, that model does make sense to me. > The question is whether this can't simply be done through bpf helpers? > bpf_get_txq_occupancy(ifindex, txqno)? Helper vs dev_queue->dql field access seems like a technicality. The usual flexibility of implementation vs performance and simplicity consideration applies.. I guess? > > Regarding "where is the queue": For me the XDP-TX queue is the NIC > > hardware queue, that this BPF hook have some visibility into and can do > > filtering on. (Imagine that my TX queue is bandwidth limited, then I > > can shrink the packet size and still send a "congestion" packet to my > > receiver). > > I'm not sure the hardware queues will be enough, though. Unless I'm > misunderstanding something, hardware queues are (1) fairly short and (2) > FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd > still need a software queueing layer on top of the hardware queue. Jesper makes a very interesting point tho. If all the implementation wants is FIFO queues which are services in some simple manner (that is something that can be offloaded) we should support that. That means REDIRECT can target multiple TX queues, and we need an API to control the queue allocation.. > If the hardware is EDT-aware this may change, I suppose, but I'm not > sure if we can design the XDP queueing primitives with this assumption? :) But I agree with you as well. I think both HW and SW feeding needs to be supported. The HW implementations are always necessarily behind ideas people implemented and tested in SW.. > > The bigger picture is that I envision the XDP-TX/egress hook can > > open-up for taking advantage of NIC hardware TX queue features. This > > also ties into the queue abstraction work by Björn+Magnus. Today NIC > > hardware can do a million TX-queues, and hardware can also do rate > > limiting per queue. Thus, I also envision that the XDP-TX/egress hook > > can choose/change the TX queue the packet is queue/sent on (we can > > likely just overload the XDP_REDIRECT and have a new bpf map type for > > this). I wonder what that does to our HW offload model which is based on TC Qdisc offload today :S Do we use TC API to control configuration of XDP queues? :S > Yes, I think I mentioned in another email that putting all the queueing > smarts into the redirect map was also something I'd considered (well, I > do think we've discussed this in the past, so maybe not so surprising if > we're thinking along the same lines) :) > > But the implication of this is also that an actual TX hook in the driver > need not necessarily incorporate a lot of new functionality, as it can > control the queueing through a combination of BPF helpers and map > updates? True, it's the dequeuing that's on the TX side, so we could go as far as putting all the enqueuing logic in the RX prog.. To answer your question from the other email Toke, my basic model was kind of similar to TC Qdiscs. XDP redirect selects a device, then that device has an enqueue and dequeue programs. Enqueue program can be run in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up some space on the TX descriptor ring. There is a "queue state" but the FIFOs etc are sort of internal detail that the enqueue and dequeue programs only share between each other. To be clear this is not a suggestion of how things should be, it's what sprung to my mind without thinking.
Jakub Kicinski <kuba@kernel.org> writes: > On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote: >> Jesper Dangaard Brouer <brouer@redhat.com> writes: >> > On Mon, 03 Feb 2020 21:13:24 +0100 >> > Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> > >> >> Oops, I see I forgot to reply to this bit: >> >> >> >> >> Yeah, but having the low-level details available to the XDP program >> >> >> (such as HW queue occupancy for the egress hook) is one of the benefits >> >> >> of XDP, isn't it? >> >> > >> >> > I think I glossed over the hope for having access to HW queue occupancy >> >> > - what exactly are you after? >> >> > >> >> > I don't think one can get anything beyond a BQL type granularity. >> >> > Reading over PCIe is out of question, device write back on high >> >> > granularity would burn through way too much bus throughput. >> >> >> >> This was Jesper's idea originally, so maybe he can explain better; but >> >> as I understood it, he basically wanted to expose the same information >> >> that BQL has to eBPF. Making it possible for an eBPF program to either >> >> (re-)implement BQL with its own custom policy, or react to HWQ pressure >> >> in some other way, such as by load balancing to another interface. >> > >> > Yes, and I also have plans that goes beyond BQL. But let me start with >> > explaining the BQL part, and answer Toke's question below. >> > >> > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote: >> >> [...] Hmm, I wonder if a TX driver hook is enough? >> > >> > Short answer is no, a TX driver hook is not enough. The queue state >> > info the TX driver hook have access to, needs to be updated once the >> > hardware have "confirmed" the TX-DMA operation have completed. For >> > BQL/DQL this update happens during TX-DMA completion/cleanup (code >> > see call sites for netdev_tx_completed_queue()). (As Jakub wisely >> > point out we cannot query the device directly due to performance >> > implications). It doesn't need to be a new BPF hook, just something >> > that update the queue state info (we could piggy back on the >> > netdev_tx_completed_queue() call or give TX hook access to >> > dev_queue->dql). > > Interesting, that model does make sense to me. > >> The question is whether this can't simply be done through bpf helpers? >> bpf_get_txq_occupancy(ifindex, txqno)? > > Helper vs dev_queue->dql field access seems like a technicality. > The usual flexibility of implementation vs performance and simplicity > consideration applies.. I guess? Yeah, that was my point: We can do whatever makes the most sense for that; it doesn't necessarily have to be something that's baked into the TX hook point. >> > Regarding "where is the queue": For me the XDP-TX queue is the NIC >> > hardware queue, that this BPF hook have some visibility into and can do >> > filtering on. (Imagine that my TX queue is bandwidth limited, then I >> > can shrink the packet size and still send a "congestion" packet to my >> > receiver). >> >> I'm not sure the hardware queues will be enough, though. Unless I'm >> misunderstanding something, hardware queues are (1) fairly short and (2) >> FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd >> still need a software queueing layer on top of the hardware queue. > > Jesper makes a very interesting point tho. If all the implementation > wants is FIFO queues which are services in some simple manner (that is > something that can be offloaded) we should support that. > > That means REDIRECT can target multiple TX queues, and we need an API > to control the queue allocation.. Yes. I think that should be part of the "hwq API" that we presented at LPC last year[0]. >> If the hardware is EDT-aware this may change, I suppose, but I'm not >> sure if we can design the XDP queueing primitives with this assumption? :) > > But I agree with you as well. I think both HW and SW feeding needs to > be supported. The HW implementations are always necessarily behind > ideas people implemented and tested in SW.. Exactly. And even though we mostly think about 10-100G network interfaces, I believe there is also potential for this to be useful at the low end (think small embedded routers that are very CPU constrained). In these sorts of environments software queueing is more feasible. >> > The bigger picture is that I envision the XDP-TX/egress hook can >> > open-up for taking advantage of NIC hardware TX queue features. This >> > also ties into the queue abstraction work by Björn+Magnus. Today NIC >> > hardware can do a million TX-queues, and hardware can also do rate >> > limiting per queue. Thus, I also envision that the XDP-TX/egress hook >> > can choose/change the TX queue the packet is queue/sent on (we can >> > likely just overload the XDP_REDIRECT and have a new bpf map type for >> > this). > > I wonder what that does to our HW offload model which is based on > TC Qdisc offload today :S Do we use TC API to control configuration > of XDP queues? :S My thought was that it could be baked into the same API[0]. I.e., add a way to say not just "I want 3 hardware queues for this", but also "I want 3 hardware queues, with fq_codel (or whatever) on top". >> Yes, I think I mentioned in another email that putting all the queueing >> smarts into the redirect map was also something I'd considered (well, I >> do think we've discussed this in the past, so maybe not so surprising if >> we're thinking along the same lines) :) >> >> But the implication of this is also that an actual TX hook in the driver >> need not necessarily incorporate a lot of new functionality, as it can >> control the queueing through a combination of BPF helpers and map >> updates? > > True, it's the dequeuing that's on the TX side, so we could go as far > as putting all the enqueuing logic in the RX prog.. > > To answer your question from the other email Toke, my basic model was > kind of similar to TC Qdiscs. XDP redirect selects a device, then that > device has an enqueue and dequeue programs. Enqueue program can be run > in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up > some space on the TX descriptor ring. There is a "queue state" but the > FIFOs etc are sort of internal detail that the enqueue and dequeue > programs only share between each other. To be clear this is not a > suggestion of how things should be, it's what sprung to my mind without > thinking. Ah, that's interesting. I do agree that it may be better to bake this into the existing hooks rather than have an additional set of (en/de)queue hooks. -Toke [0] https://linuxplumbersconf.org/event/4/contributions/462/
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 1d69f637c5d6..c760aa54315c 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -170,6 +170,7 @@ enum { IFLA_PROP_LIST, IFLA_ALT_IFNAME, /* Alternative ifname */ IFLA_PERM_ADDRESS, + IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */ __IFLA_MAX }; diff --git a/net/core/dev.c b/net/core/dev.c index 04cbcc930bc2..bf76dbee9d2a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8664,6 +8664,12 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, if (IS_ERR(prog)) return PTR_ERR(prog); + if (egress && prog->expected_attach_type != BPF_XDP_EGRESS) { + NL_SET_ERR_MSG(extack, "XDP program in egress path must use BPF_XDP_EGRESS attach type"); + bpf_prog_put(prog); + return -EINVAL; + } + if (!offload && bpf_prog_is_dev_bound(prog->aux)) { NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported"); bpf_prog_put(prog); diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index ed0c069ef187..2179de9350b2 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1030,7 +1030,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev, + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */ + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */ + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */ - + rtnl_xdp_size() /* IFLA_XDP */ + + rtnl_xdp_size() * 2 /* IFLA_XDP and IFLA_XDP_EGRESS */ + nla_total_size(4) /* IFLA_EVENT */ + nla_total_size(4) /* IFLA_NEW_NETNSID */ + nla_total_size(4) /* IFLA_NEW_IFINDEX */ @@ -1395,6 +1395,36 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev) return 0; } +static u32 rtnl_xdp_egress_prog_drv(struct net_device *dev) +{ + return __dev_xdp_query(dev, dev->netdev_ops->ndo_bpf, + XDP_QUERY_PROG_EGRESS); +} + +static int rtnl_xdp_egress_report(struct sk_buff *skb, struct net_device *dev, + u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr, + u32 (*get_prog_id)(struct net_device *dev)) +{ + u32 curr_id; + int err; + + curr_id = get_prog_id(dev); + if (!curr_id) + return 0; + + *prog_id = curr_id; + err = nla_put_u32(skb, attr, curr_id); + if (err) + return err; + + if (*mode != XDP_ATTACHED_NONE) + *mode = XDP_ATTACHED_MULTI; + else + *mode = tgt_mode; + + return 0; +} + static u32 rtnl_xdp_prog_skb(struct net_device *dev) { const struct bpf_prog *generic_xdp_prog; @@ -1486,6 +1516,41 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) return err; } +static int rtnl_xdp_egress_fill(struct sk_buff *skb, struct net_device *dev) +{ + u8 mode = XDP_ATTACHED_NONE; + struct nlattr *xdp; + u32 prog_id = 0; + int err; + + xdp = nla_nest_start_noflag(skb, IFLA_XDP_EGRESS); + if (!xdp) + return -EMSGSIZE; + + err = rtnl_xdp_egress_report(skb, dev, &prog_id, &mode, + XDP_ATTACHED_DRV, IFLA_XDP_DRV_PROG_ID, + rtnl_xdp_egress_prog_drv); + if (err) + goto err_cancel; + + err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode); + if (err) + goto err_cancel; + + if (prog_id && mode != XDP_ATTACHED_MULTI) { + err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id); + if (err) + goto err_cancel; + } + + nla_nest_end(skb, xdp); + return 0; + +err_cancel: + nla_nest_cancel(skb, xdp); + return err; +} + static u32 rtnl_get_event(unsigned long event) { u32 rtnl_event_type = IFLA_EVENT_NONE; @@ -1743,6 +1808,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, if (rtnl_xdp_fill(skb, dev)) goto nla_put_failure; + if (rtnl_xdp_egress_fill(skb, dev)) + goto nla_put_failure; + if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) { if (rtnl_link_fill(skb, dev) < 0) goto nla_put_failure; @@ -1827,6 +1895,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_ALT_IFNAME] = { .type = NLA_STRING, .len = ALTIFNAMSIZ - 1 }, [IFLA_PERM_ADDRESS] = { .type = NLA_REJECT }, + [IFLA_XDP_EGRESS] = { .type = NLA_NESTED }, }; static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = { @@ -2808,6 +2877,47 @@ static int do_setlink(const struct sk_buff *skb, } } + if (tb[IFLA_XDP_EGRESS]) { + struct nlattr *xdp[IFLA_XDP_MAX + 1]; + u32 xdp_flags = 0; + + err = nla_parse_nested_deprecated(xdp, IFLA_XDP_MAX, + tb[IFLA_XDP_EGRESS], + ifla_xdp_policy, NULL); + if (err < 0) + goto errout; + + if (xdp[IFLA_XDP_ATTACHED] || xdp[IFLA_XDP_PROG_ID]) { + err = -EINVAL; + goto errout; + } + + if (xdp[IFLA_XDP_FLAGS]) { + xdp_flags = nla_get_u32(xdp[IFLA_XDP_FLAGS]); + if (xdp_flags & XDP_FLAGS_HW_MODE) { + err = -EINVAL; + goto errout; + } + if (xdp_flags & ~XDP_FLAGS_MASK) { + err = -EINVAL; + goto errout; + } + if (hweight32(xdp_flags & XDP_FLAGS_MODES) > 1) { + err = -EINVAL; + goto errout; + } + } + + if (xdp[IFLA_XDP_FD]) { + err = dev_change_xdp_fd(dev, extack, + nla_get_s32(xdp[IFLA_XDP_FD]), + xdp_flags, true); + if (err) + goto errout; + status |= DO_SETLINK_NOTIFY; + } + } + errout: if (status & DO_SETLINK_MODIFIED) { if ((status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY) diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h index 1d69f637c5d6..c760aa54315c 100644 --- a/tools/include/uapi/linux/if_link.h +++ b/tools/include/uapi/linux/if_link.h @@ -170,6 +170,7 @@ enum { IFLA_PROP_LIST, IFLA_ALT_IFNAME, /* Alternative ifname */ IFLA_PERM_ADDRESS, + IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */ __IFLA_MAX };