diff mbox series

[bpf-next,03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path

Message ID 20200123014210.38412-4-dsahern@kernel.org
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Add support for XDP in egress path | expand

Commit Message

David Ahern Jan. 23, 2020, 1:42 a.m. UTC
From: David Ahern <dahern@digitalocean.com>

Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
the egress counterpart to the existing rtnl_xdp_fill. The expectation
is that going forward egress path will acquire the various levels of
attach - generic, driver and hardware.

Signed-off-by: David Ahern <dahern@digitalocean.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/uapi/linux/if_link.h       |   1 +
 net/core/dev.c                     |   6 ++
 net/core/rtnetlink.c               | 112 ++++++++++++++++++++++++++++-
 tools/include/uapi/linux/if_link.h |   1 +
 4 files changed, 119 insertions(+), 1 deletion(-)

Comments

Toke Høiland-Jørgensen Jan. 23, 2020, 11:35 a.m. UTC | #1
David Ahern <dsahern@kernel.org> writes:

> From: David Ahern <dahern@digitalocean.com>
>
> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> is that going forward egress path will acquire the various levels of
> attach - generic, driver and hardware.

How would a 'hardware' attach work for this? As I said in my reply to
the previous patch, isn't this explicitly for emulating XDP on the other
end of a point-to-point link? How would that work with offloaded
programs?

-Toke
David Ahern Jan. 23, 2020, 9:33 p.m. UTC | #2
On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@kernel.org> writes:
> 
>> From: David Ahern <dahern@digitalocean.com>
>>
>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>> is that going forward egress path will acquire the various levels of
>> attach - generic, driver and hardware.
> 
> How would a 'hardware' attach work for this? As I said in my reply to
> the previous patch, isn't this explicitly for emulating XDP on the other
> end of a point-to-point link? How would that work with offloaded
> programs?
> 
> -Toke
> 

Nothing about this patch set is limited to point-to-point links.
Jakub Kicinski Jan. 24, 2020, 3:21 p.m. UTC | #3
On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
> > David Ahern <dsahern@kernel.org> writes:
> >> From: David Ahern <dahern@digitalocean.com>
> >>
> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >> is that going forward egress path will acquire the various levels of
> >> attach - generic, driver and hardware.  
> > 
> > How would a 'hardware' attach work for this? As I said in my reply to
> > the previous patch, isn't this explicitly for emulating XDP on the other
> > end of a point-to-point link? How would that work with offloaded
> > programs?
> 
> Nothing about this patch set is limited to point-to-point links.

I struggle to understand of what the expected semantics of this new
hook are. Is this going to be run on all frames sent to the device
from the stack? All frames from the stack and from XDP_REDIRECT?

A little hard to figure out the semantics when we start from a funky
device like tun :S
Toke Høiland-Jørgensen Jan. 24, 2020, 3:36 p.m. UTC | #4
Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>> > David Ahern <dsahern@kernel.org> writes:
>> >> From: David Ahern <dahern@digitalocean.com>
>> >>
>> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>> >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>> >> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>> >> is that going forward egress path will acquire the various levels of
>> >> attach - generic, driver and hardware.  
>> > 
>> > How would a 'hardware' attach work for this? As I said in my reply to
>> > the previous patch, isn't this explicitly for emulating XDP on the other
>> > end of a point-to-point link? How would that work with offloaded
>> > programs?
>> 
>> Nothing about this patch set is limited to point-to-point links.
>
> I struggle to understand of what the expected semantics of this new
> hook are. Is this going to be run on all frames sent to the device
> from the stack? All frames from the stack and from XDP_REDIRECT?
>
> A little hard to figure out the semantics when we start from a funky
> device like tun :S

Yes, that is also why I found this a bit weird. We have discussed plans
for an XDP TX hook before:
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx

That TX hook would run for everything at TX, but it would be a separate
program type with its own metadata access. Whereas the idea with this
series (seemed to me) to be just to be able to "emulate" run a regular
RX-side XDP program on egress for devices where this makes sense.

If this series is not meant to implement that "emulation", but rather be
usable for all devices, I really think we should go straight for the
full TX hook as discussed earlier...

-Toke
David Ahern Jan. 26, 2020, 1:43 a.m. UTC | #5
On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
>> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>>>> David Ahern <dsahern@kernel.org> writes:
>>>>> From: David Ahern <dahern@digitalocean.com>
>>>>>
>>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>>>>> is that going forward egress path will acquire the various levels of
>>>>> attach - generic, driver and hardware.  
>>>>
>>>> How would a 'hardware' attach work for this? As I said in my reply to
>>>> the previous patch, isn't this explicitly for emulating XDP on the other
>>>> end of a point-to-point link? How would that work with offloaded
>>>> programs?
>>>
>>> Nothing about this patch set is limited to point-to-point links.
>>
>> I struggle to understand of what the expected semantics of this new
>> hook are. Is this going to be run on all frames sent to the device
>> from the stack? All frames from the stack and from XDP_REDIRECT?
>>
>> A little hard to figure out the semantics when we start from a funky
>> device like tun :S
> 
> Yes, that is also why I found this a bit weird. We have discussed plans
> for an XDP TX hook before:
> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> 
> That TX hook would run for everything at TX, but it would be a separate
> program type with its own metadata access. Whereas the idea with this
> series (seemed to me) to be just to be able to "emulate" run a regular
> RX-side XDP program on egress for devices where this makes sense.
> 
> If this series is not meant to implement that "emulation", but rather be
> usable for all devices, I really think we should go straight for the
> full TX hook as discussed earlier...
> 

The first patch set from Jason and Prashant started from the perspective
of offloading XDP programs for a guest. Independently, I was looking at
XDP in the TX path (now referred to as egress to avoid confusion with
the XDP_TX return type). Jason and Prashant were touching some of the
same code paths in the tun driver that I needed for XDP in the Tx path,
so we decided to consolidate and have XDP egress done first and then
offload of VMs as a followup. Offload in virtio_net can be done very
similar to how it is done in nfp -- the program is passed to the host as
a hardware level attach mode, and the driver verifies the program can be
offloaded (e.g., does not contain helpers that expose host specific data
like the fib lookup helper).

At this point, you need to stop thinking solely from the perspective of
tun or tap and VM offload; think about this from the ability to run an
XDP program on egress path at an appropriate place in the NIC driver
that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
been discussed before as a need (e.g, Toke's reference above), and I am
trying to get this initial support done.

I very much wanted to avoid copy-paste-modify for the entire XDP API for
this. For the most part XDP means ebpf at the NIC driver / hardware
level (obviously with the exception of generic mode). The goal is
tempered with the need for the verifier to reject rx entries in the
xdp_md context. Hence the reason for use of an attach_type - existing
infrastructure to test and reject the accesses.

That said, Martin's comment throws a wrench in the goal: if the existing
code does not enforce expected_attach_type then that option can not be
used in which case I guess I have to go with a new program type
(BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
has different return codes, etc.
Alexei Starovoitov Jan. 26, 2020, 4:54 a.m. UTC | #6
On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote:
> 
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

This is acceptable risk. We did such thing in the past. The chances of
user space breakage are extremely low.
Jesper Dangaard Brouer Jan. 26, 2020, 12:49 p.m. UTC | #7
On Sat, 25 Jan 2020 18:43:36 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <kuba@kernel.org> writes:
> >   
> >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:  
> >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:  
> >>>> David Ahern <dsahern@kernel.org> writes:  
> >>>>> From: David Ahern <dahern@digitalocean.com>
> >>>>>
> >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >>>>> is that going forward egress path will acquire the various levels of
> >>>>> attach - generic, driver and hardware.    
> >>>>
> >>>> How would a 'hardware' attach work for this? As I said in my reply to
> >>>> the previous patch, isn't this explicitly for emulating XDP on the other
> >>>> end of a point-to-point link? How would that work with offloaded
> >>>> programs?  
> >>>
> >>> Nothing about this patch set is limited to point-to-point links.  
> >>
> >> I struggle to understand of what the expected semantics of this new
> >> hook are. Is this going to be run on all frames sent to the device
> >> from the stack? All frames from the stack and from XDP_REDIRECT?
> >>
> >> A little hard to figure out the semantics when we start from a funky
> >> device like tun :S  
> > 
> > Yes, that is also why I found this a bit weird. We have discussed plans
> > for an XDP TX hook before:
> > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> > 
> > That TX hook would run for everything at TX, but it would be a separate
> > program type with its own metadata access. Whereas the idea with this
> > series (seemed to me) to be just to be able to "emulate" run a regular
> > RX-side XDP program on egress for devices where this makes sense.
> > 
> > If this series is not meant to implement that "emulation", but rather be
> > usable for all devices, I really think we should go straight for the
> > full TX hook as discussed earlier...
> >   
> 
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).
> 
> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT).

Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.


> This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.
> 
> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.
> 
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

Taking about return codes.  Does XDP the return codes make sense for
this EGRESS hook? (if thinking about this being egress on the real NIC).

E.g. XDP_REDIRECT would have to be supported, which is interesting, but
also have implications (like looping packets).

E.g. what is the semantics/action of XDP_TX return code?

E.g. I'm considering adding a XDP_CONGESTED return code that can cause
backpressure towards qdisc layer.

Also think about that if this EGRESS hook uses standard prog type for
XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff
(and also convert SKBs to xdp_buff).

Are we sure that reusing the same bpf prog type is the right choice?
David Ahern Jan. 26, 2020, 4:38 p.m. UTC | #8
On 1/26/20 5:49 AM, Jesper Dangaard Brouer wrote:
>> This has
>> been discussed before as a need (e.g, Toke's reference above), and I am
>> trying to get this initial support done.
>>
>> I very much wanted to avoid copy-paste-modify for the entire XDP API for
>> this. For the most part XDP means ebpf at the NIC driver / hardware
>> level (obviously with the exception of generic mode). The goal is
>> tempered with the need for the verifier to reject rx entries in the
>> xdp_md context. Hence the reason for use of an attach_type - existing
>> infrastructure to test and reject the accesses.
>>
>> That said, Martin's comment throws a wrench in the goal: if the existing
>> code does not enforce expected_attach_type then that option can not be
>> used in which case I guess I have to go with a new program type
>> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
>> has different return codes, etc.
> 
> Taking about return codes.  Does XDP the return codes make sense for
> this EGRESS hook? (if thinking about this being egress on the real NIC).
> 
> E.g. XDP_REDIRECT would have to be supported, which is interesting, but
> also have implications (like looping packets).
> 
> E.g. what is the semantics/action of XDP_TX return code?

This has been discussed. XDP_TX in the EGRESS path could arguably be
equal to XDP_PASS.

> 
> E.g. I'm considering adding a XDP_CONGESTED return code that can cause
> backpressure towards qdisc layer.
> 
> Also think about that if this EGRESS hook uses standard prog type for
> XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff
> (and also convert SKBs to xdp_buff).

Why? What about the patch set requires that change to be done to have
support for EGRESS path?

> 
> Are we sure that reusing the same bpf prog type is the right choice?
> 

Martin's comment about existing checking on the expected attach type is
the only reason I have seen so far to not have the same program type.

Looking at the helpers for use in XDP programs do you believe any of
those should not be allowed with EGRESS programs? Do you have any reason
to think that existing XDP capabilities should be prohibited or
different for EGRESS? As mentioned earlier the attach type can be used
to have the verifier handle small context differences (and restrict
helpers if needed).
Jakub Kicinski Jan. 26, 2020, 10:11 p.m. UTC | #9
On Sat, 25 Jan 2020 18:43:36 -0700, David Ahern wrote:
> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <kuba@kernel.org> writes:
> >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:  
> >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:  
> >>>> David Ahern <dsahern@kernel.org> writes:  
> >>>>> From: David Ahern <dahern@digitalocean.com>
> >>>>>
> >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >>>>> is that going forward egress path will acquire the various levels of
> >>>>> attach - generic, driver and hardware.    
> >>>>
> >>>> How would a 'hardware' attach work for this? As I said in my reply to
> >>>> the previous patch, isn't this explicitly for emulating XDP on the other
> >>>> end of a point-to-point link? How would that work with offloaded
> >>>> programs?  
> >>>
> >>> Nothing about this patch set is limited to point-to-point links.  
> >>
> >> I struggle to understand of what the expected semantics of this new
> >> hook are. Is this going to be run on all frames sent to the device
> >> from the stack? All frames from the stack and from XDP_REDIRECT?
> >>
> >> A little hard to figure out the semantics when we start from a funky
> >> device like tun :S  
> > 
> > Yes, that is also why I found this a bit weird. We have discussed plans
> > for an XDP TX hook before:
> > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> > 
> > That TX hook would run for everything at TX, but it would be a separate
> > program type with its own metadata access. Whereas the idea with this
> > series (seemed to me) to be just to be able to "emulate" run a regular
> > RX-side XDP program on egress for devices where this makes sense.
> > 
> > If this series is not meant to implement that "emulation", but rather be
> > usable for all devices, I really think we should go straight for the
> > full TX hook as discussed earlier...
> 
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). 

I looked through the commit message and the cover letter again, and you
never explain why you need the egress hook. Could you please clarify
your needs? If it's container-related maybe what Daniel talked about at
last netconf could be a better solution?

I can't quite square the concept of XDP which started as close to the
metal BPF hook for HW drivers, and this heavily SW-focused addition.

> Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).

<rant>

I'd ask to please never compare this work to the nfp offload. Netronome
was able to open up their NIC down to the instruction set level, with
the JIT in tree and rest of the FW open source:

https://github.com/Netronome/nic-firmware/

and that work is now used as precedent for something that risks turning
the kernel into a damn control plane for proprietary clouds?

I can see how they may seem similar in operational terms, but for
people who care about open source they couldn't be more different.

</rant>

> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.

TX hook related to queuing is a very different beast than just a RX
hook flipped. The queuing is a problem that indeed needs work, but just
adding a mirror RX hook does not solve that, and establishes semantics
which may be counter productive. That's why I was asking for clear
semantics.

> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.

For the offload host rx queue == dev PCI tx queue and vice versa.
So other than the name the rejection makes no sense. Just add a union
to xdp_md so both tx and rx names can be used.
Jakub Kicinski Jan. 26, 2020, 10:17 p.m. UTC | #10
On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:
> Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.

Any pointers on what for? Unless we see actual use cases there's
a justifiable concern of the entire thing just being an application of
"We can solve any problem by introducing an extra level of indirection."
David Ahern Jan. 27, 2020, 4:03 a.m. UTC | #11
On 1/26/20 3:11 PM, Jakub Kicinski wrote:
> 
> I looked through the commit message and the cover letter again, and you
> never explain why you need the egress hook. Could you please clarify
> your needs? 

XDP is about efficient network processing - ie., bypassing the Linux
stack when it does not make sense for the person deploying some
solution. XDP right now is Rx centric.

I want to run an ebpf program in the Tx path of the NIC regardless of
how the packet arrived at the device -- as an skb or an xdp_frame. There
are options for running programs on skb-based packets (e.g., tc). There
are *zero* options for manipulating/controlling/denying xdp_frames -
e.g., one REDIRECTED from an ingress device.
Jakub Kicinski Jan. 27, 2020, 2:16 p.m. UTC | #12
On Sun, 26 Jan 2020 21:03:15 -0700, David Ahern wrote:
> On 1/26/20 3:11 PM, Jakub Kicinski wrote:
> > I looked through the commit message and the cover letter again, and you
> > never explain why you need the egress hook. Could you please clarify
> > your needs?   
> 
> XDP is about efficient network processing - ie., bypassing the Linux
> stack when it does not make sense for the person deploying some
> solution. XDP right now is Rx centric.

Network hardware is also "Rx centric" and somehow it works..

> I want to run an ebpf program in the Tx path of the NIC regardless of
> how the packet arrived at the device -- as an skb or an xdp_frame. There
> are options for running programs on skb-based packets (e.g., tc). There
> are *zero* options for manipulating/controlling/denying xdp_frames -
> e.g., one REDIRECTED from an ingress device.

Okay - so no precise use case.  You can run the same program at the 
end of whatever is doing the redirect (especially with Alexei's work 
on linking) and from cls_bpf 🤷‍♂️

I'm sure all driver authors can't wait to plough through their TX paths
:/
David Ahern Jan. 28, 2020, 3:43 a.m. UTC | #13
On 1/27/20 7:16 AM, Jakub Kicinski wrote:
>> I want to run an ebpf program in the Tx path of the NIC regardless of
>> how the packet arrived at the device -- as an skb or an xdp_frame. There
>> are options for running programs on skb-based packets (e.g., tc). There
>> are *zero* options for manipulating/controlling/denying xdp_frames -
>> e.g., one REDIRECTED from an ingress device.
> 
> Okay - so no precise use case.  You can run the same program at the 

For the sake of this discussion, consider a per-VM, per-tap device ebpf
program for firewall / ACL / traffic verification (packet manipulation
is also a possibility). Small, singly focused ebpf programs - attached
at startup, driven by maps, cleaned up when the tap device is destroyed.
(Remember: Tx for tap device is ingress to a VM.)

Small, singly focused programs only run for traffic to be delivered to
the VM. Setup is easy; cleanup automatic. How the traffic got there
could vary - from a bridge (L2 forwarding), the host stack (L3 routing),
or an XDP program on the host's NICs. It could have arrived at the host
with an encap header which is removed and the inner packet forwarded to
the VM.

> end of whatever is doing the redirect (especially with Alexei's work 

There are use cases where they may make sense, but this is not one.

> on linking) and from cls_bpf 🤷‍♂️
> 

cls_bpf is tc based == skb, no? I want to handle any packet, regardless
of how it arrived at the device's xmit function.
Jakub Kicinski Jan. 28, 2020, 1:57 p.m. UTC | #14
On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote:
> > end of whatever is doing the redirect (especially with Alexei's work   
> 
> There are use cases where they may make sense, but this is not one.
> 
> > on linking) and from cls_bpf 🤷‍♂️
> 
> cls_bpf is tc based == skb, no? I want to handle any packet, regardless
> of how it arrived at the device's xmit function.

Yes, that's why I said you need the same rules in XDP before REDIRECT
and cls_bpf. Sure it's more complex, but (1) it's faster to drop in
the ingress prog before going though the entire redirect code and
without parsing the packet twice and (2) no extra kernel code necessary.

Even the VM "offload" work doesn't need this. Translating an XDP prog
into a cls_bpf one should be trivial. Slap on some prologue to linearize
the skb, move ctx offsets around, slap on an epilogue to convert exit
codes, anything else?

I'm weary of partially implemented XDP features, EGRESS prog does us 
no good when most drivers didn't yet catch up with the REDIRECTs. And
we're adding this before we considered the queuing problem.

But if I'm alone in thinking this, and I'm not convincing anyone we can
move on :)
Jesper Dangaard Brouer Jan. 28, 2020, 2:13 p.m. UTC | #15
On Sun, 26 Jan 2020 14:17:01 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:
> > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.  
> 
> Any pointers on what for? Unless we see actual use cases there's
> a justifiable concern of the entire thing just being an application of
> "We can solve any problem by introducing an extra level of indirection."

I have two use-cases:

(1) For XDP easier handling of interface specific setting on egress,
e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
(I think this is also David Ahern's use-case)


(2) I want this egress XDP hook to have the ability to signal
backpressure. Today we have BQL in most drivers (which is essential to
avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.

For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
or a return code that can stop the queue towards the higher layers.
Jakub Kicinski Jan. 30, 2020, 2:45 p.m. UTC | #16
On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 26 Jan 2020 14:17:01 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:  
> > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.    
> > 
> > Any pointers on what for? Unless we see actual use cases there's
> > a justifiable concern of the entire thing just being an application of
> > "We can solve any problem by introducing an extra level of indirection."  
> 
> I have two use-cases:
> 
> (1) For XDP easier handling of interface specific setting on egress,
> e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
> (I think this is also David Ahern's use-case)

Is it really useful to have a hook before multi-buffer frames are
possible and perhaps TSO? The local TCP performance is going to tank
with XDP enabled otherwise.

> (2) I want this egress XDP hook to have the ability to signal
> backpressure. Today we have BQL in most drivers (which is essential to
> avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.
> 
> For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
> or a return code that can stop the queue towards the higher layers.

Agreed, although for that use case, I'm not sure if non-XDP frames 
have to pass trough the hook. Hard to tell as the current patches 
don't touch on this use case.
Toke Høiland-Jørgensen Feb. 1, 2020, 3:59 p.m. UTC | #17
David Ahern <dsahern@gmail.com> writes:

> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> 
>>> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>>>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>>>>> David Ahern <dsahern@kernel.org> writes:
>>>>>> From: David Ahern <dahern@digitalocean.com>
>>>>>>
>>>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>>>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>>>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>>>>>> is that going forward egress path will acquire the various levels of
>>>>>> attach - generic, driver and hardware.  
>>>>>
>>>>> How would a 'hardware' attach work for this? As I said in my reply to
>>>>> the previous patch, isn't this explicitly for emulating XDP on the other
>>>>> end of a point-to-point link? How would that work with offloaded
>>>>> programs?
>>>>
>>>> Nothing about this patch set is limited to point-to-point links.
>>>
>>> I struggle to understand of what the expected semantics of this new
>>> hook are. Is this going to be run on all frames sent to the device
>>> from the stack? All frames from the stack and from XDP_REDIRECT?
>>>
>>> A little hard to figure out the semantics when we start from a funky
>>> device like tun :S
>> 
>> Yes, that is also why I found this a bit weird. We have discussed plans
>> for an XDP TX hook before:
>> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
>> 
>> That TX hook would run for everything at TX, but it would be a separate
>> program type with its own metadata access. Whereas the idea with this
>> series (seemed to me) to be just to be able to "emulate" run a regular
>> RX-side XDP program on egress for devices where this makes sense.
>> 
>> If this series is not meant to implement that "emulation", but rather be
>> usable for all devices, I really think we should go straight for the
>> full TX hook as discussed earlier...
>> 
>
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).
>
> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.

Right, so what we're discussing here *is* the "full" egress hook we've
discussed earlier. I thought this was still the other thing, which is
why I was confused.

> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.
>
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

In any case an egress program will differ in:

- The context object (the RX-related fields will be invalid on egress,
  and we'll probably want to add new TX-related ones, such as HW
  TX-queue occupancy).
  
- The return code semantics (even if XDP_TX becomes equivalent to
  XDP_PASS, that is still a semantic difference from the RX side; and
  it's not necessarily clear whether we'll want to support REDIRECT on
  the egress side either, is it?)

So we'll have to disambiguate between the two different types of
programs. Which means that what we're discussing is really whether that
disambiguation should be encoded in the program type, or in the attach
type. IMO, making it a separate program type is a clearer and more
explicit UAPI. The verifier could still treat the two program types as
basically equivalent except for those cases where there has to be a
difference anyway. So it seems to me that all you are saving by using
attach_type instead of program type is the need for a new enum value and
a bunch of additions to switch statements? Or am I wildly
underestimating the effort to add a new program type?

-Toke
Toke Høiland-Jørgensen Feb. 1, 2020, 4:03 p.m. UTC | #18
Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote:
>> On Sun, 26 Jan 2020 14:17:01 -0800
>> Jakub Kicinski <kuba@kernel.org> wrote:
>> 
>> > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:  
>> > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.    
>> > 
>> > Any pointers on what for? Unless we see actual use cases there's
>> > a justifiable concern of the entire thing just being an application of
>> > "We can solve any problem by introducing an extra level of indirection."  
>> 
>> I have two use-cases:
>> 
>> (1) For XDP easier handling of interface specific setting on egress,
>> e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
>> (I think this is also David Ahern's use-case)
>
> Is it really useful to have a hook before multi-buffer frames are
> possible and perhaps TSO? The local TCP performance is going to tank
> with XDP enabled otherwise.

For a software router (i.e., something that mostly forwards packets) it
can still be useful without multi-buffer. But yeah, that is definitely
something we need to solve, regardless of where this goes.

>> (2) I want this egress XDP hook to have the ability to signal
>> backpressure. Today we have BQL in most drivers (which is essential to
>> avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.
>> 
>> For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
>> or a return code that can stop the queue towards the higher layers.
>
> Agreed, although for that use case, I'm not sure if non-XDP frames 
> have to pass trough the hook. Hard to tell as the current patches 
> don't touch on this use case.

I think it would be weird and surprising if it *doesn't* see packets
from the stack. On RX, XDP sees everything; the natural expectation
would be that this was also the case on TX, no?

-Toke
Toke Høiland-Jørgensen Feb. 1, 2020, 4:24 p.m. UTC | #19
[ skipping the first part to just comment on the below: ]

> I'm weary of partially implemented XDP features, EGRESS prog does us
> no good when most drivers didn't yet catch up with the REDIRECTs.

I kinda agree with this; but on the other hand, if we have to wait for
all drivers to catch up, that would mean we couldn't add *anything* new
that requires driver changes, which is not ideal either :/

> And we're adding this before we considered the queuing problem.
>
> But if I'm alone in thinking this, and I'm not convincing anyone we
> can move on :)

I do share your concern that this will end up being incompatible with
whatever solution we end up with for queueing. However, I don't
necessarily think it will: I view the XDP egress hook as something that
in any case will run *after* packets are dequeued from whichever
intermediate queueing it has been through (if any). I think such a hook
is missing in any case; for instance, it's currently impossible to
implement something like CoDel (which needs to know how long a packet
spent in the queue) in eBPF.

-Toke
Jakub Kicinski Feb. 1, 2020, 5:08 p.m. UTC | #20
On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:
> > I'm weary of partially implemented XDP features, EGRESS prog does us
> > no good when most drivers didn't yet catch up with the REDIRECTs.  
> 
> I kinda agree with this; but on the other hand, if we have to wait for
> all drivers to catch up, that would mean we couldn't add *anything*
> new that requires driver changes, which is not ideal either :/

If EGRESS is only for XDP frames we could try to hide the handling in
the core (with slight changes to XDP_TX handling in the drivers),
making drivers smaller and XDP feature velocity higher.

I think loading the drivers with complexity is hurting us in so many
ways..

> > And we're adding this before we considered the queuing problem.
> >
> > But if I'm alone in thinking this, and I'm not convincing anyone we
> > can move on :)  
> 
> I do share your concern that this will end up being incompatible with
> whatever solution we end up with for queueing. However, I don't
> necessarily think it will: I view the XDP egress hook as something
> that in any case will run *after* packets are dequeued from whichever
> intermediate queueing it has been through (if any). I think such a
> hook is missing in any case; for instance, it's currently impossible
> to implement something like CoDel (which needs to know how long a
> packet spent in the queue) in eBPF.

Possibly 🤔 I don't have a good mental image of how the XDP queuing
would work.

Maybe once the queuing primitives are defined they can easily be
hooked into the Qdisc layer. With Martin's recent work all we need is 
a fifo that can store skb pointers, really...

It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
layer underneath.
Toke Høiland-Jørgensen Feb. 1, 2020, 8:05 p.m. UTC | #21
Jakub Kicinski <kuba@kernel.org> writes:

> On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:
>> > I'm weary of partially implemented XDP features, EGRESS prog does us
>> > no good when most drivers didn't yet catch up with the REDIRECTs.  
>> 
>> I kinda agree with this; but on the other hand, if we have to wait for
>> all drivers to catch up, that would mean we couldn't add *anything*
>> new that requires driver changes, which is not ideal either :/
>
> If EGRESS is only for XDP frames we could try to hide the handling in
> the core (with slight changes to XDP_TX handling in the drivers),
> making drivers smaller and XDP feature velocity higher.

But if it's only for XDP frames that are REDIRECTed, then one might as
well perform whatever action the TX hook was doing before REDIRECTing
(as you yourself argued)... :)

> I think loading the drivers with complexity is hurting us in so many
> ways..

Yeah, but having the low-level details available to the XDP program
(such as HW queue occupancy for the egress hook) is one of the benefits
of XDP, isn't it?

Ultimately, I think Jesper's idea of having drivers operate exclusively
on XDP frames and have the skb handling entirely in the core is an
intriguing way to resolve this problem. Though this is obviously a
long-term thing, and one might reasonably doubt we'll ever get there for
existing drivers...

>> > And we're adding this before we considered the queuing problem.
>> >
>> > But if I'm alone in thinking this, and I'm not convincing anyone we
>> > can move on :)  
>> 
>> I do share your concern that this will end up being incompatible with
>> whatever solution we end up with for queueing. However, I don't
>> necessarily think it will: I view the XDP egress hook as something
>> that in any case will run *after* packets are dequeued from whichever
>> intermediate queueing it has been through (if any). I think such a
>> hook is missing in any case; for instance, it's currently impossible
>> to implement something like CoDel (which needs to know how long a
>> packet spent in the queue) in eBPF.
>
> Possibly 🤔 I don't have a good mental image of how the XDP queuing
> would work.
>
> Maybe once the queuing primitives are defined they can easily be
> hooked into the Qdisc layer. With Martin's recent work all we need is 
> a fifo that can store skb pointers, really...
>
> It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
> layer underneath.

Hmm, hooking into the existing qdisc layer is an interesting idea.
Ultimately, I fear it won't be feasible for performance reasons; but
it's certainly something to consider. Maybe at least as an option?

-Toke
Jakub Kicinski Feb. 2, 2020, 4:15 a.m. UTC | #22
On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:  
> >> > I'm weary of partially implemented XDP features, EGRESS prog does us
> >> > no good when most drivers didn't yet catch up with the REDIRECTs.    
> >> 
> >> I kinda agree with this; but on the other hand, if we have to wait for
> >> all drivers to catch up, that would mean we couldn't add *anything*
> >> new that requires driver changes, which is not ideal either :/  
> >
> > If EGRESS is only for XDP frames we could try to hide the handling in
> > the core (with slight changes to XDP_TX handling in the drivers),
> > making drivers smaller and XDP feature velocity higher.  
> 
> But if it's only for XDP frames that are REDIRECTed, then one might as
> well perform whatever action the TX hook was doing before REDIRECTing
> (as you yourself argued)... :)

Right, that's why I think the design needs to start from queuing which
can't be done today, and has to be done in context of the destination.
Solving queuing justifies the added complexity if you will :)

> > I think loading the drivers with complexity is hurting us in so many
> > ways..  
> 
> Yeah, but having the low-level details available to the XDP program
> (such as HW queue occupancy for the egress hook) is one of the benefits
> of XDP, isn't it?

I think I glossed over the hope for having access to HW queue occupancy
- what exactly are you after? 

I don't think one can get anything beyond a BQL type granularity.
Reading over PCIe is out of question, device write back on high
granularity would burn through way too much bus throughput.

> Ultimately, I think Jesper's idea of having drivers operate exclusively
> on XDP frames and have the skb handling entirely in the core is an
> intriguing way to resolve this problem. Though this is obviously a
> long-term thing, and one might reasonably doubt we'll ever get there for
> existing drivers...
> 
> >> > And we're adding this before we considered the queuing problem.
> >> >
> >> > But if I'm alone in thinking this, and I'm not convincing anyone we
> >> > can move on :)    
> >> 
> >> I do share your concern that this will end up being incompatible with
> >> whatever solution we end up with for queueing. However, I don't
> >> necessarily think it will: I view the XDP egress hook as something
> >> that in any case will run *after* packets are dequeued from whichever
> >> intermediate queueing it has been through (if any). I think such a
> >> hook is missing in any case; for instance, it's currently impossible
> >> to implement something like CoDel (which needs to know how long a
> >> packet spent in the queue) in eBPF.  
> >
> > Possibly 🤔 I don't have a good mental image of how the XDP queuing
> > would work.
> >
> > Maybe once the queuing primitives are defined they can easily be
> > hooked into the Qdisc layer. With Martin's recent work all we need is 
> > a fifo that can store skb pointers, really...
> >
> > It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
> > layer underneath.  
> 
> Hmm, hooking into the existing qdisc layer is an interesting idea.
> Ultimately, I fear it won't be feasible for performance reasons; but
> it's certainly something to consider. Maybe at least as an option?

For forwarding sure, but for locally generated traffic.. 🤷‍♂️
David Ahern Feb. 2, 2020, 5:43 p.m. UTC | #23
On 1/28/20 6:57 AM, Jakub Kicinski wrote:
> On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote:
>>> end of whatever is doing the redirect (especially with Alexei's work   
>>
>> There are use cases where they may make sense, but this is not one.
>>
>>> on linking) and from cls_bpf 🤷‍♂️
>>
>> cls_bpf is tc based == skb, no? I want to handle any packet, regardless
>> of how it arrived at the device's xmit function.
> 
> Yes, that's why I said you need the same rules in XDP before REDIRECT
> and cls_bpf. Sure it's more complex, but (1) it's faster to drop in
> the ingress prog before going though the entire redirect code and
> without parsing the packet twice and (2) no extra kernel code necessary.

you are making a lot of assumptions and frankly it's the 'c' word
(complex) that I want to avoid. I do not believe in the OVS style of
packet processing - one gigantic program with a bunch of logic and data
driven flow lookups that affect every packet. I prefer simple, singly
focused programs logically concatenated to make decisions. Simpler
management, simpler lifecycle. The scope, scale and lifecycle management
of VMs/containers is just as important as minimizing the cycles spent
per packet. XDP in the Tx path is a missing primitive to make life simple.

As an example, the program on the ingress NICs to the host can be
nothing more than a demux'er - use the ethernet header and vlan
(preferably with vlan offload enabled) to do a <vlan,mac> lookup. No
packet parsing at all at this level. The lookup returns an index of
where to redirect the packet (e.g., the tap device of a VM). At the same
time, packets can hit the "slow path" - processing support from the full
stack when it is needed and still packets end up at the tap device. *IF*
there is an ingress ACL for that tap device managed by the host, it can
exist in 1 place - a program and map attached to the tap device limited
to that tap device's networking function (VMs can have multiple
connections to different networks with different needs at this point),
and the program gets run for both XDP fast path and skb slow path.
David Ahern Feb. 2, 2020, 5:45 p.m. UTC | #24
On 2/1/20 10:08 AM, Jakub Kicinski wrote:
> If EGRESS is only for XDP frames we could try to hide the handling in
> the core (with slight changes to XDP_TX handling in the drivers),

It is not. I have said multiple times it is to work on ALL packets that
hit the xmit function, both skbs and xdp_frames.
David Ahern Feb. 2, 2020, 5:48 p.m. UTC | #25
On 2/1/20 9:03 AM, Toke Høiland-Jørgensen wrote:
> I think it would be weird and surprising if it *doesn't* see packets
> from the stack. On RX, XDP sees everything; the natural expectation
> would be that this was also the case on TX, no?
> 

Yes. XDP_EGRESS should be symmetrical with XDP on the Rx side.
David Ahern Feb. 2, 2020, 5:54 p.m. UTC | #26
On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote:
> 
> In any case an egress program will differ in:
> 
> - The context object (the RX-related fields will be invalid on egress,
>   and we'll probably want to add new TX-related ones, such as HW
>   TX-queue occupancy).

Jakub has suggested that rx_queue_index can be a union with
tx_queue_index; the former for the Rx path and the latter for the egress.

The rest of the fields in xdp_md are legit for either direction.

>   
> - The return code semantics (even if XDP_TX becomes equivalent to
>   XDP_PASS, that is still a semantic difference from the RX side; and
>   it's not necessarily clear whether we'll want to support REDIRECT on
>   the egress side either, is it?)

Why should REDIRECT not be allowed in the egress path? e.g., service
chaining or capturing suspicious packets (e.g., encap with a header and
redirect somewhere for analysis).

> 
> So we'll have to disambiguate between the two different types of
> programs. Which means that what we're discussing is really whether that
> disambiguation should be encoded in the program type, or in the attach
> type. IMO, making it a separate program type is a clearer and more
> explicit UAPI. The verifier could still treat the two program types as
> basically equivalent except for those cases where there has to be a
> difference anyway. So it seems to me that all you are saving by using
> attach_type instead of program type is the need for a new enum value and
> a bunch of additions to switch statements? Or am I wildly
> underestimating the effort to add a new program type?
> 

IMHO that is duplicating code and APIs for no real reason. XDP refers to
fast path processing, the only difference is where the program is
attached - Rx side or Tx side.
David Ahern Feb. 2, 2020, 5:59 p.m. UTC | #27
On 1/25/20 9:54 PM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote:
>>
>> That said, Martin's comment throws a wrench in the goal: if the existing
>> code does not enforce expected_attach_type then that option can not be
>> used in which case I guess I have to go with a new program type
>> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
>> has different return codes, etc.
> 
> This is acceptable risk. We did such thing in the past. The chances of
> user space breakage are extremely low.
> 

Ultimately that is a decision for the maintainers. Code wise both
iproute2 and libbpf always initialize bpf_attr to 0 and given the many
uses of that union it seems odd that someone would initialize one field
at a time.

Unless someone comes back with a strong 'hell, no' I am planning to send
the next RFC version with the current API.
Jakub Kicinski Feb. 2, 2020, 7:12 p.m. UTC | #28
On Sun, 2 Feb 2020 10:45:27 -0700, David Ahern wrote:
> On 2/1/20 10:08 AM, Jakub Kicinski wrote:
> > If EGRESS is only for XDP frames we could try to hide the handling in
> > the core (with slight changes to XDP_TX handling in the drivers),  
> 
> It is not. I have said multiple times it is to work on ALL packets that
> hit the xmit function, both skbs and xdp_frames.

Okay, I should have said "was to be", I guess?

 If EGRESS _was_to_be_ only for XDP frames we could try to hide the
 handling in the core (with slight changes to XDP_TX handling in the
 drivers), making drivers smaller and XDP feature velocity higher.

I understand you'd like a hook for all packets, that is clear.
I'm just trying to highlight the cost and consequences.
Jakub Kicinski Feb. 2, 2020, 7:31 p.m. UTC | #29
On Sun, 2 Feb 2020 10:43:58 -0700, David Ahern wrote:
> you are making a lot of assumptions and frankly it's the 'c' word
> (complex) that I want to avoid. I do not believe in the OVS style of
> packet processing - one gigantic program with a bunch of logic and data
> driven flow lookups that affect every packet. I prefer simple, singly
> focused programs logically concatenated to make decisions. Simpler
> management, simpler lifecycle. The scope, scale and lifecycle management
> of VMs/containers is just as important as minimizing the cycles spent
> per packet. XDP in the Tx path is a missing primitive to make life simple.
> 
> As an example, the program on the ingress NICs to the host can be
> nothing more than a demux'er - use the ethernet header and vlan
> (preferably with vlan offload enabled) to do a <vlan,mac> lookup. No
> packet parsing at all at this level. The lookup returns an index of
> where to redirect the packet (e.g., the tap device of a VM). At the same
> time, packets can hit the "slow path" - processing support from the full
> stack when it is needed and still packets end up at the tap device. *IF*
> there is an ingress ACL for that tap device managed by the host, it can
> exist in 1 place - a program and map attached to the tap device limited
> to that tap device's networking function (VMs can have multiple
> connections to different networks with different needs at this point),
> and the program gets run for both XDP fast path and skb slow path.

I think our perspectives will be difficult to reconcile.

My reading is that you look at this from slow software device
perspective. Of which there are few (and already loaded with hacks).
And you want the control plane to be simple rather than performance.

I look at this from HW driver perspective of which there are many.
Saying that it's fine to load TX paths of all drivers with extra 
code, and that it's fine to effectively disable SG in the entire 
system just doesn't compute.

Is that a fair summary of the arguments?
David Ahern Feb. 2, 2020, 9:51 p.m. UTC | #30
On 2/2/20 12:31 PM, Jakub Kicinski wrote:
> 
> I think our perspectives will be difficult to reconcile.
> 
> My reading is that you look at this from slow software device
> perspective. Of which there are few (and already loaded with hacks).
> And you want the control plane to be simple rather than performance.
> 
> I look at this from HW driver perspective of which there are many.
> Saying that it's fine to load TX paths of all drivers with extra 
> code, and that it's fine to effectively disable SG in the entire 
> system just doesn't compute.
> 
> Is that a fair summary of the arguments?
> 

I do not believe so.

My argument is about allowing XDP and full stack to work
synergistically: Fast path for known unicast traffic, and slow path for
BUM traffic without duplicating config such as ACLs and forcing a
specific design (such as forcing all programs in the XDP Rx path which
does not work for all traffic patterns). For example, if I am willing to
trade a few cycles of over head (redirect processing) for a simpler
overall solution then I should be to do that. Fast path with XDP is
still faster than the full host stack for known unicast traffic.
Toke Høiland-Jørgensen Feb. 3, 2020, 7:56 p.m. UTC | #31
Jakub Kicinski <kuba@kernel.org> writes:

> On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:  
>> >> > I'm weary of partially implemented XDP features, EGRESS prog does us
>> >> > no good when most drivers didn't yet catch up with the REDIRECTs.    
>> >> 
>> >> I kinda agree with this; but on the other hand, if we have to wait for
>> >> all drivers to catch up, that would mean we couldn't add *anything*
>> >> new that requires driver changes, which is not ideal either :/  
>> >
>> > If EGRESS is only for XDP frames we could try to hide the handling in
>> > the core (with slight changes to XDP_TX handling in the drivers),
>> > making drivers smaller and XDP feature velocity higher.  
>> 
>> But if it's only for XDP frames that are REDIRECTed, then one might as
>> well perform whatever action the TX hook was doing before REDIRECTing
>> (as you yourself argued)... :)
>
> Right, that's why I think the design needs to start from queuing which
> can't be done today, and has to be done in context of the destination.
> Solving queuing justifies the added complexity if you will :)

Right, that makes sense. Hmm, I wonder if a TX driver hook is enough?
I.e., a driver callback in the TX path could also just queue that packet
(returning XDP_QUEUED?), without the driver needing any more handling
than it will already have? I'm spitballing a little bit here, but it may
be quite straight-forward? :)

>> > I think loading the drivers with complexity is hurting us in so many
>> > ways..  
>> 
>> Yeah, but having the low-level details available to the XDP program
>> (such as HW queue occupancy for the egress hook) is one of the benefits
>> of XDP, isn't it?
>
> I think I glossed over the hope for having access to HW queue occupancy
> - what exactly are you after? 
>
> I don't think one can get anything beyond a BQL type granularity.
> Reading over PCIe is out of question, device write back on high
> granularity would burn through way too much bus throughput.
>
>> Ultimately, I think Jesper's idea of having drivers operate exclusively
>> on XDP frames and have the skb handling entirely in the core is an
>> intriguing way to resolve this problem. Though this is obviously a
>> long-term thing, and one might reasonably doubt we'll ever get there for
>> existing drivers...
>> 
>> >> > And we're adding this before we considered the queuing problem.
>> >> >
>> >> > But if I'm alone in thinking this, and I'm not convincing anyone we
>> >> > can move on :)    
>> >> 
>> >> I do share your concern that this will end up being incompatible with
>> >> whatever solution we end up with for queueing. However, I don't
>> >> necessarily think it will: I view the XDP egress hook as something
>> >> that in any case will run *after* packets are dequeued from whichever
>> >> intermediate queueing it has been through (if any). I think such a
>> >> hook is missing in any case; for instance, it's currently impossible
>> >> to implement something like CoDel (which needs to know how long a
>> >> packet spent in the queue) in eBPF.  
>> >
>> > Possibly 🤔 I don't have a good mental image of how the XDP queuing
>> > would work.
>> >
>> > Maybe once the queuing primitives are defined they can easily be
>> > hooked into the Qdisc layer. With Martin's recent work all we need is 
>> > a fifo that can store skb pointers, really...
>> >
>> > It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
>> > layer underneath.  
>> 
>> Hmm, hooking into the existing qdisc layer is an interesting idea.
>> Ultimately, I fear it won't be feasible for performance reasons; but
>> it's certainly something to consider. Maybe at least as an option?
>
> For forwarding sure, but for locally generated traffic.. 🤷‍♂️

Right, well, for locally generated traffic we already have the qdisc?
This was kinda the reason why my original thought was to add the
queueing for XDP only at REDIRECT time. I.e., we already have the
xdp_dev_bulk_queue (which now even lives in struct net_device), so we
could "just" extend that and make it into a proper queueing structure,
and call it a day? :)

But from your comments it sounds like that when you're saying "BPF
queueing" you mean that the queueing itself should be programmable using
BPF? To me, the most urgent thing has been to figure out a way to do
*any* kind of queueing with XDP-forwarded frames, so haven't given much
thought to what a "fully programmable" queue would look like... Do you
have any thoughts?

-Toke
Toke Høiland-Jørgensen Feb. 3, 2020, 8:09 p.m. UTC | #32
David Ahern <dsahern@gmail.com> writes:

> On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote:
>> 
>> In any case an egress program will differ in:
>> 
>> - The context object (the RX-related fields will be invalid on egress,
>>   and we'll probably want to add new TX-related ones, such as HW
>>   TX-queue occupancy).
>
> Jakub has suggested that rx_queue_index can be a union with
> tx_queue_index; the former for the Rx path and the latter for the egress.
>
> The rest of the fields in xdp_md are legit for either direction.

Right, okay, ifindex and queue index could make sense in both
directions. But it would still mean that we would limit ourselves to
only adding new fields that would work on both RX and TX. Maybe that is
OK, though.

>> - The return code semantics (even if XDP_TX becomes equivalent to
>>   XDP_PASS, that is still a semantic difference from the RX side; and
>>   it's not necessarily clear whether we'll want to support REDIRECT on
>>   the egress side either, is it?)
>
> Why should REDIRECT not be allowed in the egress path? e.g., service
> chaining or capturing suspicious packets (e.g., encap with a header and
> redirect somewhere for analysis).

Implementation and deployment complexity, mostly (loops!)? I do agree
that it would be nice to allow it, I'm just not entirely convinced that
it's feasible...

>> So we'll have to disambiguate between the two different types of
>> programs. Which means that what we're discussing is really whether that
>> disambiguation should be encoded in the program type, or in the attach
>> type. IMO, making it a separate program type is a clearer and more
>> explicit UAPI. The verifier could still treat the two program types as
>> basically equivalent except for those cases where there has to be a
>> difference anyway. So it seems to me that all you are saving by using
>> attach_type instead of program type is the need for a new enum value and
>> a bunch of additions to switch statements? Or am I wildly
>> underestimating the effort to add a new program type?
>> 
>
> IMHO that is duplicating code and APIs for no real reason. XDP refers to
> fast path processing, the only difference is where the program is
> attached - Rx side or Tx side.

Sure, if we can guarantee API and semantic equivalence. I.e., if any XDP
program really *can* be attached as both an RX and TX program, I have no
objections to re-using the program type. But if it turns out that there
*is* a difference, we're making implicit subtypes anyway, in which case
I think it's better to be explicit about the difference.

-Toke
Toke Høiland-Jørgensen Feb. 3, 2020, 8:13 p.m. UTC | #33
Oops, I see I forgot to reply to this bit:

>> Yeah, but having the low-level details available to the XDP program
>> (such as HW queue occupancy for the egress hook) is one of the benefits
>> of XDP, isn't it?
>
> I think I glossed over the hope for having access to HW queue occupancy
> - what exactly are you after? 
>
> I don't think one can get anything beyond a BQL type granularity.
> Reading over PCIe is out of question, device write back on high
> granularity would burn through way too much bus throughput.

This was Jesper's idea originally, so maybe he can explain better; but
as I understood it, he basically wanted to expose the same information
that BQL has to eBPF. Making it possible for an eBPF program to either
(re-)implement BQL with its own custom policy, or react to HWQ pressure
in some other way, such as by load balancing to another interface.

-Toke
Jesper Dangaard Brouer Feb. 3, 2020, 10:15 p.m. UTC | #34
On Mon, 03 Feb 2020 21:13:24 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Oops, I see I forgot to reply to this bit:
> 
> >> Yeah, but having the low-level details available to the XDP program
> >> (such as HW queue occupancy for the egress hook) is one of the benefits
> >> of XDP, isn't it?  
> >
> > I think I glossed over the hope for having access to HW queue occupancy
> > - what exactly are you after? 
> >
> > I don't think one can get anything beyond a BQL type granularity.
> > Reading over PCIe is out of question, device write back on high
> > granularity would burn through way too much bus throughput.  
> 
> This was Jesper's idea originally, so maybe he can explain better; but
> as I understood it, he basically wanted to expose the same information
> that BQL has to eBPF. Making it possible for an eBPF program to either
> (re-)implement BQL with its own custom policy, or react to HWQ pressure
> in some other way, such as by load balancing to another interface.

Yes, and I also have plans that goes beyond BQL. But let me start with
explaining the BQL part, and answer Toke's question below.

On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:
> [...] Hmm, I wonder if a TX driver hook is enough?

Short answer is no, a TX driver hook is not enough.  The queue state
info the TX driver hook have access to, needs to be updated once the
hardware have "confirmed" the TX-DMA operation have completed.  For
BQL/DQL this update happens during TX-DMA completion/cleanup (code
see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
point out we cannot query the device directly due to performance
implications).  It doesn't need to be a new BPF hook, just something
that update the queue state info (we could piggy back on the
netdev_tx_completed_queue() call or give TX hook access to
dev_queue->dql).

Regarding "where is the queue": For me the XDP-TX queue is the NIC
hardware queue, that this BPF hook have some visibility into and can do
filtering on. (Imagine that my TX queue is bandwidth limited, then I
can shrink the packet size and still send a "congestion" packet to my
receiver).

The bigger picture is that I envision the XDP-TX/egress hook can
open-up for taking advantage of NIC hardware TX queue features.
This also ties into the queue abstraction work by Björn+Magnus.
Today NIC hardware can do a million TX-queues, and hardware can also do
rate limiting per queue.  Thus, I also envision that the XDP-TX/egress
hook can choose/change the TX queue the packet is queue/sent on (we can
likely just overload the XDP_REDIRECT and have a new bpf map type for
this).
Toke Høiland-Jørgensen Feb. 4, 2020, 11 a.m. UTC | #35
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Mon, 03 Feb 2020 21:13:24 +0100
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Oops, I see I forgot to reply to this bit:
>> 
>> >> Yeah, but having the low-level details available to the XDP program
>> >> (such as HW queue occupancy for the egress hook) is one of the benefits
>> >> of XDP, isn't it?  
>> >
>> > I think I glossed over the hope for having access to HW queue occupancy
>> > - what exactly are you after? 
>> >
>> > I don't think one can get anything beyond a BQL type granularity.
>> > Reading over PCIe is out of question, device write back on high
>> > granularity would burn through way too much bus throughput.  
>> 
>> This was Jesper's idea originally, so maybe he can explain better; but
>> as I understood it, he basically wanted to expose the same information
>> that BQL has to eBPF. Making it possible for an eBPF program to either
>> (re-)implement BQL with its own custom policy, or react to HWQ pressure
>> in some other way, such as by load balancing to another interface.
>
> Yes, and I also have plans that goes beyond BQL. But let me start with
> explaining the BQL part, and answer Toke's question below.
>
> On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:
>> [...] Hmm, I wonder if a TX driver hook is enough?
>
> Short answer is no, a TX driver hook is not enough.  The queue state
> info the TX driver hook have access to, needs to be updated once the
> hardware have "confirmed" the TX-DMA operation have completed.  For
> BQL/DQL this update happens during TX-DMA completion/cleanup (code
> see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
> point out we cannot query the device directly due to performance
> implications).  It doesn't need to be a new BPF hook, just something
> that update the queue state info (we could piggy back on the
> netdev_tx_completed_queue() call or give TX hook access to
> dev_queue->dql).

The question is whether this can't simply be done through bpf helpers?
bpf_get_txq_occupancy(ifindex, txqno)?

> Regarding "where is the queue": For me the XDP-TX queue is the NIC
> hardware queue, that this BPF hook have some visibility into and can do
> filtering on. (Imagine that my TX queue is bandwidth limited, then I
> can shrink the packet size and still send a "congestion" packet to my
> receiver).

I'm not sure the hardware queues will be enough, though. Unless I'm
misunderstanding something, hardware queues are (1) fairly short and (2)
FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
still need a software queueing layer on top of the hardware queue.

If the hardware is EDT-aware this may change, I suppose, but I'm not
sure if we can design the XDP queueing primitives with this assumption? :)

> The bigger picture is that I envision the XDP-TX/egress hook can
> open-up for taking advantage of NIC hardware TX queue features. This
> also ties into the queue abstraction work by Björn+Magnus. Today NIC
> hardware can do a million TX-queues, and hardware can also do rate
> limiting per queue. Thus, I also envision that the XDP-TX/egress hook
> can choose/change the TX queue the packet is queue/sent on (we can
> likely just overload the XDP_REDIRECT and have a new bpf map type for
> this).

Yes, I think I mentioned in another email that putting all the queueing
smarts into the redirect map was also something I'd considered (well, I
do think we've discussed this in the past, so maybe not so surprising if
we're thinking along the same lines) :)

But the implication of this is also that an actual TX hook in the driver
need not necessarily incorporate a lot of new functionality, as it can
control the queueing through a combination of BPF helpers and map
updates?

-Toke
Jakub Kicinski Feb. 4, 2020, 5:09 p.m. UTC | #36
On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > On Mon, 03 Feb 2020 21:13:24 +0100
> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >  
> >> Oops, I see I forgot to reply to this bit:
> >>   
> >> >> Yeah, but having the low-level details available to the XDP program
> >> >> (such as HW queue occupancy for the egress hook) is one of the benefits
> >> >> of XDP, isn't it?    
> >> >
> >> > I think I glossed over the hope for having access to HW queue occupancy
> >> > - what exactly are you after? 
> >> >
> >> > I don't think one can get anything beyond a BQL type granularity.
> >> > Reading over PCIe is out of question, device write back on high
> >> > granularity would burn through way too much bus throughput.    
> >> 
> >> This was Jesper's idea originally, so maybe he can explain better; but
> >> as I understood it, he basically wanted to expose the same information
> >> that BQL has to eBPF. Making it possible for an eBPF program to either
> >> (re-)implement BQL with its own custom policy, or react to HWQ pressure
> >> in some other way, such as by load balancing to another interface.  
> >
> > Yes, and I also have plans that goes beyond BQL. But let me start with
> > explaining the BQL part, and answer Toke's question below.
> >
> > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:  
> >> [...] Hmm, I wonder if a TX driver hook is enough?  
> >
> > Short answer is no, a TX driver hook is not enough.  The queue state
> > info the TX driver hook have access to, needs to be updated once the
> > hardware have "confirmed" the TX-DMA operation have completed.  For
> > BQL/DQL this update happens during TX-DMA completion/cleanup (code
> > see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
> > point out we cannot query the device directly due to performance
> > implications).  It doesn't need to be a new BPF hook, just something
> > that update the queue state info (we could piggy back on the
> > netdev_tx_completed_queue() call or give TX hook access to
> > dev_queue->dql).  

Interesting, that model does make sense to me.

> The question is whether this can't simply be done through bpf helpers?
> bpf_get_txq_occupancy(ifindex, txqno)?

Helper vs dev_queue->dql field access seems like a technicality.
The usual flexibility of implementation vs performance and simplicity
consideration applies.. I guess?

> > Regarding "where is the queue": For me the XDP-TX queue is the NIC
> > hardware queue, that this BPF hook have some visibility into and can do
> > filtering on. (Imagine that my TX queue is bandwidth limited, then I
> > can shrink the packet size and still send a "congestion" packet to my
> > receiver).  
> 
> I'm not sure the hardware queues will be enough, though. Unless I'm
> misunderstanding something, hardware queues are (1) fairly short and (2)
> FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
> still need a software queueing layer on top of the hardware queue.

Jesper makes a very interesting point tho. If all the implementation
wants is FIFO queues which are services in some simple manner (that is
something that can be offloaded) we should support that.

That means REDIRECT can target multiple TX queues, and we need an API
to control the queue allocation..

> If the hardware is EDT-aware this may change, I suppose, but I'm not
> sure if we can design the XDP queueing primitives with this assumption? :)

But I agree with you as well. I think both HW and SW feeding needs to
be supported. The HW implementations are always necessarily behind
ideas people implemented and tested in SW..

> > The bigger picture is that I envision the XDP-TX/egress hook can
> > open-up for taking advantage of NIC hardware TX queue features. This
> > also ties into the queue abstraction work by Björn+Magnus. Today NIC
> > hardware can do a million TX-queues, and hardware can also do rate
> > limiting per queue. Thus, I also envision that the XDP-TX/egress hook
> > can choose/change the TX queue the packet is queue/sent on (we can
> > likely just overload the XDP_REDIRECT and have a new bpf map type for
> > this).  

I wonder what that does to our HW offload model which is based on
TC Qdisc offload today :S Do we use TC API to control configuration 
of XDP queues? :S

> Yes, I think I mentioned in another email that putting all the queueing
> smarts into the redirect map was also something I'd considered (well, I
> do think we've discussed this in the past, so maybe not so surprising if
> we're thinking along the same lines) :)
> 
> But the implication of this is also that an actual TX hook in the driver
> need not necessarily incorporate a lot of new functionality, as it can
> control the queueing through a combination of BPF helpers and map
> updates?

True, it's the dequeuing that's on the TX side, so we could go as far
as putting all the enqueuing logic in the RX prog..

To answer your question from the other email Toke, my basic model was
kind of similar to TC Qdiscs. XDP redirect selects a device, then that
device has an enqueue and dequeue programs. Enqueue program can be run
in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up
some space on the TX descriptor ring. There is a "queue state" but the
FIFOs etc are sort of internal detail that the enqueue and dequeue
programs only share between each other. To be clear this is not a
suggestion of how things should be, it's what sprung to my mind without
thinking.
Toke Høiland-Jørgensen Feb. 5, 2020, 3:30 p.m. UTC | #37
Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> > On Mon, 03 Feb 2020 21:13:24 +0100
>> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >  
>> >> Oops, I see I forgot to reply to this bit:
>> >>   
>> >> >> Yeah, but having the low-level details available to the XDP program
>> >> >> (such as HW queue occupancy for the egress hook) is one of the benefits
>> >> >> of XDP, isn't it?    
>> >> >
>> >> > I think I glossed over the hope for having access to HW queue occupancy
>> >> > - what exactly are you after? 
>> >> >
>> >> > I don't think one can get anything beyond a BQL type granularity.
>> >> > Reading over PCIe is out of question, device write back on high
>> >> > granularity would burn through way too much bus throughput.    
>> >> 
>> >> This was Jesper's idea originally, so maybe he can explain better; but
>> >> as I understood it, he basically wanted to expose the same information
>> >> that BQL has to eBPF. Making it possible for an eBPF program to either
>> >> (re-)implement BQL with its own custom policy, or react to HWQ pressure
>> >> in some other way, such as by load balancing to another interface.  
>> >
>> > Yes, and I also have plans that goes beyond BQL. But let me start with
>> > explaining the BQL part, and answer Toke's question below.
>> >
>> > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:  
>> >> [...] Hmm, I wonder if a TX driver hook is enough?  
>> >
>> > Short answer is no, a TX driver hook is not enough.  The queue state
>> > info the TX driver hook have access to, needs to be updated once the
>> > hardware have "confirmed" the TX-DMA operation have completed.  For
>> > BQL/DQL this update happens during TX-DMA completion/cleanup (code
>> > see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
>> > point out we cannot query the device directly due to performance
>> > implications).  It doesn't need to be a new BPF hook, just something
>> > that update the queue state info (we could piggy back on the
>> > netdev_tx_completed_queue() call or give TX hook access to
>> > dev_queue->dql).  
>
> Interesting, that model does make sense to me.
>
>> The question is whether this can't simply be done through bpf helpers?
>> bpf_get_txq_occupancy(ifindex, txqno)?
>
> Helper vs dev_queue->dql field access seems like a technicality.
> The usual flexibility of implementation vs performance and simplicity
> consideration applies.. I guess?

Yeah, that was my point: We can do whatever makes the most sense for
that; it doesn't necessarily have to be something that's baked into the
TX hook point.

>> > Regarding "where is the queue": For me the XDP-TX queue is the NIC
>> > hardware queue, that this BPF hook have some visibility into and can do
>> > filtering on. (Imagine that my TX queue is bandwidth limited, then I
>> > can shrink the packet size and still send a "congestion" packet to my
>> > receiver).  
>> 
>> I'm not sure the hardware queues will be enough, though. Unless I'm
>> misunderstanding something, hardware queues are (1) fairly short and (2)
>> FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
>> still need a software queueing layer on top of the hardware queue.
>
> Jesper makes a very interesting point tho. If all the implementation
> wants is FIFO queues which are services in some simple manner (that is
> something that can be offloaded) we should support that.
>
> That means REDIRECT can target multiple TX queues, and we need an API
> to control the queue allocation..

Yes. I think that should be part of the "hwq API" that we presented at
LPC last year[0].

>> If the hardware is EDT-aware this may change, I suppose, but I'm not
>> sure if we can design the XDP queueing primitives with this assumption? :)
>
> But I agree with you as well. I think both HW and SW feeding needs to
> be supported. The HW implementations are always necessarily behind
> ideas people implemented and tested in SW..

Exactly. And even though we mostly think about 10-100G network
interfaces, I believe there is also potential for this to be useful at
the low end (think small embedded routers that are very CPU
constrained). In these sorts of environments software queueing is more
feasible.

>> > The bigger picture is that I envision the XDP-TX/egress hook can
>> > open-up for taking advantage of NIC hardware TX queue features. This
>> > also ties into the queue abstraction work by Björn+Magnus. Today NIC
>> > hardware can do a million TX-queues, and hardware can also do rate
>> > limiting per queue. Thus, I also envision that the XDP-TX/egress hook
>> > can choose/change the TX queue the packet is queue/sent on (we can
>> > likely just overload the XDP_REDIRECT and have a new bpf map type for
>> > this).  
>
> I wonder what that does to our HW offload model which is based on
> TC Qdisc offload today :S Do we use TC API to control configuration 
> of XDP queues? :S

My thought was that it could be baked into the same API[0]. I.e., add a
way to say not just "I want 3 hardware queues for this", but also "I
want 3 hardware queues, with fq_codel (or whatever) on top".

>> Yes, I think I mentioned in another email that putting all the queueing
>> smarts into the redirect map was also something I'd considered (well, I
>> do think we've discussed this in the past, so maybe not so surprising if
>> we're thinking along the same lines) :)
>> 
>> But the implication of this is also that an actual TX hook in the driver
>> need not necessarily incorporate a lot of new functionality, as it can
>> control the queueing through a combination of BPF helpers and map
>> updates?
>
> True, it's the dequeuing that's on the TX side, so we could go as far
> as putting all the enqueuing logic in the RX prog..
>
> To answer your question from the other email Toke, my basic model was
> kind of similar to TC Qdiscs. XDP redirect selects a device, then that
> device has an enqueue and dequeue programs. Enqueue program can be run
> in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up
> some space on the TX descriptor ring. There is a "queue state" but the
> FIFOs etc are sort of internal detail that the enqueue and dequeue
> programs only share between each other. To be clear this is not a
> suggestion of how things should be, it's what sprung to my mind without
> thinking.

Ah, that's interesting. I do agree that it may be better to bake this
into the existing hooks rather than have an additional set of
(en/de)queue hooks.

-Toke

[0] https://linuxplumbersconf.org/event/4/contributions/462/
diff mbox series

Patch

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1d69f637c5d6..c760aa54315c 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -170,6 +170,7 @@  enum {
 	IFLA_PROP_LIST,
 	IFLA_ALT_IFNAME, /* Alternative ifname */
 	IFLA_PERM_ADDRESS,
+	IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */
 	__IFLA_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 04cbcc930bc2..bf76dbee9d2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8664,6 +8664,12 @@  int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 
+		if (egress && prog->expected_attach_type != BPF_XDP_EGRESS) {
+			NL_SET_ERR_MSG(extack, "XDP program in egress path must use BPF_XDP_EGRESS attach type");
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+
 		if (!offload && bpf_prog_is_dev_bound(prog->aux)) {
 			NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported");
 			bpf_prog_put(prog);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ed0c069ef187..2179de9350b2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1030,7 +1030,7 @@  static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
-	       + rtnl_xdp_size() /* IFLA_XDP */
+	       + rtnl_xdp_size() * 2 /* IFLA_XDP and IFLA_XDP_EGRESS */
 	       + nla_total_size(4)  /* IFLA_EVENT */
 	       + nla_total_size(4)  /* IFLA_NEW_NETNSID */
 	       + nla_total_size(4)  /* IFLA_NEW_IFINDEX */
@@ -1395,6 +1395,36 @@  static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static u32 rtnl_xdp_egress_prog_drv(struct net_device *dev)
+{
+	return __dev_xdp_query(dev, dev->netdev_ops->ndo_bpf,
+			       XDP_QUERY_PROG_EGRESS);
+}
+
+static int rtnl_xdp_egress_report(struct sk_buff *skb, struct net_device *dev,
+				  u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr,
+				  u32 (*get_prog_id)(struct net_device *dev))
+{
+	u32 curr_id;
+	int err;
+
+	curr_id = get_prog_id(dev);
+	if (!curr_id)
+		return 0;
+
+	*prog_id = curr_id;
+	err = nla_put_u32(skb, attr, curr_id);
+	if (err)
+		return err;
+
+	if (*mode != XDP_ATTACHED_NONE)
+		*mode = XDP_ATTACHED_MULTI;
+	else
+		*mode = tgt_mode;
+
+	return 0;
+}
+
 static u32 rtnl_xdp_prog_skb(struct net_device *dev)
 {
 	const struct bpf_prog *generic_xdp_prog;
@@ -1486,6 +1516,41 @@  static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 	return err;
 }
 
+static int rtnl_xdp_egress_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	u8 mode = XDP_ATTACHED_NONE;
+	struct nlattr *xdp;
+	u32 prog_id = 0;
+	int err;
+
+	xdp = nla_nest_start_noflag(skb, IFLA_XDP_EGRESS);
+	if (!xdp)
+		return -EMSGSIZE;
+
+	err = rtnl_xdp_egress_report(skb, dev, &prog_id, &mode,
+				     XDP_ATTACHED_DRV, IFLA_XDP_DRV_PROG_ID,
+				     rtnl_xdp_egress_prog_drv);
+	if (err)
+		goto err_cancel;
+
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode);
+	if (err)
+		goto err_cancel;
+
+	if (prog_id && mode != XDP_ATTACHED_MULTI) {
+		err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id);
+		if (err)
+			goto err_cancel;
+	}
+
+	nla_nest_end(skb, xdp);
+	return 0;
+
+err_cancel:
+	nla_nest_cancel(skb, xdp);
+	return err;
+}
+
 static u32 rtnl_get_event(unsigned long event)
 {
 	u32 rtnl_event_type = IFLA_EVENT_NONE;
@@ -1743,6 +1808,9 @@  static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	if (rtnl_xdp_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_xdp_egress_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;
@@ -1827,6 +1895,7 @@  static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_ALT_IFNAME]	= { .type = NLA_STRING,
 				    .len = ALTIFNAMSIZ - 1 },
 	[IFLA_PERM_ADDRESS]	= { .type = NLA_REJECT },
+	[IFLA_XDP_EGRESS]	= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2808,6 +2877,47 @@  static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_XDP_EGRESS]) {
+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+		u32 xdp_flags = 0;
+
+		err = nla_parse_nested_deprecated(xdp, IFLA_XDP_MAX,
+						  tb[IFLA_XDP_EGRESS],
+						  ifla_xdp_policy, NULL);
+		if (err < 0)
+			goto errout;
+
+		if (xdp[IFLA_XDP_ATTACHED] || xdp[IFLA_XDP_PROG_ID]) {
+			err = -EINVAL;
+			goto errout;
+		}
+
+		if (xdp[IFLA_XDP_FLAGS]) {
+			xdp_flags = nla_get_u32(xdp[IFLA_XDP_FLAGS]);
+			if (xdp_flags & XDP_FLAGS_HW_MODE) {
+				err = -EINVAL;
+				goto errout;
+			}
+			if (xdp_flags & ~XDP_FLAGS_MASK) {
+				err = -EINVAL;
+				goto errout;
+			}
+			if (hweight32(xdp_flags & XDP_FLAGS_MODES) > 1) {
+				err = -EINVAL;
+				goto errout;
+			}
+		}
+
+		if (xdp[IFLA_XDP_FD]) {
+			err = dev_change_xdp_fd(dev, extack,
+						nla_get_s32(xdp[IFLA_XDP_FD]),
+						xdp_flags, true);
+			if (err)
+				goto errout;
+			status |= DO_SETLINK_NOTIFY;
+		}
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if ((status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY)
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 1d69f637c5d6..c760aa54315c 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -170,6 +170,7 @@  enum {
 	IFLA_PROP_LIST,
 	IFLA_ALT_IFNAME, /* Alternative ifname */
 	IFLA_PERM_ADDRESS,
+	IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */
 	__IFLA_MAX
 };