mbox series

[bpf-next,0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full

Message ID 20200904135332.60259-1-bjorn.topel@gmail.com
Headers show
Series xsk: exit NAPI loop when AF_XDP Rx ring is full | expand

Message

Björn Töpel Sept. 4, 2020, 1:53 p.m. UTC
This series addresses a problem that arises when AF_XDP zero-copy is
enabled, and the kernel softirq Rx processing and userland process is
running on the same core.

In contrast to the two-core case, when the userland process/Rx softirq
shares one core, it it very important that the kernel is not doing
unnecessary work, but instead let the userland process run. This has
not been the case.

For the Intel drivers, when the XDP_REDIRECT fails due to a full Rx
ring, the NAPI loop will simply drop the packet and continue
processing the next packet. The XDP_REDIRECT operation will then fail
again, since userland has not been able to empty the full Rx ring.

The fix for this is letting the NAPI loop exit early, if the AF_XDP
socket Rx ring is full.

The outline is as following; The first patch cleans up the error codes
returned by xdp_do_redirect(), so that a driver can figure out when
the Rx ring is full (ENOBUFS). Patch two adds an extended
xdp_do_redirect() variant that returns what kind of map that was used
in the XDP_REDIRECT action. The third patch adds an AF_XDP driver
helper to figure out if the Rx ring was full. Finally, the last three
patches implements the "early exit" support for Intel.

On my machine the "one core scenario Rx drop" performance went from
~65Kpps to 21Mpps. In other words, from "not usable" to
"usable". YMMV.

I prefer to route this series via bpf-next, since it include core
changes, and not only driver changes.


Have a nice weekend!
Björn

Björn Töpel (6):
  xsk: improve xdp_do_redirect() error codes
  xdp: introduce xdp_do_redirect_ext() function
  xsk: introduce xsk_do_redirect_rx_full() helper
  i40e, xsk: finish napi loop if AF_XDP Rx queue is full
  ice, xsk: finish napi loop if AF_XDP Rx queue is full
  ixgbe, xsk: finish napi loop if AF_XDP Rx queue is full

 drivers/net/ethernet/intel/i40e/i40e_xsk.c   | 23 ++++++++++++++------
 drivers/net/ethernet/intel/ice/ice_xsk.c     | 23 ++++++++++++++------
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 23 ++++++++++++++------
 include/linux/filter.h                       |  2 ++
 include/net/xdp_sock_drv.h                   |  9 ++++++++
 net/core/filter.c                            | 16 ++++++++++++--
 net/xdp/xsk.c                                |  2 +-
 net/xdp/xsk_queue.h                          |  2 +-
 8 files changed, 75 insertions(+), 25 deletions(-)


base-commit: 8eb629585d2231e90112148009e2a11b0979ca38

Comments

Björn Töpel Sept. 4, 2020, 1:59 p.m. UTC | #1
On 2020-09-04 15:53, Björn Töpel wrote:
> This series addresses a problem that arises when AF_XDP zero-copy is 
> enabled, and the kernel softirq Rx processing and userland process
> is running on the same core.
> 
[...]
> 

@Maxim I'm not well versed in Mellanox drivers. Would this be relevant 
to mlx5 as well?


Cheers,
Björn
Jesper Dangaard Brouer Sept. 4, 2020, 2:27 p.m. UTC | #2
On Fri,  4 Sep 2020 15:53:25 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> On my machine the "one core scenario Rx drop" performance went from
> ~65Kpps to 21Mpps. In other words, from "not usable" to
> "usable". YMMV.

We have observed this kind of dropping off an edge before with softirq
(when userspace process runs on same RX-CPU), but I thought that Eric
Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").

I wonder what makes AF_XDP different or if the problem have come back?
Björn Töpel Sept. 4, 2020, 2:32 p.m. UTC | #3
On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
> On Fri,  4 Sep 2020 15:53:25 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
> 
>> On my machine the "one core scenario Rx drop" performance went from
>> ~65Kpps to 21Mpps. In other words, from "not usable" to
>> "usable". YMMV.
> 
> We have observed this kind of dropping off an edge before with softirq
> (when userspace process runs on same RX-CPU), but I thought that Eric
> Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").
> 
> I wonder what makes AF_XDP different or if the problem have come back?
> 

I would say this is not the same issue. The problem is that the softirq 
is busy dropping packets since the AF_XDP Rx is full. So, the cycles 
*are* split 50/50, which is not what we want in this case. :-)

This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than 
fairness. If the Rx ring is full, then there is really no use to let the 
NAPI loop continue.

Would you agree, or am I rambling? :-P


Björn
Jakub Kicinski Sept. 4, 2020, 11:58 p.m. UTC | #4
On Fri, 4 Sep 2020 16:32:56 +0200 Björn Töpel wrote:
> On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
> > On Fri,  4 Sep 2020 15:53:25 +0200
> > Björn Töpel <bjorn.topel@gmail.com> wrote:
> >   
> >> On my machine the "one core scenario Rx drop" performance went from
> >> ~65Kpps to 21Mpps. In other words, from "not usable" to
> >> "usable". YMMV.  
> > 
> > We have observed this kind of dropping off an edge before with softirq
> > (when userspace process runs on same RX-CPU), but I thought that Eric
> > Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").
> > 
> > I wonder what makes AF_XDP different or if the problem have come back?
> >   
> 
> I would say this is not the same issue. The problem is that the softirq 
> is busy dropping packets since the AF_XDP Rx is full. So, the cycles 
> *are* split 50/50, which is not what we want in this case. :-)
> 
> This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than 
> fairness. If the Rx ring is full, then there is really no use to let the 
> NAPI loop continue.
> 
> Would you agree, or am I rambling? :-P

I wonder if ksoftirqd never kicks in because we are able to discard 
the entire ring before we run out of softirq "slice".


I've been pondering the exact problem you're solving with Maciej
recently. The efficiency of AF_XDP on one core with the NAPI processing.

Your solution (even though it admittedly helps, and is quite simple)
still has the application potentially not able to process packets 
until the queue fills up. This will be bad for latency.

Why don't we move closer to application polling? Never re-arm the NAPI 
after RX, let the application ask for packets, re-arm if 0 polled. 
You'd get max batching, min latency.

Who's the rambling one now? :-D
Björn Töpel Sept. 7, 2020, 1:37 p.m. UTC | #5
On 2020-09-05 01:58, Jakub Kicinski wrote:
 > On Fri, 4 Sep 2020 16:32:56 +0200 Björn Töpel wrote:
 >> On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
 >>> On Fri,  4 Sep 2020 15:53:25 +0200
 >>> Björn Töpel <bjorn.topel@gmail.com> wrote:
 >>>
 >>>> On my machine the "one core scenario Rx drop" performance went from
 >>>> ~65Kpps to 21Mpps. In other words, from "not usable" to
 >>>> "usable". YMMV.
 >>>
 >>> We have observed this kind of dropping off an edge before with softirq
 >>> (when userspace process runs on same RX-CPU), but I thought that Eric
 >>> Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its 
job").
 >>>
 >>> I wonder what makes AF_XDP different or if the problem have come back?
 >>>
 >>
 >> I would say this is not the same issue. The problem is that the softirq
 >> is busy dropping packets since the AF_XDP Rx is full. So, the cycles
 >> *are* split 50/50, which is not what we want in this case. :-)
 >>
 >> This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than
 >> fairness. If the Rx ring is full, then there is really no use to let the
 >> NAPI loop continue.
 >>
 >> Would you agree, or am I rambling? :-P
 >
 > I wonder if ksoftirqd never kicks in because we are able to discard
 > the entire ring before we run out of softirq "slice".
 >

This is exactly what's happening, so we're entering a "busy poll like"
behavior; syscall, return from syscall softirq/napi, userland.

 >
 > I've been pondering the exact problem you're solving with Maciej
 > recently. The efficiency of AF_XDP on one core with the NAPI processing.
 >
 > Your solution (even though it admittedly helps, and is quite simple)
 > still has the application potentially not able to process packets
 > until the queue fills up. This will be bad for latency.
 >
 > Why don't we move closer to application polling? Never re-arm the NAPI
 > after RX, let the application ask for packets, re-arm if 0 polled.
 > You'd get max batching, min latency.
 >
 > Who's the rambling one now? :-D
 >

:-D No, these are all very good ideas! We've actually experimented
with it with the busy-poll series a while back -- NAPI busy-polling
does exactly "application polling".

However, I wonder if the busy-polling would have better performance
than the scenario above (i.e. when the ksoftirqd never kicks in)?
Executing the NAPI poll *explicitly* in the syscall, or implicitly
from the softirq.

Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.

Or did you have something completely different in mind?

As for this patch set, I think it would make sense to pull it in since
it makes the single-core scenario *much* better, and it is pretty
simple. Then do the application polling as another, potentially,
improvement series.


Thoughts? Thanks a lot for the feedback!
Björn
Jakub Kicinski Sept. 7, 2020, 6:40 p.m. UTC | #6
On Mon, 7 Sep 2020 15:37:40 +0200 Björn Töpel wrote:
>  > I've been pondering the exact problem you're solving with Maciej
>  > recently. The efficiency of AF_XDP on one core with the NAPI processing.
>  >
>  > Your solution (even though it admittedly helps, and is quite simple)
>  > still has the application potentially not able to process packets
>  > until the queue fills up. This will be bad for latency.
>  >
>  > Why don't we move closer to application polling? Never re-arm the NAPI
>  > after RX, let the application ask for packets, re-arm if 0 polled.
>  > You'd get max batching, min latency.
>  >
>  > Who's the rambling one now? :-D
>  >  
> 
> :-D No, these are all very good ideas! We've actually experimented
> with it with the busy-poll series a while back -- NAPI busy-polling
> does exactly "application polling".
> 
> However, I wonder if the busy-polling would have better performance
> than the scenario above (i.e. when the ksoftirqd never kicks in)?
> Executing the NAPI poll *explicitly* in the syscall, or implicitly
> from the softirq.
> 
> Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
> Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
> sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.
> 
> Or did you have something completely different in mind?

My understanding is that busy-polling is allowing application to pick
up packets from the ring before the IRQ fires.

What we're more concerned about is the IRQ firing in the first place.

 application:   busy    | needs packets | idle
 -----------------------+---------------+----------------------
   standard   |         |   polls NAPI  | keep polling? sleep?
   busy poll  | IRQ on  |    IRQ off    |  IRQ off      IRQ on
 -------------+---------+---------------+----------------------
              |         |   polls once  |
    AF_XDP    | IRQ off |    IRQ off    |  IRQ on


So busy polling is pretty orthogonal. It only applies to the
"application needs packets" time. What we'd need is for the application
to be able to suppress NAPI polls, promising the kernel that it will
busy poll when appropriate.

> As for this patch set, I think it would make sense to pull it in since
> it makes the single-core scenario *much* better, and it is pretty
> simple. Then do the application polling as another, potentially,
> improvement series.

Up to you, it's extra code in the driver so mostly your code to
maintain.

I think that if we implement what I described above - everyone will
use that on a single core setup, so this set would be dead code
(assuming RQ is sized appropriately). But again, your call :)
Björn Töpel Sept. 8, 2020, 6:58 a.m. UTC | #7
On 2020-09-07 20:40, Jakub Kicinski wrote:
> On Mon, 7 Sep 2020 15:37:40 +0200 Björn Töpel wrote:
>>   > I've been pondering the exact problem you're solving with Maciej
>>   > recently. The efficiency of AF_XDP on one core with the NAPI processing.
>>   >
>>   > Your solution (even though it admittedly helps, and is quite simple)
>>   > still has the application potentially not able to process packets
>>   > until the queue fills up. This will be bad for latency.
>>   >
>>   > Why don't we move closer to application polling? Never re-arm the NAPI
>>   > after RX, let the application ask for packets, re-arm if 0 polled.
>>   > You'd get max batching, min latency.
>>   >
>>   > Who's the rambling one now? :-D
>>   >
>>
>> :-D No, these are all very good ideas! We've actually experimented
>> with it with the busy-poll series a while back -- NAPI busy-polling
>> does exactly "application polling".
>>
>> However, I wonder if the busy-polling would have better performance
>> than the scenario above (i.e. when the ksoftirqd never kicks in)?
>> Executing the NAPI poll *explicitly* in the syscall, or implicitly
>> from the softirq.
>>
>> Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
>> Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
>> sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.
>>
>> Or did you have something completely different in mind?
> 
> My understanding is that busy-polling is allowing application to pick
> up packets from the ring before the IRQ fires.
> 
> What we're more concerned about is the IRQ firing in the first place.
> 
>   application:   busy    | needs packets | idle
>   -----------------------+---------------+----------------------
>     standard   |         |   polls NAPI  | keep polling? sleep?
>     busy poll  | IRQ on  |    IRQ off    |  IRQ off      IRQ on
>   -------------+---------+---------------+----------------------
>                |         |   polls once  |
>      AF_XDP    | IRQ off |    IRQ off    |  IRQ on
> 
> 
> So busy polling is pretty orthogonal. It only applies to the
> "application needs packets" time. What we'd need is for the application
> to be able to suppress NAPI polls, promising the kernel that it will
> busy poll when appropriate.
>

Ah, nice write-up! Thanks! A strict busy-poll mechanism, not the
opportunistic (existing) NAPI busy-poll.

This would be a new kind of mechanism, and a very much welcome one in
AF_XDP-land. More below.

>> As for this patch set, I think it would make sense to pull it in since
>> it makes the single-core scenario *much* better, and it is pretty
>> simple. Then do the application polling as another, potentially,
>> improvement series.
> 
> Up to you, it's extra code in the driver so mostly your code to
> maintain.
> 
> I think that if we implement what I described above - everyone will
> use that on a single core setup, so this set would be dead code
> (assuming RQ is sized appropriately). But again, your call :)
> 

Now, I agree that the busy-poll you describe above would be the best
option, but from my perspective it's a much larger set that involves
experimenting. I will explore that, but I still think this series should
go in sooner to make the single core scenario usable *today*.

Ok, back to the busy-poll ideas. I'll call your idea "strict busy-poll",
i.e. the NAPI loop is *only* driven by userland, and interrupts stay
disabled. "Syscall driven poll-mode driver". :-)

On the driver side (again, only talking Intel here, since that's what I
know the details of), the NAPI context would only cover AF_XDP queues,
so that other queues are not starved.

Any ideas how strict busy-poll would look, API/implmentation-wise? An
option only for AF_XDP sockets? Would this make sense to regular
sockets? If so, maybe extend the existing NAPI busy-poll with a "strict"
mode?

I'll start playing around a bit, but again, I think this simple series
should go in just to make AF_XDP single core usable *today*.


Thanks!
Björn
Maxim Mikityanskiy Sept. 8, 2020, 10:32 a.m. UTC | #8
On 2020-09-04 16:59, Björn Töpel wrote:
> On 2020-09-04 15:53, Björn Töpel wrote:
>> This series addresses a problem that arises when AF_XDP zero-copy is 
>> enabled, and the kernel softirq Rx processing and userland process
>> is running on the same core.
>>
> [...]
>>
> 
> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant 
> to mlx5 as well?

Thanks for letting me know about this series! So the basic idea is to 
stop processing hardware completions if the RX ring gets full, because 
the application didn't have chance to run? Yes, I think it's also 
relevant to mlx5, the issue is not driver-specific, and a similar fix is 
applicable. However, it may lead to completion queue overflows - some 
analysis is needed to understand what happens then and how to handle it.

Regarding the feature, I think it should be opt-in (disabled by 
default), because old applications may not wakeup RX after they process 
packets in the RX ring. Is it required to change xdpsock accordingly? 
Also, when need_wakeup is disabled, your driver implementation seems to 
quit NAPI anyway, but it shouldn't happen, because no one will send a 
wakeup.

Waiting until the RX ring fills up, then passing control to the 
application and waiting until the hardware completion queue fills up, 
and so on increases latency - the busy polling approach sounds more 
legit here.

The behavior may be different depending on the driver implementation:

1. If you arm the completion queue and leave interrupts enabled on early 
exit too, the application will soon be interrupted anyway and won't have 
much time to process many packets, leading to app <-> NAPI ping-pong one 
packet at a time, making NAPI inefficient.

2. If you don't arm the completion queue on early exit and wait for the 
explicit wakeup from the application, it will easily overflow the 
hardware completion queue, because we don't have a symmetric mechanism 
to stop the application on imminent hardware queue overflow. It doesn't 
feel correct and may be trickier to handle: if the application is too 
slow, such drops should happen on driver/kernel level, not in hardware.

Which behavior is used in your drivers? Or am I missing some more options?

BTW, it should be better to pass control to the application before the 
first dropped packet, not after it has been dropped.

Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and 
XDP_TX may suffer from such behavior, so it's another point to make a 
knob on the application layer to enable/disable it.

 From the driver API perspective, I would prefer to see a simpler API if 
possible. The current API exposes things that the driver shouldn't know 
(BPF map type), and requires XSK-specific handling. It would be better 
if some specific error code returned from xdp_do_redirect was reserved 
to mean "exit NAPI early if you support it". This way we wouldn't need 
two new helpers, two xdp_do_redirect functions, and this approach would 
be extensible to other non-XSK use cases without further changes in the 
driver, and also the logic to opt-in the feature could be put inside the 
kernel.

Thanks,
Max

> 
> Cheers,
> Björn
Magnus Karlsson Sept. 8, 2020, 11:37 a.m. UTC | #9
On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 16:59, Björn Töpel wrote:
> > On 2020-09-04 15:53, Björn Töpel wrote:
> >> This series addresses a problem that arises when AF_XDP zero-copy is
> >> enabled, and the kernel softirq Rx processing and userland process
> >> is running on the same core.
> >>
> > [...]
> >>
> >
> > @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
> > to mlx5 as well?
>
> Thanks for letting me know about this series! So the basic idea is to
> stop processing hardware completions if the RX ring gets full, because
> the application didn't have chance to run? Yes, I think it's also
> relevant to mlx5, the issue is not driver-specific, and a similar fix is
> applicable. However, it may lead to completion queue overflows - some
> analysis is needed to understand what happens then and how to handle it.
>
> Regarding the feature, I think it should be opt-in (disabled by
> default), because old applications may not wakeup RX after they process
> packets in the RX ring.

How about need_wakeup enable/disable at bind time being that opt-in,
instead of a new option? It is off by default, and when it is off, the
driver busy-spins on the Rx ring until it can put an entry there. It
will not yield to the application by returning something less than
budget. Applications need not check the need_wakeup flag. If
need_wakeup is enabled by the user, the contract is that user-space
needs to check the need_wakeup flag and act on it. If it does not,
then that is a programming error and it can be set for any unspecified
reason. No reason to modify the application, if it checks need_wakeup.
But if this patch behaves like that I have not checked.

Good points in the rest of the mail, that I think should be addressed.

/Magnus

> Is it required to change xdpsock accordingly?
> Also, when need_wakeup is disabled, your driver implementation seems to
> quit NAPI anyway, but it shouldn't happen, because no one will send a
> wakeup.
>
> Waiting until the RX ring fills up, then passing control to the
> application and waiting until the hardware completion queue fills up,
> and so on increases latency - the busy polling approach sounds more
> legit here.
>
> The behavior may be different depending on the driver implementation:
>
> 1. If you arm the completion queue and leave interrupts enabled on early
> exit too, the application will soon be interrupted anyway and won't have
> much time to process many packets, leading to app <-> NAPI ping-pong one
> packet at a time, making NAPI inefficient.
>
> 2. If you don't arm the completion queue on early exit and wait for the
> explicit wakeup from the application, it will easily overflow the
> hardware completion queue, because we don't have a symmetric mechanism
> to stop the application on imminent hardware queue overflow. It doesn't
> feel correct and may be trickier to handle: if the application is too
> slow, such drops should happen on driver/kernel level, not in hardware.
>
> Which behavior is used in your drivers? Or am I missing some more options?
>
> BTW, it should be better to pass control to the application before the
> first dropped packet, not after it has been dropped.
>
> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
> XDP_TX may suffer from such behavior, so it's another point to make a
> knob on the application layer to enable/disable it.
>
>  From the driver API perspective, I would prefer to see a simpler API if
> possible. The current API exposes things that the driver shouldn't know
> (BPF map type), and requires XSK-specific handling. It would be better
> if some specific error code returned from xdp_do_redirect was reserved
> to mean "exit NAPI early if you support it". This way we wouldn't need
> two new helpers, two xdp_do_redirect functions, and this approach would
> be extensible to other non-XSK use cases without further changes in the
> driver, and also the logic to opt-in the feature could be put inside the
> kernel.
>
> Thanks,
> Max
>
> >
> > Cheers,
> > Björn
>
Björn Töpel Sept. 8, 2020, 12:21 p.m. UTC | #10
On 2020-09-08 13:37, Magnus Karlsson wrote:
> On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>>
>> On 2020-09-04 16:59, Björn Töpel wrote:
>>> On 2020-09-04 15:53, Björn Töpel wrote:
>>>> This series addresses a problem that arises when AF_XDP zero-copy is
>>>> enabled, and the kernel softirq Rx processing and userland process
>>>> is running on the same core.
>>>>
>>> [...]
>>>>
>>>
>>> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
>>> to mlx5 as well?
>>
>> Thanks for letting me know about this series! So the basic idea is to
>> stop processing hardware completions if the RX ring gets full, because
>> the application didn't have chance to run? Yes, I think it's also
>> relevant to mlx5, the issue is not driver-specific, and a similar fix is
>> applicable. However, it may lead to completion queue overflows - some
>> analysis is needed to understand what happens then and how to handle it.
>>
>> Regarding the feature, I think it should be opt-in (disabled by
>> default), because old applications may not wakeup RX after they process
>> packets in the RX ring.
>

First of all; Max, thanks for some really good input as usual!


> How about need_wakeup enable/disable at bind time being that opt-in,
> instead of a new option? It is off by default, and when it is off, the
> driver busy-spins on the Rx ring until it can put an entry there. It
> will not yield to the application by returning something less than
> budget. Applications need not check the need_wakeup flag. If
> need_wakeup is enabled by the user, the contract is that user-space
> needs to check the need_wakeup flag and act on it. If it does not,
> then that is a programming error and it can be set for any unspecified
> reason. No reason to modify the application, if it checks need_wakeup.
> But if this patch behaves like that I have not checked.
>

It does not behave exactly like this. If need_wakeup is enabled, the 
napi look exists, but if not the napi is rearmed -- so we'll get a less 
efficient loop. I'll need address this.

I'm leaning towards a more explicit opt-in like Max suggested. As Max 
pointed out, AF_XDP/XDP_TX is actually a non-insane(!) way of using 
AF_XDP zero-copy, which will suffer from the early exit.


> Good points in the rest of the mail, that I think should be addressed.
>

Yeah, I agree. I will take a step back an rethink. I'll experiment with
the busy-polling suggested by Jakub, and also an opt-in early exit.

Thanks for all the suggestions folks!


Cheers,
Björn


> /Magnus
> 
>> Is it required to change xdpsock accordingly?
>> Also, when need_wakeup is disabled, your driver implementation seems to
>> quit NAPI anyway, but it shouldn't happen, because no one will send a
>> wakeup.
>>
>> Waiting until the RX ring fills up, then passing control to the
>> application and waiting until the hardware completion queue fills up,
>> and so on increases latency - the busy polling approach sounds more
>> legit here.
>>
>> The behavior may be different depending on the driver implementation:
>>
>> 1. If you arm the completion queue and leave interrupts enabled on early
>> exit too, the application will soon be interrupted anyway and won't have
>> much time to process many packets, leading to app <-> NAPI ping-pong one
>> packet at a time, making NAPI inefficient.
>>
>> 2. If you don't arm the completion queue on early exit and wait for the
>> explicit wakeup from the application, it will easily overflow the
>> hardware completion queue, because we don't have a symmetric mechanism
>> to stop the application on imminent hardware queue overflow. It doesn't
>> feel correct and may be trickier to handle: if the application is too
>> slow, such drops should happen on driver/kernel level, not in hardware.
>>
>> Which behavior is used in your drivers? Or am I missing some more options?
>>
>> BTW, it should be better to pass control to the application before the
>> first dropped packet, not after it has been dropped.
>>
>> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
>> XDP_TX may suffer from such behavior, so it's another point to make a
>> knob on the application layer to enable/disable it.
>>
>>   From the driver API perspective, I would prefer to see a simpler API if
>> possible. The current API exposes things that the driver shouldn't know
>> (BPF map type), and requires XSK-specific handling. It would be better
>> if some specific error code returned from xdp_do_redirect was reserved
>> to mean "exit NAPI early if you support it". This way we wouldn't need
>> two new helpers, two xdp_do_redirect functions, and this approach would
>> be extensible to other non-XSK use cases without further changes in the
>> driver, and also the logic to opt-in the feature could be put inside the
>> kernel.
>>
>> Thanks,
>> Max
>>
>>>
>>> Cheers,
>>> Björn
>>
Jakub Kicinski Sept. 8, 2020, 5:24 p.m. UTC | #11
On Tue, 8 Sep 2020 08:58:30 +0200 Björn Töpel wrote:
> >> As for this patch set, I think it would make sense to pull it in since
> >> it makes the single-core scenario *much* better, and it is pretty
> >> simple. Then do the application polling as another, potentially,
> >> improvement series.  
> > 
> > Up to you, it's extra code in the driver so mostly your code to
> > maintain.
> > 
> > I think that if we implement what I described above - everyone will
> > use that on a single core setup, so this set would be dead code
> > (assuming RQ is sized appropriately). But again, your call :)
> 
> Now, I agree that the busy-poll you describe above would be the best
> option, but from my perspective it's a much larger set that involves
> experimenting. I will explore that, but I still think this series should
> go in sooner to make the single core scenario usable *today*.
> 
> Ok, back to the busy-poll ideas. I'll call your idea "strict busy-poll",
> i.e. the NAPI loop is *only* driven by userland, and interrupts stay
> disabled. "Syscall driven poll-mode driver". :-)
> 
> On the driver side (again, only talking Intel here, since that's what I
> know the details of), the NAPI context would only cover AF_XDP queues,
> so that other queues are not starved.
> 
> Any ideas how strict busy-poll would look, API/implmentation-wise? An
> option only for AF_XDP sockets? Would this make sense to regular
> sockets? If so, maybe extend the existing NAPI busy-poll with a "strict"
> mode?

For AF_XDP and other sockets I think it should be quite straightforward.

For AF_XDP just implement current busy poll.

Then for all socket types add a new sockopt which sets "timeout" on how
long the IRQs can be suppressed for (we don't want application crash or
hang to knock the system off the network), or just enables the feature
and the timeout is from a sysctl.

Then make sure that at the end of polling napi doesn't get scheduled,
and set some bit which will prevent napi_schedule_prep() from letting
normal IRQ processing from scheduling it, too. Set a timer for the
timeout handling to undo all this.

What I haven't figured out in my head is how/if this relates to the
ongoing wq/threaded NAPI polling work 🤔 but that shouldn't stop you.

> I'll start playing around a bit, but again, I think this simple series
> should go in just to make AF_XDP single core usable *today*.

No objection from me.
Björn Töpel Sept. 8, 2020, 6:28 p.m. UTC | #12
On 2020-09-08 19:24, Jakub Kicinski wrote:
>> I'll start playing around a bit, but again, I think this simple series
>> should go in just to make AF_XDP single core usable*today*.
> No objection from me.

Thanks Jakub, but as you (probably) noticed in the other thread Maxim 
had some valid concerns. Let's drop this for now, and I'll get back 
after some experimentation/hacking.


Again, thanks for the ideas! Very much appreciated!
Björn
Jakub Kicinski Sept. 8, 2020, 6:34 p.m. UTC | #13
On Tue, 8 Sep 2020 20:28:14 +0200 Björn Töpel wrote:
> On 2020-09-08 19:24, Jakub Kicinski wrote:
> >> I'll start playing around a bit, but again, I think this simple series
> >> should go in just to make AF_XDP single core usable*today*.  
> > No objection from me.  
> 
> Thanks Jakub, but as you (probably) noticed in the other thread Maxim 
> had some valid concerns. Let's drop this for now, and I'll get back 
> after some experimentation/hacking.

Yeah, I sort of assumed you got the wake-up problem down :S

If it gets complicated it may not be worth pursuing this optimization.
Jesper Dangaard Brouer Sept. 9, 2020, 3:37 p.m. UTC | #14
On Tue, 8 Sep 2020 13:32:01 +0300
Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

>  From the driver API perspective, I would prefer to see a simpler API if 
> possible. The current API exposes things that the driver shouldn't know 
> (BPF map type), and requires XSK-specific handling. It would be better 
> if some specific error code returned from xdp_do_redirect was reserved 
> to mean "exit NAPI early if you support it". This way we wouldn't need 
> two new helpers, two xdp_do_redirect functions, and this approach would 
> be extensible to other non-XSK use cases without further changes in the 
> driver, and also the logic to opt-in the feature could be put inside the 
> kernel.

I agree.