Message ID | 1488166294.9415.172.camel@edumazet-glaptop3.roam.corp.google.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Sun, 2017-02-26 at 19:31 -0800, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > While playing with mlx4 hardware timestamping of RX packets, I found > that some packets were received by TCP stack with a ~200 ms delay... > > Since the timestamp was provided by the NIC, and my probe was added > in tcp_v4_rcv() while in BH handler, I was confident it was not > a sender issue, or a drop in the network. > > This would happen with a very low probability, but hurting RPC > workloads. > > A NAPI driver normally arms the IRQ after the napi_complete_done(), > after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab > it. > > Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit > while IRQ are not disabled, we might have later an IRQ firing and > finding this bit set, right before napi_complete_done() clears it. > > This can happen with busy polling users, or if gro_flush_timeout is > used. But some other uses of napi_schedule() in drivers can cause this > as well. > > This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep() > can set if it could not grab NAPI_STATE_SCHED > > Then napi_complete_done() properly reschedules the napi to make sure > we do not miss something. > > Since we manipulate multiple bits at once, use cmpxchg() like in > sk_busy_loop() to provide proper transactions. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/linux/netdevice.h | 29 +++++++---------------- > net/core/dev.c | 44 ++++++++++++++++++++++++++++++++++-- > 2 files changed, 51 insertions(+), 22 deletions(-) I will send a v2 of this patch. Points trying to grab NAPI_STATE_SCHED not from the device driver IRQ handler should not set NAPI_STATE_MISSED if they fail, otherwise this adds extra work for no purpose. One of this point is napi_watchdog() which can fire pretty often.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 27 Feb 2017 06:21:38 -0800 > A NAPI driver normally arms the IRQ after the napi_complete_done(), > after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab > it. > > Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit > while IRQ are not disabled, we might have later an IRQ firing and > finding this bit set, right before napi_complete_done() clears it. > > This can happen with busy polling users, or if gro_flush_timeout is > used. But some other uses of napi_schedule() in drivers can cause this > as well. > > This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep() > can set if it could not grab NAPI_STATE_SCHED Various rules were meant to protect these sequences, and make sure nothing like this race could happen. Can you show the specific sequence that fails? One of the basic protections is that the device IRQ is not re-enabled until napi_complete_done() is finished, most drivers do something like this: napi_complete_done(); - sets NAPI_STATE_SCHED enable device IRQ So I don't understand how it is possible that "later an IRQ firing and finding this bit set, right before napi_complete_done() clears it". While napi_complete_done() is running, the device's IRQ is still disabled, so there cannot be an IRQ firing before napi_complete_done() is finished.
On Mon, 2017-02-27 at 11:19 -0500, David Miller wrote: > Various rules were meant to protect these sequences, and make sure > nothing like this race could happen. > > Can you show the specific sequence that fails? > > One of the basic protections is that the device IRQ is not re-enabled > until napi_complete_done() is finished, most drivers do something like > this: > > napi_complete_done(); > - sets NAPI_STATE_SCHED > enable device IRQ > > So I don't understand how it is possible that "later an IRQ firing and > finding this bit set, right before napi_complete_done() clears it". > > While napi_complete_done() is running, the device's IRQ is still > disabled, so there cannot be an IRQ firing before napi_complete_done() > is finished. Any point doing a napi_schedule() not from device hard irq handler is subject to the race for NIC using some kind of edge trigger interrupts. Since we do not provide a ndo to disable device interrupts, the following can happen. thread 1 thread 2 (could be on same cpu) // busy polling or napi_watchdog() napi_schedule(); ... napi->poll() device polling: read 2 packets from ring buffer Additional 3rd packet is available. device hard irq // does nothing because NAPI_STATE_SCHED bit is owned by thread 1 napi_schedule(); napi_complete_done(napi, 2); rearm_irq(); Note that rearm_irq() will not force the device to send an additional IRQ for the packet it already signaled (3rd packet in my example) At least for mlx4, only 4th packet will trigger the IRQ again. In the old days, the race would not happen since napi->poll() was called in direct response to a prior device IRQ : Edge triggered hard irqs from the device for this queue were already disabled.
On Mon, 2017-02-27 at 08:44 -0800, Eric Dumazet wrote:
> // busy polling or napi_watchdog()
BTW, we also can add to the beginning of busy_poll_stop() :
clear_bit(NAPI_STATE_MISSED, &napi->state);
On Mon, Feb 27, 2017 at 8:19 AM, David Miller <davem@davemloft.net> wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Mon, 27 Feb 2017 06:21:38 -0800 > >> A NAPI driver normally arms the IRQ after the napi_complete_done(), >> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab >> it. >> >> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit >> while IRQ are not disabled, we might have later an IRQ firing and >> finding this bit set, right before napi_complete_done() clears it. >> >> This can happen with busy polling users, or if gro_flush_timeout is >> used. But some other uses of napi_schedule() in drivers can cause this >> as well. >> >> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep() >> can set if it could not grab NAPI_STATE_SCHED > > Various rules were meant to protect these sequences, and make sure > nothing like this race could happen. > > Can you show the specific sequence that fails? > > One of the basic protections is that the device IRQ is not re-enabled > until napi_complete_done() is finished, most drivers do something like > this: > > napi_complete_done(); > - sets NAPI_STATE_SCHED > enable device IRQ > > So I don't understand how it is possible that "later an IRQ firing and > finding this bit set, right before napi_complete_done() clears it". > > While napi_complete_done() is running, the device's IRQ is still > disabled, so there cannot be an IRQ firing before napi_complete_done() > is finished. So there are some drivers that will need to have the interrupts enabled when busy polling and I assume that can cause this kind of issue. Specifically in the case of i40e the part will not flush completed descriptors until either 4 completed descriptors are ready to be written back, or an interrupt fires. Our other drivers have code in them that will force the interrupt to unmask and fire once every 2 seconds in the unlikely event that an interrupt was lost which can occur on some platforms. - Alex
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 27 Feb 2017 08:44:14 -0800 > Any point doing a napi_schedule() not from device hard irq handler > is subject to the race for NIC using some kind of edge trigger > interrupts. > > Since we do not provide a ndo to disable device interrupts, the > following can happen. Ok, now I understand. I think even without considering the race you are trying to solve, this situation is really dangerous. I am sure that every ->poll() handler out there was written by an author who completely assumed that if they are executing then the device's interrupts for that NAPI instance are disabled. And this is with very few, if any, exceptions. So if we saw a driver doing something like: reg->irq_enable ^= value; after napi_complete_done(), it would be quite understandable. We really made a mistake taking the napi_schedule() call out of the domain of the driver so that it could manage the interrupt state properly. I'm not against your missed bit fix as a short-term cure for now, it's just that somewhere down the road we need to manage the interrupt properly.
David Miller <davem@davemloft.net> : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Mon, 27 Feb 2017 08:44:14 -0800 > > > Any point doing a napi_schedule() not from device hard irq handler > > is subject to the race for NIC using some kind of edge trigger > > interrupts. > > > > Since we do not provide a ndo to disable device interrupts, the > > following can happen. > > Ok, now I understand. > > I think even without considering the race you are trying to solve, > this situation is really dangerous. > > I am sure that every ->poll() handler out there was written by an > author who completely assumed that if they are executing then the > device's interrupts for that NAPI instance are disabled. And this is > with very few, if any, exceptions. Shareable pci irq used to remind author that such an assumption was not always right. Otoh it was still manageable as long as level only triggered irq were involved.
On Wed, 1 Mar 2017 01:22:40 +0100 Francois Romieu <romieu@fr.zoreil.com> wrote: > David Miller <davem@davemloft.net> : > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Mon, 27 Feb 2017 08:44:14 -0800 > > > > > Any point doing a napi_schedule() not from device hard irq handler > > > is subject to the race for NIC using some kind of edge trigger > > > interrupts. > > > > > > Since we do not provide a ndo to disable device interrupts, the > > > following can happen. > > > > Ok, now I understand. > > > > I think even without considering the race you are trying to solve, > > this situation is really dangerous. > > > > I am sure that every ->poll() handler out there was written by an > > author who completely assumed that if they are executing then the > > device's interrupts for that NAPI instance are disabled. And this is > > with very few, if any, exceptions. > > Shareable pci irq used to remind author that such an assumption was > not always right. Otoh it was still manageable as long as level only > triggered irq were involved. > When I had to deal with that in sky2, the best way was to have a single NAPI poll handler shared between both ports. Works well and avoids races in interrupt handling and enabling.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -330,6 +330,7 @@ struct napi_struct { enum { NAPI_STATE_SCHED, /* Poll is scheduled */ + NAPI_STATE_MISSED, /* reschedule a napi poll */ NAPI_STATE_DISABLE, /* Disable pending */ NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */ NAPI_STATE_HASHED, /* In NAPI hash (busy polling possible) */ @@ -338,12 +339,13 @@ enum { }; enum { - NAPIF_STATE_SCHED = (1UL << NAPI_STATE_SCHED), - NAPIF_STATE_DISABLE = (1UL << NAPI_STATE_DISABLE), - NAPIF_STATE_NPSVC = (1UL << NAPI_STATE_NPSVC), - NAPIF_STATE_HASHED = (1UL << NAPI_STATE_HASHED), - NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL), - NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL), + NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED), + NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED), + NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE), + NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC), + NAPIF_STATE_HASHED = BIT(NAPI_STATE_HASHED), + NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL), + NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL), }; enum gro_result { @@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n) return test_bit(NAPI_STATE_DISABLE, &n->state); } -/** - * napi_schedule_prep - check if NAPI can be scheduled - * @n: NAPI context - * - * Test if NAPI routine is already running, and if not mark - * it as running. This is used as a condition variable to - * insure only one NAPI poll instance runs. We also make - * sure there is no pending NAPI disable. - */ -static inline bool napi_schedule_prep(struct napi_struct *n) -{ - return !napi_disable_pending(n) && - !test_and_set_bit(NAPI_STATE_SCHED, &n->state); -} +bool napi_schedule_prep(struct napi_struct *n); /** * napi_schedule - schedule NAPI poll diff --git a/net/core/dev.c b/net/core/dev.c index 304f2deae5f9897e60a79ed8b69d6ef208295ded..82d868049ba78260a5f28376842657b72bd31994 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n) EXPORT_SYMBOL(__napi_schedule); /** + * napi_schedule_prep - check if napi can be scheduled + * @n: napi context + * + * Test if NAPI routine is already running, and if not mark + * it as running. This is used as a condition variable + * insure only one NAPI poll instance runs. We also make + * sure there is no pending NAPI disable. + */ +bool napi_schedule_prep(struct napi_struct *n) +{ + unsigned long val, new; + + do { + val = READ_ONCE(n->state); + if (unlikely(val & NAPIF_STATE_DISABLE)) + return false; + new = val | NAPIF_STATE_SCHED; + if (unlikely(val & NAPIF_STATE_SCHED)) + new |= NAPIF_STATE_MISSED; + } while (cmpxchg(&n->state, val, new) != val); + + return !(val & NAPIF_STATE_SCHED); +} +EXPORT_SYMBOL(napi_schedule_prep); + +/** * __napi_schedule_irqoff - schedule for receive * @n: entry to schedule * @@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff); bool napi_complete_done(struct napi_struct *n, int work_done) { - unsigned long flags; + unsigned long flags, val, new; /* * 1) Don't let napi dequeue from the cpu poll list @@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done) list_del_init(&n->poll_list); local_irq_restore(flags); } - WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state)); + + do { + val = READ_ONCE(n->state); + + WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED)); + + new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED); + + if (unlikely(val & NAPIF_STATE_MISSED)) + new |= NAPIF_STATE_SCHED; + } while (cmpxchg(&n->state, val, new) != val); + + if (unlikely(val & NAPIF_STATE_MISSED)) + __napi_schedule(n); + return true; } EXPORT_SYMBOL(napi_complete_done);