diff mbox

[net] net: solve a NAPI race

Message ID 1488166294.9415.172.camel@edumazet-glaptop3.roam.corp.google.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 27, 2017, 3:31 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   29 +++++++----------------
 net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 22 deletions(-)

Comments

Eric Dumazet Feb. 27, 2017, 2:06 p.m. UTC | #1
On Sun, 2017-02-26 at 19:31 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
> 
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
> 
> This would happen with a very low probability, but hurting RPC
> workloads.
> 
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
> 
> Then napi_complete_done() properly reschedules the napi to make sure
> we do not miss something.
> 
> Since we manipulate multiple bits at once, use cmpxchg() like in
> sk_busy_loop() to provide proper transactions.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice.h |   29 +++++++----------------
>  net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 51 insertions(+), 22 deletions(-)

I will send a v2 of this patch.

Points trying to grab NAPI_STATE_SCHED not from the device driver IRQ
handler should not set NAPI_STATE_MISSED if they fail, otherwise this
adds extra work for no purpose.

One of this point is napi_watchdog() which can fire pretty often.
David Miller Feb. 27, 2017, 4:19 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 06:21:38 -0800

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED

Various rules were meant to protect these sequences, and make sure
nothing like this race could happen.

Can you show the specific sequence that fails?

One of the basic protections is that the device IRQ is not re-enabled
until napi_complete_done() is finished, most drivers do something like
this:

	napi_complete_done();
		- sets NAPI_STATE_SCHED
	enable device IRQ

So I don't understand how it is possible that "later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it".

While napi_complete_done() is running, the device's IRQ is still
disabled, so there cannot be an IRQ firing before napi_complete_done()
is finished.
Eric Dumazet Feb. 27, 2017, 4:44 p.m. UTC | #3
On Mon, 2017-02-27 at 11:19 -0500, David Miller wrote:

> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
> 
> Can you show the specific sequence that fails?
> 
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
> 
> 	napi_complete_done();
> 		- sets NAPI_STATE_SCHED
> 	enable device IRQ
> 
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
> 
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.


Any point doing a napi_schedule() not from device hard irq handler
is subject to the race for NIC using some kind of edge trigger interrupts.

Since we do not provide a ndo to disable device interrupts,
the following can happen.

thread 1                                 thread 2 (could be on same cpu)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is available.
                                          device hard irq

                                          // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();
                                          
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional IRQ
for the packet it already signaled (3rd packet in my example)

At least for mlx4, only 4th packet will trigger the IRQ again.

In the old days, the race would not happen since napi->poll() was called
in direct response to a prior device IRQ :
Edge triggered hard irqs from the device for this queue were already disabled.
Eric Dumazet Feb. 27, 2017, 5:10 p.m. UTC | #4
On Mon, 2017-02-27 at 08:44 -0800, Eric Dumazet wrote:

> // busy polling or napi_watchdog()

BTW, we also can add to the beginning of busy_poll_stop() :

clear_bit(NAPI_STATE_MISSED, &napi->state);
Alexander H Duyck Feb. 27, 2017, 9 p.m. UTC | #5
On Mon, Feb 27, 2017 at 8:19 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 06:21:38 -0800
>
>> A NAPI driver normally arms the IRQ after the napi_complete_done(),
>> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
>> it.
>>
>> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
>> while IRQ are not disabled, we might have later an IRQ firing and
>> finding this bit set, right before napi_complete_done() clears it.
>>
>> This can happen with busy polling users, or if gro_flush_timeout is
>> used. But some other uses of napi_schedule() in drivers can cause this
>> as well.
>>
>> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
>> can set if it could not grab NAPI_STATE_SCHED
>
> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
>
> Can you show the specific sequence that fails?
>
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
>
>         napi_complete_done();
>                 - sets NAPI_STATE_SCHED
>         enable device IRQ
>
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
>
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.

So there are some drivers that will need to have the interrupts
enabled when busy polling and I assume that can cause this kind of
issue. Specifically in the case of i40e the part will not flush
completed descriptors until either 4 completed descriptors are ready
to be written back, or an interrupt fires.

Our other drivers have code in them that will force the interrupt to
unmask and fire once every 2 seconds in the unlikely event that an
interrupt was lost which can occur on some platforms.

- Alex
David Miller Feb. 28, 2017, 2:08 a.m. UTC | #6
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 08:44:14 -0800

> Any point doing a napi_schedule() not from device hard irq handler
> is subject to the race for NIC using some kind of edge trigger
> interrupts.
> 
> Since we do not provide a ndo to disable device interrupts, the
> following can happen.

Ok, now I understand.

I think even without considering the race you are trying to solve,
this situation is really dangerous.

I am sure that every ->poll() handler out there was written by an
author who completely assumed that if they are executing then the
device's interrupts for that NAPI instance are disabled.  And this is
with very few, if any, exceptions.

So if we saw a driver doing something like:

	reg->irq_enable ^= value;

after napi_complete_done(), it would be quite understandable.

We really made a mistake taking the napi_schedule() call out of
the domain of the driver so that it could manage the interrupt
state properly.

I'm not against your missed bit fix as a short-term cure for now, it's
just that somewhere down the road we need to manage the interrupt
properly.
Francois Romieu March 1, 2017, 12:22 a.m. UTC | #7
David Miller <davem@davemloft.net> :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 08:44:14 -0800
> 
> > Any point doing a napi_schedule() not from device hard irq handler
> > is subject to the race for NIC using some kind of edge trigger
> > interrupts.
> > 
> > Since we do not provide a ndo to disable device interrupts, the
> > following can happen.
> 
> Ok, now I understand.
> 
> I think even without considering the race you are trying to solve,
> this situation is really dangerous.
> 
> I am sure that every ->poll() handler out there was written by an
> author who completely assumed that if they are executing then the
> device's interrupts for that NAPI instance are disabled.  And this is
> with very few, if any, exceptions.

Shareable pci irq used to remind author that such an assumption was
not always right. Otoh it was still manageable as long as level only
triggered irq were involved.
Stephen Hemminger March 1, 2017, 1:04 a.m. UTC | #8
On Wed, 1 Mar 2017 01:22:40 +0100
Francois Romieu <romieu@fr.zoreil.com> wrote:

> David Miller <davem@davemloft.net> :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 27 Feb 2017 08:44:14 -0800
> >   
> > > Any point doing a napi_schedule() not from device hard irq handler
> > > is subject to the race for NIC using some kind of edge trigger
> > > interrupts.
> > > 
> > > Since we do not provide a ndo to disable device interrupts, the
> > > following can happen.  
> > 
> > Ok, now I understand.
> > 
> > I think even without considering the race you are trying to solve,
> > this situation is really dangerous.
> > 
> > I am sure that every ->poll() handler out there was written by an
> > author who completely assumed that if they are executing then the
> > device's interrupts for that NAPI instance are disabled.  And this is
> > with very few, if any, exceptions.  
> 
> Shareable pci irq used to remind author that such an assumption was
> not always right. Otoh it was still manageable as long as level only
> triggered irq were involved.
> 

When I had to deal with that in sky2, the best way was to have a single
NAPI poll handler shared between both ports. Works well and avoids races
in interrupt handling and enabling.
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@  struct napi_struct {
 
 enum {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_MISSED,	/* reschedule a napi poll */
 	NAPI_STATE_DISABLE,	/* Disable pending */
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@  enum {
 };
 
 enum {
-	NAPIF_STATE_SCHED	 = (1UL << NAPI_STATE_SCHED),
-	NAPIF_STATE_DISABLE	 = (1UL << NAPI_STATE_DISABLE),
-	NAPIF_STATE_NPSVC	 = (1UL << NAPI_STATE_NPSVC),
-	NAPIF_STATE_HASHED	 = (1UL << NAPI_STATE_HASHED),
-	NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-	NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_SCHED	 = BIT(NAPI_STATE_SCHED),
+	NAPIF_STATE_MISSED	 = BIT(NAPI_STATE_MISSED),
+	NAPIF_STATE_DISABLE	 = BIT(NAPI_STATE_DISABLE),
+	NAPIF_STATE_NPSVC	 = BIT(NAPI_STATE_NPSVC),
+	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
+	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@  static inline bool napi_disable_pending(struct napi_struct *n)
 	return test_bit(NAPI_STATE_DISABLE, &n->state);
 }
 
-/**
- *	napi_schedule_prep - check if NAPI can be scheduled
- *	@n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-	return !napi_disable_pending(n) &&
-		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..82d868049ba78260a5f28376842657b72bd31994 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@  void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ *	napi_schedule_prep - check if napi can be scheduled
+ *	@n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			return false;
+		new = val | NAPIF_STATE_SCHED;
+		if (unlikely(val & NAPIF_STATE_SCHED))
+			new |= NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
@@ -4897,7 +4923,7 @@  EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags;
+	unsigned long flags, val, new;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4953,21 @@  bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
-	WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+	do {
+		val = READ_ONCE(n->state);
+
+		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+		if (unlikely(val & NAPIF_STATE_MISSED))
+			new |= NAPIF_STATE_SCHED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	if (unlikely(val & NAPIF_STATE_MISSED))
+		__napi_schedule(n);
+
 	return true;
 }
 EXPORT_SYMBOL(napi_complete_done);