diff mbox

[2.6.30-rc4] r8169: avoid losing MSI interrupts

Message ID 1243042174.3580.23.camel@obelisk.thedillows.org
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

David Dillow May 23, 2009, 1:29 a.m. UTC
The 8169 chip only generates MSI interrupts when all enabled event
sources are quiescent and one or more sources transition to active. If
not all of the active events are acknowledged, or a new event becomes
active while the existing ones are cleared in the handler, we will not
see a new interrupt.

The current interrupt handler masks off the Rx and Tx events once the
NAPI handler has been scheduled, which opens a race window in which we
can get another Rx or Tx event and never ACK'ing it, stopping all
activity until the link is reset (ifconfig down/up). Fix this by always
ACK'ing all event sources, and loop in the handler until we have all
sources quiescent.

Signed-off-by: David Dillow <dave@thedillows.org>
---
This fixes the lockups I've seen. Both MSI and level-triggered interrupt
configurations survive over an hour of testing when it would lockup in
under 90 seconds before. I am certain of the analysis of the root cause,
but there may be better ways to fix it. There may also be a theoretical
race window between the ending of a NAPI poll cycle and a link change
interrupt coming in, but I'm not sure it would matter. 

Some variant of this should also be applied to the currently running
stable trees, as the problem is long-standing.

 r8169.c |  102 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 57 insertions(+), 45 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Francois Romieu May 24, 2009, 9:15 p.m. UTC | #1
David Dillow <dave@thedillows.org> :
[...]
> This fixes the lockups I've seen. Both MSI and level-triggered interrupt
> configurations survive over an hour of testing when it would lockup in
> under 90 seconds before. I am certain of the analysis of the root cause,
> but there may be better ways to fix it. There may also be a theoretical
> race window between the ending of a NAPI poll cycle and a link change
> interrupt coming in, but I'm not sure it would matter. 

It makes sense.

If I understand correctly, one should expect to find some pending Tx
event in the ISR of a failed card when reading the registers with
ethtool.

Has someone noticed it ?
David Dillow May 24, 2009, 10:55 p.m. UTC | #2
On Sun, 2009-05-24 at 23:15 +0200, Francois Romieu wrote:
> David Dillow <dave@thedillows.org> :
> [...]
> > This fixes the lockups I've seen. Both MSI and level-triggered interrupt
> > configurations survive over an hour of testing when it would lockup in
> > under 90 seconds before. I am certain of the analysis of the root cause,
> > but there may be better ways to fix it. There may also be a theoretical
> > race window between the ending of a NAPI poll cycle and a link change
> > interrupt coming in, but I'm not sure it would matter. 
> 
> It makes sense.
> 
> If I understand correctly, one should expect to find some pending Tx
> event in the ISR of a failed card when reading the registers with
> ethtool.
> 
> Has someone noticed it ?

Yes, that's part of how I came to this conclusion, I put a debug patch
together that looked at the IRQ status 2 seconds after the last IRQ came
in. Then I waited for the chip to lock and the timer to fire. It showed
0x0085 in the IntrStatus register.

I didn't know I could do that with ethtool, but that would've been a
nice way to go, too. :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 26, 2009, 5:55 a.m. UTC | #3
From: David Dillow <dave@thedillows.org>
Date: Fri, 22 May 2009 21:29:34 -0400

> The 8169 chip only generates MSI interrupts when all enabled event
> sources are quiescent and one or more sources transition to active. If
> not all of the active events are acknowledged, or a new event becomes
> active while the existing ones are cleared in the handler, we will not
> see a new interrupt.
> 
> The current interrupt handler masks off the Rx and Tx events once the
> NAPI handler has been scheduled, which opens a race window in which we
> can get another Rx or Tx event and never ACK'ing it, stopping all
> activity until the link is reset (ifconfig down/up). Fix this by always
> ACK'ing all event sources, and loop in the handler until we have all
> sources quiescent.
> 
> Signed-off-by: David Dillow <dave@thedillows.org>

I've applied this, thanks David.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Buesch May 26, 2009, 6:22 p.m. UTC | #4
On Tuesday 26 May 2009 07:55:03 David Miller wrote:
> From: David Dillow <dave@thedillows.org>
> Date: Fri, 22 May 2009 21:29:34 -0400
> 
> > The 8169 chip only generates MSI interrupts when all enabled event
> > sources are quiescent and one or more sources transition to active. If
> > not all of the active events are acknowledged, or a new event becomes
> > active while the existing ones are cleared in the handler, we will not
> > see a new interrupt.
> > 
> > The current interrupt handler masks off the Rx and Tx events once the
> > NAPI handler has been scheduled, which opens a race window in which we
> > can get another Rx or Tx event and never ACK'ing it, stopping all
> > activity until the link is reset (ifconfig down/up). Fix this by always
> > ACK'ing all event sources, and loop in the handler until we have all
> > sources quiescent.
> > 
> > Signed-off-by: David Dillow <dave@thedillows.org>
> 
> I've applied this, thanks David.
> 
> 

I didn't notice a CC:stable.
I think this should also go to stable.
Does somebody take care?

(wiggle is able to apply the patch to stable without any problems, so it's easy
to do a patch)
David Miller May 26, 2009, 9:52 p.m. UTC | #5
From: Michael Buesch <mb@bu3sch.de>
Date: Tue, 26 May 2009 20:22:23 +0200

> I didn't notice a CC:stable.
> I think this should also go to stable.
> Does somebody take care?

I'll queue it up.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 26, 2009, 10:14 p.m. UTC | #6
From: David Miller <davem@davemloft.net>
Date: Tue, 26 May 2009 14:52:11 -0700 (PDT)

> From: Michael Buesch <mb@bu3sch.de>
> Date: Tue, 26 May 2009 20:22:23 +0200
> 
>> I didn't notice a CC:stable.
>> I think this should also go to stable.
>> Does somebody take care?
> 
> I'll queue it up.

Actually, this patch doesn't even remotely come close to applying
to 2.6.29.4

So if someone wants this in -stable, they need to respin this against
that tree and (if wanted) 2.6.27.x -stable as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Riepe May 26, 2009, 10:40 p.m. UTC | #7
David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Tue, 26 May 2009 14:52:11 -0700 (PDT)
> 
> 
>>From: Michael Buesch <mb@bu3sch.de>
>>Date: Tue, 26 May 2009 20:22:23 +0200
>>
>>
>>>I didn't notice a CC:stable.
>>>I think this should also go to stable.
>>>Does somebody take care?
>>
>>I'll queue it up.
> 
> 
> Actually, this patch doesn't even remotely come close to applying
> to 2.6.29.4

Michael Buesch already provided a patch against 2.6.29.4 that we both
tested. I doubt that it will apply to 2.6.27 cleanly, though.
David Miller May 26, 2009, 10:43 p.m. UTC | #8
From: Michael Riepe <michael.riepe@googlemail.com>
Date: Wed, 27 May 2009 00:40:41 +0200

> Michael Buesch already provided a patch against 2.6.29.4 that we both
> tested. I doubt that it will apply to 2.6.27 cleanly, though.

Please forward a copy to me.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 26, 2009, 11:10 p.m. UTC | #9
From: David Miller <davem@davemloft.net>
Date: Tue, 26 May 2009 15:43:40 -0700 (PDT)

> From: Michael Riepe <michael.riepe@googlemail.com>
> Date: Wed, 27 May 2009 00:40:41 +0200
> 
>> Michael Buesch already provided a patch against 2.6.29.4 that we both
>> tested. I doubt that it will apply to 2.6.27 cleanly, though.
> 
> Please forward a copy to me.

Nevermind, I was tired of waiting and fished it out of
patchwork.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 21, 2009, 8:57 p.m. UTC | #10
David Dillow <dave@thedillows.org> writes:

> The 8169 chip only generates MSI interrupts when all enabled event
> sources are quiescent and one or more sources transition to active. If
> not all of the active events are acknowledged, or a new event becomes
> active while the existing ones are cleared in the handler, we will not
> see a new interrupt.
>
> The current interrupt handler masks off the Rx and Tx events once the
> NAPI handler has been scheduled, which opens a race window in which we
> can get another Rx or Tx event and never ACK'ing it, stopping all
> activity until the link is reset (ifconfig down/up). Fix this by always
> ACK'ing all event sources, and loop in the handler until we have all
> sources quiescent.
>
> Signed-off-by: David Dillow <dave@thedillows.org>
> ---
> This fixes the lockups I've seen. Both MSI and level-triggered interrupt
> configurations survive over an hour of testing when it would lockup in
> under 90 seconds before. I am certain of the analysis of the root cause,
> but there may be better ways to fix it. There may also be a theoretical
> race window between the ending of a NAPI poll cycle and a link change
> interrupt coming in, but I'm not sure it would matter. 
>
> Some variant of this should also be applied to the currently running
> stable trees, as the problem is long-standing.

I have what at first glance looks like a problem caused by this
patch.  For the last month since upgrading one of my machines from
2.6.28 to 2.6.30 it has been becomming inaccessible from the
network and I have a few:

NETDEV WATCHDOG: eth0 (r8169): transmit timed out

in my logs and a lot soft lockups that always have rtl8169_interrupt
as the thing that is running.   I suspect your patch has introduced
a near infinite loop in the interrupt handler and is causing these
soft lockups.

Any ideas?

Eric

BUG: soft lockup - CPU#3 stuck for 61s! [swapper:0]
CPU 3:
Pid: 0, comm: swapper Tainted: G        W  2.6.30-170263.2006.Arora.fc11.x86_64 #1 G33M-S2
RIP: 0010:[<ffffffffa01deacd>]  [<ffffffffa01deacd>] rtl8169_interrupt+0x26f/0x2b7 [r8169]
RSP: 0018:ffff880028070cb0  EFLAGS: 00000206
RAX: 0000000000000050 RBX: ffff880028070d10 RCX: ffff88002807b9e0
RDX: ffffc2000065c03e RSI: ffff88012d79a000 RDI: 0000000000000246
RBP: ffffffff8100c9d3 R08: ffff88012fae0000 R09: ffff880028070ec0
R10: 077321422cb06619 R11: 000000003c5efb73 R12: ffff880028070c30
R13: ffff88012d79a000 R14: ffff88012d79a600 R15: 077321422cb06619
FS:  0000000000000000(0000) GS:ffff88002806d000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc10010c000 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>  [<ffffffff81093f0b>] ? handle_IRQ_event+0x6a/0x13f
 [<ffffffff810219fa>] ? apic_write+0x24/0x3a
 [<ffffffff8109607a>] ? handle_edge_irq+0xdb/0x138
 [<ffffffff81012fbd>] ? native_sched_clock+0x2d/0x54
 [<ffffffff8100e996>] ? handle_irq+0x95/0xb7
 [<ffffffff8100df42>] ? do_IRQ+0x6a/0xe9
 [<ffffffff8100c853>] ? ret_from_intr+0x0/0x11
 [<ffffffff8104ba16>] ? __do_softirq+0x5e/0x1b0
 [<ffffffff8100cfcc>] ? call_softirq+0x1c/0x28
 [<ffffffff8100e721>] ? do_softirq+0x51/0xae
 [<ffffffff8104b6d2>] ? irq_exit+0x52/0xa3
 [<ffffffff81020f11>] ? smp_apic_timer_interrupt+0x94/0xb8
 [<ffffffff8100c9d3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff81014096>] ? mwait_idle+0x9b/0xcc
 [<ffffffff81014038>] ? mwait_idle+0x3d/0xcc
 [<ffffffff8100ae08>] ? enter_idle+0x33/0x49
 [<ffffffff8100aece>] ? cpu_idle+0xb0/0xf3
 [<ffffffff8136f30c>] ? start_secondary+0x19c/0x1b7


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Riepe Aug. 21, 2009, 9:22 p.m. UTC | #11
Hi!

Eric W. Biederman wrote:

> I have what at first glance looks like a problem caused by this
> patch.  For the last month since upgrading one of my machines from
> 2.6.28 to 2.6.30 it has been becomming inaccessible from the
> network and I have a few:
> 
> NETDEV WATCHDOG: eth0 (r8169): transmit timed out
>
> in my logs and a lot soft lockups that always have rtl8169_interrupt
> as the thing that is running.   I suspect your patch has introduced
> a near infinite loop in the interrupt handler and is causing these
> soft lockups.

I've been running 2.6.30 for more than two months now and all is fine.
But this might be a different chip.
David Dillow Aug. 21, 2009, 10:59 p.m. UTC | #12
On Fri, 2009-08-21 at 13:57 -0700, Eric W. Biederman wrote:
> David Dillow <dave@thedillows.org> writes:
> I have what at first glance looks like a problem caused by this
> patch.  For the last month since upgrading one of my machines from
> 2.6.28 to 2.6.30 it has been becomming inaccessible from the
> network and I have a few:
> 
> NETDEV WATCHDOG: eth0 (r8169): transmit timed out
> 
> in my logs and a lot soft lockups that always have rtl8169_interrupt
> as the thing that is running.   I suspect your patch has introduced
> a near infinite loop in the interrupt handler and is causing these
> soft lockups.
> 
> Any ideas?

I would be surprised, but I suppose it is not out of the realm of
possibility. Can you send me a full dmesg, please?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 0b6e8c8..bdc8d5a 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -3554,54 +3554,64 @@  static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 	int handled = 0;
 	int status;
 
+	/* loop handling interrupts until we have no new ones or
+	 * we hit a invalid/hotplug case.
+	 */
 	status = RTL_R16(IntrStatus);
+	while (status && status != 0xffff) {
+		handled = 1;
 
-	/* hotplug/major error/no more work/shared irq */
-	if ((status == 0xffff) || !status)
-		goto out;
-
-	handled = 1;
+		/* Handle all of the error cases first. These will reset
+		 * the chip, so just exit the loop.
+		 */
+		if (unlikely(!netif_running(dev))) {
+			rtl8169_asic_down(ioaddr);
+			break;
+		}
 
-	if (unlikely(!netif_running(dev))) {
-		rtl8169_asic_down(ioaddr);
-		goto out;
-	}
+		/* Work around for rx fifo overflow */
+		if (unlikely(status & RxFIFOOver) &&
+	    	(tp->mac_version == RTL_GIGA_MAC_VER_11)) {
+			netif_stop_queue(dev);
+			rtl8169_tx_timeout(dev);
+			break;
+		}
 
-	status &= tp->intr_mask;
-	RTL_W16(IntrStatus,
-		(status & RxFIFOOver) ? (status | RxOverflow) : status);
+		if (unlikely(status & SYSErr)) {
+			rtl8169_pcierr_interrupt(dev);
+			break;
+		}
 
-	if (!(status & tp->intr_event))
-		goto out;
+		if (status & LinkChg)
+			rtl8169_check_link_status(dev, tp, ioaddr);
 
-	/* Work around for rx fifo overflow */
-	if (unlikely(status & RxFIFOOver) &&
-	    (tp->mac_version == RTL_GIGA_MAC_VER_11)) {
-		netif_stop_queue(dev);
-		rtl8169_tx_timeout(dev);
-		goto out;
-	}
+		/* We need to see the lastest version of tp->intr_mask to
+		 * avoid ignoring an MSI interrupt and having to wait for
+		 * another event which may never come.
+		 */
+		smp_rmb();
+		if (status & tp->intr_mask & tp->napi_event) {
+			RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event);
+			tp->intr_mask = ~tp->napi_event;
+
+			if (likely(napi_schedule_prep(&tp->napi)))
+				__napi_schedule(&tp->napi);
+			else if (netif_msg_intr(tp)) {
+				printk(KERN_INFO "%s: interrupt %04x in poll\n",
+			       	dev->name, status);
+			}
+		}
 
-	if (unlikely(status & SYSErr)) {
-		rtl8169_pcierr_interrupt(dev);
-		goto out;
+		/* We only get a new MSI interrupt when all active irq
+		 * sources on the chip have been acknowledged. So, ack
+		 * everything we've seen and check if new sources have become
+		 * active to avoid blocking all interrupts from the chip.
+		 */
+		RTL_W16(IntrStatus,
+			(status & RxFIFOOver) ? (status | RxOverflow) : status);
+		status = RTL_R16(IntrStatus);
 	}
 
-	if (status & LinkChg)
-		rtl8169_check_link_status(dev, tp, ioaddr);
-
-	if (status & tp->napi_event) {
-		RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event);
-		tp->intr_mask = ~tp->napi_event;
-
-		if (likely(napi_schedule_prep(&tp->napi)))
-			__napi_schedule(&tp->napi);
-		else if (netif_msg_intr(tp)) {
-			printk(KERN_INFO "%s: interrupt %04x in poll\n",
-			       dev->name, status);
-		}
-	}
-out:
 	return IRQ_RETVAL(handled);
 }
 
@@ -3617,13 +3627,15 @@  static int rtl8169_poll(struct napi_struct *napi, int budget)
 
 	if (work_done < budget) {
 		napi_complete(napi);
-		tp->intr_mask = 0xffff;
-		/*
-		 * 20040426: the barrier is not strictly required but the
-		 * behavior of the irq handler could be less predictable
-		 * without it. Btw, the lack of flush for the posted pci
-		 * write is safe - FR
+
+		/* We need for force the visibility of tp->intr_mask
+		 * for other CPUs, as we can loose an MSI interrupt
+		 * and potentially wait for a retransmit timeout if we don't.
+		 * The posted write to IntrMask is safe, as it will
+		 * eventually make it to the chip and we won't loose anything
+		 * until it does.
 		 */
+		tp->intr_mask = 0xffff;
 		smp_wmb();
 		RTL_W16(IntrMask, tp->intr_event);
 	}