diff mbox

[1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms

Message ID 1442844109.7367.5.camel@infradead.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

David Woodhouse Sept. 21, 2015, 2:01 p.m. UTC
From: David Woodhouse <David.Woodhouse@intel.com>

The TX timeout handling has been observed to trigger RX IRQ storms. And
since cp_interrupt() just keeps saying that it handled the interrupt,
the machine then dies. Fix the return value from cp_interrupt(), and
the offending IRQ gets disabled and the machine survives.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
 drivers/net/ethernet/realtek/8139cp.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

Comments

Francois Romieu Sept. 21, 2015, 8:25 p.m. UTC | #1
David Woodhouse <dwmw2@infradead.org> :
> From: David Woodhouse <David.Woodhouse@intel.com>
> 
> The TX timeout handling has been observed to trigger RX IRQ storms. And
> since cp_interrupt() just keeps saying that it handled the interrupt,
> the machine then dies. Fix the return value from cp_interrupt(), and
> the offending IRQ gets disabled and the machine survives.

I am not fond of the way it dissociates the hardware status word and the
software "handled" variable.

What you are describing - RX IRQ storms - looks like a problem between
the irq and poll handlers. That's where I expect the problem to be solved.
Sprinkling "handled" operations does not make me terribly confortable,
especially as I'd happily trade the old-style part irq, part napi
processing for a plain napi processing (I can get over it though :o) ).

Once the code is past the "if (!status || (status == 0xFFFF))" test -
or whatever test against some mask - I don't see why the driver could
refuse to take ownership of the irq.
David Woodhouse Sept. 21, 2015, 8:52 p.m. UTC | #2
On Mon, 2015-09-21 at 22:25 +0200, Francois Romieu wrote:
> David Woodhouse <dwmw2@infradead.org> :
> > From: David Woodhouse <David.Woodhouse@intel.com>
> > 
> > The TX timeout handling has been observed to trigger RX IRQ storms. And
> > since cp_interrupt() just keeps saying that it handled the interrupt,
> > the machine then dies. Fix the return value from cp_interrupt(), and
> > the offending IRQ gets disabled and the machine survives.
> 
> I am not fond of the way it dissociates the hardware status word and the
> software "handled" variable.

Oh, I like that part very much :)

The practice of returning a 'handled' status only when you've actually
*done* something you expect to mitigate the interrupt, is a useful way
of also protecting against both hardware misbehaviour and software
bugs.

> What you are describing - RX IRQ storms - looks like a problem between
> the irq and poll handlers. That's where I expect the problem to be solved.

I already fixed that, in the next patch in the series. But the failure
mode *should* have been 'IRQ disabled' and the device continuing to
work via polling. Not a complete death of the machine. That's the
difference this patch makes.

> Sprinkling "handled" operations does not make me terribly confortable,
> especially as I'd happily trade the old-style part irq, part napi
> processing for a plain napi processing (I can get over it though :o) ).

The existing cp_rx_poll() function mostly runs without taking cp->lock.
But cp_tx() *does* need cp->lock for the tx_head/tx_tail manipulations.
I'm guessing that's why it's still being called directly from the hard
IRQ handler? I suppose we could probably find a way to move that out.

But I was mostly just trying to fix stuff that was actually broken...
:)
David Miller Sept. 22, 2015, 11:45 p.m. UTC | #3
From: David Woodhouse <dwmw2@infradead.org>
Date: Mon, 21 Sep 2015 15:01:49 +0100

> From: David Woodhouse <David.Woodhouse@intel.com>
> 
> The TX timeout handling has been observed to trigger RX IRQ storms. And
> since cp_interrupt() just keeps saying that it handled the interrupt,
> the machine then dies. Fix the return value from cp_interrupt(), and
> the offending IRQ gets disabled and the machine survives.
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

Like Francois, I don't like this.

First of all, there are only 3 bits not handled explicitly by
cp_interrupt().  And for those if they are set and no other condition
was rasied, you should report the event and the status bits set, and
then forcibly clear the interrupt.

And if we are getting Rx* interrupts with napi_schedule_prep()
returning false, that's a serious problem.  It can mean that the TX
timeout handler's resetting of the chip is either miscoded or is
racing with either NAPI polling or this interrupt handler.

And if that's the case your patch is making the chip's IRQ line get
disabled when this race triggers.

This change is even worse, in my opinion, if patch #2 indeed makes
the problem go away.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse Sept. 23, 2015, 8:14 a.m. UTC | #4
On Tue, 2015-09-22 at 16:45 -0700, David Miller wrote:
> And if we are getting Rx* interrupts with napi_schedule_prep()
> returning false, that's a serious problem.  It can mean that the TX
> timeout handler's resetting of the chip is either miscoded or is
> racing with either NAPI polling or this interrupt handler.

Such bugs happen. That's why we have IRQ-storm detection, to help
protect us.

> And if that's the case your patch is making the chip's IRQ line get
> disabled when this race triggers.
> 
> This change is even worse, in my opinion, if patch #2 indeed makes
> the problem go away.

Even worse than what?

When a bug like the one I've fixed in #2 happens, let's look at the
options:

  - With this patch, the IRQ storm is detected. We disable the IRQ 
    line and the driver continues to work, albeit with a higher
    latency.

  - Without this patch, the box just dies completely. First the 
    hardware watchdog was triggering an apparently spontaneous reset. 
    Then when I disabled the hardware watchdog, the machine locked up 
    and doesn't even respond to serial sysrq. Eventually I got a
    little bit of sense out of it using the NMI watchdog.

Trust me, the disabled IRQ you get with my patch really *isn't* worse
than what happened without it :)

If cp_interrupt() had already worked the way I suggest, I would
literally have two days of my life back right now. I spent a *lot* of
time working out that it was dying in an interrupt storm. And trying to
find the underlying problem while having to physically visit it under
the stairs and reset it all the time.

In fact, I've been seeing undebuggable spontaneous resets of this
router for a while now. If I'd had an 'IRQ disabled' backtrace and a
smoking gun, I might have been able to fix it a long time ago.

I even discounted the IRQ-storm hypothesis, early on, on the basis that
"Linux handles those quite gracefully now, and disables the interrupt".
Except of course *that* relies on the IRQ handler returning an
appropriate value. So it doesn't work with 8139cp as things stand.

That's why I'm really quite keen on fixing cp_interrupt() to be more
defensive in when it claims to have handled the interrupt.

Surely that's *why* we have the IRQ-storm detection in the first place
— to help us survive precisely this situation. IRQ storms are either
caused by software bugs like the one fixed in patch #2, or hardware
misbehaviour. And if IRQ handlers blindly return 'handled' just because
the hardware actually *did* assert an interrupt, regardless of whether
they've made any *progress*, then we don't get that protection in a
*lot* of the situations where we'd want it.

But sure, for now I'll drop this from the series and I can try to
convince you separately. In fact I think it only affects the last patch
in the series which reduces the banging on TxPoll. And I'm going to
drop that one too for now anyway. I'll repost shortly. And I think I'll
add a new one at the end, enabling TX checksums, SG and TSO by default.
David Miller Sept. 23, 2015, 5:58 p.m. UTC | #5
From: David Woodhouse <dwmw2@infradead.org>
Date: Wed, 23 Sep 2015 09:14:09 +0100

> But sure, for now I'll drop this from the series and I can try to
> convince you separately.

Yes, let's discuss this independantly to the nice bug fixing
that's happening in the rest of this series.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse Sept. 23, 2015, 7:45 p.m. UTC | #6
On Wed, 2015-09-23 at 10:58 -0700, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Wed, 23 Sep 2015 09:14:09 +0100
> 
> > But sure, for now I'll drop this from the series and I can try to
> > convince you separately.
> 
> Yes, let's discuss this independantly to the nice bug fixing
> that's happening in the rest of this series.

Since you explicitly call them 'bug fixes'... I suppose the last patch
in the latest series (enabling the various offloads by default) isn't a
bug fix and should be held back for net-next.

I think I can live with pushing the rest for the net tree. Just to
recap, the series is:

      8139cp: Do not re-enable RX interrupts in cp_tx_timeout()
      8139cp: Fix tx_queued debug message to print correct slot numbers
      8139cp: Fix TSO/scatter-gather descriptor setup
      8139cp: Reduce duplicate csum/tso code in cp_start_xmit()
      8139cp: Fix DMA unmapping of transmitted buffers
      8139cp: Dump contents of descriptor ring on TX timeout
      8139cp: Enable offload features by default

The penultimate patch is also not strictly a bug fix, but it's only
adding additional diagnostics to a code path which was already broken 
until I fixed it anyway, so I can live with that in net although I'll
also be happy if you want to defer it.

The fourth patch which removes the three duplicate versions of the
csum/tso checks might also be deferrable but really, it's *also*
fixing the failure mode when we see an inappropriate CHECKSUM_PARTIAL
packet, and it's touching a code path which I've *also* fixed in this
patch set. So it might as well stay.

I can shuffle things around if you disagree, but assuming Francois
concurs I'd suggest merging patches 1-6 (or just pulling from
git://git.infradead.org/users/dwmw2/linux-8139cp.git fixes-only)
and then I'll resubmit the last patch some time later, after you
next pull net into net-next.

Otherwise, please let me know if/how you'd like me to reorganise it.
David Miller Sept. 23, 2015, 9:48 p.m. UTC | #7
From: David Woodhouse <dwmw2@infradead.org>
Date: Wed, 23 Sep 2015 20:45:30 +0100

> Since you explicitly call them 'bug fixes'... I suppose the last patch
> in the latest series (enabling the various offloads by default) isn't a
> bug fix and should be held back for net-next.
> 
> I think I can live with pushing the rest for the net tree. Just to
> recap, the series is:

I've applied patches #1-#6 and left #7 sitting around in patchwork
for when I next merge net into net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse Sept. 23, 2015, 10 p.m. UTC | #8
On Wed, 2015-09-23 at 14:48 -0700, David Miller wrote:
> I've applied patches #1-#6 and left #7 sitting around in patchwork
> for when I next merge net into net-next, thanks.

Great, thanks. I've got one more for the net tree; just masking the
gso_size to the maximum (0xfff) that the hardware can handle, and never
telling the net stack that that's our maximum, doesn't seem like such a
good idea...

I think I really am going to stop looking now :)
Francois Romieu Sept. 23, 2015, 10:44 p.m. UTC | #9
David Woodhouse <dwmw2@infradead.org> :
[...]
> But sure, for now I'll drop this from the series and I can try to
> convince you separately.

You may as well change the IRQ storm avoidance logic so that it does not
require conformant driver code if you do not want to waste more time :o)
Francois Romieu Sept. 23, 2015, 10:44 p.m. UTC | #10
David Woodhouse <dwmw2@infradead.org> :
[...]
> I can shuffle things around if you disagree, but assuming Francois
> concurs I'd suggest merging patches 1-6 (or just pulling from
> git://git.infradead.org/users/dwmw2/linux-8139cp.git fixes-only)
> and then I'll resubmit the last patch some time later, after you
> next pull net into net-next.

No objection but I do not really trust whatever review I do this late.

Did you try #5 with the DMA allocation debug code enabled ?
David Woodhouse Sept. 23, 2015, 11:09 p.m. UTC | #11
On Thu, 2015-09-24 at 00:44 +0200, Francois Romieu wrote:
> David Woodhouse <dwmw2@infradead.org> :
> [...]
> > I can shuffle things around if you disagree, but assuming Francois
> > concurs I'd suggest merging patches 1-6 (or just pulling from
> > git://git.infradead.org/users/dwmw2/linux-8139cp.git fixes-only)
> > and then I'll resubmit the last patch some time later, after you
> > next pull net into net-next.
> 
> No objection but I do not really trust whatever review I do this late.
> 
> Did you try #5 with the DMA allocation debug code enabled ?

Only in QEMU (where there wasn't a problem in the first place). But
yes.
David Woodhouse Sept. 23, 2015, 11:18 p.m. UTC | #12
On Thu, 2015-09-24 at 00:44 +0200, Francois Romieu wrote:
> David Woodhouse <dwmw2@infradead.org> :
> [...]
> > But sure, for now I'll drop this from the series and I can try to
> > convince you separately.
> 
> You may as well change the IRQ storm avoidance logic so that it does not
> require conformant driver code if you do not want to waste more time :o)

That's not entirely trivial — at a certain level we *do* require that
drivers don't lie to the core IRQ logic.

I suppose even if a rogue driver is returning IRQ_HANDLED in an IRQ
storm, when it's *not* actually doing anything or making any progress,
perhaps there's a way we could detect that we're always going straight
back into the next interrupt handler as *soon* as we exit the previous
one. Perhaps there's merit in exploring that option.

I'm not sure that gives you a valid excuse for not wanting to fix the
driver though :)

I might start by "clarifying" the documentation on the IRQ handler's
return value...
David Woodhouse Oct. 28, 2015, 8:47 a.m. UTC | #13
On Wed, 2015-09-23 at 10:58 -0700, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Wed, 23 Sep 2015 09:14:09 +0100
> 
> > But sure, for now I'll drop this from the series and I can try to
> > convince you separately.
> 
> Yes, let's discuss this independantly to the nice bug fixing
> that's happening in the rest of this series.
> 
> Thanks.

This is the patch we discussed earlier, and I came away from the
conversation believing that I had achieved my goal of trying to
convince you separately :)

The patch in patchwork at https://patchwork.ozlabs.org/patch/520318/
still applies cleanly; is that sufficient or would you like it
resubmitted (perhaps after -rc1)?

Thanks.
diff mbox

Patch

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index ba3dab7..f1054ad 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -371,7 +371,7 @@  struct cp_private {
 
 
 static void __cp_set_rx_mode (struct net_device *dev);
-static void cp_tx (struct cp_private *cp);
+static int cp_tx (struct cp_private *cp);
 static void cp_clean_rings (struct cp_private *cp);
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void cp_poll_controller(struct net_device *dev);
@@ -587,8 +587,6 @@  static irqreturn_t cp_interrupt (int irq, void *dev_instance)
 	if (!status || (status == 0xFFFF))
 		goto out_unlock;
 
-	handled = 1;
-
 	netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x\n",
 		  status, cpr8(Cmd), cpr16(CpCmd));
 
@@ -596,25 +594,30 @@  static irqreturn_t cp_interrupt (int irq, void *dev_instance)
 
 	/* close possible race's with dev_close */
 	if (unlikely(!netif_running(dev))) {
+		handled = 1;
 		cpw16(IntrMask, 0);
 		goto out_unlock;
 	}
 
-	if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr))
+	if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr)) {
 		if (napi_schedule_prep(&cp->napi)) {
+			handled = 1;
 			cpw16_f(IntrMask, cp_norx_intr_mask);
 			__napi_schedule(&cp->napi);
 		}
-
+	}
 	if (status & (TxOK | TxErr | TxEmpty | SWInt))
-		cp_tx(cp);
-	if (status & LinkChg)
-		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
+		handled |= cp_tx(cp);
 
+	if (status & LinkChg) {
+		handled = 1;
+		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
+	}
 
 	if (status & PciErr) {
 		u16 pci_status;
 
+		handled = 1;
 		pci_read_config_word(cp->pdev, PCI_STATUS, &pci_status);
 		pci_write_config_word(cp->pdev, PCI_STATUS, pci_status);
 		netdev_err(dev, "PCI bus error, status=%04x, PCI status=%04x\n",
@@ -645,11 +648,12 @@  static void cp_poll_controller(struct net_device *dev)
 }
 #endif
 
-static void cp_tx (struct cp_private *cp)
+static int cp_tx (struct cp_private *cp)
 {
 	unsigned tx_head = cp->tx_head;
 	unsigned tx_tail = cp->tx_tail;
 	unsigned bytes_compl = 0, pkts_compl = 0;
+	int handled = 0;
 
 	while (tx_tail != tx_head) {
 		struct cp_desc *txd = cp->tx_ring + tx_tail;
@@ -661,6 +665,7 @@  static void cp_tx (struct cp_private *cp)
 		if (status & DescOwn)
 			break;
 
+		handled = 1;
 		skb = cp->tx_skb[tx_tail];
 		BUG_ON(!skb);
 
@@ -704,6 +709,8 @@  static void cp_tx (struct cp_private *cp)
 	netdev_completed_queue(cp->dev, pkts_compl, bytes_compl);
 	if (TX_BUFFS_AVAIL(cp) > (MAX_SKB_FRAGS + 1))
 		netif_wake_queue(cp->dev);
+
+	return handled;
 }
 
 static inline u32 cp_tx_vlan_tag(struct sk_buff *skb)