diff mbox

r8169: IO_PAGE_FAULT & netdev watchdog

Message ID 20120601125949.GA11973@electric-eye.fr.zoreil.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Francois Romieu June 1, 2012, 12:59 p.m. UTC
Vincent Pelletier <plr.vincent@gmail.com> :
[...]
> I'm getting consistently errors when using btlaunchmanycurses (multi-torrent
> downloader) after a few minutes. I usually first notice the network being down
> (no trafic) then find this in syslog (see at bottom).
> 
> Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's),
> but network never comes back - at least no trafic can go through - until
> reboot.

Same thing if you reset and remove the pci device through sysfs then ask
the PCI bridge to scan it again ?

> www.kerneloops.org being down (aparently for quite some time...) I though I
> should report here.
> 
> I'm quite sure this problem also occured on 3.2, but I don't know the exact
> version I was using at that time. I only have this motherboard since a few
> months, and previous one didn't have an IOMMU - which in my understanding is
> what causes (well, detects actually) this error.

https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not
identical IOMMU messages (this #bz is messy but it may be of intereset to
add yourself to the Cc: list btw).
AFAIUI the IOMMU complains because the r8169 tried to perform a read access.
The target address matches the start of a descriptor ring one. However it
happens long after the r8169 initialized the chipset and the driver would
work rather poorly if it could not access its descriptor rings. The r8169
bug is real but the IOMMU message seems rather useless if not bogus.

> May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050]
> May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------
> May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
> May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3
> May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

You can apply the attached patch but it may not do much for your problem.

The patch below could make a difference though. Does it ?

Comments

Vincent Pelletier June 1, 2012, 7:20 p.m. UTC | #1
Thanks for the quick reply.

Le vendredi 01 juin 2012 14:59:49, vous avez écrit :
> Same thing if you reset and remove the pci device through sysfs then ask
> the PCI bridge to scan it again ?

I didn't try it before - but I should have, I know this.
rmmod; reset; modprobe -> doesn't work
rmmod; reset; remove; rescan -> doesn't work either (?!)

> https://bugzilla.kernel.org/show_bug.cgi?id=42899 contains similar if not
> identical IOMMU messages (this #bz is messy but it may be of intereset to
> add yourself to the Cc: list btw).

I found it a bit after my post (while watching the archives, in case someone 
replied without CC :) ). I posted on that bug as I couldn't find a way to just 
add me to bug CC.

> The r8169 bug is real but the IOMMU message seems rather useless if not
> bogus.

Just being curious, feel free to skip over my questions:
If it's bogus, could it be a mis-interpretation of its state when the error 
occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some 
register contain error status, address of error, some process/context 
identifier) ? Or hardware bug ? Or MMU misconfiguration for some reason ?
If it's not bogus, would it be the sign of firmware bug (accessing some 
unpredictable memory upon certain conditions) ?

> You can apply the attached patch but it may not do much for your problem.
> The patch below could make a difference though. Does it ?

I'll try either and both. Given the poor result I got from 
reset/remove/rescan, I guess I should reboot between attempts, right ?
Should I prevent original module auto-loading at boot ? Maybe more than just 
r8169 ?

Regards,
Francois Romieu June 1, 2012, 8:13 p.m. UTC | #2
Vincent Pelletier <plr.vincent@gmail.com> :
[...]
> If it's bogus, could it be a mis-interpretation of its state when the error 
> occurs (I don't know how CPU knows a fault happened, I guess some IRQ + some 
> register contain error status, address of error, some process/context 
> identifier) ?

See "AMD I/O Virtualization Technology (IOMMU) Specification".

> Or hardware bug ? Or MMU misconfiguration for some reason ?

I don't have time to poke deeply enough into the iommu code.

[...]
> If it's not bogus, would it be the sign of firmware bug (accessing some 
> unpredictable memory upon certain conditions) ?

That's what I thought first. Or I should have added something to the r8169
driver. However it's quite reproducible, the failing address is one of the
mapped Rx or Tx descriptor ring address - don't remember which one, see
the PR at korg - and it does not fit the timing pattern.

[...]
> I'll try either and both. Given the poor result I got from 
> reset/remove/rescan, I guess I should reboot between attempts, right ?

Yes. The inlined patch could help avoiding the problem but it is not
supposed to help a failed network adapter recovering.

> Should I prevent original module auto-loading at boot ? Maybe more than just 
> r8169 ?

It should not be required. YMMV.
diff mbox

Patch

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index bbacb37..da46588 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3766,6 +3766,7 @@  static void rtl_init_rxcfg(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_22:
 	case RTL_GIGA_MAC_VER_23:
 	case RTL_GIGA_MAC_VER_24:
+	case RTL_GIGA_MAC_VER_34:
 		RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
 		break;
 	default: