Message ID | 20150408215051.GA25326@electric-eye.fr.zoreil.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 8 Apr 2015, Francois Romieu stated: > Nix <nix@esperi.org.uk> : > [...] >> I am sorry to report that I just had a watchdog-triggered autoreboot >> during testing of this patch series :( with no log messages of any kind. >> looks like the underlying bug is still there, or another bug with the >> same symptoms (i.e. some way to crash inside the rx handler). I qwish I >> could get some debugging output when this happens! > > You may add the patch below on top of the current stack. I don't expect > much difference. Increasing RX_RING_SIZE could be a different story. Will try it tomorrow. > Did you keep netconsole disabled and did you increse via-rhine verbosity > level ? netconsole is active. The verbosity level is default because I didn't notice the driver had one to set! I'll push it up and see what happens. I don't expect too much: this is, after all, a uniprocessor, and if you're stuck in an interrupt handler there's not much it can do... > No shared IRQ ? None relevant that I can see: CPU0 0: 692019 XT-PIC timer 2: 0 XT-PIC cascade 4: 314 XT-PIC serial 5: 114062 XT-PIC adsl 7: 329914 XT-PIC cs5535-clockevt 9: 137080 XT-PIC bdsl 10: 2006 XT-PIC voip 11: 1546540 XT-PIC gordianet 12: 724187 XT-PIC wireless 14: 12313 XT-PIC pata_cs5536 15: 219702 XT-PIC ehci_hcd:usb1, ohci_hcd:usb2 The only shared interrupt is usb2, and since that's an internal-to-the-case thing that's not plugged in to anything, in effect there are none shared at all. (The unplugged interfaces *are* shared -- the last four of the eight interfaces on the box run 10, 7, 10, 7 -- but since only one of those has anything plugged into it, the net effect is no sharing.) My load tests are being performed between the gordianet and wireless interfaces, IRQs 11 and 12, which are entirely unshared. >> However, to give some good news, CPU usage is much lower than before the >> patch: si ~10%, rather than ~80% with spikes of full CPU usage: >> ksoftirqd's CPU usage is steady at about 3% rather than 40--60% with >> spikes to 100%, and some of that will be USB interrupts from the >> continuous USB traffic from my entropy key. > > Huuuuh ? Which entropy key ? It's a USB device that gives you about 4KiB/s over USB: <http://www.entropykey.co.uk/>. No longer manufactured, alas :( but as it provides entropy continuously it is a continuous source of interrupts. (But from the USB port, obviously.)
On 8 Apr 2015, Francois Romieu spake thusly: > Nix <nix@esperi.org.uk> : > [...] >> I am sorry to report that I just had a watchdog-triggered autoreboot >> during testing of this patch series :( with no log messages of any kind. >> looks like the underlying bug is still there, or another bug with the >> same symptoms (i.e. some way to crash inside the rx handler). I qwish I >> could get some debugging output when this happens! > > You may add the patch below on top of the current stack. I don't expect > much difference. Increasing RX_RING_SIZE could be a different story. It still crashes with that patch. The lockups are definitely getting rarer: I have to load the thing for several hours to see a single crash now (though sometimes I am still (un)lucky and it dies almost at once). > Did you keep netconsole disabled and did you increse via-rhine verbosity > level ? Oops! Netconsole *was* on: I've been using it for so long I'd forgotten that you pretty much have to whap it with a hammer and turn it off in .config to turn it off completely, not just stop mentioning it on the kernel cmdline. It's off now. The verbosity level is now 16, which should be enough to cover, well, everything, and indeed I see extra log at initialization time: [ 0.911369] via_rhine: v1.10-LK1.5.1 2010-10-09 Written by Donald Becker [ 0.921653] via-rhine 0000:00:06.0 (unnamed net_device) (uninitialized): Reset succeeded [ 0.936067] via-rhine 0000:00:06.0 eth0: VIA Rhine III (Management Adapter) at 0xe0806000, 00:00:24:cb:c6:a0, IRQ 11 [ 0.949911] via-rhine 0000:00:06.0 eth0: MII PHY found at address 1, status 0x786d advertising 05e1 Link cde1 [ 0.969400] via-rhine 0000:00:07.0 (unnamed net_device) (uninitialized): Reset succeeded [ 0.983852] via-rhine 0000:00:07.0 eth1: VIA Rhine III (Management Adapter) at 0xe0808100, 00:00:24:cb:c6:a1, IRQ 5 [ 0.997168] via-rhine 0000:00:07.0 eth1: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1 [ 1.006638] via-rhine 0000:00:08.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.021091] via-rhine 0000:00:08.0 eth2: VIA Rhine III (Management Adapter) at 0xe080a200, 00:00:24:cb:c6:a2, IRQ 9 [ 1.034402] via-rhine 0000:00:08.0 eth2: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1 [ 1.043872] via-rhine 0000:00:09.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.058294] via-rhine 0000:00:09.0 eth3: VIA Rhine III (Management Adapter) at 0xe080c300, 00:00:24:cb:c6:a3, IRQ 12 [ 1.062104] via-rhine 0000:00:09.0 eth3: MII PHY found at address 1, status 0x786d advertising 05e1 Link 4de1 [ 1.071608] via-rhine 0000:01:00.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.086073] via-rhine 0000:01:00.0 eth4: VIA Rhine III (Management Adapter) at 0xe080e000, 00:00:24:d1:2a:3c, IRQ 10 [ 1.099911] via-rhine 0000:01:00.0 eth4: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1 [ 1.119402] via-rhine 0000:01:01.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.133915] via-rhine 0000:01:01.0 eth5: VIA Rhine III (Management Adapter) at 0xe0810100, 00:00:24:d1:2a:3d, IRQ 7 [ 1.147234] via-rhine 0000:01:01.0 eth5: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000 [ 1.156747] via-rhine 0000:01:02.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.171262] via-rhine 0000:01:02.0 eth6: VIA Rhine III (Management Adapter) at 0xe0812200, 00:00:24:d1:2a:3e, IRQ 10 [ 1.185095] via-rhine 0000:01:02.0 eth6: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000 [ 1.194599] via-rhine 0000:01:03.0 (unnamed net_device) (uninitialized): Reset succeeded [ 1.209094] via-rhine 0000:01:03.0 eth7: VIA Rhine III (Management Adapter) at 0xe0814300, 00:00:24:d1:2a:3f, IRQ 7 [ 1.212436] via-rhine 0000:01:03.0 eth7: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000 [...] [ 17.264820] via-rhine 0000:00:06.0 gordianet: Reset succeeded [ 17.299978] via-rhine 0000:00:06.0 gordianet: link up, 100Mbps, full-duplex, lpa 0xCDE1 [ 17.347962] via-rhine 0000:00:06.0 gordianet: force_media 0, carrier 1 [ 18.221924] via-rhine 0000:00:09.0 wireless: Reset succeeded [ 18.256936] via-rhine 0000:00:09.0 wireless: link up, 100Mbps, full-duplex, lpa 0x4DE1 [ 18.304399] via-rhine 0000:00:09.0 wireless: force_media 0, carrier 1 [ 18.360046] via-rhine 0000:01:00.0 voip: Reset succeeded [ 18.397168] via-rhine 0000:01:00.0 voip: link up, 100Mbps, full-duplex, lpa 0x41E1 [ 18.442578] via-rhine 0000:01:00.0 voip: force_media 0, carrier 1 [ 18.510970] via-rhine 0000:00:07.0 adsl: Reset succeeded [ 18.546141] via-rhine 0000:00:07.0 adsl: link up, 100Mbps, full-duplex, lpa 0x41E1 [ 18.591511] via-rhine 0000:00:07.0 adsl: force_media 0, carrier 1 [ 18.639051] via-rhine 0000:00:08.0 bdsl: Reset succeeded [ 18.671983] via-rhine 0000:00:08.0 bdsl: link up, 100Mbps, full-duplex, lpa 0x41E1 [ 18.717363] via-rhine 0000:00:08.0 bdsl: force_media 0, carrier 1 (again, the first two interfaces, gordianet and wireless, are the ones being stressed by this test.) Of course now I've done this it's not crashing! Maybe it's netconsole- related on top of everything else, or I'm just being unlucky... I'll keep trying.
Nix <nix@esperi.org.uk> : [...] > The verbosity level is now 16, which should be enough to cover, well, > everything, and indeed I see extra log at initialization time: 16 ? _warn, _err and _info ought ti be good enough but you should also check the message level with ethtool as well. You don't want debug messages from the rx poll handler. [...] > Of course now I've done this it's not crashing! Maybe it's netconsole- > related on top of everything else, or I'm just being unlucky... I'll > keep trying. Please no netconsole yet.
On 9 Apr 2015, Francois Romieu told this: > Nix <nix@esperi.org.uk> : > [...] >> Of course now I've done this it's not crashing! Maybe it's netconsole- >> related on top of everything else, or I'm just being unlucky... I'll >> keep trying. > > Please no netconsole yet. Still trying. It hasn't crashed since I turned netconsole off despite a couple of days of loading, so it's quite possible that this crash is netconsole-related. (Without the receive buffers rework, the crash is much faster and happens even without netconsole, so the rework is clearly necessary and definitely improves things dramatically.)
diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c index 3e6fdbb..d441d8c 100644 --- a/drivers/net/ethernet/via/via-rhine.c +++ b/drivers/net/ethernet/via/via-rhine.c @@ -1849,7 +1849,7 @@ static netdev_tx_t rhine_start_tx(struct sk_buff *skb, netdev_sent_queue(dev, skb->len); /* lock eth irq */ - wmb(); + dma_wmb(); rp->tx_ring[entry].tx_status |= cpu_to_le32(DescOwn); wmb(); @@ -1996,6 +1996,7 @@ static inline u16 rhine_get_vlan_tci(struct sk_buff *skb, int data_size) static inline void rhine_rx_vlan_tag(struct sk_buff *skb, struct rx_desc *desc, int data_size) { + dma_rmb(); if (unlikely(desc->desc_length & cpu_to_le32(DescTag))) { u16 vlan_tci; @@ -2161,6 +2162,7 @@ static void rhine_slow_event_task(struct work_struct *work) container_of(work, struct rhine_private, slow_event_task); struct net_device *dev = rp->dev; u32 intr_status; + u16 enable_mask; mutex_lock(&rp->task_lock); @@ -2176,7 +2178,12 @@ static void rhine_slow_event_task(struct work_struct *work) if (intr_status & IntrPCIErr) netif_warn(rp, hw, dev, "PCI error\n"); - iowrite16(RHINE_EVENT & 0xffff, rp->base + IntrEnable); + napi_disable(&rp->napi); + + enable_mask = ioread16(rp->base + IntrEnable) | RHINE_EVENT_SLOW; + iowrite16(enable_mask, rp->base + IntrEnable); + + napi_enable(&rp->napi); out_unlock: mutex_unlock(&rp->task_lock);