[RFT,net-next,#2,0/6] via-rhine receive buffers rework

Message ID	20150408215051.GA25326@electric-eye.fr.zoreil.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Wed, 8 Apr 2015 23:50:51 +0200 From: Francois Romieu <romieu@fr.zoreil.com> To: Nix <nix@esperi.org.uk> Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>, rl@hellgate.ch, Bjarke Istrup Pedersen <gurligebis@gentoo.org> Subject: Re: [PATCH RFT net-next #2 0/6] via-rhine receive buffers rework Message-ID: <20150408215051.GA25326@electric-eye.fr.zoreil.com> References: <cover.1428445141.git.romieu@fr.zoreil.com> <87vbh6364u.fsf@spindle.srvr.nix> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87vbh6364u.fsf@spindle.srvr.nix> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

20150408215051.GA25326@electric-eye.fr.zoreil.com

State

RFC, archived

Delegated to:

David Miller

Headers

Date: Wed, 8 Apr 2015 23:50:51 +0200
From: Francois Romieu <romieu@fr.zoreil.com>
To: Nix <nix@esperi.org.uk>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	rl@hellgate.ch, Bjarke Istrup Pedersen <gurligebis@gentoo.org>
Subject: Re: [PATCH RFT net-next #2 0/6] via-rhine receive buffers rework
Message-ID: <20150408215051.GA25326@electric-eye.fr.zoreil.com>
References: <cover.1428445141.git.romieu@fr.zoreil.com>
	<87vbh6364u.fsf@spindle.srvr.nix>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87vbh6364u.fsf@spindle.srvr.nix>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Commit Message

Francois Romieu April 8, 2015, 9:50 p.m. UTC

Nix <nix@esperi.org.uk> :
[...]
> I am sorry to report that I just had a watchdog-triggered autoreboot
> during testing of this patch series :( with no log messages of any kind.
> looks like the underlying bug is still there, or another bug with the
> same symptoms (i.e. some way to crash inside the rx handler). I qwish I
> could get some debugging output when this happens!

You may add the patch below on top of the current stack. I don't expect
much difference. Increasing RX_RING_SIZE could be a different story.

Did you keep netconsole disabled and did you increse via-rhine verbosity
level ?

No shared IRQ ?

> However, to give some good news, CPU usage is much lower than before the
> patch: si ~10%, rather than ~80% with spikes of full CPU usage:
> ksoftirqd's CPU usage is steady at about 3% rather than 40--60% with
> spikes to 100%, and some of that will be USB interrupts from the
> continuous USB traffic from my entropy key.

Huuuuh ? Which entropy key ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Nix April 8, 2015, 10:43 p.m. UTC | #1

On 8 Apr 2015, Francois Romieu stated:

> Nix <nix@esperi.org.uk> :
> [...]
>> I am sorry to report that I just had a watchdog-triggered autoreboot
>> during testing of this patch series :( with no log messages of any kind.
>> looks like the underlying bug is still there, or another bug with the
>> same symptoms (i.e. some way to crash inside the rx handler). I qwish I
>> could get some debugging output when this happens!
>
> You may add the patch below on top of the current stack. I don't expect
> much difference. Increasing RX_RING_SIZE could be a different story.

Will try it tomorrow.

> Did you keep netconsole disabled and did you increse via-rhine verbosity
> level ?

netconsole is active. The verbosity level is default because I didn't
notice the driver had one to set! I'll push it up and see what happens.
I don't expect too much: this is, after all, a uniprocessor, and if
you're stuck in an interrupt handler there's not much it can do...

> No shared IRQ ?

None relevant that I can see:

           CPU0
  0:     692019    XT-PIC  timer
  2:          0    XT-PIC  cascade
  4:        314    XT-PIC  serial
  5:     114062    XT-PIC  adsl
  7:     329914    XT-PIC  cs5535-clockevt
  9:     137080    XT-PIC  bdsl
 10:       2006    XT-PIC  voip
 11:    1546540    XT-PIC  gordianet
 12:     724187    XT-PIC  wireless
 14:      12313    XT-PIC  pata_cs5536
 15:     219702    XT-PIC  ehci_hcd:usb1, ohci_hcd:usb2

The only shared interrupt is usb2, and since that's an
internal-to-the-case thing that's not plugged in to anything, in effect
there are none shared at all. (The unplugged interfaces *are* shared --
the last four of the eight interfaces on the box run 10, 7, 10, 7 -- but
since only one of those has anything plugged into it, the net effect is
no sharing.)

My load tests are being performed between the gordianet and wireless
interfaces, IRQs 11 and 12, which are entirely unshared.

>> However, to give some good news, CPU usage is much lower than before the
>> patch: si ~10%, rather than ~80% with spikes of full CPU usage:
>> ksoftirqd's CPU usage is steady at about 3% rather than 40--60% with
>> spikes to 100%, and some of that will be USB interrupts from the
>> continuous USB traffic from my entropy key.
>
> Huuuuh ? Which entropy key ?

It's a USB device that gives you about 4KiB/s over USB:
<http://www.entropykey.co.uk/>. No longer manufactured, alas :(
but as it provides entropy continuously it is a continuous source of
interrupts. (But from the USB port, obviously.)

Nix April 9, 2015, 6:08 p.m. UTC | #2

On 8 Apr 2015, Francois Romieu spake thusly:

> Nix <nix@esperi.org.uk> :
> [...]
>> I am sorry to report that I just had a watchdog-triggered autoreboot
>> during testing of this patch series :( with no log messages of any kind.
>> looks like the underlying bug is still there, or another bug with the
>> same symptoms (i.e. some way to crash inside the rx handler). I qwish I
>> could get some debugging output when this happens!
>
> You may add the patch below on top of the current stack. I don't expect
> much difference. Increasing RX_RING_SIZE could be a different story.

It still crashes with that patch. The lockups are definitely getting
rarer: I have to load the thing for several hours to see a single crash
now (though sometimes I am still (un)lucky and it dies almost at once).

> Did you keep netconsole disabled and did you increse via-rhine verbosity
> level ?

Oops! Netconsole *was* on: I've been using it for so long I'd forgotten
that you pretty much have to whap it with a hammer and turn it off in
.config to turn it off completely, not just stop mentioning it on the
kernel cmdline. It's off now.

The verbosity level is now 16, which should be enough to cover, well,
everything, and indeed I see extra log at initialization time:

[    0.911369] via_rhine: v1.10-LK1.5.1 2010-10-09 Written by Donald Becker
[    0.921653] via-rhine 0000:00:06.0 (unnamed net_device) (uninitialized): Reset succeeded
[    0.936067] via-rhine 0000:00:06.0 eth0: VIA Rhine III (Management Adapter) at 0xe0806000, 00:00:24:cb:c6:a0, IRQ 11
[    0.949911] via-rhine 0000:00:06.0 eth0: MII PHY found at address 1, status 0x786d advertising 05e1 Link cde1
[    0.969400] via-rhine 0000:00:07.0 (unnamed net_device) (uninitialized): Reset succeeded
[    0.983852] via-rhine 0000:00:07.0 eth1: VIA Rhine III (Management Adapter) at 0xe0808100, 00:00:24:cb:c6:a1, IRQ 5
[    0.997168] via-rhine 0000:00:07.0 eth1: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1
[    1.006638] via-rhine 0000:00:08.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.021091] via-rhine 0000:00:08.0 eth2: VIA Rhine III (Management Adapter) at 0xe080a200, 00:00:24:cb:c6:a2, IRQ 9
[    1.034402] via-rhine 0000:00:08.0 eth2: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1
[    1.043872] via-rhine 0000:00:09.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.058294] via-rhine 0000:00:09.0 eth3: VIA Rhine III (Management Adapter) at 0xe080c300, 00:00:24:cb:c6:a3, IRQ 12
[    1.062104] via-rhine 0000:00:09.0 eth3: MII PHY found at address 1, status 0x786d advertising 05e1 Link 4de1
[    1.071608] via-rhine 0000:01:00.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.086073] via-rhine 0000:01:00.0 eth4: VIA Rhine III (Management Adapter) at 0xe080e000, 00:00:24:d1:2a:3c, IRQ 10
[    1.099911] via-rhine 0000:01:00.0 eth4: MII PHY found at address 1, status 0x786d advertising 05e1 Link 41e1
[    1.119402] via-rhine 0000:01:01.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.133915] via-rhine 0000:01:01.0 eth5: VIA Rhine III (Management Adapter) at 0xe0810100, 00:00:24:d1:2a:3d, IRQ 7
[    1.147234] via-rhine 0000:01:01.0 eth5: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000
[    1.156747] via-rhine 0000:01:02.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.171262] via-rhine 0000:01:02.0 eth6: VIA Rhine III (Management Adapter) at 0xe0812200, 00:00:24:d1:2a:3e, IRQ 10
[    1.185095] via-rhine 0000:01:02.0 eth6: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000
[    1.194599] via-rhine 0000:01:03.0 (unnamed net_device) (uninitialized): Reset succeeded
[    1.209094] via-rhine 0000:01:03.0 eth7: VIA Rhine III (Management Adapter) at 0xe0814300, 00:00:24:d1:2a:3f, IRQ 7
[    1.212436] via-rhine 0000:01:03.0 eth7: MII PHY found at address 1, status 0x7849 advertising 05e1 Link 0000
[...]
[   17.264820] via-rhine 0000:00:06.0 gordianet: Reset succeeded
[   17.299978] via-rhine 0000:00:06.0 gordianet: link up, 100Mbps, full-duplex, lpa 0xCDE1
[   17.347962] via-rhine 0000:00:06.0 gordianet: force_media 0, carrier 1
[   18.221924] via-rhine 0000:00:09.0 wireless: Reset succeeded
[   18.256936] via-rhine 0000:00:09.0 wireless: link up, 100Mbps, full-duplex, lpa 0x4DE1
[   18.304399] via-rhine 0000:00:09.0 wireless: force_media 0, carrier 1
[   18.360046] via-rhine 0000:01:00.0 voip: Reset succeeded
[   18.397168] via-rhine 0000:01:00.0 voip: link up, 100Mbps, full-duplex, lpa 0x41E1
[   18.442578] via-rhine 0000:01:00.0 voip: force_media 0, carrier 1
[   18.510970] via-rhine 0000:00:07.0 adsl: Reset succeeded
[   18.546141] via-rhine 0000:00:07.0 adsl: link up, 100Mbps, full-duplex, lpa 0x41E1
[   18.591511] via-rhine 0000:00:07.0 adsl: force_media 0, carrier 1
[   18.639051] via-rhine 0000:00:08.0 bdsl: Reset succeeded
[   18.671983] via-rhine 0000:00:08.0 bdsl: link up, 100Mbps, full-duplex, lpa 0x41E1
[   18.717363] via-rhine 0000:00:08.0 bdsl: force_media 0, carrier 1

(again, the first two interfaces, gordianet and wireless, are the ones
being stressed by this test.)

Of course now I've done this it's not crashing! Maybe it's netconsole-
related on top of everything else, or I'm just being unlucky... I'll
keep trying.

Francois Romieu April 9, 2015, 10:41 p.m. UTC | #3

Nix <nix@esperi.org.uk> :
[...]
> The verbosity level is now 16, which should be enough to cover, well,
> everything, and indeed I see extra log at initialization time:

16 ? _warn, _err and _info ought ti be good enough but you should also
check the message level with ethtool as well. You don't want debug
messages from the rx poll handler.

[...]
> Of course now I've done this it's not crashing! Maybe it's netconsole-
> related on top of everything else, or I'm just being unlucky... I'll
> keep trying.

Please no netconsole yet.

Nix April 13, 2015, 1:16 p.m. UTC | #4

On 9 Apr 2015, Francois Romieu told this:

> Nix <nix@esperi.org.uk> :
> [...]
>> Of course now I've done this it's not crashing! Maybe it's netconsole-
>> related on top of everything else, or I'm just being unlucky... I'll
>> keep trying.
>
> Please no netconsole yet.

Still trying. It hasn't crashed since I turned netconsole off despite a
couple of days of loading, so it's quite possible that this crash is
netconsole-related. (Without the receive buffers rework, the crash is
much faster and happens even without netconsole, so the rework is
clearly necessary and definitely improves things dramatically.)

diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c
index 3e6fdbb..d441d8c 100644
--- a/drivers/net/ethernet/via/via-rhine.c
+++ b/drivers/net/ethernet/via/via-rhine.c
@@ -1849,7 +1849,7 @@  static netdev_tx_t rhine_start_tx(struct sk_buff *skb,
 
 	netdev_sent_queue(dev, skb->len);
 	/* lock eth irq */
-	wmb();
+	dma_wmb();
 	rp->tx_ring[entry].tx_status |= cpu_to_le32(DescOwn);
 	wmb();
 
@@ -1996,6 +1996,7 @@  static inline u16 rhine_get_vlan_tci(struct sk_buff *skb, int data_size)
 static inline void rhine_rx_vlan_tag(struct sk_buff *skb, struct rx_desc *desc,
 				     int data_size)
 {
+	dma_rmb();
 	if (unlikely(desc->desc_length & cpu_to_le32(DescTag))) {
 		u16 vlan_tci;
 
@@ -2161,6 +2162,7 @@  static void rhine_slow_event_task(struct work_struct *work)
 		container_of(work, struct rhine_private, slow_event_task);
 	struct net_device *dev = rp->dev;
 	u32 intr_status;
+	u16 enable_mask;
 
 	mutex_lock(&rp->task_lock);
 
@@ -2176,7 +2178,12 @@  static void rhine_slow_event_task(struct work_struct *work)
 	if (intr_status & IntrPCIErr)
 		netif_warn(rp, hw, dev, "PCI error\n");
 
-	iowrite16(RHINE_EVENT & 0xffff, rp->base + IntrEnable);
+	napi_disable(&rp->napi);
+
+	enable_mask = ioread16(rp->base + IntrEnable) | RHINE_EVENT_SLOW;
+	iowrite16(enable_mask, rp->base + IntrEnable);
+
+	napi_enable(&rp->napi);
 
 out_unlock:
 	mutex_unlock(&rp->task_lock);

[RFT,net-next,#2,0/6] via-rhine receive buffers rework

Commit Message

Comments

Patch