diff mbox

Mark IPW2100 as BROKEN: Fatal interrupt. Scheduling firmware restart.

Message ID 20080921172316.GA6306@2ka.mipt.ru
State Not Applicable, archived
Headers show

Commit Message

Evgeniy Polyakov Sept. 21, 2008, 5:23 p.m. UTC
Hi.

Following bug exists in the ipw2100 driver/firmware for years and Intel
folks never responded to zillions bugzilla entries and forum notices in
the internet with some patch or firmware update (although did request
dmesg and debug info, and received them).

ipw2100: Fatal interrupt. Scheduling firmware restart.

I believe it is a firmware bug because after driver is unloaded and
loaded back again wireless adapter usually starts working (for small
amount of time though). My conspiracy feeling can suggest, that it may
be kind of a force to buy a new one, or trivial error in the firmware,
when it writes to the same place in the flash and essentially given cell
became dead or whatever else.

Intel folks, please fix this problem, I see no other way to force you to do
this than to mark ipw2100 driver as broken, since that is what it is.

Bug exists at least in .15 upto .24 kernels, just search above dmesg
line. I cought it with 2.6.24-19-386 ubuntu kernel, 1.3 firmware
version. lspci:

02:04.0 Network controller: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter (rev 04)
	Subsystem: Intel Corporation Samsung X10/P30 integrated WLAN
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (500ns min, 8500ns max), Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at 90080000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-

dmesg is pretty usual.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

Comments

Michael Buesch Sept. 21, 2008, 5:36 p.m. UTC | #1
On Sunday 21 September 2008 19:23:17 Evgeniy Polyakov wrote:
> Hi.
> 
> Following bug exists in the ipw2100 driver/firmware for years and Intel
> folks never responded to zillions bugzilla entries and forum notices in
> the internet with some patch or firmware update (although did request
> dmesg and debug info, and received them).
> 
> ipw2100: Fatal interrupt. Scheduling firmware restart.
> 
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though). My conspiracy feeling can suggest, that it may
> be kind of a force to buy a new one, or trivial error in the firmware,
> when it writes to the same place in the flash and essentially given cell
> became dead or whatever else.
> 
> Intel folks, please fix this problem, I see no other way to force you to do
> this than to mark ipw2100 driver as broken, since that is what it is.

You are pretty funny, actually. :)

I think the bug should be fixed, but what makes _you_ think you can _force_
anybody to do so?

> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> 
> diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
> index 9931b5a..c24fc6a 100644
> --- a/drivers/net/wireless/Kconfig
> +++ b/drivers/net/wireless/Kconfig
> @@ -125,7 +125,7 @@ config PCMCIA_RAYCS
>  
>  config IPW2100
>  	tristate "Intel PRO/Wireless 2100 Network Connection"
> -	depends on PCI && WLAN_80211
> +	depends on PCI && WLAN_80211 && BROKEN
>  	select WIRELESS_EXT
>  	select FW_LOADER
>  	select IEEE80211
>
Evgeniy Polyakov Sept. 21, 2008, 5:38 p.m. UTC | #2
On Sun, Sep 21, 2008 at 07:36:00PM +0200, Michael Buesch (mb@bu3sch.de) wrote:
> > Intel folks, please fix this problem, I see no other way to force you to do
> > this than to mark ipw2100 driver as broken, since that is what it is.
> 
> You are pretty funny, actually. :)
> 
> I think the bug should be fixed, but what makes _you_ think you can _force_
> anybody to do so?

Maybe because I bought that adapter and it stopped working and Intel
knows about this bug and does not fix it for years?
Arjan van de Ven Sept. 21, 2008, 6:04 p.m. UTC | #3
On Sun, 21 Sep 2008 21:23:17 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Hi.
> 
> Following bug exists in the ipw2100 driver/firmware for years and
> Intel folks never responded to zillions bugzilla entries and forum
> notices in the internet with some patch or firmware update (although
> did request dmesg and debug info, and received them).
> 
> ipw2100: Fatal interrupt. Scheduling firmware restart.
> 
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though). 
\
> diff --git a/drivers/net/wireless/Kconfig
> b/drivers/net/wireless/Kconfig index 9931b5a..c24fc6a 100644
> --- a/drivers/net/wireless/Kconfig
> +++ b/drivers/net/wireless/Kconfig
> @@ -125,7 +125,7 @@ config PCMCIA_RAYCS
>  
>  config IPW2100
>  	tristate "Intel PRO/Wireless 2100 Network Connection"
> -	depends on PCI && WLAN_80211
> +	depends on PCI && WLAN_80211 && BROKEN
>  	select WIRELESS_EXT
>  	select FW_LOADER
>  	select IEEE80211
> 

so now you go from an occasional burp to having nothing at all.
How about you run with this patch on your own machine only?

or.. since you say a reload of the driver fixes it.. why don't you make
a patch for the driver that does basically the actions of a reload
automatically when the driver detects the issue?
(and stick a WARN_ON in for good measure so that kerneloops.org can
start tracking these burps)
Evgeniy Polyakov Sept. 21, 2008, 6:28 p.m. UTC | #4
On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven (arjan@infradead.org) wrote:
> so now you go from an occasional burp to having nothing at all.
> How about you run with this patch on your own machine only?

And how else user can get attention to the problem which is not fixed by
the vendor? We close our eyes and there is no problem, since we do not
see it. I just brought a lamp: no user can see that essentially driver
is broken so he can run it on his own risk.

> or.. since you say a reload of the driver fixes it.. why don't you make
> a patch for the driver that does basically the actions of a reload
> automatically when the driver detects the issue?
> (and stick a WARN_ON in for good measure so that kerneloops.org can
> start tracking these burps)

It stops after several seconds (or packets?). Sometimes (but rarely)
it works several minutes, sometimes it fires above dmesg line and
continues to work, sometimes it fires it for a while and then stops
writing it, although driver does not send or receive anything (at
least ifconfig counters do not change).

Actually, I do not think it is a driver problem, since what it does is
pretty much straightforward, but if you will tell me how else can we fix
this issue, I will print it and glue near the window so this gotcha
could be used with other problems. Or you can (as everyone else who do)
just said that this is damn wrong and forget about problem for the next
several years.
Arjan van de Ven Sept. 21, 2008, 6:35 p.m. UTC | #5
On Sun, 21 Sep 2008 22:28:38 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven
> (arjan@infradead.org) wrote:
> > so now you go from an occasional burp to having nothing at all.
> > How about you run with this patch on your own machine only?
> 
> And how else user can get attention to the problem which is not fixed
> by the vendor? 

... or the community

> We close our eyes and there is no problem, since we do
> not see it. I just brought a lamp: no user can see that essentially
> driver is broken so he can run it on his own risk.
> 
> > or.. since you say a reload of the driver fixes it.. why don't you
> > make a patch for the driver that does basically the actions of a
> > reload automatically when the driver detects the issue?
> > (and stick a WARN_ON in for good measure so that kerneloops.org can
> > start tracking these burps)
> 
> It stops after several seconds (or packets?). Sometimes (but rarely)
> it works several minutes, sometimes it fires above dmesg line and
> continues to work, sometimes it fires it for a while and then stops
> writing it, although driver does not send or receive anything (at
> least ifconfig counters do not change).

again.. so how about you detect this condition and do, in the driver
code, the equivalent of rmmod/insmod to the hardware. I'm sure people
who have the hardware would appreciate that type of patch a lot more
than the one you sent out.
Wei Weng Sept. 21, 2008, 6:52 p.m. UTC | #6
Arjan van de Ven wrote:
> On Sun, 21 Sep 2008 22:28:38 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
>> On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven
>> (arjan@infradead.org) wrote:
>>> so now you go from an occasional burp to having nothing at all.
>>> How about you run with this patch on your own machine only?
>> And how else user can get attention to the problem which is not fixed
>> by the vendor? 
> 
> ... or the community
> 
>> We close our eyes and there is no problem, since we do
>> not see it. I just brought a lamp: no user can see that essentially
>> driver is broken so he can run it on his own risk.
>>
>>> or.. since you say a reload of the driver fixes it.. why don't you
>>> make a patch for the driver that does basically the actions of a
>>> reload automatically when the driver detects the issue?
>>> (and stick a WARN_ON in for good measure so that kerneloops.org can
>>> start tracking these burps)
>> It stops after several seconds (or packets?). Sometimes (but rarely)
>> it works several minutes, sometimes it fires above dmesg line and
>> continues to work, sometimes it fires it for a while and then stops
>> writing it, although driver does not send or receive anything (at
>> least ifconfig counters do not change).
> 
> again.. so how about you detect this condition and do, in the driver
> code, the equivalent of rmmod/insmod to the hardware. I'm sure people
> who have the hardware would appreciate that type of patch a lot more
> than the one you sent out.
> 

I guess it is your way of "middle finger" to all the IPW2100 customers who try
to use it on a Linux machine.


Thanks
Wei


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Sept. 21, 2008, 7 p.m. UTC | #7
On Sun, Sep 21, 2008 at 11:35:13AM -0700, Arjan van de Ven (arjan@infradead.org) wrote:
> > And how else user can get attention to the problem which is not fixed
> > by the vendor? 
> 
> ... or the community

Which does not have access to the firmware... Which IMO is failing and
not the driver itself.

> > It stops after several seconds (or packets?). Sometimes (but rarely)
> > it works several minutes, sometimes it fires above dmesg line and
> > continues to work, sometimes it fires it for a while and then stops
> > writing it, although driver does not send or receive anything (at
> > least ifconfig counters do not change).
> 
> again.. so how about you detect this condition and do, in the driver
> code, the equivalent of rmmod/insmod to the hardware. I'm sure people
> who have the hardware would appreciate that type of patch a lot more
> than the one you sent out.
 
Reset task does efectively ipw2100_up(), so the difference is power
cycles over the PCI bus and enable/disable/request commands. Like this
stuff:
	/* We disable the RETRY_TIMEOUT register (0x41) to keep
	 * PCI Tx retries from interfering with C3 CPU state */
	pci_read_config_dword(pci_dev, 0x40, &val);
	if ((val & 0x0000ff00) != 0)
		pci_write_config_dword(pci_dev, 0x40, val & 0xffff00ff);

I do remember I had a tibet monk course of decoding ipw2100 PCI
config address space, just need to find my kimono.

Do you want me to implement ipw2100 driver as a big work structure
which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
And that will be the fix suggested by Intel? That would explain a lot.

P.S. And some people tell that asking for bug bisection is a hard
pressure on user. Vendor has to ask him to fix bug himself instead,
and that will be a solution!

Getting the fact, that rmmod/insmod does not always fix the problem (but
most of the time for a short period of time), I again want to point,
that it looks like a firmware problem related to some inner timings. You
ask me to fix the driver and do not even listen to what I said
previously and do not get that into account and analyze.
Johannes Berg Sept. 21, 2008, 7:14 p.m. UTC | #8
On Sun, 2008-09-21 at 23:00 +0400, Evgeniy Polyakov wrote:

> Do you want me to implement ipw2100 driver as a big work structure
> which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
> And that will be the fix suggested by Intel? That would explain a lot.

I think what Arjan is saying is that it would be better to put pressure
on the responsible folks (I don't think Arjan is anywhere near them at
all) if you'd put in a WARN_ON() for this error and that would make the
top entry on kerneloops.org all the time... And additionally put in a
workaround for yourself for now.

And can we keep the flames off this list please? That comment from Wei
Weng was absolutely uncalled for, and inciting a flamewar (as you have
already blogged) was not really productive either.

johannes
Arjan van de Ven Sept. 21, 2008, 7:20 p.m. UTC | #9
On Sun, 21 Sep 2008 14:52:37 -0400
Wei Weng <wweng@acedsl.com> wrote:
> > again.. so how about you detect this condition and do, in the driver
> > code, the equivalent of rmmod/insmod to the hardware. I'm sure
> > people who have the hardware would appreciate that type of patch a
> > lot more than the one you sent out.
> > 
> 
> I guess it is your way of "middle finger" to all the IPW2100
> customers who try to use it on a Linux machine.

if suggesting a workaround is giving the middle finger in your mind, 
then I don't think it's worth my time to discuss this further with you.
Marcel Holtmann Sept. 21, 2008, 7:35 p.m. UTC | #10
Hi Evgeniy,

> Following bug exists in the ipw2100 driver/firmware for years and Intel
> folks never responded to zillions bugzilla entries and forum notices in
> the internet with some patch or firmware update (although did request
> dmesg and debug info, and received them).
> 
> ipw2100: Fatal interrupt. Scheduling firmware restart.
> 
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though). My conspiracy feeling can suggest, that it may
> be kind of a force to buy a new one, or trivial error in the firmware,
> when it writes to the same place in the flash and essentially given cell
> became dead or whatever else.

I don't know if it is for this bug or a different one, but Matthew
Garrett seem to have some pending patches. At least that is what he told
me at PlumbersConf. Lets see if these patches do help. And please follow
up with Arjan's suggestion and put a WARN_ON in the upstream code
instead of waving CONFIG_BROKEN around.

Regards

Marcel


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Sept. 21, 2008, 7:43 p.m. UTC | #11
On Sun, 21 Sep 2008 23:38:09 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> 
> As I pointed, I can rewrite the whole driver's initialization process,
> so that it looked like init/wait/exit loop, which can be processed at
> the module load and when fatal interrupt fires. 

yes please do this.

> 
> I do hope this will result in a progress. Arjan, do you aggree to add
> this patch to the current tree?
> 
> diff --git a/drivers/net/wireless/ipw2100.c
> b/drivers/net/wireless/ipw2100.c index 19a401c..9a7b64c 100644
> --- a/drivers/net/wireless/ipw2100.c
> +++ b/drivers/net/wireless/ipw2100.c
> @@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the
> radio (default 0 [radio on])"); 
>  static u32 ipw2100_debug_level = IPW_DL_NONE;
>  
> +static int ipw2100_max_fatal_ints = 10;
> +
>  #ifdef CONFIG_IPW2100_DEBUG
>  #define IPW_DEBUG(level, message...) \
>  do { \
> @@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct
> ipw2100_priv *priv) if (inta & IPW2100_INTA_FATAL_ERROR) {
>  		printk(KERN_WARNING DRV_NAME
>  		       ": Fatal interrupt. Scheduling firmware
> restart.\n");
> +		WARN_ON(1);
> +
> +		BUG_ON(ipw2100_max_fatal_ints-- <= 0);

BUG_ON in interrupt context is just extremely hostile, since it means
the box is dead.

also I would suggest using WARN_ON_ONCE()
Alan Cox Sept. 21, 2008, 7:57 p.m. UTC | #12
> Getting the fact, that rmmod/insmod does not always fix the problem (but
> most of the time for a short period of time), I again want to point,
> that it looks like a firmware problem related to some inner timings. You
> ask me to fix the driver and do not even listen to what I said
> previously and do not get that into account and analyze.

Try putting it into D3 counting to 10 and powering it back up. Thats
about as close as you can get to pulling the plug when it hangs.

Alan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Sept. 21, 2008, 9:10 p.m. UTC | #13
On Sun, Sep 21, 2008 at 08:57:44PM +0100, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> Try putting it into D3 counting to 10 and powering it back up. Thats
> about as close as you can get to pulling the plug when it hangs.

I will experiment with this, thanks Alan.
Unfortunately my machine builds this only
updated driver for about 10 minutes, so
results will appear not too quickly.
I will start tests tomorrow.
Evgeniy Polyakov Sept. 21, 2008, 9:12 p.m. UTC | #14
Hi Marcel.

On Sun, Sep 21, 2008 at 09:35:31PM +0200, Marcel Holtmann (holtmann@linux.intel.com) wrote:
> I don't know if it is for this bug or a different one, but Matthew
> Garrett seem to have some pending patches. At least that is what he told
> me at PlumbersConf. Lets see if these patches do help. And please follow
> up with Arjan's suggestion and put a WARN_ON in the upstream code
> instead of waving CONFIG_BROKEN around.

I expect it is something new, since this bug exists at least from the
1.2 firmware version and .11 kernel. It was also reproduced (long ago
though) on 2.4.
Matthew Garrett Sept. 21, 2008, 10:42 p.m. UTC | #15
Try D3ing the chip in the firmware restart code. Yes, it's retarded.
Matthew Garrett Sept. 21, 2008, 10:45 p.m. UTC | #16
On Sun, Sep 21, 2008 at 09:35:31PM +0200, Marcel Holtmann wrote:

> I don't know if it is for this bug or a different one, but Matthew
> Garrett seem to have some pending patches. At least that is what he told
> me at PlumbersConf. Lets see if these patches do help. And please follow
> up with Arjan's suggestion and put a WARN_ON in the upstream code
> instead of waving CONFIG_BROKEN around.

The fix I had for this was actually for ipw2200, but it ought to be 
applicable for 2100 as well. The ideal fix is probably to ensure that 
ipw*_down D3s the card and *_up D0s it, which brings enhanced runtime 
power saving and also has the nice side effect of actually resetting the 
damned POS in error cases.
Evgeniy Polyakov Sept. 21, 2008, 11:45 p.m. UTC | #17
On Sun, Sep 21, 2008 at 11:42:10PM +0100, Matthew Garrett (mjg59@srcf.ucam.org) wrote:
> Try D3ing the chip in the firmware restart code. Yes, it's retarded.

Thank you, I will start tests tomorrow.
Kenneth R. Crudup Sept. 22, 2008, 4:21 p.m. UTC | #18
I gotta admit, those "firmware restarts" were pretty annoying, and I'd
always wondered why Intel themselves couldn't be bothered to fix 'em.

	-Kenny
Evgeniy Polyakov Sept. 26, 2008, 5:56 a.m. UTC | #19
On Sun, Sep 21, 2008 at 08:57:44PM +0100, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Getting the fact, that rmmod/insmod does not always fix the problem (but
> > most of the time for a short period of time), I again want to point,
> > that it looks like a firmware problem related to some inner timings. You
> > ask me to fix the driver and do not even listen to what I said
> > previously and do not get that into account and analyze.
> 
> Try putting it into D3 counting to 10 and powering it back up. Thats
> about as close as you can get to pulling the plug when it hangs.

I made several experimetns with power states in reset handler,
like put to d3 (hot), disable device, save/resetore states.
Fatal interrupts continue to fire with essentially the same rate.

The same error address does not always contain the same error value, but
frequently it is finit small set.

Here are some data:
[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501,
	inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11
	(level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at
	offset 0x1 (was 0x2900002, writing 0x2900006)

That is quite harmless, since interrupt handler just sees that device is
dissapearing. This brought me to think more about interrupt processing
(irq handler and related tasklet), and I found races between interrupt
tasklet, ipw2100_wx_event_work() handler, reset task and probably
others. Register access in some cases are proteceted by lock (interrupt
handler), and in some cases is not (all others). Although every user
first disables interrupts, but it can be handled right now and scheduled
tasklet already. Also priv->status field is frequently accessed and
modified with and without locks. This may be harmless, but still a red
flag.

Another data about the same failed address:
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000

values are quite limited, but I saw at different address wider set of
different data, but still limited. Addresses and values of the fatal
interrupt do not follow some immediately obvious law, so this looks more
like firmware losts its mind. The reason for this actually may be a
register access races described above. I will continue experiments with
this.
diff mbox

Patch

diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index 9931b5a..c24fc6a 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -125,7 +125,7 @@  config PCMCIA_RAYCS
 
 config IPW2100
 	tristate "Intel PRO/Wireless 2100 Network Connection"
-	depends on PCI && WLAN_80211
+	depends on PCI && WLAN_80211 && BROKEN
 	select WIRELESS_EXT
 	select FW_LOADER
 	select IEEE80211