diff mbox

Sparc release requalification

Message ID 20090906.175259.243140550.davem@davemloft.net
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

David Miller Sept. 7, 2009, 12:52 a.m. UTC
From: Sébastien Bernard <sbernard@nerim.net>
Date: Sun, 06 Sep 2009 23:21:42 +0200

> David said, he'll look this bug later. I'll need to remind him.

In Linus's tree is the following fix for this.  I'll submit it
to -stable when I get a chance.

sparc64: Kill spurious NMI watchdog triggers by increasing limit to 30 seconds.

This is a compromise and a temporary workaround for bootup NMI
watchdog triggers some people see with qla2xxx devices present.

This happens when, for example:

CPU 0 is in the driver init and looping submitting mailbox commands to
load the firmware, then waiting for completion.

CPU 1 is receiving the device interrupts.  CPU 1 is where the NMI
watchdog triggers.

CPU 0 is submitting mailbox commands fast enough that by the time CPU
1 returns from the device interrupt handler, a new one is pending.
This sequence runs for more than 5 seconds.

The problematic case is CPU 1's timer interrupt running when the
barrage of device interrupts begin.  Then we have:

	timer interrupt
	return for softirq checking
	pending, thus enable interrupts

		 qla2xxx interrupt
		 return
		 qla2xxx interrupt
		 return
		 ... 5+ seconds pass
		 final qla2xxx interrupt for fw load
		 return

	run timer softirq
	return

At some point in the multi-second qla2xxx interrupt storm we trigger
the NMI watchdog on CPU 1 from the NMI interrupt handler.

The timer softirq, once we get back to running it, is smart enough to
run the timer work enough times to make up for the missed timer
interrupts.

However, the NMI watchdogs (both x86 and sparc) use the timer
interrupt count to notice the cpu is wedged.  But in the above
scenerio we'll receive only one such timer interrupt even if we last
all the way back to running the timer softirq.

The default watchdog trigger point is only 5 seconds, which is pretty
low (the softwatchdog triggers at 60 seconds).  So increase it to 30
seconds for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/kernel/nmi.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Comments

Sébastien Bernard Sept. 7, 2009, 4:53 p.m. UTC | #1
David Miller a écrit :
> [great explanation snipped]
> --- a/arch/sparc/kernel/nmi.c
> +++ b/arch/sparc/kernel/nmi.c
> @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, struct pt_regs *regs)
>  	}
>  	if (!touched && __get_cpu_var(last_irq_sum) == sum) {
>  		local_inc(&__get_cpu_var(alert_counter));
> -		if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz)
> +		if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz)
>  			die_nmi("BUG: NMI Watchdog detected LOCKUP",
>  				regs, panic_on_timeout);
>  	} else {
>   

Hum, I tested today, and no, it does not solve the problem. Kernel is 
still hanging at the same place.
I'll get the initcall debug back when I'll have rebuild a kernel withtou 
the config_prom_console.


    Seb
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
seb@frankengul.org Sept. 7, 2009, 8:05 p.m. UTC | #2
Sébastien Bernard a écrit :
> David Miller a écrit :
>> [great explanation snipped]
>> --- a/arch/sparc/kernel/nmi.c
>> +++ b/arch/sparc/kernel/nmi.c
>> @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, 
>> struct pt_regs *regs)
>>      }
>>      if (!touched && __get_cpu_var(last_irq_sum) == sum) {
>>          local_inc(&__get_cpu_var(alert_counter));
>> -        if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz)
>> +        if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz)
>>              die_nmi("BUG: NMI Watchdog detected LOCKUP",
>>                  regs, panic_on_timeout);
>>      } else {
>>   
>
> Hum, I tested today, and no, it does not solve the problem. Kernel is 
> still hanging at the same place.
> I'll get the initcall debug back when I'll have rebuild a kernel 
> withtou the config_prom_console.
>
>
>    Seb
>
>
Please find included here the logs from the boot session.
David Miller Sept. 7, 2009, 11:50 p.m. UTC | #3
From: Sébastien Bernard <seb@sfrdev.fr>
Date: Mon, 07 Sep 2009 18:53:15 +0200

> David Miller a écrit :
>> [great explanation snipped]
>> --- a/arch/sparc/kernel/nmi.c
>> +++ b/arch/sparc/kernel/nmi.c
>> @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, struct
>> pt_regs *regs)
>>  	}
>>  	if (!touched && __get_cpu_var(last_irq_sum) == sum) {
>>  		local_inc(&__get_cpu_var(alert_counter));
>> -		if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz)
>> + if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz)
>>  			die_nmi("BUG: NMI Watchdog detected LOCKUP",
>>  				regs, panic_on_timeout);
>>  	} else {
>>   
> 
> Hum, I tested today, and no, it does not solve the problem. Kernel is
> still hanging at the same place.
> I'll get the initcall debug back when I'll have rebuild a kernel
> withtou the config_prom_console.

Then what bug are you talking about?

You stated that disabling the NMI watchdog completely solves your
problem right?  That's why I mentioned the above patch to you?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sébastien Bernard Sept. 9, 2009, 3:47 p.m. UTC | #4
David Miller a écrit :
> From: Sébastien Bernard <seb@sfrdev.fr>
> Date: Mon, 07 Sep 2009 18:53:15 +0200
>
>   
>> David Miller a écrit :
>>     
>>> [great explanation snipped]
>>> --- a/arch/sparc/kernel/nmi.c
>>> +++ b/arch/sparc/kernel/nmi.c
>>> @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, struct
>>> pt_regs *regs)
>>>  	}
>>>  	if (!touched && __get_cpu_var(last_irq_sum) == sum) {
>>>  		local_inc(&__get_cpu_var(alert_counter));
>>> -		if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz)
>>> + if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz)
>>>  			die_nmi("BUG: NMI Watchdog detected LOCKUP",
>>>  				regs, panic_on_timeout);
>>>  	} else {
>>>   
>>>       
>> Hum, I tested today, and no, it does not solve the problem. Kernel is
>> still hanging at the same place.
>> I'll get the initcall debug back when I'll have rebuild a kernel
>> withtou the config_prom_console.
>>     
>
> Then what bug are you talking about?
>
> You stated that disabling the NMI watchdog completely solves your
> problem right?  That's why I mentioned the above patch to you?
>   
My mistake, I didn't look the log hard enough.
I rebuild a new kernel (2.6.31-rc9) with no patch and there was no hang.
It passes the nmi_setup but crashes further when mounting the root 
partition.
I'll send the complete logs further.
The funny part is that booting with debug_initcalls=1 makes the kernel 
hangs.

Seb
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
index 2c0cc72..b75bf50 100644
--- a/arch/sparc/kernel/nmi.c
+++ b/arch/sparc/kernel/nmi.c
@@ -103,7 +103,7 @@  notrace __kprobes void perfctr_irq(int irq, struct pt_regs *regs)
 	}
 	if (!touched && __get_cpu_var(last_irq_sum) == sum) {
 		local_inc(&__get_cpu_var(alert_counter));
-		if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz)
+		if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz)
 			die_nmi("BUG: NMI Watchdog detected LOCKUP",
 				regs, panic_on_timeout);
 	} else {