diff mbox

bisect results of MSI-X related panic (help!)

Message ID 4AD3E875.5040800@kernel.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Tejun Heo Oct. 13, 2009, 2:39 a.m. UTC
Brandeburg, Jesse wrote:
> On Mon, 12 Oct 2009, Tejun Heo wrote:
>>> any other debugging tricks/ideas?
>> Hmm... stackprotector adds considerable amount of stack usage and it
>> could be you're seeing stack overflow which would also explain the
>> random crashes you've been seeing.  Do you have DEBUG_STACKOVERFLOW
>> turned on?  This is on x86_64, right?
> 
> Hi, thanks for your response, 
> 
> [root@jbrandeb-hc linux-2.6.32-rc1]# grep STACKO .config
> CONFIG_DEBUG_STACKOVERFLOW=y
> 
> [root@jbrandeb-hc linux-2.6.32-rc1]# grep X86_64 .config
> CONFIG_X86_64=y
> CONFIG_X86_64_SMP=y
> CONFIG_X86_64_ACPI_NUMA=y
> 
> stack size is 8K
> 
> I tried Jarek's suggestion of CPUMASK_OFFSTACK and still panic.
> [66027.266057] Kernel panic - not syncing: stack-protector: Kernel stack 
> is corrupted in: ffffffff810b4eb0
> [66027.266059]
> [66027.266070] Kernel panic - not syncing: stack-protector: Kernel stack 
> is corrupted in: ffffffff81472856
> [66027.266071]
> [66027.266081] Pid: 0, comm: swapper Tainted: G        W  
> 2.6.32-rc2-git-debug #6
> [66027.266086] Call Trace:
> 
> that was all I got.  Interesting double fault, that hadn't happened 
> before.
> 
> the symbols might be off slightly since I rebuilt the kernel, but this was 
> initial poke at offsets above in gdb
> (gdb) l *0xffffffff810b4eb0
> 0xffffffff810b4eb0 is in dynamic_irq_cleanup (kernel/irq/chip.c:86).
> 81              desc->handle_irq = handle_bad_irq;
> 82              desc->chip = &no_irq_chip;
> 83              desc->name = NULL;
> 84              clear_kstat_irqs(desc);
> 85              spin_unlock_irqrestore(&desc->lock, flags);
> 86      }

Can you please apply the following patch and try to retrigger the
panic?

Comments

Jesse Brandeburg Oct. 14, 2009, 10:30 p.m. UTC | #1
On Mon, 12 Oct 2009, Tejun Heo wrote:
> Can you please apply the following patch and try to retrigger the
> panic?
> 
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index c166019..f5a1482 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -63,6 +63,9 @@ void dynamic_irq_cleanup(unsigned int irq)
>  	struct irq_desc *desc = irq_to_desc(irq);
>  	unsigned long flags;
> 
> +	printk("XXX dynamic_irq_cleanup() called on %u\n", irq);
> +	dump_stack();
> +
>  	if (!desc) {
>  		WARN(1, KERN_ERR "Trying to cleanup invalid IRQ%d\n", irq);
>  		return;

I'm working on it, but now that I've added a bunch of debug including the 
above printk, my system panics (with a stack protector canary overwrite) 
when loading the first network adapter with 30+ MSI-X vectors.  I can boot 
single user mode and bring up netconsole, but then as soon as I brought up 
the first port with lots of MSI-X vectors, the system hard locks, no panic 
message.
 
I have a bit of a theory that the node = -1 (numa_node) stuff might be 
playing some havoc with the code in numa_migrate.c.  I'm not sure if that 
is contributing, but the code in there doesn't seem written to handle node 
= - 1 very well.  As in I never see it do an smp_processor_id at the 
bottom before accessing the node value.

Not sure if that is relevant, but I wanted to mention it before I went 
home.

What next?  I made it worse so I guess that is something.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Oct. 15, 2009, 7:30 a.m. UTC | #2
Hello,

Brandeburg, Jesse wrote:
> On Mon, 12 Oct 2009, Tejun Heo wrote:
>> Can you please apply the following patch and try to retrigger the
>> panic?
>>
>> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
>> index c166019..f5a1482 100644
>> --- a/kernel/irq/chip.c
>> +++ b/kernel/irq/chip.c
>> @@ -63,6 +63,9 @@ void dynamic_irq_cleanup(unsigned int irq)
>>  	struct irq_desc *desc = irq_to_desc(irq);
>>  	unsigned long flags;
>>
>> +	printk("XXX dynamic_irq_cleanup() called on %u\n", irq);
>> +	dump_stack();
>> +
>>  	if (!desc) {
>>  		WARN(1, KERN_ERR "Trying to cleanup invalid IRQ%d\n", irq);
>>  		return;
> 
> I'm working on it, but now that I've added a bunch of debug including the 
> above printk, my system panics (with a stack protector canary overwrite) 
> when loading the first network adapter with 30+ MSI-X vectors.  I can boot 
> single user mode and bring up netconsole, but then as soon as I brought up 
> the first port with lots of MSI-X vectors, the system hard locks, no panic 
> message.
>  
> I have a bit of a theory that the node = -1 (numa_node) stuff might be 
> playing some havoc with the code in numa_migrate.c.  I'm not sure if that 
> is contributing, but the code in there doesn't seem written to handle node 
> = - 1 very well.  As in I never see it do an smp_processor_id at the 
> bottom before accessing the node value.
> 
> Not sure if that is relevant, but I wanted to mention it before I went 
> home.
> 
> What next?  I made it worse so I guess that is something.

I don't know.  At this point, I can't think of anything other than
sprinkling printks and dump_stacks around.  :-(

Thanks.
diff mbox

Patch

diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index c166019..f5a1482 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -63,6 +63,9 @@  void dynamic_irq_cleanup(unsigned int irq)
 	struct irq_desc *desc = irq_to_desc(irq);
 	unsigned long flags;

+	printk("XXX dynamic_irq_cleanup() called on %u\n", irq);
+	dump_stack();
+
 	if (!desc) {
 		WARN(1, KERN_ERR "Trying to cleanup invalid IRQ%d\n", irq);
 		return;