Message ID | 505766fa0909231903t2fe2f7b6y26d2681d6d38ef0e@mail.gmail.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
From: hyl <heyongli@gmail.com> Date: Thu, 24 Sep 2009 10:03:25 +0800 > Subject: [PATCH] Sun ldom vnet driver dead lock > > if 2 vnet attach to same vswitch, ldom will report > 2 same irq then lead to dead lock on the lp->lock > or the vio->lock > > static irqreturn_t ldc_rx(int irq, void *dev_id){ > ... > out: > spin_unlock_irqrestore(&lp->lock, flags); > << here run with out any lock > send_events(lp, event_mask); > >> vnet evetn process will dead lock on > >> lp->lock or the vio->lock > > Signed-off-by: Yongli He <heyongli@gmail.com> Thank you for this bug report and patch, I am looking at it now. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Fri, 09 Oct 2009 15:08:29 -0700 (PDT) > Thank you for this bug report and patch, I am looking at > it now. I'm trying to figure out how the deadlock can even occur, and I've failed so far, please help me :-) See, we always take the VIO and LDC locks in the same order (VIO then LDC) and always with interrupts disabled, so it is not possible to deadlock. The only way we could deadlock is if: 1) There is some path that takes the LDC lock before the VIO one. 2) There is some path that takes either lock with interrupts enabled. And I cannot find any such case. It might help if you run your test case with lockdep enabled. It will find such deadlocks and report them precisely to the kernel logs. Thank you! -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2009/10/10 David Miller <davem@davemloft.net>: > From: David Miller <davem@davemloft.net> > Date: Fri, 09 Oct 2009 15:08:29 -0700 (PDT) > >> Thank you for this bug report and patch, I am looking at >> it now. > > I'm trying to figure out how the deadlock can even occur, > and I've failed so far, please help me :-) > > See, we always take the VIO and LDC locks in the same order > (VIO then LDC) and always with interrupts disabled, so it is > not possible to deadlock. > > The only way we could deadlock is if: > > 1) There is some path that takes the LDC lock before the VIO one. > > 2) There is some path that takes either lock with interrupts > enabled. > > And I cannot find any such case. David Thank you, i try to figure out the path which lead to system hang. i got the output log :(the same output run18350079 times and going forever ... ) ctl: 17, data:0 err:0, abr:0 runcc:18350079 CPUId:0, event: 0x00000004 qall Trace: s[00000000004acb30] _handle_IRQ_event+0x50/0x120 C[00000000004acc70] handle_IRQ_event+0x70/0x120 [00000000004af14c] handle_fasteoi_irq+0xcc/0x180 [000000000042ee54] handler_irq+0x134/0x160 [00000000004208b4] tl0_irq5+0x14/0x20 [00000000004acbac] _handle_IRQ_event+0xcc/0x120 [00000000004acc70] handle_IRQ_event+0x70/0x120 [00000000004af14c] handle_fasteoi_irq+0xcc/0x180 [000000000042ee54] handler_irq+0x134/0x160 [00000000004208b4] tl0_irq5+0x14/0x20 [00000000004acbac] _handle_IRQ_event+0xcc/0x120 [00000000004acc70] handle_IRQ_event+0x70/0x120 [00000000004af14c] handle_fasteoi_irq+0xcc/0x180 [000000000042ee54] handler_irq+0x134/0x160 [00000000004208b4] tl0_irq5+0x14/0x20 [00000000007e7ffc] _spin_unlock_irqrestore+0x3c/0x60 >runcc:18350079 CPUId:0, event: 0x00000004 the runcc is the count of times ldx_rx been run. dump code: static irqreturn_t ldc_rx(int irq, void *dev_id) { .... atomic64_inc(&runcc); .... printk(KERN_INFO"runcc:%lld CPUId:%d, event: 0x%08x\n", atomic_read(&runcc), smp_processor_id(),event_mask); dump_stack(); } look the console output, system seems hang on a live lock: tl0_irq5 triggered just after the irq been re-enable in the handler of irq5: the ldc_rx. i have no idea about the t10_irq5, just guess that: the special configuration lead to t10_irq5 been triggered continuously, and the trigger condition can not been cleared. Pauli He > > It might help if you run your test case with lockdep enabled. It will > find such deadlocks and report them precisely to the kernel logs. > > Thank you! > -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/include/asm/ldc.h b/arch/sparc/include/asm/ldc.h index bdb524a..dd2b30f 100644 --- a/arch/sparc/include/asm/ldc.h +++ b/arch/sparc/include/asm/ldc.h @@ -20,6 +20,7 @@ extern void ldom_power_off(void); */ struct ldc_channel_config { void (*event)(void *arg, int event); + spinlock_t *serial_lock; u32 mtu; unsigned int rx_irq; diff --git a/arch/sparc64/kernel/ldc.c b/arch/sparc64/kernel/ldc.c index a6b75cd..72b9501 100644 --- a/arch/sparc64/kernel/ldc.c +++ b/arch/sparc64/kernel/ldc.c @@ -892,9 +892,13 @@ handshake_complete: } out: + if(lp->cfg.serial_lock) + spin_lock_irqsave(lp->cfg.serial_lock, flags); spin_unlock_irqrestore(&lp->lock, flags); send_events(lp, event_mask); + if(lp->cfg.serial_lock) + spin_unlock_irqrestore(lp->cfg.serial_lock, flags); return IRQ_HANDLED; } diff --git a/drivers/net/sunvnet.c b/drivers/net/sunvnet.c index 6415ce1..e58f9da 100644 --- a/drivers/net/sunvnet.c +++ b/drivers/net/sunvnet.c @@ -1117,11 +1117,13 @@ static struct vnet * __devinit vnet_find_parent(struct mdesc_handle *hp, return vnet_find_or_create(local_mac); } +static spinlock_t event_lock=__SPIN_LOCK_UNLOCKED("vnet_event"); static struct ldc_channel_config vnet_ldc_cfg = { .event = vnet_event, .mtu = 64, .mode = LDC_MODE_UNRELIABLE, + .serial_lock = &event_lock, };