Message ID | 20081010075640.GA5204@ff.dom.local |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Jarek Poplawski пишет: > On 10-10-2008 07:44, Badalian Vyacheslav wrote: > >> Hello all! >> > > Hello Slavon, > > >> Please look to if you have time: >> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >> >> We have deadlocks at few PC one times in week. >> I can test any patches to detect and fix problem. >> Now i test 2.6.27-rc kernel at one PC. >> > > A similar bug was reported by Denys Fedoryshchenko but it wasn't fully > diagnosed. Anyway it looks like hardware dependent. The patch below > can sometimes help. 2.6.27 may have this fixed too (some other way). > > 2.6.27 - get it now! [ 6951.841662] BUG: NMI Watchdog detected LOCKUP on CPU3, ip c01fde4c, registers: [ 6951.841662] Modules linked in: sch_sfq sch_htb netconsole e1000 i2c_i801 e1000e i2c_core [ 6951.841662] [ 6951.841662] Pid: 0, comm: swapper Not tainted (2.6.27-fw #1) [ 6951.841662] EIP: 0060:[<c01fde4c>] EFLAGS: 00000092 CPU: 3 [ 6951.841662] EIP is at __rb_rotate_right+0xc/0x70 [ 6951.841662] EAX: f70c3c68 EBX: f70c3c68 ECX: f70c3c68 EDX: c202c134 [ 6951.841662] ESI: f70c3c68 EDI: f70c3c68 EBP: c202c134 ESP: f785fc2c [ 6951.841662] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 6951.841662] Process swapper (pid: 0, ti=f785e000 task=f7832940 task.ti=f785e000) [ 6951.841662] Stack: f70c3c68 f70c3c68 f70c3c68 c01fdf41 f70c3c68 00000000 c202c12c c202c134 [ 6951.841662] c013a91f f70c3c68 c202c12c c202212c c045b100 c013ae0a 00000000 c013d63d [ 6951.841662] 9a011800 00000652 00000001 00000282 00000652 f70c3c68 00000000 00000000 [ 6951.841662] Call Trace: [ 6951.841662] [<c01fdf41>] rb_insert_color+0x91/0xc0 [ 6951.841662] [<c013a91f>] enqueue_hrtimer+0x5f/0x80 [ 6951.841662] [<c013ae0a>] hrtimer_start+0xaa/0x130 [ 6951.841662] [<c013d63d>] getnstimeofday+0x3d/0xe0 [ 6951.841662] [<c02de83d>] qdisc_watchdog_schedule+0x3d/0x50 [ 6951.841662] [<f88ac343>] htb_dequeue+0x683/0x7b0 [sch_htb] [ 6951.841662] [<c02ce692>] dev_hard_start_xmit+0x1d2/0x2c0 [ 6951.841662] [<c02dc87a>] __qdisc_run+0x13a/0x1d0 [ 6951.841662] [<c02d0ed7>] dev_queue_xmit+0x227/0x4f0 [ 6951.841662] [<c02f29ff>] ip_finish_output+0x11f/0x280 [ 6951.841662] [<c02f00e0>] ip_forward+0x290/0x310 [ 6951.841662] [<c02efe35>] ip_forward_finish+0x25/0x40 [ 6951.841662] [<c02ee9a2>] ip_rcv_finish+0x122/0x360 [ 6951.841662] [<c02c8cc6>] __alloc_skb+0x36/0x120 [ 6951.841662] [<c02c9d02>] __netdev_alloc_skb+0x22/0x50 [ 6951.841662] [<c02eee20>] ip_rcv+0x0/0x290 [ 6951.841662] [<c02ce064>] netif_receive_skb+0x274/0x4d0 [ 6951.841662] [<c0108b1a>] nommu_map_single+0x2a/0x60 [ 6951.841662] [<f883be39>] e1000_receive_skb+0x49/0x80 [e1000e] [ 6951.841662] [<f883e84c>] e1000_clean_rx_irq+0x23c/0x300 [e1000e] [ 6951.841662] [<f883b3ad>] e1000_clean+0x1bd/0x570 [e1000e] [ 6951.841662] [<c02d03bc>] net_rx_action+0x13c/0x200 [ 6951.841662] [<c0129b72>] __do_softirq+0x82/0x100 [ 6951.841662] [<c0129c27>] do_softirq+0x37/0x40 [ 6951.841662] [<c0106060>] do_IRQ+0x40/0x80 [ 6951.841662] [<c01134c7>] smp_apic_timer_interrupt+0x57/0x90 [ 6951.841662] [<c010457f>] common_interrupt+0x23/0x28 [ 6951.841662] [<c0109aa2>] mwait_idle+0x32/0x40 [ 6951.841662] [<c01026c8>] cpu_idle+0x48/0xe0 [ 6951.841662] ======================= [ 6951.841662] Code: 24 08 83 e0 03 09 d0 89 03 8b 1c 24 83 c4 0c c3 89 56 08 eb e3 8d 76 00 8d bc 27 00 00 00 00 83 ec 0c 89 1c 24 89 c3 89 7c 24 08 <89> d7 89 74 24 04 8b 50 08 8b 30 8b 4a 04 83 e6 fc 85 c9 89 48 > Jarek P. > > (some offsets are OK when patching 2.6.26) > --- > > net/sched/sch_htb.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > index 30c999c..ff9e965 100644 > --- a/net/sched/sch_htb.c > +++ b/net/sched/sch_htb.c > @@ -162,6 +162,7 @@ struct htb_sched { > > int rate2quantum; /* quant = rate / rate2quantum */ > psched_time_t now; /* cached dequeue time */ > + psched_time_t next_watchdog; > struct qdisc_watchdog watchdog; > > /* non shaped skbs; let them go directly thru */ > @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) > } > } > sch->qstats.overlimits++; > - qdisc_watchdog_schedule(&q->watchdog, next_event); > + if (q->next_watchdog < q->now || next_event <= > + q->next_watchdog - PSCHED_TICKS_PER_SEC / HZ) { > + qdisc_watchdog_schedule(&q->watchdog, next_event); > + q->next_watchdog = next_event; > + } > fin: > return skb; > } > @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) > } > } > qdisc_watchdog_cancel(&q->watchdog); > + q->next_watchdog = 0; > __skb_queue_purge(&q->direct_queue); > sch->q.qlen = 0; > memset(q->row, 0, sizeof(q->row)); > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Oh... sorry. I Wrong you understand... i patch 2.6.27 with this patch and will test it... >> Please look to if you have time: >> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >> >> We have deadlocks at few PC one times in week. >> I can test any patches to detect and fix problem. >> Now i test 2.6.27-rc kernel at one PC. >> > > A similar bug was reported by Denys Fedoryshchenko but it wasn't fully > diagnosed. Anyway it looks like hardware dependent. The patch below > can sometimes help. 2.6.27 may have this fixed too (some other way). > > Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 10, 2008 at 12:52:58PM +0400, Badalian Vyacheslav wrote: > Oh... sorry. I Wrong you understand... i patch 2.6.27 with this patch > and will test it... No, you understood it right. But it seems 2.6.27 fix doesn't work for you. So, yes, try this patch with 2.6.26 or 2.6.27. Jarek P. > > >> Please look to if you have time: > >> http://bugzilla.kernel.org/show_bug.cgi?id=11718 > >> > >> We have deadlocks at few PC one times in week. > >> I can test any patches to detect and fix problem. > >> Now i test 2.6.27-rc kernel at one PC. > >> > > > > A similar bug was reported by Denys Fedoryshchenko but it wasn't fully > > diagnosed. Anyway it looks like hardware dependent. The patch below > > can sometimes help. 2.6.27 may have this fixed too (some other way). > > > > Jarek P. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 10, 2008 at 09:04:26AM +0000, Jarek Poplawski wrote: > On Fri, Oct 10, 2008 at 12:52:58PM +0400, Badalian Vyacheslav wrote: > > Oh... sorry. I Wrong you understand... i patch 2.6.27 with this patch > > and will test it... > > No, you understood it right. But it seems 2.6.27 fix doesn't work for > you. So, yes, try this patch with 2.6.26 or 2.6.27. There is also a parameter you could try (without this patch): modprobe sch_htb htb_hysteresis=1 Jarek P. > > > > >> Please look to if you have time: > > >> http://bugzilla.kernel.org/show_bug.cgi?id=11718 > > >> > > >> We have deadlocks at few PC one times in week. > > >> I can test any patches to detect and fix problem. > > >> Now i test 2.6.27-rc kernel at one PC. > > >> > > > > > > A similar bug was reported by Denys Fedoryshchenko but it wasn't fully > > > diagnosed. Anyway it looks like hardware dependent. The patch below > > > can sometimes help. 2.6.27 may have this fixed too (some other way). > > > > > > Jarek P. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On 10-10-2008 07:44, Badalian Vyacheslav wrote: >> Hello all! > > Hello Slavon, > >> Please look to if you have time: >> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >> >> We have deadlocks at few PC one times in week. >> I can test any patches to detect and fix problem. >> Now i test 2.6.27-rc kernel at one PC. > > A similar bug was reported by Denys Fedoryshchenko but it wasn't fully > diagnosed. Anyway it looks like hardware dependent. The patch below > can sometimes help. 2.6.27 may have this fixed too (some other way). I doubt its hardware related, whats happening is that the hrtimer insertion gets into an endless loop because the rb tree (or node) apparently has a loop. I went through the qdiscs' use of hrtimers again, but can't spot any error there. Denys, did your systems also have CONFIG_HIGH_RES_TIMERS=n? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > Jarek Poplawski wrote: >> On 10-10-2008 07:44, Badalian Vyacheslav wrote: >>> Hello all! >> >> Hello Slavon, >> >>> Please look to if you have time: >>> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >>> >>> We have deadlocks at few PC one times in week. >>> I can test any patches to detect and fix problem. >>> Now i test 2.6.27-rc kernel at one PC. >> >> A similar bug was reported by Denys Fedoryshchenko but it wasn't fully >> diagnosed. Anyway it looks like hardware dependent. The patch below >> can sometimes help. 2.6.27 may have this fixed too (some other way). > > I doubt its hardware related, whats happening is that the hrtimer > insertion gets into an endless loop because the rb tree (or node) > apparently has a loop. I went through the qdiscs' use of hrtimers > again, but can't spot any error there. > > Denys, did your systems also have CONFIG_HIGH_RES_TIMERS=n? Badalian, please try enabling CONFIG_DEBUG_OBJECTS and post the results, if any. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy пишет: > Patrick McHardy wrote: >> Jarek Poplawski wrote: >>> On 10-10-2008 07:44, Badalian Vyacheslav wrote: >>>> Hello all! >>> >>> Hello Slavon, >>> >>>> Please look to if you have time: >>>> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >>>> >>>> We have deadlocks at few PC one times in week. >>>> I can test any patches to detect and fix problem. >>>> Now i test 2.6.27-rc kernel at one PC. >>> >>> A similar bug was reported by Denys Fedoryshchenko but it wasn't fully >>> diagnosed. Anyway it looks like hardware dependent. The patch below >>> can sometimes help. 2.6.27 may have this fixed too (some other way). >> >> I doubt its hardware related, whats happening is that the hrtimer >> insertion gets into an endless loop because the rb tree (or node) >> apparently has a loop. I went through the qdiscs' use of hrtimers >> again, but can't spot any error there. >> >> Denys, did your systems also have CONFIG_HIGH_RES_TIMERS=n? > > Badalian, please try enabling CONFIG_DEBUG_OBJECTS and post the > results, if any. i have some results with CONFIG_HIGH_RES_TIMERS=n and CONFIG_HIGH_RES_TIMERS=y Ok... i recompile kernel... simple wait crash for reboot =) Now i have pc: 1. 2.6.27 with patch 2. 2.6.26.6 with htb_hysteresis=1 and CONFIG_DEBUG_OBJECTS=n 3. 2.6.26.6 with htb_hysteresis=1 and CONFIG_DEBUG_OBJECTS=y (wait for crash for reboot) 4. 2 servers deadlocked and not rebooted after panic (2.6.26.5 kernel)... need for drive to its for reboot... 5. 4 pc with 2.6.24-rc7-git2 that also do equal shaping but not have crashes(its on other hardware) > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski пишет: > On Fri, Oct 10, 2008 at 09:04:26AM +0000, Jarek Poplawski wrote: > >> On Fri, Oct 10, 2008 at 12:52:58PM +0400, Badalian Vyacheslav wrote: >> >>> Oh... sorry. I Wrong you understand... i patch 2.6.27 with this patch >>> and will test it... >>> >> No, you understood it right. But it seems 2.6.27 fix doesn't work for >> you. So, yes, try this patch with 2.6.26 or 2.6.27. >> > > There is also a parameter you could try (without this patch): > > modprobe sch_htb htb_hysteresis=1 > > Jarek P. > > >>>>> Please look to if you have time: >>>>> http://bugzilla.kernel.org/show_bug.cgi?id=11718 >>>>> >>>>> We have deadlocks at few PC one times in week. >>>>> I can test any patches to detect and fix problem. >>>>> Now i test 2.6.27-rc kernel at one PC. >>>>> >>>>> >>>> A similar bug was reported by Denys Fedoryshchenko but it wasn't fully >>>> diagnosed. Anyway it looks like hardware dependent. The patch below >>>> can sometimes help. 2.6.27 may have this fixed too (some other way). >>>> >>>> Jarek P. >>>> > > Sorry for long answer. We have troubles with power in our server place. Now its gone and i will test again all this. With patch + htb_hysteresis=0 and htb_hysteresis=1 without patch all PC work done 2 days and 18 hours. After this we have power crash.... =( Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 16, 2008 at 12:28:46PM +0400, Badalian Vyacheslav wrote: ... > Sorry for long answer. > > We have troubles with power in our server place. Now its gone and i will > test again all this. > > With patch + htb_hysteresis=0 and htb_hysteresis=1 without patch all PC > work done 2 days and 18 hours. After this we have power crash.... =( No need to hurry: you've written it's not everyday. Better try to make sure there is really a diffrence after any of these changes. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello! I get more information. Statistics of PC: 1. 2.6.26.6 Dunamic Timer, HiResTimer, 1000HZ, htb_hysteresis=0 - crashed 1d 18h ago 2. 2.6.26.5 HZ300, NO Dunamic Timer, No HiResTimer, htb_hysteresis=0 - uptime 5d 17h (no crashes for now, but it crashed some time ago with htb_hysteresis=1) 3. 2.6.27, 1000HZ, NO Dunamic Timer, No HiResTimer, htb_hysteresis=0 + PATCH - uptime 5d 17h (no crashes for now, but it crashed some time ago without patch) Also attach crash log of lash crash PC 1: [10610.110729] BUG: NMI Watchdog detected LOCKUP on CPU1, ip c01fd939, registers: [10610.110729] Modules linked in: netconsole e1000e i2c_i801 i2c_core e1000 [10610.110729] [10610.110729] Pid: 0, comm: swapper Not tainted (2.6.26.6-fw #1) [10610.110729] EIP: 0060:[<c01fd939>] EFLAGS: 00000082 CPU: 1 [10610.110729] EIP is at rb_insert_color+0x19/0xc0 [10610.110729] EAX: f6c23ca4 EBX: f6c23ca4 ECX: 00000000 EDX: f6c23ca4 [10610.110729] ESI: f6c23ca4 EDI: f6c23ca4 EBP: c20190e0 ESP: f7c4dc98 [10610.110729] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [10610.110729] Process swapper (pid: 0, ti=f7c4c000 task=f7c314a0 task.ti=f7c4c000) [10610.110729] Stack: f6c23ca8 f6c23ca4 f6c23ca4 00000000 c013b672 c20190e0 00000001 c20190d8 [10610.110729] c20190d8 f6c23ca4 c202d0d8 c04470a0 c013bd4d 00000000 f7c4dcf4 7c491000 [10610.110729] 000009a5 00000001 00000286 f6c23800 ffffffff 00000000 00000000 c02d407e [10610.110729] Call Trace: [10610.110729] [<c013b672>] enqueue_hrtimer+0x72/0xf0 [10610.110729] [<c013bd4d>] hrtimer_start+0xad/0x150 [10610.110729] [<c02d407e>] qdisc_watchdog_schedule+0x1e/0x30 [10610.110729] [<c02d9826>] htb_dequeue+0x6a6/0x810 [10610.110729] [<c02d3f72>] tc_classify+0x42/0x90 [10610.110729] [<c02dab22>] sfq_enqueue+0x22/0x230 [10610.110729] [<c02d9c40>] htb_enqueue+0x0/0x1e0 [10610.110729] [<c02d2efc>] __qdisc_run+0x19c/0x1d0 [10610.110729] [<c02d9c40>] htb_enqueue+0x0/0x1e0 [10610.110729] [<c02c7737>] dev_queue_xmit+0x267/0x380 [10610.110729] [<c02e8ab0>] ip_forward_finish+0x0/0x40 [10610.110729] [<c02eb65f>] ip_finish_output+0x11f/0x280 [10610.110729] [<c02e8d7f>] ip_forward+0x28f/0x2d0 [10610.110729] [<c02e8ad5>] ip_forward_finish+0x25/0x40 [10610.110729] [<c02e7612>] ip_rcv_finish+0x122/0x360 [10610.110729] [<c02bfa87>] __alloc_skb+0x57/0x120 [10610.110729] [<c0109c8a>] nommu_map_single+0x2a/0x60 [10610.110729] [<c02e7a90>] ip_rcv+0x0/0x290 [10610.110729] [<c02c45cb>] netif_receive_skb+0x26b/0x470 [10610.110729] [<f886c75d>] e1000_receive_skb+0x4d/0x1b0 [e1000e] [10610.110729] [<f886f9cc>] e1000_clean_rx_irq+0x23c/0x300 [e1000e] [10610.110729] [<f886bf69>] e1000_clean+0x49/0x1f0 [e1000e] [10610.110729] [<c02c69d8>] net_rx_action+0xf8/0x1b0 [10610.110729] [<c012a922>] __do_softirq+0x82/0x100 [10610.110729] [<c012a9d7>] do_softirq+0x37/0x40 [10610.110729] [<c012ad27>] irq_exit+0x57/0x80 [10610.110729] [<c0107120>] do_IRQ+0x40/0x80 [10610.110729] [<c0114097>] smp_apic_timer_interrupt+0x57/0x90 [10610.110729] [<c01055a3>] common_interrupt+0x23/0x28 [10610.110729] [<c010a602>] mwait_idle+0x32/0x40 [10610.110729] [<c010a5d0>] mwait_idle+0x0/0x40 [10610.110729] [<c01036f3>] cpu_idle+0x53/0xc0 [10610.110729] ======================= [10610.110729] Code: c4 0c c3 89 56 04 eb e3 8d 76 00 8d bc 27 00 00 00 00 55 89 d5 57 89 c7 56 53 90 8d b4 26 00 00 00 00 8b 1f 83 e3 fc 74 32 8b 03 <89> d9 a8 01 75 2a 89 c6 83 e6 fc 8b 56 08 39 d3 74 45 85 d2 74 Thanks! Best regals, Badalian Vyacheslav > On Thu, Oct 16, 2008 at 12:28:46PM +0400, Badalian Vyacheslav wrote: > ... > >> Sorry for long answer. >> >> We have troubles with power in our server place. Now its gone and i will >> test again all this. >> >> With patch + htb_hysteresis=0 and htb_hysteresis=1 without patch all PC >> work done 2 days and 18 hours. After this we have power crash.... =( >> > > No need to hurry: you've written it's not everyday. Better try to make > sure there is really a diffrence after any of these changes. > > Thanks, > Jarek P. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 30c999c..ff9e965 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -162,6 +162,7 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ + psched_time_t next_watchdog; struct qdisc_watchdog watchdog; /* non shaped skbs; let them go directly thru */ @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + if (q->next_watchdog < q->now || next_event <= + q->next_watchdog - PSCHED_TICKS_PER_SEC / HZ) { + qdisc_watchdog_schedule(&q->watchdog, next_event); + q->next_watchdog = next_event; + } fin: return skb; } @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) } } qdisc_watchdog_cancel(&q->watchdog); + q->next_watchdog = 0; __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row));