Message ID | 20081215111308.GA5853@ff.dom.local |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Now i test patches 2+3. 5 pc work without oops 5 days > On Thu, Dec 11, 2008 at 08:46:06AM +0000, Jarek Poplawski wrote: > >> On Wed, Dec 10, 2008 at 06:14:28PM +0300, Badalian Vyacheslav wrote: >> >>> Hello again! Sorry for long away. >>> >> Hi! >> >> >>> I was go away from this work for long time. >>> >>> May we return to this bug? >>> Servers at last stable kernel 2.6.27.8 >>> HZ=1000, HR=off, DynamicTicks=off, hysteresis=1 >>> Sorry - no patched, update do not i. Do you have fresh patches or ideas >>> for tests? >>> >> Not much, but I can have if you only are willing to test them... >> I attach below a patch which combines 2 patches I sent yesterday to >> netdev (PATCH 7/6 and 8/6) vs. 2.6.27.7 (named testing patch #3 here). >> >> You can still try the testing patch #2 I sent previously (quoted below) >> with or without this new #3 patch. >> >> > > Here is another idea worth checking (instead of patch #2). > > Jarek P. > > --- (testing patch #4) > > diff -Nurp a2.6.27.7/net/sched/sch_htb.c b2.6.27.7/net/sched/sch_htb.c > --- a2.6.27.7/net/sched/sch_htb.c 2008-12-11 08:16:16.000000000 +0000 > +++ b2.6.27.7/net/sched/sch_htb.c 2008-12-15 10:44:32.000000000 +0000 > @@ -924,6 +924,7 @@ static struct sk_buff *htb_dequeue(struc > } > } > sch->qstats.overlimits++; > + qdisc_watchdog_cancel(&q->watchdog); > qdisc_watchdog_schedule(&q->watchdog, next_event); > fin: > return skb; > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello result: Patch 2+3 = uptime 7 days without crashes. May i revert patches and try single new patch? Thanks! > On Thu, Dec 11, 2008 at 08:46:06AM +0000, Jarek Poplawski wrote: > >> On Wed, Dec 10, 2008 at 06:14:28PM +0300, Badalian Vyacheslav wrote: >> >>> Hello again! Sorry for long away. >>> >> Hi! >> >> >>> I was go away from this work for long time. >>> >>> May we return to this bug? >>> Servers at last stable kernel 2.6.27.8 >>> HZ=1000, HR=off, DynamicTicks=off, hysteresis=1 >>> Sorry - no patched, update do not i. Do you have fresh patches or ideas >>> for tests? >>> >> Not much, but I can have if you only are willing to test them... >> I attach below a patch which combines 2 patches I sent yesterday to >> netdev (PATCH 7/6 and 8/6) vs. 2.6.27.7 (named testing patch #3 here). >> >> You can still try the testing patch #2 I sent previously (quoted below) >> with or without this new #3 patch. >> >> > > Here is another idea worth checking (instead of patch #2). > > Jarek P. > > --- (testing patch #4) > > diff -Nurp a2.6.27.7/net/sched/sch_htb.c b2.6.27.7/net/sched/sch_htb.c > --- a2.6.27.7/net/sched/sch_htb.c 2008-12-11 08:16:16.000000000 +0000 > +++ b2.6.27.7/net/sched/sch_htb.c 2008-12-15 10:44:32.000000000 +0000 > @@ -924,6 +924,7 @@ static struct sk_buff *htb_dequeue(struc > } > } > sch->qstats.overlimits++; > + qdisc_watchdog_cancel(&q->watchdog); > qdisc_watchdog_schedule(&q->watchdog, next_event); > fin: > return skb; > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 18, 2008 at 09:43:51AM +0300, Badalian Vyacheslav wrote: > Hello > result: Patch 2+3 = uptime 7 days without crashes. > May i revert patches and try single new patch? Here is my current opinion on this bug: 1) I'm almost sure it's not a htb, but hrtimers bug (some race), 2) the htb patches you've tested are not "the proper" way of fixing it; I see substantial changes in hrtimers code in the "-tip" tree (probably for 2.6.29), which, probably, you'll be advised by hrtimers maintainers to try, and I guess, it's not easy on a production system, So, it's up to you: 1) since these patches work for you, you can stop with testing and wait with these patched kernels until 2.6.29 (I can propose this #2 patch as a temporary fix then), 2) for curiosity you could try this patch #4 alone on one box first (after reverting at least patch #2), but again: if it works, it could be only treated as a temporary hack, and alternative of #2. Thanks, Jarek P. > > > On Thu, Dec 11, 2008 at 08:46:06AM +0000, Jarek Poplawski wrote: > > > >> On Wed, Dec 10, 2008 at 06:14:28PM +0300, Badalian Vyacheslav wrote: > >> > >>> Hello again! Sorry for long away. > >>> > >> Hi! > >> > >> > >>> I was go away from this work for long time. > >>> > >>> May we return to this bug? > >>> Servers at last stable kernel 2.6.27.8 > >>> HZ=1000, HR=off, DynamicTicks=off, hysteresis=1 > >>> Sorry - no patched, update do not i. Do you have fresh patches or ideas > >>> for tests? > >>> > >> Not much, but I can have if you only are willing to test them... > >> I attach below a patch which combines 2 patches I sent yesterday to > >> netdev (PATCH 7/6 and 8/6) vs. 2.6.27.7 (named testing patch #3 here). > >> > >> You can still try the testing patch #2 I sent previously (quoted below) > >> with or without this new #3 patch. > >> > >> > > > > Here is another idea worth checking (instead of patch #2). > > > > Jarek P. > > > > --- (testing patch #4) > > > > diff -Nurp a2.6.27.7/net/sched/sch_htb.c b2.6.27.7/net/sched/sch_htb.c > > --- a2.6.27.7/net/sched/sch_htb.c 2008-12-11 08:16:16.000000000 +0000 > > +++ b2.6.27.7/net/sched/sch_htb.c 2008-12-15 10:44:32.000000000 +0000 > > @@ -924,6 +924,7 @@ static struct sk_buff *htb_dequeue(struc > > } > > } > > sch->qstats.overlimits++; > > + qdisc_watchdog_cancel(&q->watchdog); > > qdisc_watchdog_schedule(&q->watchdog, next_event); > > fin: > > return skb; > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks for all Jarek! Vyacheslav Badalian > On Thu, Dec 18, 2008 at 09:43:51AM +0300, Badalian Vyacheslav wrote: > >> Hello >> result: Patch 2+3 = uptime 7 days without crashes. >> May i revert patches and try single new patch? >> > > Here is my current opinion on this bug: > > 1) I'm almost sure it's not a htb, but hrtimers bug (some race), > > 2) the htb patches you've tested are not "the proper" way of fixing > it; I see substantial changes in hrtimers code in the "-tip" tree > (probably for 2.6.29), which, probably, you'll be advised by > hrtimers maintainers to try, and I guess, it's not easy on a > production system, > > So, it's up to you: > > 1) since these patches work for you, you can stop with testing and > wait with these patched kernels until 2.6.29 (I can propose this > #2 patch as a temporary fix then), > > 2) for curiosity you could try this patch #4 alone on one box first > (after reverting at least patch #2), but again: if it works, it > could be only treated as a temporary hack, and alternative of #2. > > Thanks, > Jarek P. > > >>> On Thu, Dec 11, 2008 at 08:46:06AM +0000, Jarek Poplawski wrote: >>> >>> >>>> On Wed, Dec 10, 2008 at 06:14:28PM +0300, Badalian Vyacheslav wrote: >>>> >>>> >>>>> Hello again! Sorry for long away. >>>>> >>>>> >>>> Hi! >>>> >>>> >>>> >>>>> I was go away from this work for long time. >>>>> >>>>> May we return to this bug? >>>>> Servers at last stable kernel 2.6.27.8 >>>>> HZ=1000, HR=off, DynamicTicks=off, hysteresis=1 >>>>> Sorry - no patched, update do not i. Do you have fresh patches or ideas >>>>> for tests? >>>>> >>>>> >>>> Not much, but I can have if you only are willing to test them... >>>> I attach below a patch which combines 2 patches I sent yesterday to >>>> netdev (PATCH 7/6 and 8/6) vs. 2.6.27.7 (named testing patch #3 here). >>>> >>>> You can still try the testing patch #2 I sent previously (quoted below) >>>> with or without this new #3 patch. >>>> >>>> >>>> >>> Here is another idea worth checking (instead of patch #2). >>> >>> Jarek P. >>> >>> --- (testing patch #4) >>> >>> diff -Nurp a2.6.27.7/net/sched/sch_htb.c b2.6.27.7/net/sched/sch_htb.c >>> --- a2.6.27.7/net/sched/sch_htb.c 2008-12-11 08:16:16.000000000 +0000 >>> +++ b2.6.27.7/net/sched/sch_htb.c 2008-12-15 10:44:32.000000000 +0000 >>> @@ -924,6 +924,7 @@ static struct sk_buff *htb_dequeue(struc >>> } >>> } >>> sch->qstats.overlimits++; >>> + qdisc_watchdog_cancel(&q->watchdog); >>> qdisc_watchdog_schedule(&q->watchdog, next_event); >>> fin: >>> return skb; >>> >>> >>> >>> > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 18, 2008 at 02:23:33PM +0300, Badalian Vyacheslav wrote:
> Thanks for all Jarek!
The same to you for very valuable reporting and testing Vyacheslav!
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Dec 2008, Jarek Poplawski wrote: > On Thu, Dec 18, 2008 at 09:43:51AM +0300, Badalian Vyacheslav wrote: > > Hello > > result: Patch 2+3 = uptime 7 days without crashes. > > May i revert patches and try single new patch? > > Here is my current opinion on this bug: > > 1) I'm almost sure it's not a htb, but hrtimers bug (some race), > > 2) the htb patches you've tested are not "the proper" way of fixing > it; I see substantial changes in hrtimers code in the "-tip" tree > (probably for 2.6.29), which, probably, you'll be advised by > hrtimers maintainers to try, and I guess, it's not easy on a > production system, > > So, it's up to you: > > 1) since these patches work for you, you can stop with testing and > wait with these patched kernels until 2.6.29 (I can propose this > #2 patch as a temporary fix then), > > 2) for curiosity you could try this patch #4 alone on one box first > (after reverting at least patch #2), but again: if it works, it > could be only treated as a temporary hack, and alternative of #2. I think I am hitting the same issue discussed in this thread and: http://bugzilla.kernel.org/show_bug.cgi?id=11718 and wanted to share my data. My system: 2x Xeon @ 3.06ghz CONFIG_HZ=250 CONFIG_HIGH_RES_TIMERS=y CONFIG_NO_HZ=y (please let me know if more details are useful) With 2.6.27.10 after around 30 minutes or an hour, server tends to hang with: Pid: 0, comm: swapper Not tainted (2.6.27.10 #1) EIP: 0060:[<c023199f>] EFLAGS: 00000286 CPU: 0 EIP is at run_hrtimer_pending+0x31/0xb8 EAX: f6c1d468 EBX: c039b2ae ECX: 00000002 EDX: 00000001 ESI: f6c1d468 EDI: c1807e20 EBP: c0587f28 ESP: c0587f18 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 09a36358 CR3: 36d64000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c0231a3c>] run_hrtimer_softirq+0x16/0x18 [<c0223e6d>] __do_softirq+0x64/0xcd [<c0223f0b>] do_softirq+0x35/0x3a [<c0224199>] irq_exit+0x38/0x6d [<c020eefb>] smp_apic_timer_interrupt+0x71/0x82 [<c0203640>] apic_timer_interrupt+0x28/0x30 [<c0207f9d>] ? default_idle+0x2d/0x42 [<c0201946>] cpu_idle+0xca/0xea [<c045194e>] rest_init+0x4e/0x50 One time with 2.6.27.10: BUG: unable to handle kernel NULL pointer dereference at 000000 [<c0239b6b>] smp_call_function+0x12/0x14 [<c020df20>] native_smp_send_stop+0x1b/0x28 [<c0220304>] panic+0x47/0xdf [<c0203ac2>] oops_end+0x5d/0x71 [<c0203f9f>] die+0x57/0x5f [<c02129a7>] do_page_fault+0x474/0x529 [<c0212533>] ? do_page_fault+0x0/0x529 [<c045eb3a>] error_code+0x72/0x78 [<c024007b>] ? cgroup_get_sb+0x20/0x2b4 [<c02f223c>] ? rb_erase+0x118/0x241 [<c023177d>] __remove_hrtimer+0x5f/0x67 [<c039b2ae>] ? qdisc_watchdog+0x0/0x1c [<c023199a>] run_hrtimer_pending+0x2c/0xb8 [<c0231a3c>] run_hrtimer_softirq+0x16/0x18 [<c0223e6d>] __do_softirq+0x64/0xcd [<c0223f0b>] do_softirq+0x35/0x3a [<c0224199>] irq_exit+0x38/0x6d [<c020eefb>] smp_apic_timer_interrupt+0x71/0x82 [<c0203640>] apic_timer_interrupt+0x28/0x30 [<c0207f9d>] ? default_idle+0x2d/0x42 [<c0201946>] cpu_idle+0xca/0xea [<c045194e>] rest_init+0x4e/0x50 ======================= ---[ end trace 818de8d3237477a5 ]--- Rebooting in 10 seconds.. With 2.6.27.10 plus what is being called patches #2 and #3 in this thread, the server ran for a day or so before hanging again: Pid: 0, comm: swapper Not tainted (2.6.27.10 #3) EIP: 0060:[<c023199f>] EFLAGS: 00000206 CPU: 0 EIP is at run_hrtimer_pending+0x31/0xb8 EAX: f73f8470 EBX: c039b312 ECX: 00000002 EDX: 00000001 ESI: f73f8470 EDI: c1807e20 EBP: c0589f20 ESP: c0589f10 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 08ca3358 CR3: 005ca000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c0231a3c>] run_hrtimer_softirq+0x16/0x18 [<c0223e6d>] __do_softirq+0x64/0xcd [<c0223f0b>] do_softirq+0x35/0x3a [<c0224199>] irq_exit+0x38/0x6d [<c02052b3>] do_IRQ+0x5c/0x75 [<c0203553>] common_interrupt+0x23/0x28 [<c0207f9d>] ? default_idle+0x2d/0x42 [<c0201946>] cpu_idle+0xca/0xea [<c045225e>] rest_init+0x4e/0x50 With 2.6.28 (without the patches from this thread) after about an hour the system hung again. "Show Regs" indicated: Pid: 0, comm: swapper Not tainted (2.6.28 #1) EIP: 0060:[<c03977bd>] EFLAGS: 00000247 CPU: 0 EIP is at __netif_schedule+0xc/0x39 EAX: f6934600 EBX: 00000000 ECX: f6934600 EDX: f6864000 ESI: f6864488 EDI: c1807480 EBP: c05c1f08 ESP: c05c1f04 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 09968358 CR3: 365cd000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Call Trace: [<c03a669d>] qdisc_watchdog+0x18/0x1c [<c02335c3>] run_hrtimer_pending+0x56/0xda [<c03a6685>] ? qdisc_watchdog+0x0/0x1c [<c023365d>] run_hrtimer_softirq+0x16/0x18 [<c0225b1a>] __do_softirq+0x7a/0x11c [<c0225bf1>] do_softirq+0x35/0x3a [<c0225eb5>] irq_exit+0x38/0x6d [<c020f719>] smp_apic_timer_interrupt+0x71/0x7f [<c0203858>] apic_timer_interrupt+0x28/0x30 [<c0208360>] ? default_idle+0x2d/0x42 [<c0201938>] cpu_idle+0x6b/0x84 [<c046b292>] rest_init+0x4e/0x50 Per Jarek's suggestion in bugzilla, I ran 2.6.28 plus Peter Zijlstra's "hrtimer: removing all ur callback modes" patches dated 2008-11-25, 2008-12-04 and 2008-12-08. Uptime was 2 days 22 hours before I hit what appears to be an unrelated bug related to the IPv6 FIB. (Separately reported with subject 'panic with 2.6.28 while doing "ip -6 route"'.) Chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:43:13AM +0000, Chris Caputo wrote: ... > Per Jarek's suggestion in bugzilla, I ran 2.6.28 plus Peter Zijlstra's > "hrtimer: removing all ur callback modes" patches dated 2008-11-25, > 2008-12-04 and 2008-12-08. Uptime was 2 days 22 hours before I hit what > appears to be an unrelated bug related to the IPv6 FIB. (Separately > reported with subject 'panic with 2.6.28 while doing "ip -6 route"'.) Chris, this is a very good news, thanks for testing this! Peter, great work! IMHO these patches should go to the stable (at least 2.6.28). Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I will try that patches too, when i got this message, after 3 minutes i got crash of my router :-) after working around 17 hours :-( There is 3 of them by the way, 2 fixes also. hrtimer: removing all ur callback modes hrtimer: removing all ur callback modes, fix hotplug hrtimer: removing all ur callback modes, fix On Wednesday 14 January 2009 08:39:09 Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 02:43:13AM +0000, Chris Caputo wrote: > ... > > > Per Jarek's suggestion in bugzilla, I ran 2.6.28 plus Peter Zijlstra's > > "hrtimer: removing all ur callback modes" patches dated 2008-11-25, > > 2008-12-04 and 2008-12-08. Uptime was 2 days 22 hours before I hit what > > appears to be an unrelated bug related to the IPv6 FIB. (Separately > > reported with subject 'panic with 2.6.28 while doing "ip -6 route"'.) > > Chris, this is a very good news, thanks for testing this! > > Peter, great work! IMHO these patches should go to the stable (at > least 2.6.28). > > Thanks, > Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:17:58PM +0200, Denys Fedoryschenko wrote: > I will try that patches too, when i got this message, after 3 minutes i got > crash of my router :-) after working around 17 hours :-( > There is 3 of them by the way, 2 fixes also. > > hrtimer: removing all ur callback modes > hrtimer: removing all ur callback modes, fix hotplug > hrtimer: removing all ur callback modes, fix Yes, it's a very nasty bug which, BTW you reported the first AFAIK, but I hope it's really gone with these patches. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 14 January 2009 14:36:23 Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 02:17:58PM +0200, Denys Fedoryschenko wrote: > > I will try that patches too, when i got this message, after 3 minutes i > > got crash of my router :-) after working around 17 hours :-( > > There is 3 of them by the way, 2 fixes also. > > > > hrtimer: removing all ur callback modes > > hrtimer: removing all ur callback modes, fix hotplug > > hrtimer: removing all ur callback modes, fix > > Yes, it's a very nasty bug which, BTW you reported the first AFAIK, > but I hope it's really gone with these patches. > If it is really fix, it must go to mainline ASAP. Because really many people hitting this bug... and complaining, that "Linux with htb not stable". But i think it is not just fix :-( -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-01-14 at 14:17 +0200, Denys Fedoryschenko wrote: > I will try that patches too, when i got this message, after 3 minutes i got > crash of my router :-) after working around 17 hours :-( > There is 3 of them by the way, 2 fixes also. > > hrtimer: removing all ur callback modes > hrtimer: removing all ur callback modes, fix hotplug > hrtimer: removing all ur callback modes, fix I'm afraid its a bit more than that: ca109491f612aab5c8152207631c0444f63da97f 37810659ea7d9572c5ac284ade272f806ef8f788 a0a99b227da57f81319dd239bc4de811b0f530ec b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 d5fd43c4ae04523e1dcd7794f9c511b289851350 731a55ba0f17064f85903b7bf8e24849ec6cfa20 a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 e3f1d883740b09e5116d4d4e30a6a6987264a83c 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 and 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 Also, all this is rather invasive and large, so I'm not sure it will meet the -stable criteria. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 01:50:04PM +0100, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 14:17 +0200, Denys Fedoryschenko wrote: > > I will try that patches too, when i got this message, after 3 minutes i got > > crash of my router :-) after working around 17 hours :-( > > There is 3 of them by the way, 2 fixes also. > > > > hrtimer: removing all ur callback modes > > hrtimer: removing all ur callback modes, fix hotplug > > hrtimer: removing all ur callback modes, fix > > I'm afraid its a bit more than that: > > ca109491f612aab5c8152207631c0444f63da97f > 37810659ea7d9572c5ac284ade272f806ef8f788 > a0a99b227da57f81319dd239bc4de811b0f530ec > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > d5fd43c4ae04523e1dcd7794f9c511b289851350 > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > e3f1d883740b09e5116d4d4e30a6a6987264a83c > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > and > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > Also, all this is rather invasive and large, so I'm not sure it will > meet the -stable criteria. > Earlier I thought it's very hard to hit, but after Chris reported it triggers within hours, IMHO some minimal fix is necessary. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'm afraid its a bit more than that: > > ca109491f612aab5c8152207631c0444f63da97f > 37810659ea7d9572c5ac284ade272f806ef8f788 > a0a99b227da57f81319dd239bc4de811b0f530ec > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > d5fd43c4ae04523e1dcd7794f9c511b289851350 > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > e3f1d883740b09e5116d4d4e30a6a6987264a83c > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > and > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > Also, all this is rather invasive and large, so I'm not sure it will > meet the -stable criteria. Then i will try to run 2.6.29-rc1-git4. Kind of risky to run rc1 on loaded router, but i am hitting this bug with hrtimers hardly, so no choice for me seems. And i cannot disable hrtimers because of bad shaper precision. Thanks a lot for info and patches :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 03:05:46PM +0200, Denys Fedoryschenko wrote: > > I'm afraid its a bit more than that: > > > > ca109491f612aab5c8152207631c0444f63da97f > > 37810659ea7d9572c5ac284ade272f806ef8f788 > > a0a99b227da57f81319dd239bc4de811b0f530ec > > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > > d5fd43c4ae04523e1dcd7794f9c511b289851350 > > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > > e3f1d883740b09e5116d4d4e30a6a6987264a83c > > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > > > and > > > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > > > Also, all this is rather invasive and large, so I'm not sure it will > > meet the -stable criteria. > > Then i will try to run 2.6.29-rc1-git4. Kind of risky to run rc1 on loaded > router, but i am hitting this bug with hrtimers hardly, so no choice for me > seems. And i cannot disable hrtimers because of bad shaper precision. > Thanks a lot for info and patches :-) > Denys, since these three patches worked for Chris, I guess it's safer to stay with 2.6.28 anyway (or my patches mentioned earlier in this thread). rc1 is really not stable... Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-01-14 at 13:12 +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 03:05:46PM +0200, Denys Fedoryschenko wrote: > > > I'm afraid its a bit more than that: > > > > > > ca109491f612aab5c8152207631c0444f63da97f > > > 37810659ea7d9572c5ac284ade272f806ef8f788 > > > a0a99b227da57f81319dd239bc4de811b0f530ec > > > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > > > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > > > d5fd43c4ae04523e1dcd7794f9c511b289851350 > > > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > > > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > > > e3f1d883740b09e5116d4d4e30a6a6987264a83c > > > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > > > > > and > > > > > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > > > > > Also, all this is rather invasive and large, so I'm not sure it will > > > meet the -stable criteria. > > > > Then i will try to run 2.6.29-rc1-git4. Kind of risky to run rc1 on loaded > > router, but i am hitting this bug with hrtimers hardly, so no choice for me > > seems. And i cannot disable hrtimers because of bad shaper precision. > > Thanks a lot for info and patches :-) > > > > Denys, since these three patches worked for Chris, I guess it's safer > to stay with 2.6.28 anyway (or my patches mentioned earlier in this > thread). rc1 is really not stable... Yeah, but those three patches have some holes in them still, so I cannot recommend just patching in those three. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 14 January 2009 15:15:29 Peter Zijlstra wrote: > > > > Denys, since these three patches worked for Chris, I guess it's safer > > to stay with 2.6.28 anyway (or my patches mentioned earlier in this > > thread). rc1 is really not stable... > > Yeah, but those three patches have some holes in them still, so I cannot > recommend just patching in those three. I will test rc1-git4 on few units which is less important and have redundancy, and then will decide. Already i did this 3 patches, but i prefer complete solution. If i will hit any bug, at least i will do my best to report about this bugs and such way to help fix them :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:15:29PM +0100, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 13:12 +0000, Jarek Poplawski wrote: > > On Wed, Jan 14, 2009 at 03:05:46PM +0200, Denys Fedoryschenko wrote: > > > > I'm afraid its a bit more than that: > > > > > > > > ca109491f612aab5c8152207631c0444f63da97f > > > > 37810659ea7d9572c5ac284ade272f806ef8f788 > > > > a0a99b227da57f81319dd239bc4de811b0f530ec > > > > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > > > > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > > > > d5fd43c4ae04523e1dcd7794f9c511b289851350 > > > > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > > > > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > > > > e3f1d883740b09e5116d4d4e30a6a6987264a83c > > > > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > > > > > > > and > > > > > > > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > > > > > > > Also, all this is rather invasive and large, so I'm not sure it will > > > > meet the -stable criteria. > > > > > > Then i will try to run 2.6.29-rc1-git4. Kind of risky to run rc1 on loaded > > > router, but i am hitting this bug with hrtimers hardly, so no choice for me > > > seems. And i cannot disable hrtimers because of bad shaper precision. > > > Thanks a lot for info and patches :-) > > > > > > > Denys, since these three patches worked for Chris, I guess it's safer > > to stay with 2.6.28 anyway (or my patches mentioned earlier in this > > thread). rc1 is really not stable... > > Yeah, but those three patches have some holes in them still, so I cannot > recommend just patching in those three. > OK, I hope Denys can apply more, but what about others? Without any patches the hole seems to be much bigger. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-01-14 at 13:26 +0000, Jarek Poplawski wrote: > OK, I hope Denys can apply more, but what about others? Without any > patches the hole seems to be much bigger. OK, I read most of this thread on netdev, but didn't find a clear clue on the specific hrtimer insertion race. Do you have any clear ideas or should I poke at the htb/hrtimer code a little? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:32:26PM +0100, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 13:26 +0000, Jarek Poplawski wrote: > > > OK, I hope Denys can apply more, but what about others? Without any > > patches the hole seems to be much bigger. > > OK, I read most of this thread on netdev, but didn't find a clear clue > on the specific hrtimer insertion race. > > Do you have any clear ideas or should I poke at the htb/hrtimer code a > little? > I didn't diagnose this entirely, and there were "a few" days since I wrote my opinion on this, so currently I forgot details, but I guess the most suspicious place is adding a hrtimer, when it's on the list, maybe even two times in a row. I guess it's not removed from the rbtree in some case. (I stopped searching after seeing you removed this list or something...) Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:32:26PM +0100, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 13:26 +0000, Jarek Poplawski wrote: > > > OK, I hope Denys can apply more, but what about others? Without any > > patches the hole seems to be much bigger. > > OK, I read most of this thread on netdev, but didn't find a clear clue > on the specific hrtimer insertion race. There is something at the beginning of this thread, plus earlier threads mostly with Denys as sender, and "htb bug" in the subject. > > Do you have any clear ideas or should I poke at the htb/hrtimer code a > little? > ...And htb code is htb_dequeue(): qdisc_watchdog_schedule(). Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-01-14 at 14:13 +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 02:32:26PM +0100, Peter Zijlstra wrote: > > On Wed, 2009-01-14 at 13:26 +0000, Jarek Poplawski wrote: > > > > > OK, I hope Denys can apply more, but what about others? Without any > > > patches the hole seems to be much bigger. > > > > OK, I read most of this thread on netdev, but didn't find a clear clue > > on the specific hrtimer insertion race. > > There is something at the beginning of this thread, plus earlier > threads mostly with Denys as sender, and "htb bug" in the subject. > > > > > Do you have any clear ideas or should I poke at the htb/hrtimer code a > > little? > > > > ....And htb code is htb_dequeue(): qdisc_watchdog_schedule(). Right, found all that... Can't spot anything obviously wrong though.. hrtimer_start*() does remove_hrtimer() which checks STATE_ENQUEUED, STATE_PENDING and pulls it off the relevant list before it continues the enqueue. However a loop in enqueue_hrtimer() would suggest a corrupted RB-tree, but I cannot find an RB-op that doesn't hold base-lock. Hohumm.. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 03:28:03PM +0100, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 14:13 +0000, Jarek Poplawski wrote: > > On Wed, Jan 14, 2009 at 02:32:26PM +0100, Peter Zijlstra wrote: > > > On Wed, 2009-01-14 at 13:26 +0000, Jarek Poplawski wrote: > > > > > > > OK, I hope Denys can apply more, but what about others? Without any > > > > patches the hole seems to be much bigger. > > > > > > OK, I read most of this thread on netdev, but didn't find a clear clue > > > on the specific hrtimer insertion race. > > > > There is something at the beginning of this thread, plus earlier > > threads mostly with Denys as sender, and "htb bug" in the subject. > > > > > > > > Do you have any clear ideas or should I poke at the htb/hrtimer code a > > > little? > > > > > > > ....And htb code is htb_dequeue(): qdisc_watchdog_schedule(). > > Right, found all that... > > Can't spot anything obviously wrong though.. hrtimer_start*() does > remove_hrtimer() which checks STATE_ENQUEUED, STATE_PENDING and pulls it > off the relevant list before it continues the enqueue. > > However a loop in enqueue_hrtimer() would suggest a corrupted RB-tree, > but I cannot find an RB-op that doesn't hold base-lock. > Anyway I think some trace in rbtree seemed to show the parent is the same with the node, if I didn't mess anything. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 14 Jan 2009, Peter Zijlstra wrote: > On Wed, 2009-01-14 at 14:17 +0200, Denys Fedoryschenko wrote: > > I will try that patches too, when i got this message, after 3 minutes i got > > crash of my router :-) after working around 17 hours :-( > > There is 3 of them by the way, 2 fixes also. > > > > hrtimer: removing all ur callback modes > > hrtimer: removing all ur callback modes, fix hotplug > > hrtimer: removing all ur callback modes, fix > > I'm afraid its a bit more than that: > > ca109491f612aab5c8152207631c0444f63da97f > 37810659ea7d9572c5ac284ade272f806ef8f788 > a0a99b227da57f81319dd239bc4de811b0f530ec > b2e3c0adec918ea22b6c9d7c76193dd3aaba9bd4 > 8bdec955b0da2ffbd10eb9b200651dd1f9e366f2 > d5fd43c4ae04523e1dcd7794f9c511b289851350 > 731a55ba0f17064f85903b7bf8e24849ec6cfa20 > a6037b61c2f5fc99c57c15b26d7cfa58bbb34008 > e3f1d883740b09e5116d4d4e30a6a6987264a83c > 82c5b7b527ccc4b5d3cf832437e842f9d2920a79 > > and > > 6e5c172cf7ca1ab878cc6a6a4c1d52fef60f3ee0 > > Also, all this is rather invasive and large, so I'm not sure it will > meet the -stable criteria. In my testing (another 12 hours of uptime, BTW), the three patches I am using are: ca109491f612aab5c8152207631c0444f63da97f 37810659ea7d9572c5ac284ade272f806ef8f788 a0a99b227da57f81319dd239bc4de811b0f530ec I have not tried the other patches. That said, I would not recommend just the three for -stable unless they get a much wider amount of testing, on multiple platforms. I don't see that as likely to happen, plus Peter says they are incomplete, so maybe it is just best to recommend that 2.6.28 users getting crashes while using HTB try these specific patches at first, and then the rest of the patches if they do not work. Thanks, Chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 03:28:03PM +0100, Peter Zijlstra wrote: ... > Right, found all that... > > Can't spot anything obviously wrong though.. hrtimer_start*() does > remove_hrtimer() which checks STATE_ENQUEUED, STATE_PENDING and pulls it > off the relevant list before it continues the enqueue. > > However a loop in enqueue_hrtimer() would suggest a corrupted RB-tree, > but I cannot find an RB-op that doesn't hold base-lock. > I've revisited it yesterday, and if I don't miss something, there is possible a scenario similar to this: cpu1: cpu2: run_hrtimer_pending spin_unlock restart = fn(timer) hrtimer_start enqueue_hrtimer hrtimer_start remove_hrtimer (the HRTIMER_STATE_CALLBACK is removed) switch_hrtimer_base spin_lock (not this hrtimer's anymore) __remove_hrtimer list_add_tail enqueue_hrtimer Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-01-15 at 09:01 +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 03:28:03PM +0100, Peter Zijlstra wrote: > .... > > Right, found all that... > > > > Can't spot anything obviously wrong though.. hrtimer_start*() does > > remove_hrtimer() which checks STATE_ENQUEUED, STATE_PENDING and pulls it > > off the relevant list before it continues the enqueue. > > > > However a loop in enqueue_hrtimer() would suggest a corrupted RB-tree, > > but I cannot find an RB-op that doesn't hold base-lock. > > > > I've revisited it yesterday, and if I don't miss something, there is > possible a scenario similar to this: > > cpu1: cpu2: > > run_hrtimer_pending > spin_unlock > restart = fn(timer) > > hrtimer_start > enqueue_hrtimer > > hrtimer_start > remove_hrtimer > (the HRTIMER_STATE_CALLBACK is removed) > > switch_hrtimer_base > spin_lock > (not this hrtimer's anymore) > __remove_hrtimer > list_add_tail enqueue_hrtimer > (looking at .28 code) run_hrtimer_pending() reads like: while (pending timers) { __remove_hrtimer(timer, HRTIMER_STATE_CALLBACK); spin_unlock(&cpu_base->lock); fn(timer); spin_lock(&cpu_base->lock); timer->state &= ~HRTIMER_STATE_CALLBACK; // _should_ result in HRTIMER_STATE_INACTIVE if (HRTIMER_RESTART) re-queue else if (timer->state != INACTIVE) { // so another cpu re-queued this timer _while_ we were executing it. if (timer is first && !reprogramm) { __remove_hrtimer(timer, HRTIMER_STATE_PENDING); list_add_tail(timer, &cb_pending); } } } So in the window where we drop the lock, one can, as you said, have another cpu requeue the timer, but the rb_entry and list_entry are free, so it should not cause the data corruption we're seeing. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 15, 2009 at 11:46:48AM +0100, Peter Zijlstra wrote: > On Thu, 2009-01-15 at 09:01 +0000, Jarek Poplawski wrote: ... > > spin_lock > > (not this hrtimer's anymore) > > __remove_hrtimer > > list_add_tail enqueue_hrtimer > > > > (looking at .28 code) > > run_hrtimer_pending() reads like: > > while (pending timers) { > __remove_hrtimer(timer, HRTIMER_STATE_CALLBACK); > spin_unlock(&cpu_base->lock); > > fn(timer); > > spin_lock(&cpu_base->lock); > timer->state &= ~HRTIMER_STATE_CALLBACK; // _should_ result in HRTIMER_STATE_INACTIVE > if (HRTIMER_RESTART) > re-queue > else if (timer->state != INACTIVE) { > // so another cpu re-queued this timer _while_ we were executing it. > if (timer is first && !reprogramm) { > __remove_hrtimer(timer, HRTIMER_STATE_PENDING); > list_add_tail(timer, &cb_pending); > } > } > } > > So in the window where we drop the lock, one can, as you said, have > another cpu requeue the timer, but the rb_entry and list_entry are free, > so it should not cause the data corruption we're seeing. > Can't they be enqueued to the list (without a lock) and rbtree at the same time? Then removing is done for the list only? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -Nurp a2.6.27.7/net/sched/sch_htb.c b2.6.27.7/net/sched/sch_htb.c --- a2.6.27.7/net/sched/sch_htb.c 2008-12-11 08:16:16.000000000 +0000 +++ b2.6.27.7/net/sched/sch_htb.c 2008-12-15 10:44:32.000000000 +0000 @@ -924,6 +924,7 @@ static struct sk_buff *htb_dequeue(struc } } sch->qstats.overlimits++; + qdisc_watchdog_cancel(&q->watchdog); qdisc_watchdog_schedule(&q->watchdog, next_event); fin: return skb;