Message ID | alpine.LFD.2.00.0907091401040.24704@localhost.localdomain |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > > > I have the feeling that the code relies on some implicit cpu > > > boundness, which is not longer guaranteed with the timer migration > > > changes, but that's a question for the network experts. > > > > As a matter of fact, I've just looked at this __netif_schedule(), > > which really is cpu bound, so you might be 100% right. > > So the watchdog is the one which causes the trouble. The patch below > should fix this. I hope so. On the other hand it seems it should work with this migration yet, so it probably needs additional debugging. Thanks, Jarek P. > --- > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 24d17ce..fbe554f 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -485,7 +485,7 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires) > wd->qdisc->flags |= TCQ_F_THROTTLED; > time = ktime_set(0, 0); > time = ktime_add_ns(time, PSCHED_TICKS2NS(expires)); > - hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS); > + hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS_PINNED); > } > EXPORT_SYMBOL(qdisc_watchdog_schedule); > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 9 Jul 2009, Jarek Poplawski wrote: > On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > > > > > I have the feeling that the code relies on some implicit cpu > > > > boundness, which is not longer guaranteed with the timer migration > > > > changes, but that's a question for the network experts. > > > > > > As a matter of fact, I've just looked at this __netif_schedule(), > > > which really is cpu bound, so you might be 100% right. > > > > So the watchdog is the one which causes the trouble. The patch below > > should fix this. > > I hope so. On the other hand it seems it should work with this > migration yet, so it probably needs additional debugging. Right. I just provided the patch to narrow down the problem, but please test the fix of the hrtimer migration code which I sent out a bit earlier: http://lkml.org/lkml/2009/7/9/150 It fixes a possible endless loop in the timer code which is related to the migration changes. Looking at the backtraces of the spinlock lockup I think that is what you hit. spin_lock(root_lock); qdisc_run(q); __qdisc_run(q); dequeue_skb(q); q->dequeue(q); qdisc_watchdog_schedule(); hrtimer_start(); switch_hrtimer_base(); <- loops forever Now the other CPU is stuck in dev_xmit() spin_lock(root_lock) Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2009 at 04:15:28PM +0200, Thomas Gleixner wrote: > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: > > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > > > > > > > I have the feeling that the code relies on some implicit cpu > > > > > boundness, which is not longer guaranteed with the timer migration > > > > > changes, but that's a question for the network experts. > > > > > > > > As a matter of fact, I've just looked at this __netif_schedule(), > > > > which really is cpu bound, so you might be 100% right. > > > > > > So the watchdog is the one which causes the trouble. The patch below > > > should fix this. > > > > I hope so. On the other hand it seems it should work with this > > migration yet, so it probably needs additional debugging. > > Right. I just provided the patch to narrow down the problem, but > please test the fix of the hrtimer migration code which I sent out a > bit earlier: http://lkml.org/lkml/2009/7/9/150 > > It fixes a possible endless loop in the timer code which is related to > the migration changes. Looking at the backtraces of the spinlock > lockup I think that is what you hit. Actually, Andres and Joao hit this, and I hope they'll try these two patches. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 9, 2009 at 3:24 PM, Jarek Poplawski<jarkao2@gmail.com> wrote: > On Thu, Jul 09, 2009 at 04:15:28PM +0200, Thomas Gleixner wrote: >> On Thu, 9 Jul 2009, Jarek Poplawski wrote: >> > On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: >> > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: >> > > > > >> > > > > I have the feeling that the code relies on some implicit cpu >> > > > > boundness, which is not longer guaranteed with the timer migration >> > > > > changes, but that's a question for the network experts. >> > > > >> > > > As a matter of fact, I've just looked at this __netif_schedule(), >> > > > which really is cpu bound, so you might be 100% right. >> > > >> > > So the watchdog is the one which causes the trouble. The patch below >> > > should fix this. >> > >> > I hope so. On the other hand it seems it should work with this >> > migration yet, so it probably needs additional debugging. >> >> Right. I just provided the patch to narrow down the problem, but >> please test the fix of the hrtimer migration code which I sent out a >> bit earlier: http://lkml.org/lkml/2009/7/9/150 >> >> It fixes a possible endless loop in the timer code which is related to >> the migration changes. Looking at the backtraces of the spinlock >> lockup I think that is what you hit. > > Actually, Andres and Joao hit this, and I hope they'll try these two > patches. > > Thanks, > Jarek P. > I can only try later on today. Will post back as soon as i do it. Joao Correia -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 9 Jul 2009, Jarek Poplawski wrote: > On Thu, Jul 09, 2009 at 04:15:28PM +0200, Thomas Gleixner wrote: > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: > > > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > > > > > > > > > I have the feeling that the code relies on some implicit cpu > > > > > > boundness, which is not longer guaranteed with the timer migration > > > > > > changes, but that's a question for the network experts. > > > > > > > > > > As a matter of fact, I've just looked at this __netif_schedule(), > > > > > which really is cpu bound, so you might be 100% right. > > > > > > > > So the watchdog is the one which causes the trouble. The patch below > > > > should fix this. > > > > > > I hope so. On the other hand it seems it should work with this > > > migration yet, so it probably needs additional debugging. > > > > Right. I just provided the patch to narrow down the problem, but > > please test the fix of the hrtimer migration code which I sent out a > > bit earlier: http://lkml.org/lkml/2009/7/9/150 > > > > It fixes a possible endless loop in the timer code which is related to > > the migration changes. Looking at the backtraces of the spinlock > > lockup I think that is what you hit. > > Actually, Andres and Joao hit this, and I hope they'll try these two > patches. Please test them separate from each other. The one I sent in this thread was just for narrowing down the issue, but I'm now quite sure that they really hit the issue which is addressed by the hrtimer patch. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thursday 09 July 2009 16:28:05 Thomas Gleixner wrote: > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > On Thu, Jul 09, 2009 at 04:15:28PM +0200, Thomas Gleixner wrote: > > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > On Thu, Jul 09, 2009 at 02:03:50PM +0200, Thomas Gleixner wrote: > > > > > On Thu, 9 Jul 2009, Jarek Poplawski wrote: > > > > > > > I have the feeling that the code relies on some implicit cpu > > > > > > > boundness, which is not longer guaranteed with the timer > > > > > > > migration changes, but that's a question for the network > > > > > > > experts. > > > > > > > > > > > > As a matter of fact, I've just looked at this __netif_schedule(), > > > > > > which really is cpu bound, so you might be 100% right. > > > > > > > > > > So the watchdog is the one which causes the trouble. The patch > > > > > below should fix this. > > > > > > > > I hope so. On the other hand it seems it should work with this > > > > migration yet, so it probably needs additional debugging. > > > > > > Right. I just provided the patch to narrow down the problem, but > > > please test the fix of the hrtimer migration code which I sent out a > > > bit earlier: http://lkml.org/lkml/2009/7/9/150 > > > > > > It fixes a possible endless loop in the timer code which is related to > > > the migration changes. Looking at the backtraces of the spinlock > > > lockup I think that is what you hit. > > > > Actually, Andres and Joao hit this, and I hope they'll try these two > > patches. > > Please test them separate from each other. The one I sent in this > thread was just for narrowing down the issue, but I'm now quite sure > that they really hit the issue which is addressed by the hrtimer > patch. No crash yet. 15min running (seconds to a minute before). Will let it run for some hours to be sure. Nice! Andres -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andres, On Thu, 9 Jul 2009, Andres Freund wrote: > On Thursday 09 July 2009 16:28:05 Thomas Gleixner wrote: > > Please test them separate from each other. The one I sent in this > > thread was just for narrowing down the issue, but I'm now quite sure > > that they really hit the issue which is addressed by the hrtimer > > patch. > No crash yet. 15min running (seconds to a minute before). > > Will let it run for some hours to be sure. Which one of the patches are you testing ? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 09 July 2009 18:01:56 Thomas Gleixner wrote: > Andres, > > On Thu, 9 Jul 2009, Andres Freund wrote: > > On Thursday 09 July 2009 16:28:05 Thomas Gleixner wrote: > > > Please test them separate from each other. The one I sent in this > > > thread was just for narrowing down the issue, but I'm now quite sure > > > that they really hit the issue which is addressed by the hrtimer > > > patch. > > > > No crash yet. 15min running (seconds to a minute before). > > > > Will let it run for some hours to be sure. > > Which one of the patches are you testing ? Your "real" one, i.e. de907e8432b08f2d5966c36e0747e97c0e596810 1h30m now... Andres -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andres, On Thu, 9 Jul 2009, Andres Freund wrote: > On Thursday 09 July 2009 18:01:56 Thomas Gleixner wrote: > > Andres, > > > > On Thu, 9 Jul 2009, Andres Freund wrote: > > > On Thursday 09 July 2009 16:28:05 Thomas Gleixner wrote: > > > > Please test them separate from each other. The one I sent in this > > > > thread was just for narrowing down the issue, but I'm now quite sure > > > > that they really hit the issue which is addressed by the hrtimer > > > > patch. > > > > > > No crash yet. 15min running (seconds to a minute before). > > > > > > Will let it run for some hours to be sure. > > > > Which one of the patches are you testing ? > Your "real" one, i.e. de907e8432b08f2d5966c36e0747e97c0e596810 > > 1h30m now... Looks like I hit the nail on the head. Queueing it up for Linus. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 9, 2009 at 6:44 PM, Thomas Gleixner<tglx@linutronix.de> wrote: > Andres, > > On Thu, 9 Jul 2009, Andres Freund wrote: > >> On Thursday 09 July 2009 18:01:56 Thomas Gleixner wrote: >> > Andres, >> > >> > On Thu, 9 Jul 2009, Andres Freund wrote: >> > > On Thursday 09 July 2009 16:28:05 Thomas Gleixner wrote: >> > > > Please test them separate from each other. The one I sent in this >> > > > thread was just for narrowing down the issue, but I'm now quite sure >> > > > that they really hit the issue which is addressed by the hrtimer >> > > > patch. >> > > >> > > No crash yet. 15min running (seconds to a minute before). >> > > >> > > Will let it run for some hours to be sure. >> > >> > Which one of the patches are you testing ? >> Your "real" one, i.e. de907e8432b08f2d5966c36e0747e97c0e596810 >> >> 1h30m now... > > Looks like I hit the nail on the head. Queueing it up for Linus. > > Thanks, > > tglx > Confirmed working from me as well. Thank you for your time, Joao Correia -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 24d17ce..fbe554f 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -485,7 +485,7 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires) wd->qdisc->flags |= TCQ_F_THROTTLED; time = ktime_set(0, 0); time = ktime_add_ns(time, PSCHED_TICKS2NS(expires)); - hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS); + hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS_PINNED); } EXPORT_SYMBOL(qdisc_watchdog_schedule);