Message ID | 20081209102118.GB14862@ff.dom.local |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Jarek Poplawski wrote: > Next event time should consider jiffies used for recounting. Otherwise > qdisc_watchdog_schedule() triggers hrtimer immediately with the event > in the past, and may cause very high ksoftirqd cpu usage (if highres > is on). This patch charges jiffies globally, so we can skip this in > htb_do_events(). > > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > --- > net/sched/sch_htb.c | 14 ++++++++++++-- > 1 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > index d6eb4a7..102866d 100644 > --- a/net/sched/sch_htb.c > +++ b/net/sched/sch_htb.c > @@ -685,8 +685,11 @@ static psched_time_t htb_do_events(struct htb_sched *q, int level) > if (cl->cmode != HTB_CAN_SEND) > htb_add_to_wait_tree(q, cl, diff); > } > - /* too much load - let's continue on next jiffie (including above) */ > - return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ; > + /* > + * Too much load - let's continue on next jiffie. > + * (Used jiffies are charged later.) > + */ > + return q->now + 1; > This (including the last patch) is really confusing - q->now doesn't contain HZ values but psched ticks. Could you describe the overall algorithm you're trying to implement please? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 09, 2008 at 11:28:44AM +0100, Patrick McHardy wrote: > Jarek Poplawski wrote: >> Next event time should consider jiffies used for recounting. Otherwise >> qdisc_watchdog_schedule() triggers hrtimer immediately with the event >> in the past, and may cause very high ksoftirqd cpu usage (if highres >> is on). This patch charges jiffies globally, so we can skip this in >> htb_do_events(). >> >> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> >> --- >> net/sched/sch_htb.c | 14 ++++++++++++-- >> 1 files changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c >> index d6eb4a7..102866d 100644 >> --- a/net/sched/sch_htb.c >> +++ b/net/sched/sch_htb.c >> @@ -685,8 +685,11 @@ static psched_time_t htb_do_events(struct htb_sched *q, int level) >> if (cl->cmode != HTB_CAN_SEND) >> htb_add_to_wait_tree(q, cl, diff); >> } >> - /* too much load - let's continue on next jiffie (including above) */ >> - return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ; >> + /* >> + * Too much load - let's continue on next jiffie. >> + * (Used jiffies are charged later.) >> + */ >> + return q->now + 1; >> > > This (including the last patch) is really confusing - q->now doesn't > contain HZ values but psched ticks. Could you describe the overall > algorithm you're trying to implement please? The algorithm is we want to "continue on the next jiffie". We know we've lost here a lot of time (~2 jiffies), and this will be added later. Since these jiffies are not precise enough wrt. psched ticks or ktime, and we will add around 2000 (for HZ 1000) psched ticks anyway this +1 here simply doesn't matter and can mean "a bit after q->now". We can try to do this more precisely with additional psched_get_time(), instead of jiffies, but my "tests" didn't show any advantage. What matters is to avoid longer scheduling with the past time. This case can probably happen only with very low rate limits for a large number of classes, and I guess they simply need some spare time, not necessarily at psched tick precision. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Tue, Dec 09, 2008 at 11:28:44AM +0100, Patrick McHardy wrote: >>> - /* too much load - let's continue on next jiffie (including above) */ >>> - return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ; >>> + /* >>> + * Too much load - let's continue on next jiffie. >>> + * (Used jiffies are charged later.) >>> + */ >>> + return q->now + 1; >>> >> This (including the last patch) is really confusing - q->now doesn't >> contain HZ values but psched ticks. Could you describe the overall >> algorithm you're trying to implement please? > > The algorithm is we want to "continue on the next jiffie". We know > we've lost here a lot of time (~2 jiffies), and this will be added > later. Since these jiffies are not precise enough wrt. psched ticks > or ktime, and we will add around 2000 (for HZ 1000) psched ticks > anyway this +1 here simply doesn't matter and can mean "a bit after > q->now". This might as well return q->now, no? The elapsed time is added on top later anyways. But anyways, I think both the approach and the patch are wrong. /* charge used jiffies */ start_at = jiffies - start_at; if (start_at > 0) next_event += start_at * PSCHED_TICKS_PER_SEC / HZ; What relationship does the duration it ran for has to the time it should run at again? The focus on jiffies is wrong IMO, the reason why we get high load is because the CPU can't keep up, delaying things even longer is not going to help get the work done. The only reason to look at jiffies is because its a cheap indication that it has ran for too long and we should give other tasks a change to run as well, but it should continue immediately after it did that. So all it should do is make sure that the watchdog is scheduled with a very small positive delay. As for the implementation: the increase in delay (the snippet above) is also done in the case that no packets were available for other reasons (throttling), in which case we might needlessly delay for an extra jiffie if jiffies wrapped while it tried to dequeue. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 09, 2008 at 01:25:15PM +0100, Patrick McHardy wrote: > Jarek Poplawski wrote: >> On Tue, Dec 09, 2008 at 11:28:44AM +0100, Patrick McHardy wrote: >>>> - /* too much load - let's continue on next jiffie (including above) */ >>>> - return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ; >>>> + /* >>>> + * Too much load - let's continue on next jiffie. >>>> + * (Used jiffies are charged later.) >>>> + */ >>>> + return q->now + 1; >>>> >>> This (including the last patch) is really confusing - q->now doesn't >>> contain HZ values but psched ticks. Could you describe the overall >>> algorithm you're trying to implement please? >> >> The algorithm is we want to "continue on the next jiffie". We know >> we've lost here a lot of time (~2 jiffies), and this will be added >> later. Since these jiffies are not precise enough wrt. psched ticks >> or ktime, and we will add around 2000 (for HZ 1000) psched ticks >> anyway this +1 here simply doesn't matter and can mean "a bit after >> q->now". > > This might as well return q->now, no? Yes, but IMHO it looks worse, considering the problem here (we want to avoid scheduling in the past). > The elapsed time is added > on top later anyways. But anyways, I think both the approach and > the patch are wrong. > > /* charge used jiffies */ > start_at = jiffies - start_at; > if (start_at > 0) > next_event += start_at * PSCHED_TICKS_PER_SEC / HZ; > > What relationship does the duration it ran for has to the time it > should run at again? The scheduling times won't be in the past mostly and hrtimers won't trigger too soon, but approximately around we really need and can afford a new try without stopping everything else. > The focus on jiffies is wrong IMO, the reason why we get high > load is because the CPU can't keep up, delaying things even > longer is not going to help get the work done. The only reason to > look at jiffies is because its a cheap indication that it has > ran for too long and we should give other tasks a change to run > as well, but it should continue immediately after it did that. > So all it should do is make sure that the watchdog is scheduled > with a very small positive delay. This needs additional psched_get_time(), and as I've written before there is no apparent advantage in problematic cases, but this would add more overhead for common cases. > As for the implementation: the increase in delay (the snippet > above) is also done in the case that no packets were available > for other reasons (throttling), in which case we might needlessly > delay for an extra jiffie if jiffies wrapped while it tried to > dequeue. But in another similar cases there could be no change in jiffies, but almost a jiffie used for counting, so wrong schedule time as well. Approximatly this all should be fine, and it still can be tuned later. IMHO, this all should not affect "common" cases, which are expected to use less then jiffie here. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Tue, Dec 09, 2008 at 01:25:15PM +0100, Patrick McHardy wrote: >> Jarek Poplawski wrote: >>> The algorithm is we want to "continue on the next jiffie". We know >>> we've lost here a lot of time (~2 jiffies), and this will be added >>> later. Since these jiffies are not precise enough wrt. psched ticks >>> or ktime, and we will add around 2000 (for HZ 1000) psched ticks >>> anyway this +1 here simply doesn't matter and can mean "a bit after >>> q->now". >> This might as well return q->now, no? > > Yes, but IMHO it looks worse, considering the problem here (we want to > avoid scheduling in the past). I guess its a matter of taste. >> The elapsed time is added >> on top later anyways. But anyways, I think both the approach and >> the patch are wrong. >> >> /* charge used jiffies */ >> start_at = jiffies - start_at; >> if (start_at > 0) >> next_event += start_at * PSCHED_TICKS_PER_SEC / HZ; >> >> What relationship does the duration it ran for has to the time it >> should run at again? > > The scheduling times won't be in the past mostly and hrtimers won't > trigger too soon, but approximately around we really need and can > afford a new try without stopping everything else. Sure. But it also won't be in the past if we simply add .. lets say the current uptime in ms. My point was that there's absolutely no relationship between those two times and combining them just to get a value thats not in the past is wrong. Especially considering *why* we want a value in the future and what we'll get from that calculation. >> The focus on jiffies is wrong IMO, the reason why we get high >> load is because the CPU can't keep up, delaying things even >> longer is not going to help get the work done. The only reason to >> look at jiffies is because its a cheap indication that it has >> ran for too long and we should give other tasks a change to run >> as well, but it should continue immediately after it did that. >> So all it should do is make sure that the watchdog is scheduled >> with a very small positive delay. > > This needs additional psched_get_time(), and as I've written before > there is no apparent advantage in problematic cases, but this would > add more overhead for common cases. htb_do_events() exceeding two jiffies is fortunately not a common case. You (incorrectly) made the calculation somewhat of a common case by also adding to the delay if the inner classes simply throttled and already returned the exact delay they want. Much better (again considering what we want to achieve here) would be to not use the hrtimer watchdog at all. We want to give lower priority tasks a chance to run, so ideally we'd use a low priority task for wakeup. >> As for the implementation: the increase in delay (the snippet >> above) is also done in the case that no packets were available >> for other reasons (throttling), in which case we might needlessly >> delay for an extra jiffie if jiffies wrapped while it tried to >> dequeue. > > But in another similar cases there could be no change in jiffies, but > almost a jiffie used for counting, so wrong schedule time as well. Its not "wrong". We don't want to delay. Its a courtesy to the remaining system. > Approximatly this all should be fine, and it still can be tuned later. > IMHO, this all should not affect "common" cases, which are expected to > use less then jiffie here. Jiffies might wrap even if it only took only a few nanoseconds. And its not fine, in the case of throttled classes there's no reason to add extra delay *at all*. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
To David Miller: David don't apply yet - this patch needs change. Patrick, read below: On Tue, Dec 09, 2008 at 02:20:00PM +0100, Patrick McHardy wrote: > Jarek Poplawski wrote: >> On Tue, Dec 09, 2008 at 01:25:15PM +0100, Patrick McHardy wrote: >>> Jarek Poplawski wrote: >>>> The algorithm is we want to "continue on the next jiffie". We know >>>> we've lost here a lot of time (~2 jiffies), and this will be added >>>> later. Since these jiffies are not precise enough wrt. psched ticks >>>> or ktime, and we will add around 2000 (for HZ 1000) psched ticks >>>> anyway this +1 here simply doesn't matter and can mean "a bit after >>>> q->now". >>> This might as well return q->now, no? >> >> Yes, but IMHO it looks worse, considering the problem here (we want to >> avoid scheduling in the past). > > I guess its a matter of taste. Exactly. (And could be changed.) > >>> The elapsed time is added >>> on top later anyways. But anyways, I think both the approach and >>> the patch are wrong. >>> >>> /* charge used jiffies */ >>> start_at = jiffies - start_at; >>> if (start_at > 0) >>> next_event += start_at * PSCHED_TICKS_PER_SEC / HZ; >>> >>> What relationship does the duration it ran for has to the time it >>> should run at again? >> >> The scheduling times won't be in the past mostly and hrtimers won't >> trigger too soon, but approximately around we really need and can >> afford a new try without stopping everything else. > > Sure. But it also won't be in the past if we simply add .. lets say > the current uptime in ms. My point was that there's absolutely no > relationship between those two times and combining them just to > get a value thats not in the past is wrong. Especially considering > *why* we want a value in the future and what we'll get from that > calculation. > >>> The focus on jiffies is wrong IMO, the reason why we get high >>> load is because the CPU can't keep up, delaying things even >>> longer is not going to help get the work done. The only reason to >>> look at jiffies is because its a cheap indication that it has >>> ran for too long and we should give other tasks a change to run >>> as well, but it should continue immediately after it did that. >>> So all it should do is make sure that the watchdog is scheduled >>> with a very small positive delay. >> >> This needs additional psched_get_time(), and as I've written before >> there is no apparent advantage in problematic cases, but this would >> add more overhead for common cases. > > htb_do_events() exceeding two jiffies is fortunately not a common > case. You (incorrectly) made the calculation somewhat of a common > case by also adding to the delay if the inner classes simply throttled > and already returned the exact delay they want. I see! You are right and this needs fixing. > Much better (again considering what we want to achieve here) would > be to not use the hrtimer watchdog at all. We want to give lower > priority tasks a chance to run, so ideally we'd use a low priority > task for wakeup. I'm not sure I get ot right: for precise scheduling hrtimers look useful. Do you really mean "at all"? > >>> As for the implementation: the increase in delay (the snippet >>> above) is also done in the case that no packets were available >>> for other reasons (throttling), in which case we might needlessly >>> delay for an extra jiffie if jiffies wrapped while it tried to >>> dequeue. >> >> But in another similar cases there could be no change in jiffies, but >> almost a jiffie used for counting, so wrong schedule time as well. > > Its not "wrong". We don't want to delay. Its a courtesy to the > remaining system. In this case it's probably self-courtesy too: this ksoftirqd takes most of the time and it's useless. > >> Approximatly this all should be fine, and it still can be tuned later. >> IMHO, this all should not affect "common" cases, which are expected to >> use less then jiffie here. > > Jiffies might wrap even if it only took only a few nanoseconds. > And its not fine, in the case of throttled classes there's no > reason to add extra delay *at all*. Yes, you are right with this. I can try too fix this tomorrow, unless you prefer to send your version of this patch. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > To David Miller: David don't apply yet - this patch needs change. > > Patrick, read below: > > On Tue, Dec 09, 2008 at 02:20:00PM +0100, Patrick McHardy wrote: >>>> >>>> The focus on jiffies is wrong IMO, the reason why we get high >>>> load is because the CPU can't keep up, delaying things even >>>> longer is not going to help get the work done. The only reason to >>>> look at jiffies is because its a cheap indication that it has >>>> ran for too long and we should give other tasks a change to run >>>> as well, but it should continue immediately after it did that. >>>> So all it should do is make sure that the watchdog is scheduled >>>> with a very small positive delay. >>>> >>> This needs additional psched_get_time(), and as I've written before >>> there is no apparent advantage in problematic cases, but this would >>> add more overhead for common cases. >>> >> htb_do_events() exceeding two jiffies is fortunately not a common >> case. You (incorrectly) made the calculation somewhat of a common >> case by also adding to the delay if the inner classes simply throttled >> and already returned the exact delay they want. > > I see! You are right and this needs fixing. > >> Much better (again considering what we want to achieve here) would >> be to not use the hrtimer watchdog at all. We want to give lower >> priority tasks a chance to run, so ideally we'd use a low priority >> task for wakeup. > > I'm not sure I get ot right: for precise scheduling hrtimers look > useful. Do you really mean "at all"? I meant "at all" for the wakeup after we've decided HTB has too much work to do at once. A work queue seems better suited since that makes sure we allow other processes to run, but don't wait unnecessarily long when there is no other work. >>>> As for the implementation: the increase in delay (the snippet >>>> above) is also done in the case that no packets were available >>>> for other reasons (throttling), in which case we might needlessly >>>> delay for an extra jiffie if jiffies wrapped while it tried to >>>> dequeue. >>>> >>> But in another similar cases there could be no change in jiffies, but >>> almost a jiffie used for counting, so wrong schedule time as well. >> Its not "wrong". We don't want to delay. Its a courtesy to the >> remaining system. > > In this case it's probably self-courtesy too: this ksoftirqd takes > most of the time and it's useless. Well, it calls back to HTB, which continues to do real work. But leaving HTB, scheduling a timer just to be called immediately again is useless overhead, I agree. >>> Approximatly this all should be fine, and it still can be tuned later. >>> IMHO, this all should not affect "common" cases, which are expected to >>> use less then jiffie here. >>> >> Jiffies might wrap even if it only took only a few nanoseconds. >> And its not fine, in the case of throttled classes there's no >> reason to add extra delay *at all*. > > Yes, you are right with this. I can try too fix this tomorrow, unless > you prefer to send your version of this patch. I don't have a version of my own, so please go ahead :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 9 Dec 2008 14:45:34 +0000 > To David Miller: David don't apply yet - this patch needs change. What I've done is apply patches 5 and 6 since those are fine and independent of these timer issues. So once you work this stuff out please resubmit the rest. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 09, 2008 at 10:35:45PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 9 Dec 2008 14:45:34 +0000 > > > To David Miller: David don't apply yet - this patch needs change. > > What I've done is apply patches 5 and 6 since those are > fine and independent of these timer issues. Very nice of you, but IMHO patches 1 and 4 are OK too. I'm withdrawing patches 2 and 3, and will try to discuss this with Patrick more, but this could stay like this too. > So once you work this stuff out please resubmit the rest. Of course, I can resubmit 1 and 4 if you think it's really needed. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Wed, 10 Dec 2008 09:11:26 +0000 > On Tue, Dec 09, 2008 at 10:35:45PM -0800, David Miller wrote: > > So once you work this stuff out please resubmit the rest. > > Of course, I can resubmit 1 and 4 if you think it's really needed. Please do, there were dependencies. In fact patch 2 changes the very change you made in patch 1. I understand it's logical steps, but you could just do 1 and 2 in the same patch and it'd be OK. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index d6eb4a7..102866d 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -685,8 +685,11 @@ static psched_time_t htb_do_events(struct htb_sched *q, int level) if (cl->cmode != HTB_CAN_SEND) htb_add_to_wait_tree(q, cl, diff); } - /* too much load - let's continue on next jiffie (including above) */ - return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ; + /* + * Too much load - let's continue on next jiffie. + * (Used jiffies are charged later.) + */ + return q->now + 1; } /* Returns class->node+prio from id-tree where classe's id is >= id. NULL @@ -845,6 +848,7 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) struct htb_sched *q = qdisc_priv(sch); int level; psched_time_t next_event; + unsigned long start_at; /* try to dequeue direct packets as high prio (!) to minimize cpu work */ skb = __skb_dequeue(&q->direct_queue); @@ -857,6 +861,7 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) if (!sch->q.qlen) goto fin; q->now = psched_get_time(); + start_at = jiffies; next_event = q->now + 5 * PSCHED_TICKS_PER_SEC; @@ -889,6 +894,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; + /* charge used jiffies */ + start_at = jiffies - start_at; + if (start_at > 0) + next_event += start_at * PSCHED_TICKS_PER_SEC / HZ; + qdisc_watchdog_schedule(&q->watchdog, next_event); fin: return skb;
Next event time should consider jiffies used for recounting. Otherwise qdisc_watchdog_schedule() triggers hrtimer immediately with the event in the past, and may cause very high ksoftirqd cpu usage (if highres is on). This patch charges jiffies globally, so we can skip this in htb_do_events(). Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/sched/sch_htb.c | 14 ++++++++++++-- 1 files changed, 12 insertions(+), 2 deletions(-)