Message ID | 1270742531-19808-1-git-send-email-chase.douglas@canonical.com |
---|---|
State | Accepted |
Delegated to: | Andy Whitcroft |
Headers | show |
On Thu, Apr 08, 2010 at 12:02:11PM -0400, Chase Douglas wrote: > There's a period of 10 ticks where calc_load_tasks is updated by all the > cpus for the load avg. Usually all the cpus do this during the first > tick. If any cpus go idle, calc_load_tasks is decremented accordingly. > However, if they wake up calc_load_tasks is not incremented. Thus, if > cpus go idle during the 10 tick period, calc_load_tasks may be > decremented to a non-representative value. This issue can lead to > systems having a load avg of exactly 0, even though the real load avg > could theoretically be up to NR_CPUS. > > This change defers calc_load_tasks accounting after each cpu updates the > count until after the 10 tick update window. > > A few points: > > * A global atomic deferral counter, and not per-cpu vars, is needed > because a cpu may go NOHZ idle and not be able to update the global > calc_load_tasks variable for subsequent load calculations. > * It is not enough to add calls to account for the load when a cpu is > awakened: > - Load avg calculation must be independent of cpu load. > - If a cpu is awakend by one tasks, but then has more scheduled before > the end of the update window, only the first task will be accounted. > > BugLink: http://bugs.launchpad.net/bugs/513848 > > Signed-off-by: Chase Douglas <chase.douglas@canonical.com> > --- > kernel/sched.c | 24 ++++++++++++++++++++++-- > 1 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index 81ede13..c372249 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) > > /* Variables and functions for calc_load */ > static atomic_long_t calc_load_tasks; > +static atomic_long_t calc_load_tasks_deferred; > static unsigned long calc_load_update; > unsigned long avenrun[3]; > EXPORT_SYMBOL(avenrun); > @@ -3021,7 +3022,7 @@ void calc_global_load(void) > */ > static void calc_load_account_active(struct rq *this_rq) > { > - long nr_active, delta; > + long nr_active, delta, deferred; > > nr_active = this_rq->nr_running; > nr_active += (long) this_rq->nr_uninterruptible; > @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) > if (nr_active != this_rq->calc_load_active) { > delta = nr_active - this_rq->calc_load_active; > this_rq->calc_load_active = nr_active; > + > + /* > + * Update calc_load_tasks only once per cpu in 10 tick update > + * window. > + */ > + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && > + time_after_eq(jiffies, calc_load_update))) { > + if (delta) > + atomic_long_add(delta, > + &calc_load_tasks_deferred); > + return; > + } > + > + if (atomic_long_read(&calc_load_tasks_deferred)) { > + deferred = atomic_long_xchg(&calc_load_tasks_deferred, > + 0); > + delta += deferred; > + } > + > atomic_long_add(delta, &calc_load_tasks); > } > } > @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) > } > > if (time_after_eq(jiffies, this_rq->calc_load_update)) { > - this_rq->calc_load_update += LOAD_FREQ; > calc_load_account_active(this_rq); > + this_rq->calc_load_update += LOAD_FREQ; > } > } This seems to contain all the changes I requested in email and on irc, I believe it does what you claim and it seems upstream thinks its a vile but necessary hack till they think of how to fix it properly. It also fixes someone seeing an issue. So: Acked-by: Andy Whitcroft <apw@canonical.com> -apw
On Thu, 2010-04-08 at 12:02 -0400, Chase Douglas wrote: > There's a period of 10 ticks where calc_load_tasks is updated by all the > cpus for the load avg. Usually all the cpus do this during the first > tick. If any cpus go idle, calc_load_tasks is decremented accordingly. > However, if they wake up calc_load_tasks is not incremented. Thus, if > cpus go idle during the 10 tick period, calc_load_tasks may be > decremented to a non-representative value. This issue can lead to > systems having a load avg of exactly 0, even though the real load avg > could theoretically be up to NR_CPUS. > > This change defers calc_load_tasks accounting after each cpu updates the > count until after the 10 tick update window. > > A few points: > > * A global atomic deferral counter, and not per-cpu vars, is needed > because a cpu may go NOHZ idle and not be able to update the global > calc_load_tasks variable for subsequent load calculations. > * It is not enough to add calls to account for the load when a cpu is > awakened: > - Load avg calculation must be independent of cpu load. > - If a cpu is awakend by one tasks, but then has more scheduled before > the end of the update window, only the first task will be accounted. > > BugLink: http://bugs.launchpad.net/bugs/513848 > > Signed-off-by: Chase Douglas <chase.douglas@canonical.com> > --- > kernel/sched.c | 24 ++++++++++++++++++++++-- > 1 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index 81ede13..c372249 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) > > /* Variables and functions for calc_load */ > static atomic_long_t calc_load_tasks; > +static atomic_long_t calc_load_tasks_deferred; > static unsigned long calc_load_update; > unsigned long avenrun[3]; > EXPORT_SYMBOL(avenrun); > @@ -3021,7 +3022,7 @@ void calc_global_load(void) > */ > static void calc_load_account_active(struct rq *this_rq) > { > - long nr_active, delta; > + long nr_active, delta, deferred; > > nr_active = this_rq->nr_running; > nr_active += (long) this_rq->nr_uninterruptible; > @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) > if (nr_active != this_rq->calc_load_active) { > delta = nr_active - this_rq->calc_load_active; > this_rq->calc_load_active = nr_active; > + > + /* > + * Update calc_load_tasks only once per cpu in 10 tick update > + * window. > + */ > + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && > + time_after_eq(jiffies, calc_load_update))) { > + if (delta) > + atomic_long_add(delta, > + &calc_load_tasks_deferred); > + return; > + } > + > + if (atomic_long_read(&calc_load_tasks_deferred)) { > + deferred = atomic_long_xchg(&calc_load_tasks_deferred, > + 0); > + delta += deferred; > + } > + > atomic_long_add(delta, &calc_load_tasks); > } > } > @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) > } > > if (time_after_eq(jiffies, this_rq->calc_load_update)) { > - this_rq->calc_load_update += LOAD_FREQ; > calc_load_account_active(this_rq); > + this_rq->calc_load_update += LOAD_FREQ; > } > } OK - this does have some subtly to it that's for sure, especially with the this_rg->calc_load_update dependency and hence the need to re-ordering this after calc_load_account_active. One needs to tread carefully here. The description of the change and change needs some hard thought, but I believe it does the trick as you claim. Acked-by: Colin King <colin.king@canonical.com>
Applied to Lucid. -apw
On Thu, Apr 8, 2010 at 12:02 PM, Chase Douglas <chase.douglas@canonical.com> wrote: > There's a period of 10 ticks where calc_load_tasks is updated by all the > cpus for the load avg. Usually all the cpus do this during the first > tick. If any cpus go idle, calc_load_tasks is decremented accordingly. > However, if they wake up calc_load_tasks is not incremented. Thus, if > cpus go idle during the 10 tick period, calc_load_tasks may be > decremented to a non-representative value. This issue can lead to > systems having a load avg of exactly 0, even though the real load avg > could theoretically be up to NR_CPUS. > > This change defers calc_load_tasks accounting after each cpu updates the > count until after the 10 tick update window. > > A few points: > > * A global atomic deferral counter, and not per-cpu vars, is needed > because a cpu may go NOHZ idle and not be able to update the global > calc_load_tasks variable for subsequent load calculations. > * It is not enough to add calls to account for the load when a cpu is > awakened: > - Load avg calculation must be independent of cpu load. > - If a cpu is awakend by one tasks, but then has more scheduled before > the end of the update window, only the first task will be accounted. > > BugLink: http://bugs.launchpad.net/bugs/513848 > > Signed-off-by: Chase Douglas <chase.douglas@canonical.com> > --- > kernel/sched.c | 24 ++++++++++++++++++++++-- > 1 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index 81ede13..c372249 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) > > /* Variables and functions for calc_load */ > static atomic_long_t calc_load_tasks; > +static atomic_long_t calc_load_tasks_deferred; > static unsigned long calc_load_update; > unsigned long avenrun[3]; > EXPORT_SYMBOL(avenrun); > @@ -3021,7 +3022,7 @@ void calc_global_load(void) > */ > static void calc_load_account_active(struct rq *this_rq) > { > - long nr_active, delta; > + long nr_active, delta, deferred; > > nr_active = this_rq->nr_running; > nr_active += (long) this_rq->nr_uninterruptible; > @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) > if (nr_active != this_rq->calc_load_active) { > delta = nr_active - this_rq->calc_load_active; > this_rq->calc_load_active = nr_active; > + > + /* > + * Update calc_load_tasks only once per cpu in 10 tick update > + * window. > + */ > + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && > + time_after_eq(jiffies, calc_load_update))) { > + if (delta) > + atomic_long_add(delta, > + &calc_load_tasks_deferred); > + return; > + } > + > + if (atomic_long_read(&calc_load_tasks_deferred)) { > + deferred = atomic_long_xchg(&calc_load_tasks_deferred, > + 0); > + delta += deferred; > + } > + > atomic_long_add(delta, &calc_load_tasks); > } > } > @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) > } > > if (time_after_eq(jiffies, this_rq->calc_load_update)) { > - this_rq->calc_load_update += LOAD_FREQ; > calc_load_account_active(this_rq); > + this_rq->calc_load_update += LOAD_FREQ; > } > } > > -- > 1.7.0 Now that this has been accepted for Lucid, please consider this for Karmic as well. SRU justification can be found in bug 513848. Thanks! -- Chase
Chase Douglas wrote: > On Thu, Apr 8, 2010 at 12:02 PM, Chase Douglas > <chase.douglas@canonical.com> wrote: >> There's a period of 10 ticks where calc_load_tasks is updated by all the >> cpus for the load avg. Usually all the cpus do this during the first >> tick. If any cpus go idle, calc_load_tasks is decremented accordingly. >> However, if they wake up calc_load_tasks is not incremented. Thus, if >> cpus go idle during the 10 tick period, calc_load_tasks may be >> decremented to a non-representative value. This issue can lead to >> systems having a load avg of exactly 0, even though the real load avg >> could theoretically be up to NR_CPUS. >> >> This change defers calc_load_tasks accounting after each cpu updates the >> count until after the 10 tick update window. >> >> A few points: >> >> * A global atomic deferral counter, and not per-cpu vars, is needed >> because a cpu may go NOHZ idle and not be able to update the global >> calc_load_tasks variable for subsequent load calculations. >> * It is not enough to add calls to account for the load when a cpu is >> awakened: >> - Load avg calculation must be independent of cpu load. >> - If a cpu is awakend by one tasks, but then has more scheduled before >> the end of the update window, only the first task will be accounted. >> >> BugLink: http://bugs.launchpad.net/bugs/513848 >> >> Signed-off-by: Chase Douglas <chase.douglas@canonical.com> >> --- >> kernel/sched.c | 24 ++++++++++++++++++++++-- >> 1 files changed, 22 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched.c b/kernel/sched.c >> index 81ede13..c372249 100644 >> --- a/kernel/sched.c >> +++ b/kernel/sched.c >> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) >> >> /* Variables and functions for calc_load */ >> static atomic_long_t calc_load_tasks; >> +static atomic_long_t calc_load_tasks_deferred; >> static unsigned long calc_load_update; >> unsigned long avenrun[3]; >> EXPORT_SYMBOL(avenrun); >> @@ -3021,7 +3022,7 @@ void calc_global_load(void) >> */ >> static void calc_load_account_active(struct rq *this_rq) >> { >> - long nr_active, delta; >> + long nr_active, delta, deferred; >> >> nr_active = this_rq->nr_running; >> nr_active += (long) this_rq->nr_uninterruptible; >> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) >> if (nr_active != this_rq->calc_load_active) { >> delta = nr_active - this_rq->calc_load_active; >> this_rq->calc_load_active = nr_active; >> + >> + /* >> + * Update calc_load_tasks only once per cpu in 10 tick update >> + * window. >> + */ >> + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && >> + time_after_eq(jiffies, calc_load_update))) { >> + if (delta) >> + atomic_long_add(delta, >> + &calc_load_tasks_deferred); >> + return; >> + } >> + >> + if (atomic_long_read(&calc_load_tasks_deferred)) { >> + deferred = atomic_long_xchg(&calc_load_tasks_deferred, >> + 0); >> + delta += deferred; >> + } >> + >> atomic_long_add(delta, &calc_load_tasks); >> } >> } >> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) >> } >> >> if (time_after_eq(jiffies, this_rq->calc_load_update)) { >> - this_rq->calc_load_update += LOAD_FREQ; >> calc_load_account_active(this_rq); >> + this_rq->calc_load_update += LOAD_FREQ; >> } >> } >> >> -- >> 1.7.0 > > Now that this has been accepted for Lucid, please consider this for > Karmic as well. SRU justification can be found in bug 513848. Thanks! > > -- Chase > If you could just resend the patch with a karmic header. Then I can assign it to my todo queue in patchworks and will less likely forget it. Thanks Stefan
diff --git a/kernel/sched.c b/kernel/sched.c index 81ede13..c372249 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) /* Variables and functions for calc_load */ static atomic_long_t calc_load_tasks; +static atomic_long_t calc_load_tasks_deferred; static unsigned long calc_load_update; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); @@ -3021,7 +3022,7 @@ void calc_global_load(void) */ static void calc_load_account_active(struct rq *this_rq) { - long nr_active, delta; + long nr_active, delta, deferred; nr_active = this_rq->nr_running; nr_active += (long) this_rq->nr_uninterruptible; @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; + + /* + * Update calc_load_tasks only once per cpu in 10 tick update + * window. + */ + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && + time_after_eq(jiffies, calc_load_update))) { + if (delta) + atomic_long_add(delta, + &calc_load_tasks_deferred); + return; + } + + if (atomic_long_read(&calc_load_tasks_deferred)) { + deferred = atomic_long_xchg(&calc_load_tasks_deferred, + 0); + delta += deferred; + } + atomic_long_add(delta, &calc_load_tasks); } } @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) } if (time_after_eq(jiffies, this_rq->calc_load_update)) { - this_rq->calc_load_update += LOAD_FREQ; calc_load_account_active(this_rq); + this_rq->calc_load_update += LOAD_FREQ; } }
There's a period of 10 ticks where calc_load_tasks is updated by all the cpus for the load avg. Usually all the cpus do this during the first tick. If any cpus go idle, calc_load_tasks is decremented accordingly. However, if they wake up calc_load_tasks is not incremented. Thus, if cpus go idle during the 10 tick period, calc_load_tasks may be decremented to a non-representative value. This issue can lead to systems having a load avg of exactly 0, even though the real load avg could theoretically be up to NR_CPUS. This change defers calc_load_tasks accounting after each cpu updates the count until after the 10 tick update window. A few points: * A global atomic deferral counter, and not per-cpu vars, is needed because a cpu may go NOHZ idle and not be able to update the global calc_load_tasks variable for subsequent load calculations. * It is not enough to add calls to account for the load when a cpu is awakened: - Load avg calculation must be independent of cpu load. - If a cpu is awakend by one tasks, but then has more scheduled before the end of the update window, only the first task will be accounted. BugLink: http://bugs.launchpad.net/bugs/513848 Signed-off-by: Chase Douglas <chase.douglas@canonical.com> --- kernel/sched.c | 24 ++++++++++++++++++++++-- 1 files changed, 22 insertions(+), 2 deletions(-)