Patchwork [v3?] sched: update load count only once per cpu in 10 tick update window

login
register
mail settings
Submitter Chase Douglas
Date April 8, 2010, 4:02 p.m.
Message ID <1270742531-19808-1-git-send-email-chase.douglas@canonical.com>
Download mbox | patch
Permalink /patch/49736/
State Accepted
Delegated to: Andy Whitcroft
Headers show

Comments

Chase Douglas - April 8, 2010, 4:02 p.m.
There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.

A few points:

* A global atomic deferral counter, and not per-cpu vars, is needed
  because a cpu may go NOHZ idle and not be able to update the global
  calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
  awakened:
  - Load avg calculation must be independent of cpu load.
  - If a cpu is awakend by one tasks, but then has more scheduled before
    the end of the update window, only the first task will be accounted.

BugLink: http://bugs.launchpad.net/bugs/513848

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
---
 kernel/sched.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)
Andy Whitcroft - April 9, 2010, 2 p.m.
On Thu, Apr 08, 2010 at 12:02:11PM -0400, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> 
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick update window.
> 
> A few points:
> 
> * A global atomic deferral counter, and not per-cpu vars, is needed
>   because a cpu may go NOHZ idle and not be able to update the global
>   calc_load_tasks variable for subsequent load calculations.
> * It is not enough to add calls to account for the load when a cpu is
>   awakened:
>   - Load avg calculation must be independent of cpu load.
>   - If a cpu is awakend by one tasks, but then has more scheduled before
>     the end of the update window, only the first task will be accounted.
> 
> BugLink: http://bugs.launchpad.net/bugs/513848
> 
> Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
> ---
>  kernel/sched.c |   24 ++++++++++++++++++++++--
>  1 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 81ede13..c372249 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
>  
>  /* Variables and functions for calc_load */
>  static atomic_long_t calc_load_tasks;
> +static atomic_long_t calc_load_tasks_deferred;
>  static unsigned long calc_load_update;
>  unsigned long avenrun[3];
>  EXPORT_SYMBOL(avenrun);
> @@ -3021,7 +3022,7 @@ void calc_global_load(void)
>   */
>  static void calc_load_account_active(struct rq *this_rq)
>  {
> -	long nr_active, delta;
> +	long nr_active, delta, deferred;
>  
>  	nr_active = this_rq->nr_running;
>  	nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
>  	if (nr_active != this_rq->calc_load_active) {
>  		delta = nr_active - this_rq->calc_load_active;
>  		this_rq->calc_load_active = nr_active;
> +
> +		/*
> +		 * Update calc_load_tasks only once per cpu in 10 tick update
> +		 * window.
> +		 */
> +		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> +			     time_after_eq(jiffies, calc_load_update))) {
> +			if (delta)
> +				atomic_long_add(delta,
> +						&calc_load_tasks_deferred);
> +			return;
> +		}
> +
> +		if (atomic_long_read(&calc_load_tasks_deferred)) {
> +			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
> +						    0);
> +			delta += deferred;
> +		}
> +
>  		atomic_long_add(delta, &calc_load_tasks);
>  	}
>  }
> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
>  	}
>  
>  	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
> -		this_rq->calc_load_update += LOAD_FREQ;
>  		calc_load_account_active(this_rq);
> +		this_rq->calc_load_update += LOAD_FREQ;
>  	}
>  }

This seems to contain all the changes I requested in email and on irc, I
believe it does what you claim and it seems upstream thinks its a vile
but necessary hack till they think of how to fix it properly.  It also
fixes someone seeing an issue.  So:

Acked-by: Andy Whitcroft <apw@canonical.com>

-apw
Colin King - April 9, 2010, 4:29 p.m.
On Thu, 2010-04-08 at 12:02 -0400, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> 
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick update window.
> 
> A few points:
> 
> * A global atomic deferral counter, and not per-cpu vars, is needed
>   because a cpu may go NOHZ idle and not be able to update the global
>   calc_load_tasks variable for subsequent load calculations.
> * It is not enough to add calls to account for the load when a cpu is
>   awakened:
>   - Load avg calculation must be independent of cpu load.
>   - If a cpu is awakend by one tasks, but then has more scheduled before
>     the end of the update window, only the first task will be accounted.
> 
> BugLink: http://bugs.launchpad.net/bugs/513848
> 
> Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
> ---
>  kernel/sched.c |   24 ++++++++++++++++++++++--
>  1 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 81ede13..c372249 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
>  
>  /* Variables and functions for calc_load */
>  static atomic_long_t calc_load_tasks;
> +static atomic_long_t calc_load_tasks_deferred;
>  static unsigned long calc_load_update;
>  unsigned long avenrun[3];
>  EXPORT_SYMBOL(avenrun);
> @@ -3021,7 +3022,7 @@ void calc_global_load(void)
>   */
>  static void calc_load_account_active(struct rq *this_rq)
>  {
> -	long nr_active, delta;
> +	long nr_active, delta, deferred;
>  
>  	nr_active = this_rq->nr_running;
>  	nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
>  	if (nr_active != this_rq->calc_load_active) {
>  		delta = nr_active - this_rq->calc_load_active;
>  		this_rq->calc_load_active = nr_active;
> +
> +		/*
> +		 * Update calc_load_tasks only once per cpu in 10 tick update
> +		 * window.
> +		 */
> +		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> +			     time_after_eq(jiffies, calc_load_update))) {
> +			if (delta)
> +				atomic_long_add(delta,
> +						&calc_load_tasks_deferred);
> +			return;
> +		}
> +
> +		if (atomic_long_read(&calc_load_tasks_deferred)) {
> +			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
> +						    0);
> +			delta += deferred;
> +		}
> +
>  		atomic_long_add(delta, &calc_load_tasks);
>  	}
>  }
> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
>  	}
>  
>  	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
> -		this_rq->calc_load_update += LOAD_FREQ;
>  		calc_load_account_active(this_rq);
> +		this_rq->calc_load_update += LOAD_FREQ;
>  	}
>  }


OK - this does have some subtly to it that's for sure, especially with
the this_rg->calc_load_update dependency and hence the need to
re-ordering this after calc_load_account_active. One needs to tread
carefully here.

The description of the change and change needs some hard thought, but
I believe it does the trick as you claim. 

Acked-by: Colin King <colin.king@canonical.com>
Andy Whitcroft - April 9, 2010, 4:40 p.m.
Applied to Lucid.

-apw
Chase Douglas - April 9, 2010, 5:32 p.m.
On Thu, Apr 8, 2010 at 12:02 PM, Chase Douglas
<chase.douglas@canonical.com> wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
>
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick update window.
>
> A few points:
>
> * A global atomic deferral counter, and not per-cpu vars, is needed
>  because a cpu may go NOHZ idle and not be able to update the global
>  calc_load_tasks variable for subsequent load calculations.
> * It is not enough to add calls to account for the load when a cpu is
>  awakened:
>  - Load avg calculation must be independent of cpu load.
>  - If a cpu is awakend by one tasks, but then has more scheduled before
>    the end of the update window, only the first task will be accounted.
>
> BugLink: http://bugs.launchpad.net/bugs/513848
>
> Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
> ---
>  kernel/sched.c |   24 ++++++++++++++++++++++--
>  1 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 81ede13..c372249 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
>
>  /* Variables and functions for calc_load */
>  static atomic_long_t calc_load_tasks;
> +static atomic_long_t calc_load_tasks_deferred;
>  static unsigned long calc_load_update;
>  unsigned long avenrun[3];
>  EXPORT_SYMBOL(avenrun);
> @@ -3021,7 +3022,7 @@ void calc_global_load(void)
>  */
>  static void calc_load_account_active(struct rq *this_rq)
>  {
> -       long nr_active, delta;
> +       long nr_active, delta, deferred;
>
>        nr_active = this_rq->nr_running;
>        nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
>        if (nr_active != this_rq->calc_load_active) {
>                delta = nr_active - this_rq->calc_load_active;
>                this_rq->calc_load_active = nr_active;
> +
> +               /*
> +                * Update calc_load_tasks only once per cpu in 10 tick update
> +                * window.
> +                */
> +               if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> +                            time_after_eq(jiffies, calc_load_update))) {
> +                       if (delta)
> +                               atomic_long_add(delta,
> +                                               &calc_load_tasks_deferred);
> +                       return;
> +               }
> +
> +               if (atomic_long_read(&calc_load_tasks_deferred)) {
> +                       deferred = atomic_long_xchg(&calc_load_tasks_deferred,
> +                                                   0);
> +                       delta += deferred;
> +               }
> +
>                atomic_long_add(delta, &calc_load_tasks);
>        }
>  }
> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
>        }
>
>        if (time_after_eq(jiffies, this_rq->calc_load_update)) {
> -               this_rq->calc_load_update += LOAD_FREQ;
>                calc_load_account_active(this_rq);
> +               this_rq->calc_load_update += LOAD_FREQ;
>        }
>  }
>
> --
> 1.7.0

Now that this has been accepted for Lucid, please consider this for
Karmic as well. SRU justification can be found in bug 513848. Thanks!

-- Chase
Stefan Bader - April 12, 2010, 3:18 p.m.
Chase Douglas wrote:
> On Thu, Apr 8, 2010 at 12:02 PM, Chase Douglas
> <chase.douglas@canonical.com> wrote:
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick update window.
>>
>> A few points:
>>
>> * A global atomic deferral counter, and not per-cpu vars, is needed
>>  because a cpu may go NOHZ idle and not be able to update the global
>>  calc_load_tasks variable for subsequent load calculations.
>> * It is not enough to add calls to account for the load when a cpu is
>>  awakened:
>>  - Load avg calculation must be independent of cpu load.
>>  - If a cpu is awakend by one tasks, but then has more scheduled before
>>    the end of the update window, only the first task will be accounted.
>>
>> BugLink: http://bugs.launchpad.net/bugs/513848
>>
>> Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
>> ---
>>  kernel/sched.c |   24 ++++++++++++++++++++++--
>>  1 files changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 81ede13..c372249 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
>>
>>  /* Variables and functions for calc_load */
>>  static atomic_long_t calc_load_tasks;
>> +static atomic_long_t calc_load_tasks_deferred;
>>  static unsigned long calc_load_update;
>>  unsigned long avenrun[3];
>>  EXPORT_SYMBOL(avenrun);
>> @@ -3021,7 +3022,7 @@ void calc_global_load(void)
>>  */
>>  static void calc_load_account_active(struct rq *this_rq)
>>  {
>> -       long nr_active, delta;
>> +       long nr_active, delta, deferred;
>>
>>        nr_active = this_rq->nr_running;
>>        nr_active += (long) this_rq->nr_uninterruptible;
>> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
>>        if (nr_active != this_rq->calc_load_active) {
>>                delta = nr_active - this_rq->calc_load_active;
>>                this_rq->calc_load_active = nr_active;
>> +
>> +               /*
>> +                * Update calc_load_tasks only once per cpu in 10 tick update
>> +                * window.
>> +                */
>> +               if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
>> +                            time_after_eq(jiffies, calc_load_update))) {
>> +                       if (delta)
>> +                               atomic_long_add(delta,
>> +                                               &calc_load_tasks_deferred);
>> +                       return;
>> +               }
>> +
>> +               if (atomic_long_read(&calc_load_tasks_deferred)) {
>> +                       deferred = atomic_long_xchg(&calc_load_tasks_deferred,
>> +                                                   0);
>> +                       delta += deferred;
>> +               }
>> +
>>                atomic_long_add(delta, &calc_load_tasks);
>>        }
>>  }
>> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
>>        }
>>
>>        if (time_after_eq(jiffies, this_rq->calc_load_update)) {
>> -               this_rq->calc_load_update += LOAD_FREQ;
>>                calc_load_account_active(this_rq);
>> +               this_rq->calc_load_update += LOAD_FREQ;
>>        }
>>  }
>>
>> --
>> 1.7.0
> 
> Now that this has been accepted for Lucid, please consider this for
> Karmic as well. SRU justification can be found in bug 513848. Thanks!
> 
> -- Chase
> 
If you could just resend the patch with a karmic header. Then I can assign it to
my todo queue in patchworks and will less likely forget it. Thanks

Stefan

Patch

diff --git a/kernel/sched.c b/kernel/sched.c
index 81ede13..c372249 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2967,6 +2967,7 @@  unsigned long nr_iowait(void)
 
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
 static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
@@ -3021,7 +3022,7 @@  void calc_global_load(void)
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
-	long nr_active, delta;
+	long nr_active, delta, deferred;
 
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -3029,6 +3030,25 @@  static void calc_load_account_active(struct rq *this_rq)
 	if (nr_active != this_rq->calc_load_active) {
 		delta = nr_active - this_rq->calc_load_active;
 		this_rq->calc_load_active = nr_active;
+
+		/*
+		 * Update calc_load_tasks only once per cpu in 10 tick update
+		 * window.
+		 */
+		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+			     time_after_eq(jiffies, calc_load_update))) {
+			if (delta)
+				atomic_long_add(delta,
+						&calc_load_tasks_deferred);
+			return;
+		}
+
+		if (atomic_long_read(&calc_load_tasks_deferred)) {
+			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
+						    0);
+			delta += deferred;
+		}
+
 		atomic_long_add(delta, &calc_load_tasks);
 	}
 }
@@ -3072,8 +3092,8 @@  static void update_cpu_load(struct rq *this_rq)
 	}
 
 	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
 		calc_load_account_active(this_rq);
+		this_rq->calc_load_update += LOAD_FREQ;
 	}
 }