Patchwork [084/152] sched: Cure more NO_HZ load average woes

login
register
mail settings
Submitter Chase Douglas
Date Jan. 6, 2011, 3:09 p.m.
Message ID <4D25DB3A.9000301@canonical.com>
Download mbox | patch
Permalink /patch/77726/
State New
Headers show

Comments

Chase Douglas - Jan. 6, 2011, 3:09 p.m.
Hi all,

I received this notification of a stable patch for .36 that should fix
the load avg bugs once and for all. A recap:

I found a bug in the load avg calculation and got a fix pushed upstream.
This was thrown into lucid and maverick. Unfortunately, it caused a
regression for our xen kernels, so it was removed from maverick ec2
IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
master:

74f5187ac873042f502227701ed1727e7c5fbfa9

I believe this patch should be reenabled for all lucid and maverick
kernels, and the following patch should be applied on top. I'm not sure
how everything is falling out with the new stable queue process, so I'm
forwarding this to the list just to be sure it's seen.

Thanks!

-- Chase

-------- Original Message --------
Subject: [084/152] sched: Cure more NO_HZ load average woes
Date: Wed, 05 Jan 2011 16:23:03 -0800
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
CC: stable-review@kernel.org, torvalds@linux-foundation.org,
akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,        Peter
Zijlstra <a.p.zijlstra@chello.nl>,        Chase Douglas
<chase.douglas@canonical.com>,        Ingo Molnar <mingo@elte.hu>

2.6.36-stable review patch.  If anyone has any objections, please let us
know.

------------------

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream.

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

 - Folding the pending idle delta into the global active count.
 - Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.

Reported-by: Damien Wyart <damien.wyart@free.fr>
Reported-by: Tim McGrath <tmhikaru@gmail.com>
Tested-by: Damien Wyart <damien.wyart@free.fr>
Tested-by: Orion Poplawski <orion@cora.nwra.com>
Tested-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 include/linux/sched.h |    2
 kernel/sched.c        |  150
++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/timer.c        |    2
 3 files changed, 141 insertions(+), 13 deletions(-)
John Johansen - Jan. 6, 2011, 6:58 p.m.
On 01/06/2011 07:09 AM, Chase Douglas wrote:
> Hi all,
> 
> I received this notification of a stable patch for .36 that should fix
> the load avg bugs once and for all. A recap:
> 
> I found a bug in the load avg calculation and got a fix pushed upstream.
> This was thrown into lucid and maverick. Unfortunately, it caused a
> regression for our xen kernels, so it was removed from maverick ec2
> IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
> master:
> 
> 74f5187ac873042f502227701ed1727e7c5fbfa9
> 
> I believe this patch should be reenabled for all lucid and maverick
> kernels, and the following patch should be applied on top. I'm not sure
> how everything is falling out with the new stable queue process, so I'm
> forwarding this to the list just to be sure it's seen.
> 
We'll this shouldn't really make a difference for the ec2 kernels that it
was reverted for because those kernels are Hz based.  When we reverted
the patch I looked closely at the difference between just reverting and
applying the upstream version of the patch, which had extra conditionals
to correctly handle the Hz based kernel case, and they were functionally
equivalent.

We ended up just reverting as it was one less patch that was different on
the topic branch to deal with during rebases.  If we are going to apply
this no hz fix to the EC2 kernel then we should, apply the upstream patch
and then this.

In fact we should make sure its the upstream version of Chase's patch
on all our kernels, before applying this fix.

As for the patch it self, it looks good and jives with what I suspected
when I was looking at it for the Lucid EC2 kernel.

Acked-by: John Johansen <john.johansen@canonical.com>


> Thanks!
> 
> -- Chase
> 
> -------- Original Message --------
> Subject: [084/152] sched: Cure more NO_HZ load average woes
> Date: Wed, 05 Jan 2011 16:23:03 -0800
> From: Greg KH <gregkh@suse.de>
> To: linux-kernel@vger.kernel.org, stable@kernel.org
> CC: stable-review@kernel.org, torvalds@linux-foundation.org,
> akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,        Peter
> Zijlstra <a.p.zijlstra@chello.nl>,        Chase Douglas
> <chase.douglas@canonical.com>,        Ingo Molnar <mingo@elte.hu>
> 
> 2.6.36-stable review patch.  If anyone has any objections, please let us
> know.
> 
> ------------------
> 
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream.
> 
> There's a long-running regression that proved difficult to fix and
> which is hitting certain people and is rather annoying in its effects.
> 
> Damien reported that after 74f5187ac8 (sched: Cure load average vs
> NO_HZ woes) his load average is unnaturally high, he also noted that
> even with that patch reverted the load avgerage numbers are not
> correct.
> 
> The problem is that the previous patch only solved half the NO_HZ
> problem, it addressed the part of going into NO_HZ mode, not of
> comming out of NO_HZ mode. This patch implements that missing half.
> 
> When comming out of NO_HZ mode there are two important things to take
> care of:
> 
>  - Folding the pending idle delta into the global active count.
>  - Correctly aging the averages for the idle-duration.
> 
> So with this patch the NO_HZ interaction should be complete and
> behaviour between CONFIG_NO_HZ=[yn] should be equivalent.
> 
> Furthermore, this patch slightly changes the load average computation
> by adding a rounding term to the fixed point multiplication.
> 
> Reported-by: Damien Wyart <damien.wyart@free.fr>
> Reported-by: Tim McGrath <tmhikaru@gmail.com>
> Tested-by: Damien Wyart <damien.wyart@free.fr>
> Tested-by: Orion Poplawski <orion@cora.nwra.com>
> Tested-by: Kyle McMartin <kyle@mcmartin.ca>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Chase Douglas <chase.douglas@canonical.com>
> LKML-Reference: <1291129145.32004.874.camel@laptop>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> 
> ---
>  include/linux/sched.h |    2
>  kernel/sched.c        |  150
> ++++++++++++++++++++++++++++++++++++++++++++++----
>  kernel/timer.c        |    2
>  3 files changed, 141 insertions(+), 13 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int c
>  extern unsigned long this_cpu_load(void);
> 
> 
> -extern void calc_global_load(void);
> +extern void calc_global_load(unsigned long ticks);
> 
>  extern unsigned long get_parent_ip(unsigned long addr);
> 
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,15 @@ static long calc_load_fold_active(struct
>  	return delta;
>  }
> 
> +static unsigned long
> +calc_load(unsigned long load, unsigned long exp, unsigned long active)
> +{
> +	load *= exp;
> +	load += active * (FIXED_1 - exp);
> +	load += 1UL << (FSHIFT - 1);
> +	return load >> FSHIFT;
> +}
> +
>  #ifdef CONFIG_NO_HZ
>  /*
>   * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
> @@ -2996,6 +3005,128 @@ static long calc_load_fold_idle(void)
> 
>  	return delta;
>  }
> +
> +/**
> + * fixed_power_int - compute: x^n, in O(log n) time
> + *
> + * @x:         base of the power
> + * @frac_bits: fractional bits of @x
> + * @n:         power to raise @x to.
> + *
> + * By exploiting the relation between the definition of the natural power
> + * function: x^n := x*x*...*x (x multiplied by itself for n times), and
> + * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
> + * (where: n_i \elem {0, 1}, the binary vector representing n),
> + * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
> + * of course trivially computable in O(log_2 n), the length of our binary
> + * vector.
> + */
> +static unsigned long
> +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> +{
> +	unsigned long result = 1UL << frac_bits;
> +
> +	if (n) for (;;) {
> +		if (n & 1) {
> +			result *= x;
> +			result += 1UL << (frac_bits - 1);
> +			result >>= frac_bits;
> +		}
> +		n >>= 1;
> +		if (!n)
> +			break;
> +		x *= x;
> +		x += 1UL << (frac_bits - 1);
> +		x >>= frac_bits;
> +	}
> +
> +	return result;
> +}
> +
> +/*
> + * a1 = a0 * e + a * (1 - e)
> + *
> + * a2 = a1 * e + a * (1 - e)
> + *    = (a0 * e + a * (1 - e)) * e + a * (1 - e)
> + *    = a0 * e^2 + a * (1 - e) * (1 + e)
> + *
> + * a3 = a2 * e + a * (1 - e)
> + *    = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
> + *    = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
> + *
> + *  ...
> + *
> + * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
> + *    = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
> + *    = a0 * e^n + a * (1 - e^n)
> + *
> + * [1] application of the geometric series:
> + *
> + *              n         1 - x^(n+1)
> + *     S_n := \Sum x^i = -------------
> + *             i=0          1 - x
> + */
> +static unsigned long
> +calc_load_n(unsigned long load, unsigned long exp,
> +	    unsigned long active, unsigned int n)
> +{
> +
> +	return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
> +}
> +
> +/*
> + * NO_HZ can leave us missing all per-cpu ticks calling
> + * calc_load_account_active(), but since an idle CPU folds its delta into
> + * calc_load_tasks_idle per calc_load_account_idle(), all we need to do
> is fold
> + * in the pending idle delta if our idle period crossed a load cycle
> boundary.
> + *
> + * Once we've updated the global active value, we need to apply the
> exponential
> + * weights adjusted to the number of cycles missed.
> + */
> +static void calc_global_nohz(unsigned long ticks)
> +{
> +	long delta, active, n;
> +
> +	if (time_before(jiffies, calc_load_update))
> +		return;
> +
> +	/*
> +	 * If we crossed a calc_load_update boundary, make sure to fold
> +	 * any pending idle changes, the respective CPUs might have
> +	 * missed the tick driven calc_load_account_active() update
> +	 * due to NO_HZ.
> +	 */
> +	delta = calc_load_fold_idle();
> +	if (delta)
> +		atomic_long_add(delta, &calc_load_tasks);
> +
> +	/*
> +	 * If we were idle for multiple load cycles, apply them.
> +	 */
> +	if (ticks >= LOAD_FREQ) {
> +		n = ticks / LOAD_FREQ;
> +
> +		active = atomic_long_read(&calc_load_tasks);
> +		active = active > 0 ? active * FIXED_1 : 0;
> +
> +		avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> +		avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> +		avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
> +
> +		calc_load_update += n * LOAD_FREQ;
> +	}
> +
> +	/*
> +	 * Its possible the remainder of the above division also crosses
> +	 * a LOAD_FREQ period, the regular check in calc_global_load()
> +	 * which comes after this will take care of that.
> +	 *
> +	 * Consider us being 11 ticks before a cycle completion, and us
> +	 * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
> +	 * age us 4 cycles, and the test in calc_global_load() will
> +	 * pick up the final one.
> +	 */
> +}
>  #else
>  static void calc_load_account_idle(struct rq *this_rq)
>  {
> @@ -3005,6 +3136,10 @@ static inline long calc_load_fold_idle(v
>  {
>  	return 0;
>  }
> +
> +static void calc_global_nohz(unsigned long ticks)
> +{
> +}
>  #endif
> 
>  /**
> @@ -3022,24 +3157,17 @@ void get_avenrun(unsigned long *loads, u
>  	loads[2] = (avenrun[2] + offset) << shift;
>  }
> 
> -static unsigned long
> -calc_load(unsigned long load, unsigned long exp, unsigned long active)
> -{
> -	load *= exp;
> -	load += active * (FIXED_1 - exp);
> -	return load >> FSHIFT;
> -}
> -
>  /*
>   * calc_load - update the avenrun load estimates 10 ticks after the
>   * CPUs have updated calc_load_tasks.
>   */
> -void calc_global_load(void)
> +void calc_global_load(unsigned long ticks)
>  {
> -	unsigned long upd = calc_load_update + 10;
>  	long active;
> 
> -	if (time_before(jiffies, upd))
> +	calc_global_nohz(ticks);
> +
> +	if (time_before(jiffies, calc_load_update + 10))
>  		return;
> 
>  	active = atomic_long_read(&calc_load_tasks);
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1322,7 +1322,7 @@ void do_timer(unsigned long ticks)
>  {
>  	jiffies_64 += ticks;
>  	update_wall_time();
> -	calc_global_load();
> +	calc_global_load(ticks);
>  }
> 
>  #ifdef __ARCH_WANT_SYS_ALARM
> 
> 
>
Stefan Bader - Jan. 24, 2011, 2:50 p.m.
On 01/06/2011 04:09 PM, Chase Douglas wrote:
> Hi all,
> 
> I received this notification of a stable patch for .36 that should fix
> the load avg bugs once and for all. A recap:
> 
> I found a bug in the load avg calculation and got a fix pushed upstream.
> This was thrown into lucid and maverick. Unfortunately, it caused a
> regression for our xen kernels, so it was removed from maverick ec2
> IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
> master:
> 
> 74f5187ac873042f502227701ed1727e7c5fbfa9
> 
> I believe this patch should be reenabled for all lucid and maverick
> kernels, and the following patch should be applied on top. I'm not sure
> how everything is falling out with the new stable queue process, so I'm
> forwarding this to the list just to be sure it's seen.
> 
> Thanks!
> 

I must admit that the maths are a bit beyond my understanding. Though given that
the first half is in Maverick and the second went as a stable update for .36,
this seems to be the right thing to do.

Acked-by: Stefan Bader <stefan.bader@canonical.com>

This has now also been reported as a bug

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/706592

Secondary question would be for Lucid. As the patch did not cleanly apply there,
I checked what currently is in Lucid and it seems the first part is
not there. But another patch:

commit 0d843425672f4d2dc99b9004409aae503ef4d39f
Author: Chase Douglas <chase.douglas@canonical.com>
Date:   Thu Apr 8 12:02:11 2010 -0400

    sched: update load count only once per cpu in 10 tick update window

This does not show up upstream and I think I remember vaguely that this one was
replaced by the first upstream patch that is in Maverick.

So I guess the action required there would be to revert the Ubuntu specific
patch and apply both halfs of the upstream solution. Still it would be quite
nice to have some way of verification. Has anybody already some sort of testcase
for this?

-Stefan

> -- Chase
> 
> -------- Original Message --------
> Subject: [084/152] sched: Cure more NO_HZ load average woes
> Date: Wed, 05 Jan 2011 16:23:03 -0800
> From: Greg KH <gregkh@suse.de>
> To: linux-kernel@vger.kernel.org, stable@kernel.org
> CC: stable-review@kernel.org, torvalds@linux-foundation.org,
> akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,        Peter
> Zijlstra <a.p.zijlstra@chello.nl>,        Chase Douglas
> <chase.douglas@canonical.com>,        Ingo Molnar <mingo@elte.hu>
> 
> 2.6.36-stable review patch.  If anyone has any objections, please let us
> know.
> 
> ------------------
> 
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream.
> 
> There's a long-running regression that proved difficult to fix and
> which is hitting certain people and is rather annoying in its effects.
> 
> Damien reported that after 74f5187ac8 (sched: Cure load average vs
> NO_HZ woes) his load average is unnaturally high, he also noted that
> even with that patch reverted the load avgerage numbers are not
> correct.
> 
> The problem is that the previous patch only solved half the NO_HZ
> problem, it addressed the part of going into NO_HZ mode, not of
> comming out of NO_HZ mode. This patch implements that missing half.
> 
> When comming out of NO_HZ mode there are two important things to take
> care of:
> 
>  - Folding the pending idle delta into the global active count.
>  - Correctly aging the averages for the idle-duration.
> 
> So with this patch the NO_HZ interaction should be complete and
> behaviour between CONFIG_NO_HZ=[yn] should be equivalent.
> 
> Furthermore, this patch slightly changes the load average computation
> by adding a rounding term to the fixed point multiplication.
> 
> Reported-by: Damien Wyart <damien.wyart@free.fr>
> Reported-by: Tim McGrath <tmhikaru@gmail.com>
> Tested-by: Damien Wyart <damien.wyart@free.fr>
> Tested-by: Orion Poplawski <orion@cora.nwra.com>
> Tested-by: Kyle McMartin <kyle@mcmartin.ca>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Chase Douglas <chase.douglas@canonical.com>
> LKML-Reference: <1291129145.32004.874.camel@laptop>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> 
> ---
>  include/linux/sched.h |    2
>  kernel/sched.c        |  150
> ++++++++++++++++++++++++++++++++++++++++++++++----
>  kernel/timer.c        |    2
>  3 files changed, 141 insertions(+), 13 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int c
>  extern unsigned long this_cpu_load(void);
> 
> 
> -extern void calc_global_load(void);
> +extern void calc_global_load(unsigned long ticks);
> 
>  extern unsigned long get_parent_ip(unsigned long addr);
> 
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,15 @@ static long calc_load_fold_active(struct
>  	return delta;
>  }
> 
> +static unsigned long
> +calc_load(unsigned long load, unsigned long exp, unsigned long active)
> +{
> +	load *= exp;
> +	load += active * (FIXED_1 - exp);
> +	load += 1UL << (FSHIFT - 1);
> +	return load >> FSHIFT;
> +}
> +
>  #ifdef CONFIG_NO_HZ
>  /*
>   * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
> @@ -2996,6 +3005,128 @@ static long calc_load_fold_idle(void)
> 
>  	return delta;
>  }
> +
> +/**
> + * fixed_power_int - compute: x^n, in O(log n) time
> + *
> + * @x:         base of the power
> + * @frac_bits: fractional bits of @x
> + * @n:         power to raise @x to.
> + *
> + * By exploiting the relation between the definition of the natural power
> + * function: x^n := x*x*...*x (x multiplied by itself for n times), and
> + * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
> + * (where: n_i \elem {0, 1}, the binary vector representing n),
> + * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
> + * of course trivially computable in O(log_2 n), the length of our binary
> + * vector.
> + */
> +static unsigned long
> +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> +{
> +	unsigned long result = 1UL << frac_bits;
> +
> +	if (n) for (;;) {
> +		if (n & 1) {
> +			result *= x;
> +			result += 1UL << (frac_bits - 1);
> +			result >>= frac_bits;
> +		}
> +		n >>= 1;
> +		if (!n)
> +			break;
> +		x *= x;
> +		x += 1UL << (frac_bits - 1);
> +		x >>= frac_bits;
> +	}
> +
> +	return result;
> +}
> +
> +/*
> + * a1 = a0 * e + a * (1 - e)
> + *
> + * a2 = a1 * e + a * (1 - e)
> + *    = (a0 * e + a * (1 - e)) * e + a * (1 - e)
> + *    = a0 * e^2 + a * (1 - e) * (1 + e)
> + *
> + * a3 = a2 * e + a * (1 - e)
> + *    = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
> + *    = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
> + *
> + *  ...
> + *
> + * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
> + *    = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
> + *    = a0 * e^n + a * (1 - e^n)
> + *
> + * [1] application of the geometric series:
> + *
> + *              n         1 - x^(n+1)
> + *     S_n := \Sum x^i = -------------
> + *             i=0          1 - x
> + */
> +static unsigned long
> +calc_load_n(unsigned long load, unsigned long exp,
> +	    unsigned long active, unsigned int n)
> +{
> +
> +	return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
> +}
> +
> +/*
> + * NO_HZ can leave us missing all per-cpu ticks calling
> + * calc_load_account_active(), but since an idle CPU folds its delta into
> + * calc_load_tasks_idle per calc_load_account_idle(), all we need to do
> is fold
> + * in the pending idle delta if our idle period crossed a load cycle
> boundary.
> + *
> + * Once we've updated the global active value, we need to apply the
> exponential
> + * weights adjusted to the number of cycles missed.
> + */
> +static void calc_global_nohz(unsigned long ticks)
> +{
> +	long delta, active, n;
> +
> +	if (time_before(jiffies, calc_load_update))
> +		return;
> +
> +	/*
> +	 * If we crossed a calc_load_update boundary, make sure to fold
> +	 * any pending idle changes, the respective CPUs might have
> +	 * missed the tick driven calc_load_account_active() update
> +	 * due to NO_HZ.
> +	 */
> +	delta = calc_load_fold_idle();
> +	if (delta)
> +		atomic_long_add(delta, &calc_load_tasks);
> +
> +	/*
> +	 * If we were idle for multiple load cycles, apply them.
> +	 */
> +	if (ticks >= LOAD_FREQ) {
> +		n = ticks / LOAD_FREQ;
> +
> +		active = atomic_long_read(&calc_load_tasks);
> +		active = active > 0 ? active * FIXED_1 : 0;
> +
> +		avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> +		avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> +		avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
> +
> +		calc_load_update += n * LOAD_FREQ;
> +	}
> +
> +	/*
> +	 * Its possible the remainder of the above division also crosses
> +	 * a LOAD_FREQ period, the regular check in calc_global_load()
> +	 * which comes after this will take care of that.
> +	 *
> +	 * Consider us being 11 ticks before a cycle completion, and us
> +	 * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
> +	 * age us 4 cycles, and the test in calc_global_load() will
> +	 * pick up the final one.
> +	 */
> +}
>  #else
>  static void calc_load_account_idle(struct rq *this_rq)
>  {
> @@ -3005,6 +3136,10 @@ static inline long calc_load_fold_idle(v
>  {
>  	return 0;
>  }
> +
> +static void calc_global_nohz(unsigned long ticks)
> +{
> +}
>  #endif
> 
>  /**
> @@ -3022,24 +3157,17 @@ void get_avenrun(unsigned long *loads, u
>  	loads[2] = (avenrun[2] + offset) << shift;
>  }
> 
> -static unsigned long
> -calc_load(unsigned long load, unsigned long exp, unsigned long active)
> -{
> -	load *= exp;
> -	load += active * (FIXED_1 - exp);
> -	return load >> FSHIFT;
> -}
> -
>  /*
>   * calc_load - update the avenrun load estimates 10 ticks after the
>   * CPUs have updated calc_load_tasks.
>   */
> -void calc_global_load(void)
> +void calc_global_load(unsigned long ticks)
>  {
> -	unsigned long upd = calc_load_update + 10;
>  	long active;
> 
> -	if (time_before(jiffies, upd))
> +	calc_global_nohz(ticks);
> +
> +	if (time_before(jiffies, calc_load_update + 10))
>  		return;
> 
>  	active = atomic_long_read(&calc_load_tasks);
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1322,7 +1322,7 @@ void do_timer(unsigned long ticks)
>  {
>  	jiffies_64 += ticks;
>  	update_wall_time();
> -	calc_global_load();
> +	calc_global_load(ticks);
>  }
> 
>  #ifdef __ARCH_WANT_SYS_ALARM
> 
> 
>
Chase Douglas - Jan. 24, 2011, 3 p.m.
On 01/24/2011 09:50 AM, Stefan Bader wrote:
> On 01/06/2011 04:09 PM, Chase Douglas wrote:
>> Hi all,
>>
>> I received this notification of a stable patch for .36 that should fix
>> the load avg bugs once and for all. A recap:
>>
>> I found a bug in the load avg calculation and got a fix pushed upstream.
>> This was thrown into lucid and maverick. Unfortunately, it caused a
>> regression for our xen kernels, so it was removed from maverick ec2
>> IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
>> master:
>>
>> 74f5187ac873042f502227701ed1727e7c5fbfa9
>>
>> I believe this patch should be reenabled for all lucid and maverick
>> kernels, and the following patch should be applied on top. I'm not sure
>> how everything is falling out with the new stable queue process, so I'm
>> forwarding this to the list just to be sure it's seen.
>>
>> Thanks!
>>
> 
> I must admit that the maths are a bit beyond my understanding. Though given that
> the first half is in Maverick and the second went as a stable update for .36,
> this seems to be the right thing to do.
> 
> Acked-by: Stefan Bader <stefan.bader@canonical.com>
> 
> This has now also been reported as a bug
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/706592
> 
> Secondary question would be for Lucid. As the patch did not cleanly apply there,
> I checked what currently is in Lucid and it seems the first part is
> not there. But another patch:
> 
> commit 0d843425672f4d2dc99b9004409aae503ef4d39f
> Author: Chase Douglas <chase.douglas@canonical.com>
> Date:   Thu Apr 8 12:02:11 2010 -0400
> 
>     sched: update load count only once per cpu in 10 tick update window
> 
> This does not show up upstream and I think I remember vaguely that this one was
> replaced by the first upstream patch that is in Maverick.
> 
> So I guess the action required there would be to revert the Ubuntu specific
> patch and apply both halfs of the upstream solution. Still it would be quite
> nice to have some way of verification. Has anybody already some sort of testcase
> for this?

I wrote a testcase for the original bug:

http://lkml.org/lkml/2010/3/29/170

As for the second bug, I'm not sure what a good testcase is.

-- Chase
Stefan Bader - Jan. 25, 2011, 11:08 a.m.
On 01/24/2011 04:00 PM, Chase Douglas wrote:
> On 01/24/2011 09:50 AM, Stefan Bader wrote:
>> On 01/06/2011 04:09 PM, Chase Douglas wrote:
>>> Hi all,
>>>
>>> I received this notification of a stable patch for .36 that should fix
>>> the load avg bugs once and for all. A recap:
>>>
>>> I found a bug in the load avg calculation and got a fix pushed upstream.
>>> This was thrown into lucid and maverick. Unfortunately, it caused a
>>> regression for our xen kernels, so it was removed from maverick ec2
>>> IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
>>> master:
>>>
>>> 74f5187ac873042f502227701ed1727e7c5fbfa9
>>>
>>> I believe this patch should be reenabled for all lucid and maverick
>>> kernels, and the following patch should be applied on top. I'm not sure
>>> how everything is falling out with the new stable queue process, so I'm
>>> forwarding this to the list just to be sure it's seen.
>>>
>>> Thanks!
>>>
>>
>> I must admit that the maths are a bit beyond my understanding. Though given that
>> the first half is in Maverick and the second went as a stable update for .36,
>> this seems to be the right thing to do.
>>
>> Acked-by: Stefan Bader <stefan.bader@canonical.com>
>>
>> This has now also been reported as a bug
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/706592
>>
>> Secondary question would be for Lucid. As the patch did not cleanly apply there,
>> I checked what currently is in Lucid and it seems the first part is
>> not there. But another patch:
>>
>> commit 0d843425672f4d2dc99b9004409aae503ef4d39f
>> Author: Chase Douglas <chase.douglas@canonical.com>
>> Date:   Thu Apr 8 12:02:11 2010 -0400
>>
>>     sched: update load count only once per cpu in 10 tick update window
>>
>> This does not show up upstream and I think I remember vaguely that this one was
>> replaced by the first upstream patch that is in Maverick.
>>
>> So I guess the action required there would be to revert the Ubuntu specific
>> patch and apply both halfs of the upstream solution. Still it would be quite
>> nice to have some way of verification. Has anybody already some sort of testcase
>> for this?
> 
> I wrote a testcase for the original bug:
> 
> http://lkml.org/lkml/2010/3/29/170
> 
> As for the second bug, I'm not sure what a good testcase is.
> 
> -- Chase

I played around with the test case this morning and while I think it seems to be
ok for the initial problem (task at cpu% 90 and loadavg 0), I am not sure about
the results with a Lucid kernel having the old fix reverted and the two upstream
halves added. The load rises when running the test case, but never really to 0.9
and even with that test task running and nothing else going on the 1min average
moves between 0.8 and 0.5. That could be misunderstanding of what the loadavg
really means or something else. I just don't feel like it is giving me any real
confidence in any direction at the moment.

-Stefan

Patch

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@  extern unsigned long nr_iowait_cpu(int c
 extern unsigned long this_cpu_load(void);


-extern void calc_global_load(void);
+extern void calc_global_load(unsigned long ticks);

 extern unsigned long get_parent_ip(unsigned long addr);

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2967,6 +2967,15 @@  static long calc_load_fold_active(struct
 	return delta;
 }

+static unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+	load *= exp;
+	load += active * (FIXED_1 - exp);
+	load += 1UL << (FSHIFT - 1);
+	return load >> FSHIFT;
+}
+
 #ifdef CONFIG_NO_HZ
 /*
  * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
@@ -2996,6 +3005,128 @@  static long calc_load_fold_idle(void)

 	return delta;
 }
+
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x:         base of the power
+ * @frac_bits: fractional bits of @x
+ * @n:         power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+	unsigned long result = 1UL << frac_bits;
+
+	if (n) for (;;) {
+		if (n & 1) {
+			result *= x;
+			result += 1UL << (frac_bits - 1);
+			result >>= frac_bits;
+		}
+		n >>= 1;
+		if (!n)
+			break;
+		x *= x;
+		x += 1UL << (frac_bits - 1);
+		x >>= frac_bits;
+	}
+
+	return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ *    = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ *    = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ *    = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ *    = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ *  ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ *    = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ *    = a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ *              n         1 - x^(n+1)
+ *     S_n := \Sum x^i = -------------
+ *             i=0          1 - x
+ */
+static unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+	    unsigned long active, unsigned int n)
+{
+
+	return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
+/*
+ * NO_HZ can leave us missing all per-cpu ticks calling
+ * calc_load_account_active(), but since an idle CPU folds its delta into
+ * calc_load_tasks_idle per calc_load_account_idle(), all we need to do
is fold
+ * in the pending idle delta if our idle period crossed a load cycle
boundary.
+ *
+ * Once we've updated the global active value, we need to apply the
exponential
+ * weights adjusted to the number of cycles missed.
+ */
+static void calc_global_nohz(unsigned long ticks)
+{
+	long delta, active, n;
+
+	if (time_before(jiffies, calc_load_update))
+		return;
+
+	/*
+	 * If we crossed a calc_load_update boundary, make sure to fold
+	 * any pending idle changes, the respective CPUs might have
+	 * missed the tick driven calc_load_account_active() update
+	 * due to NO_HZ.
+	 */
+	delta = calc_load_fold_idle();
+	if (delta)
+		atomic_long_add(delta, &calc_load_tasks);
+
+	/*
+	 * If we were idle for multiple load cycles, apply them.
+	 */
+	if (ticks >= LOAD_FREQ) {
+		n = ticks / LOAD_FREQ;
+
+		active = atomic_long_read(&calc_load_tasks);
+		active = active > 0 ? active * FIXED_1 : 0;
+
+		avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+		avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+		avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+
+		calc_load_update += n * LOAD_FREQ;
+	}
+
+	/*
+	 * Its possible the remainder of the above division also crosses
+	 * a LOAD_FREQ period, the regular check in calc_global_load()
+	 * which comes after this will take care of that.
+	 *
+	 * Consider us being 11 ticks before a cycle completion, and us
+	 * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
+	 * age us 4 cycles, and the test in calc_global_load() will
+	 * pick up the final one.
+	 */
+}
 #else
 static void calc_load_account_idle(struct rq *this_rq)
 {
@@ -3005,6 +3136,10 @@  static inline long calc_load_fold_idle(v
 {
 	return 0;
 }
+
+static void calc_global_nohz(unsigned long ticks)
+{
+}
 #endif

 /**
@@ -3022,24 +3157,17 @@  void get_avenrun(unsigned long *loads, u
 	loads[2] = (avenrun[2] + offset) << shift;
 }

-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
-	load *= exp;
-	load += active * (FIXED_1 - exp);
-	return load >> FSHIFT;
-}
-
 /*
  * calc_load - update the avenrun load estimates 10 ticks after the
  * CPUs have updated calc_load_tasks.
  */
-void calc_global_load(void)
+void calc_global_load(unsigned long ticks)
 {
-	unsigned long upd = calc_load_update + 10;
 	long active;

-	if (time_before(jiffies, upd))
+	calc_global_nohz(ticks);
+
+	if (time_before(jiffies, calc_load_update + 10))
 		return;

 	active = atomic_long_read(&calc_load_tasks);
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1322,7 +1322,7 @@  void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_wall_time();
-	calc_global_load();
+	calc_global_load(ticks);
 }

 #ifdef __ARCH_WANT_SYS_ALARM