diff mbox

[Karmic,v3] sched: update load count only once per cpu in 10 tick update window

Message ID 1271771163-2779-1-git-send-email-chase.douglas@canonical.com
State Accepted
Delegated to: Stefan Bader
Headers show

Commit Message

Chase Douglas April 20, 2010, 1:46 p.m. UTC
There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.

A few points:

* A global atomic deferral counter, and not per-cpu vars, is needed
  because a cpu may go NOHZ idle and not be able to update the global
  calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
  awakened:
  - Load avg calculation must be independent of cpu load.
  - If a cpu is awakend by one tasks, but then has more scheduled before
    the end of the update window, only the first task will be accounted.

BugLink: http://bugs.launchpad.net/bugs/513848

Stable Release Update Justification:

Impact of bug: On low load systems with specific work load
characteristics the load average may not be representative of the real
load. Given a specific test case it is possible to load a system with
NR_CPUS number of tasks and yet have a load avg of 0.00, though this is
unlikely to occur in real work loads.

How addressed: The attached patch reworks the load accounting mechanism
in the kernel scheduler. It ensures that the accounting is strictly
dependent on the time (i.e. snapshot taken every 5 seconds) and the
number of runnable and uninterruptible tasks at that given time.
Previously, the accounting also depended on whether a cpu goes idle
shortly after the 5 second snapshot.

Reproduction: See attached reproduction test case. Run it once on a
non-loaded system (boot to rescue mode works well). Top will report the
cpu usage at 90%, but uptime will report a load avg near 0.00 instead of
at least 0.90 as expected.

Regression potential: The patch has been received well from senior
Ubuntu kernel team members and some of the upstream kernel maintainers
on lkml. For this reason it is assumed to be a good fix for this issue.
The only code path touched by this patch involves the load avg
accounting, so potential regressions could include incorrect load avg
and/or some unforeseen general bug like a null dereference. However, the
likelihood of either is minimal due to proper and thorough patch review.

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
 kernel/sched.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

Comments

Stefan Bader April 20, 2010, 6:14 p.m. UTC | #1
I think I am a little confused by other discussions going on on this. A bit it
sounds like the moment you submitted this there was Peter responding with yet
another idea. Or am I just confused?

-Stefan
Chase Douglas April 20, 2010, 6:28 p.m. UTC | #2
On Tue, Apr 20, 2010 at 2:14 PM, Stefan Bader
<stefan.bader@canonical.com> wrote:
> I think I am a little confused by other discussions going on on this. A bit it
> sounds like the moment you submitted this there was Peter responding with yet
> another idea. Or am I just confused?

Peter Zijlstra has another idea about using the idle load balancing
mechanism to do load accounting. However, his idea is so far not fully
thought out, and not tested at all I presume. I pointed out a few
concerns with his approach, and he agreed there are issues that would
need to be worked around. Right now, the only tested and reasonable
solution is my patch. I'm waiting to hear back from him on why he
feels using the ILB mechanism is a better approach, but I've also made
clear that I don't understand the code and can't create a solution
myself. I think you really need to understand ILB deep down to do it
right.

I think it makes sense to put this in Karmic and leave the patch in
Lucid for now until upstream decides its final direction.

P.S.: I CC'd kernel-team cause I figured we all would like to see the
discussion, but Peter un-CC'd it cause of bounces. Is there a proper
way to do this, or should we just not CC our list?

-- Chase
Tim Gardner April 20, 2010, 7:14 p.m. UTC | #3
On 04/20/2010 12:28 PM, Chase Douglas wrote:

> P.S.: I CC'd kernel-team cause I figured we all would like to see the
> discussion, but Peter un-CC'd it cause of bounces. Is there a proper
> way to do this, or should we just not CC our list?
>
> -- Chase
>

I turned off the option in the list serv that annoys unsubscribed email 
submitters.

rtg
Stefan Bader April 20, 2010, 10:10 p.m. UTC | #4
Based on previous review and current state of the discussion I have applied this
to Karmic master
diff mbox

Patch

diff --git a/kernel/sched.c b/kernel/sched.c
index 81ede13..c372249 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2967,6 +2967,7 @@  unsigned long nr_iowait(void)
 
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
 static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
@@ -3021,7 +3022,7 @@  void calc_global_load(void)
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
-	long nr_active, delta;
+	long nr_active, delta, deferred;
 
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -3029,6 +3030,25 @@  static void calc_load_account_active(struct rq *this_rq)
 	if (nr_active != this_rq->calc_load_active) {
 		delta = nr_active - this_rq->calc_load_active;
 		this_rq->calc_load_active = nr_active;
+
+		/*
+		 * Update calc_load_tasks only once per cpu in 10 tick update
+		 * window.
+		 */
+		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+			     time_after_eq(jiffies, calc_load_update))) {
+			if (delta)
+				atomic_long_add(delta,
+						&calc_load_tasks_deferred);
+			return;
+		}
+
+		if (atomic_long_read(&calc_load_tasks_deferred)) {
+			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
+						    0);
+			delta += deferred;
+		}
+
 		atomic_long_add(delta, &calc_load_tasks);
 	}
 }
@@ -3072,8 +3092,8 @@  static void update_cpu_load(struct rq *this_rq)
 	}
 
 	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
 		calc_load_account_active(this_rq);
+		this_rq->calc_load_update += LOAD_FREQ;
 	}
 }