Patchwork [REGRESSION,2.6.30,v3] sched: update load count only once per cpu in 10 tick update window

login
register
mail settings
Submitter Chase Douglas
Date April 13, 2010, 11:19 p.m.
Message ID <1271200751-18697-1-git-send-email-chase.douglas@canonical.com>
Download mbox | patch
Permalink /patch/50087/
State Superseded
Delegated to: Stefan Bader
Headers show

Comments

Chase Douglas - April 13, 2010, 11:19 p.m.
There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.

A few points:

* A global atomic deferral counter, and not per-cpu vars, is needed
  because a cpu may go NOHZ idle and not be able to update the global
  calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
  awakened:
  - Load avg calculation must be independent of cpu load.
  - If a cpu is awakend by one tasks, but then has more scheduled before
    the end of the update window, only the first task will be accounted.

BugLink: http://bugs.launchpad.net/bugs/513848

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
 kernel/sched.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)
Peter Zijlstra - April 19, 2010, 6:56 p.m.
On Mon, 2010-04-19 at 20:52 +0200, Peter Zijlstra wrote:
> 
> So the only early updates can come from
> pick_next_task_idle()->calc_load_account_active(), so why not specialize
> that callchain instead of the below? 

To clarify, when I wrote that your patch was still below.. ;-)
Chase Douglas - April 19, 2010, 8:16 p.m.
On Mon, Apr 19, 2010 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-04-13 at 16:19 -0700, Chase Douglas wrote:
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick update window.
>>
>> A few points:
>>
>> * A global atomic deferral counter, and not per-cpu vars, is needed
>>   because a cpu may go NOHZ idle and not be able to update the global
>>   calc_load_tasks variable for subsequent load calculations.
>> * It is not enough to add calls to account for the load when a cpu is
>>   awakened:
>>   - Load avg calculation must be independent of cpu load.
>>   - If a cpu is awakend by one tasks, but then has more scheduled before
>>     the end of the update window, only the first task will be accounted.
>
> OK, so what you're saying is that because we update calc_load_tasks from
> entering idle, we decrease earlier than a regular 10 tick sample
> interval would?
>
> Hence you batch these early updates into _deferred and let the next 10
> tick sample roll them over?

Correct

> So the only early updates can come from
> pick_next_task_idle()->calc_load_account_active(), so why not specialize
> that callchain instead of the below?
>
> Also, since its all NO_HZ, why not stick this in with the ILB? Once
> people get around to making that scale better, this can hitch a ride.
>
> Something like the below perhaps? It does run partially from softirq
> context, but since there's a distinct lack of synchronization here that
> didn't seem like an immediate problem.

I understand everything until you move the calc_load_account_active
call to run_rebalance_domains. I take it that when CPUs go NO_HZ idle,
at least one cpu is left to monitor and perform updates as necessary.
Conceptually, it makes sense that this cpu should be handling the load
accounting updates. However, I'm new to this code, so I'm having a
hard time understanding all the cases and timings for when the
scheduler softirq is called. Is it guaranteed to be called during
every 10 tick load update window? If not, then we'll have the issue
where a NO_HZ idle cpu won't be updated to 0 running tasks in time for
the load avg calculation.

Would someone be able to explain how we are guaranteed of the correct
timing for this path?

I also have a concern with run_rebalance_domains: If the designated
no_hz.load_balancer cpu wasn't idle at the last tick or needs
rescheduling, load accounting won't occur for idle cpus. Is it
possible for this to occur every time when called in the 10 tick
update window?

-- Chase

Patch

diff --git a/kernel/sched.c b/kernel/sched.c
index abb36b1..be348cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3010,6 +3010,7 @@  unsigned long this_cpu_load(void)
 
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
 static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
@@ -3064,7 +3065,7 @@  void calc_global_load(void)
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
-	long nr_active, delta;
+	long nr_active, delta, deferred;
 
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -3072,6 +3073,25 @@  static void calc_load_account_active(struct rq *this_rq)
 	if (nr_active != this_rq->calc_load_active) {
 		delta = nr_active - this_rq->calc_load_active;
 		this_rq->calc_load_active = nr_active;
+
+		/*
+		 * Update calc_load_tasks only once per cpu in 10 tick update
+		 * window.
+		 */
+		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+			     time_after_eq(jiffies, calc_load_update))) {
+			if (delta)
+				atomic_long_add(delta,
+						&calc_load_tasks_deferred);
+			return;
+		}
+
+		if (atomic_long_read(&calc_load_tasks_deferred)) {
+			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
+						    0);
+			delta += deferred;
+		}
+
 		atomic_long_add(delta, &calc_load_tasks);
 	}
 }
@@ -3106,8 +3126,8 @@  static void update_cpu_load(struct rq *this_rq)
 	}
 
 	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
 		calc_load_account_active(this_rq);
+		this_rq->calc_load_update += LOAD_FREQ;
 	}
 }