[REGRESSION,2.6.30,v3] sched: update load count only once per cpu in 10 tick update window

Message ID 1271703130.1676.214.camel@laptop
State Not Applicable
Delegated to: Stefan Bader
Headers show

Commit Message

Peter Zijlstra April 19, 2010, 6:52 p.m.
On Tue, 2010-04-13 at 16:19 -0700, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick update window.
> A few points:
> * A global atomic deferral counter, and not per-cpu vars, is needed
>   because a cpu may go NOHZ idle and not be able to update the global
>   calc_load_tasks variable for subsequent load calculations.
> * It is not enough to add calls to account for the load when a cpu is
>   awakened:
>   - Load avg calculation must be independent of cpu load.
>   - If a cpu is awakend by one tasks, but then has more scheduled before
>     the end of the update window, only the first task will be accounted.

OK, so what you're saying is that because we update calc_load_tasks from
entering idle, we decrease earlier than a regular 10 tick sample
interval would?

Hence you batch these early updates into _deferred and let the next 10
tick sample roll them over?

So the only early updates can come from
pick_next_task_idle()->calc_load_account_active(), so why not specialize
that callchain instead of the below?

Also, since its all NO_HZ, why not stick this in with the ILB? Once
people get around to making that scale better, this can hitch a ride.

Something like the below perhaps? It does run partially from softirq
context, but since there's a distinct lack of synchronization here that
didn't seem like an immediate problem.

 kernel/sched.c          |   10 ++++++----
 kernel/sched_fair.c     |    4 +++-
 kernel/sched_idletask.c |    2 --
 3 files changed, 9 insertions(+), 7 deletions(-)


diff --git a/kernel/sched.c b/kernel/sched.c
index 95eaecc..cdd4d8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2959,6 +2959,11 @@  static void calc_load_account_active(struct rq *this_rq)
 	long nr_active, delta;
+	if (!time_after_eq(jiffies, this_rq->calc_load_update))
+		return;
+	this_rq->calc_load_update += LOAD_FREQ;
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -2998,10 +3003,7 @@  static void update_cpu_load(struct rq *this_rq)
 		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
-	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
-		calc_load_account_active(this_rq);
-	}
+	calc_load_account_active(this_rq);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 88d3053..2c267ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3394,9 +3394,11 @@  static void run_rebalance_domains(struct softirq_action *h)
 			if (need_resched())
+			rq = cpu_rq(balance_cpu);
+			calc_load_account_active(rq);
 			rebalance_domains(balance_cpu, CPU_IDLE);
-			rq = cpu_rq(balance_cpu);
 			if (time_after(this_rq->next_balance, rq->next_balance))
 				this_rq->next_balance = rq->next_balance;
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index bea2b8f..6ca191f 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -23,8 +23,6 @@  static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 static struct task_struct *pick_next_task_idle(struct rq *rq)
 	schedstat_inc(rq, sched_goidle);
-	/* adjust the active tasks as we might go into a long sleep */
-	calc_load_account_active(rq);
 	return rq->idle;