From patchwork Mon Apr 19 18:52:10 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 50537 X-Patchwork-Delegate: stefan.bader@canonical.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from chlorine.canonical.com (chlorine.canonical.com [91.189.94.204]) by ozlabs.org (Postfix) with ESMTP id 6ED99B7D1C for ; Tue, 20 Apr 2010 22:26:30 +1000 (EST) Received: from localhost ([127.0.0.1] helo=chlorine.canonical.com) by chlorine.canonical.com with esmtp (Exim 4.69) (envelope-from ) id 1O4CX1-00045c-AD; Tue, 20 Apr 2010 13:26:23 +0100 Received: from casper.infradead.org ([85.118.1.10]) by chlorine.canonical.com with esmtp (Exim 4.69) (envelope-from ) id 1O3w4s-0001M5-ES for kernel-team@lists.ubuntu.com; Mon, 19 Apr 2010 19:52:14 +0100 Received: from f199130.upc-f.chello.nl ([80.56.199.130] helo=dyad.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.69 #1 (Red Hat Linux)) id 1O3w4s-000781-0X for kernel-team@lists.ubuntu.com; Mon, 19 Apr 2010 18:52:14 +0000 Received: by dyad.programming.kicks-ass.net (Postfix, from userid 65534) id 85AE0F3E6; Mon, 19 Apr 2010 20:53:28 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on dyad X-Spam-Level: X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.2.5 Received: from [IPv6:::1] (dyad [192.168.0.60]) by dyad.programming.kicks-ass.net (Postfix) with ESMTP id D4298F69B; Mon, 19 Apr 2010 20:53:26 +0200 (CEST) Subject: Re: [REGRESSION 2.6.30][PATCH v3] sched: update load count only once per cpu in 10 tick update window From: Peter Zijlstra To: Chase Douglas In-Reply-To: <1271200751-18697-1-git-send-email-chase.douglas@canonical.com> References: <1271200751-18697-1-git-send-email-chase.douglas@canonical.com> Date: Mon, 19 Apr 2010 20:52:10 +0200 Message-ID: <1271703130.1676.214.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 X-Mailman-Approved-At: Tue, 20 Apr 2010 13:26:21 +0100 Cc: Andrew Morton , linux-kernel@vger.kernel.org, "Rafael J. Wysocki" , kernel-team , Thomas Gleixner , Ingo Molnar X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: kernel-team-bounces@lists.ubuntu.com Errors-To: kernel-team-bounces@lists.ubuntu.com On Tue, 2010-04-13 at 16:19 -0700, Chase Douglas wrote: > There's a period of 10 ticks where calc_load_tasks is updated by all the > cpus for the load avg. Usually all the cpus do this during the first > tick. If any cpus go idle, calc_load_tasks is decremented accordingly. > However, if they wake up calc_load_tasks is not incremented. Thus, if > cpus go idle during the 10 tick period, calc_load_tasks may be > decremented to a non-representative value. This issue can lead to > systems having a load avg of exactly 0, even though the real load avg > could theoretically be up to NR_CPUS. > > This change defers calc_load_tasks accounting after each cpu updates the > count until after the 10 tick update window. > > A few points: > > * A global atomic deferral counter, and not per-cpu vars, is needed > because a cpu may go NOHZ idle and not be able to update the global > calc_load_tasks variable for subsequent load calculations. > * It is not enough to add calls to account for the load when a cpu is > awakened: > - Load avg calculation must be independent of cpu load. > - If a cpu is awakend by one tasks, but then has more scheduled before > the end of the update window, only the first task will be accounted. OK, so what you're saying is that because we update calc_load_tasks from entering idle, we decrease earlier than a regular 10 tick sample interval would? Hence you batch these early updates into _deferred and let the next 10 tick sample roll them over? So the only early updates can come from pick_next_task_idle()->calc_load_account_active(), so why not specialize that callchain instead of the below? Also, since its all NO_HZ, why not stick this in with the ILB? Once people get around to making that scale better, this can hitch a ride. Something like the below perhaps? It does run partially from softirq context, but since there's a distinct lack of synchronization here that didn't seem like an immediate problem. --- kernel/sched.c | 10 ++++++---- kernel/sched_fair.c | 4 +++- kernel/sched_idletask.c | 2 -- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 95eaecc..cdd4d8c 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2959,6 +2959,11 @@ static void calc_load_account_active(struct rq *this_rq) { long nr_active, delta; + if (!time_after_eq(jiffies, this_rq->calc_load_update)) + return; + + this_rq->calc_load_update += LOAD_FREQ; + nr_active = this_rq->nr_running; nr_active += (long) this_rq->nr_uninterruptible; @@ -2998,10 +3003,7 @@ static void update_cpu_load(struct rq *this_rq) this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i; } - if (time_after_eq(jiffies, this_rq->calc_load_update)) { - this_rq->calc_load_update += LOAD_FREQ; - calc_load_account_active(this_rq); - } + calc_load_account_active(this_rq); } #ifdef CONFIG_SMP diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 88d3053..2c267ef 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3394,9 +3394,11 @@ static void run_rebalance_domains(struct softirq_action *h) if (need_resched()) break; + rq = cpu_rq(balance_cpu); + calc_load_account_active(rq); + rebalance_domains(balance_cpu, CPU_IDLE); - rq = cpu_rq(balance_cpu); if (time_after(this_rq->next_balance, rq->next_balance)) this_rq->next_balance = rq->next_balance; } diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c index bea2b8f..6ca191f 100644 --- a/kernel/sched_idletask.c +++ b/kernel/sched_idletask.c @@ -23,8 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl static struct task_struct *pick_next_task_idle(struct rq *rq) { schedstat_inc(rq, sched_goidle); - /* adjust the active tasks as we might go into a long sleep */ - calc_load_account_active(rq); return rq->idle; }