Patchwork Lucid pull request, sched: Fix SMT scheduler regression in find_busiest_queue()

login
register
mail settings
Submitter Tim Gardner
Date March 5, 2010, 4:46 p.m.
Message ID <20100305164641.C34AD31B6E@sepang.rtg.net>
Download mbox | patch
Permalink /patch/47012/
State Accepted
Delegated to: Andy Whitcroft
Headers show

Comments

Tim Gardner - March 5, 2010, 4:46 p.m.
This is probably a good patch to have in time for release, even though it
should soon show up in stable. I've noticed no regressions using
'stress -t 300 -c 24 -i 24 -m 24 -d 24' on a (2) CPU Nehalem.

------------------------------
The following changes since commit 9512754e274895426d5305d09a4c667d8621f420:
  Oleg Nesterov (1):
        x86: set_personality_ia32() misses force_personality32

are available in the git repository at:

  git://kernel.ubuntu.com/rtg/ubuntu-lucid.git sched-find_busiest_queue

Suresh Siddha (1):
      sched: Fix SMT scheduler regression in find_busiest_queue()

 kernel/sched.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

From f4d96f2b70af1037d7cab412d79db38846af0d79 Mon Sep 17 00:00:00 2001
From: Suresh Siddha <suresh.b.siddha@intel.com>
Date: Fri, 12 Feb 2010 17:14:22 -0800
Subject: [PATCH] sched: Fix SMT scheduler regression in find_busiest_queue()

Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

   commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
   Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
   Date:   Tue Sep 1 10:34:38 2009 +0200

   sched: Try to deal with low capacity

   @@ -4203,15 +4223,18 @@ find_busiest_queue()
   ...
	for_each_cpu(i, sched_group_cpus(group)) {
   +	unsigned long power = power_of(i);

   ...

   -	wl = weighted_cpuload(i);
   +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
   +	wl /= power;

   -	if (rq->nr_running == 1 && wl > imbalance)
   +	if (capacity && rq->nr_running == 1 && wl > imbalance)
		continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with  imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 9000f05c6d1607f79c0deacf42b09693be673f4c)

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 kernel/sched.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)
Andy Whitcroft - March 8, 2010, 10:35 a.m.
Applied to Lucid.

-apw

Patch

diff --git a/kernel/sched.c b/kernel/sched.c
index 60d74cc..4d01095 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4098,12 +4098,23 @@  find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
 			continue;
 
 		rq = cpu_rq(i);
-		wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
-		wl /= power;
+		wl = weighted_cpuload(i);
 
+		/*
+		 * When comparing with imbalance, use weighted_cpuload()
+		 * which is not scaled with the cpu power.
+		 */
 		if (capacity && rq->nr_running == 1 && wl > imbalance)
 			continue;
 
+		/*
+		 * For the load comparisons with the other cpu's, consider
+		 * the weighted_cpuload() scaled with the cpu power, so that
+		 * the load can be moved away from the cpu that is potentially
+		 * running at a lower capacity.
+		 */
+		wl = (wl * SCHED_LOAD_SCALE) / power;
+
 		if (wl > max_load) {
 			max_load = wl;
 			busiest = rq;