diff mbox

[PATCHv4,2/2] powerpc: implement arch_scale_smt_power for Power7

Message ID 1265403478.6089.41.camel@jschopp-laptop (mailing list archive)
State Rejected
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

jschopp@austin.ibm.com Feb. 5, 2010, 8:57 p.m. UTC
On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads 
there is performance benefit to idling the higher numbered threads in
the core.  

This patch implements arch_scale_smt_power to dynamically update smt
thread power in these idle cases in order to prefer threads 0,1 over
threads 2,3 within a core.

Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---
Version 3 adds the #ifdef to avoid compiling on kernels that don't need it

Comments

Peter Zijlstra Feb. 14, 2010, 10:12 a.m. UTC | #1
On Fri, 2010-02-05 at 14:57 -0600, Joel Schopp wrote:
> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads 
> there is performance benefit to idling the higher numbered threads in
> the core.  
> 
> This patch implements arch_scale_smt_power to dynamically update smt
> thread power in these idle cases in order to prefer threads 0,1 over
> threads 2,3 within a core.
> 
> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
> ---

> Index: linux-2.6.git/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6.git/arch/powerpc/kernel/smp.c
> @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu)
>  		smp_ops->cpu_die(cpu);
>  }
>  #endif
> +
> +#ifdef CONFIG_SCHED_SMT
> +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
> +{
> +	int sibling;
> +	int idle_count = 0;
> +	int thread;
> +
> +	/* Setup the default weight and smt_gain used by most cpus for SMT
> +	 * Power.  Doing this right away covers the default case and can be
> +	 * used by cpus that modify it dynamically.
> +	 */
> +	struct cpumask *sibling_map = sched_domain_span(sd);
> +	unsigned long weight = cpumask_weight(sibling_map);
> +	unsigned long smt_gain = sd->smt_gain;
> +
> +
> +	if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) {
> +		for_each_cpu(sibling, sibling_map) {
> +			if (idle_cpu(sibling))
> +				idle_count++;
> +		}
> +
> +		/* the following section attempts to tweak cpu power based
> +		 * on current idleness of the threads dynamically at runtime
> +		 */
> +		if (idle_count > 1) {
> +			thread = cpu_thread_in_core(cpu);
> +			if (thread < 2) {
> +				/* add 75 % to thread power */
> +				smt_gain += (smt_gain >> 1) + (smt_gain >> 2);
> +			} else {
> +				 /* subtract 75 % to thread power */
> +				smt_gain = smt_gain >> 2;
> +			}
> +		}
> +	}
> +
> +	/* default smt gain is 1178, weight is # of SMT threads */
> +	switch (weight) {
> +	case 1:
> +		/*divide by 1, do nothing*/
> +		break;
> +	case 2:
> +		smt_gain = smt_gain >> 1;
> +		break;
> +	case 4:
> +		smt_gain = smt_gain >> 2;
> +		break;
> +	default:
> +		smt_gain /= weight;
> +		break;
> +	}
> +
> +	return smt_gain;
> +
> +}
> +#endif

Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we
can construct an equivalent but more complex example for 4 threads), and
we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the
SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it
ends up on.

In that situation, provided that each cpu's cpu_power is of equal
measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the
cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT
task, so that each task consumes 50%, which is all fair and proper.

However, if you do the above, thread 0 will have +75% = 1.75 and thread
2 will have -75% = 0.25, then if the RT task will land on thread 0,
we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either
case thread 0 will receive too many (if not all) SCHED_OTHER tasks.

That is, unless these threads 2 and 3 really are _that_ weak, at which
point one wonders why IBM bothered with the silicon ;-)

So tell me again, why is fiddling with the cpu_power a good placement
tool?
Michael Neuling Feb. 17, 2010, 10:20 p.m. UTC | #2
> Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we
> can construct an equivalent but more complex example for 4 threads), and
> we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the
> SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it
> ends up on.
> 
> In that situation, provided that each cpu's cpu_power is of equal
> measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the
> cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT
> task, so that each task consumes 50%, which is all fair and proper.
> 
> However, if you do the above, thread 0 will have +75% = 1.75 and thread
> 2 will have -75% = 0.25, then if the RT task will land on thread 0,
> we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either
> case thread 0 will receive too many (if not all) SCHED_OTHER tasks.
> 
> That is, unless these threads 2 and 3 really are _that_ weak, at which
> point one wonders why IBM bothered with the silicon ;-)

Peter,

2 & 3 aren't weaker than 0 & 1 but.... 

The core has dynamic SMT mode switching which is controlled by the
hypervisor (IBM's PHYP).  There are 3 SMT modes:
	SMT1 uses thread  0
	SMT2 uses threads 0 & 1
	SMT4 uses threads 0, 1, 2 & 3
When in any particular SMT mode, all threads have the same performance
as each other (ie. at any moment in time, all threads perform the same).  

The SMT mode switching works such that when linux has threads 2 & 3 idle
and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
idle loop and the hypervisor will automatically switch to SMT2 for that
core (independent of other cores).  The opposite is not true, so if
threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.

Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
into SMT1 mode.  

If we can get the core into a lower SMT mode (SMT1 is best), the threads
will perform better (since they share less core resources).  Hence when
we have idle threads, we want them to be the higher ones.

So to answer your question, threads 2 and 3 aren't weaker than the other
threads when in SMT4 mode.  It's that if we idle threads 2 & 3, threads
0 & 1 will speed up since we'll move to SMT2 mode.

I'm pretty vague on linux scheduler details, so I'm a bit at sea as to
how to solve this.  Can you suggest any mechanisms we currently have in
the kernel to reflect these properties, or do you think we need to
develop something new?  If so, any pointers as to where we should look?

Thanks,
Mikey
diff mbox

Patch

Index: linux-2.6.git/arch/powerpc/kernel/smp.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
+++ linux-2.6.git/arch/powerpc/kernel/smp.c
@@ -620,3 +620,61 @@  void __cpu_die(unsigned int cpu)
 		smp_ops->cpu_die(cpu);
 }
 #endif
+
+#ifdef CONFIG_SCHED_SMT
+unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	int sibling;
+	int idle_count = 0;
+	int thread;
+
+	/* Setup the default weight and smt_gain used by most cpus for SMT
+	 * Power.  Doing this right away covers the default case and can be
+	 * used by cpus that modify it dynamically.
+	 */
+	struct cpumask *sibling_map = sched_domain_span(sd);
+	unsigned long weight = cpumask_weight(sibling_map);
+	unsigned long smt_gain = sd->smt_gain;
+
+
+	if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) {
+		for_each_cpu(sibling, sibling_map) {
+			if (idle_cpu(sibling))
+				idle_count++;
+		}
+
+		/* the following section attempts to tweak cpu power based
+		 * on current idleness of the threads dynamically at runtime
+		 */
+		if (idle_count > 1) {
+			thread = cpu_thread_in_core(cpu);
+			if (thread < 2) {
+				/* add 75 % to thread power */
+				smt_gain += (smt_gain >> 1) + (smt_gain >> 2);
+			} else {
+				 /* subtract 75 % to thread power */
+				smt_gain = smt_gain >> 2;
+			}
+		}
+	}
+
+	/* default smt gain is 1178, weight is # of SMT threads */
+	switch (weight) {
+	case 1:
+		/*divide by 1, do nothing*/
+		break;
+	case 2:
+		smt_gain = smt_gain >> 1;
+		break;
+	case 4:
+		smt_gain = smt_gain >> 2;
+		break;
+	default:
+		smt_gain /= weight;
+		break;
+	}
+
+	return smt_gain;
+
+}
+#endif
Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h
===================================================================
--- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h
+++ linux-2.6.git/arch/powerpc/include/asm/cputable.h
@@ -195,6 +195,7 @@  extern const char *powerpc_base_platform
 #define CPU_FTR_SAO			LONG_ASM_CONST(0x0020000000000000)
 #define CPU_FTR_CP_USE_DCBTZ		LONG_ASM_CONST(0x0040000000000000)
 #define CPU_FTR_UNALIGNED_LD_STD	LONG_ASM_CONST(0x0080000000000000)
+#define CPU_FTR_ASYNC_SMT4		LONG_ASM_CONST(0x0100000000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -409,7 +410,7 @@  extern const char *powerpc_base_platform
 	    CPU_FTR_MMCRA | CPU_FTR_SMT | \
 	    CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \
 	    CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
-	    CPU_FTR_DSCR | CPU_FTR_SAO)
+	    CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4)
 #define CPU_FTRS_CELL	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \