Patchwork [PATCHv3,2/2] powerpc: implement arch_scale_smt_power for Power7

login
register
mail settings
Submitter jschopp@austin.ibm.com
Date Jan. 28, 2010, 11:24 p.m.
Message ID <1264721088.10385.1.camel@jschopp-laptop>
Download mbox | patch
Permalink /patch/43884/
State Rejected
Headers show

Comments

jschopp@austin.ibm.com - Jan. 28, 2010, 11:24 p.m.
On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads 
there is performance benefit to idling the higher numbered threads in
the core.  

This patch implements arch_scale_smt_power to dynamically update smt
thread power in these idle cases in order to prefer threads 0,1 over
threads 2,3 within a core.


Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---
Version 2 addresses style and optimization, same basic functionality
Version 3 adds a comment, resent due to mailing format error
Benjamin Herrenschmidt - Jan. 29, 2010, 1:23 a.m.
On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote:
> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads 
> there is performance benefit to idling the higher numbered threads in
> the core.  
> 
> This patch implements arch_scale_smt_power to dynamically update smt
> thread power in these idle cases in order to prefer threads 0,1 over
> threads 2,3 within a core.

Almost there :-) Joel, Peter, can you help me figure something out tho ?

On machine that don't have SMT, I would like to avoid calling
arch_scale_smt_power() at all if possible (in addition to not compiling
it in if SMT is not enabled in .config).

Now, I must say I'm utterly confused by how the domains are setup and I
haven't quite managed to sort it out... it looks to me that
SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config
option is set (though each CPU will have its own domain) or am I
misguided ? IE. Is there any sense in having at least a fast exit path
out of arch_scale_smt_power() for non-SMT CPUs ?

Joel, can you look at compiling it out when SMT is not set ? We don't
want to bloat SMP kernels for 32-bit non-SMT embedded platforms.

Oh, and one minor nit:

> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
> ---
> Version 2 addresses style and optimization, same basic functionality
> Version 3 adds a comment, resent due to mailing format error
> Index: linux-2.6.git/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6.git/arch/powerpc/kernel/smp.c
> @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu)
>  		smp_ops->cpu_die(cpu);
>  }
>  #endif

 ^^^ Please add the /* CONFIG_CPU_HOTPLUG */ (or whatever it is) that's
missing after that #endif :-)

Cheers,
Ben.
Peter Zijlstra - Jan. 29, 2010, 10:13 a.m.
On Fri, 2010-01-29 at 12:23 +1100, Benjamin Herrenschmidt wrote:
> On machine that don't have SMT, I would like to avoid calling
> arch_scale_smt_power() at all if possible (in addition to not compiling
> it in if SMT is not enabled in .config).
> 
> Now, I must say I'm utterly confused by how the domains are setup and I
> haven't quite managed to sort it out... it looks to me that
> SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config
> option is set (though each CPU will have its own domain) or am I
> misguided ? IE. Is there any sense in having at least a fast exit path
> out of arch_scale_smt_power() for non-SMT CPUs ?

The sched_domain creation code is a f'ing stink pile that hurts
everybody's brain.

The AMD magny-cours people sort of cleaned it up a bit but didn't go
nearly far enough. Doing so is somewhere on my todo list, but sadly that
thing is way larger than my spare time.

Now SD_SHARE_CPUPOWER _should_ only be set for SMT domains, because only
SMT siblings share cpupower.

SD_SHARE_PKG_RESOURCES _should_ be set for both SMT and MC, because they
all share the same cache domain.

If it all works out that way in practice on powerpc is another question
entirely ;-)

That said, I'm still not entirely convinced I like this usage of
cpupower, its supposed to be a normalization scale for load-balancing,
not a placement hook.

I'd be much happier with a SD_GROUP_ORDER or something like that, that
works together with SD_PREFER_SIBLING to pack active tasks to cpus in
ascending group order.
jschopp@austin.ibm.com - Jan. 29, 2010, 6:34 p.m.
> That said, I'm still not entirely convinced I like this usage of
> cpupower, its supposed to be a normalization scale for load-balancing,
> not a placement hook.
>   
Even if you do a placement hook you'll need to address it in the load 
balancing as well.  Consider a single 4 thread SMT core with 4 running 
tasks.  If 2 of them exit the remaining 2 will need to be load balanced 
within the core in a way that takes into account the dynamic nature of 
the thread power.  This patch does that.
> I'd be much happier with a SD_GROUP_ORDER or something like that, that
> works together with SD_PREFER_SIBLING to pack active tasks to cpus in
> ascending group order.
>
>   
I don't see this load-balancing patch as mutually exclusive with a patch 
to fix placement.  But even if it is a mutually exclusive solution there 
is no reason we can't fix things now with this patch and then later take 
it out when it's fixed another way.  This patch series is 
straightforward, non-intrusive, and without it the scheduler is broken 
on this processor.
jschopp@austin.ibm.com - Jan. 29, 2010, 6:41 p.m.
Benjamin Herrenschmidt wrote:
> On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote:
>   
>> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads 
>> there is performance benefit to idling the higher numbered threads in
>> the core.  
>>
>> This patch implements arch_scale_smt_power to dynamically update smt
>> thread power in these idle cases in order to prefer threads 0,1 over
>> threads 2,3 within a core.
>>     
>
> Almost there :-) Joel, Peter, can you help me figure something out tho ?
>
> On machine that don't have SMT, I would like to avoid calling
> arch_scale_smt_power() at all if possible (in addition to not compiling
> it in if SMT is not enabled in .config).
>
> Now, I must say I'm utterly confused by how the domains are setup and I
> haven't quite managed to sort it out... it looks to me that
> SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config
> option is set (though each CPU will have its own domain) or am I
> misguided ? IE. Is there any sense in having at least a fast exit path
> out of arch_scale_smt_power() for non-SMT CPUs ?
>
> Joel, can you look at compiling it out when SMT is not set ? We don't
> want to bloat SMP kernels for 32-bit non-SMT embedded platforms.
>   
I can wrap the powerpc definition of arch_scale_smt in an #ifdef, if 
it's not there the scheduler uses the default, which is the same as it 
uses if SMT isn't compiled.

Patch

Index: linux-2.6.git/arch/powerpc/kernel/smp.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/smp.c
+++ linux-2.6.git/arch/powerpc/kernel/smp.c
@@ -620,3 +620,59 @@  void __cpu_die(unsigned int cpu)
 		smp_ops->cpu_die(cpu);
 }
 #endif
+
+unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	int sibling;
+	int idle_count = 0;
+	int thread;
+
+	/* Setup the default weight and smt_gain used by most cpus for SMT
+	 * Power.  Doing this right away covers the default case and can be
+	 * used by cpus that modify it dynamically.
+	 */
+	struct cpumask *sibling_map = sched_domain_span(sd);
+	unsigned long weight = cpumask_weight(sibling_map);
+	unsigned long smt_gain = sd->smt_gain;
+
+
+	if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) {
+		for_each_cpu(sibling, sibling_map) {
+			if (idle_cpu(sibling))
+				idle_count++;
+		}
+
+		/* the following section attempts to tweak cpu power based
+		 * on current idleness of the threads dynamically at runtime
+		 */
+		if (idle_count > 1) {
+			thread = cpu_thread_in_core(cpu);
+			if (thread < 2) {
+				/* add 75 % to thread power */
+				smt_gain += (smt_gain >> 1) + (smt_gain >> 2);
+			} else {
+				 /* subtract 75 % to thread power */
+				smt_gain = smt_gain >> 2;
+			}
+		}
+	}
+
+	/* default smt gain is 1178, weight is # of SMT threads */
+	switch (weight) {
+	case 1:
+		/*divide by 1, do nothing*/
+		break;
+	case 2:
+		smt_gain = smt_gain >> 1;
+		break;
+	case 4:
+		smt_gain = smt_gain >> 2;
+		break;
+	default:
+		smt_gain /= weight;
+		break;
+	}
+
+	return smt_gain;
+
+}
Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h
===================================================================
--- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h
+++ linux-2.6.git/arch/powerpc/include/asm/cputable.h
@@ -195,6 +195,7 @@  extern const char *powerpc_base_platform
 #define CPU_FTR_SAO			LONG_ASM_CONST(0x0020000000000000)
 #define CPU_FTR_CP_USE_DCBTZ		LONG_ASM_CONST(0x0040000000000000)
 #define CPU_FTR_UNALIGNED_LD_STD	LONG_ASM_CONST(0x0080000000000000)
+#define CPU_FTR_ASYNC_SMT4		LONG_ASM_CONST(0x0100000000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -409,7 +410,7 @@  extern const char *powerpc_base_platform
 	    CPU_FTR_MMCRA | CPU_FTR_SMT | \
 	    CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \
 	    CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
-	    CPU_FTR_DSCR | CPU_FTR_SAO)
+	    CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4)
 #define CPU_FTRS_CELL	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \