diff mbox

[6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

Message ID 20100210111025.GF3399@kryten (mailing list archive)
State Accepted, archived
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

Anton Blanchard Feb. 10, 2010, 11:10 a.m. UTC
Nick Piggin discovered that lwsync barriers around locks were faster than isync
on 970. That was a long time ago and I completely dropped the ball in testing
his patches across other ppc64 processors.

Turns out the idea helps on other chips. Using a microbenchmark that
uses a lot of threads to contend on a global pthread mutex (and therefore a
global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
and while I couldn't measure an improvement, there was no regression.

This patch uses the lwsync patching code to replace the isyncs with lwsyncs
on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
capable but in reality they treat it as a full sync (ie slow). Remove the
CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
method.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Comments

Nick Piggin Feb. 11, 2010, 7:09 a.m. UTC | #1
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
> 
> Nick Piggin discovered that lwsync barriers around locks were faster than isync
> on 970. That was a long time ago and I completely dropped the ball in testing
> his patches across other ppc64 processors.
> 
> Turns out the idea helps on other chips. Using a microbenchmark that
> uses a lot of threads to contend on a global pthread mutex (and therefore a
> global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
> and while I couldn't measure an improvement, there was no regression.

Ah, good to see this one come back. I also tested tbench over localhost
btw which actually did show some speedup on the G5. 

BTW. this was the last thing left:
http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html

Don't know if you took a look at that again, but maybe it's worth
looking at. Hmm, we do actually seem to be growing number of smp_mb*
calls in core kernel too.
Benjamin Herrenschmidt Feb. 16, 2010, 4:19 a.m. UTC | #2
On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
> 
> Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
> significantly more expensive there. I see a 25% drop in the microbenchmark
> doing pthread_lock/unlock loops on two cpus.
> 
> Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
> the sync->lwsync changes definitely still can, and should, be done. 

So we should use a different feature bit. No biggie. If needed we can
split them more anyways.

Cheers,
Ben.
Olof Johansson Feb. 16, 2010, 4:22 a.m. UTC | #3
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
> 
> Nick Piggin discovered that lwsync barriers around locks were faster than isync
> on 970. That was a long time ago and I completely dropped the ball in testing
> his patches across other ppc64 processors.
> 
> Turns out the idea helps on other chips. Using a microbenchmark that
> uses a lot of threads to contend on a global pthread mutex (and therefore a
> global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
> and while I couldn't measure an improvement, there was no regression.
> 
> This patch uses the lwsync patching code to replace the isyncs with lwsyncs
> on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
> capable but in reality they treat it as a full sync (ie slow). Remove the
> CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
> method.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>

Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
significantly more expensive there. I see a 25% drop in the microbenchmark
doing pthread_lock/unlock loops on two cpus.

Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
the sync->lwsync changes definitely still can, and should, be done.


-Olof
Olof Johansson Feb. 16, 2010, 6:07 a.m. UTC | #4
On Tue, Feb 16, 2010 at 03:19:03PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
> > 
> > Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
> > significantly more expensive there. I see a 25% drop in the microbenchmark
> > doing pthread_lock/unlock loops on two cpus.
> > 
> > Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
> > the sync->lwsync changes definitely still can, and should, be done. 
> 
> So we should use a different feature bit. No biggie. If needed we can
> split them more anyways.

Yeah, fine with me.


-Olof
Anton Blanchard Feb. 17, 2010, 9:43 a.m. UTC | #5
Hi Nick,

> Ah, good to see this one come back. I also tested tbench over localhost
> btw which actually did show some speedup on the G5. 
> 
> BTW. this was the last thing left:
> http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
> 
> Don't know if you took a look at that again, but maybe it's worth
> looking at. Hmm, we do actually seem to be growing number of smp_mb*
> calls in core kernel too.

Interesting idea! I do worry we will get a late night visit from the
architecture police. From memory they want the complete larx, stcwx, bne,
isync sequence to guarantee an acquire barrier.

Anton
Nick Piggin Feb. 17, 2010, 10:41 a.m. UTC | #6
On Wed, Feb 17, 2010 at 08:43:14PM +1100, Anton Blanchard wrote:
>  
> Hi Nick,
> 
> > Ah, good to see this one come back. I also tested tbench over localhost
> > btw which actually did show some speedup on the G5. 
> > 
> > BTW. this was the last thing left:
> > http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
> > 
> > Don't know if you took a look at that again, but maybe it's worth
> > looking at. Hmm, we do actually seem to be growing number of smp_mb*
> > calls in core kernel too.
> 
> Interesting idea! I do worry we will get a late night visit from the
> architecture police. From memory they want the complete larx, stcwx, bne,
> isync sequence to guarantee an acquire barrier.

Yes I suppose the branch can be executed "non speculatively" before the 
lwsync is completed. Wheras the larx/stcwx will have to complete before
the branch outcome can be known. I suppose probably not worthwhile
avoiding the full IO sync by adding yet more crap to make this work.

Thanks for putting my mind to rest though :)

I'd still be interested to know how expensive the full sync is when you
have a lot of IOs in flight.

Question, are you going to do the hint and isync->lwsync thing for
userspace as well? Too bad the kernel doesn't export synchronisation
primitives to userspace...
Benjamin Herrenschmidt Feb. 17, 2010, 12:12 p.m. UTC | #7
> Yes I suppose the branch can be executed "non speculatively" before the 
> lwsync is completed. Wheras the larx/stcwx will have to complete before
> the branch outcome can be known. I suppose probably not worthwhile
> avoiding the full IO sync by adding yet more crap to make this work.
> 
> Thanks for putting my mind to rest though :)
> 
> I'd still be interested to know how expensive the full sync is when you
> have a lot of IOs in flight.
> 
> Question, are you going to do the hint and isync->lwsync thing for
> userspace as well? Too bad the kernel doesn't export synchronisation
> primitives to userspace...

We'd love to.... the vdso inherits from all the dynamic patching
features of the kernel and more...

It's really a matter of getting glibc to grok it, and doing so
efficiently (ie, not just having a function point in glibc, might be a
good use of the IFUNC stuff).

Cheers,
Ben.
diff mbox

Patch

Index: powerpc.git/arch/powerpc/include/asm/synch.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/synch.h	2010-02-10 17:16:44.666823141 +1100
+++ powerpc.git/arch/powerpc/include/asm/synch.h	2010-02-10 17:16:45.123073249 +1100
@@ -37,7 +37,11 @@  static inline void isync(void)
 #endif
 
 #ifdef CONFIG_SMP
-#define PPC_ACQUIRE_BARRIER	"\n\tisync\n"
+#define __PPC_ACQUIRE_BARRIER				\
+	START_LWSYNC_SECTION(97);			\
+	isync;						\
+	MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
+#define PPC_ACQUIRE_BARRIER	"\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER	stringify_in_c(LWSYNC) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
Index: powerpc.git/arch/powerpc/include/asm/cputable.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/cputable.h	2010-02-10 17:15:59.816823326 +1100
+++ powerpc.git/arch/powerpc/include/asm/cputable.h	2010-02-10 17:16:45.123073249 +1100
@@ -381,9 +381,9 @@  extern const char *powerpc_base_platform
 #define CPU_FTRS_GENERIC_32	(CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN)
 
 /* 64-bit CPUs */
-#define CPU_FTRS_POWER3	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
+#define CPU_FTRS_POWER3	(CPU_FTR_USE_TB | \
 	    CPU_FTR_IABR | CPU_FTR_PPC_LE)
-#define CPU_FTRS_RS64	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
+#define CPU_FTRS_RS64	(CPU_FTR_USE_TB | \
 	    CPU_FTR_IABR | \
 	    CPU_FTR_MMCRA | CPU_FTR_CTRL)
 #define CPU_FTRS_POWER4	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \