Message ID | 20100210111025.GF3399@kryten (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Benjamin Herrenschmidt |
Headers | show |
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote: > > Nick Piggin discovered that lwsync barriers around locks were faster than isync > on 970. That was a long time ago and I completely dropped the ball in testing > his patches across other ppc64 processors. > > Turns out the idea helps on other chips. Using a microbenchmark that > uses a lot of threads to contend on a global pthread mutex (and therefore a > global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5 > and while I couldn't measure an improvement, there was no regression. Ah, good to see this one come back. I also tested tbench over localhost btw which actually did show some speedup on the G5. BTW. this was the last thing left: http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html Don't know if you took a look at that again, but maybe it's worth looking at. Hmm, we do actually seem to be growing number of smp_mb* calls in core kernel too.
On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote: > > Turns out this one hurts PA6T performance quite a bit, lwsync seems to be > significantly more expensive there. I see a 25% drop in the microbenchmark > doing pthread_lock/unlock loops on two cpus. > > Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since > the sync->lwsync changes definitely still can, and should, be done. So we should use a different feature bit. No biggie. If needed we can split them more anyways. Cheers, Ben.
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote: > > Nick Piggin discovered that lwsync barriers around locks were faster than isync > on 970. That was a long time ago and I completely dropped the ball in testing > his patches across other ppc64 processors. > > Turns out the idea helps on other chips. Using a microbenchmark that > uses a lot of threads to contend on a global pthread mutex (and therefore a > global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5 > and while I couldn't measure an improvement, there was no regression. > > This patch uses the lwsync patching code to replace the isyncs with lwsyncs > on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync > capable but in reality they treat it as a full sync (ie slow). Remove the > CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync > method. > > Signed-off-by: Anton Blanchard <anton@samba.org> Turns out this one hurts PA6T performance quite a bit, lwsync seems to be significantly more expensive there. I see a 25% drop in the microbenchmark doing pthread_lock/unlock loops on two cpus. Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since the sync->lwsync changes definitely still can, and should, be done. -Olof
On Tue, Feb 16, 2010 at 03:19:03PM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote: > > > > Turns out this one hurts PA6T performance quite a bit, lwsync seems to be > > significantly more expensive there. I see a 25% drop in the microbenchmark > > doing pthread_lock/unlock loops on two cpus. > > > > Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since > > the sync->lwsync changes definitely still can, and should, be done. > > So we should use a different feature bit. No biggie. If needed we can > split them more anyways. Yeah, fine with me. -Olof
Hi Nick, > Ah, good to see this one come back. I also tested tbench over localhost > btw which actually did show some speedup on the G5. > > BTW. this was the last thing left: > http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html > > Don't know if you took a look at that again, but maybe it's worth > looking at. Hmm, we do actually seem to be growing number of smp_mb* > calls in core kernel too. Interesting idea! I do worry we will get a late night visit from the architecture police. From memory they want the complete larx, stcwx, bne, isync sequence to guarantee an acquire barrier. Anton
On Wed, Feb 17, 2010 at 08:43:14PM +1100, Anton Blanchard wrote: > > Hi Nick, > > > Ah, good to see this one come back. I also tested tbench over localhost > > btw which actually did show some speedup on the G5. > > > > BTW. this was the last thing left: > > http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html > > > > Don't know if you took a look at that again, but maybe it's worth > > looking at. Hmm, we do actually seem to be growing number of smp_mb* > > calls in core kernel too. > > Interesting idea! I do worry we will get a late night visit from the > architecture police. From memory they want the complete larx, stcwx, bne, > isync sequence to guarantee an acquire barrier. Yes I suppose the branch can be executed "non speculatively" before the lwsync is completed. Wheras the larx/stcwx will have to complete before the branch outcome can be known. I suppose probably not worthwhile avoiding the full IO sync by adding yet more crap to make this work. Thanks for putting my mind to rest though :) I'd still be interested to know how expensive the full sync is when you have a lot of IOs in flight. Question, are you going to do the hint and isync->lwsync thing for userspace as well? Too bad the kernel doesn't export synchronisation primitives to userspace...
> Yes I suppose the branch can be executed "non speculatively" before the > lwsync is completed. Wheras the larx/stcwx will have to complete before > the branch outcome can be known. I suppose probably not worthwhile > avoiding the full IO sync by adding yet more crap to make this work. > > Thanks for putting my mind to rest though :) > > I'd still be interested to know how expensive the full sync is when you > have a lot of IOs in flight. > > Question, are you going to do the hint and isync->lwsync thing for > userspace as well? Too bad the kernel doesn't export synchronisation > primitives to userspace... We'd love to.... the vdso inherits from all the dynamic patching features of the kernel and more... It's really a matter of getting glibc to grok it, and doing so efficiently (ie, not just having a function point in glibc, might be a good use of the IFUNC stuff). Cheers, Ben.
Index: powerpc.git/arch/powerpc/include/asm/synch.h =================================================================== --- powerpc.git.orig/arch/powerpc/include/asm/synch.h 2010-02-10 17:16:44.666823141 +1100 +++ powerpc.git/arch/powerpc/include/asm/synch.h 2010-02-10 17:16:45.123073249 +1100 @@ -37,7 +37,11 @@ static inline void isync(void) #endif #ifdef CONFIG_SMP -#define PPC_ACQUIRE_BARRIER "\n\tisync\n" +#define __PPC_ACQUIRE_BARRIER \ + START_LWSYNC_SECTION(97); \ + isync; \ + MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup); +#define PPC_ACQUIRE_BARRIER "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER) #define PPC_RELEASE_BARRIER stringify_in_c(LWSYNC) "\n" #else #define PPC_ACQUIRE_BARRIER Index: powerpc.git/arch/powerpc/include/asm/cputable.h =================================================================== --- powerpc.git.orig/arch/powerpc/include/asm/cputable.h 2010-02-10 17:15:59.816823326 +1100 +++ powerpc.git/arch/powerpc/include/asm/cputable.h 2010-02-10 17:16:45.123073249 +1100 @@ -381,9 +381,9 @@ extern const char *powerpc_base_platform #define CPU_FTRS_GENERIC_32 (CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN) /* 64-bit CPUs */ -#define CPU_FTRS_POWER3 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ +#define CPU_FTRS_POWER3 (CPU_FTR_USE_TB | \ CPU_FTR_IABR | CPU_FTR_PPC_LE) -#define CPU_FTRS_RS64 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ +#define CPU_FTRS_RS64 (CPU_FTR_USE_TB | \ CPU_FTR_IABR | \ CPU_FTR_MMCRA | CPU_FTR_CTRL) #define CPU_FTRS_POWER4 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
Nick Piggin discovered that lwsync barriers around locks were faster than isync on 970. That was a long time ago and I completely dropped the ball in testing his patches across other ppc64 processors. Turns out the idea helps on other chips. Using a microbenchmark that uses a lot of threads to contend on a global pthread mutex (and therefore a global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5 and while I couldn't measure an improvement, there was no regression. This patch uses the lwsync patching code to replace the isyncs with lwsyncs on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync capable but in reality they treat it as a full sync (ie slow). Remove the CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync method. Signed-off-by: Anton Blanchard <anton@samba.org> ---