Patchwork mutex: optimise generic mutex implementations

login
register
mail settings
Submitter Nick Piggin
Date Oct. 12, 2008, 5:46 a.m.
Message ID <20081012054634.GA12535@wotan.suse.de>
Download mbox | patch
Permalink /patch/4047/
State Awaiting Upstream, archived
Delegated to: Paul Mackerras
Headers show

Comments

Nick Piggin - Oct. 12, 2008, 5:46 a.m.
Speed up generic mutex implementations.

- atomic operations which both modify the variable and return something imply
  full smp memory barriers before and after the memory operations involved
  (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
  they don't modify the target). See Documentation/atomic_ops.txt.
  So remove extra barriers and branches.
  
- All architectures support atomic_cmpxchg. This has no relation to
  __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally

This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
to 203 cycles on a ppc970 system.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Benjamin Herrenschmidt - Oct. 14, 2008, 8:35 a.m.
On Sun, 2008-10-12 at 07:46 +0200, Nick Piggin wrote:
> Speed up generic mutex implementations.
> 
> - atomic operations which both modify the variable and return something imply
>   full smp memory barriers before and after the memory operations involved
>   (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
>   they don't modify the target). See Documentation/atomic_ops.txt.
>   So remove extra barriers and branches.
>   
> - All architectures support atomic_cmpxchg. This has no relation to
>   __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally
> 
> This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
> to 203 cycles on a ppc970 system.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Looks ok.

Cheers,
Ben.
Ingo Molnar - Oct. 22, 2008, 3:59 p.m.
* Nick Piggin <npiggin@suse.de> wrote:

> Speed up generic mutex implementations.
> 
> - atomic operations which both modify the variable and return something imply
>   full smp memory barriers before and after the memory operations involved
>   (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
>   they don't modify the target). See Documentation/atomic_ops.txt.
>   So remove extra barriers and branches.
>   
> - All architectures support atomic_cmpxchg. This has no relation to
>   __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally
> 
> This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
> to 203 cycles on a ppc970 system.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

no objections here. Lets merge these two patches via the ppc tree, so 
that it gets testing on real hardware as well?

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo
David Howells - Oct. 22, 2008, 4:24 p.m.
Nick Piggin <npiggin@suse.de> wrote:

> Speed up generic mutex implementations.
> 
> - atomic operations which both modify the variable and return something imply
>   full smp memory barriers before and after the memory operations involved
>   (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
>   they don't modify the target). See Documentation/atomic_ops.txt.
>   So remove extra barriers and branches.
>   
> - All architectures support atomic_cmpxchg. This has no relation to
>   __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally
> 
> This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
> to 203 cycles on a ppc970 system.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

This seems to work on FRV which uses the mutex-dec generic algorithm, though
you have to take that with a pinch of salt as I don't have SMP hardware for
it.

Acked-by: David Howells <dhowells@redhat.com>
Benjamin Herrenschmidt - Oct. 23, 2008, 4:43 a.m.
On Wed, 2008-10-22 at 17:59 +0200, Ingo Molnar wrote:
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > Speed up generic mutex implementations.
> > 
> > - atomic operations which both modify the variable and return something imply
> >   full smp memory barriers before and after the memory operations involved
> >   (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
> >   they don't modify the target). See Documentation/atomic_ops.txt.
> >   So remove extra barriers and branches.
> >   
> > - All architectures support atomic_cmpxchg. This has no relation to
> >   __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally
> > 
> > This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
> > to 203 cycles on a ppc970 system.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> no objections here. Lets merge these two patches via the ppc tree, so 
> that it gets testing on real hardware as well?
> 
> Acked-by: Ingo Molnar <mingo@elte.hu>

Allright but in that case it will be after -rc1 unless I manage to sneak
something in tomorrow before linux closes the merge window.

I can't get an update today.

Cheers,
Ben.
Nick Piggin - Oct. 23, 2008, 7:02 a.m.
On Thu, Oct 23, 2008 at 03:43:58PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2008-10-22 at 17:59 +0200, Ingo Molnar wrote:
> > * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > Speed up generic mutex implementations.
> > > 
> > > - atomic operations which both modify the variable and return something imply
> > >   full smp memory barriers before and after the memory operations involved
> > >   (failing atomic_cmpxchg, atomic_add_unless, etc don't imply a barrier because
> > >   they don't modify the target). See Documentation/atomic_ops.txt.
> > >   So remove extra barriers and branches.
> > >   
> > > - All architectures support atomic_cmpxchg. This has no relation to
> > >   __HAVE_ARCH_CMPXCHG. We can just take the atomic_cmpxchg path unconditionally
> > > 
> > > This reduces a simple single threaded fastpath lock+unlock test from 590 cycles
> > > to 203 cycles on a ppc970 system.
> > > 
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > 
> > no objections here. Lets merge these two patches via the ppc tree, so 
> > that it gets testing on real hardware as well?
> > 
> > Acked-by: Ingo Molnar <mingo@elte.hu>
> 
> Allright but in that case it will be after -rc1 unless I manage to sneak
> something in tomorrow before linux closes the merge window.
> 
> I can't get an update today.

Fine with me.

Thanks,
Nick

Patch

Index: linux-2.6/include/asm-generic/mutex-dec.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-dec.h
+++ linux-2.6/include/asm-generic/mutex-dec.h
@@ -22,8 +22,6 @@  __mutex_fastpath_lock(atomic_t *count, v
 {
 	if (unlikely(atomic_dec_return(count) < 0))
 		fail_fn(count);
-	else
-		smp_mb();
 }
 
 /**
@@ -41,10 +39,7 @@  __mutex_fastpath_lock_retval(atomic_t *c
 {
 	if (unlikely(atomic_dec_return(count) < 0))
 		return fail_fn(count);
-	else {
-		smp_mb();
-		return 0;
-	}
+	return 0;
 }
 
 /**
@@ -63,7 +58,6 @@  __mutex_fastpath_lock_retval(atomic_t *c
 static inline void
 __mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
 {
-	smp_mb();
 	if (unlikely(atomic_inc_return(count) <= 0))
 		fail_fn(count);
 }
@@ -98,15 +92,9 @@  __mutex_fastpath_trylock(atomic_t *count
 	 * just as efficient (and simpler) as a 'destructive' probing of
 	 * the mutex state would be.
 	 */
-#ifdef __HAVE_ARCH_CMPXCHG
-	if (likely(atomic_cmpxchg(count, 1, 0) == 1)) {
-		smp_mb();
+	if (likely(atomic_cmpxchg(count, 1, 0) == 1))
 		return 1;
-	}
 	return 0;
-#else
-	return fail_fn(count);
-#endif
 }
 
 #endif
Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -27,8 +27,6 @@  __mutex_fastpath_lock(atomic_t *count, v
 {
 	if (unlikely(atomic_xchg(count, 0) != 1))
 		fail_fn(count);
-	else
-		smp_mb();
 }
 
 /**
@@ -46,10 +44,7 @@  __mutex_fastpath_lock_retval(atomic_t *c
 {
 	if (unlikely(atomic_xchg(count, 0) != 1))
 		return fail_fn(count);
-	else {
-		smp_mb();
-		return 0;
-	}
+	return 0;
 }
 
 /**
@@ -67,7 +62,6 @@  __mutex_fastpath_lock_retval(atomic_t *c
 static inline void
 __mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
 {
-	smp_mb();
 	if (unlikely(atomic_xchg(count, 1) != 0))
 		fail_fn(count);
 }
@@ -110,7 +104,6 @@  __mutex_fastpath_trylock(atomic_t *count
 		if (prev < 0)
 			prev = 0;
 	}
-	smp_mb();
 
 	return prev;
 }