Patchwork powerpc: implement optimised mutex fastpaths

login
register
mail settings
Submitter Nick Piggin
Date Oct. 12, 2008, 5:47 a.m.
Message ID <20081012054732.GB12535@wotan.suse.de>
Download mbox | patch
Permalink /patch/4048/
State Awaiting Upstream
Delegated to: Paul Mackerras
Headers show

Comments

Nick Piggin - Oct. 12, 2008, 5:47 a.m.
Implement a more optimal mutex fastpath for powerpc, making use of acquire
and release barrier semantics. This takes the mutex lock+unlock benchmark
from 203 to 173 cycles on a G5.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Nick Piggin - Oct. 13, 2008, 1:18 a.m.
On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> 
> Implement a more optimal mutex fastpath for powerpc, making use of acquire
> and release barrier semantics. This takes the mutex lock+unlock benchmark
> from 203 to 173 cycles on a G5.
> 
> +static inline int
> +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
> +{
> +	if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1))
> +		return 1;

Oops, I must have sent the wrong version. This needs a return 0 here.

> +}
Scott Wood - Oct. 13, 2008, 4:15 p.m.
On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new)
> +{
> +	int t;
> +
> +	__asm__ __volatile__ (
> +"1:	lwarx	%0,0,%1		# mutex trylock\n\
> +	cmpw	0,%0,%2\n\
> +	bne-	2f\n"
> +	PPC405_ERR77(0,%1)
> +"	stwcx.	%3,0,%1\n\
> +	bne-	1b"
> +	ISYNC_ON_SMP
> +	"\n\
> +2:"
> +	: "=&r" (t)
> +	: "r" (&v->counter), "r" (old), "r" (new)
> +	: "cc", "memory");

This will break if the compiler picks r0 for &v->counter.  Use "b" as the
constraint, or better yet do "lwarx %0, %y1", with a "Z" (v->counter)
constraint.

-Scott
Scott Wood - Oct. 13, 2008, 4:20 p.m.
On Mon, Oct 13, 2008 at 11:15:47AM -0500, Scott Wood wrote:
> On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> > +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new)
> > +{
> > +	int t;
> > +
> > +	__asm__ __volatile__ (
> > +"1:	lwarx	%0,0,%1		# mutex trylock\n\
> > +	cmpw	0,%0,%2\n\
> > +	bne-	2f\n"
> > +	PPC405_ERR77(0,%1)
> > +"	stwcx.	%3,0,%1\n\
> > +	bne-	1b"
> > +	ISYNC_ON_SMP
> > +	"\n\
> > +2:"
> > +	: "=&r" (t)
> > +	: "r" (&v->counter), "r" (old), "r" (new)
> > +	: "cc", "memory");
> 
> This will break if the compiler picks r0 for &v->counter.  Use "b" as the
> constraint, or better yet do "lwarx %0, %y1", with a "Z" (v->counter)
> constraint.

Sorry, had a morning brain-fart there -- you're not using the base
register.

-Scott
Nick Piggin - Oct. 14, 2008, 7:06 a.m.
On Mon, Oct 13, 2008 at 11:20:20AM -0500, Scott Wood wrote:
> On Mon, Oct 13, 2008 at 11:15:47AM -0500, Scott Wood wrote:
> > On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> > > +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new)
> > > +{
> > > +	int t;
> > > +
> > > +	__asm__ __volatile__ (
> > > +"1:	lwarx	%0,0,%1		# mutex trylock\n\
> > > +	cmpw	0,%0,%2\n\
> > > +	bne-	2f\n"
> > > +	PPC405_ERR77(0,%1)
> > > +"	stwcx.	%3,0,%1\n\
> > > +	bne-	1b"
> > > +	ISYNC_ON_SMP
> > > +	"\n\
> > > +2:"
> > > +	: "=&r" (t)
> > > +	: "r" (&v->counter), "r" (old), "r" (new)
> > > +	: "cc", "memory");
> > 
> > This will break if the compiler picks r0 for &v->counter.  Use "b" as the
> > constraint, or better yet do "lwarx %0, %y1", with a "Z" (v->counter)
> > constraint.
> 
> Sorry, had a morning brain-fart there -- you're not using the base
> register.

OK no problem. Do you think it looks OK? Thanks for reviewing...
Paul Mackerras - Nov. 6, 2008, 4:09 a.m.
Nick Piggin writes:

> On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> > 
> > Implement a more optimal mutex fastpath for powerpc, making use of acquire
> > and release barrier semantics. This takes the mutex lock+unlock benchmark
> > from 203 to 173 cycles on a G5.
> > 
> > +static inline int
> > +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
> > +{
> > +	if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1))
> > +		return 1;
> 
> Oops, I must have sent the wrong version. This needs a return 0 here.

Are you going to send the right version?  (or did you send it and I
missed it?)

Also I note that the comment you added just before that routine says:

+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.

yet just doing a __mutex_cmpxchg_lock isn't going to do that.  So
please fix either the comment or the code (or both :).

Paul.
Nick Piggin - Nov. 6, 2008, 5:06 a.m.
On Thu, Nov 06, 2008 at 03:09:08PM +1100, Paul Mackerras wrote:
> Nick Piggin writes:
> 
> > On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote:
> > > 
> > > Implement a more optimal mutex fastpath for powerpc, making use of acquire
> > > and release barrier semantics. This takes the mutex lock+unlock benchmark
> > > from 203 to 173 cycles on a G5.
> > > 
> > > +static inline int
> > > +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
> > > +{
> > > +	if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1))
> > > +		return 1;
> > 
> > Oops, I must have sent the wrong version. This needs a return 0 here.
> 
> Are you going to send the right version?  (or did you send it and I
> missed it?)

Hmm, I thought I did but perhaps not. Will do...


> Also I note that the comment you added just before that routine says:
> 
> + * Change the count from 1 to a value lower than 1, and return 0 (failure)
> + * if it wasn't 1 originally, or return 1 (success) otherwise. This function
> + * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
> 
> yet just doing a __mutex_cmpxchg_lock isn't going to do that.  So
> please fix either the comment or the code (or both :).

OK.

Patch

Index: linux-2.6/arch/powerpc/include/asm/mutex.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/mutex.h
+++ linux-2.6/arch/powerpc/include/asm/mutex.h
@@ -1,9 +1,136 @@ 
 /*
- * Pull in the generic implementation for the mutex fastpath.
+ * Optimised mutex implementation of include/asm-generic/mutex-dec.h algorithm
+ */
+#ifndef _ASM_POWERPC_MUTEX_H
+#define _ASM_POWERPC_MUTEX_H
+
+static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new)
+{
+	int t;
+
+	__asm__ __volatile__ (
+"1:	lwarx	%0,0,%1		# mutex trylock\n\
+	cmpw	0,%0,%2\n\
+	bne-	2f\n"
+	PPC405_ERR77(0,%1)
+"	stwcx.	%3,0,%1\n\
+	bne-	1b"
+	ISYNC_ON_SMP
+	"\n\
+2:"
+	: "=&r" (t)
+	: "r" (&v->counter), "r" (old), "r" (new)
+	: "cc", "memory");
+
+	return t;
+}
+
+static inline int __mutex_dec_return_lock(atomic_t *v)
+{
+	int t;
+
+	__asm__ __volatile__(
+"1:	lwarx	%0,0,%1		# mutex lock\n\
+	addic	%0,%0,-1\n"
+	PPC405_ERR77(0,%1)
+"	stwcx.	%0,0,%1\n\
+	bne-	1b"
+	ISYNC_ON_SMP
+	: "=&r" (t)
+	: "r" (&v->counter)
+	: "cc", "memory");
+
+	return t;
+}
+
+static inline int __mutex_inc_return_unlock(atomic_t *v)
+{
+	int t;
+
+	__asm__ __volatile__(
+	LWSYNC_ON_SMP
+"1:	lwarx	%0,0,%1		# mutex unlock\n\
+	addic	%0,%0,1\n"
+	PPC405_ERR77(0,%1)
+"	stwcx.	%0,0,%1 \n\
+	bne-	1b"
+	: "=&r" (t)
+	: "r" (&v->counter)
+	: "cc", "memory");
+
+	return t;
+}
+
+/**
+ *  __mutex_fastpath_lock - try to take the lock by moving the count
+ *                          from 1 to a 0 value
+ *  @count: pointer of type atomic_t
+ *  @fail_fn: function to call if the original value was not 1
+ *
+ * Change the count from 1 to a value lower than 1, and call <fail_fn> if
+ * it wasn't 1 originally. This function MUST leave the value lower than
+ * 1 even when the "1" assertion wasn't true.
+ */
+static inline void
+__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
+{
+	if (unlikely(__mutex_dec_return_lock(count) < 0))
+		fail_fn(count);
+}
+
+/**
+ *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ *                                 from 1 to a 0 value
+ *  @count: pointer of type atomic_t
+ *  @fail_fn: function to call if the original value was not 1
+ *
+ * Change the count from 1 to a value lower than 1, and call <fail_fn> if
+ * it wasn't 1 originally. This function returns 0 if the fastpath succeeds,
+ * or anything the slow path function returns.
+ */
+static inline int
+__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
+{
+	if (unlikely(__mutex_dec_return_lock(count) < 0))
+		return fail_fn(count);
+	return 0;
+}
+
+/**
+ *  __mutex_fastpath_unlock - try to promote the count from 0 to 1
+ *  @count: pointer of type atomic_t
+ *  @fail_fn: function to call if the original value was not 0
+ *
+ * Try to promote the count from 0 to 1. If it wasn't 0, call <fail_fn>.
+ * In the failure case, this function is allowed to either set the value to
+ * 1, or to set it to a value lower than 1.
+ */
+static inline void
+__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
+{
+	if (unlikely(__mutex_inc_return_unlock(count) <= 0))
+		fail_fn(count);
+}
+
+#define __mutex_slowpath_needs_to_unlock()		1
+
+/**
+ * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *  @fail_fn: fallback function
  *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
  */
+static inline int
+__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
+{
+	if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1))
+		return 1;
+}
 
-#include <asm-generic/mutex-dec.h>
+#endif