diff mbox

powerpc: Spinlock optimization and cleanup

Message ID 54EF91B2.3020704@linux.vnet.ibm.com
State New
Headers show

Commit Message

Adhemerval Zanella Feb. 26, 2015, 9:35 p.m. UTC
This patch optimizes powerpc spinlock implementation by:

* Current algorithm spin over a lwzx, but only after issuing a lwarx.
  first.  The optimization for such case is to avoid the first lwarx
  in contention case (which is much more costly than normal loads).

* Use the correct EH hint bit on the larx for supported ISA.  For lock
  acquisition, the thread that acquired the lock with a successful stcx
  do not want to give away the write ownership on the cacheline. The
  idea is to make the load reservation "sticky" about retaining write
  authority to the line.  That way, the store that must inevitably come
  to release the lock can succeed quickly and not contend with other
  threads issuing lwarx.  If another thread does a store to the line
  (false sharing), the winning thread must give up write authority to
  The proper value of EH for the larx for a lock acquisition is 1.

It also adds some cleanup to use the defined acquire semantic
instructions and function prototype using default C style.

Tested on powerpc32-fpu and powerpc64.

--

	* sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock):
	Optimize first check for contention case and add lwarx hint.
	* sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock):
	Use ANSI prototype.
	* sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ...
	* sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here.

---

Comments

Torvald Riegel Feb. 27, 2015, 12:20 p.m. UTC | #1
On Thu, 2015-02-26 at 18:35 -0300, Adhemerval Zanella wrote:
> This patch optimizes powerpc spinlock implementation by:
> 
> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>   first.  The optimization for such case is to avoid the first lwarx
>   in contention case (which is much more costly than normal loads).

IOW, the previous code was TAS-and-test-and-TAS loop, and you're
changing it into a test-and-TAS loop?

Are you sure that this gives enough benefit for the contended case to
offset potential disadvantages for the uncontended case in which the
lock is not yet in the cache?  That is, whether the unnecessary
transition to write ownership when the lock is already acquired is more
important to avoid than the potential transition to shared first
followed by a transition to ownership?
(I'm not familiar with powerpc enough to answer any of that, I just want
to make sure the reason for our decisions are clear and documented.
I've done a few measurements on other architectures in the past, and
there the test-and-TAS wasn't necessarily better...)

Do you have benchmarks to show the effects?  In this particular case,
I'd trust the machine maintainers with keeping the decisions we make
up-to-date wrt current hardware (ie, maintain the code so that we always
make the best trade-offs in the future); but it would be even better if
we had benchmarks.

Have you investigated with adding back-off for the contended case?

What are your assumptions about the workload and the relative importance
of individual cases (e.g., likelihood of contended vs. uncontended)?  We
should agree on those globally (ie, across archs), and document them.

> * Use the correct EH hint bit on the larx for supported ISA.  For lock
>   acquisition, the thread that acquired the lock with a successful stcx
>   do not want to give away the write ownership on the cacheline. The
>   idea is to make the load reservation "sticky" about retaining write
>   authority to the line.  That way, the store that must inevitably come
>   to release the lock can succeed quickly and not contend with other
>   threads issuing lwarx.  If another thread does a store to the line
>   (false sharing), the winning thread must give up write authority to
>   The proper value of EH for the larx for a lock acquisition is 1.

I would strongly prefer if we could use C code for our locks, including
the spinlocks.

Ignoring the cacheline ownership hint, is there any reason to use
assembly here (with either the current assembly implementation of
atomics that we have, or the new GCC __atomic builtins)?  If so, what
could / would we need to fix in these implementations of atomics?  Is
there a measurable difference?

Regarding the ownership hint, what about extending the atomics that we
have with support for that?  This hint could be useful in other
concurrent code too, and we don't want to have to clone the, say, whole
rwlock implementation for powerpc just to add a few such hints.

> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> new file mode 100644
> index 0000000..7af694f
> --- /dev/null
> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> @@ -0,0 +1,28 @@
> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "pthreadP.h"
> +#include <lowlevellock.h>
> +
> +int
> +pthread_spin_unlock (pthread_spinlock_t *lock)
> +{
> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> +  *lock = 0;
> +  return 0;
> +}

Is there any reason to not use "atomic_store_release (lock)" here?  Or
adapt the generic pthread_spin_unlock to use just a release barrier
(which I'd be open to doing too), and get rid of the powerpc-specific
file?
Adhemerval Zanella Feb. 27, 2015, 2:57 p.m. UTC | #2
On 27-02-2015 09:20, Torvald Riegel wrote:
> On Thu, 2015-02-26 at 18:35 -0300, Adhemerval Zanella wrote:
>> This patch optimizes powerpc spinlock implementation by:
>>
>> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>>   first.  The optimization for such case is to avoid the first lwarx
>>   in contention case (which is much more costly than normal loads).
> IOW, the previous code was TAS-and-test-and-TAS loop, and you're
> changing it into a test-and-TAS loop?

If by 'test-and-TAS loop' you mean it is now loop on lwarx. instruction, that
is not the case: the first test is a sniff 'check' on value to avoid the costly
LL/SC instruction that create the cacheline reservation.  The loop over a not
committed lwarx. is still doing using normal load operations.

>
> Are you sure that this gives enough benefit for the contended case to
> offset potential disadvantages for the uncontended case in which the
> lock is not yet in the cache?  That is, whether the unnecessary
> transition to write ownership when the lock is already acquired is more
> important to avoid than the potential transition to shared first
> followed by a transition to ownership?
> (I'm not familiar with powerpc enough to answer any of that, I just want
> to make sure the reason for our decisions are clear and documented.
> I've done a few measurements on other architectures in the past, and
> there the test-and-TAS wasn't necessarily better...)

In fact this 'hint' that a test sniff should be used came from directly
from our chip designers.  However their were not really clear why the sniff
is really necessary, just that lwarx. is indeed way more costly and can
change contention and add internal bus traffic in contend case.  So in 
a short, I am not 100% sure this is the best course of action.

>
> Do you have benchmarks to show the effects?  In this particular case,
> I'd trust the machine maintainers with keeping the decisions we make
> up-to-date wrt current hardware (ie, maintain the code so that we always
> make the best trade-offs in the future); but it would be even better if
> we had benchmarks.

That's the trick part: we are discussing this w.r.t some internal workloads nad
results are still yet to come.  However, I am trying myself to check some results
using a subpar benchmark (basically sysbench memory using spinlock instead of 
pthread_mutex_t).  I am seeing some results with hight threads counts, but also
some regression on lower threads counts.  Also, the lwarx hint bit is showing
some difference and also it shows some gains and regressions.  In the end, I think
the workloads I am using is not best suited for this kind of evaluation. Any
suggestions?

> Have you investigated with adding back-off for the contended case?

That something I have not investigate.

>
> What are your assumptions about the workload and the relative importance
> of individual cases (e.g., likelihood of contended vs. uncontended)?  We
> should agree on those globally (ie, across archs), and document them.

In fact I did not hold any assumptions, however I have noted on recent powerpc
ports I have been involved that spinlocks are used on various DBs (and usually
with hand-crafted ones...).  However I do not have data about what kind of case
they assume either.

>
>> * Use the correct EH hint bit on the larx for supported ISA.  For lock
>>   acquisition, the thread that acquired the lock with a successful stcx
>>   do not want to give away the write ownership on the cacheline. The
>>   idea is to make the load reservation "sticky" about retaining write
>>   authority to the line.  That way, the store that must inevitably come
>>   to release the lock can succeed quickly and not contend with other
>>   threads issuing lwarx.  If another thread does a store to the line
>>   (false sharing), the winning thread must give up write authority to
>>   The proper value of EH for the larx for a lock acquisition is 1.
> I would strongly prefer if we could use C code for our locks, including
> the spinlocks.
>
> Ignoring the cacheline ownership hint, is there any reason to use
> assembly here (with either the current assembly implementation of
> atomics that we have, or the new GCC __atomic builtins)?  If so, what
> could / would we need to fix in these implementations of atomics?  Is
> there a measurable difference?
>
> Regarding the ownership hint, what about extending the atomics that we
> have with support for that?  This hint could be useful in other
> concurrent code too, and we don't want to have to clone the, say, whole
> rwlock implementation for powerpc just to add a few such hints.

We already have support for EH hint in LL/SC instruction for powerpc.  The trick
in this algorithm is it does not really fix in current atomic semantics we have
so far, as '__arch_compare_and_exchange_val_32_acq' issues an 'isync' in either
fail or succeed case.  The trick in this case is just to emit the 'isync' just
after the stcx. instruction.  Basically the idea is to loop over checking
when lock is release using default loads 'inside' the CAS-acquire.  Suggestions?

>
>> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> new file mode 100644
>> index 0000000..7af694f
>> --- /dev/null
>> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> @@ -0,0 +1,28 @@
>> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
>> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include "pthreadP.h"
>> +#include <lowlevellock.h>
>> +
>> +int
>> +pthread_spin_unlock (pthread_spinlock_t *lock)
>> +{
>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>> +  *lock = 0;
>> +  return 0;
>> +}
> Is there any reason to not use "atomic_store_release (lock)" here?  Or
> adapt the generic pthread_spin_unlock to use just a release barrier
> (which I'd be open to doing too), and get rid of the powerpc-specific
> file?

atomic_store_release (lock) is generating correct code, I will change it, thanks.
Steven Munroe March 2, 2015, 6:40 p.m. UTC | #3
On Fri, 2015-02-27 at 11:57 -0300, Adhemerval Zanella wrote:
> On 27-02-2015 09:20, Torvald Riegel wrote:
> > On Thu, 2015-02-26 at 18:35 -0300, Adhemerval Zanella wrote:
> >> This patch optimizes powerpc spinlock implementation by:
> >>
> >> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
> >>   first.  The optimization for such case is to avoid the first lwarx
> >>   in contention case (which is much more costly than normal loads).
> > IOW, the previous code was TAS-and-test-and-TAS loop, and you're
> > changing it into a test-and-TAS loop?
> 
> If by 'test-and-TAS loop' you mean it is now loop on lwarx. instruction, that
> is not the case: the first test is a sniff 'check' on value to avoid the costly
> LL/SC instruction that create the cacheline reservation.  The loop over a not
> committed lwarx. is still doing using normal load operations.
> 
> >
> > Are you sure that this gives enough benefit for the contended case to
> > offset potential disadvantages for the uncontended case in which the
> > lock is not yet in the cache?  That is, whether the unnecessary
> > transition to write ownership when the lock is already acquired is more
> > important to avoid than the potential transition to shared first
> > followed by a transition to ownership?
> > (I'm not familiar with powerpc enough to answer any of that, I just want
> > to make sure the reason for our decisions are clear and documented.
> > I've done a few measurements on other architectures in the past, and
> > there the test-and-TAS wasn't necessarily better...)
> 
> In fact this 'hint' that a test sniff should be used came from directly
> from our chip designers.  However their were not really clear why the sniff
> is really necessary, just that lwarx. is indeed way more costly and can
> change contention and add internal bus traffic in contend case.  So in 
> a short, I am not 100% sure this is the best course of action.
> 
The background here is that POWER systems have traditionally run the
coherence protocol (and reservations) from the L2 (for scale up
servers). This is scales well for large servers with many independent
processes, but shows poorly for single threaded benchmarks.

Recent processor designs (P7+ and P8) have added improvements that
optimize when the lock word is already in the L1. So a simple load as
the prefetch is helpful on these processors.

Also the LARX engages some heavy machinery that can slow down atomic
updates (while snooping) from other threads. So is best to spin on a
simple load, while blocked, until the lock work state change is visible
to this thread.

We are getting this information directly from the designer of the memory
nest. This detail should be documented in POWER8 Users Manual that is
available from the OpenPOWER site.

> >
> > Do you have benchmarks to show the effects?  In this particular case,
> > I'd trust the machine maintainers with keeping the decisions we make
> > up-to-date wrt current hardware (ie, maintain the code so that we always
> > make the best trade-offs in the future); but it would be even better if
> > we had benchmarks.
> 
> That's the trick part: we are discussing this w.r.t some internal workloads nad
> results are still yet to come.  However, I am trying myself to check some results
> using a subpar benchmark (basically sysbench memory using spinlock instead of 
> pthread_mutex_t).  I am seeing some results with hight threads counts, but also
> some regression on lower threads counts.  Also, the lwarx hint bit is showing
> some difference and also it shows some gains and regressions.  In the end, I think
> the workloads I am using is not best suited for this kind of evaluation. Any
> suggestions?
> 
> > Have you investigated with adding back-off for the contended case?
> 
> That something I have not investigate.
> 
> >
> > What are your assumptions about the workload and the relative importance
> > of individual cases (e.g., likelihood of contended vs. uncontended)?  We
> > should agree on those globally (ie, across archs), and document them.
> 
> In fact I did not hold any assumptions, however I have noted on recent powerpc
> ports I have been involved that spinlocks are used on various DBs (and usually
> with hand-crafted ones...).  However I do not have data about what kind of case
> they assume either.
> 
> >
> >> * Use the correct EH hint bit on the larx for supported ISA.  For lock
> >>   acquisition, the thread that acquired the lock with a successful stcx
> >>   do not want to give away the write ownership on the cacheline. The
> >>   idea is to make the load reservation "sticky" about retaining write
> >>   authority to the line.  That way, the store that must inevitably come
> >>   to release the lock can succeed quickly and not contend with other
> >>   threads issuing lwarx.  If another thread does a store to the line
> >>   (false sharing), the winning thread must give up write authority to
> >>   The proper value of EH for the larx for a lock acquisition is 1.
> > I would strongly prefer if we could use C code for our locks, including
> > the spinlocks.
> >
> > Ignoring the cacheline ownership hint, is there any reason to use
> > assembly here (with either the current assembly implementation of
> > atomics that we have, or the new GCC __atomic builtins)?  If so, what
> > could / would we need to fix in these implementations of atomics?  Is
> > there a measurable difference?
> >
> > Regarding the ownership hint, what about extending the atomics that we
> > have with support for that?  This hint could be useful in other
> > concurrent code too, and we don't want to have to clone the, say, whole
> > rwlock implementation for powerpc just to add a few such hints.
> 
> We already have support for EH hint in LL/SC instruction for powerpc.  The trick
> in this algorithm is it does not really fix in current atomic semantics we have
> so far, as '__arch_compare_and_exchange_val_32_acq' issues an 'isync' in either
> fail or succeed case.  The trick in this case is just to emit the 'isync' just
> after the stcx. instruction.  Basically the idea is to loop over checking
> when lock is release using default loads 'inside' the CAS-acquire.  Suggestions?
> 
> >
> >> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> >> new file mode 100644
> >> index 0000000..7af694f
> >> --- /dev/null
> >> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> >> @@ -0,0 +1,28 @@
> >> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
> >> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include "pthreadP.h"
> >> +#include <lowlevellock.h>
> >> +
> >> +int
> >> +pthread_spin_unlock (pthread_spinlock_t *lock)
> >> +{
> >> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >> +  *lock = 0;
> >> +  return 0;
> >> +}
> > Is there any reason to not use "atomic_store_release (lock)" here?  Or
> > adapt the generic pthread_spin_unlock to use just a release barrier
> > (which I'd be open to doing too), and get rid of the powerpc-specific
> > file?
> 
> atomic_store_release (lock) is generating correct code, I will change it, thanks.
>
Richard Henderson March 10, 2015, 3:34 p.m. UTC | #4
On 02/27/2015 06:57 AM, Adhemerval Zanella wrote:
> We already have support for EH hint in LL/SC instruction for powerpc.  The trick
> in this algorithm is it does not really fix in current atomic semantics we have
> so far, as '__arch_compare_and_exchange_val_32_acq' issues an 'isync' in either
> fail or succeed case.  The trick in this case is just to emit the 'isync' just
> after the stcx. instruction.  Basically the idea is to loop over checking
> when lock is release using default loads 'inside' the CAS-acquire.  Suggestions?

This does sound like the two memory model inputs to the builtin
__atomic_compare_and_exchange_n: one for the succeed case and
one for the fail case.



r~
diff mbox

Patch

diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
index cc081f8..8a39da9 100644
--- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
+++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
@@ -19,24 +19,26 @@ 
 #include "pthreadP.h"
 
 int
-pthread_spin_lock (lock)
-     pthread_spinlock_t *lock;
+pthread_spin_lock (pthread_spinlock_t *lock)
 {
   unsigned int __tmp;
 
   asm volatile (
-       "1:	lwarx	%0,0,%1\n"
+       "0:	lwzx    %0,0,%1\n"
+       "	cmpwi   0,%0,0\n"
+       "	bne	0b\n"
+       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
        "	cmpwi	0,%0,0\n"
        "	bne-	2f\n"
        "	stwcx.	%2,0,%1\n"
        "	bne-	2f\n"
-       "	isync\n"
-       "	.subsection 1\n"
-       "2:	lwzx	%0,0,%1\n"
-       "	cmpwi	0,%0,0\n"
-       "	bne	2b\n"
-       "	b	1b\n"
-       "	.previous"
+                __ARCH_ACQ_INSTR "\n"
+       "        .subsection 1\n"
+       "2:	lwzx    %0,0,%1\n"
+       "        cmpwi   0,%0,0\n"
+       "        bne     2b\n"
+       "        b       1b\n"
+       "        .previous"
        : "=&r" (__tmp)
        : "r" (lock), "r" (1)
        : "cr0", "memory");
diff --git a/sysdeps/powerpc/nptl/pthread_spin_trylock.c b/sysdeps/powerpc/nptl/pthread_spin_trylock.c
index 77a4615..c485aa4 100644
--- a/sysdeps/powerpc/nptl/pthread_spin_trylock.c
+++ b/sysdeps/powerpc/nptl/pthread_spin_trylock.c
@@ -20,8 +20,7 @@ 
 #include "pthreadP.h"
 
 int
-pthread_spin_trylock (lock)
-     pthread_spinlock_t *lock;
+pthread_spin_trylock (pthread_spinlock_t *lock)
 {
   unsigned int old;
   int err = EBUSY;
diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
new file mode 100644
index 0000000..7af694f
--- /dev/null
+++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
@@ -0,0 +1,28 @@ 
+/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
+   Copyright (C) 2007-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "pthreadP.h"
+#include <lowlevellock.h>
+
+int
+pthread_spin_unlock (pthread_spinlock_t *lock)
+{
+  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
+  *lock = 0;
+  return 0;
+}
diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
deleted file mode 100644
index 7af694f..0000000
--- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
+++ /dev/null
@@ -1,28 +0,0 @@ 
-/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
-   Copyright (C) 2007-2015 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-int
-pthread_spin_unlock (pthread_spinlock_t *lock)
-{
-  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
-  *lock = 0;
-  return 0;
-}