diff mbox

powerpc: remove linux lowlevellock.h

Message ID 5421AF5E.8030402@linux.vnet.ibm.com
State New
Headers show

Commit Message

Adhemerval Zanella Sept. 23, 2014, 5:35 p.m. UTC
This patch remove the powerpc specific lowlevellock.h and adjust some
implementation that rely on __lll_[rel/acq]_instr defines.

Checked on powerpc32, powerpc64, and powerpc64le.

I will check tomorrow if no one opposes.

--

	* sysdeps/unix/sysv/linux/powerpc/lowlevellock.h: Remove file.
	* sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
	(pthread_spin_unlock): Use __ARCH_REL_INSTR instead of
	__lll_rel_instr and __ARCH_ACQ_INSTR instead of __lll_acq_instr.
	* sysdeps/unix/sysv/linux/powerpc/sem_post.c (__new_sem_post):
	Likewise.
	(__old_sem_post): Likewise.

---

Comments

Torvald Riegel Sept. 24, 2014, 10:26 a.m. UTC | #1
On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> index 2b8c84d..4906205 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> @@ -22,7 +22,7 @@
>  int
>  pthread_spin_unlock (pthread_spinlock_t *lock)
>  {
> -  __asm __volatile (__lll_rel_instr ::: "memory");
> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>    *lock = 0;
>    return 0;
>  }

I'd prefer if this would be a memory_order_release barrier (which we
don't have yet, I know, so bear with me).  However, I see that the
current situation is a bit messy:
* It seems that atomic_write_barrier is intended to be a C11
memory_order_release barrier/fence; at least it's used like that in all
what I saw in glibc.  So atomic_write_barrier would be our current way
to express a release barrier.
* However, PowerPC atomic_write_barrier is "eieio" and is not using
__ARCH_REL_INSTR.  Can you clarify the difference?

I'm interested in having a release barrier there so that this won't get
overlooked once we do the actual transition to C11 atomics.

Also, related to that, it would be ideal if we could use the generic
implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
this one uses atomic_full_barrier, which is stronger than what's used on
the major archs (x86, Power, ARM).
I strongly believe this should use a release barrier, despite what POSIX
is claiming (see #16432), but we'll have to deal with that separately.

> diff --git a/sysdeps/unix/sysv/linux/powerpc/sem_post.c b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
> index f222d9a..831a8dd 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/sem_post.c
> +++ b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
> @@ -30,9 +30,9 @@ __new_sem_post (sem_t *sem)
>  {
>    struct new_sem *isem = (struct new_sem *) sem;
>  
> -  __asm __volatile (__lll_rel_instr ::: "memory");
> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>    atomic_increment (&isem->value);
> -  __asm __volatile (__lll_acq_instr ::: "memory");
> +  __asm __volatile (__ARCH_ACQ_INSTR ::: "memory");
>    if (isem->nwaiters > 0)
>      {
>        int err = lll_futex_wake (&isem->value, 1,
> @@ -55,7 +55,7 @@ __old_sem_post (sem_t *sem)
>  {
>    int *futex = (int *) sem;
>  
> -  __asm __volatile (__lll_rel_instr ::: "memory");
> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>    (void) atomic_increment_val (futex);
>    /* We always have to assume it is a shared semaphore.  */
>    int err = lll_futex_wake (futex, 1, LLL_SHARED);

These are fine, though, as we'll need to change the semaphore algorithm
anyway to fulfill the destruction guarantees (I have a WIP
implementation that I plan to post in the near future).
Adhemerval Zanella Sept. 24, 2014, 11:55 a.m. UTC | #2
On 24-09-2014 07:26, Torvald Riegel wrote:
> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>> index 2b8c84d..4906205 100644
>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>> @@ -22,7 +22,7 @@
>>  int
>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>  {
>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>    *lock = 0;
>>    return 0;
>>  }
> I'd prefer if this would be a memory_order_release barrier (which we
> don't have yet, I know, so bear with me).  However, I see that the
> current situation is a bit messy:
> * It seems that atomic_write_barrier is intended to be a C11
> memory_order_release barrier/fence; at least it's used like that in all
> what I saw in glibc.  So atomic_write_barrier would be our current way
> to express a release barrier.
> * However, PowerPC atomic_write_barrier is "eieio" and is not using
> __ARCH_REL_INSTR.  Can you clarify the difference?

These two pages [1] [2] have detailed information about C11/C++11 memory order
mappings to PowerPC instructions.

As you can see the store release is indeed 'lwsync; st'.  First link also
gives you an example of a spinlock implementation, however it is using a 
full memory barrier (sync) instead of a write one (lwsync). It is better
explained in second link:

"If the critical section contains only normal cached memory operations, 
the lwsync barrier may be substituted for the heavier-weight sync instruction
above."

Non-cached memory are handled by kernel and exported to application only in very
specific situations (and afaik C11/C++11 specifications don't even handle it).
The idea is a process won't try to run a spinlock in a DMA area. 

Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
relation. My understanding is 'lwsync' is the correct memory order instruction
to use in such cases.

[1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
[2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html


>
> I'm interested in having a release barrier there so that this won't get
> overlooked once we do the actual transition to C11 atomics.
>
> Also, related to that, it would be ideal if we could use the generic
> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
> this one uses atomic_full_barrier, which is stronger than what's used on
> the major archs (x86, Power, ARM).
> I strongly believe this should use a release barrier, despite what POSIX
> is claiming (see #16432), but we'll have to deal with that separately.

I believe this is exactly why we have a specific POWER implementation: atomic
full barrier in powerpc are 'sync' and for this spinlock cases 'lwsync' is suffice.

>
>> diff --git a/sysdeps/unix/sysv/linux/powerpc/sem_post.c b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
>> index f222d9a..831a8dd 100644
>> --- a/sysdeps/unix/sysv/linux/powerpc/sem_post.c
>> +++ b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
>> @@ -30,9 +30,9 @@ __new_sem_post (sem_t *sem)
>>  {
>>    struct new_sem *isem = (struct new_sem *) sem;
>>  
>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>    atomic_increment (&isem->value);
>> -  __asm __volatile (__lll_acq_instr ::: "memory");
>> +  __asm __volatile (__ARCH_ACQ_INSTR ::: "memory");
>>    if (isem->nwaiters > 0)
>>      {
>>        int err = lll_futex_wake (&isem->value, 1,
>> @@ -55,7 +55,7 @@ __old_sem_post (sem_t *sem)
>>  {
>>    int *futex = (int *) sem;
>>  
>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>    (void) atomic_increment_val (futex);
>>    /* We always have to assume it is a shared semaphore.  */
>>    int err = lll_futex_wake (futex, 1, LLL_SHARED);
> These are fine, though, as we'll need to change the semaphore algorithm
> anyway to fulfill the destruction guarantees (I have a WIP
> implementation that I plan to post in the near future).
>
>
>
Torvald Riegel Sept. 24, 2014, 1:29 p.m. UTC | #3
On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
> On 24-09-2014 07:26, Torvald Riegel wrote:
> > On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> >> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >> index 2b8c84d..4906205 100644
> >> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >> @@ -22,7 +22,7 @@
> >>  int
> >>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>  {
> >> -  __asm __volatile (__lll_rel_instr ::: "memory");
> >> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >>    *lock = 0;
> >>    return 0;
> >>  }
> > I'd prefer if this would be a memory_order_release barrier (which we
> > don't have yet, I know, so bear with me).  However, I see that the
> > current situation is a bit messy:
> > * It seems that atomic_write_barrier is intended to be a C11
> > memory_order_release barrier/fence; at least it's used like that in all
> > what I saw in glibc.  So atomic_write_barrier would be our current way
> > to express a release barrier.
> > * However, PowerPC atomic_write_barrier is "eieio" and is not using
> > __ARCH_REL_INSTR.  Can you clarify the difference?
> 
> These two pages [1] [2] have detailed information about C11/C++11 memory order
> mappings to PowerPC instructions.

I'm aware of these ...

> As you can see the store release is indeed 'lwsync; st'.  First link also
> gives you an example of a spinlock implementation, however it is using a 
> full memory barrier (sync) instead of a write one (lwsync). It is better
> explained in second link:
> 
> "If the critical section contains only normal cached memory operations, 
> the lwsync barrier may be substituted for the heavier-weight sync instruction
> above."
> 
> Non-cached memory are handled by kernel and exported to application only in very
> specific situations (and afaik C11/C++11 specifications don't even handle it).
> The idea is a process won't try to run a spinlock in a DMA area. 
> 
> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
> relation. My understanding is 'lwsync' is the correct memory order instruction
> to use in such cases.
> 
> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html

... but not of the difference between eieio and lwsync.  AFAICT, N2745
states that eieio really applies only to stores.

What are your thoughts about changing atomic_write_barrier to eieio?
Are you aware of any prior reasoning about that (e.g., whether
atomic_write_barrier really only needs to prevent reordering of writes)?
Would there be a significant performance degradation?

> >
> > I'm interested in having a release barrier there so that this won't get
> > overlooked once we do the actual transition to C11 atomics.
> >
> > Also, related to that, it would be ideal if we could use the generic
> > implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
> > this one uses atomic_full_barrier, which is stronger than what's used on
> > the major archs (x86, Power, ARM).

BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
full barrier effectively.

> > I strongly believe this should use a release barrier, despite what POSIX
> > is claiming (see #16432), but we'll have to deal with that separately.
> 
> I believe this is exactly why we have a specific POWER implementation: atomic
> full barrier in powerpc are 'sync' and for this spinlock cases 'lwsync' is suffice.

But it shouldn't be POWER-specific, IMO.  Instead, the generic
implementation should have a release barrier only (so, translating to
lwsync on power), and all archs negatively affected by this should
update their release barriers to be proper release barriers.
Adhemerval Zanella Sept. 24, 2014, 1:58 p.m. UTC | #4
On 24-09-2014 10:29, Torvald Riegel wrote:
> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
>> On 24-09-2014 07:26, Torvald Riegel wrote:
>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>> index 2b8c84d..4906205 100644
>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>> @@ -22,7 +22,7 @@
>>>>  int
>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>>>  {
>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>>>    *lock = 0;
>>>>    return 0;
>>>>  }
>>> I'd prefer if this would be a memory_order_release barrier (which we
>>> don't have yet, I know, so bear with me).  However, I see that the
>>> current situation is a bit messy:
>>> * It seems that atomic_write_barrier is intended to be a C11
>>> memory_order_release barrier/fence; at least it's used like that in all
>>> what I saw in glibc.  So atomic_write_barrier would be our current way
>>> to express a release barrier.
>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
>>> __ARCH_REL_INSTR.  Can you clarify the difference?
>> These two pages [1] [2] have detailed information about C11/C++11 memory order
>> mappings to PowerPC instructions.
> I'm aware of these ...
>
>> As you can see the store release is indeed 'lwsync; st'.  First link also
>> gives you an example of a spinlock implementation, however it is using a 
>> full memory barrier (sync) instead of a write one (lwsync). It is better
>> explained in second link:
>>
>> "If the critical section contains only normal cached memory operations, 
>> the lwsync barrier may be substituted for the heavier-weight sync instruction
>> above."
>>
>> Non-cached memory are handled by kernel and exported to application only in very
>> specific situations (and afaik C11/C++11 specifications don't even handle it).
>> The idea is a process won't try to run a spinlock in a DMA area. 
>>
>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
>> relation. My understanding is 'lwsync' is the correct memory order instruction
>> to use in such cases.
>>
>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
> states that eieio really applies only to stores.
>
> What are your thoughts about changing atomic_write_barrier to eieio?
> Are you aware of any prior reasoning about that (e.g., whether
> atomic_write_barrier really only needs to prevent reordering of writes)?
> Would there be a significant performance degradation?

Reading again my last message I think I wasn't clear: my understanding of
'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
memory. And it is preferred over 'sync' for performance reasons.

Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
is a safe atomic_write_barrier and it also performance better than 'eieio'
(and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
I see no need to change it to 'eieio' in the pthread_spin_unlock.

>
>>> I'm interested in having a release barrier there so that this won't get
>>> overlooked once we do the actual transition to C11 atomics.
>>>
>>> Also, related to that, it would be ideal if we could use the generic
>>> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
>>> this one uses atomic_full_barrier, which is stronger than what's used on
>>> the major archs (x86, Power, ARM).
> BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
> full barrier effectively.

A full barrier do work on PowerPC, the specific optimization to use the atomic
write 'lwsync' was added later.

>
>>> I strongly believe this should use a release barrier, despite what POSIX
>>> is claiming (see #16432), but we'll have to deal with that separately.
>> I believe this is exactly why we have a specific POWER implementation: atomic
>> full barrier in powerpc are 'sync' and for this spinlock cases 'lwsync' is suffice.
> But it shouldn't be POWER-specific, IMO.  Instead, the generic
> implementation should have a release barrier only (so, translating to
> lwsync on power), and all archs negatively affected by this should
> update their release barriers to be proper release barriers.

I agree with this definition. Do you have any updated on discussion you raise in
comment #1 in BZ#16432?
Torvald Riegel Sept. 24, 2014, 3:37 p.m. UTC | #5
On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
> On 24-09-2014 10:29, Torvald Riegel wrote:
> > On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
> >> On 24-09-2014 07:26, Torvald Riegel wrote:
> >>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> >>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>> index 2b8c84d..4906205 100644
> >>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>> @@ -22,7 +22,7 @@
> >>>>  int
> >>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>>>  {
> >>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
> >>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >>>>    *lock = 0;
> >>>>    return 0;
> >>>>  }
> >>> I'd prefer if this would be a memory_order_release barrier (which we
> >>> don't have yet, I know, so bear with me).  However, I see that the
> >>> current situation is a bit messy:
> >>> * It seems that atomic_write_barrier is intended to be a C11
> >>> memory_order_release barrier/fence; at least it's used like that in all
> >>> what I saw in glibc.  So atomic_write_barrier would be our current way
> >>> to express a release barrier.
> >>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
> >>> __ARCH_REL_INSTR.  Can you clarify the difference?
> >> These two pages [1] [2] have detailed information about C11/C++11 memory order
> >> mappings to PowerPC instructions.
> > I'm aware of these ...
> >
> >> As you can see the store release is indeed 'lwsync; st'.  First link also
> >> gives you an example of a spinlock implementation, however it is using a 
> >> full memory barrier (sync) instead of a write one (lwsync). It is better
> >> explained in second link:
> >>
> >> "If the critical section contains only normal cached memory operations, 
> >> the lwsync barrier may be substituted for the heavier-weight sync instruction
> >> above."
> >>
> >> Non-cached memory are handled by kernel and exported to application only in very
> >> specific situations (and afaik C11/C++11 specifications don't even handle it).
> >> The idea is a process won't try to run a spinlock in a DMA area. 
> >>
> >> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
> >> relation. My understanding is 'lwsync' is the correct memory order instruction
> >> to use in such cases.
> >>
> >> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> >> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> > ... but not of the difference between eieio and lwsync.  AFAICT, N2745
> > states that eieio really applies only to stores.
> >
> > What are your thoughts about changing atomic_write_barrier to eieio?
> > Are you aware of any prior reasoning about that (e.g., whether
> > atomic_write_barrier really only needs to prevent reordering of writes)?
> > Would there be a significant performance degradation?
> 
> Reading again my last message I think I wasn't clear: my understanding of
> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
> memory. And it is preferred over 'sync' for performance reasons.
> 
> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
> is a safe atomic_write_barrier and it also performance better than 'eieio'
> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
> I see no need to change it to 'eieio' in the pthread_spin_unlock.

Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
meant changing it to using lwsync :)
So, what are your thoughts about that?

> >
> >>> I'm interested in having a release barrier there so that this won't get
> >>> overlooked once we do the actual transition to C11 atomics.
> >>>
> >>> Also, related to that, it would be ideal if we could use the generic
> >>> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
> >>> this one uses atomic_full_barrier, which is stronger than what's used on
> >>> the major archs (x86, Power, ARM).
> > BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
> > full barrier effectively.
> 
> A full barrier do work on PowerPC, the specific optimization to use the atomic
> write 'lwsync' was added later.

Of course the full barrier would work as well.  My concern is about what
glibc should provide: current practice differs on major archs (x86 and
POWER) is weaker than what POSIX requires -- yet the right semantics in
my opinion.  To me, this indicates that this would work as default in
practice, and that the default implementation could use a release
barrier too.

> >
> >>> I strongly believe this should use a release barrier, despite what POSIX
> >>> is claiming (see #16432), but we'll have to deal with that separately.
> >> I believe this is exactly why we have a specific POWER implementation: atomic
> >> full barrier in powerpc are 'sync' and for this spinlock cases 'lwsync' is suffice.
> > But it shouldn't be POWER-specific, IMO.  Instead, the generic
> > implementation should have a release barrier only (so, translating to
> > lwsync on power), and all archs negatively affected by this should
> > update their release barriers to be proper release barriers.
> 
> I agree with this definition. Do you have any updated on discussion you raise in
> comment #1 in BZ#16432?

No I haven't.  I'll think about reviving the discussion.
Adhemerval Zanella Sept. 25, 2014, 9:19 p.m. UTC | #6
On 24-09-2014 12:37, Torvald Riegel wrote:
> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
>> On 24-09-2014 10:29, Torvald Riegel wrote:
>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>> index 2b8c84d..4906205 100644
>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>> @@ -22,7 +22,7 @@
>>>>>>  int
>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>>>>>  {
>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>>>>>    *lock = 0;
>>>>>>    return 0;
>>>>>>  }
>>>>> I'd prefer if this would be a memory_order_release barrier (which we
>>>>> don't have yet, I know, so bear with me).  However, I see that the
>>>>> current situation is a bit messy:
>>>>> * It seems that atomic_write_barrier is intended to be a C11
>>>>> memory_order_release barrier/fence; at least it's used like that in all
>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
>>>>> to express a release barrier.
>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
>>>> mappings to PowerPC instructions.
>>> I'm aware of these ...
>>>
>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
>>>> gives you an example of a spinlock implementation, however it is using a 
>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
>>>> explained in second link:
>>>>
>>>> "If the critical section contains only normal cached memory operations, 
>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
>>>> above."
>>>>
>>>> Non-cached memory are handled by kernel and exported to application only in very
>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
>>>> The idea is a process won't try to run a spinlock in a DMA area. 
>>>>
>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
>>>> to use in such cases.
>>>>
>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
>>> states that eieio really applies only to stores.
>>>
>>> What are your thoughts about changing atomic_write_barrier to eieio?
>>> Are you aware of any prior reasoning about that (e.g., whether
>>> atomic_write_barrier really only needs to prevent reordering of writes)?
>>> Would there be a significant performance degradation?
>> Reading again my last message I think I wasn't clear: my understanding of
>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
>> memory. And it is preferred over 'sync' for performance reasons.
>>
>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
>> is a safe atomic_write_barrier and it also performance better than 'eieio'
>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
> meant changing it to using lwsync :)
> So, what are your thoughts about that?

Right, I thought you were solely referring to pthread_spin_unlock.  About the
spinlock itself, now I understand the reasoning and I do agree that about 
replace the asm with a macro to emit a release barrier (which for POWER
would be the same anyway).  However, I would like to address it in another
patch.

Now for atomic_write_barrier, as I understand, was added in GLIBC as 
memory_order_seq_cst semantics.  However, as you have noted it is acting in 
practice as release semantic; so am not sure about changing to 'lwsync' 
I am checking with Steven Monroe and  Paul E. McKenney if I overlooked 
something here and why exactly they chose 'eieio'.

>
>>>>> I'm interested in having a release barrier there so that this won't get
>>>>> overlooked once we do the actual transition to C11 atomics.
>>>>>
>>>>> Also, related to that, it would be ideal if we could use the generic
>>>>> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
>>>>> this one uses atomic_full_barrier, which is stronger than what's used on
>>>>> the major archs (x86, Power, ARM).
>>> BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
>>> full barrier effectively.
>> A full barrier do work on PowerPC, the specific optimization to use the atomic
>> write 'lwsync' was added later.
> Of course the full barrier would work as well.  My concern is about what
> glibc should provide: current practice differs on major archs (x86 and
> POWER) is weaker than what POSIX requires -- yet the right semantics in
> my opinion.  To me, this indicates that this would work as default in
> practice, and that the default implementation could use a release
> barrier too.

I do agree.

>
>>>>> I strongly believe this should use a release barrier, despite what POSIX
>>>>> is claiming (see #16432), but we'll have to deal with that separately.
>>>> I believe this is exactly why we have a specific POWER implementation: atomic
>>>> full barrier in powerpc are 'sync' and for this spinlock cases 'lwsync' is suffice.
>>> But it shouldn't be POWER-specific, IMO.  Instead, the generic
>>> implementation should have a release barrier only (so, translating to
>>> lwsync on power), and all archs negatively affected by this should
>>> update their release barriers to be proper release barriers.
>> I agree with this definition. Do you have any updated on discussion you raise in
>> comment #1 in BZ#16432?
> No I haven't.  I'll think about reviving the discussion.
>
Adhemerval Zanella Sept. 29, 2014, 7:13 p.m. UTC | #7
On 25-09-2014 18:19, Adhemerval Zanella wrote:
> On 24-09-2014 12:37, Torvald Riegel wrote:
>> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
>>> On 24-09-2014 10:29, Torvald Riegel wrote:
>>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
>>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
>>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> index 2b8c84d..4906205 100644
>>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> @@ -22,7 +22,7 @@
>>>>>>>  int
>>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>>>>>>  {
>>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>>>>>>    *lock = 0;
>>>>>>>    return 0;
>>>>>>>  }
>>>>>> I'd prefer if this would be a memory_order_release barrier (which we
>>>>>> don't have yet, I know, so bear with me).  However, I see that the
>>>>>> current situation is a bit messy:
>>>>>> * It seems that atomic_write_barrier is intended to be a C11
>>>>>> memory_order_release barrier/fence; at least it's used like that in all
>>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
>>>>>> to express a release barrier.
>>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
>>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
>>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
>>>>> mappings to PowerPC instructions.
>>>> I'm aware of these ...
>>>>
>>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
>>>>> gives you an example of a spinlock implementation, however it is using a 
>>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
>>>>> explained in second link:
>>>>>
>>>>> "If the critical section contains only normal cached memory operations, 
>>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
>>>>> above."
>>>>>
>>>>> Non-cached memory are handled by kernel and exported to application only in very
>>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
>>>>> The idea is a process won't try to run a spinlock in a DMA area. 
>>>>>
>>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
>>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
>>>>> to use in such cases.
>>>>>
>>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
>>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
>>>> states that eieio really applies only to stores.
>>>>
>>>> What are your thoughts about changing atomic_write_barrier to eieio?
>>>> Are you aware of any prior reasoning about that (e.g., whether
>>>> atomic_write_barrier really only needs to prevent reordering of writes)?
>>>> Would there be a significant performance degradation?
>>> Reading again my last message I think I wasn't clear: my understanding of
>>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
>>> memory. And it is preferred over 'sync' for performance reasons.
>>>
>>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
>>> is a safe atomic_write_barrier and it also performance better than 'eieio'
>>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
>>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
>> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
>> meant changing it to using lwsync :)
>> So, what are your thoughts about that?
> Right, I thought you were solely referring to pthread_spin_unlock.  About the
> spinlock itself, now I understand the reasoning and I do agree that about 
> replace the asm with a macro to emit a release barrier (which for POWER
> would be the same anyway).  However, I would like to address it in another
> patch.
>
> Now for atomic_write_barrier, as I understand, was added in GLIBC as 
> memory_order_seq_cst semantics.  However, as you have noted it is acting in 
> practice as release semantic; so am not sure about changing to 'lwsync' 
> I am checking with Steven Monroe and  Paul E. McKenney if I overlooked 
> something here and why exactly they chose 'eieio'.
>
>>>>>> I'm interested in having a release barrier there so that this won't get
>>>>>> overlooked once we do the actual transition to C11 atomics.
>>>>>>
>>>>>> Also, related to that, it would be ideal if we could use the generic
>>>>>> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
>>>>>> this one uses atomic_full_barrier, which is stronger than what's used on
>>>>>> the major archs (x86, Power, ARM).
>>>> BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
>>>> full barrier effectively.
>>> A full barrier do work on PowerPC, the specific optimization to use the atomic
>>> write 'lwsync' was added later.
>> Of course the full barrier would work as well.  My concern is about what
>> glibc should provide: current practice differs on major archs (x86 and
>> POWER) is weaker than what POSIX requires -- yet the right semantics in
>> my opinion.  To me, this indicates that this would work as default in
>> practice, and that the default implementation could use a release
>> barrier too.
> I do agree.

Torvald,

Besides the memory barrier discussion (which I think should be moved to another thread)
do you have any pending issues related to patch itself?
Adhemerval Zanella Oct. 1, 2014, 9:45 p.m. UTC | #8
On 25-09-2014 18:19, Adhemerval Zanella wrote:
> On 24-09-2014 12:37, Torvald Riegel wrote:
>> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
>>> On 24-09-2014 10:29, Torvald Riegel wrote:
>>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
>>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
>>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> index 2b8c84d..4906205 100644
>>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>> @@ -22,7 +22,7 @@
>>>>>>>  int
>>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>>>>>>  {
>>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>>>>>>    *lock = 0;
>>>>>>>    return 0;
>>>>>>>  }
>>>>>> I'd prefer if this would be a memory_order_release barrier (which we
>>>>>> don't have yet, I know, so bear with me).  However, I see that the
>>>>>> current situation is a bit messy:
>>>>>> * It seems that atomic_write_barrier is intended to be a C11
>>>>>> memory_order_release barrier/fence; at least it's used like that in all
>>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
>>>>>> to express a release barrier.
>>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
>>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
>>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
>>>>> mappings to PowerPC instructions.
>>>> I'm aware of these ...
>>>>
>>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
>>>>> gives you an example of a spinlock implementation, however it is using a 
>>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
>>>>> explained in second link:
>>>>>
>>>>> "If the critical section contains only normal cached memory operations, 
>>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
>>>>> above."
>>>>>
>>>>> Non-cached memory are handled by kernel and exported to application only in very
>>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
>>>>> The idea is a process won't try to run a spinlock in a DMA area. 
>>>>>
>>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
>>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
>>>>> to use in such cases.
>>>>>
>>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
>>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
>>>> states that eieio really applies only to stores.
>>>>
>>>> What are your thoughts about changing atomic_write_barrier to eieio?
>>>> Are you aware of any prior reasoning about that (e.g., whether
>>>> atomic_write_barrier really only needs to prevent reordering of writes)?
>>>> Would there be a significant performance degradation?
>>> Reading again my last message I think I wasn't clear: my understanding of
>>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
>>> memory. And it is preferred over 'sync' for performance reasons.
>>>
>>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
>>> is a safe atomic_write_barrier and it also performance better than 'eieio'
>>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
>>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
>> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
>> meant changing it to using lwsync :)
>> So, what are your thoughts about that?
> Right, I thought you were solely referring to pthread_spin_unlock.  About the
> spinlock itself, now I understand the reasoning and I do agree that about 
> replace the asm with a macro to emit a release barrier (which for POWER
> would be the same anyway).  However, I would like to address it in another
> patch.
>
> Now for atomic_write_barrier, as I understand, was added in GLIBC as 
> memory_order_seq_cst semantics.  However, as you have noted it is acting in 
> practice as release semantic; so am not sure about changing to 'lwsync' 
> I am checking with Steven Monroe and  Paul E. McKenney if I overlooked 
> something here and why exactly they chose 'eieio'.

Paul E. McKenney clarification about 'eieio' is:

"The eieio is weaker than a memory_order_release barrier, as it does not
guarantee to order prior loads against later stores, nor does it guarantee
to order prior loads against later loads."

So I see it is safe to use 'lwsync' instead and if we aiming to follow C11
semantics it would be also preferable. I will put it on my backlog to work with
after lowlevellock.h cleanup.
Torvald Riegel Oct. 6, 2014, 12:49 p.m. UTC | #9
On Thu, 2014-09-25 at 18:19 -0300, Adhemerval Zanella wrote:
> On 24-09-2014 12:37, Torvald Riegel wrote:
> > On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
> >> On 24-09-2014 10:29, Torvald Riegel wrote:
> >>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
> >>>> On 24-09-2014 07:26, Torvald Riegel wrote:
> >>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> >>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>> index 2b8c84d..4906205 100644
> >>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>> @@ -22,7 +22,7 @@
> >>>>>>  int
> >>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>>>>>  {
> >>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
> >>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >>>>>>    *lock = 0;
> >>>>>>    return 0;
> >>>>>>  }
> >>>>> I'd prefer if this would be a memory_order_release barrier (which we
> >>>>> don't have yet, I know, so bear with me).  However, I see that the
> >>>>> current situation is a bit messy:
> >>>>> * It seems that atomic_write_barrier is intended to be a C11
> >>>>> memory_order_release barrier/fence; at least it's used like that in all
> >>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
> >>>>> to express a release barrier.
> >>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
> >>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
> >>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
> >>>> mappings to PowerPC instructions.
> >>> I'm aware of these ...
> >>>
> >>>> As you can see the store release is indeed 'lwsync; st'.  First link also
> >>>> gives you an example of a spinlock implementation, however it is using a 
> >>>> full memory barrier (sync) instead of a write one (lwsync). It is better
> >>>> explained in second link:
> >>>>
> >>>> "If the critical section contains only normal cached memory operations, 
> >>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
> >>>> above."
> >>>>
> >>>> Non-cached memory are handled by kernel and exported to application only in very
> >>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
> >>>> The idea is a process won't try to run a spinlock in a DMA area. 
> >>>>
> >>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
> >>>> relation. My understanding is 'lwsync' is the correct memory order instruction
> >>>> to use in such cases.
> >>>>
> >>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> >>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> >>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
> >>> states that eieio really applies only to stores.
> >>>
> >>> What are your thoughts about changing atomic_write_barrier to eieio?
> >>> Are you aware of any prior reasoning about that (e.g., whether
> >>> atomic_write_barrier really only needs to prevent reordering of writes)?
> >>> Would there be a significant performance degradation?
> >> Reading again my last message I think I wasn't clear: my understanding of
> >> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
> >> memory. And it is preferred over 'sync' for performance reasons.
> >>
> >> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
> >> is a safe atomic_write_barrier and it also performance better than 'eieio'
> >> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
> >> I see no need to change it to 'eieio' in the pthread_spin_unlock.
> > Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
> > meant changing it to using lwsync :)
> > So, what are your thoughts about that?
> 
> Right, I thought you were solely referring to pthread_spin_unlock.  About the
> spinlock itself, now I understand the reasoning and I do agree that about 
> replace the asm with a macro to emit a release barrier (which for POWER
> would be the same anyway).  However, I would like to address it in another
> patch.
> 
> Now for atomic_write_barrier, as I understand, was added in GLIBC as 
> memory_order_seq_cst semantics.

Do you have any indication for that being the case (e.g., emails by
others)?  It would be surprising because atomic_full_barrier exists as
well.
Torvald Riegel Oct. 6, 2014, 12:51 p.m. UTC | #10
On Wed, 2014-10-01 at 18:45 -0300, Adhemerval Zanella wrote:
> On 25-09-2014 18:19, Adhemerval Zanella wrote:
> > On 24-09-2014 12:37, Torvald Riegel wrote:
> >> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
> >>> On 24-09-2014 10:29, Torvald Riegel wrote:
> >>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
> >>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
> >>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> >>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> index 2b8c84d..4906205 100644
> >>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> @@ -22,7 +22,7 @@
> >>>>>>>  int
> >>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>>>>>>  {
> >>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
> >>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >>>>>>>    *lock = 0;
> >>>>>>>    return 0;
> >>>>>>>  }
> >>>>>> I'd prefer if this would be a memory_order_release barrier (which we
> >>>>>> don't have yet, I know, so bear with me).  However, I see that the
> >>>>>> current situation is a bit messy:
> >>>>>> * It seems that atomic_write_barrier is intended to be a C11
> >>>>>> memory_order_release barrier/fence; at least it's used like that in all
> >>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
> >>>>>> to express a release barrier.
> >>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
> >>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
> >>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
> >>>>> mappings to PowerPC instructions.
> >>>> I'm aware of these ...
> >>>>
> >>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
> >>>>> gives you an example of a spinlock implementation, however it is using a 
> >>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
> >>>>> explained in second link:
> >>>>>
> >>>>> "If the critical section contains only normal cached memory operations, 
> >>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
> >>>>> above."
> >>>>>
> >>>>> Non-cached memory are handled by kernel and exported to application only in very
> >>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
> >>>>> The idea is a process won't try to run a spinlock in a DMA area. 
> >>>>>
> >>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
> >>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
> >>>>> to use in such cases.
> >>>>>
> >>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> >>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> >>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
> >>>> states that eieio really applies only to stores.
> >>>>
> >>>> What are your thoughts about changing atomic_write_barrier to eieio?
> >>>> Are you aware of any prior reasoning about that (e.g., whether
> >>>> atomic_write_barrier really only needs to prevent reordering of writes)?
> >>>> Would there be a significant performance degradation?
> >>> Reading again my last message I think I wasn't clear: my understanding of
> >>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
> >>> memory. And it is preferred over 'sync' for performance reasons.
> >>>
> >>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
> >>> is a safe atomic_write_barrier and it also performance better than 'eieio'
> >>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
> >>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
> >> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
> >> meant changing it to using lwsync :)
> >> So, what are your thoughts about that?
> > Right, I thought you were solely referring to pthread_spin_unlock.  About the
> > spinlock itself, now I understand the reasoning and I do agree that about 
> > replace the asm with a macro to emit a release barrier (which for POWER
> > would be the same anyway).  However, I would like to address it in another
> > patch.
> >
> > Now for atomic_write_barrier, as I understand, was added in GLIBC as 
> > memory_order_seq_cst semantics.  However, as you have noted it is acting in 
> > practice as release semantic; so am not sure about changing to 'lwsync' 
> > I am checking with Steven Monroe and  Paul E. McKenney if I overlooked 
> > something here and why exactly they chose 'eieio'.
> 
> Paul E. McKenney clarification about 'eieio' is:
> 
> "The eieio is weaker than a memory_order_release barrier, as it does not
> guarantee to order prior loads against later stores, nor does it guarantee
> to order prior loads against later loads."
> 
> So I see it is safe to use 'lwsync' instead and if we aiming to follow C11
> semantics it would be also preferable. I will put it on my backlog to work with
> after lowlevellock.h cleanup.

Thanks!
Torvald Riegel Oct. 6, 2014, 12:53 p.m. UTC | #11
On Mon, 2014-09-29 at 16:13 -0300, Adhemerval Zanella wrote:
> On 25-09-2014 18:19, Adhemerval Zanella wrote:
> > On 24-09-2014 12:37, Torvald Riegel wrote:
> >> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
> >>> On 24-09-2014 10:29, Torvald Riegel wrote:
> >>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
> >>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
> >>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
> >>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> index 2b8c84d..4906205 100644
> >>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> >>>>>>> @@ -22,7 +22,7 @@
> >>>>>>>  int
> >>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>>>>>>  {
> >>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
> >>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >>>>>>>    *lock = 0;
> >>>>>>>    return 0;
> >>>>>>>  }
> >>>>>> I'd prefer if this would be a memory_order_release barrier (which we
> >>>>>> don't have yet, I know, so bear with me).  However, I see that the
> >>>>>> current situation is a bit messy:
> >>>>>> * It seems that atomic_write_barrier is intended to be a C11
> >>>>>> memory_order_release barrier/fence; at least it's used like that in all
> >>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
> >>>>>> to express a release barrier.
> >>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
> >>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
> >>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
> >>>>> mappings to PowerPC instructions.
> >>>> I'm aware of these ...
> >>>>
> >>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
> >>>>> gives you an example of a spinlock implementation, however it is using a 
> >>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
> >>>>> explained in second link:
> >>>>>
> >>>>> "If the critical section contains only normal cached memory operations, 
> >>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
> >>>>> above."
> >>>>>
> >>>>> Non-cached memory are handled by kernel and exported to application only in very
> >>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
> >>>>> The idea is a process won't try to run a spinlock in a DMA area. 
> >>>>>
> >>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
> >>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
> >>>>> to use in such cases.
> >>>>>
> >>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> >>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> >>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
> >>>> states that eieio really applies only to stores.
> >>>>
> >>>> What are your thoughts about changing atomic_write_barrier to eieio?
> >>>> Are you aware of any prior reasoning about that (e.g., whether
> >>>> atomic_write_barrier really only needs to prevent reordering of writes)?
> >>>> Would there be a significant performance degradation?
> >>> Reading again my last message I think I wasn't clear: my understanding of
> >>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
> >>> memory. And it is preferred over 'sync' for performance reasons.
> >>>
> >>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
> >>> is a safe atomic_write_barrier and it also performance better than 'eieio'
> >>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
> >>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
> >> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
> >> meant changing it to using lwsync :)
> >> So, what are your thoughts about that?
> > Right, I thought you were solely referring to pthread_spin_unlock.  About the
> > spinlock itself, now I understand the reasoning and I do agree that about 
> > replace the asm with a macro to emit a release barrier (which for POWER
> > would be the same anyway).  However, I would like to address it in another
> > patch.
> >
> > Now for atomic_write_barrier, as I understand, was added in GLIBC as 
> > memory_order_seq_cst semantics.  However, as you have noted it is acting in 
> > practice as release semantic; so am not sure about changing to 'lwsync' 
> > I am checking with Steven Monroe and  Paul E. McKenney if I overlooked 
> > something here and why exactly they chose 'eieio'.
> >
> >>>>>> I'm interested in having a release barrier there so that this won't get
> >>>>>> overlooked once we do the actual transition to C11 atomics.
> >>>>>>
> >>>>>> Also, related to that, it would be ideal if we could use the generic
> >>>>>> implementation (nptl/pthread_spin_unlock.c) on Power as well.  However,
> >>>>>> this one uses atomic_full_barrier, which is stronger than what's used on
> >>>>>> the major archs (x86, Power, ARM).
> >>>> BTW, I've noticed that ARM uses the generic spin_lock, so ARM uses a
> >>>> full barrier effectively.
> >>> A full barrier do work on PowerPC, the specific optimization to use the atomic
> >>> write 'lwsync' was added later.
> >> Of course the full barrier would work as well.  My concern is about what
> >> glibc should provide: current practice differs on major archs (x86 and
> >> POWER) is weaker than what POSIX requires -- yet the right semantics in
> >> my opinion.  To me, this indicates that this would work as default in
> >> practice, and that the default implementation could use a release
> >> barrier too.
> > I do agree.
> 
> Torvald,
> 
> Besides the memory barrier discussion (which I think should be moved to another thread)
> do you have any pending issues related to patch itself?

No, I think it's fine.  I'm looking forward to the two separate patches
that spun off of this.
Adhemerval Zanella Oct. 6, 2014, 12:56 p.m. UTC | #12
On 06-10-2014 09:49, Torvald Riegel wrote:
> On Thu, 2014-09-25 at 18:19 -0300, Adhemerval Zanella wrote:
>> On 24-09-2014 12:37, Torvald Riegel wrote:
>>> On Wed, 2014-09-24 at 10:58 -0300, Adhemerval Zanella wrote:
>>>> On 24-09-2014 10:29, Torvald Riegel wrote:
>>>>> On Wed, 2014-09-24 at 08:55 -0300, Adhemerval Zanella wrote:
>>>>>> On 24-09-2014 07:26, Torvald Riegel wrote:
>>>>>>> On Tue, 2014-09-23 at 14:35 -0300, Adhemerval Zanella wrote:
>>>>>>>> diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>>> index 2b8c84d..4906205 100644
>>>>>>>> --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>>>>>>>> @@ -22,7 +22,7 @@
>>>>>>>>  int
>>>>>>>>  pthread_spin_unlock (pthread_spinlock_t *lock)
>>>>>>>>  {
>>>>>>>> -  __asm __volatile (__lll_rel_instr ::: "memory");
>>>>>>>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>>>>>>>>    *lock = 0;
>>>>>>>>    return 0;
>>>>>>>>  }
>>>>>>> I'd prefer if this would be a memory_order_release barrier (which we
>>>>>>> don't have yet, I know, so bear with me).  However, I see that the
>>>>>>> current situation is a bit messy:
>>>>>>> * It seems that atomic_write_barrier is intended to be a C11
>>>>>>> memory_order_release barrier/fence; at least it's used like that in all
>>>>>>> what I saw in glibc.  So atomic_write_barrier would be our current way
>>>>>>> to express a release barrier.
>>>>>>> * However, PowerPC atomic_write_barrier is "eieio" and is not using
>>>>>>> __ARCH_REL_INSTR.  Can you clarify the difference?
>>>>>> These two pages [1] [2] have detailed information about C11/C++11 memory order
>>>>>> mappings to PowerPC instructions.
>>>>> I'm aware of these ...
>>>>>
>>>>>> As you can see the store release is indeed 'lwsync; st'.  First link also
>>>>>> gives you an example of a spinlock implementation, however it is using a 
>>>>>> full memory barrier (sync) instead of a write one (lwsync). It is better
>>>>>> explained in second link:
>>>>>>
>>>>>> "If the critical section contains only normal cached memory operations, 
>>>>>> the lwsync barrier may be substituted for the heavier-weight sync instruction
>>>>>> above."
>>>>>>
>>>>>> Non-cached memory are handled by kernel and exported to application only in very
>>>>>> specific situations (and afaik C11/C++11 specifications don't even handle it).
>>>>>> The idea is a process won't try to run a spinlock in a DMA area. 
>>>>>>
>>>>>> Also the 'Synchronizes With' gives some explanation about 'eieio' and 'lwsync'
>>>>>> relation. My understanding is 'lwsync' is the correct memory order instruction
>>>>>> to use in such cases.
>>>>>>
>>>>>> [1] http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>>>>>> [2] http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
>>>>> ... but not of the difference between eieio and lwsync.  AFAICT, N2745
>>>>> states that eieio really applies only to stores.
>>>>>
>>>>> What are your thoughts about changing atomic_write_barrier to eieio?
>>>>> Are you aware of any prior reasoning about that (e.g., whether
>>>>> atomic_write_barrier really only needs to prevent reordering of writes)?
>>>>> Would there be a significant performance degradation?
>>>> Reading again my last message I think I wasn't clear: my understanding of
>>>> 'eieio' is to be a safe atomic_write_barrier for cacheable and cache inhibited
>>>> memory. And it is preferred over 'sync' for performance reasons.
>>>>
>>>> Now, for cacheable memory (default memory allocated from kernel), a 'lwsync'
>>>> is a safe atomic_write_barrier and it also performance better than 'eieio'
>>>> (and AFAIK GCC won't emit 'eieio' for C11/C++11 atomic operations). That's why 
>>>> I see no need to change it to 'eieio' in the pthread_spin_unlock.
>>> Sorry, when I wrote "changing atomic_write_barrier to eieio" I actually
>>> meant changing it to using lwsync :)
>>> So, what are your thoughts about that?
>> Right, I thought you were solely referring to pthread_spin_unlock.  About the
>> spinlock itself, now I understand the reasoning and I do agree that about 
>> replace the asm with a macro to emit a release barrier (which for POWER
>> would be the same anyway).  However, I would like to address it in another
>> patch.
>>
>> Now for atomic_write_barrier, as I understand, was added in GLIBC as 
>> memory_order_seq_cst semantics.
> Do you have any indication for that being the case (e.g., emails by
> others)?  It would be surprising because atomic_full_barrier exists as
> well.
>
In fact I was wrongly mislead by a wrong assumption of 'eieio' being 'stronger' than
lwsync, which the contrary in fact.
diff mbox

Patch

diff --git a/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h b/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
deleted file mode 100644
index a651d23..0000000
--- a/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
+++ /dev/null
@@ -1,342 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Paul Mackerras <paulus@au.ibm.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#ifndef _LOWLEVELLOCK_H
-#define _LOWLEVELLOCK_H	1
-
-#include <time.h>
-#include <sys/param.h>
-#include <bits/pthreadtypes.h>
-#include <atomic.h>
-#include <kernel-features.h>
-
-#ifndef __NR_futex
-# define __NR_futex		221
-#endif
-#define FUTEX_WAIT		0
-#define FUTEX_WAKE		1
-#define FUTEX_REQUEUE		3
-#define FUTEX_CMP_REQUEUE	4
-#define FUTEX_WAKE_OP		5
-#define FUTEX_OP_CLEAR_WAKE_IF_GT_ONE	((4 << 24) | 1)
-#define FUTEX_LOCK_PI		6
-#define FUTEX_UNLOCK_PI		7
-#define FUTEX_TRYLOCK_PI	8
-#define FUTEX_WAIT_BITSET	9
-#define FUTEX_WAKE_BITSET	10
-#define FUTEX_WAIT_REQUEUE_PI   11
-#define FUTEX_CMP_REQUEUE_PI    12
-#define FUTEX_PRIVATE_FLAG	128
-#define FUTEX_CLOCK_REALTIME	256
-
-#define FUTEX_BITSET_MATCH_ANY	0xffffffff
-
-/* Values for 'private' parameter of locking macros.  Yes, the
-   definition seems to be backwards.  But it is not.  The bit will be
-   reversed before passing to the system call.  */
-#define LLL_PRIVATE	0
-#define LLL_SHARED	FUTEX_PRIVATE_FLAG
-
-#if !defined NOT_IN_libc || defined IS_IN_rtld
-/* In libc.so or ld.so all futexes are private.  */
-# ifdef __ASSUME_PRIVATE_FUTEX
-#  define __lll_private_flag(fl, private) \
-  ((fl) | FUTEX_PRIVATE_FLAG)
-# else
-#  define __lll_private_flag(fl, private) \
-  ((fl) | THREAD_GETMEM (THREAD_SELF, header.private_futex))
-# endif
-#else
-# ifdef __ASSUME_PRIVATE_FUTEX
-#  define __lll_private_flag(fl, private) \
-  (((fl) | FUTEX_PRIVATE_FLAG) ^ (private))
-# else
-#  define __lll_private_flag(fl, private) \
-  (__builtin_constant_p (private)					      \
-   ? ((private) == 0							      \
-      ? ((fl) | THREAD_GETMEM (THREAD_SELF, header.private_futex))	      \
-      : (fl))								      \
-   : ((fl) | (((private) ^ FUTEX_PRIVATE_FLAG)				      \
-	      & THREAD_GETMEM (THREAD_SELF, header.private_futex))))
-# endif
-#endif
-
-#define lll_futex_wait(futexp, val, private) \
-  lll_futex_timed_wait (futexp, val, NULL, private)
-
-#define lll_futex_timed_wait(futexp, val, timespec, private) \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 4, (futexp),		      \
-			      __lll_private_flag (FUTEX_WAIT, private),	      \
-			      (val), (timespec));			      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;		      \
-  })
-
-#define lll_futex_timed_wait_bitset(futexp, val, timespec, clockbit, private) \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-    int __op = FUTEX_WAIT_BITSET | clockbit;				      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
-			      __lll_private_flag (__op, private),	      \
-			      (val), (timespec), NULL /* Unused.  */, 	      \
-			      FUTEX_BITSET_MATCH_ANY);			      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;		      \
-  })
-
-#define lll_futex_wake(futexp, nr, private) \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 4, (futexp),		      \
-			      __lll_private_flag (FUTEX_WAKE, private),	      \
-			      (nr), 0);					      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;		      \
-  })
-
-/* Returns non-zero if error happened, zero if success.  */
-#define lll_futex_requeue(futexp, nr_wake, nr_move, mutex, val, private) \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
-			      __lll_private_flag (FUTEX_CMP_REQUEUE, private),\
-			      (nr_wake), (nr_move), (mutex), (val));	      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
-  })
-
-/* Returns non-zero if error happened, zero if success.  */
-#define lll_futex_wake_unlock(futexp, nr_wake, nr_wake2, futexp2, private) \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
-			      __lll_private_flag (FUTEX_WAKE_OP, private),    \
-			      (nr_wake), (nr_wake2), (futexp2),		      \
-			      FUTEX_OP_CLEAR_WAKE_IF_GT_ONE);		      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
-  })
-
-/* Priority Inheritance support.  */
-#define lll_futex_wait_requeue_pi(futexp, val, mutex, private) \
-  lll_futex_timed_wait_requeue_pi (futexp, val, NULL, 0, mutex, private)
-
-#define lll_futex_timed_wait_requeue_pi(futexp, val, timespec, clockbit,      \
-					mutex, private)			      \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-    int __op = FUTEX_WAIT_REQUEUE_PI | clockbit;			      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 5, (futexp),		      \
-			      __lll_private_flag (__op, private),	      \
-			      (val), (timespec), mutex); 		      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;		      \
-  })
-
-#define lll_futex_cmp_requeue_pi(futexp, nr_wake, nr_move, mutex, val, priv)  \
-  ({									      \
-    INTERNAL_SYSCALL_DECL (__err);					      \
-    long int __ret;							      \
-									      \
-    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
-			      __lll_private_flag (FUTEX_CMP_REQUEUE_PI, priv),\
-			      (nr_wake), (nr_move), (mutex), (val));	      \
-    INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
-  })
-
-
-#ifdef UP
-# define __lll_acq_instr	""
-# define __lll_rel_instr	""
-#else
-# define __lll_acq_instr	"isync"
-# ifdef _ARCH_PWR4
-/*
- * Newer powerpc64 processors support the new "light weight" sync (lwsync)
- * So if the build is using -mcpu=[power4,power5,power5+,970] we can
- * safely use lwsync.
- */
-#  define __lll_rel_instr	"lwsync"
-# else
-/*
- * Older powerpc32 processors don't support the new "light weight"
- * sync (lwsync).  So the only safe option is to use normal sync
- * for all powerpc32 applications.
- */
-#  define __lll_rel_instr	"sync"
-# endif
-#endif
-
-/* Set *futex to ID if it is 0, atomically.  Returns the old value */
-#define __lll_base_trylock(futex, id) \
-  ({ int __val;								      \
-     __asm __volatile ("1:	lwarx	%0,0,%2" MUTEX_HINT_ACQ "\n"	      \
-		       "	cmpwi	0,%0,0\n"			      \
-		       "	bne	2f\n"				      \
-		       "	stwcx.	%3,0,%2\n"			      \
-		       "	bne-	1b\n"				      \
-		       "2:	" __lll_acq_instr			      \
-		       : "=&r" (__val), "=m" (*futex)			      \
-		       : "r" (futex), "r" (id), "m" (*futex)		      \
-		       : "cr0", "memory");				      \
-     __val;								      \
-  })
-
-/* Set *futex to 1 if it is 0, atomically.  Returns the old value */
-#define __lll_trylock(futex) __lll_base_trylock (futex, 1)
-
-#define lll_trylock(lock)	__lll_trylock (&(lock))
-
-/* Set *futex to 2 if it is 0, atomically.  Returns the old value */
-#define __lll_cond_trylock(futex) __lll_base_trylock (futex, 2)
-
-#define lll_cond_trylock(lock)	__lll_cond_trylock (&(lock))
-
-
-extern void __lll_lock_wait_private (int *futex) attribute_hidden;
-extern void __lll_lock_wait (int *futex, int private) attribute_hidden;
-extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
-
-#define lll_lock(lock, private) \
-  (void) ({								      \
-    int *__futex = &(lock);						      \
-    if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
-			  0) != 0)					      \
-      {									      \
-	if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)	      \
-	  __lll_lock_wait_private (__futex);				      \
-	else								      \
-	  __lll_lock_wait (__futex, private);				      \
-      }									      \
-  })
-
-#define lll_robust_lock(lock, id, private) \
-  ({									      \
-    int *__futex = &(lock);						      \
-    int __val = 0;							      \
-    if (__builtin_expect (atomic_compare_and_exchange_bool_acq (__futex, id,  \
-								0), 0))	      \
-      __val = __lll_robust_lock_wait (__futex, private);		      \
-    __val;								      \
-  })
-
-#define lll_cond_lock(lock, private) \
-  (void) ({								      \
-    int *__futex = &(lock);						      \
-    if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 2, 0),\
-			  0) != 0)					      \
-      __lll_lock_wait (__futex, private);				      \
-  })
-
-#define lll_robust_cond_lock(lock, id, private) \
-  ({									      \
-    int *__futex = &(lock);						      \
-    int __val = 0;							      \
-    int __id = id | FUTEX_WAITERS;					      \
-    if (__builtin_expect (atomic_compare_and_exchange_bool_acq (__futex, __id,\
-								0), 0))	      \
-      __val = __lll_robust_lock_wait (__futex, private);		      \
-    __val;								      \
-  })
-
-
-extern int __lll_timedlock_wait
-  (int *futex, const struct timespec *, int private) attribute_hidden;
-extern int __lll_robust_timedlock_wait
-  (int *futex, const struct timespec *, int private) attribute_hidden;
-
-#define lll_timedlock(lock, abstime, private) \
-  ({									      \
-    int *__futex = &(lock);						      \
-    int __val = 0;							      \
-    if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
-			  0) != 0)					      \
-      __val = __lll_timedlock_wait (__futex, abstime, private);		      \
-    __val;								      \
-  })
-
-#define lll_robust_timedlock(lock, abstime, id, private) \
-  ({									      \
-    int *__futex = &(lock);						      \
-    int __val = 0;							      \
-    if (__builtin_expect (atomic_compare_and_exchange_bool_acq (__futex, id,  \
-								0), 0))	      \
-      __val = __lll_robust_timedlock_wait (__futex, abstime, private);	      \
-    __val;								      \
-  })
-
-#define lll_unlock(lock, private) \
-  ((void) ({								      \
-    int *__futex = &(lock);						      \
-    int __val = atomic_exchange_rel (__futex, 0);			      \
-    if (__glibc_unlikely (__val > 1))					      \
-      lll_futex_wake (__futex, 1, private);				      \
-  }))
-
-#define lll_robust_unlock(lock, private) \
-  ((void) ({								      \
-    int *__futex = &(lock);						      \
-    int __val = atomic_exchange_rel (__futex, 0);			      \
-    if (__glibc_unlikely (__val & FUTEX_WAITERS))			      \
-      lll_futex_wake (__futex, 1, private);				      \
-  }))
-
-#define lll_islocked(futex) \
-  (futex != 0)
-
-
-/* Initializers for lock.  */
-#define LLL_LOCK_INITIALIZER		(0)
-#define LLL_LOCK_INITIALIZER_LOCKED	(1)
-
-/* The states of a lock are:
-    0  -  untaken
-    1  -  taken by one user
-   >1  -  taken by more users */
-
-/* The kernel notifies a process which uses CLONE_CHILD_CLEARTID via futex
-   wakeup when the clone terminates.  The memory location contains the
-   thread ID while the clone is running and is reset to zero
-   afterwards.	*/
-#define lll_wait_tid(tid) \
-  do {									      \
-    __typeof (tid) __tid;						      \
-    while ((__tid = (tid)) != 0)					      \
-      lll_futex_wait (&(tid), __tid, LLL_SHARED);			      \
-  } while (0)
-
-extern int __lll_timedwait_tid (int *, const struct timespec *)
-     attribute_hidden;
-
-#define lll_timedwait_tid(tid, abstime) \
-  ({									      \
-    int __res = 0;							      \
-    if ((tid) != 0)							      \
-      __res = __lll_timedwait_tid (&(tid), (abstime));			      \
-    __res;								      \
-  })
-
-#endif	/* lowlevellock.h */
diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
index 2b8c84d..4906205 100644
--- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
+++ b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
@@ -22,7 +22,7 @@ 
 int
 pthread_spin_unlock (pthread_spinlock_t *lock)
 {
-  __asm __volatile (__lll_rel_instr ::: "memory");
+  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
   *lock = 0;
   return 0;
 }
diff --git a/sysdeps/unix/sysv/linux/powerpc/sem_post.c b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
index f222d9a..831a8dd 100644
--- a/sysdeps/unix/sysv/linux/powerpc/sem_post.c
+++ b/sysdeps/unix/sysv/linux/powerpc/sem_post.c
@@ -30,9 +30,9 @@  __new_sem_post (sem_t *sem)
 {
   struct new_sem *isem = (struct new_sem *) sem;
 
-  __asm __volatile (__lll_rel_instr ::: "memory");
+  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
   atomic_increment (&isem->value);
-  __asm __volatile (__lll_acq_instr ::: "memory");
+  __asm __volatile (__ARCH_ACQ_INSTR ::: "memory");
   if (isem->nwaiters > 0)
     {
       int err = lll_futex_wake (&isem->value, 1,
@@ -55,7 +55,7 @@  __old_sem_post (sem_t *sem)
 {
   int *futex = (int *) sem;
 
-  __asm __volatile (__lll_rel_instr ::: "memory");
+  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
   (void) atomic_increment_val (futex);
   /* We always have to assume it is a shared semaphore.  */
   int err = lll_futex_wake (futex, 1, LLL_SHARED);