[RFC] nptl: use compare and exchange for lll_cond_lock
diff mbox

Message ID 5421B1C2.9020509@linux.vnet.ibm.com
State New
Headers show

Commit Message

Adhemerval Zanella Sept. 23, 2014, 5:45 p.m. UTC
While checking the generated code and macros used in generic lowlevellock.h,
I noted powerpc and other arch uses uses a compare and swap instead of a plain
exchange value on lll_cond_lock.

I am not really sure which behavior would be desirable, since as far I could
they will have both the same side effects (since lll_cond_lock, different
from lll_lock, does not hold value of '1').

So I am proposing this patch to sync default implementation for what mostly
architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
that only microblaze and sh (I am not sure about this one, I not well versed in
its assembly and I'm being guided by its comment) used the atomic_exchange_acq
instead.

Checked on powerpc32 and powercp64 with my previous lowlevellock.h removal
patch.

--

	* sysdeps/nptl/lowlevellock.h (__lll_cond_lock): Use
	atomic_compare_and_exchange_val_acq instead of atomic_exchange_acq.

---

Comments

Bernard Ogden Sept. 24, 2014, 1:46 p.m. UTC | #1
IIUC the two options are indeed equivalent - but wouldn't it be more
efficient to synchronize on the non-comparing version?

On 23 September 2014 18:45, Adhemerval Zanella
<azanella@linux.vnet.ibm.com> wrote:
> While checking the generated code and macros used in generic lowlevellock.h,
> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> exchange value on lll_cond_lock.
>
> I am not really sure which behavior would be desirable, since as far I could
> they will have both the same side effects (since lll_cond_lock, different
> from lll_lock, does not hold value of '1').
>
> So I am proposing this patch to sync default implementation for what mostly
> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> that only microblaze and sh (I am not sure about this one, I not well versed in
> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> instead.
>
> Checked on powerpc32 and powercp64 with my previous lowlevellock.h removal
> patch.
>
> --
>
>         * sysdeps/nptl/lowlevellock.h (__lll_cond_lock): Use
>         atomic_compare_and_exchange_val_acq instead of atomic_exchange_acq.
>
> ---
>
> diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
> index 28f4ba3..ba22734 100644
> --- a/sysdeps/nptl/lowlevellock.h
> +++ b/sysdeps/nptl/lowlevellock.h
> @@ -73,7 +73,8 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
>    ((void)                                                               \
>     ({                                                                   \
>       int *__futex = (futex);                                            \
> -     if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
> +     if (__glibc_unlikely (                                            \
> +        atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0))     \
>         __lll_lock_wait (__futex, private);                              \
>     }))
>  #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
>
Adhemerval Zanella Sept. 24, 2014, 2 p.m. UTC | #2
On 24-09-2014 10:46, Bernie Ogden wrote:
> IIUC the two options are indeed equivalent - but wouldn't it be more
> efficient to synchronize on the non-comparing version?

Well, that is exactly the question I posed ;)

>
> On 23 September 2014 18:45, Adhemerval Zanella
> <azanella@linux.vnet.ibm.com> wrote:
>> While checking the generated code and macros used in generic lowlevellock.h,
>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
>> exchange value on lll_cond_lock.
>>
>> I am not really sure which behavior would be desirable, since as far I could
>> they will have both the same side effects (since lll_cond_lock, different
>> from lll_lock, does not hold value of '1').
>>
>> So I am proposing this patch to sync default implementation for what mostly
>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
>> that only microblaze and sh (I am not sure about this one, I not well versed in
>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
>> instead.
>>
>> Checked on powerpc32 and powercp64 with my previous lowlevellock.h removal
>> patch.
>>
>> --
>>
>>         * sysdeps/nptl/lowlevellock.h (__lll_cond_lock): Use
>>         atomic_compare_and_exchange_val_acq instead of atomic_exchange_acq.
>>
>> ---
>>
>> diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
>> index 28f4ba3..ba22734 100644
>> --- a/sysdeps/nptl/lowlevellock.h
>> +++ b/sysdeps/nptl/lowlevellock.h
>> @@ -73,7 +73,8 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
>>    ((void)                                                               \
>>     ({                                                                   \
>>       int *__futex = (futex);                                            \
>> -     if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
>> +     if (__glibc_unlikely (                                            \
>> +        atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0))     \
>>         __lll_lock_wait (__futex, private);                              \
>>     }))
>>  #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
>>
Torvald Riegel Sept. 25, 2014, 6 p.m. UTC | #3
On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
> While checking the generated code and macros used in generic lowlevellock.h,
> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> exchange value on lll_cond_lock.
> I am not really sure which behavior would be desirable, since as far I could
> they will have both the same side effects (since lll_cond_lock, different
> from lll_lock, does not hold value of '1').

What do you mean by "[the function] does not hold value of '1'"?

> So I am proposing this patch to sync default implementation for what mostly
> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> that only microblaze and sh (I am not sure about this one, I not well versed in
> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> instead.

I think both versions work from a correctness POV, but doing an
unconditional exchange should be the right thing to do.

The default implementation of __lll_lock_wait will test if the futex
variable equals 2, and if not, do an exchange right away before running
the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
thing that will happen is an exchange.  Thus, it seems that we should do
the exchange right away.

Thoughts?
Adhemerval Zanella Sept. 26, 2014, 12:20 a.m. UTC | #4
On 25-09-2014 15:00, Torvald Riegel wrote:
> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
>> While checking the generated code and macros used in generic lowlevellock.h,
>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
>> exchange value on lll_cond_lock.
>> I am not really sure which behavior would be desirable, since as far I could
>> they will have both the same side effects (since lll_cond_lock, different
>> from lll_lock, does not hold value of '1').
> What do you mean by "[the function] does not hold value of '1'"?

Bad wording in fact, I mean the 'futex' used in lll_cond_lock.

>
>> So I am proposing this patch to sync default implementation for what mostly
>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
>> that only microblaze and sh (I am not sure about this one, I not well versed in
>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
>> instead.
> I think both versions work from a correctness POV, but doing an
> unconditional exchange should be the right thing to do.
>
> The default implementation of __lll_lock_wait will test if the futex
> variable equals 2, and if not, do an exchange right away before running
> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
> thing that will happen is an exchange.  Thus, it seems that we should do
> the exchange right away.
>
> Thoughts?

The only 'advantage' I see on using the compare and exchange version is it might be
an optimization on architectures that uses LL/SC instead of CAS instruction.  For
instance on POWER, the exchange version is translated to:

	li 	r9,2
 1:     lwarx   10,0,3,1
        stwcx.  9,0,3
        bne-    1b
        isync

And for compare and exchange:

	li	r10,2
	li 	r9,0
1:      lwarx   r8,r0,r3,1
        cmpw    r8,r9
	bne     2f
	stwcx.  r10,r0,r3
	bne-    1b
2:      isync

So for contend cases if the lock is taken it avoids the store (which for POWER8 is
at least 10 cycles to more).
Adhemerval Zanella Oct. 10, 2014, 11:43 a.m. UTC | #5
On 25-09-2014 21:20, Adhemerval Zanella wrote:
> On 25-09-2014 15:00, Torvald Riegel wrote:
>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
>>> While checking the generated code and macros used in generic lowlevellock.h,
>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
>>> exchange value on lll_cond_lock.
>>> I am not really sure which behavior would be desirable, since as far I could
>>> they will have both the same side effects (since lll_cond_lock, different
>>> from lll_lock, does not hold value of '1').
>> What do you mean by "[the function] does not hold value of '1'"?
> Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
>
>>> So I am proposing this patch to sync default implementation for what mostly
>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
>>> that only microblaze and sh (I am not sure about this one, I not well versed in
>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
>>> instead.
>> I think both versions work from a correctness POV, but doing an
>> unconditional exchange should be the right thing to do.
>>
>> The default implementation of __lll_lock_wait will test if the futex
>> variable equals 2, and if not, do an exchange right away before running
>> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
>> thing that will happen is an exchange.  Thus, it seems that we should do
>> the exchange right away.
>>
>> Thoughts?
> The only 'advantage' I see on using the compare and exchange version is it might be
> an optimization on architectures that uses LL/SC instead of CAS instruction.  For
> instance on POWER, the exchange version is translated to:
>
> 	li 	r9,2
>  1:     lwarx   10,0,3,1
>         stwcx.  9,0,3
>         bne-    1b
>         isync
>
> And for compare and exchange:
>
> 	li	r10,2
> 	li 	r9,0
> 1:      lwarx   r8,r0,r3,1
>         cmpw    r8,r9
> 	bne     2f
> 	stwcx.  r10,r0,r3
> 	bne-    1b
> 2:      isync
>
> So for contend cases if the lock is taken it avoids the store (which for POWER8 is
> at least 10 cycles to more).

Does this analysis make sense? Also, I'm not sure which is better for x86_64 or other
architectures.

[1] IntelĀ® 64 and IA-32 Architectures
 Optimization Reference Manual
Torvald Riegel Oct. 10, 2014, 12:33 p.m. UTC | #6
On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
> On 25-09-2014 21:20, Adhemerval Zanella wrote:
> > On 25-09-2014 15:00, Torvald Riegel wrote:
> >> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
> >>> While checking the generated code and macros used in generic lowlevellock.h,
> >>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> >>> exchange value on lll_cond_lock.
> >>> I am not really sure which behavior would be desirable, since as far I could
> >>> they will have both the same side effects (since lll_cond_lock, different
> >>> from lll_lock, does not hold value of '1').
> >> What do you mean by "[the function] does not hold value of '1'"?
> > Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
> >
> >>> So I am proposing this patch to sync default implementation for what mostly
> >>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> >>> that only microblaze and sh (I am not sure about this one, I not well versed in
> >>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> >>> instead.
> >> I think both versions work from a correctness POV, but doing an
> >> unconditional exchange should be the right thing to do.
> >>
> >> The default implementation of __lll_lock_wait will test if the futex
> >> variable equals 2, and if not, do an exchange right away before running
> >> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
> >> thing that will happen is an exchange.  Thus, it seems that we should do
> >> the exchange right away.
> >>
> >> Thoughts?
> > The only 'advantage' I see on using the compare and exchange version is it might be
> > an optimization on architectures that uses LL/SC instead of CAS instruction.  For
> > instance on POWER, the exchange version is translated to:
> >
> > 	li 	r9,2
> >  1:     lwarx   10,0,3,1
> >         stwcx.  9,0,3
> >         bne-    1b
> >         isync
> >
> > And for compare and exchange:
> >
> > 	li	r10,2
> > 	li 	r9,0
> > 1:      lwarx   r8,r0,r3,1
> >         cmpw    r8,r9
> > 	bne     2f
> > 	stwcx.  r10,r0,r3
> > 	bne-    1b
> > 2:      isync
> >
> > So for contend cases if the lock is taken it avoids the store (which for POWER8 is
> > at least 10 cycles to more).
> 
> Does this analysis make sense?

I can't comment on POWER because I haven't obtained nor seen any data on
this.  Not sure what branch prediction and speculative execution would
make out of this.

However, another thing to not from a correctness perspective is that
depending on the calling code, it *can make a difference* whether you do
the exchange or the CAS: If the exchange uses a release barrier, then if
the replacement CAS' load avoids the store, you will also not get the
effect of the release barrier.  Which can matter depending on how you
build the calling algorithm.  For the POWER case, you could try to
counter that with an appropriate barrier in the failure path of the CAS.
Anyway, it's not a simple replacement.

> Also, I'm not sure which is better for x86_64 or other
> architectures.

The specifc read-modify-write should be faster.  The actual operation is
easier to see for the chip, the window where something else can
interfere should be smaller, etc.  I haven't done recent measurements,
but a few years ago a CAS increment loop scaled significantly worse than
an atomic-increment on x86.
Adhemerval Zanella Oct. 10, 2014, 12:45 p.m. UTC | #7
On 10-10-2014 09:33, Torvald Riegel wrote:
> On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
>> On 25-09-2014 21:20, Adhemerval Zanella wrote:
>>> On 25-09-2014 15:00, Torvald Riegel wrote:
>>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
>>>>> While checking the generated code and macros used in generic lowlevellock.h,
>>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
>>>>> exchange value on lll_cond_lock.
>>>>> I am not really sure which behavior would be desirable, since as far I could
>>>>> they will have both the same side effects (since lll_cond_lock, different
>>>>> from lll_lock, does not hold value of '1').
>>>> What do you mean by "[the function] does not hold value of '1'"?
>>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
>>>
>>>>> So I am proposing this patch to sync default implementation for what mostly
>>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
>>>>> that only microblaze and sh (I am not sure about this one, I not well versed in
>>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
>>>>> instead.
>>>> I think both versions work from a correctness POV, but doing an
>>>> unconditional exchange should be the right thing to do.
>>>>
>>>> The default implementation of __lll_lock_wait will test if the futex
>>>> variable equals 2, and if not, do an exchange right away before running
>>>> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
>>>> thing that will happen is an exchange.  Thus, it seems that we should do
>>>> the exchange right away.
>>>>
>>>> Thoughts?
>>> The only 'advantage' I see on using the compare and exchange version is it might be
>>> an optimization on architectures that uses LL/SC instead of CAS instruction.  For
>>> instance on POWER, the exchange version is translated to:
>>>
>>> 	li 	r9,2
>>>  1:     lwarx   10,0,3,1
>>>         stwcx.  9,0,3
>>>         bne-    1b
>>>         isync
>>>
>>> And for compare and exchange:
>>>
>>> 	li	r10,2
>>> 	li 	r9,0
>>> 1:      lwarx   r8,r0,r3,1
>>>         cmpw    r8,r9
>>> 	bne     2f
>>> 	stwcx.  r10,r0,r3
>>> 	bne-    1b
>>> 2:      isync
>>>
>>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is
>>> at least 10 cycles to more).
>> Does this analysis make sense?
> I can't comment on POWER because I haven't obtained nor seen any data on
> this.  Not sure what branch prediction and speculative execution would
> make out of this.
>
> However, another thing to not from a correctness perspective is that
> depending on the calling code, it *can make a difference* whether you do
> the exchange or the CAS: If the exchange uses a release barrier, then if
> the replacement CAS' load avoids the store, you will also not get the
> effect of the release barrier.  Which can matter depending on how you
> build the calling algorithm.  For the POWER case, you could try to
> counter that with an appropriate barrier in the failure path of the CAS.
> Anyway, it's not a simple replacement.

But this change is specific for 'acquire' semantics and, at least for POWER,
it won't generate memory barrier (isync is not a memory barrier).  I don't see
any correctness issue with this change for this specific usage.

>
>> Also, I'm not sure which is better for x86_64 or other
>> architectures.
> The specifc read-modify-write should be faster.  The actual operation is
> easier to see for the chip, the window where something else can
> interfere should be smaller, etc.  I haven't done recent measurements,
> but a few years ago a CAS increment loop scaled significantly worse than
> an atomic-increment on x86.
>
Right and I don't have also number for powerpc to check if a the read-modify-write should
scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would
like to check on powerpc.
Torvald Riegel Oct. 10, 2014, 1:39 p.m. UTC | #8
On Fri, 2014-10-10 at 09:45 -0300, Adhemerval Zanella wrote:
> On 10-10-2014 09:33, Torvald Riegel wrote:
> > On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
> >> On 25-09-2014 21:20, Adhemerval Zanella wrote:
> >>> On 25-09-2014 15:00, Torvald Riegel wrote:
> >>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
> >>>>> While checking the generated code and macros used in generic lowlevellock.h,
> >>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> >>>>> exchange value on lll_cond_lock.
> >>>>> I am not really sure which behavior would be desirable, since as far I could
> >>>>> they will have both the same side effects (since lll_cond_lock, different
> >>>>> from lll_lock, does not hold value of '1').
> >>>> What do you mean by "[the function] does not hold value of '1'"?
> >>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
> >>>
> >>>>> So I am proposing this patch to sync default implementation for what mostly
> >>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> >>>>> that only microblaze and sh (I am not sure about this one, I not well versed in
> >>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> >>>>> instead.
> >>>> I think both versions work from a correctness POV, but doing an
> >>>> unconditional exchange should be the right thing to do.
> >>>>
> >>>> The default implementation of __lll_lock_wait will test if the futex
> >>>> variable equals 2, and if not, do an exchange right away before running
> >>>> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
> >>>> thing that will happen is an exchange.  Thus, it seems that we should do
> >>>> the exchange right away.
> >>>>
> >>>> Thoughts?
> >>> The only 'advantage' I see on using the compare and exchange version is it might be
> >>> an optimization on architectures that uses LL/SC instead of CAS instruction.  For
> >>> instance on POWER, the exchange version is translated to:
> >>>
> >>> 	li 	r9,2
> >>>  1:     lwarx   10,0,3,1
> >>>         stwcx.  9,0,3
> >>>         bne-    1b
> >>>         isync
> >>>
> >>> And for compare and exchange:
> >>>
> >>> 	li	r10,2
> >>> 	li 	r9,0
> >>> 1:      lwarx   r8,r0,r3,1
> >>>         cmpw    r8,r9
> >>> 	bne     2f
> >>> 	stwcx.  r10,r0,r3
> >>> 	bne-    1b
> >>> 2:      isync
> >>>
> >>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is
> >>> at least 10 cycles to more).
> >> Does this analysis make sense?
> > I can't comment on POWER because I haven't obtained nor seen any data on
> > this.  Not sure what branch prediction and speculative execution would
> > make out of this.
> >
> > However, another thing to not from a correctness perspective is that
> > depending on the calling code, it *can make a difference* whether you do
> > the exchange or the CAS: If the exchange uses a release barrier, then if
> > the replacement CAS' load avoids the store, you will also not get the
> > effect of the release barrier.  Which can matter depending on how you
> > build the calling algorithm.  For the POWER case, you could try to
> > counter that with an appropriate barrier in the failure path of the CAS.
> > Anyway, it's not a simple replacement.
> 
> But this change is specific for 'acquire' semantics and, at least for POWER,
> it won't generate memory barrier (isync is not a memory barrier).  I don't see
> any correctness issue with this change for this specific usage.

Well, it currently is acquire, I agree, and there whether we store or
not does not make a difference.  However, depending on how we interpret
trylock semantics, lock acquisition (not in the lowlevellock per se, but
when lowlevellock is used for pthread_mutex_lock) might have to have
release+store semantics.  So, while this doesn't matter for how we have
lowlevellock implemented now (at least on Power and in the generic
version), it might in the future.

So, yes, this was a general comment.  But it could apply to lowlevellock
in the future, so we should keep this in mind (preferably in a comment
in the code, if we decide to keep the CAS).

> >
> >> Also, I'm not sure which is better for x86_64 or other
> >> architectures.
> > The specifc read-modify-write should be faster.  The actual operation is
> > easier to see for the chip, the window where something else can
> > interfere should be smaller, etc.  I haven't done recent measurements,
> > but a few years ago a CAS increment loop scaled significantly worse than
> > an atomic-increment on x86.
> >
> Right and I don't have also number for powerpc to check if a the read-modify-write should
> scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would
> like to check on powerpc.

IIRC, I just measured throughput on a shared integer counter.  So,
several threads increment, and you measure how throughput of increments
scales with an increasing number of threads.

Patch
diff mbox

diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
index 28f4ba3..ba22734 100644
--- a/sysdeps/nptl/lowlevellock.h
+++ b/sysdeps/nptl/lowlevellock.h
@@ -73,7 +73,8 @@  extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
   ((void)                                                               \
    ({                                                                   \
      int *__futex = (futex);                                            \
-     if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
+     if (__glibc_unlikely (						\
+	 atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0))	\
        __lll_lock_wait (__futex, private);                              \
    }))
 #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)