diff mbox

[4/4] S390: Optimize lock-elision by decrementing adapt_count at unlock.

Message ID 036b04fb-3f33-4e6b-2591-c5af87539cad@linux.vnet.ibm.com
State New
Headers show

Commit Message

Stefan Liebler Jan. 17, 2017, 3:28 p.m. UTC
On 01/11/2017 11:53 AM, Torvald Riegel wrote:
> On Tue, 2016-12-06 at 14:51 +0100, Stefan Liebler wrote:
>> This patch decrements the adapt_count while unlocking the futex
>> instead of before aquiring the futex as it is done on power, too.
>> Furthermore a transaction is only started if the futex is currently free.
>> This check is done after starting the transaction, too.
>> If the futex is not free and the transaction nesting depth is one,
>> we can simply end the started transaction instead of aborting it.
>> The implementation of this check was faulty as it always ended the
>> started transaction.  By using the fallback path, the the outermost
>> transaction was aborted.  Now the outermost transaction is aborted
>> directly.
>>
>> This patch also adds some commentary and aligns the code in
>> elision-trylock.c to the code in elision-lock.c as possible.
>
> I don't think this is quite ready yet.  See below for details.
>
> I'm not too concerned about this fact, given that it's just in
> s390-specific code.  But generally, I'd prefer if arch-specific code
> aims for the same quality and level of consensus about it as what is our
> aim for generic code.
>
>> ChangeLog:
>>
>> 	* sysdeps/unix/sysv/linux/s390/lowlevellock.h
>> 	(__lll_unlock_elision, lll_unlock_elision): Add adapt_count argument.
>> 	* sysdeps/unix/sysv/linux/s390/elision-lock.c:
>> 	(__lll_lock_elision): Decrement adapt_count while unlocking
>> 	instead of before locking.
>> 	* sysdeps/unix/sysv/linux/s390/elision-trylock.c
>> 	(__lll_trylock_elision): Likewise.
>> 	* sysdeps/unix/sysv/linux/s390/elision-unlock.c:
>> 	(__lll_unlock_elision): Likewise.
>> ---
>>  sysdeps/unix/sysv/linux/s390/elision-lock.c    | 37 ++++++++-------
>>  sysdeps/unix/sysv/linux/s390/elision-trylock.c | 62 ++++++++++++++------------
>>  sysdeps/unix/sysv/linux/s390/elision-unlock.c  | 29 ++++++++++--
>>  sysdeps/unix/sysv/linux/s390/lowlevellock.h    |  4 +-
>>  4 files changed, 78 insertions(+), 54 deletions(-)
>>
>> diff --git a/sysdeps/unix/sysv/linux/s390/elision-lock.c b/sysdeps/unix/sysv/linux/s390/elision-lock.c
>> index 3dd7fbc..4a7d546 100644
>> --- a/sysdeps/unix/sysv/linux/s390/elision-lock.c
>> +++ b/sysdeps/unix/sysv/linux/s390/elision-lock.c
>> @@ -50,31 +50,30 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
>>       critical section uses lock elision) and outside of transactions.  Thus,
>>       we need to use atomic accesses to avoid data races.  However, the
>>       value of adapt_count is just a hint, so relaxed MO accesses are
>> -     sufficient.  */
>> -  if (atomic_load_relaxed (adapt_count) > 0)
>> -    {
>> -      /* Lost updates are possible, but harmless.  Due to races this might lead
>> -	 to *adapt_count becoming less than zero.  */
>> -      atomic_store_relaxed (adapt_count,
>> -			    atomic_load_relaxed (adapt_count) - 1);
>> -      goto use_lock;
>> -    }
>> -
>> -  if (aconf.try_tbegin > 0)
>> +     sufficient.
>> +     Do not begin a transaction if another cpu has locked the
>> +     futex with normal locking.  If adapt_count is zero, it remains and the
>> +     next pthread_mutex_lock call will try to start a transaction again.  */
>
> This seems to make an assumption about performance that should be
> explained in the comment.  IIRC, x86 LE does not make this assumption,
> so it's not generally true.  I suppose s390 aborts are really expensive,
> and you don't expect that a lock is in the acquired state often enough
> so that aborts are overall more costly than the overhead of the
> additional load and branch?
>
Yes, aborting a transaction is expensive.
But you are right, there is an additional load and branch and this 
thread will anyway wait in LLL_LOCK for another thread releasing the futex.
See example below.

>> +    if (atomic_load_relaxed (futex) == 0
>> +	&& atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
>>      {
>>        int status = __libc_tbegin_retry ((void *) 0, aconf.try_tbegin - 1);
>>        if (__builtin_expect (status == _HTM_TBEGIN_STARTED,
>>  			    _HTM_TBEGIN_STARTED))
>>  	{
>> -	  if (__builtin_expect (*futex == 0, 1))
>> +	  /* Check the futex to make sure nobody has touched it in the
>> +	     mean time.  This forces the futex into the cache and makes
>> +	     sure the transaction aborts if some other cpu uses the
>> +	     lock (writes the futex).  */
>
> s/cpu/CPU/
I'll correct it in the whole file and in elision-trylock.c.
>
> I'd also say "if another thread acquires the lock concurrently" instead
> of the last part of that sentence.
>
>> +	  if (__builtin_expect (atomic_load_relaxed (futex) == 0, 1))
>>  	    /* Lock was free.  Return to user code in a transaction.  */
>>  	    return 0;
>>
>>  	  /* Lock was busy.  Fall back to normal locking.  */
>> -	  if (__builtin_expect (__libc_tx_nesting_depth (), 1))
>> +	  if (__builtin_expect (__libc_tx_nesting_depth () <= 1, 1))
>
> Use __glibc_likely?
okay.
>
>>  	    {
>>  	      /* In a non-nested transaction there is no need to abort,
>> -		 which is expensive.  */
>> +		 which is expensive.  Simply end the started transaction.  */
>>  	      __libc_tend ();
>>  	      /* Don't try to use transactions for the next couple of times.
>>  		 See above for why relaxed MO is sufficient.  */
>> @@ -92,9 +91,9 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
>>  		 is zero.
>>  		 The adapt_count of this inner mutex is not changed,
>>  		 because using the default lock with the inner mutex
>> -		 would abort the outer transaction.
>> -	      */
>> +		 would abort the outer transaction.  */
>>  	      __libc_tabort (_HTM_FIRST_USER_ABORT_CODE | 1);
>> +	      __builtin_unreachable ();
>>  	    }
>>  	}
>>        else if (status != _HTM_TBEGIN_TRANSIENT)
>> @@ -110,15 +109,15 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
>>  	}
>>        else
>>  	{
>> -	  /* Same logic as above, but for for a number of temporary failures in
>> -	     a row.  */
>> +	  /* The transaction failed for some retries with
>> +	     _HTM_TBEGIN_TRANSIENT.  Use the normal locking now and for the
>> +	     next couple of calls.  */
>>  	  if (aconf.skip_lock_out_of_tbegin_retries > 0)
>>  	    atomic_store_relaxed (adapt_count,
>>  				  aconf.skip_lock_out_of_tbegin_retries);
>>  	}
>>      }
>>
>> -  use_lock:
>>    /* Use normal locking as fallback path if transaction does not succeed.  */
>>    return LLL_LOCK ((*futex), private);
>>  }
>> diff --git a/sysdeps/unix/sysv/linux/s390/elision-trylock.c b/sysdeps/unix/sysv/linux/s390/elision-trylock.c
>> index e21fc26..dee66d4 100644
>> --- a/sysdeps/unix/sysv/linux/s390/elision-trylock.c
>> +++ b/sysdeps/unix/sysv/linux/s390/elision-trylock.c
>> @@ -43,23 +43,36 @@ __lll_trylock_elision (int *futex, short *adapt_count)
>>  	 until their try_tbegin is zero.
>>        */
>>        __libc_tabort (_HTM_FIRST_USER_ABORT_CODE | 1);
>> +      __builtin_unreachable ();
>>      }
>>
>> -  /* Only try a transaction if it's worth it.  See __lll_lock_elision for
>> -     why we need atomic accesses.  Relaxed MO is sufficient because this is
>> -     just a hint.  */
>> -  if (atomic_load_relaxed (adapt_count) <= 0)
>> +  /* adapt_count can be accessed concurrently; these accesses can be both
>> +     inside of transactions (if critical sections are nested and the outer
>> +     critical section uses lock elision) and outside of transactions.  Thus,
>> +     we need to use atomic accesses to avoid data races.  However, the
>> +     value of adapt_count is just a hint, so relaxed MO accesses are
>> +     sufficient.
>> +     Do not begin a transaction if another cpu has locked the
>> +     futex with normal locking.  If adapt_count is zero, it remains and the
>> +     next pthread_mutex_lock call will try to start a transaction again.  */
>> +    if (atomic_load_relaxed (futex) == 0
>> +	&& atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
>>      {
>> -      int status;
>> -
>> -      if (__builtin_expect
>> -	  ((status = __libc_tbegin ((void *) 0)) == _HTM_TBEGIN_STARTED, 1))
>> +      int status = __libc_tbegin ((void *) 0);
>> +      if (__builtin_expect (status  == _HTM_TBEGIN_STARTED,
>> +			    _HTM_TBEGIN_STARTED))
>>  	{
>> -	  if (*futex == 0)
>> +	  /* Check the futex to make sure nobody has touched it in the
>> +	     mean time.  This forces the futex into the cache and makes
>> +	     sure the transaction aborts if some other cpu uses the
>> +	     lock (writes the futex).  */
>> +	  if (__builtin_expect (atomic_load_relaxed (futex) == 0, 1))
>
> __glibc_likely
okay.
>
>> +	    /* Lock was free.  Return to user code in a transaction.  */
>>  	    return 0;
>> -	  /* Lock was busy.  Fall back to normal locking.  */
>> -	  /* Since we are in a non-nested transaction there is no need to abort,
>> -	     which is expensive.  */
>> +
>> +	  /* Lock was busy.  Fall back to normal locking.  Since we are in
>> +	     a non-nested transaction there is no need to abort, which is
>> +	     expensive.  Simply end the started transaction.  */
>>  	  __libc_tend ();
>>  	  /* Note: Changing the adapt_count here might abort a transaction on a
>>  	     different cpu, but that could happen anyway when the futex is
>> @@ -68,27 +81,18 @@ __lll_trylock_elision (int *futex, short *adapt_count)
>>  	  if (aconf.skip_lock_busy > 0)
>>  	    atomic_store_relaxed (adapt_count, aconf.skip_lock_busy);
>>  	}
>> -      else
>> +      else if (status != _HTM_TBEGIN_TRANSIENT)
>>  	{
>> -	  if (status != _HTM_TBEGIN_TRANSIENT)
>> -	    {
>> -	      /* A persistent abort (cc 1 or 3) indicates that a retry is
>> -		 probably futile.  Use the normal locking now and for the
>> -		 next couple of calls.
>> -		 Be careful to avoid writing to the lock.  */
>> -	      if (aconf.skip_trylock_internal_abort > 0)
>> -		*adapt_count = aconf.skip_trylock_internal_abort;
>> -	    }
>> +	  /* A persistent abort (cc 1 or 3) indicates that a retry is
>> +	     probably futile.  Use the normal locking now and for the
>> +	     next couple of calls.
>> +	     Be careful to avoid writing to the lock.  */
>> +	  if (aconf.skip_trylock_internal_abort > 0)
>> +	    *adapt_count = aconf.skip_trylock_internal_abort;
>>  	}
>>        /* Could do some retries here.  */
>>      }
>> -  else
>> -    {
>> -      /* Lost updates are possible, but harmless.  Due to races this might lead
>> -	 to *adapt_count becoming less than zero.  */
>> -      atomic_store_relaxed (adapt_count,
>> -			    atomic_load_relaxed (adapt_count) - 1);
>> -    }
>>
>> +  /* Use normal locking as fallback path if transaction does not succeed.  */
>>    return lll_trylock (*futex);
>>  }
>> diff --git a/sysdeps/unix/sysv/linux/s390/elision-unlock.c b/sysdeps/unix/sysv/linux/s390/elision-unlock.c
>> index 0b1ade9..e68d970 100644
>> --- a/sysdeps/unix/sysv/linux/s390/elision-unlock.c
>> +++ b/sysdeps/unix/sysv/linux/s390/elision-unlock.c
>> @@ -21,16 +21,37 @@
>>  #include <htm.h>
>>
>>  int
>> -__lll_unlock_elision(int *futex, int private)
>> +__lll_unlock_elision(int *futex, short *adapt_count, int private)
>>  {
>>    /* If the lock is free, we elided the lock earlier.  This does not
>>       necessarily mean that we are in a transaction, because the user code may
>> -     have closed the transaction, but that is impossible to detect reliably.  */
>> -  if (*futex == 0)
>> +     have closed the transaction, but that is impossible to detect reliably.
>> +     Relaxed MO access to futex is sufficient as we only need a hint, if we
>
> This comment is incorrect because we don't just need just a hint here.
> The reason why relaxed MO is sufficient is because a correct program
> will only release a lock it has acquired; therefore, it must either
> changed the futex word's value to something !=0 or it must have used
> elision; these are actions by the same thread, so these actions are
> sequenced-before the relaxed load (and thus also happens-before the
> relaxed load).  Therefore, relaxed MO is sufficient.
>
okay.
>> +     started a transaction or acquired the futex in e.g. elision-lock.c.  */
>> +  if (atomic_load_relaxed (futex) == 0)
>>      {
>>        __libc_tend ();
>>      }
>>    else
>> -    lll_unlock ((*futex), private);
>> +    {
>> +      /* Update the adapt_count while unlocking before completing the critical
>> +	 section.  adapt_count is accessed concurrently outside of a
>> +	 transaction or an aquired lock e.g. in elision-lock.c so we need to use
>
> transaction or a critical section (e.g., in elision-lock.c); so, we need
> to use
>
okay

>> +	 atomic accesses.  However, the value of adapt_count is just a hint, so
>> +	 relaxed MO accesses are sufficient.
>> +	 If adapt_count would be decremented while locking, multiple
>> +	 CPUs trying to lock the locked mutex will decrement adapt_count to
>> +	 zero and another CPU will try to start a transaction, which will be
>> +	 immediately aborted as the mutex is locked.
>
> I don't think this is necessarily the case.  It is true that if more
> than one thread decrements, only one would immediately try to use
> elision (because only one decrements from 1 (ignoring lost updates)).

> However, if you decrement in the critical section, and lock acquisitions
> wait until the lock is free *before* loading adapt_count and choosing
> whether to use elision or not, then it shouldn't matter whether you
> decrement closer to the lock acquisition or closer to the release.
Waiting for a free lock is done by futex-syscall within LLL_LOCK (see 
__lll_lock_wait/__lll_lock_wait_private).
On wakeup the lock is immediately acquired if it is free.
Afterwards there is no loading of adapt_count and no decision whether to 
use elision or not.
Following your suggestion means, that we need a further implementation 
like in __lll_lock_wait/__lll_lock_wait_private in order to wait for a 
free lock and then load adapt_count and choose to elide or not!


Please have a look at the following example assuming:
adapt_count = 1;
There are two scenarios below: Imagine that only "Thread 3a" or "Thread 
3b" is used.

Decrement adapt_count while locking (without this patch):
-Thread 1 __lll_lock_elision:
decrements adapt_count to 0 and acquires the lock via LLL_LOCK.
-Thread 2 __lll_lock_elision:
starts a transaction and ends / aborts it immediately as lock is 
acquired. adapt_count is set to 3. LLL_LOCK waits until lock is released 
by Thread 1.
-Thread 1 __lll_unlock_elision:
releases lock.
-Thread 2 __lll_lock_elision:
wakes up and aquires lock via waiting LLL_LOCK.
-Thread 3a __lll_lock_elision:
decrements adapt_count to 2 and waits via LLL_LOCK until lock is 
released by Thread 2.
-Thread 2 __lll_unlock_elision:
releases lock.
-Thread 3b __lll_lock_elision:
decrements adapt_count to 2 and acquires lock via LLL_LOCK.

Decrement adapt_count while unlocking (with this patch):
-Thread 1 __lll_lock_elision:
acquires the lock via LLL_LOCK. adapt_count remains 1.
-Thread 2 __lll_lock_elision:
LLL_LOCK is used as futex is acquired by Thread 1 or adapt_count > 0. It 
waits until lock is released by Thread 1.
-Thread 1 __lll_unlock_elision:
decrements adapt_count to 0 and releases the lock.
-Thread 2 __lll_lock_elision:
wakes up and acquires lock via waiting LLL_LOCK.
-Thread 3a __lll_lock_elision:
LLL_LOCK is used as futex is acquired by Thread 2.
-Thread 2 __lll_unlock_elision:
releases lock. adapt_count remains 0.
-Thread 3b __lll_lock_elision:
starts a transaction.

If futex is not tested before starting a transaction,
the additional load and branch is not needed, Thread 3a will start and 
abort a transaction, set adapt_count to 3 and will wait via LLL_LOCK.
In case Thread 3b, a transaction will be started.

The attached diff removes the futex==0 test.
Later I will make one patch with changelog for this and the other two 
patches.
>
> I think this needs a more thorough analysis (including better
> documentation) and/or a microbenchmark.
>
>> +	 If adapt_count would be decremented while unlocking after completing
>> +	 the critical section, possible waiters will be waked up before
>> +	 decrementing the adapt_count.  Those waked up waiters could have
>> +	 destroyed and freed this mutex!  */
>
> Please fix this sentence.  Or you could just say that POSIX' mutex
> destruction requirements disallow accesses to the mutexes after it has
> been released and thus could have been acquired by another thread.
>
Okay.
>> +      short adapt_count_val = atomic_load_relaxed (adapt_count);
>> +      if (adapt_count_val > 0)
>> +	atomic_store_relaxed (adapt_count, adapt_count_val - 1);
>> +
>> +      lll_unlock ((*futex), private);
>> +    }
>>    return 0;
>>  }
>> diff --git a/sysdeps/unix/sysv/linux/s390/lowlevellock.h b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
>> index 8d564ed..c60f4f7 100644
>> --- a/sysdeps/unix/sysv/linux/s390/lowlevellock.h
>> +++ b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
>> @@ -50,7 +50,7 @@ extern int __lll_timedlock_elision
>>  extern int __lll_lock_elision (int *futex, short *adapt_count, int private)
>>    attribute_hidden;
>>
>> -extern int __lll_unlock_elision(int *futex, int private)
>> +extern int __lll_unlock_elision(int *futex, short *adapt_count, int private)
>>    attribute_hidden;
>>
>>  extern int __lll_trylock_elision(int *futex, short *adapt_count)
>> @@ -59,7 +59,7 @@ extern int __lll_trylock_elision(int *futex, short *adapt_count)
>>  # define lll_lock_elision(futex, adapt_count, private) \
>>    __lll_lock_elision (&(futex), &(adapt_count), private)
>>  #  define lll_unlock_elision(futex, adapt_count, private) \
>> -  __lll_unlock_elision (&(futex), private)
>> +  __lll_unlock_elision (&(futex), &(adapt_count), private)
>>  #  define lll_trylock_elision(futex, adapt_count) \
>>    __lll_trylock_elision(&(futex), &(adapt_count))
>>  #endif  /* ENABLE_LOCK_ELISION */
>
>
>

Comments

Torvald Riegel Jan. 17, 2017, 6:52 p.m. UTC | #1
On Tue, 2017-01-17 at 16:28 +0100, Stefan Liebler wrote:
> On 01/11/2017 11:53 AM, Torvald Riegel wrote:
> > On Tue, 2016-12-06 at 14:51 +0100, Stefan Liebler wrote:
> >> This patch decrements the adapt_count while unlocking the futex
> >> instead of before aquiring the futex as it is done on power, too.
> >> Furthermore a transaction is only started if the futex is currently free.
> >> This check is done after starting the transaction, too.
> >> If the futex is not free and the transaction nesting depth is one,
> >> we can simply end the started transaction instead of aborting it.
> >> The implementation of this check was faulty as it always ended the
> >> started transaction.  By using the fallback path, the the outermost
> >> transaction was aborted.  Now the outermost transaction is aborted
> >> directly.
> >>
> >> This patch also adds some commentary and aligns the code in
> >> elision-trylock.c to the code in elision-lock.c as possible.
> >
> > I don't think this is quite ready yet.  See below for details.
> >
> > I'm not too concerned about this fact, given that it's just in
> > s390-specific code.  But generally, I'd prefer if arch-specific code
> > aims for the same quality and level of consensus about it as what is our
> > aim for generic code.
> >
> >> ChangeLog:
> >>
> >> 	* sysdeps/unix/sysv/linux/s390/lowlevellock.h
> >> 	(__lll_unlock_elision, lll_unlock_elision): Add adapt_count argument.
> >> 	* sysdeps/unix/sysv/linux/s390/elision-lock.c:
> >> 	(__lll_lock_elision): Decrement adapt_count while unlocking
> >> 	instead of before locking.
> >> 	* sysdeps/unix/sysv/linux/s390/elision-trylock.c
> >> 	(__lll_trylock_elision): Likewise.
> >> 	* sysdeps/unix/sysv/linux/s390/elision-unlock.c:
> >> 	(__lll_unlock_elision): Likewise.
> >> ---
> >>  sysdeps/unix/sysv/linux/s390/elision-lock.c    | 37 ++++++++-------
> >>  sysdeps/unix/sysv/linux/s390/elision-trylock.c | 62 ++++++++++++++------------
> >>  sysdeps/unix/sysv/linux/s390/elision-unlock.c  | 29 ++++++++++--
> >>  sysdeps/unix/sysv/linux/s390/lowlevellock.h    |  4 +-
> >>  4 files changed, 78 insertions(+), 54 deletions(-)
> >>
> >> diff --git a/sysdeps/unix/sysv/linux/s390/elision-lock.c b/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >> index 3dd7fbc..4a7d546 100644
> >> --- a/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >> +++ b/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >> @@ -50,31 +50,30 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
> >>       critical section uses lock elision) and outside of transactions.  Thus,
> >>       we need to use atomic accesses to avoid data races.  However, the
> >>       value of adapt_count is just a hint, so relaxed MO accesses are
> >> -     sufficient.  */
> >> -  if (atomic_load_relaxed (adapt_count) > 0)
> >> -    {
> >> -      /* Lost updates are possible, but harmless.  Due to races this might lead
> >> -	 to *adapt_count becoming less than zero.  */
> >> -      atomic_store_relaxed (adapt_count,
> >> -			    atomic_load_relaxed (adapt_count) - 1);
> >> -      goto use_lock;
> >> -    }
> >> -
> >> -  if (aconf.try_tbegin > 0)
> >> +     sufficient.
> >> +     Do not begin a transaction if another cpu has locked the
> >> +     futex with normal locking.  If adapt_count is zero, it remains and the
> >> +     next pthread_mutex_lock call will try to start a transaction again.  */
> >
> > This seems to make an assumption about performance that should be
> > explained in the comment.  IIRC, x86 LE does not make this assumption,
> > so it's not generally true.  I suppose s390 aborts are really expensive,
> > and you don't expect that a lock is in the acquired state often enough
> > so that aborts are overall more costly than the overhead of the
> > additional load and branch?
> >
> Yes, aborting a transaction is expensive.
> But you are right, there is an additional load and branch and this 
> thread will anyway wait in LLL_LOCK for another thread releasing the futex.
> See example below.

Note that I'm not actually arguing against having the check -- I just
want a clear explanation in the code including the assumptions about
performance that motivate this particular choice.

If we don't add this information, things will get messy.  And a future
maintainer will not be aware of these assumptions.

> >> +	 atomic accesses.  However, the value of adapt_count is just a hint, so
> >> +	 relaxed MO accesses are sufficient.
> >> +	 If adapt_count would be decremented while locking, multiple
> >> +	 CPUs trying to lock the locked mutex will decrement adapt_count to
> >> +	 zero and another CPU will try to start a transaction, which will be
> >> +	 immediately aborted as the mutex is locked.
> >
> > I don't think this is necessarily the case.  It is true that if more
> > than one thread decrements, only one would immediately try to use
> > elision (because only one decrements from 1 (ignoring lost updates)).
> 
> > However, if you decrement in the critical section, and lock acquisitions
> > wait until the lock is free *before* loading adapt_count and choosing
> > whether to use elision or not, then it shouldn't matter whether you
> > decrement closer to the lock acquisition or closer to the release.
> Waiting for a free lock is done by futex-syscall within LLL_LOCK (see 
> __lll_lock_wait/__lll_lock_wait_private).
> On wakeup the lock is immediately acquired if it is free.
> Afterwards there is no loading of adapt_count and no decision whether to 
> use elision or not.
> Following your suggestion means, that we need a further implementation 
> like in __lll_lock_wait/__lll_lock_wait_private in order to wait for a 
> free lock and then load adapt_count and choose to elide or not!

Well, what would be required is that after we block using futexes, we
reassess the situation (including potentially using elision), instead of
just proceeding to try to acquire the lock without elision.

> 
> Please have a look at the following example assuming:
> adapt_count = 1;
> There are two scenarios below: Imagine that only "Thread 3a" or "Thread 
> 3b" is used.
> 
> Decrement adapt_count while locking (without this patch):
> -Thread 1 __lll_lock_elision:
> decrements adapt_count to 0 and acquires the lock via LLL_LOCK.
> -Thread 2 __lll_lock_elision:
> starts a transaction and ends / aborts it immediately as lock is 
> acquired. adapt_count is set to 3. LLL_LOCK waits until lock is released 
> by Thread 1.
> -Thread 1 __lll_unlock_elision:
> releases lock.
> -Thread 2 __lll_lock_elision:
> wakes up and aquires lock via waiting LLL_LOCK.
> -Thread 3a __lll_lock_elision:
> decrements adapt_count to 2 and waits via LLL_LOCK until lock is 
> released by Thread 2.
> -Thread 2 __lll_unlock_elision:
> releases lock.
> -Thread 3b __lll_lock_elision:
> decrements adapt_count to 2 and acquires lock via LLL_LOCK.
> 
> Decrement adapt_count while unlocking (with this patch):
> -Thread 1 __lll_lock_elision:
> acquires the lock via LLL_LOCK. adapt_count remains 1.
> -Thread 2 __lll_lock_elision:
> LLL_LOCK is used as futex is acquired by Thread 1 or adapt_count > 0. It 
> waits until lock is released by Thread 1.
> -Thread 1 __lll_unlock_elision:
> decrements adapt_count to 0 and releases the lock.
> -Thread 2 __lll_lock_elision:
> wakes up and acquires lock via waiting LLL_LOCK.
> -Thread 3a __lll_lock_elision:
> LLL_LOCK is used as futex is acquired by Thread 2.
> -Thread 2 __lll_unlock_elision:
> releases lock. adapt_count remains 0.
> -Thread 3b __lll_lock_elision:
> starts a transaction.

I agree that if we do NOT wait for the lock to become free before
checking whether we can use elision and do NOT try to use elision after
we blocked using futexes, then decrementing closer to the release of a
mutex can decrease the window in which another thread can observe
adapt_count==0 but lock!=0.

This may have required more changes, but it would have been cleaner
overall, and I guess we would have ended up with less s390-specific
code.

However, it's too late for this now, given that we're past the freeze.
We should get back to this when master opens up for normal development.

> If futex is not tested before starting a transaction,
> the additional load and branch is not needed, Thread 3a will start and 
> abort a transaction, set adapt_count to 3 and will wait via LLL_LOCK.
> In case Thread 3b, a transaction will be started.
> 
> The attached diff removes the futex==0 test.
> Later I will make one patch with changelog for this and the other two 
> patches.

See above, I didn't request to remove it, but just to clearly document
why you included it.  If you think removing it is just as fine, that's
OK with me too.

> >
> > I think this needs a more thorough analysis (including better
> > documentation) and/or a microbenchmark.

Regarding the patch:

+	     Since we are in a non-nested transaction there is no need to abort,
+	     which is expensive.  Simply end the started transaction.  */

Is that actually true?  You don't seem to check whether you are indeed
not in a nested transaction.  The difference is that you do not need to
acquire the lock because this is trylock, so you can run a little
further as a transaction.  You will get aborted as soon the lock you
failed to lock is released though.
So, it's not quite clear to me what the expected win is; are you aiming
at programs that don't actually need to acquire the lock and whose
transactions finish quickly enough to not get aborted by a concurrent
release of the lock?

The rest of the diff looked OK.
Stefan Liebler Jan. 18, 2017, 12:28 p.m. UTC | #2
On 01/17/2017 07:52 PM, Torvald Riegel wrote:
> On Tue, 2017-01-17 at 16:28 +0100, Stefan Liebler wrote:
>> On 01/11/2017 11:53 AM, Torvald Riegel wrote:
>>> On Tue, 2016-12-06 at 14:51 +0100, Stefan Liebler wrote:
>>>> This patch decrements the adapt_count while unlocking the futex
>>>> instead of before aquiring the futex as it is done on power, too.
>>>> Furthermore a transaction is only started if the futex is currently free.
>>>> This check is done after starting the transaction, too.
>>>> If the futex is not free and the transaction nesting depth is one,
>>>> we can simply end the started transaction instead of aborting it.
>>>> The implementation of this check was faulty as it always ended the
>>>> started transaction.  By using the fallback path, the the outermost
>>>> transaction was aborted.  Now the outermost transaction is aborted
>>>> directly.
>>>>
>>>> This patch also adds some commentary and aligns the code in
>>>> elision-trylock.c to the code in elision-lock.c as possible.
>>>
>>> I don't think this is quite ready yet.  See below for details.
>>>
>>> I'm not too concerned about this fact, given that it's just in
>>> s390-specific code.  But generally, I'd prefer if arch-specific code
>>> aims for the same quality and level of consensus about it as what is our
>>> aim for generic code.
>>>
>>>> ChangeLog:
>>>>
>>>> 	* sysdeps/unix/sysv/linux/s390/lowlevellock.h
>>>> 	(__lll_unlock_elision, lll_unlock_elision): Add adapt_count argument.
>>>> 	* sysdeps/unix/sysv/linux/s390/elision-lock.c:
>>>> 	(__lll_lock_elision): Decrement adapt_count while unlocking
>>>> 	instead of before locking.
>>>> 	* sysdeps/unix/sysv/linux/s390/elision-trylock.c
>>>> 	(__lll_trylock_elision): Likewise.
>>>> 	* sysdeps/unix/sysv/linux/s390/elision-unlock.c:
>>>> 	(__lll_unlock_elision): Likewise.
>>>> ---
>>>>  sysdeps/unix/sysv/linux/s390/elision-lock.c    | 37 ++++++++-------
>>>>  sysdeps/unix/sysv/linux/s390/elision-trylock.c | 62 ++++++++++++++------------
>>>>  sysdeps/unix/sysv/linux/s390/elision-unlock.c  | 29 ++++++++++--
>>>>  sysdeps/unix/sysv/linux/s390/lowlevellock.h    |  4 +-
>>>>  4 files changed, 78 insertions(+), 54 deletions(-)
>>>>
>>>> diff --git a/sysdeps/unix/sysv/linux/s390/elision-lock.c b/sysdeps/unix/sysv/linux/s390/elision-lock.c
>>>> index 3dd7fbc..4a7d546 100644
>>>> --- a/sysdeps/unix/sysv/linux/s390/elision-lock.c
>>>> +++ b/sysdeps/unix/sysv/linux/s390/elision-lock.c
>>>> @@ -50,31 +50,30 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
>>>>       critical section uses lock elision) and outside of transactions.  Thus,
>>>>       we need to use atomic accesses to avoid data races.  However, the
>>>>       value of adapt_count is just a hint, so relaxed MO accesses are
>>>> -     sufficient.  */
>>>> -  if (atomic_load_relaxed (adapt_count) > 0)
>>>> -    {
>>>> -      /* Lost updates are possible, but harmless.  Due to races this might lead
>>>> -	 to *adapt_count becoming less than zero.  */
>>>> -      atomic_store_relaxed (adapt_count,
>>>> -			    atomic_load_relaxed (adapt_count) - 1);
>>>> -      goto use_lock;
>>>> -    }
>>>> -
>>>> -  if (aconf.try_tbegin > 0)
>>>> +     sufficient.
>>>> +     Do not begin a transaction if another cpu has locked the
>>>> +     futex with normal locking.  If adapt_count is zero, it remains and the
>>>> +     next pthread_mutex_lock call will try to start a transaction again.  */
>>>
>>> This seems to make an assumption about performance that should be
>>> explained in the comment.  IIRC, x86 LE does not make this assumption,
>>> so it's not generally true.  I suppose s390 aborts are really expensive,
>>> and you don't expect that a lock is in the acquired state often enough
>>> so that aborts are overall more costly than the overhead of the
>>> additional load and branch?
>>>
>> Yes, aborting a transaction is expensive.
>> But you are right, there is an additional load and branch and this
>> thread will anyway wait in LLL_LOCK for another thread releasing the futex.
>> See example below.
>
> Note that I'm not actually arguing against having the check -- I just
> want a clear explanation in the code including the assumptions about
> performance that motivate this particular choice.
>
> If we don't add this information, things will get messy.  And a future
> maintainer will not be aware of these assumptions.
>
>>>> +	 atomic accesses.  However, the value of adapt_count is just a hint, so
>>>> +	 relaxed MO accesses are sufficient.
>>>> +	 If adapt_count would be decremented while locking, multiple
>>>> +	 CPUs trying to lock the locked mutex will decrement adapt_count to
>>>> +	 zero and another CPU will try to start a transaction, which will be
>>>> +	 immediately aborted as the mutex is locked.
>>>
>>> I don't think this is necessarily the case.  It is true that if more
>>> than one thread decrements, only one would immediately try to use
>>> elision (because only one decrements from 1 (ignoring lost updates)).
>>
>>> However, if you decrement in the critical section, and lock acquisitions
>>> wait until the lock is free *before* loading adapt_count and choosing
>>> whether to use elision or not, then it shouldn't matter whether you
>>> decrement closer to the lock acquisition or closer to the release.
>> Waiting for a free lock is done by futex-syscall within LLL_LOCK (see
>> __lll_lock_wait/__lll_lock_wait_private).
>> On wakeup the lock is immediately acquired if it is free.
>> Afterwards there is no loading of adapt_count and no decision whether to
>> use elision or not.
>> Following your suggestion means, that we need a further implementation
>> like in __lll_lock_wait/__lll_lock_wait_private in order to wait for a
>> free lock and then load adapt_count and choose to elide or not!
>
> Well, what would be required is that after we block using futexes, we
> reassess the situation (including potentially using elision), instead of
> just proceeding to try to acquire the lock without elision.
>
If the lock is released and we have been waking up.
Do we know if other threads are also waiting for the futex?
The strategy of __lll_lock_wait is to keep futex==2 and wakup possible 
waiters with __lll_unlock.

>>
>> Please have a look at the following example assuming:
>> adapt_count = 1;
>> There are two scenarios below: Imagine that only "Thread 3a" or "Thread
>> 3b" is used.
>>
>> Decrement adapt_count while locking (without this patch):
>> -Thread 1 __lll_lock_elision:
>> decrements adapt_count to 0 and acquires the lock via LLL_LOCK.
>> -Thread 2 __lll_lock_elision:
>> starts a transaction and ends / aborts it immediately as lock is
>> acquired. adapt_count is set to 3. LLL_LOCK waits until lock is released
>> by Thread 1.
>> -Thread 1 __lll_unlock_elision:
>> releases lock.
>> -Thread 2 __lll_lock_elision:
>> wakes up and aquires lock via waiting LLL_LOCK.
>> -Thread 3a __lll_lock_elision:
>> decrements adapt_count to 2 and waits via LLL_LOCK until lock is
>> released by Thread 2.
>> -Thread 2 __lll_unlock_elision:
>> releases lock.
>> -Thread 3b __lll_lock_elision:
>> decrements adapt_count to 2 and acquires lock via LLL_LOCK.
>>
>> Decrement adapt_count while unlocking (with this patch):
>> -Thread 1 __lll_lock_elision:
>> acquires the lock via LLL_LOCK. adapt_count remains 1.
>> -Thread 2 __lll_lock_elision:
>> LLL_LOCK is used as futex is acquired by Thread 1 or adapt_count > 0. It
>> waits until lock is released by Thread 1.
>> -Thread 1 __lll_unlock_elision:
>> decrements adapt_count to 0 and releases the lock.
>> -Thread 2 __lll_lock_elision:
>> wakes up and acquires lock via waiting LLL_LOCK.
>> -Thread 3a __lll_lock_elision:
>> LLL_LOCK is used as futex is acquired by Thread 2.
>> -Thread 2 __lll_unlock_elision:
>> releases lock. adapt_count remains 0.
>> -Thread 3b __lll_lock_elision:
>> starts a transaction.
>
> I agree that if we do NOT wait for the lock to become free before
> checking whether we can use elision and do NOT try to use elision after
> we blocked using futexes, then decrementing closer to the release of a
> mutex can decrease the window in which another thread can observe
> adapt_count==0 but lock!=0.
>
> This may have required more changes, but it would have been cleaner
> overall, and I guess we would have ended up with less s390-specific
> code.
>
> However, it's too late for this now, given that we're past the freeze.
> We should get back to this when master opens up for normal development.
>
>> If futex is not tested before starting a transaction,
>> the additional load and branch is not needed, Thread 3a will start and
>> abort a transaction, set adapt_count to 3 and will wait via LLL_LOCK.
>> In case Thread 3b, a transaction will be started.
>>
>> The attached diff removes the futex==0 test.
>> Later I will make one patch with changelog for this and the other two
>> patches.
>
> See above, I didn't request to remove it, but just to clearly document
> why you included it.  If you think removing it is just as fine, that's
> OK with me too.
>
Removing it is fine.
>>>
>>> I think this needs a more thorough analysis (including better
>>> documentation) and/or a microbenchmark.
>
> Regarding the patch:
>
> +	     Since we are in a non-nested transaction there is no need to abort,
> +	     which is expensive.  Simply end the started transaction.  */
>
> Is that actually true?  You don't seem to check whether you are indeed
> not in a nested transaction.  The difference is that you do not need to
> acquire the lock because this is trylock, so you can run a little
> further as a transaction.  You will get aborted as soon the lock you
> failed to lock is released though.
> So, it's not quite clear to me what the expected win is; are you aiming
> at programs that don't actually need to acquire the lock and whose
> transactions finish quickly enough to not get aborted by a concurrent
> release of the lock?
>
Every time we enter __lll_trylock_elision, we check if we are in a 
transaction on this CPU and abort it if needed. This is needed to detect 
multiple elided trylock calls of one thread.
Afterwards we start a transaction (if adapt_count permits it) and then 
we check the futex.
If it is free, we return within a transaction. If another thread 
acquired the lock with a transaction, we can't detect it here. But a 
conflict will abort the transactions.
If the futex is already acquired, the call shall return immediately with 
an error. Thus the current transaction, which was started before, is 
ended. You are right, if the other thread releases the lock, the 
transaction will be aborted.
As ending a transaction is faster than aborting, the transaction is 
ended. The power implementation also ends it. On intel it is aborted, 
but according to the comment, an abort is used instead of an end due to 
visibility reasons while profiling.


> The rest of the diff looked OK.
>
If it's okay, I will make one patch with all the reviewed diffs and post 
it with a ChangeLog in this mail-thread.
Then I would commit it before release to have a consistent state?
Torvald Riegel Jan. 18, 2017, 4:25 p.m. UTC | #3
On Wed, 2017-01-18 at 13:28 +0100, Stefan Liebler wrote:
> On 01/17/2017 07:52 PM, Torvald Riegel wrote:
> > On Tue, 2017-01-17 at 16:28 +0100, Stefan Liebler wrote:
> >> On 01/11/2017 11:53 AM, Torvald Riegel wrote:
> >>> On Tue, 2016-12-06 at 14:51 +0100, Stefan Liebler wrote:
> >>>> This patch decrements the adapt_count while unlocking the futex
> >>>> instead of before aquiring the futex as it is done on power, too.
> >>>> Furthermore a transaction is only started if the futex is currently free.
> >>>> This check is done after starting the transaction, too.
> >>>> If the futex is not free and the transaction nesting depth is one,
> >>>> we can simply end the started transaction instead of aborting it.
> >>>> The implementation of this check was faulty as it always ended the
> >>>> started transaction.  By using the fallback path, the the outermost
> >>>> transaction was aborted.  Now the outermost transaction is aborted
> >>>> directly.
> >>>>
> >>>> This patch also adds some commentary and aligns the code in
> >>>> elision-trylock.c to the code in elision-lock.c as possible.
> >>>
> >>> I don't think this is quite ready yet.  See below for details.
> >>>
> >>> I'm not too concerned about this fact, given that it's just in
> >>> s390-specific code.  But generally, I'd prefer if arch-specific code
> >>> aims for the same quality and level of consensus about it as what is our
> >>> aim for generic code.
> >>>
> >>>> ChangeLog:
> >>>>
> >>>> 	* sysdeps/unix/sysv/linux/s390/lowlevellock.h
> >>>> 	(__lll_unlock_elision, lll_unlock_elision): Add adapt_count argument.
> >>>> 	* sysdeps/unix/sysv/linux/s390/elision-lock.c:
> >>>> 	(__lll_lock_elision): Decrement adapt_count while unlocking
> >>>> 	instead of before locking.
> >>>> 	* sysdeps/unix/sysv/linux/s390/elision-trylock.c
> >>>> 	(__lll_trylock_elision): Likewise.
> >>>> 	* sysdeps/unix/sysv/linux/s390/elision-unlock.c:
> >>>> 	(__lll_unlock_elision): Likewise.
> >>>> ---
> >>>>  sysdeps/unix/sysv/linux/s390/elision-lock.c    | 37 ++++++++-------
> >>>>  sysdeps/unix/sysv/linux/s390/elision-trylock.c | 62 ++++++++++++++------------
> >>>>  sysdeps/unix/sysv/linux/s390/elision-unlock.c  | 29 ++++++++++--
> >>>>  sysdeps/unix/sysv/linux/s390/lowlevellock.h    |  4 +-
> >>>>  4 files changed, 78 insertions(+), 54 deletions(-)
> >>>>
> >>>> diff --git a/sysdeps/unix/sysv/linux/s390/elision-lock.c b/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >>>> index 3dd7fbc..4a7d546 100644
> >>>> --- a/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >>>> +++ b/sysdeps/unix/sysv/linux/s390/elision-lock.c
> >>>> @@ -50,31 +50,30 @@ __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
> >>>>       critical section uses lock elision) and outside of transactions.  Thus,
> >>>>       we need to use atomic accesses to avoid data races.  However, the
> >>>>       value of adapt_count is just a hint, so relaxed MO accesses are
> >>>> -     sufficient.  */
> >>>> -  if (atomic_load_relaxed (adapt_count) > 0)
> >>>> -    {
> >>>> -      /* Lost updates are possible, but harmless.  Due to races this might lead
> >>>> -	 to *adapt_count becoming less than zero.  */
> >>>> -      atomic_store_relaxed (adapt_count,
> >>>> -			    atomic_load_relaxed (adapt_count) - 1);
> >>>> -      goto use_lock;
> >>>> -    }
> >>>> -
> >>>> -  if (aconf.try_tbegin > 0)
> >>>> +     sufficient.
> >>>> +     Do not begin a transaction if another cpu has locked the
> >>>> +     futex with normal locking.  If adapt_count is zero, it remains and the
> >>>> +     next pthread_mutex_lock call will try to start a transaction again.  */
> >>>
> >>> This seems to make an assumption about performance that should be
> >>> explained in the comment.  IIRC, x86 LE does not make this assumption,
> >>> so it's not generally true.  I suppose s390 aborts are really expensive,
> >>> and you don't expect that a lock is in the acquired state often enough
> >>> so that aborts are overall more costly than the overhead of the
> >>> additional load and branch?
> >>>
> >> Yes, aborting a transaction is expensive.
> >> But you are right, there is an additional load and branch and this
> >> thread will anyway wait in LLL_LOCK for another thread releasing the futex.
> >> See example below.
> >
> > Note that I'm not actually arguing against having the check -- I just
> > want a clear explanation in the code including the assumptions about
> > performance that motivate this particular choice.
> >
> > If we don't add this information, things will get messy.  And a future
> > maintainer will not be aware of these assumptions.
> >
> >>>> +	 atomic accesses.  However, the value of adapt_count is just a hint, so
> >>>> +	 relaxed MO accesses are sufficient.
> >>>> +	 If adapt_count would be decremented while locking, multiple
> >>>> +	 CPUs trying to lock the locked mutex will decrement adapt_count to
> >>>> +	 zero and another CPU will try to start a transaction, which will be
> >>>> +	 immediately aborted as the mutex is locked.
> >>>
> >>> I don't think this is necessarily the case.  It is true that if more
> >>> than one thread decrements, only one would immediately try to use
> >>> elision (because only one decrements from 1 (ignoring lost updates)).
> >>
> >>> However, if you decrement in the critical section, and lock acquisitions
> >>> wait until the lock is free *before* loading adapt_count and choosing
> >>> whether to use elision or not, then it shouldn't matter whether you
> >>> decrement closer to the lock acquisition or closer to the release.
> >> Waiting for a free lock is done by futex-syscall within LLL_LOCK (see
> >> __lll_lock_wait/__lll_lock_wait_private).
> >> On wakeup the lock is immediately acquired if it is free.
> >> Afterwards there is no loading of adapt_count and no decision whether to
> >> use elision or not.
> >> Following your suggestion means, that we need a further implementation
> >> like in __lll_lock_wait/__lll_lock_wait_private in order to wait for a
> >> free lock and then load adapt_count and choose to elide or not!
> >
> > Well, what would be required is that after we block using futexes, we
> > reassess the situation (including potentially using elision), instead of
> > just proceeding to try to acquire the lock without elision.
> >
> If the lock is released and we have been waking up.
> Do we know if other threads are also waiting for the futex?
> The strategy of __lll_lock_wait is to keep futex==2 and wakup possible 
> waiters with __lll_unlock.

We don't know if there are other threads waiting iff futex==2.  This is
because we may have shared that "state" with other threads.
However, we can force that there are no other threads waiting by waking
up all of them.  Some more coordination may be required to make all of
them aware of that fact.
This convoying in the futex==2 state is also bad irrespective of lock
elision, I believe; for example, if we have one long critical section
that forces all other short critical sections into the futex==2 state,
these will hand off ownership of the lock through the slow path, instead
of just waking them all at once so that they can use spinning to acquire
the lock with less overhead (because they are short).

Thus, revisiting this in the upcoming development cycle, and in the
generic mutex code, would be good.

> >>
> >> Please have a look at the following example assuming:
> >> adapt_count = 1;
> >> There are two scenarios below: Imagine that only "Thread 3a" or "Thread
> >> 3b" is used.
> >>
> >> Decrement adapt_count while locking (without this patch):
> >> -Thread 1 __lll_lock_elision:
> >> decrements adapt_count to 0 and acquires the lock via LLL_LOCK.
> >> -Thread 2 __lll_lock_elision:
> >> starts a transaction and ends / aborts it immediately as lock is
> >> acquired. adapt_count is set to 3. LLL_LOCK waits until lock is released
> >> by Thread 1.
> >> -Thread 1 __lll_unlock_elision:
> >> releases lock.
> >> -Thread 2 __lll_lock_elision:
> >> wakes up and aquires lock via waiting LLL_LOCK.
> >> -Thread 3a __lll_lock_elision:
> >> decrements adapt_count to 2 and waits via LLL_LOCK until lock is
> >> released by Thread 2.
> >> -Thread 2 __lll_unlock_elision:
> >> releases lock.
> >> -Thread 3b __lll_lock_elision:
> >> decrements adapt_count to 2 and acquires lock via LLL_LOCK.
> >>
> >> Decrement adapt_count while unlocking (with this patch):
> >> -Thread 1 __lll_lock_elision:
> >> acquires the lock via LLL_LOCK. adapt_count remains 1.
> >> -Thread 2 __lll_lock_elision:
> >> LLL_LOCK is used as futex is acquired by Thread 1 or adapt_count > 0. It
> >> waits until lock is released by Thread 1.
> >> -Thread 1 __lll_unlock_elision:
> >> decrements adapt_count to 0 and releases the lock.
> >> -Thread 2 __lll_lock_elision:
> >> wakes up and acquires lock via waiting LLL_LOCK.
> >> -Thread 3a __lll_lock_elision:
> >> LLL_LOCK is used as futex is acquired by Thread 2.
> >> -Thread 2 __lll_unlock_elision:
> >> releases lock. adapt_count remains 0.
> >> -Thread 3b __lll_lock_elision:
> >> starts a transaction.
> >
> > I agree that if we do NOT wait for the lock to become free before
> > checking whether we can use elision and do NOT try to use elision after
> > we blocked using futexes, then decrementing closer to the release of a
> > mutex can decrease the window in which another thread can observe
> > adapt_count==0 but lock!=0.
> >
> > This may have required more changes, but it would have been cleaner
> > overall, and I guess we would have ended up with less s390-specific
> > code.
> >
> > However, it's too late for this now, given that we're past the freeze.
> > We should get back to this when master opens up for normal development.
> >
> >> If futex is not tested before starting a transaction,
> >> the additional load and branch is not needed, Thread 3a will start and
> >> abort a transaction, set adapt_count to 3 and will wait via LLL_LOCK.
> >> In case Thread 3b, a transaction will be started.
> >>
> >> The attached diff removes the futex==0 test.
> >> Later I will make one patch with changelog for this and the other two
> >> patches.
> >
> > See above, I didn't request to remove it, but just to clearly document
> > why you included it.  If you think removing it is just as fine, that's
> > OK with me too.
> >
> Removing it is fine.
> >>>
> >>> I think this needs a more thorough analysis (including better
> >>> documentation) and/or a microbenchmark.
> >
> > Regarding the patch:
> >
> > +	     Since we are in a non-nested transaction there is no need to abort,
> > +	     which is expensive.  Simply end the started transaction.  */
> >
> > Is that actually true?  You don't seem to check whether you are indeed
> > not in a nested transaction.  The difference is that you do not need to
> > acquire the lock because this is trylock, so you can run a little
> > further as a transaction.  You will get aborted as soon the lock you
> > failed to lock is released though.
> > So, it's not quite clear to me what the expected win is; are you aiming
> > at programs that don't actually need to acquire the lock and whose
> > transactions finish quickly enough to not get aborted by a concurrent
> > release of the lock?
> >
> Every time we enter __lll_trylock_elision, we check if we are in a 
> transaction on this CPU and abort it if needed. This is needed to detect 
> multiple elided trylock calls of one thread.

You are right, sorry.  I had forgotten about that point.

> > The rest of the diff looked OK.
> >
> If it's okay, I will make one patch with all the reviewed diffs and post 
> it with a ChangeLog in this mail-thread.
> Then I would commit it before release to have a consistent state?

Yes, please commit this.
diff mbox

Patch

diff --git a/sysdeps/unix/sysv/linux/s390/elision-lock.c b/sysdeps/unix/sysv/linux/s390/elision-lock.c
index faa998e..0081537 100644
--- a/sysdeps/unix/sysv/linux/s390/elision-lock.c
+++ b/sysdeps/unix/sysv/linux/s390/elision-lock.c
@@ -50,30 +50,28 @@  __lll_lock_elision (int *futex, short *adapt_count, EXTRAARG int private)
      critical section uses lock elision) and outside of transactions.  Thus,
      we need to use atomic accesses to avoid data races.  However, the
      value of adapt_count is just a hint, so relaxed MO accesses are
-     sufficient.
-     Do not begin a transaction if another cpu has locked the
-     futex with normal locking.  If adapt_count is zero, it remains and the
-     next pthread_mutex_lock call will try to start a transaction again.  */
-    if (atomic_load_relaxed (futex) == 0
-	&& atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
+     sufficient.  */
+    if (atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
     {
       /* Start a transaction and retry it automatically if it aborts with
 	 _HTM_TBEGIN_TRANSIENT.  This macro calls tbegin at most retry_cnt
 	 + 1 times.  The second argument is considered as retry_cnt.  */
       int status = __libc_tbegin_retry ((void *) 0, aconf.try_tbegin - 1);
-      if (__builtin_expect (status == _HTM_TBEGIN_STARTED,
-			    _HTM_TBEGIN_STARTED))
+      if (__glibc_likely (status == _HTM_TBEGIN_STARTED))
 	{
 	  /* Check the futex to make sure nobody has touched it in the
 	     mean time.  This forces the futex into the cache and makes
-	     sure the transaction aborts if some other cpu uses the
-	     lock (writes the futex).  */
-	  if (__builtin_expect (atomic_load_relaxed (futex) == 0, 1))
+	     sure the transaction aborts if another thread acquires the lock
+	     concurrently.  */
+	  if (__glibc_likely (atomic_load_relaxed (futex) == 0))
 	    /* Lock was free.  Return to user code in a transaction.  */
 	    return 0;
 
-	  /* Lock was busy.  Fall back to normal locking.  */
-	  if (__builtin_expect (__libc_tx_nesting_depth () <= 1, 1))
+	  /* Lock was busy.  Fall back to normal locking.
+	     This can be the case if e.g. adapt_count was decremented to zero
+	     by a former release and another thread has been waken up and
+	     acquired it.  */
+	  if (__glibc_likely (__libc_tx_nesting_depth () <= 1))
 	    {
 	      /* In a non-nested transaction there is no need to abort,
 		 which is expensive.  Simply end the started transaction.  */
diff --git a/sysdeps/unix/sysv/linux/s390/elision-trylock.c b/sysdeps/unix/sysv/linux/s390/elision-trylock.c
index eec172a..aa09073 100644
--- a/sysdeps/unix/sysv/linux/s390/elision-trylock.c
+++ b/sysdeps/unix/sysv/linux/s390/elision-trylock.c
@@ -51,31 +51,29 @@  __lll_trylock_elision (int *futex, short *adapt_count)
      critical section uses lock elision) and outside of transactions.  Thus,
      we need to use atomic accesses to avoid data races.  However, the
      value of adapt_count is just a hint, so relaxed MO accesses are
-     sufficient.
-     Do not begin a transaction if another cpu has locked the
-     futex with normal locking.  If adapt_count is zero, it remains and the
-     next pthread_mutex_lock call will try to start a transaction again.  */
-    if (atomic_load_relaxed (futex) == 0
-	&& atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
+     sufficient.  */
+    if (atomic_load_relaxed (adapt_count) <= 0 && aconf.try_tbegin > 0)
     {
       int status = __libc_tbegin ((void *) 0);
-      if (__builtin_expect (status  == _HTM_TBEGIN_STARTED,
-			    _HTM_TBEGIN_STARTED))
+      if (__glibc_likely (status  == _HTM_TBEGIN_STARTED))
 	{
 	  /* Check the futex to make sure nobody has touched it in the
 	     mean time.  This forces the futex into the cache and makes
-	     sure the transaction aborts if some other cpu uses the
-	     lock (writes the futex).  */
-	  if (__builtin_expect (atomic_load_relaxed (futex) == 0, 1))
+	     sure the transaction aborts if another thread acquires the lock
+	     concurrently.  */
+	  if (__glibc_likely (atomic_load_relaxed (futex) == 0))
 	    /* Lock was free.  Return to user code in a transaction.  */
 	    return 0;
 
-	  /* Lock was busy.  Fall back to normal locking.  Since we are in
-	     a non-nested transaction there is no need to abort, which is
-	     expensive.  Simply end the started transaction.  */
+	  /* Lock was busy.  Fall back to normal locking.
+	     This can be the case if e.g. adapt_count was decremented to zero
+	     by a former release and another thread has been waken up and
+	     acquired it.
+	     Since we are in a non-nested transaction there is no need to abort,
+	     which is expensive.  Simply end the started transaction.  */
 	  __libc_tend ();
 	  /* Note: Changing the adapt_count here might abort a transaction on a
-	     different cpu, but that could happen anyway when the futex is
+	     different CPU, but that could happen anyway when the futex is
 	     acquired, so there's no need to check the nesting depth here.
 	     See above for why relaxed MO is sufficient.  */
 	  if (aconf.skip_lock_busy > 0)
diff --git a/sysdeps/unix/sysv/linux/s390/elision-unlock.c b/sysdeps/unix/sysv/linux/s390/elision-unlock.c
index d9205fa..c062d71 100644
--- a/sysdeps/unix/sysv/linux/s390/elision-unlock.c
+++ b/sysdeps/unix/sysv/linux/s390/elision-unlock.c
@@ -26,8 +26,12 @@  __lll_unlock_elision(int *futex, short *adapt_count, int private)
   /* If the lock is free, we elided the lock earlier.  This does not
      necessarily mean that we are in a transaction, because the user code may
      have closed the transaction, but that is impossible to detect reliably.
-     Relaxed MO access to futex is sufficient as we only need a hint, if we
-     started a transaction or acquired the futex in e.g. elision-lock.c.  */
+     Relaxed MO access to futex is sufficient because a correct program
+     will only release a lock it has acquired; therefore, it must either
+     changed the futex word's value to something !=0 or it must have used
+     elision; these are actions by the same thread, so these actions are
+     sequenced-before the relaxed load (and thus also happens-before the
+     relaxed load).  Therefore, relaxed MO is sufficient.  */
   if (atomic_load_relaxed (futex) == 0)
     {
       __libc_tend ();
@@ -36,17 +40,17 @@  __lll_unlock_elision(int *futex, short *adapt_count, int private)
     {
       /* Update the adapt_count while unlocking before completing the critical
 	 section.  adapt_count is accessed concurrently outside of a
-	 transaction or an aquired lock e.g. in elision-lock.c so we need to use
-	 atomic accesses.  However, the value of adapt_count is just a hint, so
-	 relaxed MO accesses are sufficient.
+	 transaction or a critical section (e.g. in elision-lock.c). So we need
+	 to use atomic accesses.  However, the value of adapt_count is just a
+	 hint, so relaxed MO accesses are sufficient.
 	 If adapt_count would be decremented while locking, multiple
-	 CPUs trying to lock the locked mutex will decrement adapt_count to
+	 CPUs, trying to lock the acquired mutex, will decrement adapt_count to
 	 zero and another CPU will try to start a transaction, which will be
 	 immediately aborted as the mutex is locked.
-	 If adapt_count would be decremented while unlocking after completing
-	 the critical section, possible waiters will be waked up before
-	 decrementing the adapt_count.  Those waked up waiters could have
-	 destroyed and freed this mutex!  */
+	 The update of adapt_count is done before releasing the lock as POSIX'
+	 mutex destruction requirements disallow accesses to the mutex after it
+	 has been released and thus could have been acquired or destroyed by
+	 another thread.  */
       short adapt_count_val = atomic_load_relaxed (adapt_count);
       if (adapt_count_val > 0)
 	atomic_store_relaxed (adapt_count, adapt_count_val - 1);