diff mbox

[PATCHv3] PowerPC: Fix a race condition when eliding a lock

Message ID 1440439895-11812-1-git-send-email-tuliom@linux.vnet.ibm.com
State New
Headers show

Commit Message

Tulio Magno Quites Machado Filho Aug. 24, 2015, 6:11 p.m. UTC
Changes since v2:
 - Moved part of the source code comments to the commit message.
 - Added O'Donnel's suggestion to the source code comments.

Changes since v1:
 - Improved commit message.
 - Added new comments in the source code alerting to the concurrency details
   intrinsic to this code.
 - Removed compiler barriers from this patch, which will be treated in another
   patch and will be synchronized with the GCC implementation.

8<----------

The previous code used to evaluate the preprocessor token is_lock_free to
a variable before starting a transaction.  This behavior can cause an
error if another thread got the lock (without using a transaction)
between the evaluation of the token and the beginning of the transaction.

This bug can be triggered with the following order of events:
1. The lock accessed by is_lock_free is free.
2. Thread T1 evaluates is_lock_free and stores into register R1 that the
   lock is free.
3. Thread T2 acquires the same lock used in is_lock_free.
4. T1 begins the transaction, creating a memory barrier where is_lock_free
   is false, but R1 is true.
5. T1 reads R1 and doesn't abort the transaction.
6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides
   to unlock a lock acquired by T2, leading to undefined behavior.

This patch delays the evaluation of is_lock_free to inside a transaction
by moving this part of the code to the macro ELIDE_LOCK.

2015-08-24  Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>

	[BZ #18743]
	* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
	code to...
	(ELIDE_LOCK): ...here.
	(__get_new_count): New function with part of the code from
	__elide_lock that updates the value of adapt_count after a
	transaction abort.
	(__elided_trylock): Moved this code to...
	(ELIDE_TRYLOCK): ...here.
---
 NEWS                         |   3 +-
 sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++--------------------
 2 files changed, 63 insertions(+), 53 deletions(-)

Comments

Carlos O'Donell Aug. 24, 2015, 6:58 p.m. UTC | #1
On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
> Changes since v2:
>  - Moved part of the source code comments to the commit message.
>  - Added O'Donnel's suggestion to the source code comments.
> 
> Changes since v1:
>  - Improved commit message.
>  - Added new comments in the source code alerting to the concurrency details
>    intrinsic to this code.
>  - Removed compiler barriers from this patch, which will be treated in another
>    patch and will be synchronized with the GCC implementation.

> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\
> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\

Toravld still suggests an atomic_load_relaxed here:
https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html

Is there any technical objection to that request?

It does highlight, as he notes, the transactional and non-transactional
accesses for that evaluated value.

Cheers,
Carlos.
Tulio Magno Quites Machado Filho Aug. 24, 2015, 7:19 p.m. UTC | #2
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
>> Changes since v2:
>>  - Moved part of the source code comments to the commit message.
>>  - Added O'Donnel's suggestion to the source code comments.
>> 
>> Changes since v1:
>>  - Improved commit message.
>>  - Added new comments in the source code alerting to the concurrency details
>>    intrinsic to this code.
>>  - Removed compiler barriers from this patch, which will be treated in another
>>    patch and will be synchronized with the GCC implementation.
>
>> +	  if (__builtin_tbegin (0))					\
>> +	    {								\
>> +	      if (is_lock_free)						\
>> +		{							\
>> +		  ret = 1;						\
>> +		  break;						\
>> +		}							\
>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>> +	    }								\
>> +	  else								\
>> +	    if (!__get_new_count(&adapt_count))				\
>> +	      break;							\
>
> Toravld still suggests an atomic_load_relaxed here:
> https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html
>
> Is there any technical objection to that request?
>
> It does highlight, as he notes, the transactional and non-transactional
> accesses for that evaluated value.

I'm not objecting, but I don't see any value in adding this as I explained at
the end of this message [1].

AFAIU, there are 2 reasons for doing this:
 1. Make the code easier to understand.
 2. Help moving to proper atomic types in the future.

IMHO, reason #1 is very personal and I could change my opinion if I had seen an
example where this kind of changes helped to make the code easier to
understand.

Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't
backport the atomic access to glibc 2.21 and 2.22 if the only reason for it
is #2.  So, it would be better as a separate patch.

[1] https://sourceware.org/ml/libc-alpha/2015-08/msg00885.html
Adhemerval Zanella Aug. 24, 2015, 7:28 p.m. UTC | #3
On 24-08-2015 15:58, Carlos O'Donell wrote:
> On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
>> Changes since v2:
>>  - Moved part of the source code comments to the commit message.
>>  - Added O'Donnel's suggestion to the source code comments.
>>
>> Changes since v1:
>>  - Improved commit message.
>>  - Added new comments in the source code alerting to the concurrency details
>>    intrinsic to this code.
>>  - Removed compiler barriers from this patch, which will be treated in another
>>    patch and will be synchronized with the GCC implementation.
> 
>> +	  if (__builtin_tbegin (0))					\
>> +	    {								\
>> +	      if (is_lock_free)						\
>> +		{							\
>> +		  ret = 1;						\
>> +		  break;						\
>> +		}							\
>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>> +	    }								\
>> +	  else								\
>> +	    if (!__get_new_count(&adapt_count))				\
>> +	      break;							\
> 
> Toravld still suggests an atomic_load_relaxed here:
> https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html
> 
> Is there any technical objection to that request?
> 
> It does highlight, as he notes, the transactional and non-transactional
> accesses for that evaluated value.

No 'technical' objections, only that the transaction idea is exactly to
avoid the use of atomic accesses altogether.  For relaxed accesses it
won't change anything, but if the idea is to highlight every access
within transaction with atomics, it might be case where the algorithm
will require another semantic (like a acquire or release) and it will
require to drop the atomic usage to avoid generate subpar code.


> 
> Cheers,
> Carlos.
>
Carlos O'Donell Aug. 24, 2015, 7:42 p.m. UTC | #4
On 08/24/2015 03:28 PM, Adhemerval Zanella wrote:
> 
> 
> On 24-08-2015 15:58, Carlos O'Donell wrote:
>> On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
>>> Changes since v2:
>>>  - Moved part of the source code comments to the commit message.
>>>  - Added O'Donnel's suggestion to the source code comments.
>>>
>>> Changes since v1:
>>>  - Improved commit message.
>>>  - Added new comments in the source code alerting to the concurrency details
>>>    intrinsic to this code.
>>>  - Removed compiler barriers from this patch, which will be treated in another
>>>    patch and will be synchronized with the GCC implementation.
>>
>>> +	  if (__builtin_tbegin (0))					\
>>> +	    {								\
>>> +	      if (is_lock_free)						\
>>> +		{							\
>>> +		  ret = 1;						\
>>> +		  break;						\
>>> +		}							\
>>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>>> +	    }								\
>>> +	  else								\
>>> +	    if (!__get_new_count(&adapt_count))				\
>>> +	      break;							\
>>
>> Toravld still suggests an atomic_load_relaxed here:
>> https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html
>>
>> Is there any technical objection to that request?
>>
>> It does highlight, as he notes, the transactional and non-transactional
>> accesses for that evaluated value.
> 
> No 'technical' objections, only that the transaction idea is exactly to
> avoid the use of atomic accesses altogether.  For relaxed accesses it
> won't change anything, but if the idea is to highlight every access
> within transaction with atomics, it might be case where the algorithm
> will require another semantic (like a acquire or release) and it will
> require to drop the atomic usage to avoid generate subpar code.

I'll respond to Torvald and ask him to explain in more detail.

Cheers,
Carlos.
Torvald Riegel Aug. 25, 2015, 11:34 a.m. UTC | #5
On Mon, 2015-08-24 at 16:28 -0300, Adhemerval Zanella wrote:
> 
> On 24-08-2015 15:58, Carlos O'Donell wrote:
> > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
> >> Changes since v2:
> >>  - Moved part of the source code comments to the commit message.
> >>  - Added O'Donnel's suggestion to the source code comments.
> >>
> >> Changes since v1:
> >>  - Improved commit message.
> >>  - Added new comments in the source code alerting to the concurrency details
> >>    intrinsic to this code.
> >>  - Removed compiler barriers from this patch, which will be treated in another
> >>    patch and will be synchronized with the GCC implementation.
> > 
> >> +	  if (__builtin_tbegin (0))					\
> >> +	    {								\
> >> +	      if (is_lock_free)						\
> >> +		{							\
> >> +		  ret = 1;						\
> >> +		  break;						\
> >> +		}							\
> >> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> >> +	    }								\
> >> +	  else								\
> >> +	    if (!__get_new_count(&adapt_count))				\
> >> +	      break;							\
> > 
> > Toravld still suggests an atomic_load_relaxed here:
> > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html
> > 
> > Is there any technical objection to that request?
> > 
> > It does highlight, as he notes, the transactional and non-transactional
> > accesses for that evaluated value.
> 
> No 'technical' objections, only that the transaction idea is exactly to
> avoid the use of atomic accesses altogether.  For relaxed accesses it
> won't change anything,

Yes.

> but if the idea is to highlight every access
> within transaction with atomics,

No.  The rule I'd like to follow is that if we need atomic accesses to a
memory object somewhere in the code base, that we then consistently use
atomic accesses for all accesses to this object (with the exception of
initialization code, where things like memset can be simpler).

Thus, if you use just transactions to prevent data races for an object,
no atomics are required.  This holds for the vast majority of objects
accessed in transactions (except lock implementations and such).

IMO, not using atomics for accessing is_lock_free in the txn creates an
exception in our rules, and given that there's no problem to used an
MO-relaxed access there, it's just easier to avoid creating an
exception.  And it makes it possible to use an atomic type for it in the
future.

> it might be case where the algorithm
> will require another semantic (like a acquire or release) and it will
> require to drop the atomic usage to avoid generate subpar code.

I don't quite understand what you mean, sorry.
Torvald Riegel Aug. 25, 2015, 11:42 a.m. UTC | #6
On Mon, 2015-08-24 at 16:19 -0300, Tulio Magno Quites Machado Filho
wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
> > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
> >> Changes since v2:
> >>  - Moved part of the source code comments to the commit message.
> >>  - Added O'Donnel's suggestion to the source code comments.
> >> 
> >> Changes since v1:
> >>  - Improved commit message.
> >>  - Added new comments in the source code alerting to the concurrency details
> >>    intrinsic to this code.
> >>  - Removed compiler barriers from this patch, which will be treated in another
> >>    patch and will be synchronized with the GCC implementation.
> >
> >> +	  if (__builtin_tbegin (0))					\
> >> +	    {								\
> >> +	      if (is_lock_free)						\
> >> +		{							\
> >> +		  ret = 1;						\
> >> +		  break;						\
> >> +		}							\
> >> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> >> +	    }								\
> >> +	  else								\
> >> +	    if (!__get_new_count(&adapt_count))				\
> >> +	      break;							\
> >
> > Toravld still suggests an atomic_load_relaxed here:
> > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html
> >
> > Is there any technical objection to that request?
> >
> > It does highlight, as he notes, the transactional and non-transactional
> > accesses for that evaluated value.
> 
> I'm not objecting, but I don't see any value in adding this as I explained at
> the end of this message [1].
> 
> AFAIU, there are 2 reasons for doing this:
>  1. Make the code easier to understand.
>  2. Help moving to proper atomic types in the future.
> 
> IMHO, reason #1 is very personal and I could change my opinion if I had seen an
> example where this kind of changes helped to make the code easier to
> understand.

The synchronization around is_lock_free is mixed txnal/nontxnal
synchronization.  Given that, it's not as simple as purely txnal/txnal
synchronization.
As I mentioned in my reply to Carlos, we do have the general rule that
we don't use atomic types right now but assume that every object that
needs to be atomically accessed at some place is atomically accessed
everywhere (ie, what a C11/C++11 atomic type would provide).  Making an
exception for mixed txnal/nontxnal synchronization is creating more
complexity, just because there is another exception.

> Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't
> backport the atomic access to glibc 2.21 and 2.22 if the only reason for it
> is #2.  So, it would be better as a separate patch.

I wouldn't object if you split this into two parts, and only backport
those patches that are necessary for correctness.
Torvald Riegel Aug. 25, 2015, 1:02 p.m. UTC | #7
On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
wrote:
> Changes since v2:
>  - Moved part of the source code comments to the commit message.
>  - Added O'Donnel's suggestion to the source code comments.
> 
> Changes since v1:
>  - Improved commit message.
>  - Added new comments in the source code alerting to the concurrency details
>    intrinsic to this code.
>  - Removed compiler barriers from this patch, which will be treated in another
>    patch and will be synchronized with the GCC implementation.
> 
> 8<----------
> 
> The previous code used to evaluate the preprocessor token is_lock_free to
> a variable before starting a transaction.  This behavior can cause an
> error if another thread got the lock (without using a transaction)
> between the evaluation of the token and the beginning of the transaction.

I find the use of the preprocessor to capture an evaluation really ugly.
And it is error-prone, as the bug fixed by this patch shows.  I would
appreciate a follow-up patch that replaces is_lock_free with either a
function that is called (relying on the compiler to inline it if the
function pointer is constant, which it should be), or an address to a
value that is compared to some caller-provided constant (which would
limit flexibility for the condition somewhat, but is cleaner).

> This bug can be triggered with the following order of events:
> 1. The lock accessed by is_lock_free is free.
> 2. Thread T1 evaluates is_lock_free and stores into register R1 that the
>    lock is free.
> 3. Thread T2 acquires the same lock used in is_lock_free.
> 4. T1 begins the transaction, creating a memory barrier where is_lock_free
>    is false, but R1 is true.
> 5. T1 reads R1 and doesn't abort the transaction.
> 6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides
>    to unlock a lock acquired by T2, leading to undefined behavior.
> 
> This patch delays the evaluation of is_lock_free to inside a transaction
> by moving this part of the code to the macro ELIDE_LOCK.
> 
> 2015-08-24  Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
> 
> 	[BZ #18743]
> 	* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
> 	code to...
> 	(ELIDE_LOCK): ...here.
> 	(__get_new_count): New function with part of the code from
> 	__elide_lock that updates the value of adapt_count after a
> 	transaction abort.
> 	(__elided_trylock): Moved this code to...
> 	(ELIDE_TRYLOCK): ...here.
> ---
>  NEWS                         |   3 +-
>  sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++--------------------
>  2 files changed, 63 insertions(+), 53 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 6b36c08..76df3fd 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -11,7 +11,8 @@ Version 2.23
>  
>    14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086,
>    18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681,
> -  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824.
> +  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824,
> +  18743.
>  
>  * The obsolete header <regexp.h> has been removed.  Programs that require
>    this header must be updated to use <regex.h> instead.
> diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
> index 389f5a5..1cc9890 100644
> --- a/sysdeps/powerpc/nptl/elide.h
> +++ b/sysdeps/powerpc/nptl/elide.h
> @@ -23,67 +23,76 @@
>  # include <htm.h>
>  # include <elision-conf.h>
>  
> -/* Returns true if the lock defined by is_lock_free as elided.
> -   ADAPT_COUNT is a pointer to per-lock state variable. */
> -
> +/* Get the new value of adapt_count according to the elision
> +   configurations.  Returns true if the system should retry again or false
> +   otherwise.  */
>  static inline bool
> -__elide_lock (uint8_t *adapt_count, int is_lock_free)
> +__get_new_count (uint8_t *adapt_count)
>  {
> -  if (*adapt_count > 0)
> +  /* A persistent failure indicates that a retry will probably
> +     result in another failure.  Use normal locking now and
> +     for the next couple of calls.  */
> +  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>      {
> -      (*adapt_count)--;
> +      if (__elision_aconf.skip_lock_internal_abort > 0)
> +	*adapt_count = __elision_aconf.skip_lock_internal_abort;
>        return false;
>      }
> -
> -  for (int i = __elision_aconf.try_tbegin; i > 0; i--)
> -    {
> -      if (__builtin_tbegin (0))
> -	{
> -	  if (is_lock_free)
> -	    return true;
> -	  /* Lock was busy.  */
> -	  __builtin_tabort (_ABORT_LOCK_BUSY);
> -	}
> -      else
> -	{
> -	  /* A persistent failure indicates that a retry will probably
> -	     result in another failure.  Use normal locking now and
> -	     for the next couple of calls.  */
> -	  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
> -	    {
> -	      if (__elision_aconf.skip_lock_internal_abort > 0)
> -		*adapt_count = __elision_aconf.skip_lock_internal_abort;
> -	      break;
> -	    }
> -	  /* Same logic as above, but for a number of temporary failures in a
> -	     a row.  */
> -	  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> -		   && __elision_aconf.try_tbegin > 0)
> -	    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> -	}
> -     }
> -
> -  return false;
> +  /* Same logic as above, but for a number of temporary failures in a
> +     a row.  */
> +  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> +	   && __elision_aconf.try_tbegin > 0)
> +    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> +  return true;
>  }
>  
> -# define ELIDE_LOCK(adapt_count, is_lock_free) \
> -  __elide_lock (&(adapt_count), is_lock_free)
> -
> -
> -static inline bool
> -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
> -{
> -  if (__elision_aconf.try_tbegin > 0)
> -    {
> -      if (write)
> -	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
> -      return __elide_lock (adapt_count, is_lock_free);
> -    }
> -  return false;
> -}
> +/* CONCURRENCY NOTES:
> +
> +   The value of is_lock_free is read and possibly written to by multiple

As you write above, is_lock_free is not a value, it is an expression.
The expression will read from a memory location, that is then modified
by other threads.  Please clarify that so this is clear to readers (this
misunderstanding seems to be related to the bug you're fixing...).

> +   threads, either during lock acquisition, or elision.  In order to avoid
> +   this evaluation from becoming a data race the access of is_lock_free

It could be a data race because you're not using atomics there, but
that's not the whole point.  (We use the term "data race" specifically
to refer to the C11/C++11 memory model concept of the same name.)
You want to ensure the lock elision synchronization scheme, and thus are
moving it inside the txn.

> +   is placed *inside* the transaction.  Within the transaction is_lock_free
> +   can be accessed with relaxed ordering.  We are assured by the transaction

Please use "memory order" to be specific, or abbreviate with MO as
mentioned on the Concurrency wiki page.

> +   that the value of is_lock_free is consistent with the state of the lock,
> +   otherwise the hardware will abort the transaction.  */

That's not the really why this lock elisions synchronization scheme
works :)  (At least you're not giving sufficient reason why it is.)

Txn/txn synchronization is relatively simple, basically txns that access
the same memory locations are atomic and ordered.
Txnal/nontxnal synchronization is more complex.  I don't remember how
IBM has specified txn semantics in detail, but the way it's basically
supposed to work is:
* If the txn reads-from data written in a critical section, then it will
also observe the lock acquisition of this critical section or a later
write to the same lock memory locations by this critical section (e.g.,
a lock release).
* If the txn reads-from a lock release, it will observe all prior writes
in the respective critical section.
* If a txn reads-from a memory location, it picks a write to read-from.
It won't change this choice if it reads-from the same location again.
If it can't read-from the same write, it aborts.
(Here, "observe" means that it reads-from those locations, then the
respective writes are visible side effects.)
The result of this (rough) set of rules is that in all executions in
which it doesn't abort, it will not read intermediate state produced by
other critical sections, but is ordered wrt to them and their writes to
application data.

It's hard to describe that in our current terminology because txns
aren't covered in the C11 memory model.

Also note that this implementation would not be correct:
  if (tbegin(0))
    {
       <application code>
       if (!is_lock_free) tabort();
       tend();
    }
This is because the application code could call tend on some path, for
example when reading an intermediate state of a concurrent non-elided
critical section.
I'm mentioning this because this shows that is_lock_free must be
"evaluated" early, and that the compiler must not reorder it to the end
of the critical section.  We could use atomic MO on the load in
is_lock_free for that, but then we get the unnecessary HW acquire fence.
The reason why it should work without that is that the compiler
shouldn't execute side effects that the abstract machine wouldn't do if
is_lock_free evaluated false, and application code that may call tend()
is such a side effect.  Thus, you can't in practice avoid the
is_lock_free check, and once you do it the rules above take care of
correctness.

Intel TSX and IBM's HTMs (IIRC) don't let any side effects, such as page
faults, escape from an aborted txn.  In contrast, AMD's ASF proposal did
let page faults escape, in which case the acquire MO would be required.

I hope this clarifies why I'm saying that mixed txnal/nontxnal
synchronization isn't all that simple.

> +/* Returns 0 if the lock defined by is_lock_free was elided.
> +   ADAPT_COUNT is a per-lock state variable.  */
> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
> +  ({									\
> +    int ret = 0;							\
> +    if (adapt_count > 0)						\
> +      (adapt_count)--;							\
> +    else								\
> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> +	{								\
> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\

This evaluation should use relaxed MO atomic accesses, IMO.  But that
can be changed by a follow-up patch.

> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\
> +	}								\
> +    ret;								\
> +  })
>  
>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
> -  __elide_trylock (&(adapt_count), is_lock_free, write)
> +  ({								\
> +    int ret = 0;						\
> +    if (__elision_aconf.try_tbegin > 0)				\
> +      {								\

I'd prefer a comment here why you abort if write is true.  Or a
reference to another source file / function where this is explained.
Can be in a follow-up patch.

> +	if (write)						\
> +	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
> +	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
> +      }								\
> +    ret;							\
> +  })
>  
> 
>  static inline bool
Adhemerval Zanella Aug. 25, 2015, 2:41 p.m. UTC | #8
On 25-08-2015 08:34, Torvald Riegel wrote:
> 
> No.  The rule I'd like to follow is that if we need atomic accesses to a
> memory object somewhere in the code base, that we then consistently use
> atomic accesses for all accesses to this object (with the exception of
> initialization code, where things like memset can be simpler).
> 
> Thus, if you use just transactions to prevent data races for an object,
> no atomics are required.  This holds for the vast majority of objects
> accessed in transactions (except lock implementations and such).
> 
> IMO, not using atomics for accessing is_lock_free in the txn creates an
> exception in our rules, and given that there's no problem to used an
> MO-relaxed access there, it's just easier to avoid creating an
> exception.  And it makes it possible to use an atomic type for it in the
> future.

The problem I see it all hardware transactional either do not explicit
guarantee progress forward for transactional code or require a fallback
code for performance reason, and then non-transactional code would be
required.


> 
>> it might be case where the algorithm
>> will require another semantic (like a acquire or release) and it will
>> require to drop the atomic usage to avoid generate subpar code.
> 
> I don't quite understand what you mean, sorry.
> 

And what I am trying to avoid is to require some atomic access in a non
MO-relaxed way in a transactional just to follow the atomic access rule.

Anyway, I do see the value of making atomic access using atomic api
insides transactions and we can handle futures cases in specific cases.
Torvald Riegel Aug. 25, 2015, 4:32 p.m. UTC | #9
On Tue, 2015-08-25 at 11:41 -0300, Adhemerval Zanella wrote:
> 
> On 25-08-2015 08:34, Torvald Riegel wrote:
> > 
> > No.  The rule I'd like to follow is that if we need atomic accesses to a
> > memory object somewhere in the code base, that we then consistently use
> > atomic accesses for all accesses to this object (with the exception of
> > initialization code, where things like memset can be simpler).
> > 
> > Thus, if you use just transactions to prevent data races for an object,
> > no atomics are required.  This holds for the vast majority of objects
> > accessed in transactions (except lock implementations and such).
> > 
> > IMO, not using atomics for accessing is_lock_free in the txn creates an
> > exception in our rules, and given that there's no problem to used an
> > MO-relaxed access there, it's just easier to avoid creating an
> > exception.  And it makes it possible to use an atomic type for it in the
> > future.
> 
> The problem I see it all hardware transactional either do not explicit
> guarantee progress forward for transactional code or require a fallback
> code for performance reason, and then non-transactional code would be
> required.

Agreed current HTMs are almost always best-effort, but I don't think a
relaxed-MO atomic should abort a txn.

> 
> > 
> >> it might be case where the algorithm
> >> will require another semantic (like a acquire or release) and it will
> >> require to drop the atomic usage to avoid generate subpar code.
> > 
> > I don't quite understand what you mean, sorry.
> > 
> 
> And what I am trying to avoid is to require some atomic access in a non
> MO-relaxed way in a transactional just to follow the atomic access rule.

Agreed.

> Anyway, I do see the value of making atomic access using atomic api
> insides transactions and we can handle futures cases in specific cases.

Good.
Tulio Magno Quites Machado Filho Aug. 25, 2015, 6:35 p.m. UTC | #10
Torvald Riegel <triegel@redhat.com> writes:

> On Mon, 2015-08-24 at 16:19 -0300, Tulio Magno Quites Machado Filho
> wrote:
>> "Carlos O'Donell" <carlos@redhat.com> writes:
>> 
>> > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote:
>> >> Changes since v2:
>> >>  - Moved part of the source code comments to the commit message.
>> >>  - Added O'Donnel's suggestion to the source code comments.
>> >> 
>> >> Changes since v1:
>> >>  - Improved commit message.
>> >>  - Added new comments in the source code alerting to the concurrency details
>> >>    intrinsic to this code.
>> >>  - Removed compiler barriers from this patch, which will be treated in another
>> >>    patch and will be synchronized with the GCC implementation.
>> >
>> Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't
>> backport the atomic access to glibc 2.21 and 2.22 if the only reason for it
>> is #2.  So, it would be better as a separate patch.
>
> I wouldn't object if you split this into two parts, and only backport
> those patches that are necessary for correctness.

Great!  So, let's split this discussion in 2: one to review the patch for bug
#18743 and another which is an RFC to create a rule to consistently use atomic
access to a memory object that need at least one atomic access.

Both discussions are completely unrelated, except that the former started the
latter.
Paul E. Murphy Aug. 25, 2015, 9:06 p.m. UTC | #11
The discussion this patch launched has been quite informative. Circling
back to the patch; it should fix the bug in question while we determine
how to improve the common code.

My only trivial nit is the x86 version might benefit from a similar
comment regarding concurrency, otherwise LGTM.

On 08/24/2015 01:11 PM, Tulio Magno Quites Machado Filho wrote:
> Changes since v2:
>  - Moved part of the source code comments to the commit message.
>  - Added O'Donnel's suggestion to the source code comments.
> 
> Changes since v1:
>  - Improved commit message.
>  - Added new comments in the source code alerting to the concurrency details
>    intrinsic to this code.
>  - Removed compiler barriers from this patch, which will be treated in another
>    patch and will be synchronized with the GCC implementation.
> 
> 8<----------
> 
> The previous code used to evaluate the preprocessor token is_lock_free to
> a variable before starting a transaction.  This behavior can cause an
> error if another thread got the lock (without using a transaction)
> between the evaluation of the token and the beginning of the transaction.
> 
> This bug can be triggered with the following order of events:
> 1. The lock accessed by is_lock_free is free.
> 2. Thread T1 evaluates is_lock_free and stores into register R1 that the
>    lock is free.
> 3. Thread T2 acquires the same lock used in is_lock_free.
> 4. T1 begins the transaction, creating a memory barrier where is_lock_free
>    is false, but R1 is true.
> 5. T1 reads R1 and doesn't abort the transaction.
> 6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides
>    to unlock a lock acquired by T2, leading to undefined behavior.
> 
> This patch delays the evaluation of is_lock_free to inside a transaction
> by moving this part of the code to the macro ELIDE_LOCK.
> 
> 2015-08-24  Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
> 
> 	[BZ #18743]
> 	* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
> 	code to...
> 	(ELIDE_LOCK): ...here.
> 	(__get_new_count): New function with part of the code from
> 	__elide_lock that updates the value of adapt_count after a
> 	transaction abort.
> 	(__elided_trylock): Moved this code to...
> 	(ELIDE_TRYLOCK): ...here.
> ---
>  NEWS                         |   3 +-
>  sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++--------------------
>  2 files changed, 63 insertions(+), 53 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 6b36c08..76df3fd 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -11,7 +11,8 @@ Version 2.23
> 
>    14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086,
>    18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681,
> -  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824.
> +  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824,
> +  18743.
> 
>  * The obsolete header <regexp.h> has been removed.  Programs that require
>    this header must be updated to use <regex.h> instead.
> diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
> index 389f5a5..1cc9890 100644
> --- a/sysdeps/powerpc/nptl/elide.h
> +++ b/sysdeps/powerpc/nptl/elide.h
> @@ -23,67 +23,76 @@
>  # include <htm.h>
>  # include <elision-conf.h>
> 
> -/* Returns true if the lock defined by is_lock_free as elided.
> -   ADAPT_COUNT is a pointer to per-lock state variable. */
> -
> +/* Get the new value of adapt_count according to the elision
> +   configurations.  Returns true if the system should retry again or false
> +   otherwise.  */
>  static inline bool
> -__elide_lock (uint8_t *adapt_count, int is_lock_free)
> +__get_new_count (uint8_t *adapt_count)
>  {
> -  if (*adapt_count > 0)
> +  /* A persistent failure indicates that a retry will probably
> +     result in another failure.  Use normal locking now and
> +     for the next couple of calls.  */
> +  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>      {
> -      (*adapt_count)--;
> +      if (__elision_aconf.skip_lock_internal_abort > 0)
> +	*adapt_count = __elision_aconf.skip_lock_internal_abort;
>        return false;
>      }
> -
> -  for (int i = __elision_aconf.try_tbegin; i > 0; i--)
> -    {
> -      if (__builtin_tbegin (0))
> -	{
> -	  if (is_lock_free)
> -	    return true;
> -	  /* Lock was busy.  */
> -	  __builtin_tabort (_ABORT_LOCK_BUSY);
> -	}
> -      else
> -	{
> -	  /* A persistent failure indicates that a retry will probably
> -	     result in another failure.  Use normal locking now and
> -	     for the next couple of calls.  */
> -	  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
> -	    {
> -	      if (__elision_aconf.skip_lock_internal_abort > 0)
> -		*adapt_count = __elision_aconf.skip_lock_internal_abort;
> -	      break;
> -	    }
> -	  /* Same logic as above, but for a number of temporary failures in a
> -	     a row.  */
> -	  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> -		   && __elision_aconf.try_tbegin > 0)
> -	    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> -	}
> -     }
> -
> -  return false;
> +  /* Same logic as above, but for a number of temporary failures in a
> +     a row.  */
> +  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> +	   && __elision_aconf.try_tbegin > 0)
> +    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> +  return true;
>  }
> 
> -# define ELIDE_LOCK(adapt_count, is_lock_free) \
> -  __elide_lock (&(adapt_count), is_lock_free)
> -
> -
> -static inline bool
> -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
> -{
> -  if (__elision_aconf.try_tbegin > 0)
> -    {
> -      if (write)
> -	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
> -      return __elide_lock (adapt_count, is_lock_free);
> -    }
> -  return false;
> -}
> +/* CONCURRENCY NOTES:
> +
> +   The value of is_lock_free is read and possibly written to by multiple
> +   threads, either during lock acquisition, or elision.  In order to avoid
> +   this evaluation from becoming a data race the access of is_lock_free
> +   is placed *inside* the transaction.  Within the transaction is_lock_free
> +   can be accessed with relaxed ordering.  We are assured by the transaction
> +   that the value of is_lock_free is consistent with the state of the lock,
> +   otherwise the hardware will abort the transaction.  */
> +
> +/* Returns 0 if the lock defined by is_lock_free was elided.
> +   ADAPT_COUNT is a per-lock state variable.  */
> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
> +  ({									\
> +    int ret = 0;							\
> +    if (adapt_count > 0)						\
> +      (adapt_count)--;							\
> +    else								\
> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> +	{								\
> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\
> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\
> +	}								\
> +    ret;								\
> +  })
> 
>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
> -  __elide_trylock (&(adapt_count), is_lock_free, write)
> +  ({								\
> +    int ret = 0;						\
> +    if (__elision_aconf.try_tbegin > 0)				\
> +      {								\
> +	if (write)						\
> +	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
> +	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
> +      }								\
> +    ret;							\
> +  })
> 
> 
>  static inline bool
>
Tulio Magno Quites Machado Filho Aug. 25, 2015, 10:08 p.m. UTC | #12
Torvald Riegel <triegel@redhat.com> writes:

> On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
> wrote:
>> Changes since v2:
>>  - Moved part of the source code comments to the commit message.
>>  - Added O'Donnel's suggestion to the source code comments.
>> 
>> Changes since v1:
>>  - Improved commit message.
>>  - Added new comments in the source code alerting to the concurrency details
>>    intrinsic to this code.
>>  - Removed compiler barriers from this patch, which will be treated in another
>>    patch and will be synchronized with the GCC implementation.
>> 
>> 8<----------
>> 
>> The previous code used to evaluate the preprocessor token is_lock_free to
>> a variable before starting a transaction.  This behavior can cause an
>> error if another thread got the lock (without using a transaction)
>> between the evaluation of the token and the beginning of the transaction.
>
> I find the use of the preprocessor to capture an evaluation really ugly.
> And it is error-prone, as the bug fixed by this patch shows.
> I would appreciate a follow-up patch that replaces is_lock_free with either a
> function that is called (relying on the compiler to inline it if the
> function pointer is constant, which it should be), or an address to a
> value that is compared to some caller-provided constant (which would
> limit flexibility for the condition somewhat, but is cleaner).

I completely agree with you

>> +/* CONCURRENCY NOTES:
>> +
>> +   The value of is_lock_free is read and possibly written to by multiple
>
> As you write above, is_lock_free is not a value, it is an expression.
> The expression will read from a memory location, that is then modified
> by other threads.  Please clarify that so this is clear to readers (this
> misunderstanding seems to be related to the bug you're fixing...).

Ack.  I'll change it to: "The macro expression is_lock_free is..."

>
>> +   threads, either during lock acquisition, or elision.  In order to avoid
>> +   this evaluation from becoming a data race the access of is_lock_free
>
> It could be a data race because you're not using atomics there, but
> that's not the whole point.  (We use the term "data race" specifically
> to refer to the C11/C++11 memory model concept of the same name.)
> You want to ensure the lock elision synchronization scheme, and thus are
> moving it inside the txn.

Are you complaining about the usage of the term "data race"?
If so, what about "race condition"?

>> +   is placed *inside* the transaction.  Within the transaction is_lock_free
>> +   can be accessed with relaxed ordering.  We are assured by the transaction
>
> Please use "memory order" to be specific, or abbreviate with MO as
> mentioned on the Concurrency wiki page.

Ack.

>> +   that the value of is_lock_free is consistent with the state of the lock,
>> +   otherwise the hardware will abort the transaction.  */
>
> That's not the really why this lock elisions synchronization scheme
> works :)  (At least you're not giving sufficient reason why it is.)

I do agree with you that comment needs some adjustments but its intention is
only to describe the concurrency details of the next macro.  It has no intention
of describing how lock elision or this macro work.
It was added per request [1] [2].

What do you think about the following?

   Within the transaction we are assured that all memory accesses are atomic
   and is_lock_free can be evaluated with relaxed memory order.  That way, the
   value of is_lock_free is consistent with the state of the lock until the
   end of the transaction.

[1] https://sourceware.org/ml/libc-alpha/2015-07/msg00987.html
[2] https://sourceware.org/ml/libc-alpha/2015-08/msg00870.html

>> +/* Returns 0 if the lock defined by is_lock_free was elided.
>> +   ADAPT_COUNT is a per-lock state variable.  */
>> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
>> +  ({									\
>> +    int ret = 0;							\
>> +    if (adapt_count > 0)						\
>> +      (adapt_count)--;							\
>> +    else								\
>> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> +	{								\
>> +	  if (__builtin_tbegin (0))					\
>> +	    {								\
>> +	      if (is_lock_free)						\
>
> This evaluation should use relaxed MO atomic accesses, IMO.  But that
> can be changed by a follow-up patch.

Ack.

>> +		{							\
>> +		  ret = 1;						\
>> +		  break;						\
>> +		}							\
>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>> +	    }								\
>> +	  else								\
>> +	    if (!__get_new_count(&adapt_count))				\
>> +	      break;							\
>> +	}								\
>> +    ret;								\
>> +  })
>>  
>>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
>> -  __elide_trylock (&(adapt_count), is_lock_free, write)
>> +  ({								\
>> +    int ret = 0;						\
>> +    if (__elision_aconf.try_tbegin > 0)				\
>> +      {								\
>
> I'd prefer a comment here why you abort if write is true.  Or a
> reference to another source file / function where this is explained.
> Can be in a follow-up patch.

Let's leave this to another patch.  I want to unblock this bugfix.  ;-)

Anyway, this code is not new.  This patch is just moving it from a function to
a macro.
AFAICS, x86 and powerpc are the only architectures to implement this macro.
They both do the same thing and they don't provide comments.
I'll have to investigate why it was implemented this way.

Thanks!
Torvald Riegel Aug. 26, 2015, 8:43 a.m. UTC | #13
On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho
wrote:
> Torvald Riegel <triegel@redhat.com> writes:
> 
> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
> >> +   threads, either during lock acquisition, or elision.  In order to avoid
> >> +   this evaluation from becoming a data race the access of is_lock_free
> >
> > It could be a data race because you're not using atomics there, but
> > that's not the whole point.  (We use the term "data race" specifically
> > to refer to the C11/C++11 memory model concept of the same name.)
> > You want to ensure the lock elision synchronization scheme, and thus are
> > moving it inside the txn.
> 
> Are you complaining about the usage of the term "data race"?

Yes.

> If so, what about "race condition"?

Well, that would be better, but a race condition is not necessarily a
bad thing.  It's usually better to say which execution or interleaving
you need to avoid, than just saying "race condition".

> >> +   is placed *inside* the transaction.  Within the transaction is_lock_free
> >> +   can be accessed with relaxed ordering.  We are assured by the transaction
> >
> > Please use "memory order" to be specific, or abbreviate with MO as
> > mentioned on the Concurrency wiki page.
> 
> Ack.
> 
> >> +   that the value of is_lock_free is consistent with the state of the lock,
> >> +   otherwise the hardware will abort the transaction.  */
> >
> > That's not the really why this lock elisions synchronization scheme
> > works :)  (At least you're not giving sufficient reason why it is.)
> 
> I do agree with you that comment needs some adjustments but its intention is
> only to describe the concurrency details of the next macro.  It has no intention
> of describing how lock elision or this macro work.
> It was added per request [1] [2].
>
> What do you think about the following?
> 
>    Within the transaction we are assured that all memory accesses are atomic
>    and is_lock_free can be evaluated with relaxed memory order.  That way, the
>    value of is_lock_free is consistent with the state of the lock until the
>    end of the transaction.

That's good enough for this patch in that it hints at the right thing.  

I'd still like to see this getting improved eventually, and I'd
appreciate if you could participate in that.  A shared comment for both
the Intel and IBM lock elision implementations would be useful, I
believe, as the general principle used in both is the same.
I'd like to see this documented properly just so that future glibc
developers know how to work with this, and that it get's less likely
that bugs are introduced in the future.

> >> +		{							\
> >> +		  ret = 1;						\
> >> +		  break;						\
> >> +		}							\
> >> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> >> +	    }								\
> >> +	  else								\
> >> +	    if (!__get_new_count(&adapt_count))				\
> >> +	      break;							\
> >> +	}								\
> >> +    ret;								\
> >> +  })
> >>  
> >>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
> >> -  __elide_trylock (&(adapt_count), is_lock_free, write)
> >> +  ({								\
> >> +    int ret = 0;						\
> >> +    if (__elision_aconf.try_tbegin > 0)				\
> >> +      {								\
> >
> > I'd prefer a comment here why you abort if write is true.  Or a
> > reference to another source file / function where this is explained.
> > Can be in a follow-up patch.
> 
> Let's leave this to another patch.  I want to unblock this bugfix.  ;-)

OK.
Tulio Magno Quites Machado Filho Aug. 26, 2015, 4:31 p.m. UTC | #14
Torvald Riegel <triegel@redhat.com> writes:

> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho
> wrote:
>> Torvald Riegel <triegel@redhat.com> writes:
>> 
>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
>> >> +   threads, either during lock acquisition, or elision.  In order to avoid
>> >> +   this evaluation from becoming a data race the access of is_lock_free
>> >
>> > It could be a data race because you're not using atomics there, but
>> > that's not the whole point.  (We use the term "data race" specifically
>> > to refer to the C11/C++11 memory model concept of the same name.)
>> > You want to ensure the lock elision synchronization scheme, and thus are
>> > moving it inside the txn.
>> 
>> Are you complaining about the usage of the term "data race"?
>
> Yes.
>
>> If so, what about "race condition"?
>
> Well, that would be better, but a race condition is not necessarily a
> bad thing.  It's usually better to say which execution or interleaving
> you need to avoid, than just saying "race condition".

What about the following?

/* CONCURRENCY NOTES:

   The macro expression is_lock_free is read and possibly written to by
   multiple threads, either during lock acquisition, or elision.  In order to
   avoid this evaluation from becoming a race condition with the lock
   acquirement from the lock primitive, the access of is_lock_free is placed
   *inside* the transaction.  Within the transaction we are assured that all
   memory accesses are atomic and is_lock_free can be evaluated with relaxed
   memory order.  That way, the value of is_lock_free is consistent with the
   state of the lock until the end of the transaction.  */

> I'd still like to see this getting improved eventually, and I'd
> appreciate if you could participate in that.

Ack.  Count me in.
Carlos O'Donell Aug. 26, 2015, 7:25 p.m. UTC | #15
On 08/25/2015 05:06 PM, Paul E. Murphy wrote:
> The discussion this patch launched has been quite informative. Circling
> back to the patch; it should fix the bug in question while we determine
> how to improve the common code.
> 
> My only trivial nit is the x86 version might benefit from a similar
> comment regarding concurrency, otherwise LGTM.

Thanks for the review Paul!

Yes, the x86 code needs the same treatment and I think we'll do that
next by fixing up all callers to use an inline function.

c.
Tulio Magno Quites Machado Filho Sept. 1, 2015, 2 p.m. UTC | #16
Ping, Torvald.


"Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes:

> Torvald Riegel <triegel@redhat.com> writes:
>
>> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho
>> wrote:
>>> Torvald Riegel <triegel@redhat.com> writes:
>>> 
>>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
>>> >> +   threads, either during lock acquisition, or elision.  In order to avoid
>>> >> +   this evaluation from becoming a data race the access of is_lock_free
>>> >
>>> > It could be a data race because you're not using atomics there, but
>>> > that's not the whole point.  (We use the term "data race" specifically
>>> > to refer to the C11/C++11 memory model concept of the same name.)
>>> > You want to ensure the lock elision synchronization scheme, and thus are
>>> > moving it inside the txn.
>>> 
>>> Are you complaining about the usage of the term "data race"?
>>
>> Yes.
>>
>>> If so, what about "race condition"?
>>
>> Well, that would be better, but a race condition is not necessarily a
>> bad thing.  It's usually better to say which execution or interleaving
>> you need to avoid, than just saying "race condition".
>
> What about the following?
>
> /* CONCURRENCY NOTES:
>
>    The macro expression is_lock_free is read and possibly written to by
>    multiple threads, either during lock acquisition, or elision.  In order to
>    avoid this evaluation from becoming a race condition with the lock
>    acquirement from the lock primitive, the access of is_lock_free is placed
>    *inside* the transaction.  Within the transaction we are assured that all
>    memory accesses are atomic and is_lock_free can be evaluated with relaxed
>    memory order.  That way, the value of is_lock_free is consistent with the
>    state of the lock until the end of the transaction.  */

Anything else you'd like to change before I push this?

Thanks,
Peter Bergner Sept. 1, 2015, 7:38 p.m. UTC | #17
On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote:
> +/* Returns 0 if the lock defined by is_lock_free was elided.
> +   ADAPT_COUNT is a per-lock state variable.  */
> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
> +  ({									\
> +    int ret = 0;							\
> +    if (adapt_count > 0)						\
> +      (adapt_count)--;							\
> +    else								\
> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> +	{								\
> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\
> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\
> +	}								\
> +    ret;								\
> +  })

Interesting.  I was going to suggest following your __builtin_tabort()
call here with a call to __builtin_unreachable() so we can give the
compiler a little more freedom to optimize things on that path.
Then I remembered we had some conditionl tabort insns we might
be able to use here, since the tabort above is conditional on
is_lock_free == 0.

The following code seems to generate much better code, but at the
loss of not passing _ABORT_LOCK_BUSY as a failure cause.

/* Returns 0 if the lock defined by is_lock_free was elided.
   ADAPT_COUNT is a per-lock state variable.  */
# define ELIDE_LOCK(adapt_count, is_lock_free)                         \
  ({                                                                   \
    int ret = 0;                                                       \
    if (adapt_count > 0)                                               \
      (adapt_count)--;                                                 \
    else                                                               \
      for (int i = __elision_aconf.try_tbegin; i > 0; i--)             \
       {                                                               \
         if (__builtin_tbegin (0))                                     \
           {                                                           \
             __builtin_tabortwci (0x4, is_lock_free, 0)                \
             ret = 1;                                                  \
             break;                                                    \
           }                                                           \
         else                                                          \
           if (!__get_new_count(&adapt_count))                         \
             break;                                                    \
       }                                                               \
    ret;                                                               \
  })


However, if I look at the code that consumes the failure code, I see
that we're only looking for the persistent bit being set and not any
specific failure cause we passed in:

static inline bool
__get_new_count (uint8_t *adapt_count)
{
  /* A persistent failure indicates that a retry will probably
     result in another failure.  Use normal locking now and
     for the next couple of calls.  */
  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
    {
      if (__elision_aconf.skip_lock_internal_abort > 0)
       *adapt_count = __elision_aconf.skip_lock_internal_abort;
       return false;
    }

  /* Same logic as above, but for a number of temporary failures in a
     a row.  */
  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
          && __elision_aconf.try_tbegin > 0)
    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
  return true;
}

...although looking closer, I see have the definitions we use for
passing in failure codes is:

/* Definitions used for TEXASR Failure code (bits 0:6), they need to be even
   because tabort. always sets the first bit.  */
#define _ABORT_LOCK_BUSY       0x3f   /* Lock already used.  */
#define _ABORT_NESTED_TRYLOCK  0x3e   /* Write operation in trylock.  */
#define _ABORT_SYSCALL         0x3d   /* Syscall issued.  */

I'm not sure I understand the "even" part above, since two of these
values are clearly not even.  In fact, the two odd values above if
passed to __builtin_tabort() will lead to the persistent bit being
set, since tabort. sets the upper part of the TEXASR by taking the
least significant 8 bits of the value you passed in and concatenating
0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as
a failure cause example):

   TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1;

Given the persistent bit is bit 7, that means any odd value we pass
into __builtin_tabort(XXX) will set the TEXASR persistent bit.
Did we mean to do that?  Maybe we did, I'm just asking. :-)
I can see that if we're about to do a syscall, that should be persistent.
Are we also treating someone having a lock as persistent too?

If we did, then using __builtin_tabortwci() won't work, since we
wouldn't be setting the persistent bit like we are now... unless
we can pass in is_lock_free to __get_new_count and test that
along with the persistent bit still.

Peter
Paul E. Murphy Sept. 1, 2015, 8:25 p.m. UTC | #18
On 09/01/2015 02:38 PM, Peter Bergner wrote:
> ...although looking closer, I see have the definitions we use for
> passing in failure codes is:
> 
> /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even
>    because tabort. always sets the first bit.  */

This is a bad comment. This is a misinterpretation of the ISA, as you've noted,
bit 31 is always set if tabort* is used. This makes using the conditional
versions impractical within the TLE implementation.

I think the implication here is you can't trust bit 7 of texasr to deduce the
abort code, as it could be set due to one of bits 8-11. I'm not sure if the cases
leading bits 8-11 to get set are mutually exclusive with calling tabort. I.e is
failure code 1 the only reserved code?

> #define _ABORT_LOCK_BUSY       0x3f   /* Lock already used.  */
> #define _ABORT_NESTED_TRYLOCK  0x3e   /* Write operation in trylock.  */

This should definitely be an odd value too. I'll see to fixing its usage. As
implemented now, we might retry 3 (try_tbegin) times in __lll_trylock_elision
if we hit that one.

> #define _ABORT_SYSCALL         0x3d   /* Syscall issued.  */
> 
> I'm not sure I understand the "even" part above, since two of these
> values are clearly not even.  In fact, the two odd values above if
> passed to __builtin_tabort() will lead to the persistent bit being
> set, since tabort. sets the upper part of the TEXASR by taking the
> least significant 8 bits of the value you passed in and concatenating
> 0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as
> a failure cause example):
> 
>    TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1;
> 
> Given the persistent bit is bit 7, that means any odd value we pass
> into __builtin_tabort(XXX) will set the TEXASR persistent bit.
> Did we mean to do that?  Maybe we did, I'm just asking. :-)
> I can see that if we're about to do a syscall, that should be persistent.
> Are we also treating someone having a lock as persistent too?
> 
> If we did, then using __builtin_tabortwci() won't work, since we
> wouldn't be setting the persistent bit like we are now... unless
> we can pass in is_lock_free to __get_new_count and test that
> along with the persistent bit still.
> 
> Peter
Adhemerval Zanella Sept. 1, 2015, 8:46 p.m. UTC | #19
On 01-09-2015 16:38, Peter Bergner wrote:
> On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote:
>> +/* Returns 0 if the lock defined by is_lock_free was elided.
>> +   ADAPT_COUNT is a per-lock state variable.  */
>> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
>> +  ({									\
>> +    int ret = 0;							\
>> +    if (adapt_count > 0)						\
>> +      (adapt_count)--;							\
>> +    else								\
>> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> +	{								\
>> +	  if (__builtin_tbegin (0))					\
>> +	    {								\
>> +	      if (is_lock_free)						\
>> +		{							\
>> +		  ret = 1;						\
>> +		  break;						\
>> +		}							\
>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>> +	    }								\
>> +	  else								\
>> +	    if (!__get_new_count(&adapt_count))				\
>> +	      break;							\
>> +	}								\
>> +    ret;								\
>> +  })
> 
> Interesting.  I was going to suggest following your __builtin_tabort()
> call here with a call to __builtin_unreachable() so we can give the
> compiler a little more freedom to optimize things on that path.
> Then I remembered we had some conditionl tabort insns we might
> be able to use here, since the tabort above is conditional on
> is_lock_free == 0.
> 
> The following code seems to generate much better code, but at the
> loss of not passing _ABORT_LOCK_BUSY as a failure cause.
> 
> /* Returns 0 if the lock defined by is_lock_free was elided.
>    ADAPT_COUNT is a per-lock state variable.  */
> # define ELIDE_LOCK(adapt_count, is_lock_free)                         \
>   ({                                                                   \
>     int ret = 0;                                                       \
>     if (adapt_count > 0)                                               \
>       (adapt_count)--;                                                 \
>     else                                                               \
>       for (int i = __elision_aconf.try_tbegin; i > 0; i--)             \
>        {                                                               \
>          if (__builtin_tbegin (0))                                     \
>            {                                                           \
>              __builtin_tabortwci (0x4, is_lock_free, 0)                \
>              ret = 1;                                                  \
>              break;                                                    \
>            }                                                           \
>          else                                                          \
>            if (!__get_new_count(&adapt_count))                         \
>              break;                                                    \
>        }                                                               \
>     ret;                                                               \
>   })
> 
> 
> However, if I look at the code that consumes the failure code, I see
> that we're only looking for the persistent bit being set and not any
> specific failure cause we passed in:

The idea of passing a error code in transaction abort is to differentiate
it from other possible abort types, for instance newer kernels that
might abort in syscalls or a user program that defines it is own
transaction abort code.

> 
> static inline bool
> __get_new_count (uint8_t *adapt_count)
> {
>   /* A persistent failure indicates that a retry will probably
>      result in another failure.  Use normal locking now and
>      for the next couple of calls.  */
>   if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>     {
>       if (__elision_aconf.skip_lock_internal_abort > 0)
>        *adapt_count = __elision_aconf.skip_lock_internal_abort;
>        return false;
>     }
> 
>   /* Same logic as above, but for a number of temporary failures in a
>      a row.  */
>   else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
>           && __elision_aconf.try_tbegin > 0)
>     *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
>   return true;
> }
> 
> ...although looking closer, I see have the definitions we use for
> passing in failure codes is:
> 
> /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even
>    because tabort. always sets the first bit.  */
> #define _ABORT_LOCK_BUSY       0x3f   /* Lock already used.  */
> #define _ABORT_NESTED_TRYLOCK  0x3e   /* Write operation in trylock.  */
> #define _ABORT_SYSCALL         0x3d   /* Syscall issued.  */
> 
> I'm not sure I understand the "even" part above, since two of these
> values are clearly not even.  In fact, the two odd values above if
> passed to __builtin_tabort() will lead to the persistent bit being
> set, since tabort. sets the upper part of the TEXASR by taking the
> least significant 8 bits of the value you passed in and concatenating
> 0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as
> a failure cause example):
> 
>    TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1;
> 
> Given the persistent bit is bit 7, that means any odd value we pass
> into __builtin_tabort(XXX) will set the TEXASR persistent bit.
> Did we mean to do that?  Maybe we did, I'm just asking. :-)
> I can see that if we're about to do a syscall, that should be persistent.
> Are we also treating someone having a lock as persistent too?

Indeed the 'odd' comment does not make sense and we should just remove it
(I misread texasr definition).  My initial idea was define some codes
that set the persistent failures and some that do not.  I think I best
approach would be:

/* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1
   and the TEXASR persistent bit is bit 25 (32-7).  Only the syscall
   code means a persistent error that should trigger a default lock
   acquisition.  */
#define _ABORT_SYSCALL         0x1   /* Syscall issued.  */
#define _ABORT_LOCK_BUSY       0x2   /* Lock already used.  */
#define _ABORT_NESTED_TRYLOCK  0x4   /* Write operation in trylock.  */

> 
> If we did, then using __builtin_tabortwci() won't work, since we
> wouldn't be setting the persistent bit like we are now... unless
> we can pass in is_lock_free to __get_new_count and test that
> along with the persistent bit still.

I think we can use '__builtin_tabortwci' since the transactional aborts
here are not meant to be persistent (the syscall one are done in syscall
wrappers).

What do you think?

> 
> Peter
> 
> 
>
Paul E. Murphy Sept. 1, 2015, 9:24 p.m. UTC | #20
On 09/01/2015 03:46 PM, Adhemerval Zanella wrote:
> Indeed the 'odd' comment does not make sense and we should just remove it
> (I misread texasr definition).  My initial idea was define some codes
> that set the persistent failures and some that do not.  I think I best
> approach would be:
> 
> /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1
>    and the TEXASR persistent bit is bit 25 (32-7).  Only the syscall
>    code means a persistent error that should trigger a default lock
>    acquisition.  */
> #define _ABORT_SYSCALL         0x1   /* Syscall issued.  */
> #define _ABORT_LOCK_BUSY       0x2   /* Lock already used.  */
> #define _ABORT_NESTED_TRYLOCK  0x4   /* Write operation in trylock.  */

The kernel defines several abort codes, we'll want to work with them, or
recycle them as needed.

I'm not convinced any of the existing codes should be non-persistent:

A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
guaranteed to fail try_tbegin times if there is no contention on the lock.
Aborts get increasingly expensive as you increase the amount of speculative
execution.

A busy lock likely indicates contention in the critical section which
does not benefit from elision, I'd err on the side of a persistent
failure.
Peter Bergner Sept. 1, 2015, 9:29 p.m. UTC | #21
On Tue, 2015-09-01 at 15:25 -0500, Paul E. Murphy wrote:
> 
> On 09/01/2015 02:38 PM, Peter Bergner wrote:
> > ...although looking closer, I see have the definitions we use for
> > passing in failure codes is:
> > 
> > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even
> >    because tabort. always sets the first bit.  */
> 
> This is a bad comment. This is a misinterpretation of the ISA, as you've noted,
> bit 31 is always set if tabort* is used. This makes using the conditional
> versions impractical within the TLE implementation.

I don't understand this.  All tabort* instructions, including the unconditional
tabort. set bit 31.  The only thing that tabort. does that the conditional
taborts[wd]c[i] don't do is set bits 0-7 (ie, failure code & persistent bits).



> I think the implication here is you can't trust bit 7 of texasr to deduce the
> abort code, as it could be set due to one of bits 8-11. I'm not sure if the cases
> leading bits 8-11 to get set are mutually exclusive with calling tabort. I.e is
> failure code 1 the only reserved code?

Bit 7 is supposed to be used to specify whether a transaction failure is
persistent (ie, whether we should retry the tbegin.) or not.  That bit
can be set by either a HW conflict leading to a transaction failure or
an explicit use of tabort.  It cannot be set with conditional tabort[wd]c[i]
instructions, but Adhemerval's other note seems to imply we weren't
supposed to be setting the persistent bit in this code, so using the
conditional tabort[wd]c[i] should be ok here.


Peter
Peter Bergner Sept. 1, 2015, 9:43 p.m. UTC | #22
On Tue, 2015-09-01 at 17:46 -0300, Adhemerval Zanella wrote:
> On 01-09-2015 16:38, Peter Bergner wrote:
> > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even
> >    because tabort. always sets the first bit.  */
[snip]
> Indeed the 'odd' comment does not make sense and we should just remove it
> (I misread texasr definition).  My initial idea was define some codes
> that set the persistent failures and some that do not.  I think I best
> approach would be:
> 
> /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1
>    and the TEXASR persistent bit is bit 25 (32-7).  Only the syscall
>    code means a persistent error that should trigger a default lock
>    acquisition.  */
> #define _ABORT_SYSCALL         0x1   /* Syscall issued.  */
> #define _ABORT_LOCK_BUSY       0x2   /* Lock already used.  */
> #define _ABORT_NESTED_TRYLOCK  0x4   /* Write operation in trylock.  */

So according to this, only the syscall is deemed persistent.


> I think we can use '__builtin_tabortwci' since the transactional aborts
> here are not meant to be persistent (the syscall one are done in syscall
> wrappers).

If we decide that these uses of the explicit tabort*s are really
to be treated as not persistent, then I agree we could use the
conditional tabort[wd]c[i] instructions, but it seems Paul thinks
otherwise.  I guess we should let Paul describe why.

Peter
Adhemerval Zanella Sept. 1, 2015, 9:58 p.m. UTC | #23
On 01-09-2015 18:24, Paul E. Murphy wrote:
> 
> 
> On 09/01/2015 03:46 PM, Adhemerval Zanella wrote:
>> Indeed the 'odd' comment does not make sense and we should just remove it
>> (I misread texasr definition).  My initial idea was define some codes
>> that set the persistent failures and some that do not.  I think I best
>> approach would be:
>>
>> /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1
>>    and the TEXASR persistent bit is bit 25 (32-7).  Only the syscall
>>    code means a persistent error that should trigger a default lock
>>    acquisition.  */
>> #define _ABORT_SYSCALL         0x1   /* Syscall issued.  */
>> #define _ABORT_LOCK_BUSY       0x2   /* Lock already used.  */
>> #define _ABORT_NESTED_TRYLOCK  0x4   /* Write operation in trylock.  */
> 
> The kernel defines several abort codes, we'll want to work with them, or
> recycle them as needed.

Yeap, but these are the one kernel itself generates (not GLIBC) and I think
it would be useful to use different code from abort generated from
GLIBC itself (so an external tool like perf can detect from where exactly
the abort came from).

And I used the number above just an example, checking the kernel code seems it
defines:

#define TM_CAUSE_PERSISTENT     0x01
#define TM_CAUSE_KVM_RESCHED    0xe0  /* From PAPR */
#define TM_CAUSE_KVM_FAC_UNAV   0xe2  /* From PAPR */
#define TM_CAUSE_RESCHED        0xde
#define TM_CAUSE_TLBI           0xdc
#define TM_CAUSE_FAC_UNAV       0xda
#define TM_CAUSE_SYSCALL        0xd8
#define TM_CAUSE_MISC           0xd6  /* future use */
#define TM_CAUSE_SIGNAL         0xd4
#define TM_CAUSE_ALIGNMENT      0xd2
#define TM_CAUSE_EMULATE        0xd0

So I think we might use 0xaX or something for the GLIBC ones.


> 
> I'm not convinced any of the existing codes should be non-persistent:
> 
> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
> guaranteed to fail try_tbegin times if there is no contention on the lock.
> Aborts get increasingly expensive as you increase the amount of speculative
> execution.
> 
> A busy lock likely indicates contention in the critical section which
> does not benefit from elision, I'd err on the side of a persistent
> failure.
> 

I do not that much information to decide it, although I do see for pthread_mutex_t 
at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not 
generate enough contention.
Paul E. Murphy Sept. 1, 2015, 10:43 p.m. UTC | #24
On 09/01/2015 04:58 PM, Adhemerval Zanella wrote:

>> I'm not convinced any of the existing codes should be non-persistent:
>>
>> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
>> guaranteed to fail try_tbegin times if there is no contention on the lock.
>> Aborts get increasingly expensive as you increase the amount of speculative
>> execution.
>>
>> A busy lock likely indicates contention in the critical section which
>> does not benefit from elision, I'd err on the side of a persistent
>> failure.
>>
> 
> I do not that much information to decide it, although I do see for pthread_mutex_t 
> at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not 
> generate enough contention.

This seems to violate my understanding of the adaptive algorithm. I agree, there
is a possibility another thread might successfully elide while adapt_count != 0,
*futex == 0, and it happens to be within the retry loop.

Transactions are not free. The question is: Is it cheaper to risk a few more
failures in hopes of the transaction succeeding, or just grabbing the lock?

This becomes less desirable with increasing values of try_tbegin. I've found
11 to be the value which factors out the "false" failures under SMT8.
Adhemerval Zanella Sept. 2, 2015, 12:45 p.m. UTC | #25
On 01-09-2015 19:43, Paul E. Murphy wrote:
> 
> 
> On 09/01/2015 04:58 PM, Adhemerval Zanella wrote:
> 
>>> I'm not convinced any of the existing codes should be non-persistent:
>>>
>>> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
>>> guaranteed to fail try_tbegin times if there is no contention on the lock.
>>> Aborts get increasingly expensive as you increase the amount of speculative
>>> execution.
>>>
>>> A busy lock likely indicates contention in the critical section which
>>> does not benefit from elision, I'd err on the side of a persistent
>>> failure.
>>>
>>
>> I do not that much information to decide it, although I do see for pthread_mutex_t 
>> at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not 
>> generate enough contention.
> 
> This seems to violate my understanding of the adaptive algorithm. I agree, there
> is a possibility another thread might successfully elide while adapt_count != 0,
> *futex == 0, and it happens to be within the retry loop.
> 
> Transactions are not free. The question is: Is it cheaper to risk a few more
> failures in hopes of the transaction succeeding, or just grabbing the lock?

Well that is exactly the question I do not know it will heavy depend of the
kind of algorithm and contention you have.

> 
> This becomes less desirable with increasing values of try_tbegin. I've found
> 11 to be the value which factors out the "false" failures under SMT8.
> 

If the workloads you are trying are showing a higher try_tbegin is better, than
indeed we can either increase it or set _ABORT_LOCK_BUSY as a persistent failure
so the algorithm will bail out and use locks instead of retrying.

Which kind of benchmark or workloads are you using to evaluate it?
Paul E. Murphy Sept. 2, 2015, 3:28 p.m. UTC | #26
On 09/02/2015 07:45 AM, Adhemerval Zanella wrote:
> 
> 
> On 01-09-2015 19:43, Paul E. Murphy wrote:
>>
>>
>> On 09/01/2015 04:58 PM, Adhemerval Zanella wrote:
>>
>>>> I'm not convinced any of the existing codes should be non-persistent:
>>>>
>>>> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
>>>> guaranteed to fail try_tbegin times if there is no contention on the lock.
>>>> Aborts get increasingly expensive as you increase the amount of speculative
>>>> execution.
>>>>
>>>> A busy lock likely indicates contention in the critical section which
>>>> does not benefit from elision, I'd err on the side of a persistent
>>>> failure.
>>>>
>>>
>>> I do not that much information to decide it, although I do see for pthread_mutex_t 
>>> at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not 
>>> generate enough contention.
>>
>> This seems to violate my understanding of the adaptive algorithm. I agree, there
>> is a possibility another thread might successfully elide while adapt_count != 0,
>> *futex == 0, and it happens to be within the retry loop.
>>
>> Transactions are not free. The question is: Is it cheaper to risk a few more
>> failures in hopes of the transaction succeeding, or just grabbing the lock?
> 
> Well that is exactly the question I do not know it will heavy depend of the
> kind of algorithm and contention you have.
> 
>>
>> This becomes less desirable with increasing values of try_tbegin. I've found
>> 11 to be the value which factors out the "false" failures under SMT8.
>>
> 
> If the workloads you are trying are showing a higher try_tbegin is better, than
> indeed we can either increase it or set _ABORT_LOCK_BUSY as a persistent failure
> so the algorithm will bail out and use locks instead of retrying.
> 
> Which kind of benchmark or workloads are you using to evaluate it?
> 

I've posted my "workload" examples at git@github.com:pmur/glibc-test-tle.git. Take
them with a grain of salt, and think of them as sanity checks. I've been using
various derivations of them to poke at what I think are edge cases. Any suggestions
or additions are appreciated.
Tulio Magno Quites Machado Filho Sept. 3, 2015, 6:58 p.m. UTC | #27
Peter Bergner <bergner@vnet.ibm.com> writes:

> On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote:
>> +/* Returns 0 if the lock defined by is_lock_free was elided.
>> +   ADAPT_COUNT is a per-lock state variable.  */
>> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
>> +  ({									\
>> +    int ret = 0;							\
>> +    if (adapt_count > 0)						\
>> +      (adapt_count)--;							\
>> +    else								\
>> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> +	{								\
>> +	  if (__builtin_tbegin (0))					\
>> +	    {								\
>> +	      if (is_lock_free)						\
>> +		{							\
>> +		  ret = 1;						\
>> +		  break;						\
>> +		}							\
>> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
>> +	    }								\
>> +	  else								\
>> +	    if (!__get_new_count(&adapt_count))				\
>> +	      break;							\
>> +	}								\
>> +    ret;								\
>> +  })
>
> Interesting.  I was going to suggest following your __builtin_tabort()
> call here with a call to __builtin_unreachable() so we can give the
> compiler a little more freedom to optimize things on that path.
> Then I remembered we had some conditionl tabort insns we might
> be able to use here, since the tabort above is conditional on
> is_lock_free == 0.
>
> The following code seems to generate much better code, but at the
> loss of not passing _ABORT_LOCK_BUSY as a failure cause.
>
> /* Returns 0 if the lock defined by is_lock_free was elided.
>    ADAPT_COUNT is a per-lock state variable.  */
> # define ELIDE_LOCK(adapt_count, is_lock_free)                         \
>   ({                                                                   \
>     int ret = 0;                                                       \
>     if (adapt_count > 0)                                               \
>       (adapt_count)--;                                                 \
>     else                                                               \
>       for (int i = __elision_aconf.try_tbegin; i > 0; i--)             \
>        {                                                               \
>          if (__builtin_tbegin (0))                                     \
>            {                                                           \
>              __builtin_tabortwci (0x4, is_lock_free, 0)                \
>              ret = 1;                                                  \
>              break;                                                    \
>            }                                                           \
>          else                                                          \
>            if (!__get_new_count(&adapt_count))                         \
>              break;                                                    \
>        }                                                               \
>     ret;                                                               \
>   })
>
>
> However, if I look at the code that consumes the failure code, I see
> that we're only looking for the persistent bit being set and not any
> specific failure cause we passed in:
>
> static inline bool
> __get_new_count (uint8_t *adapt_count)
> {
>   /* A persistent failure indicates that a retry will probably
>      result in another failure.  Use normal locking now and
>      for the next couple of calls.  */
>   if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>     {
>       if (__elision_aconf.skip_lock_internal_abort > 0)
>        *adapt_count = __elision_aconf.skip_lock_internal_abort;
>        return false;
>     }
>
>   /* Same logic as above, but for a number of temporary failures in a
>      a row.  */
>   else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
>           && __elision_aconf.try_tbegin > 0)
>     *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
>   return true;
> }

That seems an interesting suggestion, but out of scope for this patch, IMO.
We're just fixing an issue here.  ;-)
Tulio Magno Quites Machado Filho Sept. 14, 2015, 2:31 p.m. UTC | #28
Ping x2.

"Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes:

> Ping, Torvald.
>
>
> "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes:
>
>> Torvald Riegel <triegel@redhat.com> writes:
>>
>>> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho
>>> wrote:
>>>> Torvald Riegel <triegel@redhat.com> writes:
>>>> 
>>>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
>>>> >> +   threads, either during lock acquisition, or elision.  In order to avoid
>>>> >> +   this evaluation from becoming a data race the access of is_lock_free
>>>> >
>>>> > It could be a data race because you're not using atomics there, but
>>>> > that's not the whole point.  (We use the term "data race" specifically
>>>> > to refer to the C11/C++11 memory model concept of the same name.)
>>>> > You want to ensure the lock elision synchronization scheme, and thus are
>>>> > moving it inside the txn.
>>>> 
>>>> Are you complaining about the usage of the term "data race"?
>>>
>>> Yes.
>>>
>>>> If so, what about "race condition"?
>>>
>>> Well, that would be better, but a race condition is not necessarily a
>>> bad thing.  It's usually better to say which execution or interleaving
>>> you need to avoid, than just saying "race condition".
>>
>> What about the following?
>>
>> /* CONCURRENCY NOTES:
>>
>>    The macro expression is_lock_free is read and possibly written to by
>>    multiple threads, either during lock acquisition, or elision.  In order to
>>    avoid this evaluation from becoming a race condition with the lock
>>    acquirement from the lock primitive, the access of is_lock_free is placed
>>    *inside* the transaction.  Within the transaction we are assured that all
>>    memory accesses are atomic and is_lock_free can be evaluated with relaxed
>>    memory order.  That way, the value of is_lock_free is consistent with the
>>    state of the lock until the end of the transaction.  */
>
> Anything else you'd like to change before I push this?
>
> Thanks,
Tulio Magno Quites Machado Filho Oct. 7, 2015, 2:38 p.m. UTC | #29
Ping x3, Torvald.  ;-)

Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> writes:

> Ping x2.
>
> "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes:
>
>> Ping, Torvald.
>>
>>
>> "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes:
>>
>>> Torvald Riegel <triegel@redhat.com> writes:
>>>
>>>> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho
>>>> wrote:
>>>>> Torvald Riegel <triegel@redhat.com> writes:
>>>>> 
>>>>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho
>>>>> >> +   threads, either during lock acquisition, or elision.  In order to avoid
>>>>> >> +   this evaluation from becoming a data race the access of is_lock_free
>>>>> >
>>>>> > It could be a data race because you're not using atomics there, but
>>>>> > that's not the whole point.  (We use the term "data race" specifically
>>>>> > to refer to the C11/C++11 memory model concept of the same name.)
>>>>> > You want to ensure the lock elision synchronization scheme, and thus are
>>>>> > moving it inside the txn.
>>>>> 
>>>>> Are you complaining about the usage of the term "data race"?
>>>>
>>>> Yes.
>>>>
>>>>> If so, what about "race condition"?
>>>>
>>>> Well, that would be better, but a race condition is not necessarily a
>>>> bad thing.  It's usually better to say which execution or interleaving
>>>> you need to avoid, than just saying "race condition".
>>>
>>> What about the following?
>>>
>>> /* CONCURRENCY NOTES:
>>>
>>>    The macro expression is_lock_free is read and possibly written to by
>>>    multiple threads, either during lock acquisition, or elision.  In order to
>>>    avoid this evaluation from becoming a race condition with the lock
>>>    acquirement from the lock primitive, the access of is_lock_free is placed
>>>    *inside* the transaction.  Within the transaction we are assured that all
>>>    memory accesses are atomic and is_lock_free can be evaluated with relaxed
>>>    memory order.  That way, the value of is_lock_free is consistent with the
>>>    state of the lock until the end of the transaction.  */
>>
>> Anything else you'd like to change before I push this?
>>
>> Thanks,
Torvald Riegel Oct. 8, 2015, 3:43 p.m. UTC | #30
Sorry for the very long response time.  I've been sick since a while and
am slow in catching up.

On Wed, 2015-08-26 at 13:31 -0300, Tulio Magno Quites Machado Filho
wrote:
> What about the following?
> 
> /* CONCURRENCY NOTES:
> 
>    The macro expression is_lock_free is read and possibly written to by
>    multiple threads, either during lock acquisition, or elision.  In order to
>    avoid this evaluation from becoming a race condition with the lock
>    acquirement from the lock primitive, the access of is_lock_free is placed
>    *inside* the transaction.  Within the transaction we are assured that all
>    memory accesses are atomic and is_lock_free can be evaluated with relaxed
>    memory order.  That way, the value of is_lock_free is consistent with the
>    state of the lock until the end of the transaction.  */

I suggest instead:

The evaluation of the macro expression is_lock_free encompasses one or
more loads from memory locations that are concurrently modified by other
threads.  For lock elision to work, this evaluation and the rest of the
critical section protected by the lock must be atomic because an
execution with lock elision must be equivalent to an execution in which
the lock would have been actually acquired and released.  Therefore, we
evaluate is_lock_free inside of the transaction that represents the
critical section for which we want to use lock elision, which ensures
the atomicity that we require.


If you think this is adequate, please commit.
Torvald Riegel Oct. 8, 2015, 3:54 p.m. UTC | #31
On Tue, 2015-09-01 at 16:24 -0500, Paul E. Murphy wrote:
> 
> On 09/01/2015 03:46 PM, Adhemerval Zanella wrote:
> > Indeed the 'odd' comment does not make sense and we should just remove it
> > (I misread texasr definition).  My initial idea was define some codes
> > that set the persistent failures and some that do not.  I think I best
> > approach would be:
> > 
> > /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1
> >    and the TEXASR persistent bit is bit 25 (32-7).  Only the syscall
> >    code means a persistent error that should trigger a default lock
> >    acquisition.  */
> > #define _ABORT_SYSCALL         0x1   /* Syscall issued.  */
> > #define _ABORT_LOCK_BUSY       0x2   /* Lock already used.  */
> > #define _ABORT_NESTED_TRYLOCK  0x4   /* Write operation in trylock.  */
> 
> The kernel defines several abort codes, we'll want to work with them, or
> recycle them as needed.

Agreed in general.  The abort codes should be part of the ABI, or we'll
get into trouble interpreting them given that transactions can span
across different layers.  I've asked Andi Kleen whether he can make this
happen for x86 (eg, through appropriate documentation somewhere).  It
would be good if this can be clarified/specified on PowerPC too.

Can kernel-level abort codes actually reach userspace, or do we only
need to establish userspace consensus for these codes?  For x86, it's
userspace only AFAIK.

Would it be likely that userspace explicit aborts get interpreted by the
kernel?  They can't be trusted more than one would trust the program.

> I'm not convinced any of the existing codes should be non-persistent:
> 
> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is
> guaranteed to fail try_tbegin times if there is no contention on the lock.
> Aborts get increasingly expensive as you increase the amount of speculative
> execution.

Although that this depends on the program (e.g., trylock might not be
called all the time), in this case I would also guess that it would be
better to consider it a persistent condition.

> A busy lock likely indicates contention in the critical section which
> does not benefit from elision, I'd err on the side of a persistent
> failure.

I don't think I agree.  An already-acquired lock is something that could
be hit less likely by using elision more often on this lock (think about
the "lemming effect").
Paul E. Murphy Oct. 9, 2015, 3:30 p.m. UTC | #32
[Adding Andi if he has an opinion]

On 10/08/2015 10:54 AM, Torvald Riegel wrote:
>> A busy lock likely indicates contention in the critical section which
>> does not benefit from elision, I'd err on the side of a persistent
>> failure.
> 
> I don't think I agree.  An already-acquired lock is something that could
> be hit less likely by using elision more often on this lock (think about
> the "lemming effect").
> 

My line of thinking is that you are already experiencing the effect if
a thread hits a taken lock. It most likely [1] implies adapt_count > 0,
so you are guaranteed a time period of serialization. The best outcome is
that you manage to run a transaction in between an actual lock. The worst
being you extend the serialization period.

I think perhaps the best of both worlds might be to check the lock, and
set the persistent bit if the lock is busy with waiters. Thoughts?

[1] A pthread condition waiter is an odd exception in that they will set
the lock to busy with waiters when they wake up using an elidable lock,
but not set adapt_count. Also, you could hit the case where adapt_count
goes from 1 -> 0.
Andi Kleen Oct. 9, 2015, 11:46 p.m. UTC | #33
On Fri, Oct 09, 2015 at 10:30:47AM -0500, Paul E. Murphy wrote:
> 
> [Adding Andi if he has an opinion]
> 
> On 10/08/2015 10:54 AM, Torvald Riegel wrote:
> >> A busy lock likely indicates contention in the critical section which
> >> does not benefit from elision, I'd err on the side of a persistent
> >> failure.
> > 
> > I don't think I agree.  An already-acquired lock is something that could
> > be hit less likely by using elision more often on this lock (think about
> > the "lemming effect").
> > 
> 
> My line of thinking is that you are already experiencing the effect if
> a thread hits a taken lock. It most likely [1] implies adapt_count > 0,
> so you are guaranteed a time period of serialization. The best outcome is
> that you manage to run a transaction in between an actual lock. The worst
> being you extend the serialization period.

Lemming effects happens after the taken lock is freed. It prevents
speedy recovery into the parallel eliding state.

The typical way to improve Lemming is to wait for the lock
to become free before retrying. Unfortunately that is difficult
due to the way the glibc mutexes work, as there's no way to do
this with the futex interface. It would be possible to spin a bit,
but the code doesn't do that. But to do that you shouldn't do
a persistent abort, but keep going for a bit.

> 
> I think perhaps the best of both worlds might be to check the lock, and
> set the persistent bit if the lock is busy with waiters. Thoughts?

There's unfortunately nothing to check when you're experiencing the
Lemming effect. The other users just speculate on their own.

-Andi
Tulio Magno Quites Machado Filho Oct. 19, 2015, 2:01 p.m. UTC | #34
Torvald Riegel <triegel@redhat.com> writes:

> Sorry for the very long response time.

Please do not.  You have been helping me a lot here.  ;-)

> I've been sick since a while and am slow in catching up.

I hope you're better now.

> On Wed, 2015-08-26 at 13:31 -0300, Tulio Magno Quites Machado Filho
> wrote:
>> What about the following?
>> 
>> /* CONCURRENCY NOTES:
>> 
>>    The macro expression is_lock_free is read and possibly written to by
>>    multiple threads, either during lock acquisition, or elision.  In order to
>>    avoid this evaluation from becoming a race condition with the lock
>>    acquirement from the lock primitive, the access of is_lock_free is placed
>>    *inside* the transaction.  Within the transaction we are assured that all
>>    memory accesses are atomic and is_lock_free can be evaluated with relaxed
>>    memory order.  That way, the value of is_lock_free is consistent with the
>>    state of the lock until the end of the transaction.  */
>
> I suggest instead:
>
> The evaluation of the macro expression is_lock_free encompasses one or
> more loads from memory locations that are concurrently modified by other
> threads.  For lock elision to work, this evaluation and the rest of the
> critical section protected by the lock must be atomic because an
> execution with lock elision must be equivalent to an execution in which
> the lock would have been actually acquired and released.  Therefore, we
> evaluate is_lock_free inside of the transaction that represents the
> critical section for which we want to use lock elision, which ensures
> the atomicity that we require.
>
>
> If you think this is adequate, please commit.

LGTM.

I will commit.

Thanks again!
diff mbox

Patch

diff --git a/NEWS b/NEWS
index 6b36c08..76df3fd 100644
--- a/NEWS
+++ b/NEWS
@@ -11,7 +11,8 @@  Version 2.23
 
   14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086,
   18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681,
-  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824.
+  18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824,
+  18743.
 
 * The obsolete header <regexp.h> has been removed.  Programs that require
   this header must be updated to use <regex.h> instead.
diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
index 389f5a5..1cc9890 100644
--- a/sysdeps/powerpc/nptl/elide.h
+++ b/sysdeps/powerpc/nptl/elide.h
@@ -23,67 +23,76 @@ 
 # include <htm.h>
 # include <elision-conf.h>
 
-/* Returns true if the lock defined by is_lock_free as elided.
-   ADAPT_COUNT is a pointer to per-lock state variable. */
-
+/* Get the new value of adapt_count according to the elision
+   configurations.  Returns true if the system should retry again or false
+   otherwise.  */
 static inline bool
-__elide_lock (uint8_t *adapt_count, int is_lock_free)
+__get_new_count (uint8_t *adapt_count)
 {
-  if (*adapt_count > 0)
+  /* A persistent failure indicates that a retry will probably
+     result in another failure.  Use normal locking now and
+     for the next couple of calls.  */
+  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
     {
-      (*adapt_count)--;
+      if (__elision_aconf.skip_lock_internal_abort > 0)
+	*adapt_count = __elision_aconf.skip_lock_internal_abort;
       return false;
     }
-
-  for (int i = __elision_aconf.try_tbegin; i > 0; i--)
-    {
-      if (__builtin_tbegin (0))
-	{
-	  if (is_lock_free)
-	    return true;
-	  /* Lock was busy.  */
-	  __builtin_tabort (_ABORT_LOCK_BUSY);
-	}
-      else
-	{
-	  /* A persistent failure indicates that a retry will probably
-	     result in another failure.  Use normal locking now and
-	     for the next couple of calls.  */
-	  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
-	    {
-	      if (__elision_aconf.skip_lock_internal_abort > 0)
-		*adapt_count = __elision_aconf.skip_lock_internal_abort;
-	      break;
-	    }
-	  /* Same logic as above, but for a number of temporary failures in a
-	     a row.  */
-	  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
-		   && __elision_aconf.try_tbegin > 0)
-	    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
-	}
-     }
-
-  return false;
+  /* Same logic as above, but for a number of temporary failures in a
+     a row.  */
+  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
+	   && __elision_aconf.try_tbegin > 0)
+    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
+  return true;
 }
 
-# define ELIDE_LOCK(adapt_count, is_lock_free) \
-  __elide_lock (&(adapt_count), is_lock_free)
-
-
-static inline bool
-__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
-{
-  if (__elision_aconf.try_tbegin > 0)
-    {
-      if (write)
-	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
-      return __elide_lock (adapt_count, is_lock_free);
-    }
-  return false;
-}
+/* CONCURRENCY NOTES:
+
+   The value of is_lock_free is read and possibly written to by multiple
+   threads, either during lock acquisition, or elision.  In order to avoid
+   this evaluation from becoming a data race the access of is_lock_free
+   is placed *inside* the transaction.  Within the transaction is_lock_free
+   can be accessed with relaxed ordering.  We are assured by the transaction
+   that the value of is_lock_free is consistent with the state of the lock,
+   otherwise the hardware will abort the transaction.  */
+
+/* Returns 0 if the lock defined by is_lock_free was elided.
+   ADAPT_COUNT is a per-lock state variable.  */
+# define ELIDE_LOCK(adapt_count, is_lock_free)				\
+  ({									\
+    int ret = 0;							\
+    if (adapt_count > 0)						\
+      (adapt_count)--;							\
+    else								\
+      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
+	{								\
+	  if (__builtin_tbegin (0))					\
+	    {								\
+	      if (is_lock_free)						\
+		{							\
+		  ret = 1;						\
+		  break;						\
+		}							\
+	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
+	    }								\
+	  else								\
+	    if (!__get_new_count(&adapt_count))				\
+	      break;							\
+	}								\
+    ret;								\
+  })
 
 # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
-  __elide_trylock (&(adapt_count), is_lock_free, write)
+  ({								\
+    int ret = 0;						\
+    if (__elision_aconf.try_tbegin > 0)				\
+      {								\
+	if (write)						\
+	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
+	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
+      }								\
+    ret;							\
+  })
 
 
 static inline bool