diff mbox

[BZ,18743] PowerPC: Fix a race condition when eliding a lock

Message ID 1438274936-26493-1-git-send-email-tuliom@linux.vnet.ibm.com
State New
Headers show

Commit Message

Tulio Magno Quites Machado Filho July 30, 2015, 4:48 p.m. UTC
Note: this patch is updating the NEWS section of glibc 2.22 because that's the
only one available right now.  Whether this patch will end up in glibc 2.23
or 2.22 has to be defined.

8<----

The previous code used to evaluate the preprocessor token is_lock_free to
a variable before starting a transaction.  This behavior can cause an
error if another thread got the lock (without using a transaction)
between the conversion of the token and the beginning of the transaction.

This patch delays the evaluation of is_lock_free to inside a transaction
by moving this part of the code to the macro ELIDE_LOCK.

2015-07-30  Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>

	[BZ #18743]
	* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
	code to...
	(ELIDE_LOCK): ...here.
	(__get_new_count): New function with part of the code from
	__elide_lock that updates the value of adapt_count after a
	transaction abort.
	(__elided_trylock): Moved this code to...
	(ELIDE_TRYLOCK): ...here.
---
 NEWS                         |   2 +-
 sysdeps/powerpc/nptl/elide.h | 104 +++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 53 deletions(-)

Comments

Adhemerval Zanella July 30, 2015, 6:43 p.m. UTC | #1
Hi Tulio,

Some comments below:

On 30-07-2015 13:48, Tulio Magno Quites Machado Filho wrote:
> Note: this patch is updating the NEWS section of glibc 2.22 because that's the
> only one available right now.  Whether this patch will end up in glibc 2.23
> or 2.22 has to be defined.
> 
> 8<----
> 
> The previous code used to evaluate the preprocessor token is_lock_free to
> a variable before starting a transaction.  This behavior can cause an
> error if another thread got the lock (without using a transaction)
> between the conversion of the token and the beginning of the transaction.
> 
> This patch delays the evaluation of is_lock_free to inside a transaction
> by moving this part of the code to the macro ELIDE_LOCK.

Could you elaborate more why this is a race-condition on the is_lock_free
variable? My understanding is 'is_lock_free' is check iff the transaction
already started and it should be safe since 'tbegin.' creates a memory
barrier that immediately precedes the transaction (ISA 2.07, Book II, 
Chapter 1.8):

 40       if (__builtin_tbegin (0))
 41         {
 42           if (is_lock_free)
 43             return true;
 44           /* Lock was busy.  */
 45           __builtin_tabort (_ABORT_LOCK_BUSY);


So if the transaction fails it will just jump to else block.

> +    else								\
> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> +	{								\
> +	  asm volatile("" ::: "memory");				\

I can't really understand the requirement of this compiler barrier here.  If
compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
this is a compiler bug.

> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\
> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\
> +	}								\
> +    ret;								\
> +  })
>  
>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
> -  __elide_trylock (&(adapt_count), is_lock_free, write)
> +  ({								\
> +    int ret = 0;						\
> +    if (__elision_aconf.try_tbegin > 0)				\
> +      {								\
> +	if (write)						\
> +	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
> +	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
> +      }								\
> +    ret;							\
> +  })
>  
>  
>  static inline bool
>
Tulio Magno Quites Machado Filho July 31, 2015, 2:06 a.m. UTC | #2
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

>> The previous code used to evaluate the preprocessor token is_lock_free to
>> a variable before starting a transaction.  This behavior can cause an
>> error if another thread got the lock (without using a transaction)
>> between the conversion of the token and the beginning of the transaction.
>> 
>> This patch delays the evaluation of is_lock_free to inside a transaction
>> by moving this part of the code to the macro ELIDE_LOCK.
>
> Could you elaborate more why this is a race-condition on the is_lock_free
> variable? My understanding is 'is_lock_free' is check iff the transaction
> already started and it should be safe since 'tbegin.' creates a memory
> barrier that immediately precedes the transaction (ISA 2.07, Book II, 
> Chapter 1.8):

I completely agree with you here.  The problem is just where the memory access
happens.
And we're dealing with 2 different problems related to the memory access.
Let me use your example to explain the first one:

>  40       if (__builtin_tbegin (0))
>  41         {
>  42           if (is_lock_free)

If we consider that the memory access happens in this line, there wouldn't be
a problem.  But in this context, is_lock_free is just an int and not the lock
struct we need to access.
In this context, is_lock_free is defined here:

  29 static inline bool
  30 __elide_lock (uint8_t *adapt_count, int is_lock_free)

Keep in mind this macro:

  69 # define ELIDE_LOCK(adapt_count, is_lock_free) \
  70   __elide_lock (&(adapt_count), is_lock_free)

Using pthread_rwlock_rdlock as an example, is_lock_free would be a preprocessor
token with the following contents:

 128   if (ELIDE_LOCK (rwlock->__data.__rwelision,
 129                   rwlock->__data.__lock == 0
 130                   && rwlock->__data.__writer == 0
 131                   && rwlock->__data.__nr_readers == 0))

And this is where the important memory access happens.
If one of these fields are changed by another thread after this point, but
before __builtin_tbegin, we're able to reproduce the first problem I mentioned
previously.

> So if the transaction fails it will just jump to else block.
>
>> +    else								\
>> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> +	{								\
>> +	  asm volatile("" ::: "memory");				\
>
> I can't really understand the requirement of this compiler barrier here.  If
> compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
> this is a compiler bug.

This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
starts a critical path and the compiler should not be moving a memory access out
of the critical path or into the critical path.
However, as the current implementations of GCC have this "issue" (some people
may not agree with us), I believe we could carry this compiler barrier at least
until all GCC versions supporting __builtin_tbegin are fixed.

I'm planning to report this to the GCC community.
Carlos O'Donell July 31, 2015, 3:41 a.m. UTC | #3
On 07/30/2015 12:48 PM, Tulio Magno Quites Machado Filho wrote:
> Note: this patch is updating the NEWS section of glibc 2.22 because that's the
> only one available right now.  Whether this patch will end up in glibc 2.23
> or 2.22 has to be defined.

No for 2.22. Please work on 2.23.

You will need to describe in much more detail exactly what synchronization
you are using to ensure that this is race-free now.

For reference look at the detail Siddhesh and Torvald went in to when
fixing TLS concurrency issues here:
https://www.sourceware.org/ml/libc-alpha/2015-07/msg00616.html

If there is any synchronization going on with HTM we need a way to describe
that synchronization and document the concurrency aspect of the change. It
is the only way we'll remember what we did when we come back to look at this
code in 1, 2, 5, 10 years.

Cheers,
Carlos.

> 8<----
> 
> The previous code used to evaluate the preprocessor token is_lock_free to
> a variable before starting a transaction.  This behavior can cause an
> error if another thread got the lock (without using a transaction)
> between the conversion of the token and the beginning of the transaction.
> 
> This patch delays the evaluation of is_lock_free to inside a transaction
> by moving this part of the code to the macro ELIDE_LOCK.
> 
> 2015-07-30  Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
> 
> 	[BZ #18743]
> 	* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
> 	code to...
> 	(ELIDE_LOCK): ...here.
> 	(__get_new_count): New function with part of the code from
> 	__elide_lock that updates the value of adapt_count after a
> 	transaction abort.
> 	(__elided_trylock): Moved this code to...
> 	(ELIDE_TRYLOCK): ...here.
> ---
>  NEWS                         |   2 +-
>  sysdeps/powerpc/nptl/elide.h | 104 +++++++++++++++++++++----------------------
>  2 files changed, 53 insertions(+), 53 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 5698310..92996d5 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -28,7 +28,7 @@ Version 2.22
>    18536, 18539, 18540, 18542, 18544, 18545, 18546, 18547, 18549, 18553,
>    18557, 18558, 18569, 18583, 18585, 18586, 18592, 18593, 18594, 18602,
>    18612, 18613, 18619, 18633, 18641, 18643, 18648, 18657, 18676, 18694,
> -  18696.
> +  18696, 18743.
>  
>  * Cache information can be queried via sysconf() function on s390 e.g. with
>    _SC_LEVEL1_ICACHE_SIZE as argument.
> diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
> index 389f5a5..1d04ee9 100644
> --- a/sysdeps/powerpc/nptl/elide.h
> +++ b/sysdeps/powerpc/nptl/elide.h
> @@ -23,67 +23,67 @@
>  # include <htm.h>
>  # include <elision-conf.h>
>  
> -/* Returns true if the lock defined by is_lock_free as elided.
> -   ADAPT_COUNT is a pointer to per-lock state variable. */
> -
> +/* Get the new value of adapt_count according to the elision
> +   configurations.  Returns true if the system should retry again or false
> +   otherwise.  */
>  static inline bool
> -__elide_lock (uint8_t *adapt_count, int is_lock_free)
> +__get_new_count (uint8_t *adapt_count)
>  {
> -  if (*adapt_count > 0)
> +  /* A persistent failure indicates that a retry will probably
> +     result in another failure.  Use normal locking now and
> +     for the next couple of calls.  */
> +  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>      {
> -      (*adapt_count)--;
> +      if (__elision_aconf.skip_lock_internal_abort > 0)
> +	*adapt_count = __elision_aconf.skip_lock_internal_abort;
>        return false;
>      }
> -
> -  for (int i = __elision_aconf.try_tbegin; i > 0; i--)
> -    {
> -      if (__builtin_tbegin (0))
> -	{
> -	  if (is_lock_free)
> -	    return true;
> -	  /* Lock was busy.  */
> -	  __builtin_tabort (_ABORT_LOCK_BUSY);
> -	}
> -      else
> -	{
> -	  /* A persistent failure indicates that a retry will probably
> -	     result in another failure.  Use normal locking now and
> -	     for the next couple of calls.  */
> -	  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
> -	    {
> -	      if (__elision_aconf.skip_lock_internal_abort > 0)
> -		*adapt_count = __elision_aconf.skip_lock_internal_abort;
> -	      break;
> -	    }
> -	  /* Same logic as above, but for a number of temporary failures in a
> -	     a row.  */
> -	  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> -		   && __elision_aconf.try_tbegin > 0)
> -	    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> -	}
> -     }
> -
> -  return false;
> +  /* Same logic as above, but for a number of temporary failures in a
> +     a row.  */
> +  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
> +	   && __elision_aconf.try_tbegin > 0)
> +    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
> +  return true;
>  }
>  
> -# define ELIDE_LOCK(adapt_count, is_lock_free) \
> -  __elide_lock (&(adapt_count), is_lock_free)
> -
> -
> -static inline bool
> -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
> -{
> -  if (__elision_aconf.try_tbegin > 0)
> -    {
> -      if (write)
> -	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
> -      return __elide_lock (adapt_count, is_lock_free);
> -    }
> -  return false;
> -}
> +/* Returns 0 if the lock defined by is_lock_free was elided.
> +   ADAPT_COUNT is a per-lock state variable.  */
> +# define ELIDE_LOCK(adapt_count, is_lock_free)				\
> +  ({									\
> +    int ret = 0;							\
> +    if (adapt_count > 0)						\
> +      (adapt_count)--;							\
> +    else								\
> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> +	{								\
> +	  asm volatile("" ::: "memory");				\
> +	  if (__builtin_tbegin (0))					\
> +	    {								\
> +	      if (is_lock_free)						\
> +		{							\
> +		  ret = 1;						\
> +		  break;						\
> +		}							\
> +	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
> +	    }								\
> +	  else								\
> +	    if (!__get_new_count(&adapt_count))				\
> +	      break;							\
> +	}								\
> +    ret;								\
> +  })
>  
>  # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
> -  __elide_trylock (&(adapt_count), is_lock_free, write)
> +  ({								\
> +    int ret = 0;						\
> +    if (__elision_aconf.try_tbegin > 0)				\
> +      {								\
> +	if (write)						\
> +	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
> +	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
> +      }								\
> +    ret;							\
> +  })
>  
>  
>  static inline bool
>
Adhemerval Zanella July 31, 2015, 1:24 p.m. UTC | #4
On 30-07-2015 23:06, Tulio Magno Quites Machado Filho wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>>> The previous code used to evaluate the preprocessor token is_lock_free to
>>> a variable before starting a transaction.  This behavior can cause an
>>> error if another thread got the lock (without using a transaction)
>>> between the conversion of the token and the beginning of the transaction.
>>>
>>> This patch delays the evaluation of is_lock_free to inside a transaction
>>> by moving this part of the code to the macro ELIDE_LOCK.
>>
>> Could you elaborate more why this is a race-condition on the is_lock_free
>> variable? My understanding is 'is_lock_free' is check iff the transaction
>> already started and it should be safe since 'tbegin.' creates a memory
>> barrier that immediately precedes the transaction (ISA 2.07, Book II, 
>> Chapter 1.8):
> 
> I completely agree with you here.  The problem is just where the memory access
> happens.
> And we're dealing with 2 different problems related to the memory access.
> Let me use your example to explain the first one:
> 
>>  40       if (__builtin_tbegin (0))
>>  41         {
>>  42           if (is_lock_free)
> 
> If we consider that the memory access happens in this line, there wouldn't be
> a problem.  But in this context, is_lock_free is just an int and not the lock
> struct we need to access.
> In this context, is_lock_free is defined here:
> 
>   29 static inline bool
>   30 __elide_lock (uint8_t *adapt_count, int is_lock_free)
> 
> Keep in mind this macro:
> 
>   69 # define ELIDE_LOCK(adapt_count, is_lock_free) \
>   70   __elide_lock (&(adapt_count), is_lock_free)
> 
> Using pthread_rwlock_rdlock as an example, is_lock_free would be a preprocessor
> token with the following contents:
> 
>  128   if (ELIDE_LOCK (rwlock->__data.__rwelision,
>  129                   rwlock->__data.__lock == 0
>  130                   && rwlock->__data.__writer == 0
>  131                   && rwlock->__data.__nr_readers == 0))
> 
> And this is where the important memory access happens.
> If one of these fields are changed by another thread after this point, but
> before __builtin_tbegin, we're able to reproduce the first problem I mentioned
> previously.

My understanding is this fields will be changed only in the lock taken path. Let's
say a thread t1 grab the lock (either by a transaction or by using a default lock)
and modifies any 'rwlock->__data' fields after the ELIDE_LOCK check and before
the transaction begin on a second thread t2.  My understanding of the issue you
are seeing is the transaction will initiate and since is_lock_free is 1 it won't
abort and continue as the a lock taken in both threads.

However my understanding is the transaction won't be able to commit since it
will have a conflict in the 'rwlock->__data.lock' or in any memory being access
in the critical region.  It will fallback to either retry the transaction or
take the normal lock. Are you saying that the transaction is not being aborted
in the critical region?

   

> 
>> So if the transaction fails it will just jump to else block.
>>
>>> +    else								\
>>> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>>> +	{								\
>>> +	  asm volatile("" ::: "memory");				\
>>
>> I can't really understand the requirement of this compiler barrier here.  If
>> compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
>> this is a compiler bug.
> 
> This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
> starts a critical path and the compiler should not be moving a memory access out
> of the critical path or into the critical path.
> However, as the current implementations of GCC have this "issue" (some people
> may not agree with us), I believe we could carry this compiler barrier at least
> until all GCC versions supporting __builtin_tbegin are fixed.
> 
> I'm planning to report this to the GCC community.
> 

If compiler is doing memory reordering around HTM builtin it is *clearly* a bug
since transaction begin/end acts a full memory barrier (sync).  You should also
add a compiler barrier in pthread elision code as well.
Tulio Magno Quites Machado Filho July 31, 2015, 2:57 p.m. UTC | #5
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> On 30-07-2015 23:06, Tulio Magno Quites Machado Filho wrote:
>> Using pthread_rwlock_rdlock as an example, is_lock_free would be a preprocessor
>> token with the following contents:
>> 
>>  128   if (ELIDE_LOCK (rwlock->__data.__rwelision,
>>  129                   rwlock->__data.__lock == 0
>>  130                   && rwlock->__data.__writer == 0
>>  131                   && rwlock->__data.__nr_readers == 0))
>> 
>> And this is where the important memory access happens.
>> If one of these fields are changed by another thread after this point, but
>> before __builtin_tbegin, we're able to reproduce the first problem I mentioned
>> previously.
>
> My understanding is this fields will be changed only in the lock taken path. Let's
> say a thread t1 grab the lock (either by a transaction or by using a default lock)
> and modifies any 'rwlock->__data' fields after the ELIDE_LOCK check and before
> the transaction begin on a second thread t2.  My understanding of the issue you
> are seeing is the transaction will initiate and since is_lock_free is 1 it won't
> abort and continue as the a lock taken in both threads.

I agree with you.  But this won't reproduce the bug.  You have to change the
order of events.

Let's say t1 is the first one to read rwlock->__data fields and they're all 0.
t1 will set is_lock_free to 0.
Then t2 modifies rwlock->__data before t1 starts the transaction (as in: get a
lock, instead of a transaction).
After that, t1 starts the transaction.  Here comes the problem: the memory
barrier is created with rwlock->__data saying that someone took the lock,
but is_lock_free is set to 0.  In this scenario, t1 will proceed with the
transaction.

> However my understanding is the transaction won't be able to commit since it
> will have a conflict in the 'rwlock->__data.lock' or in any memory being access
> in the critical region.

Yes, but with the order of events I mentioned, when t1 calls
pthread_rwlock_unlock, it will run the following code:

  38   if (ELIDE_UNLOCK (rwlock->__data.__writer == 0
  39                     && rwlock->__data.__nr_readers == 0))
  40     return 0;

Where ELIDE_UNLOCK is:

  89 static inline bool
  90 __elide_unlock (int is_lock_free)
  91 {
  92   if (is_lock_free)
  93     {
  94       __builtin_tend (0);
  95       return true;
  96     }
  97   return false;
  98 }
  99 
 100 # define ELIDE_UNLOCK(is_lock_free) \
 101   __elide_unlock (is_lock_free)

Remember that t1 started the transaction even if rwlock->__data said someone
was holding the lock.
So now, in __elide_unlock, is_lock_free will be 1.  __elide_unlock will
return false and t1 will unlock a lock acquired by t2.

> Are you saying that the transaction is not being aborted
> in the critical region?

No.  I'm saying the memory access to rwlock->__data is critical and should
happen inside the transaction.  Without this, we can't say the whole process
is atomic.

> If compiler is doing memory reordering around HTM builtin it is *clearly* a bug
> since transaction begin/end acts a full memory barrier (sync).  You should also
> add a compiler barrier in pthread elision code as well.

Agreed.  But I guess that's out of scope for this particular bug.
I could prepare a separate patch with these changes if you agree with that.
Tulio Magno Quites Machado Filho July 31, 2015, 3:14 p.m. UTC | #6
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 07/30/2015 12:48 PM, Tulio Magno Quites Machado Filho wrote:
>> Note: this patch is updating the NEWS section of glibc 2.22 because that's the
>> only one available right now.  Whether this patch will end up in glibc 2.23
>> or 2.22 has to be defined.
>
> No for 2.22. Please work on 2.23.

Ack.

> You will need to describe in much more detail exactly what synchronization
> you are using to ensure that this is race-free now.

In fact, I'm using the same mechanism already available.  I'm just adding
another step inside the critical path.
But I'll improve my commit message and the comments in the code.

> If there is any synchronization going on with HTM we need a way to describe
> that synchronization and document the concurrency aspect of the change. It
> is the only way we'll remember what we did when we come back to look at this
> code in 1, 2, 5, 10 years.

Agreed.
Torvald Riegel Aug. 3, 2015, 4:21 p.m. UTC | #7
On Thu, 2015-07-30 at 23:06 -0300, Tulio Magno Quites Machado Filho
wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> > So if the transaction fails it will just jump to else block.
> >
> >> +    else								\
> >> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> >> +	{								\
> >> +	  asm volatile("" ::: "memory");				\
> >
> > I can't really understand the requirement of this compiler barrier here.  If
> > compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
> > this is a compiler bug.
> 
> This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
> starts a critical path and the compiler should not be moving a memory access out
> of the critical path or into the critical path.
> However, as the current implementations of GCC have this "issue" (some people
> may not agree with us), I believe we could carry this compiler barrier at least
> until all GCC versions supporting __builtin_tbegin are fixed.

I agree that a compiler should treat transaction begin / commit as
something like a lock acquisition / release, assuming that the
transaction semantics are equivalence to a single-global-lock semantics.

As far as glibc is concerned, I believe it's better to work around the
compiler issue by creating a glibc-internal wrapper for __builtin_tbegin
that enforces the semantics we believe it should have (eg, by adding the
compiler barrier).
Torvald Riegel Aug. 3, 2015, 4:28 p.m. UTC | #8
On Fri, 2015-07-31 at 12:14 -0300, Tulio Magno Quites Machado Filho
wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
> > On 07/30/2015 12:48 PM, Tulio Magno Quites Machado Filho wrote:
> >> Note: this patch is updating the NEWS section of glibc 2.22 because that's the
> >> only one available right now.  Whether this patch will end up in glibc 2.23
> >> or 2.22 has to be defined.
> >
> > No for 2.22. Please work on 2.23.
> 
> Ack.
> 
> > You will need to describe in much more detail exactly what synchronization
> > you are using to ensure that this is race-free now.
> 
> In fact, I'm using the same mechanism already available.  I'm just adding
> another step inside the critical path.
> But I'll improve my commit message and the comments in the code.
> 
> > If there is any synchronization going on with HTM we need a way to describe
> > that synchronization and document the concurrency aspect of the change. It
> > is the only way we'll remember what we did when we come back to look at this
> > code in 1, 2, 5, 10 years.
> 
> Agreed.
> 

I also think using relaxed-MO atomic operations inside the transactions
is better for accesses to data that is accessed concurrently outside of
transactions too.  The HW txn will ensure atomicity and thus there's no
difference no matter what code the compiler generates *inside* of a
transaction (it shouldn't generally move past txn boundaries, of
course).  Nonetheless, using atomic accesses will improve clarity of the
code.
Tulio Magno Quites Machado Filho Aug. 19, 2015, 8:55 p.m. UTC | #9
Torvald Riegel <triegel@redhat.com> writes:

> On Thu, 2015-07-30 at 23:06 -0300, Tulio Magno Quites Machado Filho
> wrote:
>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>> > So if the transaction fails it will just jump to else block.
>> >
>> >> +    else								\
>> >> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> >> +	{								\
>> >> +	  asm volatile("" ::: "memory");				\
>> >
>> > I can't really understand the requirement of this compiler barrier here.  If
>> > compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
>> > this is a compiler bug.
>> 
>> This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
>> starts a critical path and the compiler should not be moving a memory access out
>> of the critical path or into the critical path.
>> However, as the current implementations of GCC have this "issue" (some people
>> may not agree with us), I believe we could carry this compiler barrier at least
>> until all GCC versions supporting __builtin_tbegin are fixed.
>
> I agree that a compiler should treat transaction begin / commit as
> something like a lock acquisition / release, assuming that the
> transaction semantics are equivalence to a single-global-lock semantics.

This has been reported to GCC as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67281

> As far as glibc is concerned, I believe it's better to work around the
> compiler issue by creating a glibc-internal wrapper for __builtin_tbegin
> that enforces the semantics we believe it should have (eg, by adding the
> compiler barrier).

Agreed.  In fact, that's the right way of solving this issue.
The way I wrote my patch left some room for an instruction to be placed
between the compiler barrier and the begin of the transaction.

Anyway, I believe that's another problem and should not be confused with the
previous one.  So, I'll split this patch in 2: one that fixes BZ #18743 and
another one that fixes the compiler barriers on powerpc.  I'll synchronize
with GCC so that we use the same semantics (or as close as we can).
Peter Bergner Aug. 20, 2015, 8:48 p.m. UTC | #10
On Thu, 2015-07-30 at 23:06 -0300, Tulio Magno Quites Machado Filho wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> >> +    else								\
> >> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> >> +	{								\
> >> +	  asm volatile("" ::: "memory");				\
> >
> > I can't really understand the requirement of this compiler barrier here.  If
> > compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
> > this is a compiler bug.
> 
> This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
> starts a critical path and the compiler should not be moving a memory access out
> of the critical path or into the critical path.
> However, as the current implementations of GCC have this "issue" (some people
> may not agree with us), I believe we could carry this compiler barrier at least
> until all GCC versions supporting __builtin_tbegin are fixed.

I completely agree we need a barrier here, so I will fix that.  The question
I'm starting to wonder, is do we need more than just a memory barrier or
do we need a complete optimization barrier, so no code (even scalar code)
will move past the __builtin_tbegin() (and other HTM builtins)?

Let's look at the following test case:

  long
  foo (long dest, long src0, long src1, long tries)
  {
    long i;
    for (i = 0; i < tries; i++)
      {
        if (__builtin_tbegin (0))
	  {
	    dest = src0 + src1;
	    if (dest == 13)
	      __builtin_tabort(0);
	    __builtin_tend (0);
	  }
      }
    return dest;
  }

If I compile this with -O3 -mcpu=power8 -S, I see the "src0 + src1"
expression hoisted out of the loop, which I can accept as "ok", since
both src0 and src1 are constant for the whole function.  However, I
am also seeing the assignment of dest as well as the compare of dest
with 13 also being moved outside the tbegin/tend pair.  This doesn't
seem correct to me, since the update to dest should only ever be
observable if one or more of the transactions succeeds.  Torvald,
Adhemerval and Tulio, do you guys agree with that assessment?

Peter
Tulio Magno Quites Machado Filho Aug. 21, 2015, 3:19 p.m. UTC | #11
Peter Bergner <bergner@vnet.ibm.com> writes:

> On Thu, 2015-07-30 at 23:06 -0300, Tulio Magno Quites Machado Filho wrote:
>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>> >> +    else								\
>> >> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
>> >> +	{								\
>> >> +	  asm volatile("" ::: "memory");				\
>> >
>> > I can't really understand the requirement of this compiler barrier here.  If
>> > compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
>> > this is a compiler bug.
>> 
>> This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
>> starts a critical path and the compiler should not be moving a memory access out
>> of the critical path or into the critical path.
>> However, as the current implementations of GCC have this "issue" (some people
>> may not agree with us), I believe we could carry this compiler barrier at least
>> until all GCC versions supporting __builtin_tbegin are fixed.
>
> I completely agree we need a barrier here, so I will fix that.  The question
> I'm starting to wonder, is do we need more than just a memory barrier or
> do we need a complete optimization barrier, so no code (even scalar code)
> will move past the __builtin_tbegin() (and other HTM builtins)?
> ...
> observable if one or more of the transactions succeeds.  Torvald,
> Adhemerval and Tulio, do you guys agree with that assessment?

I agree with you.
The compiler shouldn't move instructions that affect memory or any of the
checkpointed registers.
Peter Bergner Aug. 21, 2015, 3:43 p.m. UTC | #12
On Fri, 2015-08-21 at 12:19 -0300, Tulio Magno Quites Machado Filho wrote:
> Peter Bergner <bergner@vnet.ibm.com> writes:
> > I completely agree we need a barrier here, so I will fix that.  The question
> > I'm starting to wonder, is do we need more than just a memory barrier or
> > do we need a complete optimization barrier, so no code (even scalar code)
> > will move past the __builtin_tbegin() (and other HTM builtins)?
> > ...
> > observable if one or more of the transactions succeeds.  Torvald,
> > Adhemerval and Tulio, do you guys agree with that assessment?
> 
> I agree with you.
> The compiler shouldn't move instructions that affect memory or any of the
> checkpointed registers.

Interestingly, Intel seems to have a similar issue:

    https://gcc.gnu.org/PR63672
    https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02999.html

..although, Andi only tried adding memory barriers to the builtins
and not full optimization barriers.  It seems the above patch never
made it into trunk.

Peter
Torvald Riegel Aug. 23, 2015, 12:16 p.m. UTC | #13
On Thu, 2015-08-20 at 15:48 -0500, Peter Bergner wrote:
> On Thu, 2015-07-30 at 23:06 -0300, Tulio Magno Quites Machado Filho wrote:
> > Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> > >> +    else								\
> > >> +      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
> > >> +	{								\
> > >> +	  asm volatile("" ::: "memory");				\
> > >
> > > I can't really understand the requirement of this compiler barrier here.  If
> > > compiler is moving around the 'is_lock_free' *before* __builtin_tbegin IMHO
> > > this is a compiler bug.
> > 
> > This is the second problem and I agree with you again.  IMHO,  __builtin_tbegin
> > starts a critical path and the compiler should not be moving a memory access out
> > of the critical path or into the critical path.
> > However, as the current implementations of GCC have this "issue" (some people
> > may not agree with us), I believe we could carry this compiler barrier at least
> > until all GCC versions supporting __builtin_tbegin are fixed.
> 
> I completely agree we need a barrier here, so I will fix that.  The question
> I'm starting to wonder, is do we need more than just a memory barrier or
> do we need a complete optimization barrier, so no code (even scalar code)
> will move past the __builtin_tbegin() (and other HTM builtins)?

Given that the compiler is involved, I believe we need to specify
semantics on the level of language-level memory models.  (Otherwise, for
example if we consider compiler implementation internals such as
reordering, we would have to consider a lot more possibilities -- which
should instead result from the specification.)

First, let's just look at the concurrency aspect of this.

If we consider just transactions with semantics as in the ISO C++
Transactional Memory TS, then we have similar semantics as for a
critical section.  Specifically, tbegin is like a lock acquisition, and
tend is a lock release (with the lock being a single global
implementation-internal lock).

The builtins may want to give additional guarantees regarding mixing
non-transactional and transactional synchronization, to get "closer" to
the strong isolation property HTMs typically guarantee.  However, I
think to be practical, all such semantics would require that any
nontransactional access that is meant to be concurrent with a
transactional access to that same location would have to use atomics for
the former (or a transaction).
If it is required to use atomics for the access within the transaction,
I don't see anything different to doing that in a critical section.
This would be the case for C++ anyway because of the atomic types
separating atomic from nonatomic accesses, currently.

With the atomic builtins, we can access nonatomic types, and can also
use plain accesses (ie, in the transaction).  But even if this is what
we want to allow, I don't yet see a case that would be different from
the case of a critical section.

Second, beyond concurrency, we have to consider whether we get
additional constraints on code generation because of txn aborts.  I
don't think we do because tbegin's result is unknown at compile time, as
well as what it does internally.  From the perspective of code
generation, it either will execute the txn atomically, or will execute
the txn abort/failure path.  Thus, even if at runtime, a call to tabort
decides about the outcome of a previous tbegin, this isn't
distinguishable at compile time because of the atomic execution of txns.

Therefore, I don't think aborts result in constraints that are not yet
there anyway due to tbegin returning a value that is unknown at compile
time.

> Let's look at the following test case:
> 
>   long
>   foo (long dest, long src0, long src1, long tries)
>   {
>     long i;
>     for (i = 0; i < tries; i++)
>       {
>         if (__builtin_tbegin (0))
> 	  {
> 	    dest = src0 + src1;
> 	    if (dest == 13)
> 	      __builtin_tabort(0);
> 	    __builtin_tend (0);
> 	  }
>       }
>     return dest;
>   }
> 
> If I compile this with -O3 -mcpu=power8 -S, I see the "src0 + src1"
> expression hoisted out of the loop, which I can accept as "ok", since
> both src0 and src1 are constant for the whole function.

Yes.  src0 and src1 are not observable to other threads nor the tbegin,
so we're back at sequential code for these.  So, no matter what tbegin
returns, the result for these will be the same.

> However, I
> am also seeing the assignment of dest as well as the compare of dest
> with 13 also being moved outside the tbegin/tend pair.  This doesn't
> seem correct to me, since the update to dest should only ever be
> observable if one or more of the transactions succeeds.

I think doing the compare outside is fine.  The assignment to dest could
be done outside the txn if this does not mess with the other code path
for when tbegin returns false.  For example, the following would be a
correct code transformation because we speculate the assignment to dest
and correct any misspeculation on the txn-failure path:
  tmp = dest;
  dest = src0 + src1;
  flag = dest == 13;
  if (__builtin_tbegin (0))
    {
      if (flag)
        __builtin_tabort(0);
      __builtin_tend (0);
    }
  else
    dest = tmp;

In this case, there isn't any txn-specific about tbegin.  If the
compiler is making the assignment to dest visible on the tbegin-failure
path too, and returning this value, then maybe the compiler believes
that the tbegin return value is known or constant?

To summarize, I think tbegin needs to have lock acquisition semantics
and an unknown return value, and tend needs to have lock release
semantics.

Does everyone agree with this reasoning?

Peter (or someone else), could you take care of documenting this
reasoning (or whatever reasoning we agree on) in the GCC sources for the
HTM builtins (ie, Power, s390, and Intel)?  I can review such a patch,
and can also help with the wording if necessary.  Thanks!
Peter Bergner Sept. 1, 2015, 5:01 p.m. UTC | #14
On Sun, 2015-08-23 at 14:16 +0200, Torvald Riegel wrote:
> On Thu, 2015-08-20 at 15:48 -0500, Peter Bergner wrote:
> > Let's look at the following test case:
> > 
> >   long
> >   foo (long dest, long src0, long src1, long tries)
> >   {
> >     long i;
> >     for (i = 0; i < tries; i++)
> >       {
> >         if (__builtin_tbegin (0))
> > 	  {
> > 	    dest = src0 + src1;
> > 	    if (dest == 13)
> > 	      __builtin_tabort(0);
> > 	    __builtin_tend (0);
> > 	  }
> >       }
> >     return dest;
> >   }
> > 
> > If I compile this with -O3 -mcpu=power8 -S, I see the "src0 + src1"
> > expression hoisted out of the loop, which I can accept as "ok", since
> > both src0 and src1 are constant for the whole function.

Slight update to the above.  I had too may test cases laying around and
I seem to have confused the output of one test case with the source of
another.  The actual source for what I saw above did not have the if ()
around the __builtin_tbegin (0), so we always executed the transaction
body.  That said, looking at the asm for the source above, the add of
src0 and src1 is still hoisted out of the loop.


> Yes.  src0 and src1 are not observable to other threads nor the tbegin,
> so we're back at sequential code for these.  So, no matter what tbegin
> returns, the result for these will be the same.

Agreed.


> > However, I
> > am also seeing the assignment of dest as well as the compare of dest
> > with 13 also being moved outside the tbegin/tend pair.  This doesn't
> > seem correct to me, since the update to dest should only ever be
> > observable if one or more of the transactions succeeds.

Looking at the real asm for the test case above (and not the one with
out the if ()), we get basically the following C code:

long
foo (long dest, long src0, long src1, long tries)
{
  long i;

  if (tries < 0)
    return dest;

  tmp = src0 + src1;  // expression hoisted out of loop
  cond = tmp == 13;   // comparison and branch moved out of loop
  if (cond)           // cloned loop bodies for dest == and != 13
    {
      // src0 + src1 == 13
      for (i=0; i < tries; i++)
        {
          if (!__builtin_tbegin (0))
            continue;
          __builtin_tabort (0);
          __builtin_tend (0);
          dest = 13;
        }
        return dest;
    }
  else
    {
      // src0 + src1 != 13
      for (i=0; i < tries; i++)
        {
          if (!__builtin_tbegin (0))
            continue;
          __builtin_tend (0);
          dest = tmp;
        }
        return dest;
    }
}


> I think doing the compare outside is fine.

Agreed.


> The assignment to dest could be done outside the txn if this does not
> mess with the other code path for when tbegin returns false.  

Looking at the two loops the optimizer created, I have to agree with
you on this one too.  In the first loop (src0 + src1 == 13), the tbegin.
is guaranteed to fail, since either it will fail due to some internal
conflict or we'll unconditionally execute the tabort.  This will cause
us skip the dest = 13 statement.  In the second loop, if the tbegin.
fails, we'll skip the dest = tmp statement.  If it succeeds, we'll
execute the tend. and dest = tmp statements which is what we expect.
So again, I agree with you.  When I was first looking at the generated
asm, I was thinking that the update of dest was somehow always being
performed, but I see that is not the case.

That said, if I change the code so that src0 and src1 are pointers
and the setting of dest is changed to:

    dest = *src0 + *src1;

I'm seeing the loads of *src0 and *src1 hoisted out of the loop,
which is clearly bad!  This happens because the TM hardware instructions
are not marked as memory barriers.  This is true for POWER, Intel and
S390!



> To summarize, I think tbegin needs to have lock acquisition semantics
> and an unknown return value, and tend needs to have lock release
> semantics.
> 
> Does everyone agree with this reasoning?

I agree.  Since POWER's and I also believe Intel's and S390's TM
hardware instructions have at least those semantics, I believe the
only thing the compiler needs to do is enforce a memory barrier on
those instruction patterns, so the compiler won't schedule loads and
stores across those instructions, correct?

My patch to do this on POWER fixes the unit test cases I have and I
assume the TLE issue we ran into.  Intel and S390 will also need a
similar patch.  I'll mention that as part of my patch, I created a
__TM_FENCE__ macro so the user can tell whether the TM insn patterns
have been fixed to act as fences or not.


> Second, beyond concurrency, we have to consider whether we get
> additional constraints on code generation because of txn aborts.

The only additional constraint I can think of that might be useful,
is to mark the tabort. as noreturn.  However, noreturn is a function
attribute, so we can't really attach it to a TM insn pattern.
Even if we could, it wouldn't make sense to attach it to our
conditional transaction abort instructions.  Probably users should just
follow their __builtin_tabort (0) uses with __builtin_unreachable().

I will mention that I did end up adding memory barriers to our tabort
and conditional tabort insn patterns.  It's "safe" and I assume not
much of a performance issue, since aborting a transaction should be
an uncommon occurrence and the tabort is also probably slow too.
I also had to add it to our tsuspend and tresume patterns too, since
we don't want loads/stores moving from a transaction area to a
non-transaction area and vice versa, so it made sense to just
attach the memory barrier to all our transaction insn patterns.



> Peter (or someone else), could you take care of documenting this
> reasoning (or whatever reasoning we agree on) in the GCC sources for the
> HTM builtins (ie, Power, s390, and Intel)?  I can review such a patch,
> and can also help with the wording if necessary.  Thanks!

I'll take a stab at adding this for power and we'll see how generic
it is and whether we can just add it as is or not for Intel and S390.

Thanks for your input!

Peter
Peter Bergner Sept. 3, 2015, 10:07 p.m. UTC | #15
On Sun, 2015-08-23 at 14:16 +0200, Torvald Riegel wrote:
> To summarize, I think tbegin needs to have lock acquisition semantics
> and an unknown return value, and tend needs to have lock release
> semantics.
> 
> Does everyone agree with this reasoning?
> 
> Peter (or someone else), could you take care of documenting this
> reasoning (or whatever reasoning we agree on) in the GCC sources for the
> HTM builtins (ie, Power, s390, and Intel)?  I can review such a patch,
> and can also help with the wording if necessary.  Thanks!

Ok, I submitted the POWER patch to fix this along with some verbiage
about the memory consistency sematics.  Torvald, is the documentation
I added in the patch below what you wanted?

    https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00315.html

Peter
Torvald Riegel Oct. 8, 2015, 4:13 p.m. UTC | #16
On Tue, 2015-09-01 at 12:01 -0500, Peter Bergner wrote:
> On Sun, 2015-08-23 at 14:16 +0200, Torvald Riegel wrote:
> > To summarize, I think tbegin needs to have lock acquisition semantics
> > and an unknown return value, and tend needs to have lock release
> > semantics.
> > 
> > Does everyone agree with this reasoning?
> 
> I agree.  Since POWER's and I also believe Intel's and S390's TM
> hardware instructions have at least those semantics, I believe the
> only thing the compiler needs to do is enforce a memory barrier on
> those instruction patterns, so the compiler won't schedule loads and
> stores across those instructions, correct?

Yes, I think a (code motion) barrier is sufficient.  In the future,
these could be acquire/release like code motion barriers so that we can
move code into transactions, but we don't have these yet in GCC.  Also,
moving code into a transaction is fine from a safety point of view, but
the compiler would still have to ensure that forward progress
requirements aren't violated.  Thus, just not moving code into or out of
transactions seems best.

> My patch to do this on POWER fixes the unit test cases I have and I
> assume the TLE issue we ran into.  Intel and S390 will also need a
> similar patch.  I'll mention that as part of my patch, I created a
> __TM_FENCE__ macro so the user can tell whether the TM insn patterns
> have been fixed to act as fences or not.
> 
> 
> > Second, beyond concurrency, we have to consider whether we get
> > additional constraints on code generation because of txn aborts.
> 
> The only additional constraint I can think of that might be useful,
> is to mark the tabort. as noreturn.  However, noreturn is a function
> attribute, so we can't really attach it to a TM insn pattern.
> Even if we could, it wouldn't make sense to attach it to our
> conditional transaction abort instructions.  Probably users should just
> follow their __builtin_tabort (0) uses with __builtin_unreachable().
> 
> I will mention that I did end up adding memory barriers to our tabort
> and conditional tabort insn patterns.  It's "safe" and I assume not
> much of a performance issue, since aborting a transaction should be
> an uncommon occurrence and the tabort is also probably slow too.

Agreed.  (But it might not be strictly necessary, as long as the
compiler knows that tabort has an effect that can't be undone.  (If, for
example, a tabort will happen anyway, it doesn't matter whether some
memory access is moved to after the tabort. Moving before the tabort is
not always possible because the access might have an effect like raising
a segfault.)

> I also had to add it to our tsuspend and tresume patterns too, since
> we don't want loads/stores moving from a transaction area to a
> non-transaction area and vice versa, so it made sense to just
> attach the memory barrier to all our transaction insn patterns.

Good point.
diff mbox

Patch

diff --git a/NEWS b/NEWS
index 5698310..92996d5 100644
--- a/NEWS
+++ b/NEWS
@@ -28,7 +28,7 @@  Version 2.22
   18536, 18539, 18540, 18542, 18544, 18545, 18546, 18547, 18549, 18553,
   18557, 18558, 18569, 18583, 18585, 18586, 18592, 18593, 18594, 18602,
   18612, 18613, 18619, 18633, 18641, 18643, 18648, 18657, 18676, 18694,
-  18696.
+  18696, 18743.
 
 * Cache information can be queried via sysconf() function on s390 e.g. with
   _SC_LEVEL1_ICACHE_SIZE as argument.
diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
index 389f5a5..1d04ee9 100644
--- a/sysdeps/powerpc/nptl/elide.h
+++ b/sysdeps/powerpc/nptl/elide.h
@@ -23,67 +23,67 @@ 
 # include <htm.h>
 # include <elision-conf.h>
 
-/* Returns true if the lock defined by is_lock_free as elided.
-   ADAPT_COUNT is a pointer to per-lock state variable. */
-
+/* Get the new value of adapt_count according to the elision
+   configurations.  Returns true if the system should retry again or false
+   otherwise.  */
 static inline bool
-__elide_lock (uint8_t *adapt_count, int is_lock_free)
+__get_new_count (uint8_t *adapt_count)
 {
-  if (*adapt_count > 0)
+  /* A persistent failure indicates that a retry will probably
+     result in another failure.  Use normal locking now and
+     for the next couple of calls.  */
+  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
     {
-      (*adapt_count)--;
+      if (__elision_aconf.skip_lock_internal_abort > 0)
+	*adapt_count = __elision_aconf.skip_lock_internal_abort;
       return false;
     }
-
-  for (int i = __elision_aconf.try_tbegin; i > 0; i--)
-    {
-      if (__builtin_tbegin (0))
-	{
-	  if (is_lock_free)
-	    return true;
-	  /* Lock was busy.  */
-	  __builtin_tabort (_ABORT_LOCK_BUSY);
-	}
-      else
-	{
-	  /* A persistent failure indicates that a retry will probably
-	     result in another failure.  Use normal locking now and
-	     for the next couple of calls.  */
-	  if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
-	    {
-	      if (__elision_aconf.skip_lock_internal_abort > 0)
-		*adapt_count = __elision_aconf.skip_lock_internal_abort;
-	      break;
-	    }
-	  /* Same logic as above, but for a number of temporary failures in a
-	     a row.  */
-	  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
-		   && __elision_aconf.try_tbegin > 0)
-	    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
-	}
-     }
-
-  return false;
+  /* Same logic as above, but for a number of temporary failures in a
+     a row.  */
+  else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0
+	   && __elision_aconf.try_tbegin > 0)
+    *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries;
+  return true;
 }
 
-# define ELIDE_LOCK(adapt_count, is_lock_free) \
-  __elide_lock (&(adapt_count), is_lock_free)
-
-
-static inline bool
-__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
-{
-  if (__elision_aconf.try_tbegin > 0)
-    {
-      if (write)
-	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
-      return __elide_lock (adapt_count, is_lock_free);
-    }
-  return false;
-}
+/* Returns 0 if the lock defined by is_lock_free was elided.
+   ADAPT_COUNT is a per-lock state variable.  */
+# define ELIDE_LOCK(adapt_count, is_lock_free)				\
+  ({									\
+    int ret = 0;							\
+    if (adapt_count > 0)						\
+      (adapt_count)--;							\
+    else								\
+      for (int i = __elision_aconf.try_tbegin; i > 0; i--)		\
+	{								\
+	  asm volatile("" ::: "memory");				\
+	  if (__builtin_tbegin (0))					\
+	    {								\
+	      if (is_lock_free)						\
+		{							\
+		  ret = 1;						\
+		  break;						\
+		}							\
+	      __builtin_tabort (_ABORT_LOCK_BUSY);			\
+	    }								\
+	  else								\
+	    if (!__get_new_count(&adapt_count))				\
+	      break;							\
+	}								\
+    ret;								\
+  })
 
 # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
-  __elide_trylock (&(adapt_count), is_lock_free, write)
+  ({								\
+    int ret = 0;						\
+    if (__elision_aconf.try_tbegin > 0)				\
+      {								\
+	if (write)						\
+	  __builtin_tabort (_ABORT_NESTED_TRYLOCK);		\
+	ret = ELIDE_LOCK (adapt_count, is_lock_free);		\
+      }								\
+    ret;							\
+  })
 
 
 static inline bool