diff mbox

Add adaptive elision to rwlocks

Message ID 1396652083-18920-1-git-send-email-andi@firstfloor.org
State New
Headers show

Commit Message

Andi Kleen April 4, 2014, 10:54 p.m. UTC
From: Andi Kleen <ak@linux.intel.com>

This patch relies on the C version of the rwlocks posted earlier.
With C rwlocks it is very straight forward to do adaptive elision
using TSX. It is based on the infrastructure added earlier
for mutexes, but uses its own elision macros. The macros
are fairly general purpose and could be used for other
elision purposes too.

This version is much cleaner than the earlier assembler based
version, and in particular implements adaptation which makes
it safer.

I changed the behavior slightly to not require any changes
in the test suite and fully conform to all expected
behaviors (generally at the cost of not eliding in
various situations). In particular this means the timedlock
variants are not elided.  Nested trylock aborts.

nptl/:

2014-4-04  Andi Kleen  <ak@linux.intel.com>

	* pthread_rwlock_rdlock.c: Include elide.h.
	(pthread_rwlock_rdlock): Add elision.
	* pthread_rwlock_wrlock.c: Include elide.h.
	(pthread_rwlock_wrlock): Add elision.
	* pthread_rwlock_trywrlock.c: Include elide.h.
	(pthread_rwlock_trywrlock): Add elision.
	* pthread_rwlock_tryrdlock.c: Include elide.h.
	(pthread_rwlock_tryrdlock): Add elision.
	* pthread_rwlock_unlock.c: Include elide.h.
	(pthread_rwlock_tryrdlock): Add elision unlock.
	* sysdeps/pthread/pthread.h:
	(__PTHREAD_RWLOCK_ELISION_EXTRA): Handle new define
	(PTHREAD_RWLOCK_INITIALIZER,
	PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP):
	Handle new elision field.
	* sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h:
	(pthread_rwlock_t): Change __pad1 to __rwelision.
	(__PTHREAD_RWLOCK_ELISION_EXTRA): Add.
	* sysdeps/unix/sysv/linux/x86/elide.h:
	New file. Add generic elision macros.
	* sysdeps/unix/sysv/linux/x86/elision-conf.c:
	(elision_init): Set try_xbegin to zero when no RTM.
---
 nptl/pthread_rwlock_rdlock.c                       |  8 ++-
 nptl/pthread_rwlock_tryrdlock.c                    |  8 ++-
 nptl/pthread_rwlock_trywrlock.c                    |  8 ++-
 nptl/pthread_rwlock_unlock.c                       |  5 ++
 nptl/pthread_rwlock_wrlock.c                       |  8 ++-
 nptl/sysdeps/pthread/pthread.h                     | 10 ++-
 .../unix/sysv/linux/x86/bits/pthreadtypes.h        |  6 +-
 nptl/sysdeps/unix/sysv/linux/x86/elide.h           | 81 ++++++++++++++++++++++
 nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c    |  2 +
 9 files changed, 127 insertions(+), 9 deletions(-)
 create mode 100644 nptl/sysdeps/unix/sysv/linux/x86/elide.h

Comments

Roland McGrath April 4, 2014, 11:45 p.m. UTC | #1
Please preserve the two blank lines between #include's and code.

Use <> rather than "" for #include of any file that might be sysdeps (elide.h).

elide.h is missing the standard header comment (explanatory line plus
license header).

Use uint8_t rather than 'unsigned char' for a one-byte counter
(perhaps not in the installed header if it's one that doesn't/can't
rely on <stdint.h>, but certainly in the rest of the code).

Never use bare 'unsigned', always 'unsigned int'.  (For the type
returned by _xbegin et al, perhaps uint32_t would be better, but
rtmintin.h uses 'unsigned int', so that is probably right.)

GNU style for function definitions is:
	type
	name (argtype ...)
not:
	type name(argtype ...)

Use bool rather than int (and #include <stdbool.h>) for Booleans
(e.g. elision_adapt).

Use C99 style, defining local variables only at the site of first
setting (including inside the 'for' initializer clause).

There is no need for braces around an if body that is a single
statement, even if that statement is for or while.  (It's not a real
problem to have the extra braces if you strongly prefer them, but
don't expect they will necessarily be preserved.)

__rwelision is an unsigned type, yet ELIDE_LOCK tests with <= 0.
Make it == 0.

In elision_adapt you have two cases of:

	if (var1 != var2)
	  va1 = var2;

If it is meaningfully worthwhile to write that instead of simply:

	var1 = var2;

then it warrants a comment explaining why.

I see no reason why ELIDE_*LOCK must be macros rather than functions.
If you really know better than the compiler it's important that they
be inlined, then use always_inline.  Give each a comment describing
its interface contract.  (AFAICT elision_adapt should just be rolled
directly into elide_lock.)  

Oh, I see, you are passing in an expression to be evaluated inside.
Can you try passing in the address of a nested function marked as
always_inline and see if the compiler manages to inline everything?
Hmm.  But it's the same expression every time!  So why the false
generality?  They could just be inline functions taking the
pthread_rwlock_t pointer directly.


Thanks,
Roland
Andi Kleen April 5, 2014, 12:37 a.m. UTC | #2
> __rwelision is an unsigned type, yet ELIDE_LOCK tests with <= 0.
> Make it == 0.

Thanks for the review.  I'll do these changes.

> 
> In elision_adapt you have two cases of:
> 
> 	if (var1 != var2)
> 	  va1 = var2;
> 
> If it is meaningfully worthwhile to write that instead of simply:
> 
> 	var1 = var2;

This is an important optimization to avoid unnecessary cache line
transitions and aborts. In general unnecessary writes
can be very expensive in parallel situations, as hardware
may need to do a lot of work to transition the cache line
from SHARED to EXCLUSIVE.

I can add a comment.

> 
> then it warrants a comment explaining why.
> 
> I see no reason why ELIDE_*LOCK must be macros rather than functions.
> If you really know better than the compiler it's important that they
> be inlined, then use always_inline.  Give each a comment describing
> its interface contract.  (AFAICT elision_adapt should just be rolled
> directly into elide_lock.)  
> 
> Oh, I see, you are passing in an expression to be evaluated inside.
> Can you try passing in the address of a nested function marked as
> always_inline and see if the compiler manages to inline everything?
> Hmm.  But it's the same expression every time!  So why the false
> generality?  They could just be inline functions taking the
> pthread_rwlock_t pointer directly.

The macros are generic for any kind of lock type.

It would be a different e.g. for spinlocks. There are also
some possible future optimizations where it may be different 
even for rwlocks.

So I would prefer to keep the macro. Do you insist on the
inline?

-Andi
Roland McGrath April 5, 2014, 4:42 a.m. UTC | #3
> Thanks for the review.  I'll do these changes.

I should have said that I'm not doing a complete review.  Really only
a style and nit review (but as you know that is often what takes more
effort from you to get the way we want it than is the actual logic).
I don't understand either the TSX semantics or the rwlock requirements
enough off hand to confidently review the core logic of the change.  I
hope someone like Torvald can step in for that.

> This is an important optimization to avoid unnecessary cache line
> transitions and aborts. In general unnecessary writes
> can be very expensive in parallel situations, as hardware
> may need to do a lot of work to transition the cache line
> from SHARED to EXCLUSIVE.
> 
> I can add a comment.

I think you might need to do more than that if you want to ensure the
compiler never makes a different choice for you.  I'm not positive,
but I think the rules of C allow it to transform the one into the
other ("optimizing out" the test and branch in favor of a write that
might be a no-op).  In the absence of volatile (or atomic ops or
whatnot) then I don't think there are any actual guarantees about
anything like e.g.  preserving semantics that if the value was already
what you tested for then the page will never see a write access.  Even
if compilers don't violate our presumptions about this today, they
might in the future unless the C standard constrains them.  And even
if the target hardware in question will never make that a wise
optimization decision and so the compiler will never do it, we should
still express to the compiler the constraints we want it to obey as
best we can.  So--unless I'm wrong about what the standard specifies,
which I'm not entirely sure about--then I think we should implement
these cases with something explicit.  It could start out as just a
macro that serves as documentation of the intent, while we look into
what we can or should do to express that intent to the compiler.  e.g.

#define assign_probably_same(var, val) ({ if (var != val) var = val; var; })

(or perhaps get fancier to avoid multiple evaluation or something).

> The macros are generic for any kind of lock type.
> 
> It would be a different e.g. for spinlocks. There are also
> some possible future optimizations where it may be different 
> even for rwlocks.

At first blush this sounds like premature generalization that could
always be done later if and when you actually have additional uses
that could share some of the code.  I certainly won't call that
necessarily an evil (I've clearly been known to do it myself), but its
pre-loaded potential future value always has to be weighed against
other considerations such as the readability/maintainability of the
code that goes in first before any such putative future uses the
generality.

At the very least, if they are intended to be used somewhat
generically, then certainly they should have thorough comments
describing their interface contracts.

> So I would prefer to keep the macro. Do you insist on the inline?

I insist on doing things the cleanest way that we can do them without
sacrificing meaningful performance.  We always start from the
presumption that nontrivial macros are not the cleanest choice.

If you start with rwlock-specific functions it is obvious that they
can be clean and straightforward as inlines.  If you want to start
with generic functions even before they have multiple users, then it
is more intricate to do something clean.  I didn't try your actual
code, but in a simple experiment with GCC 4.8.2 on some mock-up code
that has the same essential structure, I got it to inline everything
when it's structured as:

	static inline __attribute__ ((always_inline)) bool
	elide_lock (uint8_t *count, bool (*predicate) (void *), void *arg)
	{
	  ... if ((*predicate) (arg)) ...
	}
	static inline __attribute__ ((always_inline)) bool
	rdlock_elidable_p (void *arg)
	{
	  pthread_rwlock_t *rwlock = arg;
	  return (rwlock->__data.__lock == 0
		  && rwlock->__data.__writer == 0
		  && rwlock->__data.__nr_readers == 0);
	}

	...
	  if (elide_lock (&rwlock->__data.__rwelision,
			  &rdlock_elidable_p, rwlock))
	    return 0;
	...

(It unfortunately didn't inline the predicate function when it was
nested and referring to its parent's arguments, which should have come
out to exactly the same code as the above.)

So unless you have a good argument why having code in multi-line
macros is somehow cleaner than that or something else along those
lines, how about we try that?  (Or you can just start simple without
the generality now and revisit the argument about how to implement the
generality later on when you have the second user ready to go in.)


Thanks,
Roland
Alexander Monakov April 7, 2014, 2:41 p.m. UTC | #4
On Fri, 4 Apr 2014, Roland McGrath wrote:
> > This is an important optimization to avoid unnecessary cache line
> > transitions and aborts. In general unnecessary writes
> > can be very expensive in parallel situations, as hardware
> > may need to do a lot of work to transition the cache line
> > from SHARED to EXCLUSIVE.
> > 
> > I can add a comment.
> 
> I think you might need to do more than that if you want to ensure the
> compiler never makes a different choice for you.  I'm not positive,
> but I think the rules of C allow it to transform the one into the
> other ("optimizing out" the test and branch in favor of a write that
> might be a no-op).  In the absence of volatile (or atomic ops or
> whatnot) then I don't think there are any actual guarantees about
> anything like e.g.  preserving semantics that if the value was already
> what you tested for then the page will never see a write access.  Even
> if compilers don't violate our presumptions about this today, they
> might in the future unless the C standard constrains them.  And even
> if the target hardware in question will never make that a wise
> optimization decision and so the compiler will never do it, we should
> still express to the compiler the constraints we want it to obey as
> best we can.  So--unless I'm wrong about what the standard specifies,
> which I'm not entirely sure about--then I think we should implement
> these cases with something explicit.  It could start out as just a
> macro that serves as documentation of the intent, while we look into
> what we can or should do to express that intent to the compiler.  e.g.
> 
> #define assign_probably_same(var, val) ({ if (var != val) var = val; var; })
> 
> (or perhaps get fancier to avoid multiple evaluation or something).

The compiler is not allowed to always translate a conditional store:

    if (*ptr != value)
        *ptr = value;

into an unconditional store, because `ptr' might point to read-only memory.

If the compiler can prove that `ptr' must be pointing to writeable location
(for instance if there is a preceding (dominating) unconditional store), it
can, and likely will, perform the optimization.

I would also suggest making the intent (perform the store only when necessary)
explicit, and make sure to disallow the compiler optimization, for example:

    if (*ptr != value)
        *(volatile typeof(*ptr)*)ptr = value;

Alexander
Jeff Law April 7, 2014, 3:21 p.m. UTC | #5
On 04/07/14 08:41, Alexander Monakov wrote:
> On Fri, 4 Apr 2014, Roland McGrath wrote:
>>> This is an important optimization to avoid unnecessary cache line
>>> transitions and aborts. In general unnecessary writes
>>> can be very expensive in parallel situations, as hardware
>>> may need to do a lot of work to transition the cache line
>>> from SHARED to EXCLUSIVE.
>>>
>>> I can add a comment.
>>
>> I think you might need to do more than that if you want to ensure the
>> compiler never makes a different choice for you.  I'm not positive,
>> but I think the rules of C allow it to transform the one into the
>> other ("optimizing out" the test and branch in favor of a write that
>> might be a no-op).  In the absence of volatile (or atomic ops or
>> whatnot) then I don't think there are any actual guarantees about
>> anything like e.g.  preserving semantics that if the value was already
>> what you tested for then the page will never see a write access.  Even
>> if compilers don't violate our presumptions about this today, they
>> might in the future unless the C standard constrains them.  And even
>> if the target hardware in question will never make that a wise
>> optimization decision and so the compiler will never do it, we should
>> still express to the compiler the constraints we want it to obey as
>> best we can.  So--unless I'm wrong about what the standard specifies,
>> which I'm not entirely sure about--then I think we should implement
>> these cases with something explicit.  It could start out as just a
>> macro that serves as documentation of the intent, while we look into
>> what we can or should do to express that intent to the compiler.  e.g.
>>
>> #define assign_probably_same(var, val) ({ if (var != val) var = val; var; })
>>
>> (or perhaps get fancier to avoid multiple evaluation or something).
>
> The compiler is not allowed to always translate a conditional store:
>
>      if (*ptr != value)
>          *ptr = value;
>
> into an unconditional store, because `ptr' might point to read-only memory.
>
> If the compiler can prove that `ptr' must be pointing to writeable location
> (for instance if there is a preceding (dominating) unconditional store), it
> can, and likely will, perform the optimization.
You might want to check the C11/C++11 standard to be sure.  There's a 
reasonable chance the memory model disallows this kind of transformation.

jeff
Andi Kleen April 7, 2014, 4:10 p.m. UTC | #6
> If the compiler can prove that `ptr' must be pointing to writeable location
> (for instance if there is a preceding (dominating) unconditional store), it
> can, and likely will, perform the optimization.

Except it's not an optimization, but a pessimization

Which compiler would do that? It sounds very broken to me.

> 
> I would also suggest making the intent (perform the store only when necessary)
> explicit, and make sure to disallow the compiler optimization, for example:
> 
>     if (*ptr != value)
>         *(volatile typeof(*ptr)*)ptr = value;

That's really ugly.

-Andi
Alexander Monakov April 7, 2014, 4:54 p.m. UTC | #7
On Mon, 7 Apr 2014, Andi Kleen wrote:

> > If the compiler can prove that `ptr' must be pointing to writeable location
> > (for instance if there is a preceding (dominating) unconditional store), it
> > can, and likely will, perform the optimization.
> 
> Except it's not an optimization, but a pessimization

I see where you're coming from, but is that really a pessimization for a case
of non-multithreaded execution?  Also, I (of course) agree with Jeff Law that
such transformation has good chances of violating the memory model imposed by
newer standards.

> Which compiler would do that? It sounds very broken to me.

Example:

void foo(int * __restrict__ ptr, int val, volatile int * __restrict__ cond)
{
  *ptr = 0;
  while (*cond);
  if (*ptr != val)
    *ptr = val;
}

In my tests, GCC versions before 4.8 optimize out the first store and the
conditional branch.  GCC 4.8.0 preserves both the first store and the branch.
If you omit the busy-wait loop, GCC 4.8 performs the optimization as well.

> > I would also suggest making the intent (perform the store only when necessary)
> > explicit, and make sure to disallow the compiler optimization, for example:
> > 
> >     if (*ptr != value)
> >         *(volatile typeof(*ptr)*)ptr = value;
> 
> That's really ugly.

I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the
example.

Alexander
Andi Kleen April 7, 2014, 5:52 p.m. UTC | #8
On Mon, Apr 07, 2014 at 08:54:57PM +0400, Alexander Monakov wrote:
> 
> 
> On Mon, 7 Apr 2014, Andi Kleen wrote:
> 
> > > If the compiler can prove that `ptr' must be pointing to writeable location
> > > (for instance if there is a preceding (dominating) unconditional store), it
> > > can, and likely will, perform the optimization.
> > 
> > Except it's not an optimization, but a pessimization
> 
> I see where you're coming from, but is that really a pessimization for a case
> of non-multithreaded execution? 

If it's write back memory and the cache line would otherwise be clean,
the cache line eventually has to be flushed, which causes extra work for
the memory controller. The memory controller has a limited working set
(number of open DRAM pages etc.); if that is thrashing it may make things
significantly slower when it has to close some other actively
used page.

It's also quite bad for power consumption because it's a lot of
extra work.

In fact I would probably argue that this would be a good target
for a profile feedback pass to add such if()s

> 
> In my tests, GCC versions before 4.8 optimize out the first store and the
> conditional branch.  GCC 4.8.0 preserves both the first store and the branch.
> If you omit the busy-wait loop, GCC 4.8 performs the optimization as well.

Thanks.

> 
> > > I would also suggest making the intent (perform the store only when necessary)
> > > explicit, and make sure to disallow the compiler optimization, for example:
> > > 
> > >     if (*ptr != value)
> > >         *(volatile typeof(*ptr)*)ptr = value;
> > 
> > That's really ugly.
> 
> I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the
> example.

Ok I guess we can add a ACCESS_ONCE()

The existing elision code also uses this construct.

-Andi
Torvald Riegel April 10, 2014, 8:20 p.m. UTC | #9
On Mon, 2014-04-07 at 09:21 -0600, Jeff Law wrote:
> On 04/07/14 08:41, Alexander Monakov wrote:
> > On Fri, 4 Apr 2014, Roland McGrath wrote:
> >>> This is an important optimization to avoid unnecessary cache line
> >>> transitions and aborts. In general unnecessary writes
> >>> can be very expensive in parallel situations, as hardware
> >>> may need to do a lot of work to transition the cache line
> >>> from SHARED to EXCLUSIVE.
> >>>
> >>> I can add a comment.
> >>
> >> I think you might need to do more than that if you want to ensure the
> >> compiler never makes a different choice for you.  I'm not positive,
> >> but I think the rules of C allow it to transform the one into the
> >> other ("optimizing out" the test and branch in favor of a write that
> >> might be a no-op).  In the absence of volatile (or atomic ops or
> >> whatnot) then I don't think there are any actual guarantees about
> >> anything like e.g.  preserving semantics that if the value was already
> >> what you tested for then the page will never see a write access.  Even
> >> if compilers don't violate our presumptions about this today, they
> >> might in the future unless the C standard constrains them.  And even
> >> if the target hardware in question will never make that a wise
> >> optimization decision and so the compiler will never do it, we should
> >> still express to the compiler the constraints we want it to obey as
> >> best we can.  So--unless I'm wrong about what the standard specifies,
> >> which I'm not entirely sure about--then I think we should implement
> >> these cases with something explicit.  It could start out as just a
> >> macro that serves as documentation of the intent, while we look into
> >> what we can or should do to express that intent to the compiler.  e.g.
> >>
> >> #define assign_probably_same(var, val) ({ if (var != val) var = val; var; })
> >>
> >> (or perhaps get fancier to avoid multiple evaluation or something).
> >
> > The compiler is not allowed to always translate a conditional store:
> >
> >      if (*ptr != value)
> >          *ptr = value;
> >
> > into an unconditional store, because `ptr' might point to read-only memory.
> >
> > If the compiler can prove that `ptr' must be pointing to writeable location
> > (for instance if there is a preceding (dominating) unconditional store), it
> > can, and likely will, perform the optimization.
> You might want to check the C11/C++11 standard to be sure.  There's a 
> reasonable chance the memory model disallows this kind of transformation.

I think a compiler might consider this as a corner case, in the sense of
as-if potentially allowing an idempotent store.  The location is not
atomic-typed (so the compiler can assume thread-local memory is
accessed) nor volatile, so an idempotent store cannot change the outcome
of the program I think -- provided we don't consider read-only memory
(which neither C11 nor C++11 specify, I believe).

However, specific platforms might have read-only memory, so unless a
compiler can prove that the memory is always writable at this point in
time, it might not be wise for the compiler to do the (potentially
idempotent) store all the time.
Torvald Riegel April 10, 2014, 8:26 p.m. UTC | #10
On Mon, 2014-04-07 at 20:54 +0400, Alexander Monakov wrote:
> 
> On Mon, 7 Apr 2014, Andi Kleen wrote:
> 
> > > If the compiler can prove that `ptr' must be pointing to writeable location
> > > (for instance if there is a preceding (dominating) unconditional store), it
> > > can, and likely will, perform the optimization.
> > 
> > Except it's not an optimization, but a pessimization
> 
> I see where you're coming from, but is that really a pessimization for a case
> of non-multithreaded execution?  Also, I (of course) agree with Jeff Law that
> such transformation has good chances of violating the memory model imposed by
> newer standards.
> 
> > Which compiler would do that? It sounds very broken to me.
> 
> Example:
> 
> void foo(int * __restrict__ ptr, int val, volatile int * __restrict__ cond)
> {
>   *ptr = 0;
>   while (*cond);
>   if (*ptr != val)
>     *ptr = val;
> }
> 
> In my tests, GCC versions before 4.8 optimize out the first store and the
> conditional branch.  GCC 4.8.0 preserves both the first store and the branch.
> If you omit the busy-wait loop, GCC 4.8 performs the optimization as well.

If we consider just the standards (which don't provide for something
like read-only memory, I believe (and ptr isn't volatile)), then I think
both pre 4.8 and 4.8 behavior are correct.  I don't know whether that's
actually the intention, but 4.8 might treat the while loop as
synchronization (which it isn't according to C11/C++11) and thus not
merge the stores.
Torvald Riegel April 10, 2014, 8:35 p.m. UTC | #11
On Mon, 2014-04-07 at 19:52 +0200, Andi Kleen wrote:
> On Mon, Apr 07, 2014 at 08:54:57PM +0400, Alexander Monakov wrote: 
> > > > I would also suggest making the intent (perform the store only when necessary)
> > > > explicit, and make sure to disallow the compiler optimization, for example:
> > > > 
> > > >     if (*ptr != value)
> > > >         *(volatile typeof(*ptr)*)ptr = value;
> > > 
> > > That's really ugly.
> > 
> > I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the
> > example.
> 
> Ok I guess we can add a ACCESS_ONCE()
> 
> The existing elision code also uses this construct.

I'm not a kernel developer, but I think there ACCESS_ONCE is rather used
to get reads in spin-loops right (in terms of forward progress).  If we
transition to C11 atomics eventually, that specific workaround won't be
needed because just having an atomic load in a spin loop will be enough
to tell the compiler to not optimize the load out.

However, this case here is different because it's not about forward
progress but about whether idempotent stores are fine or not.  I guess
it's really a case of inefficient code generation if the compiler is
optimizing out checks for idempotent stores.  Alexander's example is
slightly different because there, the compiler sees that the location
just has been written to.  It shouldn't assume this in the lock elision
code.

Thus, I think it's not worth adding (something else than) ACCESS_ONCE
here; rather, someone should look at the compiler.  Maybe the case would
be clearer for the compiler if we'd actually used atomic accesses and
atomic-typed variables (because for synchronizing code, it's more likely
to be inefficient if you just store, idempotent or not...).
diff mbox

Patch

diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
index a4deed4..413aa76 100644
--- a/nptl/pthread_rwlock_rdlock.c
+++ b/nptl/pthread_rwlock_rdlock.c
@@ -22,7 +22,7 @@ 
 #include <pthread.h>
 #include <pthreadP.h>
 #include <stap-probe.h>
-
+#include "elide.h"
 
 /* Acquire read lock for RWLOCK.  Slow path. */
 static int __attribute__((noinline))
@@ -102,6 +102,12 @@  __pthread_rwlock_rdlock (pthread_rwlock_t *rwlock)
 
   LIBC_PROBE (rdlock_entry, 1, rwlock);
 
+  if (ELIDE_LOCK (rwlock->__data.__rwelision,
+		  rwlock->__data.__lock == 0
+		  && rwlock->__data.__writer == 0
+		  && rwlock->__data.__nr_readers == 0))
+    return 0;
+
   /* Make sure we are alone.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
diff --git a/nptl/pthread_rwlock_tryrdlock.c b/nptl/pthread_rwlock_tryrdlock.c
index f7b1e6b..fe615bc 100644
--- a/nptl/pthread_rwlock_tryrdlock.c
+++ b/nptl/pthread_rwlock_tryrdlock.c
@@ -19,13 +19,19 @@ 
 #include <errno.h>
 #include "pthreadP.h"
 #include <lowlevellock.h>
-
+#include "elide.h"
 
 int
 __pthread_rwlock_tryrdlock (pthread_rwlock_t *rwlock)
 {
   int result = EBUSY;
 
+  if (ELIDE_TRYLOCK (rwlock->__data.__rwelision,
+		     rwlock->__data.__lock == 0
+		     && rwlock->__data.__nr_readers == 0
+		     && rwlock->__data.__writer, 0))
+    return 0;
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
   if (rwlock->__data.__writer == 0
diff --git a/nptl/pthread_rwlock_trywrlock.c b/nptl/pthread_rwlock_trywrlock.c
index 106f157..9891c3d 100644
--- a/nptl/pthread_rwlock_trywrlock.c
+++ b/nptl/pthread_rwlock_trywrlock.c
@@ -19,13 +19,19 @@ 
 #include <errno.h>
 #include "pthreadP.h"
 #include <lowlevellock.h>
-
+#include "elide.h"
 
 int
 __pthread_rwlock_trywrlock (pthread_rwlock_t *rwlock)
 {
   int result = EBUSY;
 
+  if (ELIDE_TRYLOCK (rwlock->__data.__rwelision,
+		     rwlock->__data.__lock == 0
+		     && rwlock->__data.__nr_readers == 0
+		     && rwlock->__data.__writer, 1))
+    return 0;
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
   if (rwlock->__data.__writer == 0 && rwlock->__data.__nr_readers == 0)
diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
index d492383..e3c63e4 100644
--- a/nptl/pthread_rwlock_unlock.c
+++ b/nptl/pthread_rwlock_unlock.c
@@ -22,6 +22,7 @@ 
 #include <pthread.h>
 #include <pthreadP.h>
 #include <stap-probe.h>
+#include "elide.h"
 
 /* Unlock RWLOCK.  */
 int
@@ -29,6 +30,10 @@  __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
 {
   LIBC_PROBE (rwlock_unlock, 1, rwlock);
 
+  if (ELIDE_UNLOCK (rwlock->__data.__writer == 0
+		    && rwlock->__data.__nr_readers == 0))
+    return 0;
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
   if (rwlock->__data.__writer)
     rwlock->__data.__writer = 0;
diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
index 2907681..fc1217c 100644
--- a/nptl/pthread_rwlock_wrlock.c
+++ b/nptl/pthread_rwlock_wrlock.c
@@ -22,7 +22,7 @@ 
 #include <pthread.h>
 #include <pthreadP.h>
 #include <stap-probe.h>
-
+#include "elide.h"
 
 /* Acquire write lock for RWLOCK.  */
 static int __attribute__((noinline))
@@ -91,6 +91,12 @@  __pthread_rwlock_wrlock (pthread_rwlock_t *rwlock)
 {
   LIBC_PROBE (wrlock_entry, 1, rwlock);
 
+  if (ELIDE_LOCK (rwlock->__data.__rwelision,
+		  rwlock->__data.__lock == 0
+		  && rwlock->__data.__writer == 0
+		  && rwlock->__data.__nr_readers == 0))
+    return 0;
+
   /* Make sure we are alone.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
diff --git a/nptl/sysdeps/pthread/pthread.h b/nptl/sysdeps/pthread/pthread.h
index 1e0c5dc..6992e4a 100644
--- a/nptl/sysdeps/pthread/pthread.h
+++ b/nptl/sysdeps/pthread/pthread.h
@@ -139,19 +139,23 @@  enum
 # endif
 #endif
 
+#ifndef __PTHREAD_RWLOCK_ELISION_EXTRA
+# define __PTHREAD_RWLOCK_ELISION_EXTRA 0
+#endif
+
 /* Read-write lock initializers.  */
 # define PTHREAD_RWLOCK_INITIALIZER \
-  { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } }
+  { { 0, 0, 0, 0, 0, 0, 0, 0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0, 0 } }
 # ifdef __USE_GNU
 #  ifdef __PTHREAD_RWLOCK_INT_FLAGS_SHARED
 #   define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \
-  { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,					      \
+  { { 0, 0, 0, 0, 0, 0, 0, 0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0,					      \
 	PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP } }
 #  else
 #   if __BYTE_ORDER == __LITTLE_ENDIAN
 #    define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \
   { { 0, 0, 0, 0, 0, 0, PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP, \
-      0, 0, 0, 0 } }
+      0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0, 0 } }
 #   else
 #    define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \
   { { 0, 0, 0, 0, 0, 0, 0, 0, 0, PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP,\
diff --git a/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h b/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h
index 28e5144..6152bdf 100644
--- a/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h
+++ b/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h
@@ -183,11 +183,13 @@  typedef union
     unsigned int __nr_writers_queued;
     int __writer;
     int __shared;
-    unsigned long int __pad1;
+    unsigned char __rwelision;
+    unsigned char __pad1[7];
     unsigned long int __pad2;
     /* FLAGS must stay at this position in the structure to maintain
        binary compatibility.  */
     unsigned int __flags;
+# define __PTHREAD_RWLOCK_ELISION_EXTRA 0, {0, 0, 0, 0, 0, 0, 0 }
 # define __PTHREAD_RWLOCK_INT_FLAGS_SHARED	1
   } __data;
 # else
@@ -203,7 +205,7 @@  typedef union
        binary compatibility.  */
     unsigned char __flags;
     unsigned char __shared;
-    unsigned char __pad1;
+    unsigned char __rwelision;
     unsigned char __pad2;
     int __writer;
   } __data;
diff --git a/nptl/sysdeps/unix/sysv/linux/x86/elide.h b/nptl/sysdeps/unix/sysv/linux/x86/elide.h
new file mode 100644
index 0000000..cb56901
--- /dev/null
+++ b/nptl/sysdeps/unix/sysv/linux/x86/elide.h
@@ -0,0 +1,81 @@ 
+#ifndef ELIDE_H
+#define ELIDE_H 1
+
+#include "hle.h"
+#include "elision-conf.h"
+
+/* Adapt elision with ADAPT_COUNT and STATUS and decide retries.  */
+
+static inline int elision_adapt(unsigned char *adapt_count, unsigned status)
+{
+  if (status & _XABORT_RETRY)
+    return 0;
+  if ((status & _XABORT_EXPLICIT)
+      && _XABORT_CODE (status) == _ABORT_LOCK_BUSY)
+    {
+      /* Right now we skip here.  Better would be to wait a bit
+	 and retry.  This likely needs some spinning.  */
+      if (*adapt_count != __elision_aconf.skip_lock_busy)
+	*adapt_count = __elision_aconf.skip_lock_busy;
+    }
+  /* Internal abort.  There is no chance for retry.
+     Use the normal locking and next time use lock.
+     Be careful to avoid writing to the lock.  */
+  else if (*adapt_count != __elision_aconf.skip_lock_internal_abort)
+    *adapt_count = __elision_aconf.skip_lock_internal_abort;
+  return 1;
+}
+
+/* is_lock_free must be executed inside the transaction */
+
+#define ELIDE_LOCK(adapt_count, is_lock_free)			\
+  ({								\
+    int i;							\
+    unsigned status;						\
+    int ret = 0;						\
+								\
+    if ((adapt_count) <= 0) 					\
+      {								\
+        for (i = __elision_aconf.retry_try_xbegin; i > 0; i--)	\
+          {							\
+	    if ((status = _xbegin ()) == _XBEGIN_STARTED)	\
+	      {							\
+	        if (is_lock_free)				\
+	          {						\
+		    ret = 1;					\
+		    break;					\
+	          }						\
+	        _xabort (_ABORT_LOCK_BUSY);			\
+	      }							\
+	    if (!elision_adapt (&(adapt_count), status))	\
+	      break;						\
+          }							\
+      }								\
+    else 							\
+      (adapt_count)--; /* missing updates ok */			\
+    ret;							\
+  })
+
+#define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) ({	\
+  int ret = 0;						\
+  if (__elision_aconf.retry_try_xbegin > 0)		\
+    {  							\
+      if (write)					\
+        _xabort (_ABORT_NESTED_TRYLOCK);		\
+      ret = ELIDE_LOCK (adapt_count, is_lock_free);     \
+    }							\
+    ret;						\
+    })
+
+#define ELIDE_UNLOCK(is_lock_free)		\
+  ({						\
+  int ret = 0;					\
+  if (is_lock_free)				\
+    {						\
+      _xend ();					\
+      ret = 1;					\
+    }						\
+  ret;						\
+  })
+
+#endif
diff --git a/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c b/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c
index e6f5d6d..28e48d9 100644
--- a/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c
+++ b/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c
@@ -66,6 +66,8 @@  elision_init (int argc __attribute__ ((unused)),
 #ifdef ENABLE_LOCK_ELISION
   __pthread_force_elision = __libc_enable_secure ? 0 : __elision_available;
 #endif
+  if (!HAS_RTM)
+    __elision_aconf.retry_try_xbegin = 0; /* Disable elision on rwlocks */
 }
 
 #ifdef SHARED