Message ID | 1396652083-18920-1-git-send-email-andi@firstfloor.org |
---|---|
State | New |
Headers | show |
Please preserve the two blank lines between #include's and code. Use <> rather than "" for #include of any file that might be sysdeps (elide.h). elide.h is missing the standard header comment (explanatory line plus license header). Use uint8_t rather than 'unsigned char' for a one-byte counter (perhaps not in the installed header if it's one that doesn't/can't rely on <stdint.h>, but certainly in the rest of the code). Never use bare 'unsigned', always 'unsigned int'. (For the type returned by _xbegin et al, perhaps uint32_t would be better, but rtmintin.h uses 'unsigned int', so that is probably right.) GNU style for function definitions is: type name (argtype ...) not: type name(argtype ...) Use bool rather than int (and #include <stdbool.h>) for Booleans (e.g. elision_adapt). Use C99 style, defining local variables only at the site of first setting (including inside the 'for' initializer clause). There is no need for braces around an if body that is a single statement, even if that statement is for or while. (It's not a real problem to have the extra braces if you strongly prefer them, but don't expect they will necessarily be preserved.) __rwelision is an unsigned type, yet ELIDE_LOCK tests with <= 0. Make it == 0. In elision_adapt you have two cases of: if (var1 != var2) va1 = var2; If it is meaningfully worthwhile to write that instead of simply: var1 = var2; then it warrants a comment explaining why. I see no reason why ELIDE_*LOCK must be macros rather than functions. If you really know better than the compiler it's important that they be inlined, then use always_inline. Give each a comment describing its interface contract. (AFAICT elision_adapt should just be rolled directly into elide_lock.) Oh, I see, you are passing in an expression to be evaluated inside. Can you try passing in the address of a nested function marked as always_inline and see if the compiler manages to inline everything? Hmm. But it's the same expression every time! So why the false generality? They could just be inline functions taking the pthread_rwlock_t pointer directly. Thanks, Roland
> __rwelision is an unsigned type, yet ELIDE_LOCK tests with <= 0. > Make it == 0. Thanks for the review. I'll do these changes. > > In elision_adapt you have two cases of: > > if (var1 != var2) > va1 = var2; > > If it is meaningfully worthwhile to write that instead of simply: > > var1 = var2; This is an important optimization to avoid unnecessary cache line transitions and aborts. In general unnecessary writes can be very expensive in parallel situations, as hardware may need to do a lot of work to transition the cache line from SHARED to EXCLUSIVE. I can add a comment. > > then it warrants a comment explaining why. > > I see no reason why ELIDE_*LOCK must be macros rather than functions. > If you really know better than the compiler it's important that they > be inlined, then use always_inline. Give each a comment describing > its interface contract. (AFAICT elision_adapt should just be rolled > directly into elide_lock.) > > Oh, I see, you are passing in an expression to be evaluated inside. > Can you try passing in the address of a nested function marked as > always_inline and see if the compiler manages to inline everything? > Hmm. But it's the same expression every time! So why the false > generality? They could just be inline functions taking the > pthread_rwlock_t pointer directly. The macros are generic for any kind of lock type. It would be a different e.g. for spinlocks. There are also some possible future optimizations where it may be different even for rwlocks. So I would prefer to keep the macro. Do you insist on the inline? -Andi
> Thanks for the review. I'll do these changes. I should have said that I'm not doing a complete review. Really only a style and nit review (but as you know that is often what takes more effort from you to get the way we want it than is the actual logic). I don't understand either the TSX semantics or the rwlock requirements enough off hand to confidently review the core logic of the change. I hope someone like Torvald can step in for that. > This is an important optimization to avoid unnecessary cache line > transitions and aborts. In general unnecessary writes > can be very expensive in parallel situations, as hardware > may need to do a lot of work to transition the cache line > from SHARED to EXCLUSIVE. > > I can add a comment. I think you might need to do more than that if you want to ensure the compiler never makes a different choice for you. I'm not positive, but I think the rules of C allow it to transform the one into the other ("optimizing out" the test and branch in favor of a write that might be a no-op). In the absence of volatile (or atomic ops or whatnot) then I don't think there are any actual guarantees about anything like e.g. preserving semantics that if the value was already what you tested for then the page will never see a write access. Even if compilers don't violate our presumptions about this today, they might in the future unless the C standard constrains them. And even if the target hardware in question will never make that a wise optimization decision and so the compiler will never do it, we should still express to the compiler the constraints we want it to obey as best we can. So--unless I'm wrong about what the standard specifies, which I'm not entirely sure about--then I think we should implement these cases with something explicit. It could start out as just a macro that serves as documentation of the intent, while we look into what we can or should do to express that intent to the compiler. e.g. #define assign_probably_same(var, val) ({ if (var != val) var = val; var; }) (or perhaps get fancier to avoid multiple evaluation or something). > The macros are generic for any kind of lock type. > > It would be a different e.g. for spinlocks. There are also > some possible future optimizations where it may be different > even for rwlocks. At first blush this sounds like premature generalization that could always be done later if and when you actually have additional uses that could share some of the code. I certainly won't call that necessarily an evil (I've clearly been known to do it myself), but its pre-loaded potential future value always has to be weighed against other considerations such as the readability/maintainability of the code that goes in first before any such putative future uses the generality. At the very least, if they are intended to be used somewhat generically, then certainly they should have thorough comments describing their interface contracts. > So I would prefer to keep the macro. Do you insist on the inline? I insist on doing things the cleanest way that we can do them without sacrificing meaningful performance. We always start from the presumption that nontrivial macros are not the cleanest choice. If you start with rwlock-specific functions it is obvious that they can be clean and straightforward as inlines. If you want to start with generic functions even before they have multiple users, then it is more intricate to do something clean. I didn't try your actual code, but in a simple experiment with GCC 4.8.2 on some mock-up code that has the same essential structure, I got it to inline everything when it's structured as: static inline __attribute__ ((always_inline)) bool elide_lock (uint8_t *count, bool (*predicate) (void *), void *arg) { ... if ((*predicate) (arg)) ... } static inline __attribute__ ((always_inline)) bool rdlock_elidable_p (void *arg) { pthread_rwlock_t *rwlock = arg; return (rwlock->__data.__lock == 0 && rwlock->__data.__writer == 0 && rwlock->__data.__nr_readers == 0); } ... if (elide_lock (&rwlock->__data.__rwelision, &rdlock_elidable_p, rwlock)) return 0; ... (It unfortunately didn't inline the predicate function when it was nested and referring to its parent's arguments, which should have come out to exactly the same code as the above.) So unless you have a good argument why having code in multi-line macros is somehow cleaner than that or something else along those lines, how about we try that? (Or you can just start simple without the generality now and revisit the argument about how to implement the generality later on when you have the second user ready to go in.) Thanks, Roland
On Fri, 4 Apr 2014, Roland McGrath wrote: > > This is an important optimization to avoid unnecessary cache line > > transitions and aborts. In general unnecessary writes > > can be very expensive in parallel situations, as hardware > > may need to do a lot of work to transition the cache line > > from SHARED to EXCLUSIVE. > > > > I can add a comment. > > I think you might need to do more than that if you want to ensure the > compiler never makes a different choice for you. I'm not positive, > but I think the rules of C allow it to transform the one into the > other ("optimizing out" the test and branch in favor of a write that > might be a no-op). In the absence of volatile (or atomic ops or > whatnot) then I don't think there are any actual guarantees about > anything like e.g. preserving semantics that if the value was already > what you tested for then the page will never see a write access. Even > if compilers don't violate our presumptions about this today, they > might in the future unless the C standard constrains them. And even > if the target hardware in question will never make that a wise > optimization decision and so the compiler will never do it, we should > still express to the compiler the constraints we want it to obey as > best we can. So--unless I'm wrong about what the standard specifies, > which I'm not entirely sure about--then I think we should implement > these cases with something explicit. It could start out as just a > macro that serves as documentation of the intent, while we look into > what we can or should do to express that intent to the compiler. e.g. > > #define assign_probably_same(var, val) ({ if (var != val) var = val; var; }) > > (or perhaps get fancier to avoid multiple evaluation or something). The compiler is not allowed to always translate a conditional store: if (*ptr != value) *ptr = value; into an unconditional store, because `ptr' might point to read-only memory. If the compiler can prove that `ptr' must be pointing to writeable location (for instance if there is a preceding (dominating) unconditional store), it can, and likely will, perform the optimization. I would also suggest making the intent (perform the store only when necessary) explicit, and make sure to disallow the compiler optimization, for example: if (*ptr != value) *(volatile typeof(*ptr)*)ptr = value; Alexander
On 04/07/14 08:41, Alexander Monakov wrote: > On Fri, 4 Apr 2014, Roland McGrath wrote: >>> This is an important optimization to avoid unnecessary cache line >>> transitions and aborts. In general unnecessary writes >>> can be very expensive in parallel situations, as hardware >>> may need to do a lot of work to transition the cache line >>> from SHARED to EXCLUSIVE. >>> >>> I can add a comment. >> >> I think you might need to do more than that if you want to ensure the >> compiler never makes a different choice for you. I'm not positive, >> but I think the rules of C allow it to transform the one into the >> other ("optimizing out" the test and branch in favor of a write that >> might be a no-op). In the absence of volatile (or atomic ops or >> whatnot) then I don't think there are any actual guarantees about >> anything like e.g. preserving semantics that if the value was already >> what you tested for then the page will never see a write access. Even >> if compilers don't violate our presumptions about this today, they >> might in the future unless the C standard constrains them. And even >> if the target hardware in question will never make that a wise >> optimization decision and so the compiler will never do it, we should >> still express to the compiler the constraints we want it to obey as >> best we can. So--unless I'm wrong about what the standard specifies, >> which I'm not entirely sure about--then I think we should implement >> these cases with something explicit. It could start out as just a >> macro that serves as documentation of the intent, while we look into >> what we can or should do to express that intent to the compiler. e.g. >> >> #define assign_probably_same(var, val) ({ if (var != val) var = val; var; }) >> >> (or perhaps get fancier to avoid multiple evaluation or something). > > The compiler is not allowed to always translate a conditional store: > > if (*ptr != value) > *ptr = value; > > into an unconditional store, because `ptr' might point to read-only memory. > > If the compiler can prove that `ptr' must be pointing to writeable location > (for instance if there is a preceding (dominating) unconditional store), it > can, and likely will, perform the optimization. You might want to check the C11/C++11 standard to be sure. There's a reasonable chance the memory model disallows this kind of transformation. jeff
> If the compiler can prove that `ptr' must be pointing to writeable location > (for instance if there is a preceding (dominating) unconditional store), it > can, and likely will, perform the optimization. Except it's not an optimization, but a pessimization Which compiler would do that? It sounds very broken to me. > > I would also suggest making the intent (perform the store only when necessary) > explicit, and make sure to disallow the compiler optimization, for example: > > if (*ptr != value) > *(volatile typeof(*ptr)*)ptr = value; That's really ugly. -Andi
On Mon, 7 Apr 2014, Andi Kleen wrote: > > If the compiler can prove that `ptr' must be pointing to writeable location > > (for instance if there is a preceding (dominating) unconditional store), it > > can, and likely will, perform the optimization. > > Except it's not an optimization, but a pessimization I see where you're coming from, but is that really a pessimization for a case of non-multithreaded execution? Also, I (of course) agree with Jeff Law that such transformation has good chances of violating the memory model imposed by newer standards. > Which compiler would do that? It sounds very broken to me. Example: void foo(int * __restrict__ ptr, int val, volatile int * __restrict__ cond) { *ptr = 0; while (*cond); if (*ptr != val) *ptr = val; } In my tests, GCC versions before 4.8 optimize out the first store and the conditional branch. GCC 4.8.0 preserves both the first store and the branch. If you omit the busy-wait loop, GCC 4.8 performs the optimization as well. > > I would also suggest making the intent (perform the store only when necessary) > > explicit, and make sure to disallow the compiler optimization, for example: > > > > if (*ptr != value) > > *(volatile typeof(*ptr)*)ptr = value; > > That's really ugly. I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the example. Alexander
On Mon, Apr 07, 2014 at 08:54:57PM +0400, Alexander Monakov wrote: > > > On Mon, 7 Apr 2014, Andi Kleen wrote: > > > > If the compiler can prove that `ptr' must be pointing to writeable location > > > (for instance if there is a preceding (dominating) unconditional store), it > > > can, and likely will, perform the optimization. > > > > Except it's not an optimization, but a pessimization > > I see where you're coming from, but is that really a pessimization for a case > of non-multithreaded execution? If it's write back memory and the cache line would otherwise be clean, the cache line eventually has to be flushed, which causes extra work for the memory controller. The memory controller has a limited working set (number of open DRAM pages etc.); if that is thrashing it may make things significantly slower when it has to close some other actively used page. It's also quite bad for power consumption because it's a lot of extra work. In fact I would probably argue that this would be a good target for a profile feedback pass to add such if()s > > In my tests, GCC versions before 4.8 optimize out the first store and the > conditional branch. GCC 4.8.0 preserves both the first store and the branch. > If you omit the busy-wait loop, GCC 4.8 performs the optimization as well. Thanks. > > > > I would also suggest making the intent (perform the store only when necessary) > > > explicit, and make sure to disallow the compiler optimization, for example: > > > > > > if (*ptr != value) > > > *(volatile typeof(*ptr)*)ptr = value; > > > > That's really ugly. > > I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the > example. Ok I guess we can add a ACCESS_ONCE() The existing elision code also uses this construct. -Andi
On Mon, 2014-04-07 at 09:21 -0600, Jeff Law wrote: > On 04/07/14 08:41, Alexander Monakov wrote: > > On Fri, 4 Apr 2014, Roland McGrath wrote: > >>> This is an important optimization to avoid unnecessary cache line > >>> transitions and aborts. In general unnecessary writes > >>> can be very expensive in parallel situations, as hardware > >>> may need to do a lot of work to transition the cache line > >>> from SHARED to EXCLUSIVE. > >>> > >>> I can add a comment. > >> > >> I think you might need to do more than that if you want to ensure the > >> compiler never makes a different choice for you. I'm not positive, > >> but I think the rules of C allow it to transform the one into the > >> other ("optimizing out" the test and branch in favor of a write that > >> might be a no-op). In the absence of volatile (or atomic ops or > >> whatnot) then I don't think there are any actual guarantees about > >> anything like e.g. preserving semantics that if the value was already > >> what you tested for then the page will never see a write access. Even > >> if compilers don't violate our presumptions about this today, they > >> might in the future unless the C standard constrains them. And even > >> if the target hardware in question will never make that a wise > >> optimization decision and so the compiler will never do it, we should > >> still express to the compiler the constraints we want it to obey as > >> best we can. So--unless I'm wrong about what the standard specifies, > >> which I'm not entirely sure about--then I think we should implement > >> these cases with something explicit. It could start out as just a > >> macro that serves as documentation of the intent, while we look into > >> what we can or should do to express that intent to the compiler. e.g. > >> > >> #define assign_probably_same(var, val) ({ if (var != val) var = val; var; }) > >> > >> (or perhaps get fancier to avoid multiple evaluation or something). > > > > The compiler is not allowed to always translate a conditional store: > > > > if (*ptr != value) > > *ptr = value; > > > > into an unconditional store, because `ptr' might point to read-only memory. > > > > If the compiler can prove that `ptr' must be pointing to writeable location > > (for instance if there is a preceding (dominating) unconditional store), it > > can, and likely will, perform the optimization. > You might want to check the C11/C++11 standard to be sure. There's a > reasonable chance the memory model disallows this kind of transformation. I think a compiler might consider this as a corner case, in the sense of as-if potentially allowing an idempotent store. The location is not atomic-typed (so the compiler can assume thread-local memory is accessed) nor volatile, so an idempotent store cannot change the outcome of the program I think -- provided we don't consider read-only memory (which neither C11 nor C++11 specify, I believe). However, specific platforms might have read-only memory, so unless a compiler can prove that the memory is always writable at this point in time, it might not be wise for the compiler to do the (potentially idempotent) store all the time.
On Mon, 2014-04-07 at 20:54 +0400, Alexander Monakov wrote: > > On Mon, 7 Apr 2014, Andi Kleen wrote: > > > > If the compiler can prove that `ptr' must be pointing to writeable location > > > (for instance if there is a preceding (dominating) unconditional store), it > > > can, and likely will, perform the optimization. > > > > Except it's not an optimization, but a pessimization > > I see where you're coming from, but is that really a pessimization for a case > of non-multithreaded execution? Also, I (of course) agree with Jeff Law that > such transformation has good chances of violating the memory model imposed by > newer standards. > > > Which compiler would do that? It sounds very broken to me. > > Example: > > void foo(int * __restrict__ ptr, int val, volatile int * __restrict__ cond) > { > *ptr = 0; > while (*cond); > if (*ptr != val) > *ptr = val; > } > > In my tests, GCC versions before 4.8 optimize out the first store and the > conditional branch. GCC 4.8.0 preserves both the first store and the branch. > If you omit the busy-wait loop, GCC 4.8 performs the optimization as well. If we consider just the standards (which don't provide for something like read-only memory, I believe (and ptr isn't volatile)), then I think both pre 4.8 and 4.8 behavior are correct. I don't know whether that's actually the intention, but 4.8 might treat the while loop as synchronization (which it isn't according to C11/C++11) and thus not merge the stores.
On Mon, 2014-04-07 at 19:52 +0200, Andi Kleen wrote: > On Mon, Apr 07, 2014 at 08:54:57PM +0400, Alexander Monakov wrote: > > > > I would also suggest making the intent (perform the store only when necessary) > > > > explicit, and make sure to disallow the compiler optimization, for example: > > > > > > > > if (*ptr != value) > > > > *(volatile typeof(*ptr)*)ptr = value; > > > > > > That's really ugly. > > > > I simply expanded kernel's ACCESS_ONCE macro by hand for the sake of the > > example. > > Ok I guess we can add a ACCESS_ONCE() > > The existing elision code also uses this construct. I'm not a kernel developer, but I think there ACCESS_ONCE is rather used to get reads in spin-loops right (in terms of forward progress). If we transition to C11 atomics eventually, that specific workaround won't be needed because just having an atomic load in a spin loop will be enough to tell the compiler to not optimize the load out. However, this case here is different because it's not about forward progress but about whether idempotent stores are fine or not. I guess it's really a case of inefficient code generation if the compiler is optimizing out checks for idempotent stores. Alexander's example is slightly different because there, the compiler sees that the location just has been written to. It shouldn't assume this in the lock elision code. Thus, I think it's not worth adding (something else than) ACCESS_ONCE here; rather, someone should look at the compiler. Maybe the case would be clearer for the compiler if we'd actually used atomic accesses and atomic-typed variables (because for synchronizing code, it's more likely to be inefficient if you just store, idempotent or not...).
diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c index a4deed4..413aa76 100644 --- a/nptl/pthread_rwlock_rdlock.c +++ b/nptl/pthread_rwlock_rdlock.c @@ -22,7 +22,7 @@ #include <pthread.h> #include <pthreadP.h> #include <stap-probe.h> - +#include "elide.h" /* Acquire read lock for RWLOCK. Slow path. */ static int __attribute__((noinline)) @@ -102,6 +102,12 @@ __pthread_rwlock_rdlock (pthread_rwlock_t *rwlock) LIBC_PROBE (rdlock_entry, 1, rwlock); + if (ELIDE_LOCK (rwlock->__data.__rwelision, + rwlock->__data.__lock == 0 + && rwlock->__data.__writer == 0 + && rwlock->__data.__nr_readers == 0)) + return 0; + /* Make sure we are alone. */ lll_lock (rwlock->__data.__lock, rwlock->__data.__shared); diff --git a/nptl/pthread_rwlock_tryrdlock.c b/nptl/pthread_rwlock_tryrdlock.c index f7b1e6b..fe615bc 100644 --- a/nptl/pthread_rwlock_tryrdlock.c +++ b/nptl/pthread_rwlock_tryrdlock.c @@ -19,13 +19,19 @@ #include <errno.h> #include "pthreadP.h" #include <lowlevellock.h> - +#include "elide.h" int __pthread_rwlock_tryrdlock (pthread_rwlock_t *rwlock) { int result = EBUSY; + if (ELIDE_TRYLOCK (rwlock->__data.__rwelision, + rwlock->__data.__lock == 0 + && rwlock->__data.__nr_readers == 0 + && rwlock->__data.__writer, 0)) + return 0; + lll_lock (rwlock->__data.__lock, rwlock->__data.__shared); if (rwlock->__data.__writer == 0 diff --git a/nptl/pthread_rwlock_trywrlock.c b/nptl/pthread_rwlock_trywrlock.c index 106f157..9891c3d 100644 --- a/nptl/pthread_rwlock_trywrlock.c +++ b/nptl/pthread_rwlock_trywrlock.c @@ -19,13 +19,19 @@ #include <errno.h> #include "pthreadP.h" #include <lowlevellock.h> - +#include "elide.h" int __pthread_rwlock_trywrlock (pthread_rwlock_t *rwlock) { int result = EBUSY; + if (ELIDE_TRYLOCK (rwlock->__data.__rwelision, + rwlock->__data.__lock == 0 + && rwlock->__data.__nr_readers == 0 + && rwlock->__data.__writer, 1)) + return 0; + lll_lock (rwlock->__data.__lock, rwlock->__data.__shared); if (rwlock->__data.__writer == 0 && rwlock->__data.__nr_readers == 0) diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c index d492383..e3c63e4 100644 --- a/nptl/pthread_rwlock_unlock.c +++ b/nptl/pthread_rwlock_unlock.c @@ -22,6 +22,7 @@ #include <pthread.h> #include <pthreadP.h> #include <stap-probe.h> +#include "elide.h" /* Unlock RWLOCK. */ int @@ -29,6 +30,10 @@ __pthread_rwlock_unlock (pthread_rwlock_t *rwlock) { LIBC_PROBE (rwlock_unlock, 1, rwlock); + if (ELIDE_UNLOCK (rwlock->__data.__writer == 0 + && rwlock->__data.__nr_readers == 0)) + return 0; + lll_lock (rwlock->__data.__lock, rwlock->__data.__shared); if (rwlock->__data.__writer) rwlock->__data.__writer = 0; diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c index 2907681..fc1217c 100644 --- a/nptl/pthread_rwlock_wrlock.c +++ b/nptl/pthread_rwlock_wrlock.c @@ -22,7 +22,7 @@ #include <pthread.h> #include <pthreadP.h> #include <stap-probe.h> - +#include "elide.h" /* Acquire write lock for RWLOCK. */ static int __attribute__((noinline)) @@ -91,6 +91,12 @@ __pthread_rwlock_wrlock (pthread_rwlock_t *rwlock) { LIBC_PROBE (wrlock_entry, 1, rwlock); + if (ELIDE_LOCK (rwlock->__data.__rwelision, + rwlock->__data.__lock == 0 + && rwlock->__data.__writer == 0 + && rwlock->__data.__nr_readers == 0)) + return 0; + /* Make sure we are alone. */ lll_lock (rwlock->__data.__lock, rwlock->__data.__shared); diff --git a/nptl/sysdeps/pthread/pthread.h b/nptl/sysdeps/pthread/pthread.h index 1e0c5dc..6992e4a 100644 --- a/nptl/sysdeps/pthread/pthread.h +++ b/nptl/sysdeps/pthread/pthread.h @@ -139,19 +139,23 @@ enum # endif #endif +#ifndef __PTHREAD_RWLOCK_ELISION_EXTRA +# define __PTHREAD_RWLOCK_ELISION_EXTRA 0 +#endif + /* Read-write lock initializers. */ # define PTHREAD_RWLOCK_INITIALIZER \ - { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } } + { { 0, 0, 0, 0, 0, 0, 0, 0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0, 0 } } # ifdef __USE_GNU # ifdef __PTHREAD_RWLOCK_INT_FLAGS_SHARED # define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \ - { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ + { { 0, 0, 0, 0, 0, 0, 0, 0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0, \ PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP } } # else # if __BYTE_ORDER == __LITTLE_ENDIAN # define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \ { { 0, 0, 0, 0, 0, 0, PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP, \ - 0, 0, 0, 0 } } + 0, __PTHREAD_RWLOCK_ELISION_EXTRA, 0, 0 } } # else # define PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP \ { { 0, 0, 0, 0, 0, 0, 0, 0, 0, PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP,\ diff --git a/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h b/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h index 28e5144..6152bdf 100644 --- a/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h +++ b/nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h @@ -183,11 +183,13 @@ typedef union unsigned int __nr_writers_queued; int __writer; int __shared; - unsigned long int __pad1; + unsigned char __rwelision; + unsigned char __pad1[7]; unsigned long int __pad2; /* FLAGS must stay at this position in the structure to maintain binary compatibility. */ unsigned int __flags; +# define __PTHREAD_RWLOCK_ELISION_EXTRA 0, {0, 0, 0, 0, 0, 0, 0 } # define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 } __data; # else @@ -203,7 +205,7 @@ typedef union binary compatibility. */ unsigned char __flags; unsigned char __shared; - unsigned char __pad1; + unsigned char __rwelision; unsigned char __pad2; int __writer; } __data; diff --git a/nptl/sysdeps/unix/sysv/linux/x86/elide.h b/nptl/sysdeps/unix/sysv/linux/x86/elide.h new file mode 100644 index 0000000..cb56901 --- /dev/null +++ b/nptl/sysdeps/unix/sysv/linux/x86/elide.h @@ -0,0 +1,81 @@ +#ifndef ELIDE_H +#define ELIDE_H 1 + +#include "hle.h" +#include "elision-conf.h" + +/* Adapt elision with ADAPT_COUNT and STATUS and decide retries. */ + +static inline int elision_adapt(unsigned char *adapt_count, unsigned status) +{ + if (status & _XABORT_RETRY) + return 0; + if ((status & _XABORT_EXPLICIT) + && _XABORT_CODE (status) == _ABORT_LOCK_BUSY) + { + /* Right now we skip here. Better would be to wait a bit + and retry. This likely needs some spinning. */ + if (*adapt_count != __elision_aconf.skip_lock_busy) + *adapt_count = __elision_aconf.skip_lock_busy; + } + /* Internal abort. There is no chance for retry. + Use the normal locking and next time use lock. + Be careful to avoid writing to the lock. */ + else if (*adapt_count != __elision_aconf.skip_lock_internal_abort) + *adapt_count = __elision_aconf.skip_lock_internal_abort; + return 1; +} + +/* is_lock_free must be executed inside the transaction */ + +#define ELIDE_LOCK(adapt_count, is_lock_free) \ + ({ \ + int i; \ + unsigned status; \ + int ret = 0; \ + \ + if ((adapt_count) <= 0) \ + { \ + for (i = __elision_aconf.retry_try_xbegin; i > 0; i--) \ + { \ + if ((status = _xbegin ()) == _XBEGIN_STARTED) \ + { \ + if (is_lock_free) \ + { \ + ret = 1; \ + break; \ + } \ + _xabort (_ABORT_LOCK_BUSY); \ + } \ + if (!elision_adapt (&(adapt_count), status)) \ + break; \ + } \ + } \ + else \ + (adapt_count)--; /* missing updates ok */ \ + ret; \ + }) + +#define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) ({ \ + int ret = 0; \ + if (__elision_aconf.retry_try_xbegin > 0) \ + { \ + if (write) \ + _xabort (_ABORT_NESTED_TRYLOCK); \ + ret = ELIDE_LOCK (adapt_count, is_lock_free); \ + } \ + ret; \ + }) + +#define ELIDE_UNLOCK(is_lock_free) \ + ({ \ + int ret = 0; \ + if (is_lock_free) \ + { \ + _xend (); \ + ret = 1; \ + } \ + ret; \ + }) + +#endif diff --git a/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c b/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c index e6f5d6d..28e48d9 100644 --- a/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c +++ b/nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c @@ -66,6 +66,8 @@ elision_init (int argc __attribute__ ((unused)), #ifdef ENABLE_LOCK_ELISION __pthread_force_elision = __libc_enable_secure ? 0 : __elision_available; #endif + if (!HAS_RTM) + __elision_aconf.retry_try_xbegin = 0; /* Disable elision on rwlocks */ } #ifdef SHARED
From: Andi Kleen <ak@linux.intel.com> This patch relies on the C version of the rwlocks posted earlier. With C rwlocks it is very straight forward to do adaptive elision using TSX. It is based on the infrastructure added earlier for mutexes, but uses its own elision macros. The macros are fairly general purpose and could be used for other elision purposes too. This version is much cleaner than the earlier assembler based version, and in particular implements adaptation which makes it safer. I changed the behavior slightly to not require any changes in the test suite and fully conform to all expected behaviors (generally at the cost of not eliding in various situations). In particular this means the timedlock variants are not elided. Nested trylock aborts. nptl/: 2014-4-04 Andi Kleen <ak@linux.intel.com> * pthread_rwlock_rdlock.c: Include elide.h. (pthread_rwlock_rdlock): Add elision. * pthread_rwlock_wrlock.c: Include elide.h. (pthread_rwlock_wrlock): Add elision. * pthread_rwlock_trywrlock.c: Include elide.h. (pthread_rwlock_trywrlock): Add elision. * pthread_rwlock_tryrdlock.c: Include elide.h. (pthread_rwlock_tryrdlock): Add elision. * pthread_rwlock_unlock.c: Include elide.h. (pthread_rwlock_tryrdlock): Add elision unlock. * sysdeps/pthread/pthread.h: (__PTHREAD_RWLOCK_ELISION_EXTRA): Handle new define (PTHREAD_RWLOCK_INITIALIZER, PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP): Handle new elision field. * sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h: (pthread_rwlock_t): Change __pad1 to __rwelision. (__PTHREAD_RWLOCK_ELISION_EXTRA): Add. * sysdeps/unix/sysv/linux/x86/elide.h: New file. Add generic elision macros. * sysdeps/unix/sysv/linux/x86/elision-conf.c: (elision_init): Set try_xbegin to zero when no RTM. --- nptl/pthread_rwlock_rdlock.c | 8 ++- nptl/pthread_rwlock_tryrdlock.c | 8 ++- nptl/pthread_rwlock_trywrlock.c | 8 ++- nptl/pthread_rwlock_unlock.c | 5 ++ nptl/pthread_rwlock_wrlock.c | 8 ++- nptl/sysdeps/pthread/pthread.h | 10 ++- .../unix/sysv/linux/x86/bits/pthreadtypes.h | 6 +- nptl/sysdeps/unix/sysv/linux/x86/elide.h | 81 ++++++++++++++++++++++ nptl/sysdeps/unix/sysv/linux/x86/elision-conf.c | 2 + 9 files changed, 127 insertions(+), 9 deletions(-) create mode 100644 nptl/sysdeps/unix/sysv/linux/x86/elide.h