Message ID | 1414619416.10085.46.camel@triegel.csb |
---|---|
State | New |
Headers | show |
On Wed, 29 Oct 2014, Torvald Riegel wrote: > This patch adds atomic operations similar to C11. > > The function naming is essentially the C11 names, but with the memory > order argument removed and added as a suffix. For example, C11's > atomic_store_explicit(&foo, 23, memory_order_release) > becomes > atomic_store_relaxed (&foo, 23); As previously discussed, I'm concerned about the explicit relaxed loads and stores being defined in terms of __atomic_* (see <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>). Unless and until __atomic_* implements relaxed atomics as plain loads/stores not inhibiting optimization (as far as compatible with standard semantics), as evidenced by the change not significantly changing code generated for uses in glibc, I think the glibc implementation should be using plain loads and stores rather than __atomic_*.
On Wed, 2014-10-29 at 22:00 +0000, Joseph S. Myers wrote: > On Wed, 29 Oct 2014, Torvald Riegel wrote: > > > This patch adds atomic operations similar to C11. > > > > The function naming is essentially the C11 names, but with the memory > > order argument removed and added as a suffix. For example, C11's > > atomic_store_explicit(&foo, 23, memory_order_release) > > becomes > > atomic_store_relaxed (&foo, 23); > > As previously discussed, I'm concerned about the explicit relaxed loads > and stores being defined in terms of __atomic_* (see > <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>). Unless and until > __atomic_* implements relaxed atomics as plain loads/stores not inhibiting > optimization (as far as compatible with standard semantics), as evidenced > by the change not significantly changing code generated for uses in glibc, > I think the glibc implementation should be using plain loads and stores > rather than __atomic_*. Yes, see Patch 4/4 for how this looks with pthread_once. (You review faster than I can write emails -- nice! :) I've chosen to keep the builtins for relaxed loads/stores for now, but we can revisit later when we have transformed more code.
On Wed, 2014-10-29 at 22:50 +0100, Torvald Riegel wrote: > This patch adds atomic operations similar to C11. > > The function naming is essentially the C11 names, but with the memory > order argument removed and added as a suffix. For example, C11's > atomic_store_explicit(&foo, 23, memory_order_release) > becomes > atomic_store_relaxed (&foo, 23); Typo, sorry. The latter should in fact be atomic_store_release (&foo, 23);
On Wed, 2014-10-29 at 22:00 +0000, Joseph S. Myers wrote: > On Wed, 29 Oct 2014, Torvald Riegel wrote: > > > This patch adds atomic operations similar to C11. > > > > The function naming is essentially the C11 names, but with the memory > > order argument removed and added as a suffix. For example, C11's > > atomic_store_explicit(&foo, 23, memory_order_release) > > becomes > > atomic_store_relaxed (&foo, 23); > > As previously discussed, I'm concerned about the explicit relaxed loads > and stores being defined in terms of __atomic_* (see > <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>). Unless and until > __atomic_* implements relaxed atomics as plain loads/stores not inhibiting > optimization (as far as compatible with standard semantics), as evidenced > by the change not significantly changing code generated for uses in glibc, > I think the glibc implementation should be using plain loads and stores > rather than __atomic_*. Let me reply in some more detail. First, do you agree that we need to make the compiler aware of concurrency? For example, it would be bad if the compiler assumes that it can safely reload from an atomic variable just because it was able to prove that the loading thread didn't change it in the meantime. If we assume that, we can either (1) use __atomic* and check all the generated code, or (2) use inline asm, or (3) use volatile inline asm. Any other options? Plain loads will not reliably make the compiler aware that it has to take concurrent accesses into account. That might also mean that atomic_store_relaxed should actually use inline asm (see the comment in the patch). Thoughts? However, I would guess that we won't be really affected by 63273 anyway. The triggering usage there was very special in that the sanitizer generates loads of relaxed atomic accesses, and just that. That's not what we have in typical glibc code. If we use a relaxed access, it's either (1) in front of a CAS, so we'll have an optimization-constraining operation close-by anyway, or (2) it's in combination with an explicit fence next to it (Dekker sync, relaxed load + acquire fence, etc.), so it's likely that it can't optimized as freely anyway. Are there any other examples where the lack of optimizations of relaxed accesses in typical concurrent code was really decreasing performance (ie, ignoring the sanitizer thing and non-optimized code such as maybe comes out of templates that are *expected* to be optimized)?
On Wed, 29 Oct 2014, Torvald Riegel wrote: > First, do you agree that we need to make the compiler aware of > concurrency? For example, it would be bad if the compiler assumes that > it can safely reload from an atomic variable just because it was able to > prove that the loading thread didn't change it in the meantime. I believe that the code already makes it explicit where such reloading would be problematic, either (a) using asm with appropriate clobbers so the compiler doesn't know the value is unchanged or (b) using volatile. In principle atomics could avoid optimization limitations from use of volatile (and allow more code motion than might be allowed by the asms), but it's not clear if they do improve on volatile at present. > If we assume that, we can either (1) use __atomic* and check all the > generated code, or (2) use inline asm, or (3) use volatile inline asm. > Any other options? Plain loads will not reliably make the compiler > aware that it has to take concurrent accesses into account. As noted, I think appropriate asms or volatile are already in place where this is an issue for a plain load. > That might also mean that atomic_store_relaxed should actually use > inline asm (see the comment in the patch). Thoughts? Well, I don't see a need to supplement __atomic_* with inline asm. In the case of not using __atomic_*, if you find you are replacing places with such asms (and there aren't that many) then the replacement may need such an asm, but if replacing other places it may be less obvious if an asm is needed (or if it will make any code generation difference; if not, it should be safe).
On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote: > On Wed, 29 Oct 2014, Torvald Riegel wrote: > > > First, do you agree that we need to make the compiler aware of > > concurrency? For example, it would be bad if the compiler assumes that > > it can safely reload from an atomic variable just because it was able to > > prove that the loading thread didn't change it in the meantime. > > I believe that the code already makes it explicit where such reloading > would be problematic, Why do you think this is the case? There is an atomic_forced_read, for example, but no atomic_read_exactly_once. > either (a) using asm with appropriate clobbers so > the compiler doesn't know the value is unchanged Do you have any examples for this where the clobber is not due to a compiler barrier used for acquire or release fences? > or (b) using volatile. But volatile does not guarantee atomicity. I also think that current compilers are probably unlikely to split a load or store into different parts, but can we be sure about this? What if, for example, we have adjacent atomic variables that are loaded from, and they happen to be not perfectly aligned (e.g., that's fine for x86 atomic accesses IIRC), and the compiler merges some of the accesses and thus splits the last one? Can't we just do the right thing (and safe thing!) and tell the compiler that this is an atomic access? > In principle atomics could avoid optimization limitations from use of > volatile (and allow more code motion than might be allowed by the asms), > but it's not clear if they do improve on volatile at present. (I suppose you mean relaxed MO atomic loads/stores. For everything else I hope it's clear that atomics are an improvement...) But I also see no real evidence that using atomics for relaxed loads/stores would hurt at all. The sanitizer use case in the BZ you cited is arguably special. The pthread_once case didn't show any difference on the fast path. I can test with a __atomic*-providing compiler with and without using __atomic to look for differences, if that would help you. > > If we assume that, we can either (1) use __atomic* and check all the > > generated code, or (2) use inline asm, or (3) use volatile inline asm. > > Any other options? Plain loads will not reliably make the compiler > > aware that it has to take concurrent accesses into account. > > As noted, I think appropriate asms or volatile are already in place where > this is an issue for a plain load. We don't even have correct barriers everywhere, so I don't really have a reason to agree to that. Can we at least agree on having all glibc code use our own atomic_store_relaxed / atomic_load_relaxed (see the patch)? Then we can still argue whether to use __atomic* to implement these. But at least we can easily switch to __atomic in the future. Any objections?
On Thu, 30 Oct 2014, Torvald Riegel wrote: > On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote: > > On Wed, 29 Oct 2014, Torvald Riegel wrote: > > > > > First, do you agree that we need to make the compiler aware of > > > concurrency? For example, it would be bad if the compiler assumes that > > > it can safely reload from an atomic variable just because it was able to > > > prove that the loading thread didn't change it in the meantime. > > > > I believe that the code already makes it explicit where such reloading > > would be problematic, > > Why do you think this is the case? There is an atomic_forced_read, for > example, but no atomic_read_exactly_once. See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads happen in a particular order: size_t n = scope->r_nlist; /* Make sure we read the value before proceeding. Otherwise we might use r_list pointing to the initial scope and r_nlist being the value after a resize. That is the only path in dl-open.c not protected by GSCOPE. A read barrier here might be to expensive. */ __asm volatile ("" : "+r" (n), "+m" (scope->r_list)); struct link_map **list = scope->r_list; (There are lots of cases of asms being used - via macros - to control exactly when floating-point evaluations take place and associated exceptions are raised, but that's not relevant to atomics. In general, any asm of an empty string is probably doing something like this.) > > either (a) using asm with appropriate clobbers so > > the compiler doesn't know the value is unchanged > > Do you have any examples for this where the clobber is not due to a > compiler barrier used for acquire or release fences? I don't understand the question. > > or (b) using volatile. > > But volatile does not guarantee atomicity. I also think that current The point is to guarantee ordering (with the atomicity of aligned word loads / stores being part of the GNU C language semantics required for building glibc). > But I also see no real evidence that using atomics for relaxed > loads/stores would hurt at all. The sanitizer use case in the BZ you > cited is arguably special. The pthread_once case didn't show any > difference on the fast path. I can test with a __atomic*-providing > compiler with and without using __atomic to look for differences, if > that would help you. Yes. We may find in particular cases (possibly all of them) that the change does not have significant code generation consequences, but given the known bug we should actually check that in each case rather than presuming it. > Can we at least agree on having all glibc code use our own > atomic_store_relaxed / atomic_load_relaxed (see the patch)? Then we can > still argue whether to use __atomic* to implement these. But at least > we can easily switch to __atomic in the future. Any objections? I think explicit atomic loads / stores are fine, subject to checking code generation before making them use __atomic_*.
On Thu, 2014-10-30 at 17:17 +0000, Joseph S. Myers wrote: > On Thu, 30 Oct 2014, Torvald Riegel wrote: > > > On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote: > > > On Wed, 29 Oct 2014, Torvald Riegel wrote: > > > > > > > First, do you agree that we need to make the compiler aware of > > > > concurrency? For example, it would be bad if the compiler assumes that > > > > it can safely reload from an atomic variable just because it was able to > > > > prove that the loading thread didn't change it in the meantime. > > > > > > I believe that the code already makes it explicit where such reloading > > > would be problematic, > > > > Why do you think this is the case? There is an atomic_forced_read, for > > example, but no atomic_read_exactly_once. > > See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads > happen in a particular order: > > size_t n = scope->r_nlist; > /* Make sure we read the value before proceeding. Otherwise we > might use r_list pointing to the initial scope and r_nlist being > the value after a resize. That is the only path in dl-open.c not > protected by GSCOPE. A read barrier here might be to expensive. */ > __asm volatile ("" : "+r" (n), "+m" (scope->r_list)); > struct link_map **list = scope->r_list; That ensures that the compiler won't reorder the reads -- but it does *not* ensure that reads happen in a particular order when executed on an arch with a weak memory model. If we don't care about HW reordering, we won't need the asm (unless perhaps we communicate with external stuff through volatile, or do something behind the compiler's back), because the language semantics for sequential code are taking care of that. Thus, my guess is that this is a bug. > (There are lots of cases of asms being used - via macros - to control > exactly when floating-point evaluations take place and associated > exceptions are raised, but that's not relevant to atomics. In general, > any asm of an empty string is probably doing something like this.) > > > > either (a) using asm with appropriate clobbers so > > > the compiler doesn't know the value is unchanged > > > > Do you have any examples for this where the clobber is not due to a > > compiler barrier used for acquire or release fences? > > I don't understand the question. The fences consist of a compiler barrier and, depending on the arch, potentially a real HW fence instruction. The example you gave above is a compiler barrier, but it looks like it should actually be an acquire fence. > > > or (b) using volatile. > > > > But volatile does not guarantee atomicity. I also think that current > > The point is to guarantee ordering volatile does not guarantee ordering in a concurrent setting. The guarantee is that the loads/stores will be issued in exactly the same order as if the abstract machine would execute the program -- but this doesn't constrain visibility to other threads, or what other threads observe. > (with the atomicity of aligned word > loads / stores being part of the GNU C language semantics required for > building glibc). Ok. We can continue to rely on such things, but we can also just embrace that C11 now has an actual memory model for concurrent code. > > But I also see no real evidence that using atomics for relaxed > > loads/stores would hurt at all. The sanitizer use case in the BZ you > > cited is arguably special. The pthread_once case didn't show any > > difference on the fast path. I can test with a __atomic*-providing > > compiler with and without using __atomic to look for differences, if > > that would help you. > > Yes. We may find in particular cases (possibly all of them) that the > change does not have significant code generation consequences, but given > the known bug we should actually check that in each case rather than > presuming it. I can't argue against being conservative by default, but I also can't imagine this being an actual problem for typical synchronizing code. > > Can we at least agree on having all glibc code use our own > > atomic_store_relaxed / atomic_load_relaxed (see the patch)? Then we can > > still argue whether to use __atomic* to implement these. But at least > > we can easily switch to __atomic in the future. Any objections? > > I think explicit atomic loads / stores are fine, subject to checking code > generation before making them use __atomic_*. How do you envision to do this? Compare generated code on all archs before and after a change, for each algorithm on its own? Do the C11 transition, and then compare all code? Look at a few algorithms to see whether it makes a difference, and then see whether we have anecdotal evidence that it might affect performance?
On Thu, 30 Oct 2014, Torvald Riegel wrote: > > See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads > > happen in a particular order: > > > > size_t n = scope->r_nlist; > > /* Make sure we read the value before proceeding. Otherwise we > > might use r_list pointing to the initial scope and r_nlist being > > the value after a resize. That is the only path in dl-open.c not > > protected by GSCOPE. A read barrier here might be to expensive. */ > > __asm volatile ("" : "+r" (n), "+m" (scope->r_list)); > > struct link_map **list = scope->r_list; > > That ensures that the compiler won't reorder the reads -- but it does > *not* ensure that reads happen in a particular order when executed on an > arch with a weak memory model. If we don't care about HW reordering, we > won't need the asm (unless perhaps we communicate with external stuff > through volatile, or do something behind the compiler's back), because > the language semantics for sequential code are taking care of that. > > Thus, my guess is that this is a bug. My guess is that on the system where a problem reordering was observed, it was sufficient to stop the compiler from reordering the loads (but there might be issues for some other system). And this sort of code, where there aren't other atomic operations nearby, is a case where we'd need to be particularly careful about checking the performance impact of inserting barriers. > > > Can we at least agree on having all glibc code use our own > > > atomic_store_relaxed / atomic_load_relaxed (see the patch)? Then we can > > > still argue whether to use __atomic* to implement these. But at least > > > we can easily switch to __atomic in the future. Any objections? > > > > I think explicit atomic loads / stores are fine, subject to checking code > > generation before making them use __atomic_*. > > How do you envision to do this? > > Compare generated code on all archs before and after a change, for each > algorithm on its own? Comparing code for one architecture, at the time of each conversion, seems appropriate (with additional benchmarking for cases that don't otherwise involve atomics).
On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote: > On Thu, 30 Oct 2014, Torvald Riegel wrote: > > > > See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads > > > happen in a particular order: > > > > > > size_t n = scope->r_nlist; > > > /* Make sure we read the value before proceeding. Otherwise we > > > might use r_list pointing to the initial scope and r_nlist being > > > the value after a resize. That is the only path in dl-open.c not > > > protected by GSCOPE. A read barrier here might be to expensive. */ > > > __asm volatile ("" : "+r" (n), "+m" (scope->r_list)); > > > struct link_map **list = scope->r_list; > > > > That ensures that the compiler won't reorder the reads -- but it does > > *not* ensure that reads happen in a particular order when executed on an > > arch with a weak memory model. If we don't care about HW reordering, we > > won't need the asm (unless perhaps we communicate with external stuff > > through volatile, or do something behind the compiler's back), because > > the language semantics for sequential code are taking care of that. > > > > Thus, my guess is that this is a bug. > > My guess is that on the system where a problem reordering was observed, it > was sufficient to stop the compiler from reordering the loads (but there > might be issues for some other system). And this sort of code, where > there aren't other atomic operations nearby, is a case where we'd need to > be particularly careful about checking the performance impact of inserting > barriers. But if there are issues for other archs, we need to insert a barrier anyway, agreed? If we want to write generic, non-arch-specific code, this would be a generic barrier too -- which, in turn, would boil down to a normal compiler barrier (ie, "memory"). We could think about adding specific versions of acquire barriers that do something like the asm above in that the compiler barrier part just applies to certain memory locations -- but I doubt that's really worth it. If we want to avoid the barrier, we can try other ways to do that. This piece of code is also similar to the pthread_once fastpath, in that it's an acquire barrier surrounded by loads. > > > > Can we at least agree on having all glibc code use our own > > > > atomic_store_relaxed / atomic_load_relaxed (see the patch)? Then we can > > > > still argue whether to use __atomic* to implement these. But at least > > > > we can easily switch to __atomic in the future. Any objections? > > > > > > I think explicit atomic loads / stores are fine, subject to checking code > > > generation before making them use __atomic_*. > > > > How do you envision to do this? > > > > Compare generated code on all archs before and after a change, for each > > algorithm on its own? > > Comparing code for one architecture, at the time of each conversion, seems > appropriate (with additional benchmarking for cases that don't otherwise > involve atomics). How do we handle bug fixes? When fixing a bug, do you want performance parity too? By "involve atomics", do you mean specifically atomic read-modify-write ops, so atomic ops other than just loads and stores?
On Thu, 30 Oct 2014, Torvald Riegel wrote: > > My guess is that on the system where a problem reordering was observed, it > > was sufficient to stop the compiler from reordering the loads (but there > > might be issues for some other system). And this sort of code, where > > there aren't other atomic operations nearby, is a case where we'd need to > > be particularly careful about checking the performance impact of inserting > > barriers. > > But if there are issues for other archs, we need to insert a barrier > anyway, agreed? If we want to write generic, non-arch-specific code, Yes, subject to considering whether there's any more efficient approach available. > > Comparing code for one architecture, at the time of each conversion, seems > > appropriate (with additional benchmarking for cases that don't otherwise > > involve atomics). > > How do we handle bug fixes? When fixing a bug, do you want performance > parity too? No, but if the fix is in a critical path we should look more carefully at what the most efficient way to acheive the fix is. > By "involve atomics", do you mean specifically atomic read-modify-write > ops, so atomic ops other than just loads and stores? Yes.
On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote: > Comparing code for one architecture, at the time of each conversion, seems > appropriate (with additional benchmarking for cases that don't otherwise > involve atomics). So, if you want this code comparison as default, we can as well start with using relaxed __atomic* loads and stores instead of first using plain loads and stores, agreed? Do you have any other feedback on the patch set?
On Thu, 30 Oct 2014, Torvald Riegel wrote: > On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote: > > Comparing code for one architecture, at the time of each conversion, seems > > appropriate (with additional benchmarking for cases that don't otherwise > > involve atomics). > > So, if you want this code comparison as default, we can as well start > with using relaxed __atomic* loads and stores instead of first using > plain loads and stores, agreed? Yes. > Do you have any other feedback on the patch set? No.
commit 5921ff5b989972b744f5cb7d8f4f0b16c04ab19e Author: Torvald Riegel <triegel@redhat.com> Date: Sun Sep 14 20:04:54 2014 +0200 Add atomic operations similar to those provided by C11. diff --git a/include/atomic.h b/include/atomic.h index 3e82b6a..c4df8f1 100644 --- a/include/atomic.h +++ b/include/atomic.h @@ -542,6 +542,218 @@ ({ __typeof (x) __x; __asm ("" : "=r" (__x) : "0" (x)); __x; }) #endif +/* This is equal to 1 iff the architecture supports 64b atomic operations. */ +#ifndef __HAVE_64B_ATOMICS +#error Unable to determine if 64-bit atomics are present. +#endif + +/* The following functions are a subset of the atomic operations provided by + C11. Usually, a function named atomic_OP_MO(args) is equivalent to C11's + atomic_OP_explicit(args, memory_order_MO); exceptions noted below. */ + +/* Each arch can request to use compiler built-ins for C11 atomics. If it + does, all atomics will be based on these. */ +#if USE_ATOMIC_COMPILER_BUILTINS + +/* We require 32b atomic operations; some archs also support 64b atomic + operations. */ +void __atomic_link_error (void); +# if __HAVE_64B_ATOMICS == 1 +# define __atomic_check_size(mem) \ + if ((sizeof (*mem) != 4) && (sizeof (*mem) != 8)) \ + __atomic_link_error (); +# else +# define __atomic_check_size(mem) \ + if (sizeof (*mem) != 4) \ + __atomic_link_error (); +# endif + +# define atomic_thread_fence_acquire() \ + __atomic_thread_fence (__ATOMIC_ACQUIRE) +# define atomic_thread_fence_release() \ + __atomic_thread_fence (__ATOMIC_RELEASE) +# define atomic_thread_fence_seq_cst() \ + __atomic_thread_fence (__ATOMIC_SEQ_CST) + +# define atomic_load_relaxed(mem) \ + ({ __atomic_check_size((mem)); __atomic_load_n ((mem), __ATOMIC_RELAXED); }) +# define atomic_load_acquire(mem) \ + ({ __atomic_check_size((mem)); __atomic_load_n ((mem), __ATOMIC_ACQUIRE); }) + +# define atomic_store_relaxed(mem, val) \ + do { \ + __atomic_check_size((mem)); \ + __atomic_store_n ((mem), (val), __ATOMIC_RELAXED); \ + } while (0) +# define atomic_store_release(mem, val) \ + do { \ + __atomic_check_size((mem)); \ + __atomic_store_n ((mem), (val), __ATOMIC_RELEASE); \ + } while (0) + +/* On failure, this CAS has memory_order_relaxed semantics. */ +# define atomic_compare_exchange_weak_relaxed(mem, expected, desired) \ + ({ __atomic_check_size((mem)); \ + __atomic_compare_exchange_n ((mem), (expected), (desired), 1, \ + __ATOMIC_RELAXED, __ATOMIC_RELAXED); }) +# define atomic_compare_exchange_weak_acquire(mem, expected, desired) \ + ({ __atomic_check_size((mem)); \ + __atomic_compare_exchange_n ((mem), (expected), (desired), 1, \ + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED); }) +# define atomic_compare_exchange_weak_release(mem, expected, desired) \ + ({ __atomic_check_size((mem)); \ + __atomic_compare_exchange_n ((mem), (expected), (desired), 1, \ + __ATOMIC_RELEASE, __ATOMIC_RELAXED); }) + +# define atomic_exchange_acquire(mem, desired) \ + ({ __atomic_check_size((mem)); \ + __atomic_exchange_n ((mem), (desired), __ATOMIC_ACQUIRE); }) +# define atomic_exchange_release(mem, desired) \ + ({ __atomic_check_size((mem)); \ + __atomic_exchange_n ((mem), (desired), __ATOMIC_RELEASE); }) + +# define atomic_fetch_add_relaxed(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_add ((mem), (operand), __ATOMIC_RELAXED); }) +# define atomic_fetch_add_acquire(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_add ((mem), (operand), __ATOMIC_ACQUIRE); }) +# define atomic_fetch_add_release(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_add ((mem), (operand), __ATOMIC_RELEASE); }) +# define atomic_fetch_add_acq_rel(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_add ((mem), (operand), __ATOMIC_ACQ_REL); }) + +# define atomic_fetch_and_acquire(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_and ((mem), (operand), __ATOMIC_ACQUIRE); }) + +# define atomic_fetch_or_relaxed(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_or ((mem), (operand), __ATOMIC_RELAXED); }) +# define atomic_fetch_or_acquire(mem, operand) \ + ({ __atomic_check_size((mem)); \ + __atomic_fetch_or ((mem), (operand), __ATOMIC_ACQUIRE); }) + +#else /* !USE_ATOMIC_COMPILER_BUILTINS */ + +/* By default, we assume that read, write, and full barriers are equivalent + to acquire, release, and seq_cst barriers. Archs for which this does not + hold have to provide custom definitions of the fences. */ +# ifndef atomic_thread_fence_acquire +# define atomic_thread_fence_acquire() atomic_read_barrier () +# endif +# ifndef atomic_thread_fence_release +# define atomic_thread_fence_release() atomic_write_barrier () +# endif +# ifndef atomic_thread_fence_seq_cst +# define atomic_thread_fence_seq_cst() atomic_full_barrier () +# endif + +# ifndef atomic_load_relaxed +# define atomic_load_relaxed(mem) \ + ({ __typeof (*(mem)) __atg100_val; \ + __asm ("" : "=r" (__atg100_val) : "0" (*(mem))); \ + __atg100_val; }) +# endif +# ifndef atomic_load_acquire +# define atomic_load_acquire(mem) \ + ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem); \ + atomic_thread_fence_acquire (); \ + __atg101_val; }) +# endif + +# ifndef atomic_store_relaxed +/* XXX Use inline asm here? */ +# define atomic_store_relaxed(mem, val) do { *(mem) = (val); } while (0) +# endif +# ifndef atomic_store_release +# define atomic_store_release(mem, val) \ + do { \ + atomic_thread_fence_release (); \ + atomic_store_relaxed ((mem), (val)); \ + } while (0) +# endif + +/* On failure, this CAS has memory_order_relaxed semantics. */ +/* XXX This potentially has one branch more than necessary, but archs + currently do not define a CAS that returns both the previous value and + the success flag. */ +# ifndef atomic_compare_exchange_weak_acquire +# define atomic_compare_exchange_weak_acquire(mem, expected, desired) \ + ({ typeof (*(expected)) __atg102_expected = *(expected); \ + *(expected) = \ + atomic_compare_and_exchange_val_acq ((mem), (desired), *(expected)); \ + *(expected) == __atg102_expected; }) +# endif +# ifndef atomic_compare_exchange_weak_relaxed +/* XXX Fall back to CAS with acquire MO because archs do not define a weaker + CAS. */ +# define atomic_compare_exchange_weak_relaxed(mem, expected, desired) \ + atomic_compare_exchange_weak_acquire ((mem), (expected), (desired)) +# endif +# ifndef atomic_compare_exchange_weak_release +# define atomic_compare_exchange_weak_release(mem, expected, desired) \ + ({ typeof (*(expected)) __atg103_expected = *(expected); \ + *(expected) = \ + atomic_compare_and_exchange_val_rel ((mem), (desired), *(expected)); \ + *(expected) == __atg103_expected; }) +# endif + +# ifndef atomic_exchange_acquire +# define atomic_exchange_acquire(mem, val) \ + atomic_exchange_acq ((mem), (val)) +# endif +# ifndef atomic_exchange_release +# define atomic_exchange_release(mem, val) \ + atomic_exchange_rel ((mem), (val)) +# endif + +# ifndef atomic_fetch_add_acquire +# define atomic_fetch_add_acquire(mem, operand) \ + atomic_exchange_and_add_acq ((mem), (operand)) +# endif +# ifndef atomic_fetch_add_relaxed +/* XXX Fall back to acquire MO because the MO semantics of + atomic_exchange_and_add are not documented; the generic version falls back + to atomic_exchange_and_add_acq if atomic_exchange_and_add is not defined, + and vice versa. */ +# define atomic_fetch_add_relaxed(mem, operand) \ + atomic_fetch_add_acquire ((mem), (operand)) +# endif +# ifndef atomic_fetch_add_release +# define atomic_fetch_add_release(mem, operand) \ + atomic_exchange_and_add_rel ((mem), (operand)) +# endif +# ifndef atomic_fetch_add_acq_rel +# define atomic_fetch_add_acq_rel(mem, operand) \ + ({ atomic_thread_fence_release (); \ + atomic_exchange_and_add_acq ((mem), (operand)); }) +# endif + +/* XXX The default for atomic_and_val has acquire semantics, but this is not + documented. */ +# ifndef atomic_fetch_and_acquire +# define atomic_fetch_and_acquire(mem, operand) \ + atomic_and_val ((mem), (operand)) +# endif + +/* XXX The default for atomic_or_val has acquire semantics, but this is not + documented. */ +# ifndef atomic_fetch_or_acquire +# define atomic_fetch_or_acquire(mem, operand) \ + atomic_or_val ((mem), (operand)) +# endif +/* XXX Fall back to acquire MO because archs do not define a weaker + atomic_or_val. */ +# ifndef atomic_fetch_or_relaxed +# define atomic_fetch_or_relaxed(mem, operand) \ + atomic_fetch_or_acquire ((mem), (operand)) +# endif + +#endif /* !USE_ATOMIC_COMPILER_BUILTINS */ + #ifndef atomic_delay # define atomic_delay() do { /* nothing */ } while (0)