Message ID | 1410719669.4967.160.camel@triegel.csb |
---|---|
State | New |
Headers | show |
On 09/14/2014 08:34 PM, Torvald Riegel wrote: > I think we should transition to using the C11 memory model and atomics > instead of the current custom implementation. There are two reasons for > this: What's the quality of the C11 memory specification? The things I've seen so far do not inspire confidence, e.g.: “The value of an object visible to a thread T at a particular point is the initial value of the object, a value stored in the object by T, or a value stored in the object by another thread, according to the rules below.” (5.1.2.4/7, perhaps just an “atomic” is missing somewhere.) “Unlike memset, any call to the memset_s function shall be evaluated strictly according to the rules of the abstract machine as described in (5.1.2.3).” (K.3.7.4.1/4, if this means anything at all, it seems to require suspending all other threads in the program)
On Mon, 2014-09-15 at 15:56 +0200, Florian Weimer wrote: > On 09/14/2014 08:34 PM, Torvald Riegel wrote: > > I think we should transition to using the C11 memory model and atomics > > instead of the current custom implementation. There are two reasons for > > this: > > What's the quality of the C11 memory specification? The specification's quality is high. The only open issue I'm aware of is that there's no proper wording currently to disallow out-of-thin-air values -- but all the implementers agree that those shouldn't be allowed to happen by an implementation, and this is something C++ SG1 is working on. There's a formalization of the model[1], the cppmem tool I mentioned in my previous email based on that formalization that enumerates all the behaviors allowed by a piece of C11/C++11 code, there are formalizations of HW models (eg, Power) and proofs for how to implement the C11/C++11 model based on them [2]. [1] http://www.cl.cam.ac.uk/~mjb220/n3132.pdf [2] http://www.cl.cam.ac.uk/~pes20/ > The things I've > seen so far do not inspire confidence, e.g.: > > “The value of an object visible to a thread T at a particular point is > the initial value of the object, a value stored in the object by T, or a > value stored in the object by another thread, according to the rules > below.” (5.1.2.4/7, perhaps just an “atomic” is missing somewhere.) That's fine, because there are more "rules below" :) Those specify which values are visible, they require data-race freedom, etc. > “Unlike memset, any call to the memset_s function shall be evaluated > strictly according to the rules of the abstract machine as described in > (5.1.2.3).” (K.3.7.4.1/4, if this means anything at all, it seems to > require suspending all other threads in the program) Even the abstract machine follows the memory model. memset isn't specified to be atomic, so it can participate in data races, so essentially the programmer has to make sure that there are no data races. I suggest looking at the formalization of the model[1]. If you're interested in understanding things in detail, that's the resource I'd start with.
Torvald, Thanks for the email, very good questions. David, SPARC pre-v9 question at the bottom for you. On 09/14/2014 02:34 PM, Torvald Riegel wrote: > I think we should transition to using the C11 memory model and atomics > instead of the current custom implementation. There are two reasons for > this: Architecturally I think that glibc transitioning to the C11 memory model is the *only* way forward. > I propose that our phase in transitioning to C11 is to focus on uses of > the atomic operations. In particular, the rules are: > > * All accesses to atomic vars need to use atomic_* functions. IOW, all > non-atomic accesses are not subject to data races. The only exceptions > is initialization (ie, when the variable is not visible to any other > thread); nonetheless, initialization accesses must not result in data > races with other accesses. (This exception isn't allowed by C11, but > eases the transition to C11 atomics and likely works fine in current > implementations; as alternative, we could require MO-relaxed stores for > initialization as well.) At present we rely on small word-length writes to complete atomically, would you suggest we have to wrap those in true atomic operations? Won't this hurt performance? What correctness issue exists? > * Atomic vars aren't explicitly annotated with atomic types, but just > use the base types. They need to be naturally aligned. This makes the > transition easier because we don't get any dependencies on C11 atomic > types. OK. > * On a certain architecture, we typically only use the atomic_* ops if > the HW actually supports these; we expect to have pointer-sized atomics > at most. If the arch has no native support for atomics, it can either > use modified algorithms or emulate atomics differently. I strongly suggest all such machines should emulate atomics in the kernel using kernel-level locks. The downside of this is that all atomic vars must use atomic_* functions because otherwise the release of the lock word by a normal store won't order correctly. This already happened on hppa with userspace spinlocks. > * The atomic ops are similar to the _explicit variation of C11's > functions, except that _explicit is replaced with the last part of the > MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst). All > arguments (except the MO, which is dropped) are the same as for C11. > That avoids using the same names yet should make the names easy to > understand for people familiar with C11. OK. > I also propose an incremental transition. In particular, the steps are > roughly: > > 1) Add new C11-like atomics. If GCC supports them on this architecture, > use GCC's atomic builtins. Make them fall back to the existing atomics > otherwise. Attached is a small patch that illustrates this. OK. > 2) Refactor one use (ie, all the accesses belonging to one algorithm or > group of functions that synchronize with each other) at a time. This > involves reviewing the code and basically reimplementing the > synchronization bits in on top of the C11 memory model. We also should > take this opportunity to add any documentation of concurrent code that's > missing (which is often the case). Not OK until we talk about it more. > 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look > at all uses and decide whether we really need to keep them. Agreed e.g. rewrite. > 4) Once all of glibc uses the new atomics, remove the old ones for a > particular arch if the oldest compiler required has support for the > respective builtins. OK. > Open questions: > > * Are the current read/write memory barriers equivalent to C11 > acquire/release fences? I guess that's the case (did I mention lack of > documentation? ;) ), but we should check whether this is true on every > architecture (ie, whether the HW instructions used for read/write > membars are the same as what the compiler would use for > acquire/release). If not, we can't implement acquire/release based on > read/write membars but need something else for this arch. I'd > appreciate help from the machine maintainers for this one. Don't know. Create an internals manual? Add a new chapter on atomics? :-) > * How do we deal with archs such as older SPARC that don't have CAS and > other archs without HW support for atomics? Using modified algorithms > should be the best-performing option (eg, if we can use one critical > section instead of a complicated alternative that uses lots of atomic > operations). However, that means we'll have to maintain more algorithms > (even if they might be simpler). No. Stop. One algorithm. All arches that can't meet the HW support for atomics must enter the kernel and do the work there. This is just like hppa and ARM do. They use a light-weight syscall mechanism and serialize in the kernel. > Furthermore, do all uses of atomics work well with blocking atomics that > might also not be indivisible steps? For example, the cancellation code > might be affected because a blocking emulation of atomics won't be > async-cancel-safe? It will be safe because the kernel emulation should not deliver a signal during the emulation. > * Which of the catomic_ variants do we really need? Similarly to the > non-native atomics case, we often might be better off which running a > slightly different nonatomic (or just nonsynchronizing) algorithm in the > first place. We'll have to review all the uses to be able to tell. Don't know. > Thoughts? Any feedback and help is welcome! Cheers, Carlos.
On Sun, 14 Sep 2014, Torvald Riegel wrote: > 1) Add new C11-like atomics. If GCC supports them on this architecture, > use GCC's atomic builtins. Make them fall back to the existing atomics > otherwise. Attached is a small patch that illustrates this. I'm not clear on how you propose to determine whether to define USE_ATOMIC_COMPILER_BUILTINS. The obvious approach of testing __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 (etc.), together with making sure the GCC version is at least 4.7, runs into architecture-specific problems: * While the builtins are available in 4.7, on some architectures they may be suboptimal before a later version (see the MIPS code preferring to use them only for 4.8 and later, but using them for 4.7 in the MIPS16 case; generally this applies if the GCC support was originally implemented for __sync_*, with its stronger barrier requirements, but the architecture does support specifying the required memory ordering more precisely, but the GCC support for that was added later). * In some cases those macros are defined by GCC even though the operations are out of line in libgcc. On PA the HAVE_sync_compare_and_swap* macros are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to be predefined; for Tile GCC has special target-specific code to define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*. Do we believe such out-of-line libgcc versions will work in glibc, and is it preferable to use them or inline asm? (My guess is that they might work except for cases such as the 64-bit ARM atomics with references to libc interfaces.) So if you have a default based on those macros, you'd need to allow for architectures overriding it.
On Mon, 2014-09-15 at 16:51 +0000, Joseph S. Myers wrote: > On Sun, 14 Sep 2014, Torvald Riegel wrote: > > > 1) Add new C11-like atomics. If GCC supports them on this architecture, > > use GCC's atomic builtins. Make them fall back to the existing atomics > > otherwise. Attached is a small patch that illustrates this. > > I'm not clear on how you propose to determine whether to define > USE_ATOMIC_COMPILER_BUILTINS. Yes, sorry for leaving that unspecified. I intended this flag to be arch-specific, not a global setting; in fact, the comments you previously made on the subject motivated that. We can handle most of the transition to C11 atomics in a non-arch-specific way (because the memory model itself is portable). But for the arch-specific changes I'd really appreciate input / help from the machine maintainers. > * In some cases those macros are defined by GCC even though the operations > are out of line in libgcc. On PA the HAVE_sync_compare_and_swap* macros > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to > be predefined; for Tile GCC has special target-specific code to define > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*. Do we believe such out-of-line libgcc > versions will work in glibc, and is it preferable to use them or inline > asm? (My guess is that they might work except for cases such as the > 64-bit ARM atomics with references to libc interfaces.) I believe they should work from a correctness perspective, but might have a slight performance disadvantage. I'm not sure I understand the ARM case you mention.
On Mon, 2014-09-15 at 11:59 -0400, Carlos O'Donell wrote: > Torvald, > > Thanks for the email, very good questions. > > David, > > SPARC pre-v9 question at the bottom for you. > > On 09/14/2014 02:34 PM, Torvald Riegel wrote: > > I think we should transition to using the C11 memory model and atomics > > instead of the current custom implementation. There are two reasons for > > this: > > Architecturally I think that glibc transitioning to the C11 memory model > is the *only* way forward. > > > I propose that our phase in transitioning to C11 is to focus on uses of > > the atomic operations. In particular, the rules are: > > > > * All accesses to atomic vars need to use atomic_* functions. IOW, all > > non-atomic accesses are not subject to data races. The only exceptions > > is initialization (ie, when the variable is not visible to any other > > thread); nonetheless, initialization accesses must not result in data > > races with other accesses. (This exception isn't allowed by C11, but > > eases the transition to C11 atomics and likely works fine in current > > implementations; as alternative, we could require MO-relaxed stores for > > initialization as well.) > > At present we rely on small word-length writes to complete atomically, > would you suggest we have to wrap those in true atomic operations? Yes! That's what the C11 model requires. And we don't use special types for atomically accessed variables, so if we don't use atomic ops for every access there's no way for the compiler to figure out which memory locations are accessed concurrently, and which aren't. > Won't this hurt performance? I don't think so. We'll use memory_order_relaxed in each case where we have a plain memory access right now to concurrently accessed data. The only reasons I can think of that might lead to a decrease in performance are: * Inefficient implementation of atomics, for example due to a too old compiler version (see the example that Joseph brought up). That can be worked around on a arch-specific basis, for example by using inline-asm differently. * Compilers that don't optimize across memory_order_relaxed atomic ops and glibc code actually benefits from optimizations by the compiler across current plain memory accesses. I doubt that this actually happens in practice, because it would need a loop or such and other things in the loop would need to be performance-critical -- which is not a pattern I think is frequent in concurrent code. * If we currently have code where the compiler combines several plain memory accesses to concurrently accessed data into one, then we could have more accesses if using memory_order_relaxed atomics. However, such an optimization can easily be not what the programmer intended to happen (e.g., if in a busy-waiting loop -- hence atomic_forced_read...). > What correctness issue exists? The compiler needs to know for which memory accesses it can assume data-race-freedom and for which it doesn't. This results in different optimizations being valid, or not. You can try to make the compiler do the right thing by forcing it towards doing that, but the clean way is to actually tell the compiler what it should do. For example, there where bugs fixed in GCC for optimizations across or involving atomic operations; these optimizations where valid for sequential code, but not in a multi-threaded setting. This shows that things can go wrong if we don't tell the compiler. > > * Atomic vars aren't explicitly annotated with atomic types, but just > > use the base types. They need to be naturally aligned. This makes the > > transition easier because we don't get any dependencies on C11 atomic > > types. > > OK. > > > * On a certain architecture, we typically only use the atomic_* ops if > > the HW actually supports these; we expect to have pointer-sized atomics > > at most. If the arch has no native support for atomics, it can either > > use modified algorithms or emulate atomics differently. > > I strongly suggest all such machines should emulate atomics in the kernel > using kernel-level locks. That's fine for me. I just didn't want to make this decision for machine maintainers. > The downside of this is that all atomic vars > must use atomic_* functions because otherwise the release of the lock > word by a normal store won't order correctly. This already happened on > hppa with userspace spinlocks. But that's The Right Thing To Do anyway, so I don't see it as a downside. IMO, when writing concurrent code, the least of your troubles is writing atomic_store_relaxed(&foo, 23) instead of foo = 23; IMHO, it actually aids in readability because it shows where concurrency has to be considered and where not (ie, for all data accessed by only one thread). > > * The atomic ops are similar to the _explicit variation of C11's > > functions, except that _explicit is replaced with the last part of the > > MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst). All > > arguments (except the MO, which is dropped) are the same as for C11. > > That avoids using the same names yet should make the names easy to > > understand for people familiar with C11. > > OK. > > > I also propose an incremental transition. In particular, the steps are > > roughly: > > > > 1) Add new C11-like atomics. If GCC supports them on this architecture, > > use GCC's atomic builtins. Make them fall back to the existing atomics > > otherwise. Attached is a small patch that illustrates this. > > OK. > > > 2) Refactor one use (ie, all the accesses belonging to one algorithm or > > group of functions that synchronize with each other) at a time. This > > involves reviewing the code and basically reimplementing the > > synchronization bits in on top of the C11 memory model. We also should > > take this opportunity to add any documentation of concurrent code that's > > missing (which is often the case). > > Not OK until we talk about it more. So, what do you want to talk about? ;) > > 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look > > at all uses and decide whether we really need to keep them. > > Agreed e.g. rewrite. > > > 4) Once all of glibc uses the new atomics, remove the old ones for a > > particular arch if the oldest compiler required has support for the > > respective builtins. > > OK. > > > Open questions: > > > > * Are the current read/write memory barriers equivalent to C11 > > acquire/release fences? I guess that's the case (did I mention lack of > > documentation? ;) ), but we should check whether this is true on every > > architecture (ie, whether the HW instructions used for read/write > > membars are the same as what the compiler would use for > > acquire/release). If not, we can't implement acquire/release based on > > read/write membars but need something else for this arch. I'd > > appreciate help from the machine maintainers for this one. > > Don't know. > > Create an internals manual? Add a new chapter on atomics? :-) I had hoped we can do with comments in atomic.h, given that we basically just need to explain our exceptions. I agree we should have documentation for the transition, but I also think this just documents existing state. Every new architecture added should target C11 right away (and every new architecture's designer should have better made it clear how to implement C11 on top of this HW :) > > * How do we deal with archs such as older SPARC that don't have CAS and > > other archs without HW support for atomics? Using modified algorithms > > should be the best-performing option (eg, if we can use one critical > > section instead of a complicated alternative that uses lots of atomic > > operations). However, that means we'll have to maintain more algorithms > > (even if they might be simpler). > > No. Stop. One algorithm. All arches that can't meet the HW support for > atomics must enter the kernel and do the work there. This is just like hppa > and ARM do. They use a light-weight syscall mechanism and serialize in the > kernel. Fine with me. > > Furthermore, do all uses of atomics work well with blocking atomics that > > might also not be indivisible steps? For example, the cancellation code > > might be affected because a blocking emulation of atomics won't be > > async-cancel-safe? > > It will be safe because the kernel emulation should not deliver a signal > during the emulation. What about PI code? If we have a non-blocking algorithm in glibc (using nonblocking atomics), then this will work fine with hard real time priorities; when using blocking atomics, we might get priority inversion.
On 9/15/2014 11:59 AM, Carlos O'Donell wrote: >> * How do we deal with archs such as older SPARC that don't have CAS and >> >other archs without HW support for atomics? Using modified algorithms >> >should be the best-performing option (eg, if we can use one critical >> >section instead of a complicated alternative that uses lots of atomic >> >operations). However, that means we'll have to maintain more algorithms >> >(even if they might be simpler). > No. Stop. One algorithm. All arches that can't meet the HW support for > atomics must enter the kernel and do the work there. This is just like hppa > and ARM do. They use a light-weight syscall mechanism and serialize in the > kernel. Agreed! And, tilepro is also on that list of platforms with weak atomics that uses a light-weight syscall to serialize in the kernel. (Unlike tilegx, which has fairly rich atomics.) On 9/15/2014 12:51 PM, Joseph S. Myers wrote: > * In some cases those macros are defined by GCC even though the operations > are out of line in libgcc. On PA the HAVE_sync_compare_and_swap* macros > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to > be predefined; for Tile GCC has special target-specific code to define > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*. Do we believe such out-of-line libgcc > versions will work in glibc, and is it preferable to use them or inline > asm? Yes, the tile out-of-line versions should work fine in glibc. The tilepro out-of-line atomics do the lightweight syscalls. On tilegx out-of-line support is only used for 1- and 2-byte atomics, which are not supported in hardware, so the out-of-line code does a 4-byte compare-and-exchange loop to synthesize the operation. However, I suspect we are also not using those sizes in glibc itself - or if we are, we might decide we don't want to be.
On Mon, 15 Sep 2014, Torvald Riegel wrote: > > * In some cases those macros are defined by GCC even though the operations > > are out of line in libgcc. On PA the HAVE_sync_compare_and_swap* macros > > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to > > be predefined; for Tile GCC has special target-specific code to define > > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*. Do we believe such out-of-line libgcc > > versions will work in glibc, and is it preferable to use them or inline > > asm? (My guess is that they might work except for cases such as the > > 64-bit ARM atomics with references to libc interfaces.) > > I believe they should work from a correctness perspective, but might > have a slight performance disadvantage. I'm not sure I understand the > ARM case you mention. The 64-bit ARM atomics (linux-atomic-64bit.c) rely on kernel helpers that are only available in some kernel versions and involve a function called at startup to test for availability of those helpers and call __write / abort if not available. I don't know if that would work in all places where glibc uses atomics, but I think we should avoid the issue by making sure not to use 64-bit atomics at all on 32-bit platforms (and hopefully ensure that such atomics produce a compile / link failure if used).
On Tue, 2014-09-16 at 16:06 +0000, Joseph S. Myers wrote: > On Mon, 15 Sep 2014, Torvald Riegel wrote: > > > > * In some cases those macros are defined by GCC even though the operations > > > are out of line in libgcc. On PA the HAVE_sync_compare_and_swap* macros > > > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to > > > be predefined; for Tile GCC has special target-specific code to define > > > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*. Do we believe such out-of-line libgcc > > > versions will work in glibc, and is it preferable to use them or inline > > > asm? (My guess is that they might work except for cases such as the > > > 64-bit ARM atomics with references to libc interfaces.) > > > > I believe they should work from a correctness perspective, but might > > have a slight performance disadvantage. I'm not sure I understand the > > ARM case you mention. > > The 64-bit ARM atomics (linux-atomic-64bit.c) rely on kernel helpers that > are only available in some kernel versions and involve a function called > at startup to test for availability of those helpers and call __write / > abort if not available. > > I don't know if that would work in all places where glibc uses atomics, > but I think we should avoid the issue by making sure not to use 64-bit > atomics at all on 32-bit platforms (and hopefully ensure that such atomics > produce a compile / link failure if used). Agreed. I think we should by default not expect more than pointer-sized atomic ops. If a particular concurrent algorithm implementation really needs more, it's often more efficient to use a solution specific to that algorithm (e.g., exploiting monotonicity in the values or such) rather than try to rely on kernel helpers or locks for that.
On Sun, 14 Sep 2014, Torvald Riegel wrote: > * All accesses to atomic vars need to use atomic_* functions. IOW, all > non-atomic accesses are not subject to data races. The only exceptions > is initialization (ie, when the variable is not visible to any other > thread); nonetheless, initialization accesses must not result in data > races with other accesses. (This exception isn't allowed by C11, but > eases the transition to C11 atomics and likely works fine in current > implementations; as alternative, we could require MO-relaxed stores for > initialization as well.) Note <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273> regarding MO-relaxed operations not being well-optimized. It will be important to compare the code generated before and after any changes, and may be necessary to map the "relaxed" operations to plain loads and stores in some cases depending on the compiler version.
On Tue, 2014-09-16 at 18:17 +0000, Joseph S. Myers wrote: > On Sun, 14 Sep 2014, Torvald Riegel wrote: > > > * All accesses to atomic vars need to use atomic_* functions. IOW, all > > non-atomic accesses are not subject to data races. The only exceptions > > is initialization (ie, when the variable is not visible to any other > > thread); nonetheless, initialization accesses must not result in data > > races with other accesses. (This exception isn't allowed by C11, but > > eases the transition to C11 atomics and likely works fine in current > > implementations; as alternative, we could require MO-relaxed stores for > > initialization as well.) > > Note <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273> regarding > MO-relaxed operations not being well-optimized. I mentioned this already in my reply to a question by Carlos (in this thread) -- but thanks for noting the GCC bug: * Compilers that don't optimize across memory_order_relaxed atomic ops and glibc code actually benefits from optimizations by the compiler across current plain memory accesses. I doubt that this actually happens in practice, because it would need a loop or such and other things in the loop would need to be performance-critical -- which is not a pattern I think is frequent in concurrent code. * If we currently have code where the compiler combines several plain memory accesses to concurrently accessed data into one, then we could have more accesses if using memory_order_relaxed atomics. However, such an optimization can easily be not what the programmer intended to happen (e.g., if in a busy-waiting loop -- hence atomic_forced_read...). > It will be important to > compare the code generated before and after any changes, and may be > necessary to map the "relaxed" operations to plain loads and stores in > some cases depending on the compiler version. I agree that we need to compare, but I wouldn't be happy with mapping back to plain loads and stores either. Sure, it seems to work right now, but we're really keeping our fingers cross, actually. My *guess* would be that we won't see slow-down for any of the pthreads synchronization data structures, but might see some in cases where atomics are uses for things like statistics counters in malloc or such (ie, where they are surrounded by significant amounts of nonconcurrent code without additional memory barriers and such).
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote: > 1) Add new C11-like atomics. If GCC supports them on this architecture, > use GCC's atomic builtins. Make them fall back to the existing atomics > otherwise. Attached is a small patch that illustrates this. I made a scan through the current atomics implementation on different archs. It seems as if there's some disagreement about which operations have which barrier semantics, and which ones are worthwhile to have specific implementations of vs. falling back to a CAS loop. Things that stood out for me (some may be bugs): * Membars not being defined on i486 and x86_64 (unless I missed something). The HW memory model for these two is TSO, which would mean that algorithms that rely on atomic_full_barrier won't work. I'm not sure whether we have these in current code, but atomic_full_barrier is used in code I'm working on. * mips use of __sync_lock_test_and_set for both _acq and _rel is surprising. Is this __sync builtin different on mips, or is the _rel variant broken and needs a stronger barrier? * powerpc having memory_order_relaxed semantics on some operations while other archs use _acq, yet has acquire semantics for decrement_if_positive. Inconsistency? Just a performance problem or a correctness problem? * What's the real semantics of atomic_write_barrier? The SPARC code looks as if this wouldn't be equivalent to a memory_order_release fence. But for Power I'd guess it is? Archs marked with ??? are those I know very little about. Read/write/full membar are the current ones (not C11). _acq and _rel refer to what the current code understands as that, not C11 memory_order_acquire and memory_order_release Default: * If _rel variant not defined by arch, _acq is used. * All atomic ops without a _acq or _rel suffix (eg, atomic_exchange_and_add) default to have _acq semantics. * atomic_decrement_if_positive has _acq semantics on the success path (ie, if something gets added) but relaxed semantics on the failure path (ie, if the result of the addition would be negative). * Full membars are just compiler barriers (ie, __asm ("" ::: "memory")). * Read/write membars default to full membars. aarch64: * Uses the C11 GCC builtins. 8-64b CAS + exchange. * _acq and _rel implemented as memory_order_acquire / memory_order_release. * CAS failure path has memory_order_relaxed. * Only full membar defined. alpha ???: * Custom asm for CAS + exchange. 8-64b. * Full membar defined and equal to read membar. Write membar is different. * CAS failure path seems to have memory_order_relaxed. * atomic_exchange_and_add uses full membar. arm: * Uses either C11 GCC builtins, the older _sync* builtins, or CAS provided by the OS (_arm_assisted...). 32b only CAS + exchange. * CAS failure path has memory_order_relaxed. * _acq and _rel implemented as mo_acquire / mo_release, except when using _arm_assisted*, in which case only _acq is used. * Only full membar defined. hppa ???: * Uses kernel for CAS. ??? How is the kernel aware of the size of the CAS? * Only _acq CAS defined. * Membars are not defined. i486: * Custom asm for CAS, exchange, exchange_and_add, add, add_negative, add_zero, increment, increment_and_test, decrement, decrement_and_test, bit_set, and, or. All for 8-32b. * atomic_delay has custom asm. * atomic_exchange_and_add uses the __sync builtin, so is a full membar. * Membars are not defined. ia64 ???: * CAS based on __sync builtins; exchange uses __sync_lock_test_and_set with proper release membar added for _rel. 32-64b. * atomic_exchange_and_add is a full membar. * Full barrier defined as __sync_synchronize. m68k coldfire ???: * Uses kernel for CAS. ??? 32b only it seems? * Uses kernel for full membar. m68k 68020 ???: * Custom asm for CAS, exchange, exchange_and_add, add, increment_and_test, decrement_and_test, bit_set, bit_test_set. 8-64b. * Membars are not defined. microblaze ???: * Custom asm for CAS, exchange, exchange_and_add, increment, decrement. 32b. * Membars are not defined. mips ???: * Uses C11 GCC builtins, __sync (see Joseph's notes on GCC version requirements), or custom asm for CAS, exchange. 32b and 64b unless ABI032. * If __sync used, then more operations use the __sync builtins. * If custom asm used, then exchange_and_add is defined too. * CAS failure path is memory_order_relaxed (in case of C11 builtins and custom asm). * atomic_exchange uses __sync_lock_test_and_set for both _acq and _rel, but has different barriers for _acq and _rel elsewhere. * Full membar is defined powerpc32: * Custom asm for CAS, exchange. 32b. * Custom asm also exists for exchange_and_add, increment, decrement, decrement_if_positive. The first three of these don't use any membars, so are likely memory_order_relaxed; however, decrement_if_positive uses an acquire barrier. * CAS failure path has acquire membar on _acq, memory_order_relaxed semantics on _rel. * Read membar defined (lwsync or sync). Full and write mbars defined. powerpc64: * Like powerpc32, but custom asm also provided for 64b (same ops). * CAS failure path has acquire membar on _acq, memory_order_relaxed semantics on _rel. * Read membar defined (lwsync or sync). Full and write mbars defined. s390 ???: * Custom asm for CAS. 32-64b. * No membars defined. sh ???: * Kernel used for CAS, exchange, exchange_and_add, add_negative, add_zero, increment_and_test, decrement_and_test, bit_set, bit_test_set. 8-32b. * No membars defined. sparc32 v9 ???: * Custom asm for CAS, exchange. 32b. * Full membar defined. * Read membar defined. I guess this would work as a memory_order_acquire barrier (or C11 fence). * Write membar defined. This does not seem to implement a memory_order_release barrier if I interpret the SPARC asm correctly, but rather be like a more traditional "don't reorder the preceding write with anything else" barrier; it issues a StoreLoad and StoreStore barrier, but if used as memory_order_release, it will be followed by an atomic store, so I guess it should be LoadStore and StoreStore? I haven't checked what GCC generates, so I'm not sure. sparc32 pre v9 ???: * Either uses custom asm for a lock-based implementation of atomics or uses v9 code. sparc64 ???: * Custom asm for CAS, exchange. 32-64b. * Membars are same as sparc32 v9. x86_64: * Uses __sync builtins for CAS, custom asm for exchange, exchange_and_add, add, add_zero, increment, increment_and_test, decrement, decrement_and_test, bit_set, bit_test_set, and, or. 8-64b. * Custom asm for atomic ops that emit lock prefix conditionally (e.g., catomic_add). * Membars are not defined.
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote: > 1) Add new C11-like atomics. If GCC supports them on this architecture, > use GCC's atomic builtins. Make them fall back to the existing atomics > otherwise. Attached is a small patch that illustrates this. I made a scan through the current atomics implementation on different archs. It seems as if there's some disagreement about which operations have which barrier semantics, and which ones are worthwhile to have specific implementations of vs. falling back to a CAS loop. Things that stood out for me (some may be bugs): * Membars not being defined on i486 and x86_64 (unless I missed something). The HW memory model for these two is TSO, which would mean that algorithms that rely on atomic_full_barrier won't work. I'm not sure whether we have these in current code, but atomic_full_barrier is used in code I'm working on. * mips use of __sync_lock_test_and_set for both _acq and _rel is surprising. Is this __sync builtin different on mips, or is the _rel variant broken and needs a stronger barrier? * powerpc having memory_order_relaxed semantics on some operations while other archs use _acq, yet has acquire semantics for decrement_if_positive. Inconsistency? Just a performance problem or a correctness problem? * What's the real semantics of atomic_write_barrier? The SPARC code looks as if this wouldn't be equivalent to a memory_order_release fence. But for Power I'd guess it is? Archs marked with ??? are those I know very little about. Read/write/full membar are the current ones (not C11). _acq and _rel refer to what the current code understands as that, not C11 memory_order_acquire and memory_order_release Default: * If _rel variant not defined by arch, _acq is used. * All atomic ops without a _acq or _rel suffix (eg, atomic_exchange_and_add) default to have _acq semantics. * atomic_decrement_if_positive has _acq semantics on the success path (ie, if something gets added) but relaxed semantics on the failure path (ie, if the result of the addition would be negative). * Full membars are just compiler barriers (ie, __asm ("" ::: "memory")). * Read/write membars default to full membars. aarch64: * Uses the C11 GCC builtins. 8-64b CAS + exchange. * _acq and _rel implemented as memory_order_acquire / memory_order_release. * CAS failure path has memory_order_relaxed. * Only full membar defined. alpha ???: * Custom asm for CAS + exchange. 8-64b. * Full membar defined and equal to read membar. Write membar is different. * CAS failure path seems to have memory_order_relaxed. * atomic_exchange_and_add uses full membar. arm: * Uses either C11 GCC builtins, the older _sync* builtins, or CAS provided by the OS (_arm_assisted...). 32b only CAS + exchange. * CAS failure path has memory_order_relaxed. * _acq and _rel implemented as mo_acquire / mo_release, except when using _arm_assisted*, in which case only _acq is used. * Only full membar defined. hppa ???: * Uses kernel for CAS. ??? How is the kernel aware of the size of the CAS? * Only _acq CAS defined. * Membars are not defined. i486: * Custom asm for CAS, exchange, exchange_and_add, add, add_negative, add_zero, increment, increment_and_test, decrement, decrement_and_test, bit_set, and, or. All for 8-32b. * atomic_delay has custom asm. * atomic_exchange_and_add uses the __sync builtin, so is a full membar. * Membars are not defined. ia64 ???: * CAS based on __sync builtins; exchange uses __sync_lock_test_and_set with proper release membar added for _rel. 32-64b. * atomic_exchange_and_add is a full membar. * Full barrier defined as __sync_synchronize. m68k coldfire ???: * Uses kernel for CAS. ??? 32b only it seems? * Uses kernel for full membar. m68k 68020 ???: * Custom asm for CAS, exchange, exchange_and_add, add, increment_and_test, decrement_and_test, bit_set, bit_test_set. 8-64b. * Membars are not defined. microblaze ???: * Custom asm for CAS, exchange, exchange_and_add, increment, decrement. 32b. * Membars are not defined. mips ???: * Uses C11 GCC builtins, __sync (see Joseph's notes on GCC version requirements), or custom asm for CAS, exchange. 32b and 64b unless ABI032. * If __sync used, then more operations use the __sync builtins. * If custom asm used, then exchange_and_add is defined too. * CAS failure path is memory_order_relaxed (in case of C11 builtins and custom asm). * atomic_exchange uses __sync_lock_test_and_set for both _acq and _rel, but has different barriers for _acq and _rel elsewhere. * Full membar is defined powerpc32: * Custom asm for CAS, exchange. 32b. * Custom asm also exists for exchange_and_add, increment, decrement, decrement_if_positive. The first three of these don't use any membars, so are likely memory_order_relaxed; however, decrement_if_positive uses an acquire barrier. * CAS failure path has acquire membar on _acq, memory_order_relaxed semantics on _rel. * Read membar defined (lwsync or sync). Full and write mbars defined. powerpc64: * Like powerpc32, but custom asm also provided for 64b (same ops). * CAS failure path has acquire membar on _acq, memory_order_relaxed semantics on _rel. * Read membar defined (lwsync or sync). Full and write mbars defined. s390 ???: * Custom asm for CAS. 32-64b. * No membars defined. sh ???: * Kernel used for CAS, exchange, exchange_and_add, add_negative, add_zero, increment_and_test, decrement_and_test, bit_set, bit_test_set. 8-32b. * No membars defined. sparc32 v9 ???: * Custom asm for CAS, exchange. 32b. * Full membar defined. * Read membar defined. I guess this would work as a memory_order_acquire barrier (or C11 fence). * Write membar defined. This does not seem to implement a memory_order_release barrier if I interpret the SPARC asm correctly, but rather be like a more traditional "don't reorder the preceding write with anything else" barrier; it issues a StoreLoad and StoreStore barrier, but if used as memory_order_release, it will be followed by an atomic store, so I guess it should be LoadStore and StoreStore? I haven't checked what GCC generates, so I'm not sure. sparc32 pre v9 ???: * Either uses custom asm for a lock-based implementation of atomics or uses v9 code. sparc64 ???: * Custom asm for CAS, exchange. 32-64b. * Membars are same as sparc32 v9. x86_64: * Uses __sync builtins for CAS, custom asm for exchange, exchange_and_add, add, add_zero, increment, increment_and_test, decrement, decrement_and_test, bit_set, bit_test_set, and, or. 8-64b. * Custom asm for atomic ops that emit lock prefix conditionally (e.g., catomic_add). * Membars are not defined.
On Tue, 16 Sep 2014, Torvald Riegel wrote: > * mips use of __sync_lock_test_and_set for both _acq and _rel is > surprising. Is this __sync builtin different on mips, or is the _rel > variant broken and needs a stronger barrier? It sounds like a bug - specific to the case of (MIPS16, GCC before 4.7), as that's the only case where __sync_* get used on MIPS, and that case may not be widely used. Please file in Bugzilla. (__sync_* for MIPS16 then in turn ends up using out-of-line libgcc functions that call __sync_* built as non-MIPS16 code, but I see nothing special there to give a stronger barrier.)
On Tue, 2014-09-16 at 23:48 +0200, Torvald Riegel wrote: > * Membars not being defined on i486 and x86_64 (unless I missed > something). The HW memory model for these two is TSO, which would mean > that algorithms that rely on atomic_full_barrier won't work. I'm not > sure whether we have these in current code, but atomic_full_barrier is > used in code I'm working on. Based on a quick look at uses of atomic_full_barrier, this doesn't seem to cause incorrect behavior *currently*: * nptl/pthread_mutex_setprioceiling.c has a full barrier immediately followed by a futex syscall, which is a full barrier too effectively. * sysdeps/nptl/fork.c seems to only need an acquire barrier, so on x86 the current compiler-only barrier that the full barrier will result in is sufficient. * Other generic code that uses full barriers (eg, sem_post) currently has custom asm versions on x86/x86_64. However, once we remove those, the lack of a correct full barrier will be a bug. To summarize, I think we need to fix the full barriers as soon as we add other code that uses atomic_full_barrier and is used by x86/x86_64. This is #17403.
On Tue, 2014-09-16 at 22:02 +0000, Joseph S. Myers wrote: > On Tue, 16 Sep 2014, Torvald Riegel wrote: > > > * mips use of __sync_lock_test_and_set for both _acq and _rel is > > surprising. Is this __sync builtin different on mips, or is the _rel > > variant broken and needs a stronger barrier? > > It sounds like a bug - specific to the case of (MIPS16, GCC before 4.7), > as that's the only case where __sync_* get used on MIPS, and that case may > not be widely used. Please file in Bugzilla. (__sync_* for MIPS16 then > in turn ends up using out-of-line libgcc functions that call __sync_* > built as non-MIPS16 code, but I see nothing special there to give a > stronger barrier.) > https://sourceware.org/bugzilla/show_bug.cgi?id=17404
On Tue, 2014-09-16 at 23:48 +0200, Torvald Riegel wrote: > sparc32 v9 ???: > * Custom asm for CAS, exchange. 32b. > * Full membar defined. > * Read membar defined. I guess this would work as a > memory_order_acquire barrier (or C11 fence). > * Write membar defined. This does not seem to implement a > memory_order_release barrier if I interpret the SPARC asm correctly, but > rather be like a more traditional "don't reorder the preceding write > with anything else" barrier; it issues a StoreLoad and StoreStore > barrier, but if used as memory_order_release, it will be followed by an > atomic store, so I guess it should be LoadStore and StoreStore? I > haven't checked what GCC generates, so I'm not sure. Dave, on SPARC the membars are currently defined as: #define atomic_full_barrier() \ __asm __volatile ("membar #LoadLoad | #LoadStore" \ " | #StoreLoad | #StoreStore" : : : "memory") #define atomic_read_barrier() \ __asm __volatile ("membar #LoadLoad | #LoadStore" : : : "memory") #define atomic_write_barrier() \ __asm __volatile ("membar #StoreLoad | #StoreStore" : : : "memory") From what I can see in uses of atomic_write_barrier and in how it's implemented on other archs, this is intended to be similar to memory_order_release. Throughout glibc, it's often used like this: x = foo; y = bar; atomic_write_barrier; flag = 1; // or pointer initialization So we'd want to prevent reordering of stuff prior to the membar with the store to flag becoming visible. This would mean -- if I interpret SPARC asm correctly -- to have a #LoadStore | #StoreStore membar. This would align with SPARC being TSO, so not needing an actual HW membar for release or acquire (same as x86). However, atomic_write_barrier currently includes #StoreLoad, which should make it a full barrier. This is correct, but stronger than necessary. Am I missing something, is this intended to be a full barrier, or just an oversight?
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote: > 2) Refactor one use (ie, all the accesses belonging to one algorithm or > group of functions that synchronize with each other) at a time. This > involves reviewing the code and basically reimplementing the > synchronization bits in on top of the C11 memory model. We also should > take this opportunity to add any documentation of concurrent code that's > missing (which is often the case). While trying to determine what the intended semantics of atomic_write_barrier are -- it seems to be indeed intended to be a memory_order_release fence -- I looked at several uses of atomic_write_barrier. This wasn't a detailed analysis, but it seems that acquire barriers are missing in several places: * nis/nss_nisplus/<several files> (pthread_once) * stdlib/msort.c * malloc/arena.c * nptl/tpp.c (pthread_once) * nptl/allocatestack.c * inet/getnetgrent_r.c (pthread_once) * include/list.h (concurrent list?) Those marked with (pthread_once) are, at first sight, incorrect implementations of exactly-once execution, due to a lack of atomic_read_barrier / atomic_thread_fence(memory_order_acquire). For archs with weak memory models like ARM or Power, this can result in HW race conditions and state corruption. Even for archs with stronger models like x86 and SPARC, this can result in state corruption due to code reordering by the compiler (it's all plain memory accesses, so the compiler is allowed to assume that this is sequential code, and can potentially load speculatively or similar). I also don't remember seeing documentation of the synchronization scheme in most of these files. I think this further motivates transitioning to the C11 memory model.
commit 7bd3b53f2dc61e0bf2ef018140ef1cc83f0827c5 Author: Torvald Riegel <triegel@redhat.com> Date: Sun Sep 14 20:04:54 2014 +0200 Illustrate new function names for atomic ops. diff --git a/include/atomic.h b/include/atomic.h index 3e82b6a..939d1c3 100644 --- a/include/atomic.h +++ b/include/atomic.h @@ -543,6 +543,86 @@ #endif +/* The following functions are a subset of the atomic operations provided by + C11. Usually, a function named atomic_OP_MO(args) is equivalent to C11's + atomic_OP_explicit(args, memory_order_MO); exceptions noted below. */ + +#if USE_ATOMIC_COMPILER_BUILTINS +# define atomic_thread_fence_acquire() \ + __atomic_thread_fence (__ATOMIC_RELAXED) +# define atomic_thread_fence_release() \ + __atomic_thread_fence (__ATOMIC_RELEASE) +# define atomic_thread_fence_seq_cst() \ + __atomic_thread_fence (__ATOMIC_SEQ_CST) + +# define __atomic_load_mo(mem, mo) \ + ({ __typeof (*(mem)) __atg100_val; \ + __atomic_load (mem, &__atg100_val, mo); \ + __atg100_val; }) +# define atomic_load_relaxed(mem) __atomic_load_mo ((mem), __ATOMIC_RELAXED) +# define atomic_load_acquire(mem) __atomic_load_mo ((mem), __ATOMIC_ACQUIRE) + +# define __atomic_store_mo(mem, val, mo) \ + ({ __typeof (*(mem)) __atg101_val = val; \ + __atomic_store (mem, &__atg100_val, mo); \ + __atg100_val; }) +# define atomic_store_relaxed(mem) __atomic_store_mo ((mem), __ATOMIC_RELAXED) +# define atomic_store_release(mem) __atomic_store_mo ((mem), __ATOMIC_RELEASE) + +/* TODO atomic_exchange relaxed acquire release acq_rel? */ + +/* TODO atomic_compare_exchange_weak relaxed acquire release acq_rel? */ +/* TODO atomic_compare_exchange_strong relaxed acquire release acq_rel? */ + +/* TODO atomic_fetch_add relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_sub relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_and relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_or relaxed acquire? release? acq_rel? */ + +#else /* !USE_ATOMIC_COMPILER_BUILTINS */ + +/* By default, we assume that read, write, and full barriers are equivalent + to acquire, release, and seq_cst barriers. Archs for which this does not + hold have to provide custom definitions of the fences. */ +# ifndef atomic_thread_fence_acquire +# define atomic_thread_fence_acquire() atomic_read_barrier () +# endif +# ifndef atomic_thread_fence_release +# define atomic_thread_fence_release() atomic_write_barrier () +# endif +# ifndef atomic_thread_fence_seq_cst +# define atomic_thread_fence_seq_cst() atomic_full_barrier () +# endif + +# ifndef atomic_load_relaxed +# define atomic_load_relaxed(mem) \ + ({ __typeof (*(mem)) __atg100_val; \ + __asm ("" : "=r" (__atg100_val) : "0" (*(mem))); \ + __atg100_val; }) +# endif +# ifndef atomic_load_acquire +# define atomic_load_acquire(mem) \ + ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem);}) \ + atomic_thread_fence_acquire (); \ + __atg101_val; }) +# endif +# ifndef atomic_store_relaxed +/* XXX Use inline asm here too? */ +# define atomic_store_relaxed(mem, val) do { *(mem) = val; } while (0) +# endif +# ifndef atomic_store_release +# define atomic_store_release(mem, val) \ + do { \ + atomic_thread_fence_release (); \ + atomic_store_relaxed ((mem), (val)); \ + } while (0) +# endif + +/* TODO same as above */ + +#endif /* !USE_ATOMIC_COMPILER_BUILTINS */ + + #ifndef atomic_delay # define atomic_delay() do { /* nothing */ } while (0) #endif