Message ID | 1432063241.11088.34.camel@triegel.csb |
---|---|
State | New |
Headers | show |
On 19/05/15 20:20, Torvald Riegel wrote: > On Mon, 2015-05-18 at 17:36 +0100, Matthew Wahab wrote: >> Hello, >> >> On 15/05/15 17:22, Torvald Riegel wrote: >>> This patch improves the documentation of the built-ins for atomic >>> operations. >> >> The "memory model" to "memory order" change does improve things but I think that >> the patch has some problems. As it is now, it makes some of the descriptions >> quite difficult to understand and seems to assume more familiarity with details >> of the C++11 specification then might be expected. > > I'd say that's a side effect of the C++11 memory model being the > reference specification of the built-ins. > >> Generally, the memory order descriptions seem to be targeted towards language >> designers but don't provide for anybody trying to understand how to implement or >> to use the built-ins. > > I agree that the current descriptions aren't a tutorial on the C++11 > memory model. However, given that the model is not GCC-specific, we > aren't really in a need to provide a tutorial, in the same way that we > don't provide a C++ tutorial. Users can pick the C++11 memory model > educational material of their choice, and we need to document what's > missing to apply the C++11 knowledge to the built-ins we provide. > We seem to have different views about the purpose of the manual page. I'm treating it as a description of the built-in functions provided by gcc to generate the code needed to implement the C++11 model. That is, the built-ins are distinct from C++11 and their descriptions should be, as far as possible, independent of the methods used in the C++11 specification to describe the C++11 memory model. I understand of course that the __atomics were added in order to support C++11 but that doesn't make them part of C++11 and, since __atomic functions can be made available when C11/C++11 may not be, it seems to make sense to try for stand-alone descriptions. I'm also concerned that the patch, by describing things in terms of formal C++11 concepts, makes it more difficult for people to know what the built-ins can be expected to do and so make the built-in more difficult to use There is a danger that rather than take a risk with uncertainty about the behaviour of the __atomics, people will fall-back to the __sync functions simply because their expected behaviour is easier to work out. I don't think that linking to external sites will help either, unless people already want to know C++11. Anybody who just wants to (e.g.) add a memory barrier will take one look at the __sync manual page and use the closest match from there instead. Note that none of this requires a tutorial of any kind. I'm just suggesting that the manual should describe what behaviour should be expected of the code generated for the functions. For the memory orders, that would mean describing what constraints need to be met by the generated code. The requirement that the atomics should support C++11 could be met by making sure that the description of the expected behaviour is sufficient for C++11. > There are several resources for implementers, for example the mappings > maintained by the Cambridge research group. I guess it would be > sufficient to have such material on the wiki. Is there something > specific that you'd like to see documented for implementers? > [...] > I agree it's not described in the manual, but we're implementing C++11. (As above) I believe we're supporting the implementation of C++11 and that the distinction is important. > However, I don't see why happens-before semantics wouldn't apply to > GCC's implementation of the built-ins; there may be cases where we > guarantee more, but if one uses the builtins in way allowed by the C++11 > model, one certainly gets behavior and happens-before relationships as > specified by C++11. > My understanding is that happens-before is a relation used in the C++11 specification for a specific meaning. I believe that it's used to decide whether something is or is not a data race so saying that it applies to a gcc built-in would be wrong. Using the gcc built-in rather than the equivalent C++11 library function would result in program that C++11 regards as invalid. (Again, as I understand it.) > >> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi >> index 6004681..5b2ded8 100644 >> --- a/gcc/doc/extend.texi >> +++ b/gcc/doc/extend.texi >> @@ -8853,19 +8853,19 @@ are not prevented from being speculated to before the barrier. >> >> [...] If the data type size maps to one >> -of the integral sizes that may have lock free support, the generic >> -version uses the lock free built-in function. Otherwise an >> +of the integral sizes that may support lock-freedom, the generic >> +version uses the lock-free built-in function. Otherwise an >> external call is left to be resolved at run time. >> >> ===== >> This is a slightly awkward sentence. Maybe it could be replaced with something >> on the lines of "The generic function uses the lock-free built-in function when >> the data-type size makes that possible, otherwise an external call is left to be >> resolved at run-time." >> ===== > > Changed to: > "It uses the lock-free built-in function if the specific data type size > makes that possible; otherwise, an external call is left to be resolved > at run time." Ok for me. >> -The memory models integrate both barriers to code motion as well as >> -synchronization requirements with other threads. They are listed here >> -in approximately ascending order of strength. >> +An atomic operation can both constrain code motion by the compiler and >> +be mapped to a hardware instruction for synchronization between threads >> +(e.g., a fence). [...] >> >> ===== >> This is a little unclear (and inaccurate, aarch64 can use two instructions >> for fences). I also thought that atomic operations constrain code motion by the >> hardware. Maybe break the link with the compiler and hardware: "An atomic >> operation can both constrain code motion and act as a synchronization point >> between threads". >> ===== > > I removed "by the compiler" and used "hardware instruction_s_". Ok. >> @table @code >> @item __ATOMIC_RELAXED >> -No barriers or synchronization. >> +Implies no inter-thread ordering constraints. >> >> ==== >> It may be useful to be explicit that there are no restrctions on code motion. >> ==== > > But there are restrictions, for example those related to the forward > progress requirements or the coherency rules. Those are C++11 restrictions, used in the formal description of the model. The expected behaviour of the HW insructions generated by the built-in doesn't require any restrictions to be imposed. >> @item __ATOMIC_CONSUME >> -Data dependency only for both barrier and synchronization with another >> -thread. >> +This is currently implemented using the stronger @code{__ATOMIC_ACQUIRE} >> +memory order because of a deficiency in C++11's semantics for >> +@code{memory_order_consume}. >> >> ===== >> It would be useful to have a description of what the __ATOMIC_CONSUME was >> meant to do, as well as the fact that it currently just maps to >> __ATOMIC_ACQUIRE. (Or maybe just drop it from the documentation until it's >> fixed.) >> ===== > > I'll leave what it was meant to do to the ISO C++ discussions. I think > that it's implemented using mo_acquire is already clear from the text, > or not? Yes, I just meant don't drop the link to __ATOMIC_CONSUME. It's probably worth saying something to discourage people from using this though. Something on the lines of "This memory-order should not be used as the implementation is likely to change". >> >> @item __ATOMIC_ACQUIRE >> -Barrier to hoisting of code and synchronizes with release (or stronger) >> -semantic stores from another thread. >> +Creates an inter-thread happens-before constraint from the release (or >> +stronger) semantic store to this acquire load. Can prevent hoisting >> +of code to before the operation. >> >> ===== >> As noted before, it's not clear what the "inter-thread happens-before" >> means in this context. > > I don't see a way to make this truly freestanding without replicating > the specification of the C++11 memory model. (As above) I don't believe that the happens-before applies to the built-ins. >> Here and elsewhere: >> "Can prevent <motion> of code" is ambiguous: it doesn't say under what >> conditions code would or wouldn't be prevented from moving. > > Yes, it's not a specification but just an illustration of what can > result from the specification. But it makes it difficult to know what behaviour to expect, making it difficult to use. > [..] Describing which code movement is > allowed or not, precisely, is way too much detail IMO. There are full > ISO C++ papers about that (N4455), and even those aren't enumerations of > all allowed code transformations. A gcc built-in reduces to a code sequence to be executed by a single thread; describing the expected behaviour of that code sequence shouldn't be so difficult that it needs a paper. Note that this doesn't need a formal (in the sense of formal methods) description, just the sort of description that is normal for a compiler built-in. >> ===== >> >> -Note that the scope of a C++11 memory model depends on whether or not >> -the function being called is a @emph{fence} (such as >> -@samp{__atomic_thread_fence}). In a fence, all memory accesses are >> -subject to the restrictions of the memory model. When the function is >> -an operation on a location, the restrictions apply only to those >> -memory accesses that could affect or that could depend on the >> -location. >> +Note that in the C++11 memory model, @emph{fences} (e.g., >> +@samp{__atomic_thread_fence}) take effect in combination with other >> +atomic operations on specific memory locations (e.g., atomic loads); >> +operations on specific memory locations do not necessarily affect other >> +operations in the same way. >> >> ==== >> Its very unclear what this paragraph is saying. It seems to suggest that fences >> only work in combination with other operations. But that doesn't seem right >> since a __atomic_thread_fence (with appropriate memory order) can be dropped >> into any piece of code and will act in the way that memory barriers are commonly >> understood to work. > > Not quite. If you control the code generation around the fence (e.g., > so you control which loads/stores a compiler generates), and know which > HW instruction the C++ fence maps too, you can try use it this way. > > However, that's not what the built-in is specified to result in -- > instead, it's specified to implement a C++ fence. And those take effect > in combination with the reads-from relation etc., for which one needs to > use atomic operations that access specific memory locations. > As above. >> ==== >> >> @@ -9131,5 +9135,5 @@ if (_atomic_always_lock_free (sizeof (long long), 0)) >> @deftypefn {Built-in Function} bool __atomic_is_lock_free (size_t size, void *ptr) >> >> This built-in function returns true if objects of @var{size} bytes always >> +generate lock-free atomic instructions for the target architecture. If >> +it is not known to be lock-free, a call is made to a runtime routine named >> ==== >> Probably unrelated to the point of this patch but does this mean >> "If this built-in function is not known to be lock-free .."? >> ==== >> > > Yes, fixed. > > I've also fixed s/model/order/ in the HLE built-ins docs and fixed a > typo there too. > > Unless there are objections, I plan to commit the attached patch at the > end of this week. It's obvious that I have concerns. These come from an apparent difference of opinion about what the built-ins and the manual page are for so I won't object to the patch going in. Matthew
On Thu, 2015-05-21 at 16:45 +0100, Matthew Wahab wrote: > On 19/05/15 20:20, Torvald Riegel wrote: > > On Mon, 2015-05-18 at 17:36 +0100, Matthew Wahab wrote: > >> Hello, > >> > >> On 15/05/15 17:22, Torvald Riegel wrote: > >>> This patch improves the documentation of the built-ins for atomic > >>> operations. > >> > >> The "memory model" to "memory order" change does improve things but I think that > >> the patch has some problems. As it is now, it makes some of the descriptions > >> quite difficult to understand and seems to assume more familiarity with details > >> of the C++11 specification then might be expected. > > > > I'd say that's a side effect of the C++11 memory model being the > > reference specification of the built-ins. > > > >> Generally, the memory order descriptions seem to be targeted towards language > >> designers but don't provide for anybody trying to understand how to implement or > >> to use the built-ins. > > > > I agree that the current descriptions aren't a tutorial on the C++11 > > memory model. However, given that the model is not GCC-specific, we > > aren't really in a need to provide a tutorial, in the same way that we > > don't provide a C++ tutorial. Users can pick the C++11 memory model > > educational material of their choice, and we need to document what's > > missing to apply the C++11 knowledge to the built-ins we provide. > > > > We seem to have different views about the purpose of the manual page. I'm treating it > as a description of the built-in functions provided by gcc to generate the code > needed to implement the C++11 model. That is, the built-ins are distinct from C++11 > and their descriptions should be, as far as possible, independent of the methods used > in the C++11 specification to describe the C++11 memory model. OK. But we'd need a *precise* specification of what they do if we'd want to make them separate from the C++11 memory model. And we don't have that, would you agree? It's also not a trivial task, so I wouldn't be optimistic that someone would offer to write such a specification, and have it cross-checked. > I understand of course that the __atomics were added in order to support C++11 but > that doesn't make them part of C++11 and, since __atomic functions can be made > available when C11/C++11 may not be, it seems to make sense to try for stand-alone > descriptions. The compiler can very well provide the C++11 *memory model* without creating any dependency on the other language/library pieces of C++11 or C11. Prior to C++11, multi-threaded executions were not defined by the standard, so we're not conflicting with anything in prior language standards, right? Another way to see this is to say that we just *copy* the C++11 memory model and use it as the memory model that specifies the behavior of the atomic built-ins. That additionally frees us from having to come up with and maintain our GCC-specific specification of atomics and a memory model. > I'm also concerned that the patch, by describing things in terms of formal C++11 > concepts, makes it more difficult for people to know what the built-ins can be > expected to do and so make the built-in more difficult to use There is a danger that > rather than take a risk with uncertainty about the behaviour of the __atomics, people > will fall-back to the __sync functions simply because their expected behaviour is > easier to work out. I hadn't thought about that possible danger, but that would be right. The way I would prefer to counter that is that we add a big fat warning to the __sync built-ins that we don't have a precise specification for them and that there are several corners of hand-waving and potentially further issues, and that this is another reason to prefer the __atomic built-ins. PR 65697 etc. are enough indication for me that we indeed lack a proper specification. > I don't think that linking to external sites will help either, unless people already > want to know C++11. Anybody who just wants to (e.g.) add a memory barrier will take > one look at the __sync manual page and use the closest match from there instead. Well, "just wants to add a memory barrier" is a the start of the problem. The same way one needs to understand a hardware memory model to pick the right HW instruction(s), the same one needs to understand a programming language memory model to pick a fence and understand its semantics. > Note that none of this requires a tutorial of any kind. I'm just suggesting that the > manual should describe what behaviour should be expected of the code generated for > the functions. For the memory orders, that would mean describing what constraints > need to be met by the generated code. I'd bet that if one describes these constraints correctly, you'll get a large document -- even if one removes any introductory or explanatory parts that could make it a tutorial. It's fairly straight-forward to describe several simple usage patterns of the atomics (e.g., seq-cst ones, simple acquire/release pairs, producer/consumer, etc.). But describing the actual *constraints* correctly ends up duplicating a specification. You could certainly try to come up with a simple description of the constraints, and we can iterate until I can't pick any holes in the description anymore. But I really don't think this would be a worthwhile use of our time :) It will certainly need more than a few sentences to be bullet-proof. If you haven't, please just look at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html and try to specify the constraints just for the valid optimizations described there. This should be a good indication of why I think specifying the reordering / behavior constraints is nontrivial. > The requirement that the atomics should support > C++11 could be met by making sure that the description of the expected behaviour is > sufficient for C++11. We don't just want the semantics of __atomic* to be sufficient for C++11 but also want them to be *as weak as possible* to be still sufficient for C++11 -- otherwise, we'll make C++11 code less efficient than it can be. Thus, we want semantics that have the same strength as what's needed by C++11. > > There are several resources for implementers, for example the mappings > > maintained by the Cambridge research group. I guess it would be > > sufficient to have such material on the wiki. Is there something > > specific that you'd like to see documented for implementers? > > [...] > > I agree it's not described in the manual, but we're implementing C++11. > > (As above) I believe we're supporting the implementation of C++11 and that the > distinction is important. > > > However, I don't see why happens-before semantics wouldn't apply to > > GCC's implementation of the built-ins; there may be cases where we > > guarantee more, but if one uses the builtins in way allowed by the C++11 > > model, one certainly gets behavior and happens-before relationships as > > specified by C++11. > > > > My understanding is that happens-before is a relation used in the C++11 specification > for a specific meaning. I believe that it's used to decide whether something is or is > not a data race It appears in the data-race definition, right. More generally, it is the program-wide partial order regarding what virtually happens-before what (as-if applies of cause) in a particular execution of a program. It's at the core of actually describing how a multi-threaded program behaves. > so saying that it applies to a gcc built-in would be wrong. Simplified, we can map 1:1 between an __atomic built-in and an equivalent atomic operation. The exception is basically the data definitions for an atomic type (e.g., atomic<T>): While C++11 hides the data and data type required for an atomically accessible variable, the built-ins assume that the caller will target a suitable memory location. > Using the > gcc built-in rather than the equivalent C++11 library function would result in > program that C++11 regards as invalid. (Again, as I understand it.) It wouldn't be invalid but simply not defined by C++11. But that's fine because the built-ins are a GCC-specific extension (which is compatible with C++11 atomics, of course). > > > >> @table @code > >> @item __ATOMIC_RELAXED > >> -No barriers or synchronization. > >> +Implies no inter-thread ordering constraints. > >> > >> ==== > >> It may be useful to be explicit that there are no restrctions on code motion. > >> ==== > > > > But there are restrictions, for example those related to the forward > > progress requirements or the coherency rules. > > Those are C++11 restrictions, used in the formal description of the model. The > expected behaviour of the HW insructions generated by the built-in doesn't require > any restrictions to be imposed. No, those show up at the HW level as well. Consider examples of spin-loops with memory_order_relaxed loads, or the coherency rule. For example: foo.store(1, memory_order_relaxed); foo.store(2, memory_order_relaxed); r = foo.load(memory_order_relaxed); In absence of other stores, r must never equal 1 (according to one of the coherency rules). If there'd be "no restrictions on code motion", the load could be moved to before the stores, which isn't allowed. Likewise, the assumption is that no hardware will do that either (or the generated code has to enforce this through specific HW instructions). > >> Here and elsewhere: > >> "Can prevent <motion> of code" is ambiguous: it doesn't say under what > >> conditions code would or wouldn't be prevented from moving. > > > > Yes, it's not a specification but just an illustration of what can > > result from the specification. > > But it makes it difficult to know what behaviour to expect, making it difficult to use. > > > [..] Describing which code movement is > > allowed or not, precisely, is way too much detail IMO. There are full > > ISO C++ papers about that (N4455), and even those aren't enumerations of > > all allowed code transformations. > > A gcc built-in reduces to a code sequence to be executed by a single thread; > describing the expected behaviour of that code sequence shouldn't be so difficult > that it needs a paper. Note that this doesn't need a formal (in the sense of formal > methods) description, just the sort of description that is normal for a compiler > built-in. The built-ins represent constraints for the *both* generic code transformations and arch-specific code generation. They are not just fancy ways to get certain instruction sequences. Thus, what's discussed in N4455 is very much relevant.
On 21/05/15 19:26, Torvald Riegel wrote: > On Thu, 2015-05-21 at 16:45 +0100, Matthew Wahab wrote: >> On 19/05/15 20:20, Torvald Riegel wrote: >>> On Mon, 2015-05-18 at 17:36 +0100, Matthew Wahab wrote: >>>> Hello, >>>> >>>> On 15/05/15 17:22, Torvald Riegel wrote: >>>>> This patch improves the documentation of the built-ins for atomic >>>>> operations. >>>> I think we're talking at cross-purposes and not really getting anywhere. I've replied to some of your comments below, but it's mostly a restatement of points already made. I'll repeat that, although I have concerns about the patch, I don't object to it going in. Maybe wait a few days to see if anybody else wants to comment but, at this point and, since it's a documentation patch and won't break anything, it's better to just commit and deal with any problems come up. >> We seem to have different views about the purpose of the manual page. I'm treating it >> as a description of the built-in functions provided by gcc to generate the code >> needed to implement the C++11 model. That is, the built-ins are distinct from C++11 >> and their descriptions should be, as far as possible, independent of the methods used >> in the C++11 specification to describe the C++11 memory model. > > OK. But we'd need a *precise* specification of what they do if we'd > want to make them separate from the C++11 memory model. And we don't > have that, would you agree? There is a difference between the sort of description that is needed for a formal specification and the sort that would be needed for a programmers manual. The best example of this that I can think of is the Standard ML definition (http://sml-family.org). That is a mathematical (so precise) definition that is invaluable if you want an unambiguous specification of the language. But its useless for anybody who just wants to use Standard ML to write programs. For that, you need go to the imprecise descriptions that are given in books about SML and in the documentation for SML compilers and libraries. The problem with using the formal SML definition is the same as with using the formal C++11 definition: most of it is detail needed to make things in the formal specification come out the right way. That detail, about things that are internal to the definition of the specification, makes it difficult to understand what is intended to be available for the user. The GCC manual seems to me to be aimed more at the people who want to use GCC to write code and I don't think that the patch makes much allowance for them. I do think that more precise statements about the relationship to C++11 are useful to have. Its the sort of constraint that ought to be documented somewhere. But it seems to be more of interest to compiler writers or, at least, to users who are as knowledgeable as compiler writers. A document targeting that group, such as the GCC internals or a GCC wiki-page, would seem to be a better place for the information. (Another example of the distinction may be the Intel Itanium ABI documentation which has a programmers description of the synchronization primitives and a separate, formal description of their behaviour.) For what it's worth, my view of how C++11, the __atomics and the machine code line up is that each is a distinct layer. Each layer implements the requirements of the higher (more abstract) layer but is otherwise entirely independent. That's why I think that a description of the __atomic built-in, aimed at compiler users rather than writers and that doesn't expect knowledge of C++11 is desirable and possible. >> I'm also concerned that the patch, by describing things in terms of formal C++11 >> concepts, makes it more difficult for people to know what the built-ins can be >> expected to do and so make the built-in more difficult to use[..] > > I hadn't thought about that possible danger, but that would be right. > The way I would prefer to counter that is that we add a big fat warning > to the __sync built-ins that we don't have a precise specification for > them and that there are several corners of hand-waving and potentially > further issues, and that this is another reason to prefer the __atomic > built-ins. PR 65697 etc. are enough indication for me that we indeed > lack a proper specification. Increasing uncertainty about the __sync built-ins wouldn't make people move to equally uncertain __atomic built-ins. There's enough knowledge and use of the __sync builtins to make them a more comfortable choice then the C++11 atomics and in the worst case it would push people to roll their own synchronization functions with assembler or system calls. > Well, "just wants to add a memory barrier" is a the start of the > problem. The same way one needs to understand a hardware memory model > to pick the right HW instruction(s), the same one needs to understand a > programming language memory model to pick a fence and understand its > semantics. Sometimes you just want the hardware instructions and don't care about the programming language semantics. I suspect that for most people using GCC, the only time that the programming language specification matters is when their bug-report gets rejected as invalid code. Which is to say that you may be expecting more knowledge of the C++11 specification than most people actually have. >> Note that none of this requires a tutorial of any kind. I'm just suggesting that the >> manual should describe what behaviour should be expected of the code generated for >> the functions. For the memory orders, that would mean describing what constraints >> need to be met by the generated code. > > I'd bet that if one describes these constraints correctly, you'll get a > large document -- even if one removes any introductory or explanatory > parts that could make it a tutorial. Agreed. It's possible to get an arbitrarily large, precise definition for even the more trivial programming language. It depends on how far you want to go with "precise". That said, my point was that, for user documentation, it's usual to abstract sufficiently to be able to give (a good approximation to) a correct, useful and succinct description. Completeness isn't necessary, that belongs in the compiler writer's docs. > If you haven't, please just look at > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html > and try to specify the constraints just for the valid optimizations > described there. This should be a good indication of why I think > specifying the reordering / behavior constraints is nontrivial. I don't remember seeing this, it's an interesting paper. It deals with things that matter to compiler writers though so I don't think it's relevant to point I was trying to make (that there's a distinction between documentation for compiler users and compiler writers). > >> The requirement that the atomics should support >> C++11 could be met by making sure that the description of the expected behaviour is >> sufficient for C++11. > > We don't just want the semantics of __atomic* to be sufficient for C++11 > but also want them to be *as weak as possible* to be still sufficient > for C++11 -- otherwise, we'll make C++11 code less efficient than it can > be. Yes, that's implied by 'sufficient'. >> My understanding is that happens-before is a relation used in the C++11 specification >> for a specific meaning.[..] >> so saying that it applies to a gcc built-in would be wrong. > > Simplified, we can map 1:1 between an __atomic built-in and an > equivalent atomic operation. The exception is basically the data > definitions for an atomic type (e.g., atomic<T>): While C++11 hides the > data and data type required for an atomically accessible variable, the > built-ins assume that the caller will target a suitable memory location. > >> Using the >> gcc built-in rather than the equivalent C++11 library function would result in >> program that C++11 regards as invalid. (Again, as I understand it.) > > It wouldn't be invalid but simply not defined by C++11. But that's fine > because the built-ins are a GCC-specific extension (which is compatible > with C++11 atomics, of course). > But the distinction between the GCC built-ins and the C++11 library functions is still there. As far as the C++11 specification is concerned, the GCC __atomic would not establish the same relationships, such as inter-thread happens-before, as the equivalent C+++11 atomic function. As far as the C++11 specification is concerned, the GCC built-in is just another function. > [...] Apologies if I missed anything important. As I said, I doubt we'll come to agreement on this so I think that the best way to make progress is to see if anybody else has objections and commit if not. Matthew
On Fri, 2015-05-22 at 17:41 +0100, Matthew Wahab wrote: > On 21/05/15 19:26, Torvald Riegel wrote: > > On Thu, 2015-05-21 at 16:45 +0100, Matthew Wahab wrote: > >> On 19/05/15 20:20, Torvald Riegel wrote: > >>> On Mon, 2015-05-18 at 17:36 +0100, Matthew Wahab wrote: > >>>> Hello, > >>>> > >>>> On 15/05/15 17:22, Torvald Riegel wrote: > >>>>> This patch improves the documentation of the built-ins for atomic > >>>>> operations. > >>>> > > I think we're talking at cross-purposes and not really getting anywhere. I've replied > to some of your comments below, but it's mostly a restatement of points already made. OK. I have a few more comments below. > >> We seem to have different views about the purpose of the manual page. I'm treating it > >> as a description of the built-in functions provided by gcc to generate the code > >> needed to implement the C++11 model. That is, the built-ins are distinct from C++11 > >> and their descriptions should be, as far as possible, independent of the methods used > >> in the C++11 specification to describe the C++11 memory model. > > > > OK. But we'd need a *precise* specification of what they do if we'd > > want to make them separate from the C++11 memory model. And we don't > > have that, would you agree? > > There is a difference between the sort of description that is needed for a formal > specification and the sort that would be needed for a programmers manual. The best > example of this that I can think of is the Standard ML definition > (http://sml-family.org). That is a mathematical (so precise) definition that is > invaluable if you want an unambiguous specification of the language. But its useless > for anybody who just wants to use Standard ML to write programs. For that, you need > go to the imprecise descriptions that are given in books about SML and in the > documentation for SML compilers and libraries. > > The problem with using the formal SML definition is the same as with using the formal > C++11 definition: most of it is detail needed to make things in the formal > specification come out the right way. That detail, about things that are internal to > the definition of the specification, makes it difficult to understand what is > intended to be available for the user. A relation like happens-before is "user-facing". It is how one reasons about ordering in a multi-threaded execution. This isn't internal or for a corner-case like additional-synchronizes-with or one of the consistency rules. > The GCC manual seems to me to be aimed more at the people who want to use GCC to > write code and I don't think that the patch makes much allowance for them. I do think > that more precise statements about the relationship to C++11 are useful to have. Its > the sort of constraint that ought to be documented somewhere. But it seems to be more > of interest to compiler writers or, at least, to users who are as knowledgeable as > compiler writers. A document targeting that group, such as the GCC internals or a GCC > wiki-page, would seem to be a better place for the information. > > (Another example of the distinction may be the Intel Itanium ABI documentation which > has a programmers description of the synchronization primitives and a separate, > formal description of their behaviour.) > > For what it's worth, my view of how C++11, the __atomics and the machine code line up > is that each is a distinct layer. Each layer implements the requirements of the > higher (more abstract) layer but is otherwise entirely independent. That's why I > think that a description of the __atomic built-in, aimed at compiler users rather > than writers and that doesn't expect knowledge of C++11 is desirable and possible. > > >> I'm also concerned that the patch, by describing things in terms of formal C++11 > >> concepts, makes it more difficult for people to know what the built-ins can be > >> expected to do and so make the built-in more difficult to use[..] > > > > I hadn't thought about that possible danger, but that would be right. > > The way I would prefer to counter that is that we add a big fat warning > > to the __sync built-ins that we don't have a precise specification for > > them and that there are several corners of hand-waving and potentially > > further issues, and that this is another reason to prefer the __atomic > > built-ins. PR 65697 etc. are enough indication for me that we indeed > > lack a proper specification. > > Increasing uncertainty about the __sync built-ins wouldn't make people move to > equally uncertain __atomic built-ins. There's enough knowledge and use of the __sync > builtins to make them a more comfortable choice then the C++11 atomics and in the > worst case it would push people to roll their own synchronization functions with > assembler or system calls. I don't buy that. Sure, some people will be uncomfortable with anything. But I don't see how "specified in C++11 and C11" is the same level of uncertainty as "we don't have a tight specification". Users can pick their favorite source of education on the C++11 memory model. And over time, C++11 / C11 will become commonly known... > > Well, "just wants to add a memory barrier" is a the start of the > > problem. The same way one needs to understand a hardware memory model > > to pick the right HW instruction(s), the same one needs to understand a > > programming language memory model to pick a fence and understand its > > semantics. > > Sometimes you just want the hardware instructions and don't care about the > programming language semantics. If you tightly control code generation around those uses, that can work. But compilers will optimize concurrent code, so in general, one can't pretend the compiler isn't there or is just a fancy assembler. I'm stressing this point because I think that it's critical that users understand that in most cases, they have to consider both the compiler and the hardware when writing concurrent C/C++ code. > > If you haven't, please just look at > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html > > and try to specify the constraints just for the valid optimizations > > described there. This should be a good indication of why I think > > specifying the reordering / behavior constraints is nontrivial. > > I don't remember seeing this, it's an interesting paper. It deals with things that > matter to compiler writers though so I don't think it's relevant to point I was > trying to make (that there's a distinction between documentation for compiler users > and compiler writers). Well, it's optimizations that the compiler is allowed to do. So if we give imprecise definitions of the semantics of __atomic* to users, it can lead to differences between what users think __atomic* provides and what it actually does. I pointed to this paper because it shows examples of optimizations that would be disallowed by simple definitions of __atomic but are allowed by the standard. > > > >> The requirement that the atomics should support > >> C++11 could be met by making sure that the description of the expected behaviour is > >> sufficient for C++11. > > > > We don't just want the semantics of __atomic* to be sufficient for C++11 > > but also want them to be *as weak as possible* to be still sufficient > > for C++11 -- otherwise, we'll make C++11 code less efficient than it can > > be. > > Yes, that's implied by 'sufficient'. I disagree. If condition C1 is sufficient to yield condition C2, then C1 implies C2 -- but they are not necessarily equal. "Sufficient and necessary" would be what we want for the mapping. (For example, it would be *sufficient* to make all __atomic* to have seq-cst behavior, irrespective of the memory order argument -- but it wouldn't be *necessary*, and it would be less efficient.)
commit 0fb4c8ef5aafc3d48c5aeb6487feb7e3356b43f2 Author: Torvald Riegel <triegel@redhat.com> Date: Fri May 15 18:14:40 2015 +0200 Fix memory order description in atomic ops built-ins docs. diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index 6004681..4eb1b54 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -8853,19 +8853,19 @@ are not prevented from being speculated to before the barrier. @section Built-in Functions for Memory Model Aware Atomic Operations The following built-in functions approximately match the requirements -for C++11 concurrency and memory models. They are all +for the C++11 memory model. They are all identified by being prefixed with @samp{__atomic} and most are overloaded so that they work with multiple types. These functions are intended to replace the legacy @samp{__sync} -builtins. The main difference is that the memory model to be used is a -parameter to the functions. New code should always use the +builtins. The main difference is that the memory order that is requested +is a parameter to the functions. New code should always use the @samp{__atomic} builtins rather than the @samp{__sync} builtins. Note that the @samp{__atomic} builtins assume that programs will -conform to the C++11 model for concurrency. In particular, they assume +conform to the C++11 memory model. In particular, they assume that programs are free of data races. See the C++11 standard for -detailed definitions. +detailed requirements. The @samp{__atomic} builtins can be used with any integral scalar or pointer type that is 1, 2, 4, or 8 bytes in length. 16-byte integral @@ -8874,137 +8874,140 @@ supported by the architecture. The four non-arithmetic functions (load, store, exchange, and compare_exchange) all have a generic version as well. This generic -version works on any data type. If the data type size maps to one -of the integral sizes that may have lock free support, the generic -version uses the lock free built-in function. Otherwise an +version works on any data type. It uses the lock-free built-in function +if the specific data type size makes that possible; otherwise, an external call is left to be resolved at run time. This external call is the same format with the addition of a @samp{size_t} parameter inserted as the first parameter indicating the size of the object being pointed to. All objects must be the same size. -There are 6 different memory models that can be specified. These map -to the C++11 memory models with the same names, see the C++11 standard +There are 6 different memory orders that can be specified. These map +to the C++11 memory orders with the same names, see the C++11 standard or the @uref{http://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync,GCC wiki on atomic synchronization} for detailed definitions. Individual -targets may also support additional memory models for use on specific +targets may also support additional memory orders for use on specific architectures. Refer to the target documentation for details of these. -The memory models integrate both barriers to code motion as well as -synchronization requirements with other threads. They are listed here -in approximately ascending order of strength. +An atomic operation can both constrain code motion and +be mapped to hardware instructions for synchronization between threads +(e.g., a fence). To which extent this happens is controlled by the +memory orders, which are listed here in approximately ascending order of +strength. The description of each memory order is only meant to roughly +illustrate the effects and is not a specification; see the C++11 +memory model for precise semantics. @table @code @item __ATOMIC_RELAXED -No barriers or synchronization. +Implies no inter-thread ordering constraints. @item __ATOMIC_CONSUME -Data dependency only for both barrier and synchronization with another -thread. +This is currently implemented using the stronger @code{__ATOMIC_ACQUIRE} +memory order because of a deficiency in C++11's semantics for +@code{memory_order_consume}. @item __ATOMIC_ACQUIRE -Barrier to hoisting of code and synchronizes with release (or stronger) -semantic stores from another thread. +Creates an inter-thread happens-before constraint from the release (or +stronger) semantic store to this acquire load. Can prevent hoisting +of code to before the operation. @item __ATOMIC_RELEASE -Barrier to sinking of code and synchronizes with acquire (or stronger) -semantic loads from another thread. +Creates an inter-thread happens-before constraint to acquire (or stronger) +semantic loads that read from this release store. Can prevent sinking +of code to after the operation. @item __ATOMIC_ACQ_REL -Barrier in both directions and synchronizes with acquire loads and -release stores in another thread. +Combines the effects of both @code{__ATOMIC_ACQUIRE} and +@code{__ATOMIC_RELEASE}. @item __ATOMIC_SEQ_CST -Barrier in both directions and synchronizes with acquire loads and -release stores in all threads. +Enforces total ordering with all other @code{__ATOMIC_SEQ_CST} operations. @end table -Note that the scope of a C++11 memory model depends on whether or not -the function being called is a @emph{fence} (such as -@samp{__atomic_thread_fence}). In a fence, all memory accesses are -subject to the restrictions of the memory model. When the function is -an operation on a location, the restrictions apply only to those -memory accesses that could affect or that could depend on the -location. +Note that in the C++11 memory model, @emph{fences} (e.g., +@samp{__atomic_thread_fence}) take effect in combination with other +atomic operations on specific memory locations (e.g., atomic loads); +operations on specific memory locations do not necessarily affect other +operations in the same way. Target architectures are encouraged to provide their own patterns for -each of these built-in functions. If no target is provided, the original +each of the atomic built-in functions. If no target is provided, the original non-memory model set of @samp{__sync} atomic built-in functions are used, along with any required synchronization fences surrounding it in order to achieve the proper behavior. Execution in this case is subject to the same restrictions as those built-in functions. -If there is no pattern or mechanism to provide a lock free instruction +If there is no pattern or mechanism to provide a lock-free instruction sequence, a call is made to an external routine with the same parameters to be resolved at run time. -When implementing patterns for these built-in functions, the memory model +When implementing patterns for these built-in functions, the memory order parameter can be ignored as long as the pattern implements the most -restrictive @code{__ATOMIC_SEQ_CST} model. Any of the other memory models -execute correctly with this memory model but they may not execute as +restrictive @code{__ATOMIC_SEQ_CST} memory order. Any of the other memory +orders execute correctly with this memory order but they may not execute as efficiently as they could with a more appropriate implementation of the relaxed requirements. -Note that the C++11 standard allows for the memory model parameter to be +Note that the C++11 standard allows for the memory order parameter to be determined at run time rather than at compile time. These built-in functions map any run-time value to @code{__ATOMIC_SEQ_CST} rather than invoke a runtime library call or inline a switch statement. This is standard compliant, safe, and the simplest approach for now. -The memory model parameter is a signed int, but only the lower 16 bits are -reserved for the memory model. The remainder of the signed int is reserved +The memory order parameter is a signed int, but only the lower 16 bits are +reserved for the memory order. The remainder of the signed int is reserved for target use and should be 0. Use of the predefined atomic values ensures proper usage. -@deftypefn {Built-in Function} @var{type} __atomic_load_n (@var{type} *ptr, int memmodel) +@deftypefn {Built-in Function} @var{type} __atomic_load_n (@var{type} *ptr, int memorder) This built-in function implements an atomic load operation. It returns the contents of @code{*@var{ptr}}. -The valid memory model variants are +The valid memory order variants are @code{__ATOMIC_RELAXED}, @code{__ATOMIC_SEQ_CST}, @code{__ATOMIC_ACQUIRE}, and @code{__ATOMIC_CONSUME}. @end deftypefn -@deftypefn {Built-in Function} void __atomic_load (@var{type} *ptr, @var{type} *ret, int memmodel) +@deftypefn {Built-in Function} void __atomic_load (@var{type} *ptr, @var{type} *ret, int memorder) This is the generic version of an atomic load. It returns the contents of @code{*@var{ptr}} in @code{*@var{ret}}. @end deftypefn -@deftypefn {Built-in Function} void __atomic_store_n (@var{type} *ptr, @var{type} val, int memmodel) +@deftypefn {Built-in Function} void __atomic_store_n (@var{type} *ptr, @var{type} val, int memorder) This built-in function implements an atomic store operation. It writes @code{@var{val}} into @code{*@var{ptr}}. -The valid memory model variants are +The valid memory order variants are @code{__ATOMIC_RELAXED}, @code{__ATOMIC_SEQ_CST}, and @code{__ATOMIC_RELEASE}. @end deftypefn -@deftypefn {Built-in Function} void __atomic_store (@var{type} *ptr, @var{type} *val, int memmodel) +@deftypefn {Built-in Function} void __atomic_store (@var{type} *ptr, @var{type} *val, int memorder) This is the generic version of an atomic store. It stores the value of @code{*@var{val}} into @code{*@var{ptr}}. @end deftypefn -@deftypefn {Built-in Function} @var{type} __atomic_exchange_n (@var{type} *ptr, @var{type} val, int memmodel) +@deftypefn {Built-in Function} @var{type} __atomic_exchange_n (@var{type} *ptr, @var{type} val, int memorder) This built-in function implements an atomic exchange operation. It writes @var{val} into @code{*@var{ptr}}, and returns the previous contents of @code{*@var{ptr}}. -The valid memory model variants are +The valid memory order variants are @code{__ATOMIC_RELAXED}, @code{__ATOMIC_SEQ_CST}, @code{__ATOMIC_ACQUIRE}, @code{__ATOMIC_RELEASE}, and @code{__ATOMIC_ACQ_REL}. @end deftypefn -@deftypefn {Built-in Function} void __atomic_exchange (@var{type} *ptr, @var{type} *val, @var{type} *ret, int memmodel) +@deftypefn {Built-in Function} void __atomic_exchange (@var{type} *ptr, @var{type} *val, @var{type} *ret, int memorder) This is the generic version of an atomic exchange. It stores the contents of @code{*@var{val}} into @code{*@var{ptr}}. The original value of @code{*@var{ptr}} is copied into @code{*@var{ret}}. @end deftypefn -@deftypefn {Built-in Function} bool __atomic_compare_exchange_n (@var{type} *ptr, @var{type} *expected, @var{type} desired, bool weak, int success_memmodel, int failure_memmodel) +@deftypefn {Built-in Function} bool __atomic_compare_exchange_n (@var{type} *ptr, @var{type} *expected, @var{type} desired, bool weak, int success_memorder, int failure_memorder) This built-in function implements an atomic compare and exchange operation. This compares the contents of @code{*@var{ptr}} with the contents of @code{*@var{expected}}. If equal, the operation is a @emph{read-modify-write} -which writes @var{desired} into @code{*@var{ptr}}. If they are not +operation that writes @var{desired} into @code{*@var{ptr}}. If they are not equal, the operation is a @emph{read} and the current contents of @code{*@var{ptr}} is written into @code{*@var{expected}}. @var{weak} is true for weak compare_exchange, and false for the strong variation. Many targets @@ -9013,17 +9016,17 @@ the strong variation. True is returned if @var{desired} is written into @code{*@var{ptr}} and the operation is considered to conform to the -memory model specified by @var{success_memmodel}. There are no -restrictions on what memory model can be used here. +memory order specified by @var{success_memorder}. There are no +restrictions on what memory order can be used here. False is returned otherwise, and the operation is considered to conform -to @var{failure_memmodel}. This memory model cannot be +to @var{failure_memorder}. This memory order cannot be @code{__ATOMIC_RELEASE} nor @code{__ATOMIC_ACQ_REL}. It also cannot be a -stronger model than that specified by @var{success_memmodel}. +stronger order than that specified by @var{success_memorder}. @end deftypefn -@deftypefn {Built-in Function} bool __atomic_compare_exchange (@var{type} *ptr, @var{type} *expected, @var{type} *desired, bool weak, int success_memmodel, int failure_memmodel) +@deftypefn {Built-in Function} bool __atomic_compare_exchange (@var{type} *ptr, @var{type} *expected, @var{type} *desired, bool weak, int success_memorder, int failure_memorder) This built-in function implements the generic version of @code{__atomic_compare_exchange}. The function is virtually identical to @code{__atomic_compare_exchange_n}, except the desired value is also a @@ -9031,12 +9034,12 @@ pointer. @end deftypefn -@deftypefn {Built-in Function} @var{type} __atomic_add_fetch (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_sub_fetch (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_and_fetch (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_xor_fetch (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_or_fetch (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_nand_fetch (@var{type} *ptr, @var{type} val, int memmodel) +@deftypefn {Built-in Function} @var{type} __atomic_add_fetch (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_sub_fetch (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_and_fetch (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_xor_fetch (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_or_fetch (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_nand_fetch (@var{type} *ptr, @var{type} val, int memorder) These built-in functions perform the operation suggested by the name, and return the result of the operation. That is, @@ -9044,16 +9047,16 @@ return the result of the operation. That is, @{ *ptr @var{op}= val; return *ptr; @} @end smallexample -All memory models are valid. +All memory orders are valid. @end deftypefn -@deftypefn {Built-in Function} @var{type} __atomic_fetch_add (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_fetch_sub (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_fetch_and (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_fetch_xor (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_fetch_or (@var{type} *ptr, @var{type} val, int memmodel) -@deftypefnx {Built-in Function} @var{type} __atomic_fetch_nand (@var{type} *ptr, @var{type} val, int memmodel) +@deftypefn {Built-in Function} @var{type} __atomic_fetch_add (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_fetch_sub (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_fetch_and (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_fetch_xor (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_fetch_or (@var{type} *ptr, @var{type} val, int memorder) +@deftypefnx {Built-in Function} @var{type} __atomic_fetch_nand (@var{type} *ptr, @var{type} val, int memorder) These built-in functions perform the operation suggested by the name, and return the value that had previously been in @code{*@var{ptr}}. That is, @@ -9061,11 +9064,11 @@ return the value that had previously been in @code{*@var{ptr}}. That is, @{ tmp = *ptr; *ptr @var{op}= val; return tmp; @} @end smallexample -All memory models are valid. +All memory orders are valid. @end deftypefn -@deftypefn {Built-in Function} bool __atomic_test_and_set (void *ptr, int memmodel) +@deftypefn {Built-in Function} bool __atomic_test_and_set (void *ptr, int memorder) This built-in function performs an atomic test-and-set operation on the byte at @code{*@var{ptr}}. The byte is set to some implementation @@ -9074,11 +9077,11 @@ if the previous contents were ``set''. It should be only used for operands of type @code{bool} or @code{char}. For other types only part of the value may be set. -All memory models are valid. +All memory orders are valid. @end deftypefn -@deftypefn {Built-in Function} void __atomic_clear (bool *ptr, int memmodel) +@deftypefn {Built-in Function} void __atomic_clear (bool *ptr, int memorder) This built-in function performs an atomic clear operation on @code{*@var{ptr}}. After the operation, @code{*@var{ptr}} contains 0. @@ -9087,22 +9090,22 @@ in conjunction with @code{__atomic_test_and_set}. For other types it may only clear partially. If the type is not @code{bool} prefer using @code{__atomic_store}. -The valid memory model variants are +The valid memory order variants are @code{__ATOMIC_RELAXED}, @code{__ATOMIC_SEQ_CST}, and @code{__ATOMIC_RELEASE}. @end deftypefn -@deftypefn {Built-in Function} void __atomic_thread_fence (int memmodel) +@deftypefn {Built-in Function} void __atomic_thread_fence (int memorder) This built-in function acts as a synchronization fence between threads -based on the specified memory model. +based on the specified memory order. All memory orders are valid. @end deftypefn -@deftypefn {Built-in Function} void __atomic_signal_fence (int memmodel) +@deftypefn {Built-in Function} void __atomic_signal_fence (int memorder) This built-in function acts as a synchronization fence between a thread and signal handlers based in the same thread. @@ -9114,7 +9117,7 @@ All memory orders are valid. @deftypefn {Built-in Function} bool __atomic_always_lock_free (size_t size, void *ptr) This built-in function returns true if objects of @var{size} bytes always -generate lock free atomic instructions for the target architecture. +generate lock-free atomic instructions for the target architecture. @var{size} must resolve to a compile-time constant and the result also resolves to a compile-time constant. @@ -9131,9 +9134,9 @@ if (_atomic_always_lock_free (sizeof (long long), 0)) @deftypefn {Built-in Function} bool __atomic_is_lock_free (size_t size, void *ptr) This built-in function returns true if objects of @var{size} bytes always -generate lock free atomic instructions for the target architecture. If -it is not known to be lock free a call is made to a runtime routine named -@code{__atomic_is_lock_free}. +generate lock-free atomic instructions for the target architecture. If +the built-in function is not known to be lock-free, a call is made to a +runtime routine named @code{__atomic_is_lock_free}. @var{ptr} is an optional pointer to the object that may be used to determine alignment. A value of 0 indicates typical alignment should be used. The @@ -9204,20 +9207,20 @@ functions above, except they perform multiplication, instead of addition. The x86 architecture supports additional memory ordering flags to mark lock critical sections for hardware lock elision. -These must be specified in addition to an existing memory model to +These must be specified in addition to an existing memory order to atomic intrinsics. @table @code @item __ATOMIC_HLE_ACQUIRE Start lock elision on a lock variable. -Memory model must be @code{__ATOMIC_ACQUIRE} or stronger. +Memory order must be @code{__ATOMIC_ACQUIRE} or stronger. @item __ATOMIC_HLE_RELEASE End lock elision on a lock variable. -Memory model must be @code{__ATOMIC_RELEASE} or stronger. +Memory order must be @code{__ATOMIC_RELEASE} or stronger. @end table -When a lock acquire fails it is required for good performance to abort -the transaction quickly. This can be done with a @code{_mm_pause} +When a lock acquire fails, it is required for good performance to abort +the transaction quickly. This can be done with a @code{_mm_pause}. @smallexample #include <immintrin.h> // For _mm_pause