diff mbox

Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

Message ID 51C085EF.1040303@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini June 18, 2013, 4:08 p.m. UTC
Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>      only because I stand with one foot in theory and the other in
>      practice.  If I were a pure practitioner, I would probably strongly
>      prefer the Java formulation.

Awesome answer, and this last paragraph sums it up pretty well.

That was basically my understanding, too.  I still do not completely 
get the relationship between Java semantics and ACQ_REL, but I can 
sidestep the issue for adding portable atomics to QEMU.  QEMU 
developers and Linux developers have some overlap, and Java volatiles 
are simple to understand in terms of memory barriers (which Linux 
uses); hence, I'll treat ourselves as pure practitioners.

I will just not use __atomic_load/__atomic_store to implement the 
primitives, and always express them in terms of memory barriers.  Of 
course, the memory barriers themselves are defined using 
__atomic_thread_fence.

I attach v2 of the patch below; Ping Fan, feel free to pick it up.

Thanks again!

Paolo

----------- 8< -------------

From 5c71a423156d1f0d61f17afa99b099bf139c64ae Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Mon, 13 May 2013 13:29:47 +0200
Subject: [PATCH] add a header file for atomic operations

We're already using them in several places, but __sync builtins are just
too ugly to type, and do not provide seqcst load/store operations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/atomics.txt         | 345 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/display/qxl.c         |   3 +-
 hw/virtio/vhost.c        |   9 +-
 include/qemu/atomic.h    | 190 +++++++++++++++++++++-----
 migration.c              |   3 +-
 tests/test-thread-pool.c |   8 +-
 6 files changed, 514 insertions(+), 44 deletions(-)
 create mode 100644 docs/atomics.txt

Comments

Torvald Riegel June 18, 2013, 4:38 p.m. UTC | #1
On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> >      only because I stand with one foot in theory and the other in
> >      practice.  If I were a pure practitioner, I would probably strongly
> >      prefer the Java formulation.
> 
> Awesome answer, and this last paragraph sums it up pretty well.

I disagree that for non-Java code the Java model should be better.  Both
C11 and C++11 use the same model, and I don't see a reason to not use it
if you're writing C/C++ code anyway.

The C++ model is definitely useful for practitioners; just because it
uses seq-cst memory order as safe default doesn't mean that programmers
that can deal with weaker ordering guarantees can't make use of those
weaker ones.
I thought Paul was referring to seq-cst as default; if that wasn't the
point he wanted to make, I actually don't understand his theory/practice
comparison (never mind that whenever you need to reason about concurrent
stuff, having a solid formal framework as the one by the Cambridge group
is definitely helpful).  Seq-cst and acq-rel are just different
guarantees -- this doesn't mean that one is better than the other; you
need to understand anyway what you're doing and which one you need.
Often, ensuring a synchronized-with edge by pairing release/acquire will
be sufficient, but that doesn't say anything about the Java vs. C/C++
model.

> That was basically my understanding, too.  I still do not completely 
> get the relationship between Java semantics and ACQ_REL, but I can 
> sidestep the issue for adding portable atomics to QEMU.  QEMU 
> developers and Linux developers have some overlap, and Java volatiles 
> are simple to understand in terms of memory barriers (which Linux 
> uses); hence, I'll treat ourselves as pure practitioners.

I don't think that this is the conclusion here.  I strongly suggest to
just go with the C11/C++11 model, instead of rolling your own or trying
to replicate the Java model.  That would also allow you to just point to
the C11 model and any information / tutorials about it instead of having
to document your own (see the patch), and you can make use of any
(future) tool support (e.g., race detectors).

> I will just not use __atomic_load/__atomic_store to implement the 
> primitives, and always express them in terms of memory barriers.

Why?  (If there's some QEMU-specific reason, just let me know; I know
little about QEMU..)
I would assume that using the __atomic* builtins is just fine if they're
available.


Torvald
Paul E. McKenney June 19, 2013, 1:57 a.m. UTC | #2
On Tue, Jun 18, 2013 at 06:38:38PM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> > Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> > >      only because I stand with one foot in theory and the other in
> > >      practice.  If I were a pure practitioner, I would probably strongly
> > >      prefer the Java formulation.
> > 
> > Awesome answer, and this last paragraph sums it up pretty well.
> 
> I disagree that for non-Java code the Java model should be better.  Both
> C11 and C++11 use the same model, and I don't see a reason to not use it
> if you're writing C/C++ code anyway.
> 
> The C++ model is definitely useful for practitioners; just because it
> uses seq-cst memory order as safe default doesn't mean that programmers
> that can deal with weaker ordering guarantees can't make use of those
> weaker ones.
> I thought Paul was referring to seq-cst as default; if that wasn't the
> point he wanted to make, I actually don't understand his theory/practice
> comparison (never mind that whenever you need to reason about concurrent
> stuff, having a solid formal framework as the one by the Cambridge group
> is definitely helpful).  Seq-cst and acq-rel are just different
> guarantees -- this doesn't mean that one is better than the other; you
> need to understand anyway what you're doing and which one you need.
> Often, ensuring a synchronized-with edge by pairing release/acquire will
> be sufficient, but that doesn't say anything about the Java vs. C/C++
> model.

Having the Cambridge group's model and tooling around has definitely
made my life much easier!

And yes, I do very much see a need for a wide range of ordering/overhead
tradeoffs, at least until such time as someone figures a way around
either the atomic nature of matter or the finite speed of light.  ;-)

							Thanx, Paul

> > That was basically my understanding, too.  I still do not completely 
> > get the relationship between Java semantics and ACQ_REL, but I can 
> > sidestep the issue for adding portable atomics to QEMU.  QEMU 
> > developers and Linux developers have some overlap, and Java volatiles 
> > are simple to understand in terms of memory barriers (which Linux 
> > uses); hence, I'll treat ourselves as pure practitioners.
> 
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).
> 
> > I will just not use __atomic_load/__atomic_store to implement the 
> > primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)
> I would assume that using the __atomic* builtins is just fine if they're
> available.
> 
> 
> Torvald
>
Paolo Bonzini June 19, 2013, 9:31 a.m. UTC | #3
Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).

I'm definitely not rolling my own, but I think there is some value in
using the Java model.  Warning: the explanation came out quite
verbose... tl;dr at the end.


One reason is that implementing SC for POWER is quite expensive, while
this is not the case for Java volatile (which I'm still not convinced is
acq-rel, because it also orders volatile stores and volatile loads).
People working on QEMU are often used to manually placed barriers on
Linux, and Linux barriers do not fully give you seq-cst semantics.  They
give you something much more similar to the Java model.

The Java model gives good performance and is easier to understand than
the non-seqcst modes of atomic builtins.  It is pretty much impossible
to understand the latter without a formal model; I see the importance of
a formal model, but at the same time it is hard not to appreciate the
detailed-but-practical style of the Linux documentation.


Second, the Java model has very good "practical" documentation from
sources I trust.  Note the part about trust: I found way too many Java
tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.

Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
(especially now that Richard and Paul completed my understanding of
them).  There are substantially fewer practical documents for C11/C++11
that are similarly authoritative.  I obviously trust Cambridge for
C11/C++11, but their material is very concise or just refers to the
formal model.  The formal model is not what I want when my question is
simply "why is lwsync good for acquire and release, but not for
seqcst?", for example.  And the papers sometime refer to "private
communication" between the authors and other people, which can be
annoying.  Hans Boehm and Herb Sutter have good poster and slide
material, but they do not have the same level of completeness as Paul's
Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...


Third, we must support old GCC (even as old as 4.2), so we need
hand-written assembly for atomics anyway.  This again goes back to
documentation and the JSR-133 cookbook.  It not only gives you
instructions on how to implement the model (which is also true for the
Cambridge web pages on C11/C++11), but is also a good base for writing
our own documentation.  It helped me understanding existing code using
barriers, optimizing it, and putting this knowledge in words.  I just
couldn't find anything as useful for C11/C++11.


In short, the C11/C++11 model is not what most developers are used to
here, hardware is not 100% mature for it (for example ARMv8 has seqcst
load/store; perhaps POWER will grow that in time), is harder to
optimize, and has (as of 2013) less "practical" documentation from
sources I trust.

Besides, since what I'm using is weaker than SC, there's always the
possibility of switching to SC in the future when enough of these issues
are solved.  In case you really need SC _now_, it is easy to do it using
fetch-and-add (for loads) or xchg (for stores).

>> I will just not use __atomic_load/__atomic_store to implement the 
>> primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)

I guess I mentioned the QEMU-specific reasons above.

> I would assume that using the __atomic* builtins is just fine if they're
> available.

It would implement slightly different semantics based on the compiler
version, so I think it's dangerous.

Paolo
Torvald Riegel June 19, 2013, 1:15 p.m. UTC | #4
On Wed, 2013-06-19 at 11:31 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> > I don't think that this is the conclusion here.  I strongly suggest to
> > just go with the C11/C++11 model, instead of rolling your own or trying
> > to replicate the Java model.  That would also allow you to just point to
> > the C11 model and any information / tutorials about it instead of having
> > to document your own (see the patch), and you can make use of any
> > (future) tool support (e.g., race detectors).
> 
> I'm definitely not rolling my own, but I think there is some value in
> using the Java model.  Warning: the explanation came out quite
> verbose... tl;dr at the end.

That's fine -- it is a nontrival topic after all.  My reply is equally
lengthy :)

> 
> 
> One reason is that implementing SC for POWER is quite expensive,

Sure, but you don't have to use SC fences or atomics if you don't want
them.  Note that C11/C++11 as well as the __atomic* builtins allow you
to specify a memory order.  It's perfectly fine to use acquire fences or
release fences.  There shouldn't be just one kind of barrier/fence.

> while
> this is not the case for Java volatile (which I'm still not convinced is
> acq-rel, because it also orders volatile stores and volatile loads).

But you have the same choice with C11/C++11.  You seemed to be fine with
using acquire/release in your code, so if that's what you want, just use
it :)

I vaguely remember that the Java model gives stronger guarantees than
C/C++ in the presence of data races; something about no values
out-of-thin-air.  Maybe that's where the stronger-than-acq/rel memory
orders for volatile stores come from.

> People working on QEMU are often used to manually placed barriers on
> Linux, and Linux barriers do not fully give you seq-cst semantics.  They
> give you something much more similar to the Java model.

Note that there is a reason why C11/C++11 don't just have barriers
combined with ordinary memory accesses: The compiler needs to be aware
which accesses are sequential code (so it can assume that they are
data-race-free) and which are potentially concurrent with other accesses
to the same data.  When you just use normal memory accesses with
barriers, you're relying on non-specified behavior in a compiler, and
you have no guarantee that the compiler reads a value just once, for
example; you can try to make this very likely be correct by careful
placement of asm compiler barriers, but this is likely to be more
difficult than just using atomics, which will do the right thing.

Linux folks seem to be doing fine in this area, but they also seem to
mostly use existing concurrency abstractions such as locks or RCU.
Thus, maybe it's not a too good indication of the ease-of-use of their
style of expressing low-level synchronization.

> The Java model gives good performance and is easier to understand than
> the non-seqcst modes of atomic builtins.  It is pretty much impossible
> to understand the latter without a formal model;

I'm certainly biased because I've been looking at this a lot.  But I
believe that there are common synchronization idioms which are
relatively straightforward to understand.  Acquire/release pairs should
be one such case.  Locks are essentially the same everywhere.

> I see the importance of
> a formal model, but at the same time it is hard not to appreciate the
> detailed-but-practical style of the Linux documentation.

I don't think the inherent complexity programmers *need* to deal with is
different because in the end, if you have two ways to express the same
ordering guarantees, you have to reason about the very same ordering
guarantees.  This is what makes up most of the complexity.  For example,
if you want to use acq/rel widely, you need to understand what this
means for your concurrent code; this won't change significantly
depending on how you express it.

I think the C++11 model is pretty compact, meaning no unnecessary fluff
(maybe one would need fewer memory orders, but those aren't too hard to
ignore).  I believe that if you'd want to specify the Linux model
including any implicit assumptions on the compilers, you'd end up with
the same level of detail in the model.

Maybe the issue that you see with C11/C++11 is that it offers more than
you actually need.  If you can summarize what kind of synchronization /
concurrent code you are primarily looking at, I can try to help outline
a subset of it (i.e., something like code style but just for
synchronization).

> Second, the Java model has very good "practical" documentation from
> sources I trust.  Note the part about trust: I found way too many Java
> tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.
> 
> Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
> (especially now that Richard and Paul completed my understanding of
> them).  There are substantially fewer practical documents for C11/C++11
> that are similarly authoritative.

That's true.  But both models are younger in the sense of being
available to practitioners.  Boehm's libatomic-ops library was close,
but compiler support for them wasn't available until recently.  But
maybe it's useful to look 1 or 2 years ahead; I suspect that there will
be much more documentation in the near future.

> I obviously trust Cambridge for
> C11/C++11, but their material is very concise or just refers to the
> formal model.

Yes, their publications are really about the model.  It's not a
tutorial, but useful for reference.  BTW, have you read their C++ paper
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
or the POPL paper?  The former has more detail (no page limit).

If you haven't yet, I suggest giving their cppmem tool a try too:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
It's a browser-based tool that shows you all the possible interleavings
of small pieces of synchronization code, together with a graph with the
relationships as in their model.  Very helpful to explore whether your
synchronization code does the right thing.  At least, I'm not aware of
anything like that for Java.

> The formal model is not what I want when my question is
> simply "why is lwsync good for acquire and release, but not for
> seqcst?", for example.

But that's a question that is not about the programming-language-level
memory model, but about POWER's memory model.  I suppose that's not
something you'd have to deal with frequently, or are you indeed
primarily interested in emulating a particular architecture's model?
The Cambridge group also has a formal model for POWER's memory model,
perhaps this can help answer this question.

> And the papers sometime refer to "private
> communication" between the authors and other people, which can be
> annoying.  Hans Boehm and Herb Sutter have good poster and slide
> material, but they do not have the same level of completeness as Paul's
> Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...
> 
> 
> Third, we must support old GCC (even as old as 4.2), so we need
> hand-written assembly for atomics anyway.

Perhaps you could make use of GCC's libatomic, or at least take code
from it.  Boehm's libatomic-ops might be another choice.  But I agree
that you can't just switch to the newer atomics support; it will take
some more time until it being available is the norm.

> This again goes back to
> documentation and the JSR-133 cookbook.  It not only gives you
> instructions on how to implement the model (which is also true for the
> Cambridge web pages on C11/C++11), but is also a good base for writing
> our own documentation.  It helped me understanding existing code using
> barriers, optimizing it, and putting this knowledge in words.  I just
> couldn't find anything as useful for C11/C++11.
> 
> 
> In short, the C11/C++11 model is not what most developers are used to
> here

I guess so.  But you also have to consider the legacy that you create.
I do think the C11/C++11 model will used widely, and more and more
people will used to it.  Thus, it's the question of whether you want to
deviate from what will likely be the common approach in the not to far
away future.

> hardware is not 100% mature for it (for example ARMv8 has seqcst
> load/store; perhaps POWER will grow that in time),

I'm not quite sure what you mean, or at least how what you might mean is
related to the choice between the C/C++ and Java models.

> is harder to
> optimize, and has (as of 2013) less "practical" documentation from
> sources I trust.
> 
> Besides, since what I'm using is weaker than SC, there's always the
> possibility of switching to SC in the future when enough of these issues
> are solved.

C11/C++11 is not _equal to_ SC.  It includes SC (because it has to
anyway to enable certain concurrent algorithms), and C++ has SC as
default, but this doesn't mean that the C/C++ model equates to SC
semantics.

> In case you really need SC _now_, it is easy to do it using
> fetch-and-add (for loads) or xchg (for stores).

Not on POWER and other architectures with similarly weak memory
models :)  On x86 and such the atomic read-modify-write ops give you
strong ordering, but that's an artifact of those models.  You can get SC
on most of the archs without having to use atomic read-modify-write ops.

If the fetch-and-add abstractions in your model always give you SC, then
that's more an indication that they don't allow you the fine-grained
control that you want, especially on POWER :)

> 
> >> I will just not use __atomic_load/__atomic_store to implement the 
> >> primitives, and always express them in terms of memory barriers.
> > 
> > Why?  (If there's some QEMU-specific reason, just let me know; I know
> > little about QEMU..)
> 
> I guess I mentioned the QEMU-specific reasons above.
> 
> > I would assume that using the __atomic* builtins is just fine if they're
> > available.
> 
> It would implement slightly different semantics based on the compiler
> version, so I think it's dangerous.

If __atomic is available, the semantics don't change.  The __sync
builtins do provide a certain subset of the semantics you get with the
__atomic builtins, however.


Torvald
Paolo Bonzini June 19, 2013, 3:14 p.m. UTC | #5
Il 19/06/2013 15:15, Torvald Riegel ha scritto:
>> One reason is that implementing SC for POWER is quite expensive,
> 
> Sure, but you don't have to use SC fences or atomics if you don't want
> them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> to specify a memory order.  It's perfectly fine to use acquire fences or
> release fences.  There shouldn't be just one kind of barrier/fence.

Agreed.  For example Linux uses four: consume (read_barrier_depends),
acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
stores are always relaxed, some RMW ops are SC but others are relaxed.

I want to do something similar in QEMU, with as few changes as possible.
 In the end I settled for the following:

(1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
but RMW shouldn't be that bad.  I don't care if reference counting is a
little slower than it could be, for example);

(2) I'd like to have some kind of non-reordering load/store too, either
SC (which I've improperly referred to as C11/C++11 in my previous email)
or Java volatile.

   [An aside: Java guarantees that volatile stores are not reordered
   with volatile loads.  This is not guaranteed by just using release
   stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].

As long as you only have a producer and a consumer, C11 is fine, because
all you need is load-acquire/store-release.  In fact, if it weren't for
the experience factor, C11 is easier than manually placing acquire and
release barriers.  But as soon as two or more threads are reading _and_
writing the shared memory, it gets complicated and I want to provide
something simple that people can use.  This is the reason for (2) above.

There will still be a few cases that need to be optimized, and here are
where the difficult requirements come:

(R1) the primitives *should* not be alien to people who know Linux.

(R2) those optimizations *must* be easy to do and review; at least as
easy as these things go.

The two are obviously related.  Ease of review is why it is important to
make things familiar to people who know Linux.

In C11, relaxing SC loads and stores is complicated, and more
specifically hard to explain!  I cannot do that myself, and much less
explain that to the community.  I cannot make them do that.
Unfortunately, relaxing SC loads and stores is important on POWER which
has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
fails both requirements. :(

By contrast, Java volatile semantics are easily converted to a sequence
of relaxed loads, relaxed stores, and acq/rel/sc fences.  It's almost an
algorithm; I tried to do that myself and succeeded, I could document it
nicely.  Even better, there are authoritative sources that confirm my
writing and should be accessible to people who did synchronization
"stuff" in Linux (no formal models :)).  In this respect, Java satisfies
both requirements.

And the loss is limited, since things such as Dekker's algorithm are
rare in practice.  (In particular, RCU can be implemented just fine with
Java volatile semantics, but load-acquire/store-release is not enough).

[Nothing really important after this point, I think].

> Note that there is a reason why C11/C++11 don't just have barriers
> combined with ordinary memory accesses: The compiler needs to be aware
> which accesses are sequential code (so it can assume that they are
> data-race-free) and which are potentially concurrent with other accesses
> to the same data.  [...]
> you can try to make this very likely be correct by careful
> placement of asm compiler barriers, but this is likely to be more
> difficult than just using atomics, which will do the right thing.

Note that asm is just for older compilers (and even then I try to use
GCC intrinsics as much as possible).

On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
thread fences) to properly annotate references.  rth also suggested that
I use load/store(relaxed) instead of C volatile.

> Maybe the issue that you see with C11/C++11 is that it offers more than
> you actually need.  If you can summarize what kind of synchronization /
> concurrent code you are primarily looking at, I can try to help outline
> a subset of it (i.e., something like code style but just for
> synchronization).

Is the above explanation clearer?

>> I obviously trust Cambridge for
>> C11/C++11, but their material is very concise or just refers to the
>> formal model.
> 
> Yes, their publications are really about the model.  It's not a
> tutorial, but useful for reference.  BTW, have you read their C++ paper
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> or the POPL paper?  The former has more detail (no page limit).

I know it, but I cannot say I tried hard to understand it.

> If you haven't yet, I suggest giving their cppmem tool a try too:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

I saw and tried the similar tool for POWER.  The problem with these
tools, is that they require you to abstract your program into an input
to the tool.  It works when _writing_ the code, but not when reviewing it.

>> The formal model is not what I want when my question is
>> simply "why is lwsync good for acquire and release, but not for
>> seqcst?", for example.
> 
> But that's a question that is not about the programming-language-level
> memory model, but about POWER's memory model.  I suppose that's not
> something you'd have to deal with frequently, or are you indeed
> primarily interested in emulating a particular architecture's model?
> The Cambridge group also has a formal model for POWER's memory model,
> perhaps this can help answer this question.

I'm more familiar with Cambridge's POWER work than the C++ work, and it
didn't. :)  I care about the POWER memory model because it's roughly the
only one where

    load(seqcst) => load(relaxed) + fence(acquire)
    store(seqcst) => fence(release) + store(relaxed) + fence(seqcst)

is not a valid (though perhaps suboptimal) mapping.  Note that the
primitives on the RHS are basically what they use in the Linux kernel.

>> In short, the C11/C++11 model is not what most developers are used to
>> here
> 
> I guess so.  But you also have to consider the legacy that you create.
> I do think the C11/C++11 model will used widely, and more and more
> people will used to it.

I don't think many people will learn how to use the various non-seqcst
modes...  At least so far I punted. :)

Many idioms will certainly be wrapped within the C++ standard library
(smart pointers and all that), but C starts at a disadvantage.  You have
a point, though.

>> In case you really need SC _now_, it is easy to do it using
>> fetch-and-add (for loads) or xchg (for stores).
> 
> Not on POWER and other architectures with similarly weak memory models :)  
> If the fetch-and-add abstractions in your model always give you SC, then
> that's more an indication that they don't allow you the fine-grained
> control that you want, especially on POWER :)

Indeed, they don't allow fine-grained control.  But I need to strike a
balance between control and ease of use, otherwise C11 would have been
an obvious choice (except for the detail of older compilers, but let's
ignore it).

Paolo
Torvald Riegel June 19, 2013, 8:25 p.m. UTC | #6
On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 15:15, Torvald Riegel ha scritto:
> >> One reason is that implementing SC for POWER is quite expensive,
> > 
> > Sure, but you don't have to use SC fences or atomics if you don't want
> > them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> > to specify a memory order.  It's perfectly fine to use acquire fences or
> > release fences.  There shouldn't be just one kind of barrier/fence.
> 
> Agreed.  For example Linux uses four: consume (read_barrier_depends),
> acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
> stores are always relaxed, some RMW ops are SC but others are relaxed.
> 
> I want to do something similar in QEMU, with as few changes as possible.
>  In the end I settled for the following:
> 
> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> but RMW shouldn't be that bad.  I don't care if reference counting is a
> little slower than it could be, for example);

I doubt relaxed RMW ops are sufficient even for reference counting.
Typically, the reference counter is used conceptually similar to a lock,
so you need the acquire/release (modulo funny optimizations).  The only
use case that comes to my mind right now for relaxed RMW is really just
statistics counters or such, or cases where you can "re-use" another
barrier.

> (2) I'd like to have some kind of non-reordering load/store too, either
> SC (which I've improperly referred to as C11/C++11 in my previous email)
> or Java volatile.

Often you probably don't need more than acq/rel, as Paul pointed out.
SC becomes important once you do something like Dekker-style sync, so
cases where you sync via several separate variables to avoid the cache
misses in some common case.  Once you go through one variable in the
end, acq/rel should be fine.

>    [An aside: Java guarantees that volatile stores are not reordered
>    with volatile loads.  This is not guaranteed by just using release
>    stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
   Or maybe Java volatile is acq for loads and seq_cst for stores...
> 
> As long as you only have a producer and a consumer, C11 is fine, because
> all you need is load-acquire/store-release.  In fact, if it weren't for
> the experience factor, C11 is easier than manually placing acquire and
> release barriers.  But as soon as two or more threads are reading _and_
> writing the shared memory, it gets complicated and I want to provide
> something simple that people can use.  This is the reason for (2) above.

I can't quite follow you here.  There is a total order for all
modifications to a single variable, and if you use acq/rel combined with
loads and stores on this variable, then you basically can make use of
the total order.  (All loads that read-from a certain store get a
synchronized-with (and thus happens-before edge) with the store, and the
stores are in a total order.)  This is independent of the number of
readers and writers.  The difference starts once you want to sync with
more than one variable, and need to establish an order between those
accesses.

> There will still be a few cases that need to be optimized, and here are
> where the difficult requirements come:
> 
> (R1) the primitives *should* not be alien to people who know Linux.
> 
> (R2) those optimizations *must* be easy to do and review; at least as
> easy as these things go.
> 
> The two are obviously related.  Ease of review is why it is important to
> make things familiar to people who know Linux.
> 
> In C11, relaxing SC loads and stores is complicated, and more
> specifically hard to explain!

I can't see why that would be harder than reasoning about equally weaker
Java semantics.  But you obviously know your community, and I don't :)

> I cannot do that myself, and much less
> explain that to the community.  I cannot make them do that.
> Unfortunately, relaxing SC loads and stores is important on POWER which
> has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
> fails both requirements. :(
> 
> By contrast, Java volatile semantics are easily converted to a sequence
> of relaxed loads, relaxed stores, and acq/rel/sc fences.

The same holds for C11/C++11.  If you look at either the standard or the
Batty model, you'll see that for every pair like store(rel)--load(acq),
there is also store(rel)--fence(acq)+load(relaxed),
store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
giving the same semantics.  Likewise for SC.

> It's almost an
> algorithm; I tried to do that myself and succeeded, I could document it
> nicely.  Even better, there are authoritative sources that confirm my
> writing and should be accessible to people who did synchronization
> "stuff" in Linux (no formal models :)).  In this respect, Java satisfies
> both requirements.
> 
> And the loss is limited, since things such as Dekker's algorithm are
> rare in practice.  (In particular, RCU can be implemented just fine with
> Java volatile semantics, but load-acquire/store-release is not enough).

You can also build Dekker with SC stores and acq loads, if I'm not
mistaken.  Typically one would probably use SC fences and relaxed
stores/loads.

> 
> [Nothing really important after this point, I think].
> 
> > Note that there is a reason why C11/C++11 don't just have barriers
> > combined with ordinary memory accesses: The compiler needs to be aware
> > which accesses are sequential code (so it can assume that they are
> > data-race-free) and which are potentially concurrent with other accesses
> > to the same data.  [...]
> > you can try to make this very likely be correct by careful
> > placement of asm compiler barriers, but this is likely to be more
> > difficult than just using atomics, which will do the right thing.
> 
> Note that asm is just for older compilers (and even then I try to use
> GCC intrinsics as much as possible).
> 
> On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
> thread fences) to properly annotate references.  rth also suggested that
> I use load/store(relaxed) instead of C volatile.

I agree with rth's suggestion.

> > Maybe the issue that you see with C11/C++11 is that it offers more than
> > you actually need.  If you can summarize what kind of synchronization /
> > concurrent code you are primarily looking at, I can try to help outline
> > a subset of it (i.e., something like code style but just for
> > synchronization).
> 
> Is the above explanation clearer?

Yes, thanks.

> 
> >> I obviously trust Cambridge for
> >> C11/C++11, but their material is very concise or just refers to the
> >> formal model.
> > 
> > Yes, their publications are really about the model.  It's not a
> > tutorial, but useful for reference.  BTW, have you read their C++ paper
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> > or the POPL paper?  The former has more detail (no page limit).
> 
> I know it, but I cannot say I tried hard to understand it.
> 
> > If you haven't yet, I suggest giving their cppmem tool a try too:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> I saw and tried the similar tool for POWER.  The problem with these
> tools, is that they require you to abstract your program into an input
> to the tool.  It works when _writing_ the code, but not when reviewing it.

I agree that it isn't a model checker for existing programs (but that
would likely be quite slow :)).  But it can help people to learn the
idioms.

> > I guess so.  But you also have to consider the legacy that you create.
> > I do think the C11/C++11 model will used widely, and more and more
> > people will used to it.
> 
> I don't think many people will learn how to use the various non-seqcst
> modes...  At least so far I punted. :)

But you already use similarly weaker orderings that the other
abstractions provide (e.g., Java), so you're half-way there :)

Torvald
Paolo Bonzini June 20, 2013, 7:53 a.m. UTC | #7
Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
>> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
>> but RMW shouldn't be that bad.  I don't care if reference counting is a
>> little slower than it could be, for example);
> 
> I doubt relaxed RMW ops are sufficient even for reference counting.

They are enough on the increment side, or so says boost...

http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

>>    [An aside: Java guarantees that volatile stores are not reordered
>>    with volatile loads.  This is not guaranteed by just using release
>>    stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
>
> Or maybe Java volatile is acq for loads and seq_cst for stores...

Perhaps (but I'm not 100% sure).

>> As long as you only have a producer and a consumer, C11 is fine, because
>> all you need is load-acquire/store-release.  In fact, if it weren't for
>> the experience factor, C11 is easier than manually placing acquire and
>> release barriers.  But as soon as two or more threads are reading _and_
>> writing the shared memory, it gets complicated and I want to provide
>> something simple that people can use.  This is the reason for (2) above.
> 
> I can't quite follow you here.  There is a total order for all
> modifications to a single variable, and if you use acq/rel combined with
> loads and stores on this variable, then you basically can make use of
> the total order.  (All loads that read-from a certain store get a
> synchronized-with (and thus happens-before edge) with the store, and the
> stores are in a total order.)  This is independent of the number of
> readers and writers.  The difference starts once you want to sync with
> more than one variable, and need to establish an order between those
> accesses.

You're right of course.  More specifically when there is a thread where
some variables are stored while others are loaded.

>> There will still be a few cases that need to be optimized, and here are
>> where the difficult requirements come:
>>
>> (R1) the primitives *should* not be alien to people who know Linux.
>>
>> (R2) those optimizations *must* be easy to do and review; at least as
>> easy as these things go.
>>
>> The two are obviously related.  Ease of review is why it is important to
>> make things familiar to people who know Linux.
>>
>> In C11, relaxing SC loads and stores is complicated, and more
>> specifically hard to explain!
> 
> I can't see why that would be harder than reasoning about equally weaker
> Java semantics.  But you obviously know your community, and I don't :)

Because Java semantics are "almost" SC, and as Paul mentioned the
difference doesn't matter in practice (IRIW/RWC is where it matters, WRC
works even on Power; see
http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/ppc051.html#toc5, row
WRC+lwsyncs).  It hasn't ever mattered for Linux, at least.

>> By contrast, Java volatile semantics are easily converted to a sequence
>> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> 
> The same holds for C11/C++11.  If you look at either the standard or the
> Batty model, you'll see that for every pair like store(rel)--load(acq),
> there is also store(rel)--fence(acq)+load(relaxed),
> store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> giving the same semantics.  Likewise for SC.

Do you have a pointer to that?  It would help.

> You can also build Dekker with SC stores and acq loads, if I'm not
> mistaken.  Typically one would probably use SC fences and relaxed
> stores/loads.

Yes.

>>> I guess so.  But you also have to consider the legacy that you create.
>>> I do think the C11/C++11 model will used widely, and more and more
>>> people will used to it.
>>
>> I don't think many people will learn how to use the various non-seqcst
>> modes...  At least so far I punted. :)
> 
> But you already use similarly weaker orderings that the other
> abstractions provide (e.g., Java), so you're half-way there :)

True.  On the other hand you can treat Java like "kinda SC but don't
worry, you won't see the difference".  It is both worrisome and appealing...

Paolo
Torvald Riegel June 22, 2013, 10:55 a.m. UTC | #8
On Thu, 2013-06-20 at 09:53 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> > On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> >> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> >> but RMW shouldn't be that bad.  I don't care if reference counting is a
> >> little slower than it could be, for example);
> > 
> > I doubt relaxed RMW ops are sufficient even for reference counting.
> 
> They are enough on the increment side, or so says boost...
> 
> http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

Oh, right, for this kind of refcounting it's okay on the increment side.
But the explanation on the page you referenced isn't correct I think:
"...passing an existing reference from one thread to another must
already provide any required synchronization." is not sufficient because
that would just create a happens-before from the reference-passing
source to the other thread that gets the reference.
The relaxed RMW increment works because of the modification order being
consistent with happens-before (see 6.17 in the model), so we can never
reach a value of zero for the refcount once we incremented the reference
even with a relaxed RMW.

IMO, the acquire fence in the release is not 100% correct according to
my understanding of the memory model:
    if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
      boost::atomic_thread_fence(boost::memory_order_acquire);
      delete x;
    }
"delete x" is unconditional, and I guess not specified to read all of
what x points to.  The acquire fence would only result in a
synchronizes-with edge if there is a reads-from edge between the release
store and a load that reads the stores value and is sequenced after the
acquire fence.
Thus, I think the compiler could be allowed to reorder the fence after
the delete in some case (e.g., when there's no destructor called or it
doesn't have any conditionals in it), but I guess it's likely to not
ever try to do that in practice.
Regarding the hardware fences that this maps, I suppose this just
happens to work fine on most architectures, perhaps just because
"delete" will access some of the memory when releasing the memory.

Changing the release to the following would be correct, and probably
little additional overhead:
    if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) { 
      if (x->refcount.load(boost::memory_order_acquire) == 0)
        delete x;
    }

That makes delete conditional and thus having to happen after we ensured
the happens before edge that we need.

> >> By contrast, Java volatile semantics are easily converted to a sequence
> >> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> > 
> > The same holds for C11/C++11.  If you look at either the standard or the
> > Batty model, you'll see that for every pair like store(rel)--load(acq),
> > there is also store(rel)--fence(acq)+load(relaxed),
> > store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> > giving the same semantics.  Likewise for SC.
> 
> Do you have a pointer to that?  It would help.

In the full model (n3132.pdf), see 6.12 (which then references which
parts in the standard lead to those parts of the model).  SC fences are
also acquire and release fences, so this covers synchronizes-with via
reads-from too.  6.17 has more constraints on SC fences and modification
order, so we get something similar for the ordering of just writes.


Torvald
diff mbox

Patch

diff --git a/docs/atomics.txt b/docs/atomics.txt
new file mode 100644
index 0000000..e2ce04b
--- /dev/null
+++ b/docs/atomics.txt
@@ -0,0 +1,345 @@ 
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest).  Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking.  Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization.  Anything else is considerably harder, but it's
+also justified more often than one would like.  The two tools that
+are provided by qemu/atomic.h are memory barriers and atomic operations.
+
+Macros defined by qemu/atomic.h fall in three camps:
+
+- compiler barriers: barrier();
+
+- weak atomic access and manual memory barriers: atomic_read(),
+  atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
+
+- sequentially consistent atomic access: everything else.
+
+
+COMPILER MEMORY BARRIER
+=======================
+
+barrier() prevents the compiler from moving the memory accesses either
+side of it to the other side.  The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+barrier() is mostly used within qemu/atomic.h itself.  On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution.  In this
+case, qemu/atomic.h will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, barrier() can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+SEQUENTIALLY CONSISTENT ATOMIC ACCESS
+=====================================
+
+Most of the operations in the qemu/atomic.h header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+qemu/atomic.h provides the following set of atomic read-modify-write
+operations:
+
+    typeof(*ptr) atomic_inc(ptr)
+    typeof(*ptr) atomic_dec(ptr)
+    typeof(*ptr) atomic_add(ptr, val)
+    typeof(*ptr) atomic_sub(ptr, val)
+    typeof(*ptr) atomic_and(ptr, val)
+    typeof(*ptr) atomic_or(ptr, val)
+    typeof(*ptr) atomic_xchg(ptr, val
+    typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of *ptr.  These operations are
+polymorphic; they operate on any type that is as wide as an int.
+
+Sequentially consistent loads and stores can be done using:
+
+    atomic_add(ptr, 0) for loads
+    atomic_xchg(ptr, val) for stores
+
+However, they are quite expensive on some platforms, notably POWER and
+ARM.  Therefore, qemu/atomic.h provides two primitives with slightly
+weaker constraints:
+
+    typeof(*ptr) atomic_mb_read(ptr)
+    void         atomic_mb_set(ptr, val)
+
+The semantics of these primitives map to Java volatile variables,
+and are strongly related to memory barriers as used in the Linux
+kernel (see below).
+
+As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
+be reordered with each other, and it is also not possible to reorder
+"normal" accesses around them.
+
+However, and this is the important difference between
+atomic_mb_read/atomic_mb_set and sequential consistency, it is important
+for both threads to access the same volatile variable.  It is not the
+case that everything visible to thread A when it writes volatile field f
+becomes visible to thread B after it reads volatile field g. The store
+and load have to "match" (i.e., be performed on the same volatile
+field) to achieve the right semantics.
+
+
+These operations operate on any type that is as wide as an int or smaller.
+
+
+WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+In general, if the algorithm you are writing includes both writes
+and reads on the same side, it is generally simpler to use sequentially
+consistent primitives.
+
+When using this model, variables are accessed with atomic_read() and
+atomic_set(), and restrictions to the ordering of accesses is enforced
+using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
+memory barriers.
+
+atomic_read() and atomic_set() prevents the compiler from using
+optimizations that might otherwise optimize accesses out of existence
+on the one hand, or that might create unsolicited accesses on the other.
+In general this should not have any effect, because the same compiler
+barriers are already implied by memory barriers.  However, it is useful
+to do so, because it tells readers which variables are shared with
+other threads, and which are local to the current thread or protected
+by other, more mundane means.
+
+Memory barriers control the order of references to shared memory.
+They come in four kinds:
+
+- smp_rmb() guarantees that all the LOAD operations specified before
+  the barrier will appear to happen before all the LOAD operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_rmb() puts a partial ordering on loads, but is not
+  required to have any effect on stores.
+
+- smp_wmb() guarantees that all the STORE operations specified before
+  the barrier will appear to happen before all the STORE operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_wmb() puts a partial ordering on stores, but is not
+  required to have any effect on loads.
+
+- smp_mb() guarantees that all the LOAD and STORE operations specified
+  before the barrier will appear to happen before all the LOAD and
+  STORE operations specified after the barrier with respect to the other
+  components of the system.
+
+  smp_mb() puts a partial ordering on both loads and stores.  It is
+  stronger than both a read and a write memory barrier; it implies both
+  smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
+  barrier from overtaking LOADs coming after the barrier and vice versa.
+
+- smp_read_barrier_depends() is a weaker kind of read barrier.  On
+  most processors, whenever two loads are performed such that the
+  second depends on the result of the first (e.g., the first load
+  retrieves the address to which the second load will be directed),
+  the processor will guarantee that the first LOAD will appear to happen
+  before the second with respect to the other components of the system.
+  However, this is not always true---for example, it was not true on
+  Alpha processors.  Whenever this kind of access happens to shared
+  memory (that is not protected by a lock), a read barrier is needed,
+  and smp_read_barrier_depends() can be used instead of smp_rmb().
+
+  Note that the first load really has to have a _data_ dependency and not
+  a control dependency.  If the address for the second load is dependent
+  on the first load, but the dependency is through a conditional rather
+  than actually loading the address itself, then it's a _control_
+  dependency and a full read barrier or better is required.
+
+
+This is the set of barriers that is required *between* two atomic_read()
+and atomic_set() operations to achieve sequential consistency:
+
+                    |               2nd operation             |
+                    |-----------------------------------------|
+     1st operation  | (after last) | atomic_read | atomic_set |
+     ---------------+--------------+-------------+------------|
+     (before first) |              | none        | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+     atomic_read    | smp_rmb()    | smp_rmb()*  | **         |
+     ---------------+--------------+-------------+------------|
+     atomic_set     | none         | smp_mb()*** | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+
+       * Or smp_read_barrier_depends().
+
+      ** This requires a load-store barrier.  How to achieve this varies
+         depending on the machine, but in practice smp_rmb()+smp_wmb()
+         should have the desired effect.  For example, on PowerPC the
+         lwsync instruction is a combined load-load, load-store and
+         store-store barrier.
+
+     *** This requires a store-load barrier.  On most machines, the only
+         way to achieve this is a full barrier.
+
+
+You can see that the two possible definitions of atomic_mb_read()
+and atomic_mb_set() are the following:
+
+    1) atomic_mb_read(p)   = atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
+
+    2) atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
+
+Usually the former is used, because smp_mb() is expensive and a program
+normally has more reads than writes.  Therefore it makes more sense to
+make atomic_mb_set() the more expensive operation.
+
+There are two common cases in which atomic_mb_read and atomic_mb_set
+generate too many memory barriers, and thus it can be useful to manually
+place barriers instead:
+
+- when a data structure has one thread that is always a writer
+  and one thread that is always a reader, manual placement of
+  memory barriers makes the write side faster.  Furthermore,
+  correctness is easy to check for in this case using the "pairing"
+  trick that is explained below:
+
+     thread 1                                thread 1
+     -------------------------               ------------------------
+     (other writes)
+                                             smp_wmb()
+     atomic_mb_set(&a, x)                    atomic_set(&a, x)
+                                             smp_wmb()
+     atomic_mb_set(&b, y)                    atomic_set(&b, y)
+
+                                       =>
+     thread 2                                thread 2
+     -------------------------               ------------------------
+     y = atomic_mb_read(&b)                  y = atomic_read(&b)
+                                             smp_rmb()
+     x = atomic_mb_read(&a)                  x = atomic_read(&a)
+                                             smp_rmb()
+
+- sometimes, a thread is accessing many variables that are otherwise
+  unrelated to each other (for example because, apart from the current
+  thread, exactly one other thread will read or write each of these
+  variables).  In this case, it is possible to "hoist" the implicit
+  barriers provided by atomic_mb_read() and atomic_mb_set() outside
+  a loop.  For example, the above definition atomic_mb_read() gives
+  the following transformation:
+
+     n = 0;                                  n = 0;
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       n += atomic_mb_read(&a[i]);             n += atomic_read(&a[i]);
+                                             smp_rmb();
+
+  Similarly, atomic_mb_set() can be transformed as follows:
+  smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+                                             smp_mb();
+
+
+The two tricks can be combined.  In this case, splitting a loop in
+two lets you hoist the barriers out of the loops _and_ eliminate the
+expensive smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++) {        =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+       atomic_mb_set(&b[i], false);          smb_wmb();
+     }                                       for (i = 0; i < 10; i++)
+                                               atomic_set(&a[i], false);
+                                             smp_mb();
+
+  The other thread can still use atomic_mb_read()/atomic_mb_set()
+
+
+Memory barrier pairing
+----------------------
+
+A useful rule of thumb is that memory barriers should always, or almost
+always, be paired with another barrier.  In the case of QEMU, however,
+note that the other barrier may actually be in a driver that runs in
+the guest!
+
+For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
+both count as read barriers.  A read barriers shall pair with a write
+barrier or a full barrier; a write barrier shall pair with a read
+barrier or a full barrier.  A full barrier can pair with anything.
+For example:
+
+        thread 1             thread 2
+        ===============      ===============
+        a = 1;
+        smp_wmb();
+        b = 2;               x = b;
+                             smp_rmb();
+                             y = a;
+
+Note that the "writing" thread are accessing the variables in the
+opposite order as the "reading" thread.  This is expected: stores
+before the write barrier will normally match the loads after the
+read barrier, and vice versa.  The same is true for more than 2
+access and for data dependency barriers:
+
+        thread 1             thread 2
+        ===============      ===============
+        b[2] = 1;
+        smp_wmb();
+        x->i = 2;
+        smp_wmb();
+        a = x;               x = a;
+                             smp_read_barrier_depends();
+                             y = x->i;
+                             smp_read_barrier_depends();
+                             z = b[y];
+
+smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
+with atomic_mb_set().
+
+
+COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
+============================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+  use a boxed atomic_t type; atomic operations in QEMU are polymorphic
+  and use normal C types.
+
+- atomic_read and atomic_set in Linux give no guarantee at all;
+  atomic_read and atomic_set in QEMU include a compiler barrier
+  (similar to the ACCESS_ONCE macro in Linux).
+
+- most atomic read-modify-write operations in Linux return void;
+  in QEMU, all of them return the old value of the variable.
+
+- different atomic read-modify-write operations in Linux imply
+  a different set of memory barriers; in QEMU, all of them enforce
+  sequential consistency, which means they imply full memory barriers
+  before and after the operation.
+
+- Linux does not have an equivalent of atomic_mb_read() and
+  atomic_mb_set().  In particular, note that set_mb() is a little
+  weaker than atomic_mb_set().
+
+
+SOURCES
+=======
+
+* Documentation/memory-barriers.txt from the Linux kernel
+
+* "The JSR-133 Cookbook for Compiler Writers", available at
+  http://g.oswego.edu/dl/jmm/cookbook.html
diff --git a/hw/display/qxl.c b/hw/display/qxl.c
index 0c852de..a85e178 100644
--- a/hw/display/qxl.c
+++ b/hw/display/qxl.c
@@ -23,6 +23,7 @@ 
 #include "qemu-common.h"
 #include "qemu/timer.h"
 #include "qemu/queue.h"
+#include "qemu/atomic.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
@@ -1725,7 +1726,7 @@  static void qxl_send_events(PCIQXLDevice *d, uint32_t events)
         trace_qxl_send_events_vm_stopped(d->id, events);
         return;
     }
-    old_pending = __sync_fetch_and_or(&d->ram->int_pending, le_events);
+    old_pending = atomic_or(&d->ram->int_pending, le_events);
     if ((old_pending & le_events) == le_events) {
         return;
     }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 96ab625..8f6ab13 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -16,6 +16,7 @@ 
 #include <sys/ioctl.h>
 #include "hw/virtio/vhost.h"
 #include "hw/hw.h"
+#include "qemu/atomic.h"
 #include "qemu/range.h"
 #include <linux/vhost.h>
 #include "exec/address-spaces.h"
@@ -47,11 +48,9 @@  static void vhost_dev_sync_region(struct vhost_dev *dev,
             addr += VHOST_LOG_CHUNK;
             continue;
         }
-        /* Data must be read atomically. We don't really
-         * need the barrier semantics of __sync
-         * builtins, but it's easier to use them than
-         * roll our own. */
-        log = __sync_fetch_and_and(from, 0);
+        /* Data must be read atomically. We don't really need barrier semantics
+         * but it's easier to use atomic_* than roll our own. */
+        log = atomic_xchg(from, 0);
         while ((bit = sizeof(log) > sizeof(int) ?
                 ffsll(log) : ffs(log))) {
             hwaddr page_addr;
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 10becb6..04d64d0 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -1,68 +1,194 @@ 
-#ifndef __QEMU_BARRIER_H
-#define __QEMU_BARRIER_H 1
+/*
+ * Simple interface for atomic operations.
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
 
-/* Compiler barrier */
-#define barrier()   asm volatile("" ::: "memory")
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H 1
 
-#if defined(__i386__)
+#include "qemu/compiler.h"
 
-#include "qemu/compiler.h"        /* QEMU_GNUC_PREREQ */
+/* For C11 atomic ops */
 
-/*
- * Because of the strongly ordered x86 storage model, wmb() and rmb() are nops
- * on x86(well, a compiler barrier only).  Well, at least as long as
- * qemu doesn't do accesses to write-combining memory or non-temporal
- * load/stores from C code.
- */
-#define smp_wmb()   barrier()
-#define smp_rmb()   barrier()
+/* Compiler barrier */
+#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
+
+#ifndef __ATOMIC_RELAXED
 
 /*
- * We use GCC builtin if it's available, as that can use
- * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
- * However, on i386, there seem to be known bugs as recently as 4.3.
- * */
-#if QEMU_GNUC_PREREQ(4, 4)
-#define smp_mb() __sync_synchronize()
+ * We use GCC builtin if it's available, as that can use mfence on
+ * 32-bit as well, e.g. if built with -march=pentium-m. However, on
+ * i386 the spec is buggy, and the implementation followed it until
+ * 4.3 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793).
+ */
+#if defined(__i386__) || defined(__x86_64__)
+#if !QEMU_GNUC_PREREQ(4, 4)
+#if defined __x86_64__
+#define smp_mb()    ({ asm volatile("mfence" ::: "memory"); (void)0; })
 #else
-#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#define smp_mb()    ({ asm volatile("lock; addl $0,0(%%esp) " ::: "memory"); (void)0; })
+#endif
+#endif
 #endif
 
-#elif defined(__x86_64__)
 
+#ifdef __alpha__
+#define smp_read_barrier_depends()   asm volatile("mb":::"memory")
+#endif
+
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
+
+/*
+ * Because of the strongly ordered storage model, wmb() and rmb() are nops
+ * here (a compiler barrier only).  QEMU doesn't do accesses to write-combining
+ * qemu memory or non-temporal load/stores from C code.
+ */
 #define smp_wmb()   barrier()
 #define smp_rmb()   barrier()
-#define smp_mb() asm volatile("mfence" ::: "memory")
+
+/*
+ * __sync_lock_test_and_set() is documented to be an acquire barrier only,
+ * but it is a full barrier at the hardware level.  Add a compiler barrier
+ * to make it a full barrier also at the compiler level.
+ */
+#define atomic_xchg(ptr, i)    (barrier(), __sync_lock_test_and_set(ptr, i))
+
+/*
+ * Load/store with Java volatile semantics.
+ */
+#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))
 
 #elif defined(_ARCH_PPC)
 
 /*
  * We use an eieio() for wmb() on powerpc.  This assumes we don't
  * need to order cacheable and non-cacheable stores with respect to
- * each other
+ * each other.
+ *
+ * smp_mb has the same problem as on x86 for not-very-new GCC
+ * (http://patchwork.ozlabs.org/patch/126184/, Nov 2011).
  */
-#define smp_wmb()   asm volatile("eieio" ::: "memory")
-
+#define smp_wmb()   ({ asm volatile("eieio" ::: "memory"); (void)0; })
 #if defined(__powerpc64__)
-#define smp_rmb()   asm volatile("lwsync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
 #else
-#define smp_rmb()   asm volatile("sync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
 #endif
+#define smp_mb()    ({ asm volatile("sync" ::: "memory"); (void)0; })
 
-#define smp_mb()   asm volatile("sync" ::: "memory")
+#endif /* _ARCH_PPC */
 
-#else
+#endif /* C11 atomics */
 
 /*
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill for wmb() and rmb().
+ * be overkill for smp_wmb() and smp_rmb().
  */
+#ifndef smp_mb
+#define smp_mb()    __sync_synchronize()
+#endif
+
+#ifndef smp_wmb
+#ifdef __ATOMIC_RELEASE
+#define smp_wmb()   __atomic_thread_fence(__ATOMIC_RELEASE)
+#else
 #define smp_wmb()   __sync_synchronize()
-#define smp_mb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_rmb
+#ifdef __ATOMIC_ACQUIRE
+#define smp_rmb()   __atomic_thread_fence(__ATOMIC_ACQUIRE)
+#else
 #define smp_rmb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_read_barrier_depends
+#ifdef __ATOMIC_CONSUME
+#define smp_read_barrier_depends()   __atomic_thread_fence(__ATOMIC_CONSUME)
+#else
+#define smp_read_barrier_depends()   barrier()
+#endif
+#endif
 
+#ifndef atomic_read
+#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
 #endif
 
+#ifndef atomic_set
+#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
+#endif
+
+/* These have the same semantics as Java volatile variables.
+ * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
+ * "1. Issue a StoreStore barrier (wmb) before each volatile store."
+ *  2. Issue a StoreLoad barrier after each volatile store.
+ *     Note that you could instead issue one before each volatile load, but
+ *     this would be slower for typical programs using volatiles in which
+ *     reads greatly outnumber writes. Alternatively, if available, you
+ *     can implement volatile store as an atomic instruction (for example
+ *     XCHG on x86) and omit the barrier. This may be more efficient if
+ *     atomic instructions are cheaper than StoreLoad barriers.
+ *  3. Issue LoadLoad and LoadStore barriers after each volatile load."
+ *
+ * If you prefer to think in terms of "pairing" of memory barriers,
+ * an atomic_mb_read pairs with an atomic_mb_set.
+ *
+ * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
+ * while an atomic_mb_set is a st.rel followed by a memory barrier.
+ *
+ * These are a bit weaker than __atomic_load/store with __ATOMIC_SEQ_CST
+ * (see docs/atomics.txt), and I'm not sure that __ATOMIC_ACQ_REL is enough.
+ * Just always use the barriers manually by the rules above.
+ */
+#ifndef atomic_mb_read
+#define atomic_mb_read(ptr)    ({           \
+    typeof(*ptr) _val = atomic_read(ptr);   \
+    smp_rmb();                              \
+    _val;                                   \
+})
+#endif
+
+#ifndef atomic_mb_set
+#define atomic_mb_set(ptr, i)  do {         \
+    smp_wmb();                              \
+    atomic_set(ptr, i);                     \
+    smp_mb();                               \
+} while (0)
+#endif
+
+#ifndef atomic_xchg
+#ifdef __ATOMIC_SEQ_CST
+#define atomic_xchg(ptr, i)    ({                           \
+    typeof(*ptr) _new = (i), _old;                          \
+    __atomic_exchange(ptr, &_new, &_old, __ATOMIC_SEQ_CST); \
+    _old;                                                   \
+})
+#elif defined __clang__
+#define atomic_xchg(ptr, i)    __sync_exchange(ptr, i)
+#else
+/* __sync_lock_test_and_set() is documented to be an acquire barrier only.  */
+#define atomic_xchg(ptr, i)    (smp_mb(), __sync_lock_test_and_set(ptr, i))
+#endif
+#endif
+
+/* Provide shorter names for GCC atomic builtins.  */
+#define atomic_inc(ptr)        __sync_fetch_and_add(ptr, 1)
+#define atomic_dec(ptr)        __sync_fetch_and_add(ptr, -1)
+#define atomic_add             __sync_fetch_and_add
+#define atomic_sub             __sync_fetch_and_sub
+#define atomic_and             __sync_fetch_and_and
+#define atomic_or              __sync_fetch_and_or
+#define atomic_cmpxchg         __sync_val_compare_and_swap
+
 #endif
diff --git a/migration.c b/migration.c
index 058f9e6..83f5691 100644
--- a/migration.c
+++ b/migration.c
@@ -290,8 +290,7 @@  static void migrate_fd_cleanup(void *opaque)
 
 static void migrate_finish_set_state(MigrationState *s, int new_state)
 {
-    if (__sync_val_compare_and_swap(&s->state, MIG_STATE_ACTIVE,
-                                    new_state) == new_state) {
+    if (atomic_cmpxchg(&s->state, MIG_STATE_ACTIVE, new_state) == new_state) {
         trace_migrate_set_state(new_state);
     }
 }
diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index 22915aa..dbf2fc8 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -17,15 +17,15 @@  typedef struct {
 static int worker_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    return __sync_fetch_and_add(&data->n, 1);
+    return atomic_inc(&data->n);
 }
 
 static int long_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     g_usleep(2000000);
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     return 0;
 }
 
@@ -169,7 +169,7 @@  static void test_cancel(void)
     /* Cancel the jobs that haven't been started yet.  */
     num_canceled = 0;
     for (i = 0; i < 100; i++) {
-        if (__sync_val_compare_and_swap(&data[i].n, 0, 3) == 0) {
+        if (atomic_cmpxchg(&data[i].n, 0, 3) == 0) {
             data[i].ret = -ECANCELED;
             bdrv_aio_cancel(data[i].aiocb);
             active--;