Patchwork Add a memory barrier to guest memory access functions

login
register
mail settings
Submitter Benjamin Herrenschmidt
Date May 21, 2012, 1:56 a.m.
Message ID <1337565405.2458.12.camel@pasglop>
Download mbox | patch
Permalink /patch/160305/
State New
Headers show

Comments

Benjamin Herrenschmidt - May 21, 2012, 1:56 a.m.
The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

This adds barriers to some standard guest memory access functions
along with a comment explaining the semantics to exec.c

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

So after the discussion with Paolo, I removed the specific accessors,
just used a normal smp_mb() in only two places, cpu_physical_memory_rw
and cpu_physical_memory_map.

I don't see an obvious need to provide a "relaxed" variant of the
later at this stage, a quick grep doesn't seem to show that most cases
where it's used are either not performance sensitive or the barrier
makes sense, but feel free to prove me wrong :-)

If we really want that, my suggestion would be to change the "is_write"
flag into a proper bitmask of direction and relaxed attribute (which we
can use for more attributes in the future if needed). 

Also, we probably want an smp_mb() when shooting MSIs (not LSIs, those
are not ordered, that's why the guest driver needs to do an MMIO read
after an LSI, but MSIs are). I haven't looked at that yet, we can do
it from a separate patch if needed.

 exec.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)
Paolo Bonzini - May 21, 2012, 8:11 a.m.
Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> I don't see an obvious need to provide a "relaxed" variant of the
> later at this stage, a quick grep doesn't seem to show that most cases
> where it's used are either not performance sensitive or the barrier
> makes sense, but feel free to prove me wrong :-)

The only problem here is that you have useless memory barriers when
calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).

Paolo
Michael S. Tsirkin - May 21, 2012, 8:31 a.m.
On Mon, May 21, 2012 at 10:11:06AM +0200, Paolo Bonzini wrote:
> Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> > I don't see an obvious need to provide a "relaxed" variant of the
> > later at this stage, a quick grep doesn't seem to show that most cases
> > where it's used are either not performance sensitive or the barrier
> > makes sense, but feel free to prove me wrong :-)
> 
> The only problem here is that you have useless memory barriers when
> calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).
> 
> Paolo

More than that. smp_mb is pretty expensive. You
often can do just smp_wmb and smp_rmb and that is
very cheap.
Many operations run in the vcpu context
or start when guest exits to host and work
is bounced from there and thus no barrier is needed
here.

Example? start_xmit in e1000. Executed in vcpu context
so no barrier is needed.

virtio of course is another example since it does its own
barriers. But even without that, virtio_blk_handle_output
runs in vcpu context.

But more importantly, this hack just sweeps the
dirt under the carpet. Understanding the interaction
with guest drivers is important anyway. So
I really don't see why don't we audit devices
and add proper barriers.
Benjamin Herrenschmidt - May 21, 2012, 8:58 a.m.
On Mon, 2012-05-21 at 11:31 +0300, Michael S. Tsirkin wrote:
> On Mon, May 21, 2012 at 10:11:06AM +0200, Paolo Bonzini wrote:
> > Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> > > I don't see an obvious need to provide a "relaxed" variant of the
> > > later at this stage, a quick grep doesn't seem to show that most cases
> > > where it's used are either not performance sensitive or the barrier
> > > makes sense, but feel free to prove me wrong :-)
> > 
> > The only problem here is that you have useless memory barriers when
> > calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).
> > 
> > Paolo
> 
> More than that. smp_mb is pretty expensive. You
> often can do just smp_wmb and smp_rmb and that is
> very cheap.

Except that you mostly don't know at that level what you can or cannot
do, it depends on the caller. We should have the standard accessors do
it the "safe" way and have performance sensitive stuff do map/unmap, at
least that's the result of the discussions with Anthony.

If we can address the virtqueue_map_sg problem, I think we should be
good, I'll look at it tomorrow. Maybe the right way for now is to remove
the barrier I added to "map" and only leave the one in _rw

> Many operations run in the vcpu context
> or start when guest exits to host and work
> is bounced from there and thus no barrier is needed
> here.

But we don't always know it unless we start sprinkling the drivers.

> Example? start_xmit in e1000. Executed in vcpu context
> so no barrier is needed.

If it's performance sensitive, it can always use map/unmap and
hand-tuned barriers. I suspect the cost of the barrier is drowned in the
cost of the exit tho, isn't it ?

Also while it doesn't need barriers for the read it does of the
descriptor it still needs one barrier for the write back to it.

> virtio of course is another example since it does its own
> barriers. But even without that, virtio_blk_handle_output
> runs in vcpu context.
> 
> But more importantly, this hack just sweeps the
> dirt under the carpet. Understanding the interaction
> with guest drivers is important anyway. So
> I really don't see why don't we audit devices
> and add proper barriers.

Because my experience is that you'll never get them all right and will
keep getting more breakage as devices are added. This has always been
the case kernel wise and I don't think qemu will get it any better.

That's why Linus always insisted that our basic MMIO accessors provide
full ordering vs. each other and vs. accesses to main memory (DMA)
limiting the extent of the barriers needed for drivers for example.

It's a small price to pay considered the risk and how nasty those bugs
are to debug. And performance critical drivers are few and can be hand
tuned.

Cheers,
Ben.
Benjamin Herrenschmidt - May 21, 2012, 9:07 a.m.
On Mon, 2012-05-21 at 18:58 +1000, Benjamin Herrenschmidt wrote:

> Except that you mostly don't know at that level what you can or cannot
> do, it depends on the caller. We should have the standard accessors do
> it the "safe" way and have performance sensitive stuff do map/unmap, at
> least that's the result of the discussions with Anthony.
> 
> If we can address the virtqueue_map_sg problem, I think we should be
> good, I'll look at it tomorrow. Maybe the right way for now is to remove
> the barrier I added to "map" and only leave the one in _rw

One thing that might alleviate some of your concerns would possibly be
to "remember" in a global (to be replaced with a thread var eventually)
the last transfer direction and use a simple test to chose the barrier,
ie, store + store -> wmb, load + load -> rmb, other -> mb.

But first I'd be curious if some x86 folks could actually measure the
impact of the patch as I proposed it. That would give us an idea of how
bad the performance problem is and how far we need to go to address it.

Cheers,
Ben.
Benjamin Herrenschmidt - May 21, 2012, 9:16 a.m.
On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:

> One thing that might alleviate some of your concerns would possibly be
> to "remember" in a global (to be replaced with a thread var eventually)
> the last transfer direction and use a simple test to chose the barrier,
> ie, store + store -> wmb, load + load -> rmb, other -> mb.
> 
> But first I'd be curious if some x86 folks could actually measure the
> impact of the patch as I proposed it. That would give us an idea of how
> bad the performance problem is and how far we need to go to address it.

Another option.... go back to something more like the original patch,
ie, put the barrier in the new dma_* accessors (and provide a
non-barrier one while at it) rather than the low level cpu_physical_*
accessor.

That makes it a lot easier for selected driver to be converted to avoid
the barrier in thing like code running in the vcpu context. It also
means that virtio doesn't get any added barriers which is what we want
as well.

IE. Have something along the lines (based on the accessors added by the
iommu series) (using __ kernel-style, feel free to provide a better
naming)

static inline int __dma_memory_rw( ... args ... )
{
    if (!dma_has_iommu(dma)) {
        /* Fast-path for no IOMMU */
        cpu_physical_memory_rw( ... args ...);
        return 0;
    } else {
        return iommu_dma_memory_rw( ... args ...);
    }
}

static inline int dma_memory_rw( ... args ... )
{
	smp_mb(); /* Or use finer grained as discussied earlier */

	return __dma_memory_rw( ... args ... )
}

And corresponding __dma_memory_read/__dma_memory_write (again, feel
free to suggest a more "qemu'ish" naming if you don't like __, it's
a kernel habit, not sure what you guys do in qemu land).

Cheers,
Ben.
Michael S. Tsirkin - May 21, 2012, 9:34 a.m.
On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> 
> > One thing that might alleviate some of your concerns would possibly be
> > to "remember" in a global (to be replaced with a thread var eventually)
> > the last transfer direction and use a simple test to chose the barrier,
> > ie, store + store -> wmb, load + load -> rmb, other -> mb.

But how do you know guest did a store?

> > 
> > But first I'd be curious if some x86 folks could actually measure the
> > impact of the patch as I proposed it. That would give us an idea of how
> > bad the performance problem is and how far we need to go to address it.
> 
> Another option.... go back to something more like the original patch,
> ie, put the barrier in the new dma_* accessors (and provide a
> non-barrier one while at it) rather than the low level cpu_physical_*
> accessor.
> 
> That makes it a lot easier for selected driver to be converted to avoid
> the barrier in thing like code running in the vcpu context. It also
> means that virtio doesn't get any added barriers which is what we want
> as well.
> 
> IE. Have something along the lines (based on the accessors added by the
> iommu series) (using __ kernel-style, feel free to provide a better
> naming)
> 
> static inline int __dma_memory_rw( ... args ... )
> {
>     if (!dma_has_iommu(dma)) {
>         /* Fast-path for no IOMMU */
>         cpu_physical_memory_rw( ... args ...);
>         return 0;
>     } else {
>         return iommu_dma_memory_rw( ... args ...);
>     }
> }
> 
> static inline int dma_memory_rw( ... args ... )
> {
> 	smp_mb(); /* Or use finer grained as discussied earlier */
> 
> 	return __dma_memory_rw( ... args ... )

Heh. But don't we need an mb afterwards too?

> }
> 
> And corresponding __dma_memory_read/__dma_memory_write (again, feel
> free to suggest a more "qemu'ish" naming if you don't like __, it's
> a kernel habit, not sure what you guys do in qemu land).
> 
> Cheers,
> Ben.

And my preference is to first convert everyone to __ variants and
carefully switch devices to the barrier version after a bit of
consideration.
Benjamin Herrenschmidt - May 21, 2012, 9:53 a.m.
On Mon, 2012-05-21 at 12:34 +0300, Michael S. Tsirkin wrote:
> On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> > 
> > > One thing that might alleviate some of your concerns would possibly be
> > > to "remember" in a global (to be replaced with a thread var eventually)
> > > the last transfer direction and use a simple test to chose the barrier,
> > > ie, store + store -> wmb, load + load -> rmb, other -> mb.
> 
> But how do you know guest did a store?

This isn't vs. guest access, but vs DMA access, ie we are ordering DMA
accesses vs. each other. The guest is still responsible to do it's own
side of the barriers as usual.

> > > But first I'd be curious if some x86 folks could actually measure the
> > > impact of the patch as I proposed it. That would give us an idea of how
> > > bad the performance problem is and how far we need to go to address it.
> > 
> > Another option.... go back to something more like the original patch,
> > ie, put the barrier in the new dma_* accessors (and provide a
> > non-barrier one while at it) rather than the low level cpu_physical_*
> > accessor.
> > 
> > That makes it a lot easier for selected driver to be converted to avoid
> > the barrier in thing like code running in the vcpu context. It also
> > means that virtio doesn't get any added barriers which is what we want
> > as well.
> > 
> > IE. Have something along the lines (based on the accessors added by the
> > iommu series) (using __ kernel-style, feel free to provide a better
> > naming)
> > 
> > static inline int __dma_memory_rw( ... args ... )
> > {
> >     if (!dma_has_iommu(dma)) {
> >         /* Fast-path for no IOMMU */
> >         cpu_physical_memory_rw( ... args ...);
> >         return 0;
> >     } else {
> >         return iommu_dma_memory_rw( ... args ...);
> >     }
> > }
> > 
> > static inline int dma_memory_rw( ... args ... )
> > {
> > 	smp_mb(); /* Or use finer grained as discussied earlier */
> > 
> > 	return __dma_memory_rw( ... args ... )
> 
> Heh. But don't we need an mb afterwards too?

Not really no, but we can discuss the fine point, I'm pretty sure
one-before is enough as long as we ensure MSIs are properly ordered.

> > }
> > 
> > And corresponding __dma_memory_read/__dma_memory_write (again, feel
> > free to suggest a more "qemu'ish" naming if you don't like __, it's
> > a kernel habit, not sure what you guys do in qemu land).
> > 
> > Cheers,
> > Ben.
> 
> And my preference is to first convert everyone to __ variants and
> carefully switch devices to the barrier version after a bit of
> consideration.

I very strongly disagree. This is exactly the wrong approach. In pretty
much -all- cases the ordered versions are going to be safer, since they
basically provide the similar ordering semantics to what a PCI bus would
provide.

IE. Just making the default accessors ordered means that all devices
written with the assumption that the guest will see accesses in the
order they are written in the emulated device will be correct, which
means pretty much all of them (well, almost).

 --> It actually fixes a real bug that affects almost all devices
     that do DMA today in qemu

Then, fine-tuning performance critical one by selectively removing
barriers allows to improve performance where it would be othewise
harmed.

So on that I will not compromise.

However, I think it might be better to leave the barrier in the dma
accessor since that's how you also get iommu transparency etc... so it's
not a bad place to put them, and leave the cpu_physical_* for use by
lower level device drivers which are thus responsible also for dealing
with ordering if they have to.

Cheers,
Ben.
Michael S. Tsirkin - May 21, 2012, 10:31 a.m.
On Mon, May 21, 2012 at 07:53:23PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 12:34 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > > One thing that might alleviate some of your concerns would possibly be
> > > > to "remember" in a global (to be replaced with a thread var eventually)
> > > > the last transfer direction and use a simple test to chose the barrier,
> > > > ie, store + store -> wmb, load + load -> rmb, other -> mb.
> > 
> > But how do you know guest did a store?
> 
> This isn't vs. guest access, but vs DMA access, ie we are ordering DMA
> accesses vs. each other. The guest is still responsible to do it's own
> side of the barriers as usual.
> > > > But first I'd be curious if some x86 folks could actually measure the
> > > > impact of the patch as I proposed it. That would give us an idea of how
> > > > bad the performance problem is and how far we need to go to address it.
> > > 
> > > Another option.... go back to something more like the original patch,
> > > ie, put the barrier in the new dma_* accessors (and provide a
> > > non-barrier one while at it) rather than the low level cpu_physical_*
> > > accessor.
> > > 
> > > That makes it a lot easier for selected driver to be converted to avoid
> > > the barrier in thing like code running in the vcpu context. It also
> > > means that virtio doesn't get any added barriers which is what we want
> > > as well.
> > > 
> > > IE. Have something along the lines (based on the accessors added by the
> > > iommu series) (using __ kernel-style, feel free to provide a better
> > > naming)
> > > 
> > > static inline int __dma_memory_rw( ... args ... )
> > > {
> > >     if (!dma_has_iommu(dma)) {
> > >         /* Fast-path for no IOMMU */
> > >         cpu_physical_memory_rw( ... args ...);
> > >         return 0;
> > >     } else {
> > >         return iommu_dma_memory_rw( ... args ...);
> > >     }
> > > }
> > > 
> > > static inline int dma_memory_rw( ... args ... )
> > > {
> > > 	smp_mb(); /* Or use finer grained as discussied earlier */
> > > 
> > > 	return __dma_memory_rw( ... args ... )
> > 
> > Heh. But don't we need an mb afterwards too?
> 
> Not really no, but we can discuss the fine point, I'm pretty sure
> one-before is enough as long as we ensure MSIs are properly ordered.

Hmm. MSI injection causes IPI. So that does an SMP
barrier I think. But see below about the use of
write-combining in guest.

> > > }
> > > 
> > > And corresponding __dma_memory_read/__dma_memory_write (again, feel
> > > free to suggest a more "qemu'ish" naming if you don't like __, it's
> > > a kernel habit, not sure what you guys do in qemu land).
> > > 
> > > Cheers,
> > > Ben.
> > 
> > And my preference is to first convert everyone to __ variants and
> > carefully switch devices to the barrier version after a bit of
> > consideration.
> 
> I very strongly disagree. This is exactly the wrong approach. In pretty
> much -all- cases the ordered versions are going to be safer, since they
> basically provide the similar ordering semantics to what a PCI bus would
> provide.
> 
> IE. Just making the default accessors ordered means that all devices
> written with the assumption that the guest will see accesses in the
> order they are written in the emulated device will be correct, which
> means pretty much all of them (well, almost).
> 
>  --> It actually fixes a real bug that affects almost all devices
>      that do DMA today in qemu

In theory fine but practical examples that affect x86?
We might want to at least document some of them.

wmb and rmb are nops so there's no bug in practice.
So the only actual rule which might be violated by qemu is that
read flushes out writes.
It's unlikely you will find real examples where this matters
but I'm interested to hear otherwise.


I also note that guests do use write-combining e.g. for vga.
One wonders whether stronger barrriers are needed
because of that?


> Then, fine-tuning performance critical one by selectively removing
> barriers allows to improve performance where it would be othewise
> harmed.

So that breaks attempts to bisect performance regressions.
Not good.

> So on that I will not compromise.
> 
> However, I think it might be better to leave the barrier in the dma
> accessor since that's how you also get iommu transparency etc... so it's
> not a bad place to put them, and leave the cpu_physical_* for use by
> lower level device drivers which are thus responsible also for dealing
> with ordering if they have to.
> 
> Cheers,
> Ben.

You claim to understand what matters for all devices I doubt that.

Why don't we add safe APIs, then go over devices and switch over?
I counted 97 pci_dma_ accesses.
33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.

Let maintainers make a decision where does speed matter.
Benjamin Herrenschmidt - May 21, 2012, 11:45 a.m.
On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:

> > IE. Just making the default accessors ordered means that all devices
> > written with the assumption that the guest will see accesses in the
> > order they are written in the emulated device will be correct, which
> > means pretty much all of them (well, almost).
> > 
> >  --> It actually fixes a real bug that affects almost all devices
> >      that do DMA today in qemu
> 
> In theory fine but practical examples that affect x86?
> We might want to at least document some of them.

x86 I don't know, I suspect mostly none that have actually been hit but
I could be wrong, I'm not familiar enough with it.

I have practical examples that affect power though :-) And I'm
reasonably confident they'll affect ARM as soon as people start doing
serious SMP on it etc...

IE. The code is provably wrong without barriers.

> wmb and rmb are nops so there's no bug in practice.
> So the only actual rule which might be violated by qemu is that
> read flushes out writes.
> It's unlikely you will find real examples where this matters
> but I'm interested to hear otherwise.

wmb and rmb are not nops on powerpc and arm to name a couple.

mb is more than just "read flush writes" (besides it's not a statement
about flushing, it's a statement about ordering. whether it has a
flushing side effect on x86 is a separate issue, it doesn't on power for
example).

Real flushing out writes matters very much in real life in two very
different contexts that tend to not affect emulation in qemu as much.

One is flushing write in the opposite direction (or rather, having the
read response queued up behind those writes) which is critical to
ensuring proper completion of DMAs after an LSI from a guest driver
perspective on real HW typically.

The other classic case is to flush posted MMIO writes in order to ensure
that a subsequent delay is respected.

Most of those don't actually matter when doing emulation. Besides a
barrier won't provide you the second guarantee, you need a nastier
construct at least on some architectures like power.

However, we do need to ensure that read and writes are properly ordered
vs. each other (regardless of any "flush" semantic) or things could go
very wrong on OO architectures (here too, my understanding on x86 is
limited).

> I also note that guests do use write-combining e.g. for vga.
> One wonders whether stronger barrriers are needed
> because of that?


What I'm trying to address here is really to ensure that load and stores
issued by qemu emulated devices "appear" in the right order in respect
to guest driver code running simultaneously on different CPUs (both P
and V in this context).

I've somewhat purposefully for now left alone the problem ordering
"accross" transfer directions which in the case of PCI (and most "sane"
busses) means read flushing writes in the other direction. (Here too
it's not really a flush, it's just an ordering statement between the
write and the read response).

If you want to look at that other problem more closely, it breaks down
into two parts:

 - Guest writes vs. qemu reads. My gut feeling is that this should take
care of itself for the most part, tho you might want to bracket PIO
operations with barriers for extra safety. But we might want to dig
more.

 - qemu writes vs. guest reads. Here too, this only matters as long as
the guest can observe the cat inside the box. This means the guest gets
to "observe" the writes vs. some other event that might need to be
synchronized with the read, such as an interrupt. Here my gut feeling is
that we might be safer by having a barrier before any interrupt get shot
to the guest (and a syscall/ioctl is not a memory barrier, though at
least with kvm on power, a guest entry is so we should be fine) but here
too, it might be a good idea to dig more.

However, both of these can (and probably should) be treated separately
and are rather unlikely to cause problems in practice. In most case
these have to do with flushing non coherent DMA buffers in intermediary
bridges, and while those could be considered as akin to a store queue
that hasn't yet reached coherency on a core, in practice I think we are
fine.

So let's go back to the problem at hand, which is to make sure that load
and stores done by emulated devices get observed by the guest code in
the right order.

My preference is to do it in the dma_* accessors (with possibly
providing __relaxed versions), which means "low level" stuff using cpu_*
directly such as virtio doesn't pay the price (or rather is responsible
for doing its own barriers, which is good since typically that's our
real perf sensitive stuff).

Anthony earlier preferred all the way down into cpu_physical_* which is
what my latest round of patches did. But I can understand that this
makes people uncomfortable.

In any case, we yet have to get some measurements of the actual cost of
those barriers on x86. I can try to give it a go on my laptop next week,
but I'm not an x86 expert and don't have that much different x86 HW
around (such as large SMP machines where the cost might differ or
different generations of x86 processors etc...).

> > Then, fine-tuning performance critical one by selectively removing
> > barriers allows to improve performance where it would be othewise
> > harmed.
> 
> So that breaks attempts to bisect performance regressions.
> Not good.

Disagreed. In all cases safety trumps performances. Almost all our
devices were written without any thought given to ordering, so they
basically can and should be considered as all broken. Since thinking
about ordering is something that, by experience, very few programmer can
do and get right, the default should absolutely be fully ordered.

Performance regressions aren't a big deal to bisect in that case: If
there's a regression for a given driver and it points to this specific
patch adding the barriers then we know precisely where the regression
come from, and we can get some insight about how this specific driver
can be improved to use more relaxed accessors.

I don't see the problem.

One thing that might be worth looking at is if indeed mb() is so much
more costly than just wmb/rmb, in which circumstances we could have some
smarts in the accessors to make them skip the full mb based on knowledge
of previous access direction, though here too I would be tempted to only
do that if absolutely necessary (ie if we can't instead just fix the
sensitive driver to use explicitly relaxed accessors).

> > So on that I will not compromise.
> > 
> > However, I think it might be better to leave the barrier in the dma
> > accessor since that's how you also get iommu transparency etc... so it's
> > not a bad place to put them, and leave the cpu_physical_* for use by
> > lower level device drivers which are thus responsible also for dealing
> > with ordering if they have to.
> > 
> > Cheers,
> > Ben.
> 
> You claim to understand what matters for all devices I doubt that.

It's pretty obvious that anything that does DMA using a classic
descriptor + buffers structure is broken without appropriate ordering.

And yes, I claim to have a fairly good idea of the problem, but I don't
think throwing credentials around is going to be helpful.
 
> Why don't we add safe APIs, then go over devices and switch over?
> I counted 97 pci_dma_ accesses.
> 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> 
> Let maintainers make a decision where does speed matter.

No. Let's fix the general bug first. Then let's people who know the
individual drivers intimately and understand their access patterns make
the call as to when things can/should be relaxed.

Ben.
Michael S. Tsirkin - May 21, 2012, 12:18 p.m.
On Mon, May 21, 2012 at 09:45:58PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:
> 
> > > IE. Just making the default accessors ordered means that all devices
> > > written with the assumption that the guest will see accesses in the
> > > order they are written in the emulated device will be correct, which
> > > means pretty much all of them (well, almost).
> > > 
> > >  --> It actually fixes a real bug that affects almost all devices
> > >      that do DMA today in qemu
> > 
> > In theory fine but practical examples that affect x86?
> > We might want to at least document some of them.
> 
> x86 I don't know, I suspect mostly none that have actually been hit but
> I could be wrong, I'm not familiar enough with it.
> 
> I have practical examples that affect power though :-) And I'm
> reasonably confident they'll affect ARM as soon as people start doing
> serious SMP on it etc...
> 
> IE. The code is provably wrong without barriers.
> 
> > wmb and rmb are nops so there's no bug in practice.
> > So the only actual rule which might be violated by qemu is that
> > read flushes out writes.
> > It's unlikely you will find real examples where this matters
> > but I'm interested to hear otherwise.
> 
> wmb and rmb are not nops on powerpc and arm to name a couple.
> 
> mb is more than just "read flush writes" (besides it's not a statement
> about flushing, it's a statement about ordering. whether it has a
> flushing side effect on x86 is a separate issue, it doesn't on power for
> example).

I referred to reads not bypassing writes on PCI.
This is the real argument in my eyes: that we
should behave the way real hardware does.

> Real flushing out writes matters very much in real life in two very
> different contexts that tend to not affect emulation in qemu as much.
> 
> One is flushing write in the opposite direction (or rather, having the
> read response queued up behind those writes) which is critical to
> ensuring proper completion of DMAs after an LSI from a guest driver
> perspective on real HW typically.
> 
> The other classic case is to flush posted MMIO writes in order to ensure
> that a subsequent delay is respected.
> 
> Most of those don't actually matter when doing emulation. Besides a
> barrier won't provide you the second guarantee, you need a nastier
> construct at least on some architectures like power.

Exactly. This is what I was saying too.

> However, we do need to ensure that read and writes are properly ordered
> vs. each other (regardless of any "flush" semantic) or things could go
> very wrong on OO architectures (here too, my understanding on x86 is
> limited).

Right. Here's a compromize:
- add smp_rmb() on any DMA read
- add smp_wmb( on any DMA write
This is almost zero cost on x86 at least.
So we are not regressing existing setups.

Are there any places where devices do read after write?
My preferred way is to find them and do pci_dma_flush() invoking
smp_mb(). If there is such a case it's likely on datapath anyway
so we do care.

But I can also live with a global flag "latest_dma_read"
and on read we could do
	if (unlikely(latest_dma_read))
		smp_mb();

if you really insist on it
though I do think it's inelegant.

> > I also note that guests do use write-combining e.g. for vga.
> > One wonders whether stronger barrriers are needed
> > because of that?
> 
> 
> What I'm trying to address here is really to ensure that load and stores
> issued by qemu emulated devices "appear" in the right order in respect
> to guest driver code running simultaneously on different CPUs (both P
> and V in this context).
> 
> I've somewhat purposefully for now left alone the problem ordering
> "accross" transfer directions which in the case of PCI (and most "sane"
> busses) means read flushing writes in the other direction. (Here too
> it's not really a flush, it's just an ordering statement between the
> write and the read response).
> 
> If you want to look at that other problem more closely, it breaks down
> into two parts:
> 
>  - Guest writes vs. qemu reads. My gut feeling is that this should take
> care of itself for the most part, tho you might want to bracket PIO
> operations with barriers for extra safety. But we might want to dig
> more.
> 
>  - qemu writes vs. guest reads. Here too, this only matters as long as
> the guest can observe the cat inside the box. This means the guest gets
> to "observe" the writes vs. some other event that might need to be
> synchronized with the read, such as an interrupt. Here my gut feeling is
> that we might be safer by having a barrier before any interrupt get shot
> to the guest (and a syscall/ioctl is not a memory barrier, though at
> least with kvm on power, a guest entry is so we should be fine) but here
> too, it might be a good idea to dig more.
> 
> However, both of these can (and probably should) be treated separately
> and are rather unlikely to cause problems in practice. In most case
> these have to do with flushing non coherent DMA buffers in intermediary
> bridges, and while those could be considered as akin to a store queue
> that hasn't yet reached coherency on a core, in practice I think we are
> fine.
> 
> So let's go back to the problem at hand, which is to make sure that load
> and stores done by emulated devices get observed by the guest code in
> the right order.
> 
> My preference is to do it in the dma_* accessors (with possibly
> providing __relaxed versions), which means "low level" stuff using cpu_*
> directly such as virtio doesn't pay the price (or rather is responsible
> for doing its own barriers, which is good since typically that's our
> real perf sensitive stuff).
> 
> Anthony earlier preferred all the way down into cpu_physical_* which is
> what my latest round of patches did. But I can understand that this
> makes people uncomfortable.
> 
> In any case, we yet have to get some measurements of the actual cost of
> those barriers on x86. I can try to give it a go on my laptop next week,
> but I'm not an x86 expert and don't have that much different x86 HW
> around (such as large SMP machines where the cost might differ or
> different generations of x86 processors etc...).
> 
> > > Then, fine-tuning performance critical one by selectively removing
> > > barriers allows to improve performance where it would be othewise
> > > harmed.
> > 
> > So that breaks attempts to bisect performance regressions.
> > Not good.
> 
> Disagreed. In all cases safety trumps performances.

You said above x86 is unaffected. This is portability, not safety.

> Almost all our
> devices were written without any thought given to ordering, so they
> basically can and should be considered as all broken.

Problem is, a lot of code is likely broken even after you sprinkle
barriers around. For example qemu might write A then B where guest driver
expects to see B written before A.

> Since thinking
> about ordering is something that, by experience, very few programmer can
> do and get right, the default should absolutely be fully ordered.

Give it bus ordering. That is not fully ordered.

> Performance regressions aren't a big deal to bisect in that case: If
> there's a regression for a given driver and it points to this specific
> patch adding the barriers then we know precisely where the regression
> come from, and we can get some insight about how this specific driver
> can be improved to use more relaxed accessors.
> 
> I don't see the problem.
> 
> One thing that might be worth looking at is if indeed mb() is so much
> more costly than just wmb/rmb, in which circumstances we could have some
> smarts in the accessors to make them skip the full mb based on knowledge
> of previous access direction, though here too I would be tempted to only
> do that if absolutely necessary (ie if we can't instead just fix the
> sensitive driver to use explicitly relaxed accessors).

We did this in virtio and yes it is measureable.
branches are pretty cheap though.

> > > So on that I will not compromise.
> > > 
> > > However, I think it might be better to leave the barrier in the dma
> > > accessor since that's how you also get iommu transparency etc... so it's
> > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > lower level device drivers which are thus responsible also for dealing
> > > with ordering if they have to.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > You claim to understand what matters for all devices I doubt that.
> 
> It's pretty obvious that anything that does DMA using a classic
> descriptor + buffers structure is broken without appropriate ordering.
> 
> And yes, I claim to have a fairly good idea of the problem, but I don't
> think throwing credentials around is going to be helpful.
>  
> > Why don't we add safe APIs, then go over devices and switch over?
> > I counted 97 pci_dma_ accesses.
> > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > 
> > Let maintainers make a decision where does speed matter.
> 
> No. Let's fix the general bug first. Then let's people who know the
> individual drivers intimately and understand their access patterns make
> the call as to when things can/should be relaxed.
> 
> Ben.

As a maintainer of a device, if you send me a patch I can review.
If you change core APIs creating performance regressions
I don't even know what to review without wasting time debugging
and bisecting.

According to what you said you want to fix kvm on powerpc.
Good. Find a way that looks non intrusive on x86 please.
Paolo Bonzini - May 21, 2012, 3:16 p.m.
Il 21/05/2012 14:18, Michael S. Tsirkin ha scritto:
>> > Almost all our
>> > devices were written without any thought given to ordering, so they
>> > basically can and should be considered as all broken.
> Problem is, a lot of code is likely broken even after you sprinkle
> barriers around. For example qemu might write A then B where guest driver
> expects to see B written before A.

This would be a bug in the guest driver, and usually relatively easy to
reproduce.  The specs (I know of UHCI) should be very precise on this
for obvious reasons.

Paolo
Benjamin Herrenschmidt - May 21, 2012, 9:58 p.m.
On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:

> > mb is more than just "read flush writes" (besides it's not a statement
> > about flushing, it's a statement about ordering. whether it has a
> > flushing side effect on x86 is a separate issue, it doesn't on power for
> > example).
> 
> I referred to reads not bypassing writes on PCI.

Again, from which originator ? From a given initiator, nothing bypasses
anything, so the right thing to do here is a full mb(). However, I
suspect what you are talking about here is read -responses- not
bypassing writes in the direction of the response (ie, the "flushing"
semantic of reads) which is a different matter. Also don't forget that
this is only a semantic of PCI, not of the system fabric, ie, a device
DMA read doesn't flush a CPU write that is still in that CPU store
queue.

> This is the real argument in my eyes: that we
> should behave the way real hardware does.

But that doesn't really make much sense since we don't actually have a
non-coherent bus sitting in the middle :-)

However we should as much as possible be observed to behave as such, I
agree, though I don't think we need to bother too much about timings
since we don't really have way to enforce the immediate visibility of
stores within the coherent domain without a bunch of arch specific very
very heavy hammers which we really don't want to wield at this point.

> > Real flushing out writes matters very much in real life in two very
> > different contexts that tend to not affect emulation in qemu as much.
> > 
> > One is flushing write in the opposite direction (or rather, having the
> > read response queued up behind those writes) which is critical to
> > ensuring proper completion of DMAs after an LSI from a guest driver
> > perspective on real HW typically.
> > 
> > The other classic case is to flush posted MMIO writes in order to ensure
> > that a subsequent delay is respected.
> > 
> > Most of those don't actually matter when doing emulation. Besides a
> > barrier won't provide you the second guarantee, you need a nastier
> > construct at least on some architectures like power.
> 
> Exactly. This is what I was saying too.

Right and I'm reasonably sure that none of those above is our problem. 

As I said, at this point, what I want to sort out is purely the
observable ordering of DMA transactions. The side effect of reads in one
direction on writes in the other direction is an orthogonal problem
which as I wrote above is probably not hurting us.

> > However, we do need to ensure that read and writes are properly ordered
> > vs. each other (regardless of any "flush" semantic) or things could go
> > very wrong on OO architectures (here too, my understanding on x86 is
> > limited).
> 
> Right. Here's a compromize:
> - add smp_rmb() on any DMA read
> - add smp_wmb( on any DMA write
> This is almost zero cost on x86 at least.
> So we are not regressing existing setups.

And it's not correct. With that setup, DMA writes can pass DMA reads
(and vice-versa) which doesn't correspond to the guarantees of the PCI
spec. The question I suppose is whether this is a problem in practice...

> Are there any places where devices do read after write?

It's possible, things like update of a descriptor followed by reading of
the next one, etc...  I don't have an example hot in mind right know of
a device that would be hurt but I'm a bit nervous as this would be a
violation of the PCI guaranteed ordering.

> My preferred way is to find them and do pci_dma_flush() invoking
> smp_mb(). If there is such a case it's likely on datapath anyway
> so we do care.
> 
> But I can also live with a global flag "latest_dma_read"
> and on read we could do
> 	if (unlikely(latest_dma_read))
> 		smp_mb();
> 
> if you really insist on it
> though I do think it's inelegant.

Again, why do you object on simply making the default accessors fully
ordered ? Do you think it will be a measurable different in most cases ?

Shouldn't we measure it first ?

> You said above x86 is unaffected. This is portability, not safety.

x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
the missing ordering between loads and stores, I don't know, as I said,
I don't fully know the x86 memory model.

In any case, opposing "portability" to "safety" the way you do it means
you are making assumptions that basically "qemu is written for x86 and
nothing else matters".

If that's your point of view, so be it and be clear about it, but I will
disagree :-) And while I can understand that powerpc might not be
considered as the most important arch around at this point in time,
these problems are going to affect ARM as well.

> > Almost all our
> > devices were written without any thought given to ordering, so they
> > basically can and should be considered as all broken.
> 
> Problem is, a lot of code is likely broken even after you sprinkle
> barriers around. For example qemu might write A then B where guest driver
> expects to see B written before A.

No, this is totally unrelated bugs, nothing to do with barriers. You are
mixing up two completely different problems and using one as an excuse
to not fix the other one :-)

A device with the above problem would be broken today on x86 regardless.

> > Since thinking
> > about ordering is something that, by experience, very few programmer can
> > do and get right, the default should absolutely be fully ordered.
> 
> Give it bus ordering. That is not fully ordered.

It pretty much is actually, look at your PCI spec :-)

> > Performance regressions aren't a big deal to bisect in that case: If
> > there's a regression for a given driver and it points to this specific
> > patch adding the barriers then we know precisely where the regression
> > come from, and we can get some insight about how this specific driver
> > can be improved to use more relaxed accessors.
> > 
> > I don't see the problem.
> > 
> > One thing that might be worth looking at is if indeed mb() is so much
> > more costly than just wmb/rmb, in which circumstances we could have some
> > smarts in the accessors to make them skip the full mb based on knowledge
> > of previous access direction, though here too I would be tempted to only
> > do that if absolutely necessary (ie if we can't instead just fix the
> > sensitive driver to use explicitly relaxed accessors).
> 
> We did this in virtio and yes it is measureable.

You did it in virtio in a very hot spot on a performance critical
driver. My argument is that:

 - We can do it in a way that doesn't affect virtio at all (by using the
dma accessors instead of cpu_*)

 - Only few drivers have that kind of performance criticality and they
can be easily hand fixed.

> branches are pretty cheap though.

Depends, not always but yes, cheaper than barriers in many cases.

> > > > So on that I will not compromise.
> > > > 
> > > > However, I think it might be better to leave the barrier in the dma
> > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > lower level device drivers which are thus responsible also for dealing
> > > > with ordering if they have to.
> > > > 
> > > > Cheers,
> > > > Ben.
> > > 
> > > You claim to understand what matters for all devices I doubt that.
> > 
> > It's pretty obvious that anything that does DMA using a classic
> > descriptor + buffers structure is broken without appropriate ordering.
> > 
> > And yes, I claim to have a fairly good idea of the problem, but I don't
> > think throwing credentials around is going to be helpful.
> >  
> > > Why don't we add safe APIs, then go over devices and switch over?
> > > I counted 97 pci_dma_ accesses.
> > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > 
> > > Let maintainers make a decision where does speed matter.
> > 
> > No. Let's fix the general bug first. Then let's people who know the
> > individual drivers intimately and understand their access patterns make
> > the call as to when things can/should be relaxed.
> > 
> > Ben.
> 
> As a maintainer of a device, if you send me a patch I can review.
> If you change core APIs creating performance regressions
> I don't even know what to review without wasting time debugging
> and bisecting.

Well, if you don't know then you ask on the list and others (such as
myself or a certain mst) who happens to know those issues will help that
lone clueless maintainer, seriously it's not like it's hard, and a lot
easier than just keeping everything broken and hoping we get to audit
them all properly.

> According to what you said you want to fix kvm on powerpc.
> Good. Find a way that looks non intrusive on x86 please.

And ARM. And any other OO arch.

Ben.
Anthony Liguori - May 21, 2012, 10:18 p.m.
On 05/21/2012 03:31 AM, Michael S. Tsirkin wrote:
> More than that. smp_mb is pretty expensive. You
> often can do just smp_wmb and smp_rmb and that is
> very cheap.
> Many operations run in the vcpu context
> or start when guest exits to host and work
> is bounced from there and thus no barrier is needed
> here.
>
> Example? start_xmit in e1000. Executed in vcpu context
> so no barrier is needed.
>
> virtio of course is another example since it does its own
> barriers. But even without that, virtio_blk_handle_output
> runs in vcpu context.
>
> But more importantly, this hack just sweeps the
> dirt under the carpet. Understanding the interaction
> with guest drivers is important anyway. So

But this isn't what this series is about.

This series is only attempting to make sure that writes are ordered with respect 
to other writes in main memory.

It's based on the assumption that write ordering is well defined (and typically 
strict) on most busses including PCI.  I have not confirmed this myself but I 
trust that Ben has.

So the only problem trying to be solved here is to make sure that if a write A 
is issued by the device model while it's on PCPU 0, if PCPU 1 does a write B to 
another location, and then the device model runs on PCPU 2 and does a read of 
both A and B, it will only see the new value of B if the it sees the new value of A.

Whether the driver on VCPU 0 (which may be on any PCPU) also sees the write 
ordering is irrelevant.

If you want to avoid taking a barrier on every write, we can make use of map() 
and issue explicit barriers (as virtio does).

Regards,

Anthony Liguori

> I really don't see why don't we audit devices
> and add proper barriers.
>
Michael S. Tsirkin - May 21, 2012, 10:22 p.m.
On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
> 
> > > mb is more than just "read flush writes" (besides it's not a statement
> > > about flushing, it's a statement about ordering. whether it has a
> > > flushing side effect on x86 is a separate issue, it doesn't on power for
> > > example).
> > 
> > I referred to reads not bypassing writes on PCI.
> 
> Again, from which originator ? From a given initiator, nothing bypasses
> anything, so the right thing to do here is a full mb(). However, I
> suspect what you are talking about here is read -responses- not
> bypassing writes in the direction of the response (ie, the "flushing"
> semantic of reads) which is a different matter.

No. My spec says:
A3, A4
A Posted Request must be able to pass Non-Posted Requests to avoid
deadlocks.


> Also don't forget that
> this is only a semantic of PCI, not of the system fabric, ie, a device
> DMA read doesn't flush a CPU write that is still in that CPU store
> queue.

We need to emulate what hardware does IMO.
So if usb has different rules it needs a different barriers.


> > This is the real argument in my eyes: that we
> > should behave the way real hardware does.
> 
> But that doesn't really make much sense since we don't actually have a
> non-coherent bus sitting in the middle :-)
> 
> However we should as much as possible be observed to behave as such, I
> agree, though I don't think we need to bother too much about timings
> since we don't really have way to enforce the immediate visibility of
> stores within the coherent domain without a bunch of arch specific very
> very heavy hammers which we really don't want to wield at this point.
> 
> > > Real flushing out writes matters very much in real life in two very
> > > different contexts that tend to not affect emulation in qemu as much.
> > > 
> > > One is flushing write in the opposite direction (or rather, having the
> > > read response queued up behind those writes) which is critical to
> > > ensuring proper completion of DMAs after an LSI from a guest driver
> > > perspective on real HW typically.
> > > 
> > > The other classic case is to flush posted MMIO writes in order to ensure
> > > that a subsequent delay is respected.
> > > 
> > > Most of those don't actually matter when doing emulation. Besides a
> > > barrier won't provide you the second guarantee, you need a nastier
> > > construct at least on some architectures like power.
> > 
> > Exactly. This is what I was saying too.
> 
> Right and I'm reasonably sure that none of those above is our problem. 
> 
> As I said, at this point, what I want to sort out is purely the
> observable ordering of DMA transactions. The side effect of reads in one
> direction on writes in the other direction is an orthogonal problem
> which as I wrote above is probably not hurting us.
> 
> > > However, we do need to ensure that read and writes are properly ordered
> > > vs. each other (regardless of any "flush" semantic) or things could go
> > > very wrong on OO architectures (here too, my understanding on x86 is
> > > limited).
> > 
> > Right. Here's a compromize:
> > - add smp_rmb() on any DMA read
> > - add smp_wmb( on any DMA write
> > This is almost zero cost on x86 at least.
> > So we are not regressing existing setups.
> 
> And it's not correct. With that setup, DMA writes can pass DMA reads
> (and vice-versa) which doesn't correspond to the guarantees of the PCI
> spec.

Cite the spec please. Express spec matches this at least.

> The question I suppose is whether this is a problem in practice...
> 
> > Are there any places where devices do read after write?
> 
> It's possible, things like update of a descriptor followed by reading of
> the next one, etc...  I don't have an example hot in mind right know of
> a device that would be hurt but I'm a bit nervous as this would be a
> violation of the PCI guaranteed ordering.
> 
> > My preferred way is to find them and do pci_dma_flush() invoking
> > smp_mb(). If there is such a case it's likely on datapath anyway
> > so we do care.
> > 
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> > 	if (unlikely(latest_dma_read))
> > 		smp_mb();
> > 
> > if you really insist on it
> > though I do think it's inelegant.
> 
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
> 
> Shouldn't we measure it first ?

It's a lot of work. We measured the effect for virtio in
the past. I don't think we need to redo it.

> > You said above x86 is unaffected. This is portability, not safety.
> 
> x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
> the missing ordering between loads and stores, I don't know, as I said,
> I don't fully know the x86 memory model.

You don't need to understand it. Assume memory-barriers.h is correct.

> In any case, opposing "portability" to "safety" the way you do it means
> you are making assumptions that basically "qemu is written for x86 and
> nothing else matters".

No. But find a way to fix power without hurting working setups.

> If that's your point of view, so be it and be clear about it, but I will
> disagree :-) And while I can understand that powerpc might not be
> considered as the most important arch around at this point in time,
> these problems are going to affect ARM as well.
> 
> > > Almost all our
> > > devices were written without any thought given to ordering, so they
> > > basically can and should be considered as all broken.
> > 
> > Problem is, a lot of code is likely broken even after you sprinkle
> > barriers around. For example qemu might write A then B where guest driver
> > expects to see B written before A.
> 
> No, this is totally unrelated bugs, nothing to do with barriers. You are
> mixing up two completely different problems and using one as an excuse
> to not fix the other one :-)
> 
> A device with the above problem would be broken today on x86 regardless.
> 
> > > Since thinking
> > > about ordering is something that, by experience, very few programmer can
> > > do and get right, the default should absolutely be fully ordered.
> > 
> > Give it bus ordering. That is not fully ordered.
> 
> It pretty much is actually, look at your PCI spec :-)

I looked. 2.4.1.  Transaction Ordering Rules

> > > Performance regressions aren't a big deal to bisect in that case: If
> > > there's a regression for a given driver and it points to this specific
> > > patch adding the barriers then we know precisely where the regression
> > > come from, and we can get some insight about how this specific driver
> > > can be improved to use more relaxed accessors.
> > > 
> > > I don't see the problem.
> > > 
> > > One thing that might be worth looking at is if indeed mb() is so much
> > > more costly than just wmb/rmb, in which circumstances we could have some
> > > smarts in the accessors to make them skip the full mb based on knowledge
> > > of previous access direction, though here too I would be tempted to only
> > > do that if absolutely necessary (ie if we can't instead just fix the
> > > sensitive driver to use explicitly relaxed accessors).
> > 
> > We did this in virtio and yes it is measureable.
> 
> You did it in virtio in a very hot spot on a performance critical
> driver. My argument is that:
> 
>  - We can do it in a way that doesn't affect virtio at all (by using the
> dma accessors instead of cpu_*)
> 
>  - Only few drivers have that kind of performance criticality and they
> can be easily hand fixed.
> 
> > branches are pretty cheap though.
> 
> Depends, not always but yes, cheaper than barriers in many cases.
> 
> > > > > So on that I will not compromise.
> > > > > 
> > > > > However, I think it might be better to leave the barrier in the dma
> > > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > > lower level device drivers which are thus responsible also for dealing
> > > > > with ordering if they have to.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > 
> > > > You claim to understand what matters for all devices I doubt that.
> > > 
> > > It's pretty obvious that anything that does DMA using a classic
> > > descriptor + buffers structure is broken without appropriate ordering.
> > > 
> > > And yes, I claim to have a fairly good idea of the problem, but I don't
> > > think throwing credentials around is going to be helpful.
> > >  
> > > > Why don't we add safe APIs, then go over devices and switch over?
> > > > I counted 97 pci_dma_ accesses.
> > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > > 
> > > > Let maintainers make a decision where does speed matter.
> > > 
> > > No. Let's fix the general bug first. Then let's people who know the
> > > individual drivers intimately and understand their access patterns make
> > > the call as to when things can/should be relaxed.
> > > 
> > > Ben.
> > 
> > As a maintainer of a device, if you send me a patch I can review.
> > If you change core APIs creating performance regressions
> > I don't even know what to review without wasting time debugging
> > and bisecting.
> 
> Well, if you don't know then you ask on the list and others (such as
> myself or a certain mst) who happens to know those issues will help that
> lone clueless maintainer, seriously it's not like it's hard, and a lot
> easier than just keeping everything broken and hoping we get to audit
> them all properly.
> 
> > According to what you said you want to fix kvm on powerpc.
> > Good. Find a way that looks non intrusive on x86 please.
> 
> And ARM. And any other OO arch.
> 
> Ben.
Benjamin Herrenschmidt - May 21, 2012, 10:26 p.m.
On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
> But this isn't what this series is about.
> 
> This series is only attempting to make sure that writes are ordered
> with respect 
> to other writes in main memory.

Actually, it applies to both reads and writes. They can't pass each
other either and that can be fairly important.

It's in fact the main contention point because if it was only writes we
could just use wmb and be done with it (that's a nop on x86).

Because we are trying to order everything (and specifically store
followed by a load), we need a full barrier which is more expensive on
x86.

Ben.

> It's based on the assumption that write ordering is well defined (and
> typically 
> strict) on most busses including PCI.  I have not confirmed this
> myself but I 
> trust that Ben has.
> 
> So the only problem trying to be solved here is to make sure that if a
> write A 
> is issued by the device model while it's on PCPU 0, if PCPU 1 does a
> write B to 
> another location, and then the device model runs on PCPU 2 and does a
> read of 
> both A and B, it will only see the new value of B if the it sees the
> new value of A.
> 
> Whether the driver on VCPU 0 (which may be on any PCPU) also sees the
> write 
> ordering is irrelevant.
> 
> If you want to avoid taking a barrier on every write, we can make use
> of map() 
> and issue explicit barriers (as virtio does).
> 
>
Anthony Liguori - May 21, 2012, 10:31 p.m.
On 05/21/2012 05:26 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
>> But this isn't what this series is about.
>>
>> This series is only attempting to make sure that writes are ordered
>> with respect
>> to other writes in main memory.
>
> Actually, it applies to both reads and writes. They can't pass each
> other either and that can be fairly important.

That's fine but that's a detail of the bus.

> It's in fact the main contention point because if it was only writes we
> could just use wmb and be done with it (that's a nop on x86).
>
> Because we are trying to order everything (and specifically store
> followed by a load), we need a full barrier which is more expensive on
> x86.

I think the thing to do is make the barrier implemented in the dma API and allow 
it to be overridden by the bus.  The default implementation should be a full 
barrier.

If we can establish that the bus guarantees a weaker ordering guarantee, a bus 
could override the default implementation and do something weaker.

Regards,

Anthony Liguori
Michael S. Tsirkin - May 21, 2012, 10:37 p.m.
On Mon, May 21, 2012 at 05:18:21PM -0500, Anthony Liguori wrote:
> On 05/21/2012 03:31 AM, Michael S. Tsirkin wrote:
> >More than that. smp_mb is pretty expensive. You
> >often can do just smp_wmb and smp_rmb and that is
> >very cheap.
> >Many operations run in the vcpu context
> >or start when guest exits to host and work
> >is bounced from there and thus no barrier is needed
> >here.
> >
> >Example? start_xmit in e1000. Executed in vcpu context
> >so no barrier is needed.
> >
> >virtio of course is another example since it does its own
> >barriers. But even without that, virtio_blk_handle_output
> >runs in vcpu context.
> >
> >But more importantly, this hack just sweeps the
> >dirt under the carpet. Understanding the interaction
> >with guest drivers is important anyway. So
> 
> But this isn't what this series is about.
> 
> This series is only attempting to make sure that writes are ordered
> with respect to other writes in main memory.

They it should use smp_wmb() not smp_mb().
I would be 100% fine with that.

> It's based on the assumption that write ordering is well defined
> (and typically strict) on most busses including PCI.  I have not
> confirmed this myself but I trust that Ben has.
> 
> So the only problem trying to be solved here is to make sure that if
> a write A is issued by the device model while it's on PCPU 0, if
> PCPU 1 does a write B to another location, and then the device model
> runs on PCPU 2 and does a read of both A and B, it will only see the
> new value of B if the it sees the new value of A.
> 
> Whether the driver on VCPU 0 (which may be on any PCPU) also sees
> the write ordering is irrelevant.
> 
> If you want to avoid taking a barrier on every write, we can make
> use of map() and issue explicit barriers (as virtio does).
> 
> Regards,
> 
> Anthony Liguori
> 
> >I really don't see why don't we audit devices
> >and add proper barriers.
> >
Michael S. Tsirkin - May 21, 2012, 10:44 p.m.
On Mon, May 21, 2012 at 05:31:06PM -0500, Anthony Liguori wrote:
> On 05/21/2012 05:26 PM, Benjamin Herrenschmidt wrote:
> >On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
> >>But this isn't what this series is about.
> >>
> >>This series is only attempting to make sure that writes are ordered
> >>with respect
> >>to other writes in main memory.
> >
> >Actually, it applies to both reads and writes. They can't pass each
> >other either and that can be fairly important.
> 
> That's fine but that's a detail of the bus.
> 
> >It's in fact the main contention point because if it was only writes we
> >could just use wmb and be done with it (that's a nop on x86).
> >
> >Because we are trying to order everything (and specifically store
> >followed by a load), we need a full barrier which is more expensive on
> >x86.
> 
> I think the thing to do is make the barrier implemented in the dma
> API and allow it to be overridden by the bus.  The default
> implementation should be a full barrier.

I think what's called for is what Ben proposed: track
last transaction and use the appropriate barrier.

> If we can establish that the bus guarantees a weaker ordering
> guarantee, a bus could override the default implementation and do
> something weaker.
> 
> Regards,
> 
> Anthony Liguori

OK. Just not another level of indirect function callbacks please.  Make
it a library so each bus can do the right thing.  There are not so many
buses.
Benjamin Herrenschmidt - May 21, 2012, 10:56 p.m.
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > Again, from which originator ? From a given initiator, nothing
> bypasses
> > anything, so the right thing to do here is a full mb(). However, I
> > suspect what you are talking about here is read -responses- not
> > bypassing writes in the direction of the response (ie, the
> "flushing"
> > semantic of reads) which is a different matter.
> 
> No. My spec says:
> A3, A4
> A Posted Request must be able to pass Non-Posted Requests to avoid
> deadlocks.

Right, a read + write can become write + read at the target, I forgot
about that, or you can deadlock due to the flush semantics, but a write
+ read must remain in order or am I missing something ?

And write + read afaik is typically the one that x86 can re-order
without a barrier isn't it ?

> > Also don't forget that
> > this is only a semantic of PCI, not of the system fabric, ie, a
> device
> > DMA read doesn't flush a CPU write that is still in that CPU store
> > queue.
> 
> We need to emulate what hardware does IMO.
> So if usb has different rules it needs a different barriers.

Who talks about USB here ? Whatever rules USB has only matter between
the USB device and the USB controller emulation, USB doesn't sit
directly on the memory bus doing DMA, it all goes through the HCI, which
adheres the the ordering rules of whatever bus it sits on.

Here, sanity must prevail :-) I suggest ordering by default...

> > And it's not correct. With that setup, DMA writes can pass DMA reads
> > (and vice-versa) which doesn't correspond to the guarantees of the
> PCI
> > spec.
> 
> Cite the spec please. Express spec matches this at least.

Sure, see above. Yes I did forgot that a read + write could be
re-ordered on PCI but that isn't the case of a write + read, or am I
reading the table sideways ?

> It's a lot of work. We measured the effect for virtio in
> the past. I don't think we need to redo it.

virtio is specifically our high performance case and what I'm proposing
isn't affecting it.

> > > You said above x86 is unaffected. This is portability, not safety.
> > 
> > x86 is unaffected by the missing wmb/rmb, it might not be unaffected
> by
> > the missing ordering between loads and stores, I don't know, as I
> said,
> > I don't fully know the x86 memory model.
> 
> You don't need to understand it. Assume memory-barriers.h is correct.

In which case we still need a full mb() unless we can convince ourselves
that the ordering between a write and a subsequent read can be relaxed
safely and I'm really not sure about it.

> > In any case, opposing "portability" to "safety" the way you do it
> means
> > you are making assumptions that basically "qemu is written for x86
> and
> > nothing else matters".
> 
> No. But find a way to fix power without hurting working setups.

And ARM ;-)

Arguably x86 is wrong too anyway, at least from a strict interpretation
off the spec (and unless I missed something).

> > If that's your point of view, so be it and be clear about it, but I
> will
> > disagree :-) And while I can understand that powerpc might not be
> > considered as the most important arch around at this point in time,
> > these problems are going to affect ARM as well.
> > 
> > > > Almost all our
> > > > devices were written without any thought given to ordering, so
> they
> > > > basically can and should be considered as all broken.
> > > 
> > > Problem is, a lot of code is likely broken even after you sprinkle
> > > barriers around. For example qemu might write A then B where guest
> driver
> > > expects to see B written before A.
> > 
> > No, this is totally unrelated bugs, nothing to do with barriers. You
> are
> > mixing up two completely different problems and using one as an
> excuse
> > to not fix the other one :-)
> > 
> > A device with the above problem would be broken today on x86
> regardless.
> > 
> > > > Since thinking
> > > > about ordering is something that, by experience, very few
> programmer can
> > > > do and get right, the default should absolutely be fully
> ordered.
> > > 
> > > Give it bus ordering. That is not fully ordered.
> > 
> > It pretty much is actually, look at your PCI spec :-)
> 
> I looked. 2.4.1.  Transaction Ordering Rules
> 
> > > > Performance regressions aren't a big deal to bisect in that
> case: If
> > > > there's a regression for a given driver and it points to this
> specific
> > > > patch adding the barriers then we know precisely where the
> regression
> > > > come from, and we can get some insight about how this specific
> driver
> > > > can be improved to use more relaxed accessors.
> > > > 
> > > > I don't see the problem.
> > > > 
> > > > One thing that might be worth looking at is if indeed mb() is so
> much
> > > > more costly than just wmb/rmb, in which circumstances we could
> have some
> > > > smarts in the accessors to make them skip the full mb based on
> knowledge
> > > > of previous access direction, though here too I would be tempted
> to only
> > > > do that if absolutely necessary (ie if we can't instead just fix
> the
> > > > sensitive driver to use explicitly relaxed accessors).
> > > 
> > > We did this in virtio and yes it is measureable.
> > 
> > You did it in virtio in a very hot spot on a performance critical
> > driver. My argument is that:
> > 
> >  - We can do it in a way that doesn't affect virtio at all (by using
> the
> > dma accessors instead of cpu_*)
> > 
> >  - Only few drivers have that kind of performance criticality and
> they
> > can be easily hand fixed.
> > 
> > > branches are pretty cheap though.
> > 
> > Depends, not always but yes, cheaper than barriers in many cases.
> > 
> > > > > > So on that I will not compromise.
> > > > > > 
> > > > > > However, I think it might be better to leave the barrier in
> the dma
> > > > > > accessor since that's how you also get iommu transparency
> etc... so it's
> > > > > > not a bad place to put them, and leave the cpu_physical_*
> for use by
> > > > > > lower level device drivers which are thus responsible also
> for dealing
> > > > > > with ordering if they have to.
> > > > > > 
> > > > > > Cheers,
> > > > > > Ben.
> > > > > 
> > > > > You claim to understand what matters for all devices I doubt
> that.
> > > > 
> > > > It's pretty obvious that anything that does DMA using a classic
> > > > descriptor + buffers structure is broken without appropriate
> ordering.
> > > > 
> > > > And yes, I claim to have a fairly good idea of the problem, but
> I don't
> > > > think throwing credentials around is going to be helpful.
> > > >  
> > > > > Why don't we add safe APIs, then go over devices and switch
> over?
> > > > > I counted 97 pci_dma_ accesses.
> > > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > > > 
> > > > > Let maintainers make a decision where does speed matter.
> > > > 
> > > > No. Let's fix the general bug first. Then let's people who know
> the
> > > > individual drivers intimately and understand their access
> patterns make
> > > > the call as to when things can/should be relaxed.
> > > > 
> > > > Ben.
> > > 
> > > As a maintainer of a device, if you send me a patch I can review.
> > > If you change core APIs creating performance regressions
> > > I don't even know what to review without wasting time debugging
> > > and bisecting.
> > 
> > Well, if you don't know then you ask on the list and others (such as
> > myself or a certain mst) who happens to know those issues will help
> that
> > lone clueless maintainer, seriously it's not like it's hard, and a
> lot
> > easier than just keeping everything broken and hoping we get to
> audit
> > them all properly.
> > 
> > > According to what you said you want to fix kvm on powerpc.
> > > Good. Find a way that looks non intrusive on x86 please.
> > 
> > And ARM. And any other OO arch.
> > 
> > Ben.
> 
> 
>
Benjamin Herrenschmidt - May 21, 2012, 11:02 p.m.
On Tue, 2012-05-22 at 01:44 +0300, Michael S. Tsirkin wrote:
> 
> OK. Just not another level of indirect function callbacks please.  Make
> it a library so each bus can do the right thing.  There are not so many
> buses. 

I think we are a long way from having to deal with subtly different
ordering rules, at this stage I'm keen on just sticking it in the inline
dma_ accessor, period.

We can always have flags in the dma context to indicate the sort of
ordering, avoids indirect pointer calls which can be costly (often
mispredicted).

Ben.
Benjamin Herrenschmidt - May 22, 2012, midnight
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > Again, from which originator ? From a given initiator, nothing
> bypasses
> > anything, so the right thing to do here is a full mb(). However, I
> > suspect what you are talking about here is read -responses- not
> > bypassing writes in the direction of the response (ie, the
> "flushing"
> > semantic of reads) which is a different matter.
> 
> No. My spec says:
> A3, A4
> A Posted Request must be able to pass Non-Posted Requests to avoid
> deadlocks. 

An additional note about that one: It only applies obviously if the
initiator doesn't wait for the read response before shooting the write.

Most devices actually do wait. Here too, we would have to explicitly
make sure if what is the right semantic on an individual device basis.

Ben.
Rusty Russell - May 22, 2012, 4:19 a.m.
On Tue, 22 May 2012 07:58:17 +1000, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> > 	if (unlikely(latest_dma_read))
> > 		smp_mb();
> > 
> > if you really insist on it
> > though I do think it's inelegant.
> 
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
> 
> Shouldn't we measure it first ?

Yes.  It seems clear to me that qemu's default DMA operations should be
strictly ordered.  It's just far easier to get right.

After that, we can get tricky with conditional barriers, and we can get
tricky with using special unordered variants in critical drivers, but I
really don't want to be chasing subtle SMP ordering problems in
production.

If you're working on ARM or PPC, "it works for x86" makes it *worse*,
not better, since you have fewer users to find bugs.

Cheers,
Rusty.
Michael S. Tsirkin - May 22, 2012, 5:11 a.m.
On Tue, May 22, 2012 at 08:56:12AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > > Again, from which originator ? From a given initiator, nothing
> > bypasses
> > > anything, so the right thing to do here is a full mb(). However, I
> > > suspect what you are talking about here is read -responses- not
> > > bypassing writes in the direction of the response (ie, the
> > "flushing"
> > > semantic of reads) which is a different matter.
> > 
> > No. My spec says:
> > A3, A4
> > A Posted Request must be able to pass Non-Posted Requests to avoid
> > deadlocks.
> 
> Right, a read + write can become write + read at the target, I forgot
> about that, or you can deadlock due to the flush semantics, but a write
> + read must remain in order or am I missing something ?

Exactly.

> And write + read afaik is typically the one that x86 can re-order
> without a barrier isn't it ?

AFAIK without a barrier, x86 can reorder them however you initiate them.

Patch

diff --git a/exec.c b/exec.c
index 363ec98..997dbb0 100644
--- a/exec.c
+++ b/exec.c
@@ -16,6 +16,34 @@ 
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
+
+/*
+ * Note on memory barriers usage:
+ *
+ * In order for emulated devices "DMA" operations to appear
+ * with a consistent ordering to the guest, we provide some
+ * amount of ordering guarantees:
+ *
+ * cpy_physical_memory_rw() (and all functions using it) is
+ * ordered vs. all previous accesses (it begins with a full
+ * memory barrier)
+ *
+ * This include all the new dma_* accessors.
+ *
+ * The old variants of ld* and st* that have not been convered
+ * to dma_ are not ordered. Users are reponsible for using their
+ * own ordering.
+ *
+ * cpu_physical_memory_map() provides a memory barrier vs. all
+ * previous accesses. There is no implicit barrier on unmap.
+ * If ordering is required between accessed within the map/unmmap
+ * sequence, then it needs to be done explicitely.
+ *
+ * This means that a typical block driver using map/unmap accross
+ * the transfer of a block followed by dma_ writes to signal
+ * completion or interrupt shouldn't require the addition of
+ * explicit barriers.
+ */ 
 #include "config.h"
 #ifdef _WIN32
 #include <windows.h>
@@ -25,6 +53,7 @@ 
 #endif
 
 #include "qemu-common.h"
+#include "qemu-barrier.h"
 #include "cpu.h"
 #include "tcg.h"
 #include "hw/hw.h"
@@ -3516,6 +3545,10 @@  void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
     target_phys_addr_t page;
     MemoryRegionSection *section;
 
+    /* Provides ordering vs. previous accesses, see comments
+     * at the top of this file */
+    smp_mb();
+
     while (len > 0) {
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;
@@ -3713,6 +3746,10 @@  void *cpu_physical_memory_map(target_phys_addr_t addr,
     ram_addr_t rlen;
     void *ret;
 
+    /* Provides ordering vs. previous accesses, see comments
+     * at the top of this file */
+    smp_mb();
+
     while (len > 0) {
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;