Patchwork virtio: Make memory barriers be memory barriers

login
register
mail settings
Submitter David Gibson
Date Sept. 1, 2011, 6:09 a.m.
Message ID <1314857389-13363-1-git-send-email-david@gibson.dropbear.id.au>
Download mbox | patch
Permalink /patch/112819/
State New
Headers show

Comments

David Gibson - Sept. 1, 2011, 6:09 a.m.
The virtio code already has memory barrier wmb() macros in the code.
However they are was defined as no-ops.  The comment claims that real
barriers are not necessary because the code does not run concurrent.
However, with kvm and io-thread enabled, this is not true and this qemu
code can indeed run concurrently with the guest kernel.  This does not
cause problems on x86 due to it's strongly ordered storage model, but it
causes a race leading to virtio errors on POWER which has a relaxed storage
ordering model.

Specifically, the QEMU puts new element into the "used" ring and then
updates the ring free-running counter.  Without a barrier between these
under the right circumstances, the guest linux driver can receive an
interrupt, read the counter change but find the ring element to be handled
still has an old value, leading to an "id %u is not a head!\n" error
message.

The problem is easy to reproduce on POWER using virtio-net with heavy
traffic.

The patch defines wmb() as __sync_synchronize(), a cross platform memory
barrier primitive available in sufficiently recent gcc versions (gcc 4.2
and after?).  If we care about older gccs then this patch will need to
be updated with some sort of fallback.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)
Paolo Bonzini - Sept. 1, 2011, 6:47 a.m.
On 09/01/2011 08:09 AM, David Gibson wrote:
> The patch defines wmb() as __sync_synchronize(), a cross platform memory
> barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> and after?).  If we care about older gccs then this patch will need to
> be updated with some sort of fallback.

Yes, though RHEL5's 4.1 has it backported.

> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
> Signed-off-by: David Gibson<david@gibson.dropbear.id.au>

It will most definitely break OpenBSD, but anyway:

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo
Sasha Levin - Sept. 1, 2011, 7:37 a.m.
On Thu, 2011-09-01 at 16:09 +1000, David Gibson wrote:
> The virtio code already has memory barrier wmb() macros in the code.
> However they are was defined as no-ops.  The comment claims that real
> barriers are not necessary because the code does not run concurrent.
> However, with kvm and io-thread enabled, this is not true and this qemu
> code can indeed run concurrently with the guest kernel.  This does not
> cause problems on x86 due to it's strongly ordered storage model, but it
> causes a race leading to virtio errors on POWER which has a relaxed storage
> ordering model.
> 
> Specifically, the QEMU puts new element into the "used" ring and then
> updates the ring free-running counter.  Without a barrier between these
> under the right circumstances, the guest linux driver can receive an
> interrupt, read the counter change but find the ring element to be handled
> still has an old value, leading to an "id %u is not a head!\n" error
> message.
> 
> The problem is easy to reproduce on POWER using virtio-net with heavy
> traffic.
> 
> The patch defines wmb() as __sync_synchronize(), a cross platform memory
> barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> and after?).  If we care about older gccs then this patch will need to
> be updated with some sort of fallback.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio.c |   10 ++--------
>  1 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 13aa0fa..c9f0e75 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -21,14 +21,8 @@
>   * x86 pagesize again. */
>  #define VIRTIO_PCI_VRING_ALIGN         4096
>  
> -/* QEMU doesn't strictly need write barriers since everything runs in
> - * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
> - * KVM or if kqemu gets SMP support.
> - * In any case, we must prevent the compiler from reordering the code.
> - * TODO: we likely need some rmb()/mb() as well.
> - */
> -
> -#define wmb() __asm__ __volatile__("": : :"memory")
> + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
> + #define wmb() __sync_synchronize()

That asm directive also implicitly provided a compiler barrier, I could
find whether __sync_synchronize() provides one as well.

Any idea if it does?

>  
>  typedef struct VRingDesc
>  {
Sasha Levin - Sept. 1, 2011, 7:38 a.m.
On Thu, 2011-09-01 at 10:37 +0300, Sasha Levin wrote:
> On Thu, 2011-09-01 at 16:09 +1000, David Gibson wrote:
> > The virtio code already has memory barrier wmb() macros in the code.
> > However they are was defined as no-ops.  The comment claims that real
> > barriers are not necessary because the code does not run concurrent.
> > However, with kvm and io-thread enabled, this is not true and this qemu
> > code can indeed run concurrently with the guest kernel.  This does not
> > cause problems on x86 due to it's strongly ordered storage model, but it
> > causes a race leading to virtio errors on POWER which has a relaxed storage
> > ordering model.
> > 
> > Specifically, the QEMU puts new element into the "used" ring and then
> > updates the ring free-running counter.  Without a barrier between these
> > under the right circumstances, the guest linux driver can receive an
> > interrupt, read the counter change but find the ring element to be handled
> > still has an old value, leading to an "id %u is not a head!\n" error
> > message.
> > 
> > The problem is easy to reproduce on POWER using virtio-net with heavy
> > traffic.
> > 
> > The patch defines wmb() as __sync_synchronize(), a cross platform memory
> > barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> > and after?).  If we care about older gccs then this patch will need to
> > be updated with some sort of fallback.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio.c |   10 ++--------
> >  1 files changed, 2 insertions(+), 8 deletions(-)
> > 
> > diff --git a/hw/virtio.c b/hw/virtio.c
> > index 13aa0fa..c9f0e75 100644
> > --- a/hw/virtio.c
> > +++ b/hw/virtio.c
> > @@ -21,14 +21,8 @@
> >   * x86 pagesize again. */
> >  #define VIRTIO_PCI_VRING_ALIGN         4096
> >  
> > -/* QEMU doesn't strictly need write barriers since everything runs in
> > - * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
> > - * KVM or if kqemu gets SMP support.
> > - * In any case, we must prevent the compiler from reordering the code.
> > - * TODO: we likely need some rmb()/mb() as well.
> > - */
> > -
> > -#define wmb() __asm__ __volatile__("": : :"memory")
> > + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
> > + #define wmb() __sync_synchronize()
> 
> That asm directive also implicitly provided a compiler barrier, I could

couldn't.

> find whether __sync_synchronize() provides one as well.
> 
> Any idea if it does?
> 
> >  
> >  typedef struct VRingDesc
> >  {
>
Paolo Bonzini - Sept. 1, 2011, 7:56 a.m.
On 09/01/2011 09:37 AM, Sasha Levin wrote:
>> >  -#define wmb() __asm__ __volatile__("": : :"memory")
>> >  + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
>> >  + #define wmb() __sync_synchronize()
>
> That asm directive also implicitly provided a compiler barrier, I could
                      ^^^^

only

> find whether __sync_synchronize() provides one as well.
>
> Any idea if it does?

Of course, __sync_synchronize() is a stronger barrier.  The asm is 
simply not a synchronization primitive.

Paolo
Michael S. Tsirkin - Sept. 1, 2011, 3:30 p.m.
On Thu, Sep 01, 2011 at 04:09:49PM +1000, David Gibson wrote:
> The virtio code already has memory barrier wmb() macros in the code.
> However they are was defined as no-ops.  The comment claims that real
> barriers are not necessary because the code does not run concurrent.
> However, with kvm and io-thread enabled, this is not true and this qemu
> code can indeed run concurrently with the guest kernel.  This does not
> cause problems on x86 due to it's strongly ordered storage model, but it
> causes a race leading to virtio errors on POWER which has a relaxed storage
> ordering model.

Why not limit the change to ppc then?
Paolo Bonzini - Sept. 1, 2011, 4:14 p.m.
On 09/01/2011 05:30 PM, Michael S. Tsirkin wrote:
>> >  The virtio code already has memory barrier wmb() macros in the code.
>> >  However they are was defined as no-ops.  The comment claims that real
>> >  barriers are not necessary because the code does not run concurrent.
>> >  However, with kvm and io-thread enabled, this is not true and this qemu
>> >  code can indeed run concurrently with the guest kernel.  This does not
>> >  cause problems on x86 due to it's strongly ordered storage model, but it
>> >  causes a race leading to virtio errors on POWER which has a relaxed storage
>> >  ordering model.
>
> Why not limit the change to ppc then?

Because the bug is masked by the x86 memory model, but it is still there 
even there conceptually.  It is not really true that x86 does not need 
memory barriers, though it doesn't in this case:

http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

Paolo
Michael S. Tsirkin - Sept. 1, 2011, 4:34 p.m.
On Thu, Sep 01, 2011 at 06:14:34PM +0200, Paolo Bonzini wrote:
> On 09/01/2011 05:30 PM, Michael S. Tsirkin wrote:
> >>>  The virtio code already has memory barrier wmb() macros in the code.
> >>>  However they are was defined as no-ops.  The comment claims that real
> >>>  barriers are not necessary because the code does not run concurrent.
> >>>  However, with kvm and io-thread enabled, this is not true and this qemu
> >>>  code can indeed run concurrently with the guest kernel.  This does not
> >>>  cause problems on x86 due to it's strongly ordered storage model, but it
> >>>  causes a race leading to virtio errors on POWER which has a relaxed storage
> >>>  ordering model.
> >
> >Why not limit the change to ppc then?
> 
> Because the bug is masked by the x86 memory model, but it is still
> there even there conceptually.  It is not really true that x86 does
> not need memory barriers, though it doesn't in this case:
> 
> http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> 
> Paolo

Right.
To summarize, on x86 we probably want wmb and rmb to be compiler
barrier only. Only mb might in theory need to be an mfence.
But there might be reasons why that is not an issue either
if we look closely enough.
Paolo Bonzini - Sept. 1, 2011, 8:31 p.m.
> > > Why not limit the change to ppc then?
> >
> > Because the bug is masked by the x86 memory model, but it is still
> > there even there conceptually. It is not really true that x86 does
> > not need memory barriers, though it doesn't in this case:
> >
> > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> >
> > Paolo
> 
> Right.
> To summarize, on x86 we probably want wmb and rmb to be compiler
> barrier only. Only mb might in theory need to be an mfence.

No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
not provide those, so they should become __sync_synchronize() too,
or you should use inline assembly.

> But there might be reasons why that is not an issue either
> if we look closely enough.

Since the ring buffers are not using locked instructions (no xchg
or cmpxchg) the barriers simply must be there, even on x86.  Whether
it works in practice is not interesting, only the formal model is
interesting.

Paolo
David Gibson - Sept. 2, 2011, 12:08 a.m.
On Thu, Sep 01, 2011 at 08:47:49AM +0200, Paolo Bonzini wrote:
> On 09/01/2011 08:09 AM, David Gibson wrote:
> >The patch defines wmb() as __sync_synchronize(), a cross platform memory
> >barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> >and after?).  If we care about older gccs then this patch will need to
> >be updated with some sort of fallback.
> 
> Yes, though RHEL5's 4.1 has it backported.
> 
> >Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
> >Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> 
> It will most definitely break OpenBSD, but anyway:

Uh, why?
David Gibson - Sept. 2, 2011, 12:09 a.m.
On Thu, Sep 01, 2011 at 09:56:25AM +0200, Paolo Bonzini wrote:
> On 09/01/2011 09:37 AM, Sasha Levin wrote:
> >>>  -#define wmb() __asm__ __volatile__("": : :"memory")
> >>>  + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
> >>>  + #define wmb() __sync_synchronize()
> >
> >That asm directive also implicitly provided a compiler barrier, I could
>                      ^^^^
> 
> only
> 
> >find whether __sync_synchronize() provides one as well.
> >
> >Any idea if it does?
> 
> Of course, __sync_synchronize() is a stronger barrier.  The asm is
> simply not a synchronization primitive.

Good to hear that confirmed, I hadn't checked myself.  But then a cpu
memory barrier primitive without a compiler barrier would be an insane
thing to have.
David Gibson - Sept. 2, 2011, 12:11 a.m.
On Thu, Sep 01, 2011 at 06:14:34PM +0200, Paolo Bonzini wrote:
> On 09/01/2011 05:30 PM, Michael S. Tsirkin wrote:
> >>>  The virtio code already has memory barrier wmb() macros in the code.
> >>>  However they are was defined as no-ops.  The comment claims that real
> >>>  barriers are not necessary because the code does not run concurrent.
> >>>  However, with kvm and io-thread enabled, this is not true and this qemu
> >>>  code can indeed run concurrently with the guest kernel.  This does not
> >>>  cause problems on x86 due to it's strongly ordered storage model, but it
> >>>  causes a race leading to virtio errors on POWER which has a relaxed storage
> >>>  ordering model.
> >
> >Why not limit the change to ppc then?
> 
> Because the bug is masked by the x86 memory model, but it is still
> there even there conceptually.  It is not really true that x86 does
> not need memory barriers, though it doesn't in this case:
> 
> http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

Not to mention that pcc is not the only non-x86 architecture.  I don't
know all their storage models off hand, but the point is that there is
a required order to these writes, so there should be a memory barrier.
Paolo Bonzini - Sept. 2, 2011, 6:11 a.m.
On 09/02/2011 02:11 AM, David Gibson wrote:
>>> >  >Why not limit the change to ppc then?
>> >
>> >  Because the bug is masked by the x86 memory model, but it is still
>> >  there even there conceptually.  It is not really true that x86 does
>> >  not need memory barriers, though it doesn't in this case:
>> >
>> >  http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> Not to mention that pcc is not the only non-x86 architecture.  I don't
> know all their storage models off hand, but the point is that there is
> a required order to these writes, so there should be a memory barrier.

Indeed, I interpreted Michael's question more as "why not limit the 
change to non-x86".  I think we should cater to all memory models except 
perhaps the Alpha's.

Paolo
Paolo Bonzini - Sept. 2, 2011, 6:49 a.m.
On 09/02/2011 02:08 AM, David Gibson wrote:
>> >
>>> >  >Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>> >  >Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
>> >
>> >  It will most definitely break OpenBSD, but anyway:
> Uh, why?

They use an ancient compiler because they do not want to use GPLv3.  I 
thought it was 4.1.something but actually it is 4.2.1, so it should work.

Paolo
Michael S. Tsirkin - Sept. 2, 2011, 3:45 p.m.
On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > Why not limit the change to ppc then?
> > >
> > > Because the bug is masked by the x86 memory model, but it is still
> > > there even there conceptually. It is not really true that x86 does
> > > not need memory barriers, though it doesn't in this case:
> > >
> > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > >
> > > Paolo
> > 
> > Right.
> > To summarize, on x86 we probably want wmb and rmb to be compiler
> > barrier only. Only mb might in theory need to be an mfence.
> 
> No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> not provide those, so they should become __sync_synchronize() too,
> or you should use inline assembly.
> 
> > But there might be reasons why that is not an issue either
> > if we look closely enough.
> 
> Since the ring buffers are not using locked instructions (no xchg
> or cmpxchg) the barriers simply must be there, even on x86.  Whether
> it works in practice is not interesting, only the formal model is
> interesting.
> 
> Paolo

Well, can you describe an issue in virtio that lfence/sfence help solve
in terms of a memory model please?
Pls note that guest uses smp_ variants for barriers.
Michael S. Tsirkin - Sept. 2, 2011, 3:57 p.m.
On Fri, Sep 02, 2011 at 08:11:29AM +0200, Paolo Bonzini wrote:
> On 09/02/2011 02:11 AM, David Gibson wrote:
> >>>>  >Why not limit the change to ppc then?
> >>>
> >>>  Because the bug is masked by the x86 memory model, but it is still
> >>>  there even there conceptually.  It is not really true that x86 does
> >>>  not need memory barriers, though it doesn't in this case:
> >>>
> >>>  http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> >Not to mention that pcc is not the only non-x86 architecture.  I don't
> >know all their storage models off hand, but the point is that there is
> >a required order to these writes, so there should be a memory barrier.
> 
> Indeed, I interpreted Michael's question more as "why not limit the
> change to non-x86".  I think we should cater to all memory models
> except perhaps the Alpha's.
> 
> Paolo

Yes, x86 is pretty popular.
Blue Swirl - Sept. 3, 2011, 11:53 a.m.
On Fri, Sep 2, 2011 at 6:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 09/02/2011 02:08 AM, David Gibson wrote:
>>>
>>> >
>>>>
>>>> >  >Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>> >  >Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
>>>
>>> >
>>> >  It will most definitely break OpenBSD, but anyway:
>>
>> Uh, why?
>
> They use an ancient compiler because they do not want to use GPLv3.  I
> thought it was 4.1.something but actually it is 4.2.1, so it should work.

It works:

$ cat >sync_synch.c
void f(void)
{
  __sync_synchronize();
}
$ gcc -c sync_synch.c -Wall
$ objdump -d sync_synch.o

sync_synch.o:     file format elf64-sparc

Disassembly of section .text:

0000000000000000 <f>:
   0:   9d e3 bf 40     save  %sp, -192, %sp
   4:   81 43 e0 0f     membar  #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
   8:   81 cf e0 08     rett  %i7 + 8
   c:   01 00 00 00     nop
$ gcc -v
Reading specs from
/usr/bin/../lib/gcc-lib/sparc64-unknown-openbsd4.9/4.2.1/specs
Target: sparc64-unknown-openbsd4.9
Configured with: OpenBSD/sparc64 system compiler
Thread model: posix
gcc version 4.2.1 20070719
David Gibson - Sept. 3, 2011, 2:46 p.m.
On Fri, Sep 02, 2011 at 06:45:50PM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > > Why not limit the change to ppc then?
> > > >
> > > > Because the bug is masked by the x86 memory model, but it is still
> > > > there even there conceptually. It is not really true that x86 does
> > > > not need memory barriers, though it doesn't in this case:
> > > >
> > > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > > >
> > > > Paolo
> > > 
> > > Right.
> > > To summarize, on x86 we probably want wmb and rmb to be compiler
> > > barrier only. Only mb might in theory need to be an mfence.
> > 
> > No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> > not provide those, so they should become __sync_synchronize() too,
> > or you should use inline assembly.
> > 
> > > But there might be reasons why that is not an issue either
> > > if we look closely enough.
> > 
> > Since the ring buffers are not using locked instructions (no xchg
> > or cmpxchg) the barriers simply must be there, even on x86.  Whether
> > it works in practice is not interesting, only the formal model is
> > interesting.
> > 
> > Paolo
> 
> Well, can you describe an issue in virtio that lfence/sfence help solve
> in terms of a memory model please?
> Pls note that guest uses smp_ variants for barriers.

Ok, so, I'm having a bit of trouble with the fact that I'm having to
argue the case that things the protocol requiress to be memory
barriers actually *be* memory barriers on all platforms.

I mean argue for a richer set of barriers, with per-arch minimal
implementations instead of the large but portable hammer of
sync_synchronize, if you will.  But just leaving them out on x86!?
Seriously, wtf?  Do you enjoy having software that works chiefly by
accident?
Paolo Bonzini - Sept. 3, 2011, 4:19 p.m.
On 09/02/2011 05:45 PM, Michael S. Tsirkin wrote:
> Well, can you describe an issue in virtio that lfence/sfence help solve
> in terms of a memory model please?
> Pls note that guest uses smp_ variants for barriers.

     /* Make sure buffer is written before we update index. */
     wmb();

Without it, a guest could see a partially updated buffer, because the 
buffer and index writes are unlocked stores to different locations.

Even if the guest uses barriers, with ioeventfd it will only order the 
CPU that is running the guest, not the one that is running the iothread. 
  In fact I'm surprised that it works at all under x86 with ioeventfd.

Paolo
Michael S. Tsirkin - Sept. 4, 2011, 8:47 a.m.
On Sat, Sep 03, 2011 at 06:19:23PM +0200, Paolo Bonzini wrote:
> On 09/02/2011 05:45 PM, Michael S. Tsirkin wrote:
> >Well, can you describe an issue in virtio that lfence/sfence help solve
> >in terms of a memory model please?
> >Pls note that guest uses smp_ variants for barriers.
> 
>     /* Make sure buffer is written before we update index. */
>     wmb();
> 
> Without it, a guest could see a partially updated buffer, because
> the buffer and index writes are unlocked stores to different
> locations.

Sorry, I still don't understand. Yes, they are unlocked stores to different
locations. How does it follow that a guest could see a partially updated
buffer? Just to make sure - we are talking about x86 here, not ppc,
right?

> Even if the guest uses barriers, with ioeventfd it will only order
> the CPU that is running the guest, not the one that is running the
> iothread.  In fact I'm surprised that it works at all under x86 with
> ioeventfd.
> 
> Paolo

I can try to explain if I understand the problem you see.
Michael S. Tsirkin - Sept. 4, 2011, 9:16 a.m.
On Sun, Sep 04, 2011 at 12:46:35AM +1000, David Gibson wrote:
> On Fri, Sep 02, 2011 at 06:45:50PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > > > Why not limit the change to ppc then?
> > > > >
> > > > > Because the bug is masked by the x86 memory model, but it is still
> > > > > there even there conceptually. It is not really true that x86 does
> > > > > not need memory barriers, though it doesn't in this case:
> > > > >
> > > > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > > > >
> > > > > Paolo
> > > > 
> > > > Right.
> > > > To summarize, on x86 we probably want wmb and rmb to be compiler
> > > > barrier only. Only mb might in theory need to be an mfence.
> > > 
> > > No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> > > not provide those, so they should become __sync_synchronize() too,
> > > or you should use inline assembly.
> > > 
> > > > But there might be reasons why that is not an issue either
> > > > if we look closely enough.
> > > 
> > > Since the ring buffers are not using locked instructions (no xchg
> > > or cmpxchg) the barriers simply must be there, even on x86.  Whether
> > > it works in practice is not interesting, only the formal model is
> > > interesting.
> > > 
> > > Paolo
> > 
> > Well, can you describe an issue in virtio that lfence/sfence help solve
> > in terms of a memory model please?
> > Pls note that guest uses smp_ variants for barriers.
> 
> Ok, so, I'm having a bit of trouble with the fact that I'm having to
> argue the case that things the protocol requiress to be memory
> barriers actually *be* memory barriers on all platforms.
> 
> I mean argue for a richer set of barriers, with per-arch minimal
> implementations instead of the large but portable hammer of
> sync_synchronize, if you will.

That's what I'm saying really. On x86 the richer set of barriers
need not insert code at all for both wmb and rmb macros. All we
might need is an 'optimization barrier'- e.g. linux does
 __asm__ __volatile__("": : :"memory")
ppc needs something like sync_synchronize there.


> But just leaving them out on x86!?
> Seriously, wtf?  Do you enjoy having software that works chiefly by
> accident?

I'm surprised at the controversy too. People seem to argue that
x86 cpu does not reorder stores and that we need an sfence between
stores to prevent the guest from seeing them out of order, at
the same time.


> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson
David Gibson - Sept. 5, 2011, 4:43 a.m.
On Sun, Sep 04, 2011 at 12:16:43PM +0300, Michael S. Tsirkin wrote:
> On Sun, Sep 04, 2011 at 12:46:35AM +1000, David Gibson wrote:
> > On Fri, Sep 02, 2011 at 06:45:50PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > > > > Why not limit the change to ppc then?
> > > > > >
> > > > > > Because the bug is masked by the x86 memory model, but it is still
> > > > > > there even there conceptually. It is not really true that x86 does
> > > > > > not need memory barriers, though it doesn't in this case:
> > > > > >
> > > > > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > > > > >
> > > > > > Paolo
> > > > > 
> > > > > Right.
> > > > > To summarize, on x86 we probably want wmb and rmb to be compiler
> > > > > barrier only. Only mb might in theory need to be an mfence.
> > > > 
> > > > No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> > > > not provide those, so they should become __sync_synchronize() too,
> > > > or you should use inline assembly.
> > > > 
> > > > > But there might be reasons why that is not an issue either
> > > > > if we look closely enough.
> > > > 
> > > > Since the ring buffers are not using locked instructions (no xchg
> > > > or cmpxchg) the barriers simply must be there, even on x86.  Whether
> > > > it works in practice is not interesting, only the formal model is
> > > > interesting.
> > > > 
> > > > Paolo
> > > 
> > > Well, can you describe an issue in virtio that lfence/sfence help solve
> > > in terms of a memory model please?
> > > Pls note that guest uses smp_ variants for barriers.
> > 
> > Ok, so, I'm having a bit of trouble with the fact that I'm having to
> > argue the case that things the protocol requiress to be memory
> > barriers actually *be* memory barriers on all platforms.
> > 
> > I mean argue for a richer set of barriers, with per-arch minimal
> > implementations instead of the large but portable hammer of
> > sync_synchronize, if you will.
> 
> That's what I'm saying really. On x86 the richer set of barriers
> need not insert code at all for both wmb and rmb macros. All we
> might need is an 'optimization barrier'- e.g. linux does
>  __asm__ __volatile__("": : :"memory")
> ppc needs something like sync_synchronize there.

But you're approaching this the wrong way around - correctness should
come first.  That is, we should first ensure that there is a
sufficient memory barrier to satisfy the protocol.  Then, *if* there
is a measurable performance improvement and *if* we can show that a
weaker barrier is sufficient on a given platform, then we can whittle
it down to a lighter barrier.

> > But just leaving them out on x86!?
> > Seriously, wtf?  Do you enjoy having software that works chiefly by
> > accident?
> 
> I'm surprised at the controversy too. People seem to argue that
> x86 cpu does not reorder stores and that we need an sfence between
> stores to prevent the guest from seeing them out of order, at
> the same time.

I don't know the x86 storage model well enough to definitively say
that the barrier is not necessary there - nor to say that it is
necessary.  All I know is that the x86 model is quite strongly
ordered, and I assume that is why the lack of barrier has not caused
an observed problem on x86.

Again, correctness first.  sync_synchronize should be a sufficient
barrier for wmb() on all platforms.  If you really don't want it, the
onus is on you to show that (a) it's safe to do so and (b) it's
actually worth it.
Paolo Bonzini - Sept. 5, 2011, 7:41 a.m.
On 09/04/2011 11:16 AM, Michael S. Tsirkin wrote:
>> >  I mean argue for a richer set of barriers, with per-arch minimal
>> >  implementations instead of the large but portable hammer of
>> >  sync_synchronize, if you will.
>
> That's what I'm saying really. On x86 the richer set of barriers
> need not insert code at all for both wmb and rmb macros. All we
> might need is an 'optimization barrier'- e.g. linux does
>   __asm__ __volatile__("": : :"memory")
> ppc needs something like sync_synchronize there.

No, rmb and wmb need to generate code.  You are right that in some 
places there will be some extra barriers.

If you want a richer set of barriers, that must be something like 
{rr,rw,wr,ww}_mb{_acq,_rel,} (again not counting the Alpha).  On x86, 
then, all the rr/rw/ww barriers will be compiler barriers because the 
hardware already enforces ordering.  The other three map to 
lfence/sfence/mfence:

    barrier     assembly  why?
    ---------------------------------------------------------------------
    wr_mb_acq   lfence    prevents the read from moving up -> acquire
    wr_mb_rel   sfence    prevents the write from moving down -> release
    wr_mb       mfence    (full barrier)

But if you stick to rmb/wmb/mb, then the correct definition of rmb is 
"the least strict barrier that provides all three of rr_mb(), 
rw_mb_rel() and wr_mb_acq()".  This is, as expected, an lfence. 
Similarly, wmb must provide all three of ww_mb(), wr_mb_rel() and 
rw_mb_acq(), and this is an sfence.

So the right place to put an #ifdef is not "wmb()", but the _uses_ of 
wmb() where you know you need a barrier that is less strict.  That's why 
I say David patch is correct; on top of that you may change the 
particular uses of wmb() in virtio.c to compiler barriers, for example 
when you only care about ordering writes after writes.

Likewise, there may even be places in which you could #ifdef out a full 
memory barrier.  For example, if you only care about ordering writes 
with respect to reads, x86 hardware is already providing that and you 
could omit the mb().

I think in general it is premature optimization, though.

Regarding specific examples in virtio where lfence and sfence could be 
used, there may be one when using event signaling.  In the backend you 
write first the index of your response, then you check whether to 
generate an event.  (I think) the following requirements hold:

* if you read the event-index too early, you might skip an event and 
deadlock.  So you need at least a read barrier.

* you can write the response-index after reading the event-index, as 
long as you write it before waking up the guest.

So, in that case an x86 lfence should be enough, though again without 
more consideration I would use a full barrier just to be sure.

Paolo
Michael S. Tsirkin - Sept. 5, 2011, 8:06 a.m.
On Mon, Sep 05, 2011 at 09:41:19AM +0200, Paolo Bonzini wrote:
> On 09/04/2011 11:16 AM, Michael S. Tsirkin wrote:
> >>>  I mean argue for a richer set of barriers, with per-arch minimal
> >>>  implementations instead of the large but portable hammer of
> >>>  sync_synchronize, if you will.
> >
> >That's what I'm saying really. On x86 the richer set of barriers
> >need not insert code at all for both wmb and rmb macros. All we
> >might need is an 'optimization barrier'- e.g. linux does
> >  __asm__ __volatile__("": : :"memory")
> >ppc needs something like sync_synchronize there.
> 
> No, rmb and wmb need to generate code.

If they do we'll have to surround each their use with
ifndef x86 as you suggest later. Which is just messy.
Maybe we should rename the barriers to smp_rmb/smp_wmb/smp_mb -
this is what linux guests do.

> You are right that in some
> places there will be some extra barriers.
> 
> If you want a richer set of barriers, that must be something like
> {rr,rw,wr,ww}_mb{_acq,_rel,} (again not counting the Alpha).  On
> x86, then, all the rr/rw/ww barriers will be compiler barriers
> because the hardware already enforces ordering.  The other three map
> to lfence/sfence/mfence:
> 
>    barrier     assembly  why?
>    ---------------------------------------------------------------------
>    wr_mb_acq   lfence    prevents the read from moving up -> acquire
>    wr_mb_rel   sfence    prevents the write from moving down -> release
>    wr_mb       mfence    (full barrier)
> 
> But if you stick to rmb/wmb/mb, then the correct definition of rmb
> is "the least strict barrier that provides all three of rr_mb(),
> rw_mb_rel() and wr_mb_acq()".  This is, as expected, an lfence.
> Similarly, wmb must provide all three of ww_mb(), wr_mb_rel() and
> rw_mb_acq(), and this is an sfence.
> 
> So the right place to put an #ifdef is not "wmb()", but the _uses_
> of wmb() where you know you need a barrier that is less strict.
> That's why I say David patch is correct; on top of that you may
> change the particular uses of wmb() in virtio.c to compiler
> barriers, for example when you only care about ordering writes after
> writes.
> Likewise, there may even be places in which you could #ifdef out a
> full memory barrier.  For example, if you only care about ordering
> writes with respect to reads, x86 hardware is already providing that
> and you could omit the mb().
> I think in general it is premature optimization, though.
> 
> Regarding specific examples in virtio where lfence and sfence could
> be used, there may be one when using event signaling.  In the
> backend you write first the index of your response, then you check
> whether to generate an event.  (I think) the following requirements
> hold:
> 
> * if you read the event-index too early, you might skip an event and
> deadlock.  So you need at least a read barrier.
> * you can write the response-index after reading the event-index, as
> long as you write it before waking up the guest.
> 
> So, in that case an x86 lfence should be enough, though again
> without more consideration I would use a full barrier just to be
> sure.
> 
> Paolo

lfence in this place will not affect the relative order of read versus
write.
Michael S. Tsirkin - Sept. 5, 2011, 9:19 a.m.
On Mon, Sep 05, 2011 at 02:43:16PM +1000, David Gibson wrote:
> On Sun, Sep 04, 2011 at 12:16:43PM +0300, Michael S. Tsirkin wrote:
> > On Sun, Sep 04, 2011 at 12:46:35AM +1000, David Gibson wrote:
> > > On Fri, Sep 02, 2011 at 06:45:50PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > > > > > Why not limit the change to ppc then?
> > > > > > >
> > > > > > > Because the bug is masked by the x86 memory model, but it is still
> > > > > > > there even there conceptually. It is not really true that x86 does
> > > > > > > not need memory barriers, though it doesn't in this case:
> > > > > > >
> > > > > > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > > > > > >
> > > > > > > Paolo
> > > > > > 
> > > > > > Right.
> > > > > > To summarize, on x86 we probably want wmb and rmb to be compiler
> > > > > > barrier only. Only mb might in theory need to be an mfence.
> > > > > 
> > > > > No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> > > > > not provide those, so they should become __sync_synchronize() too,
> > > > > or you should use inline assembly.
> > > > > 
> > > > > > But there might be reasons why that is not an issue either
> > > > > > if we look closely enough.
> > > > > 
> > > > > Since the ring buffers are not using locked instructions (no xchg
> > > > > or cmpxchg) the barriers simply must be there, even on x86.  Whether
> > > > > it works in practice is not interesting, only the formal model is
> > > > > interesting.
> > > > > 
> > > > > Paolo
> > > > 
> > > > Well, can you describe an issue in virtio that lfence/sfence help solve
> > > > in terms of a memory model please?
> > > > Pls note that guest uses smp_ variants for barriers.
> > > 
> > > Ok, so, I'm having a bit of trouble with the fact that I'm having to
> > > argue the case that things the protocol requiress to be memory
> > > barriers actually *be* memory barriers on all platforms.
> > > 
> > > I mean argue for a richer set of barriers, with per-arch minimal
> > > implementations instead of the large but portable hammer of
> > > sync_synchronize, if you will.
> > 
> > That's what I'm saying really. On x86 the richer set of barriers
> > need not insert code at all for both wmb and rmb macros. All we
> > might need is an 'optimization barrier'- e.g. linux does
> >  __asm__ __volatile__("": : :"memory")
> > ppc needs something like sync_synchronize there.
> 
> But you're approaching this the wrong way around - correctness should
> come first.  That is, we should first ensure that there is a
> sufficient memory barrier to satisfy the protocol.  Then, *if* there
> is a measurable performance improvement and *if* we can show that a
> weaker barrier is sufficient on a given platform, then we can whittle
> it down to a lighter barrier.

You are only looking at ppc. But on x86 this code ships in
production. So changes should be made in a way to reduce
a potential for regressions, balancing risk versus potential benefit.
I'm trying to point out a way to do this.

> > > But just leaving them out on x86!?
> > > Seriously, wtf?  Do you enjoy having software that works chiefly by
> > > accident?
> > 
> > I'm surprised at the controversy too. People seem to argue that
> > x86 cpu does not reorder stores and that we need an sfence between
> > stores to prevent the guest from seeing them out of order, at
> > the same time.
> 
> I don't know the x86 storage model well enough to definitively say
> that the barrier is not necessary there - nor to say that it is
> necessary.  All I know is that the x86 model is quite strongly
> ordered, and I assume that is why the lack of barrier has not caused
> an observed problem on x86.

Please review Documentation/memory-barriers.txt as one reference.
then look at how SMP barriers are implemented at various systems.
In particular, note how it says 'Mandatory barriers should not be used
to control SMP effects'.

> Again, correctness first.  sync_synchronize should be a sufficient
> barrier for wmb() on all platforms.  If you really don't want it, the
> onus is on you

Just for fun, I did a quick hack replacing all barriers with mb()
in the userspace virtio test. This is on x386.

Before:
[mst@tuck virtio]$ sudo time ./virtio_test 
spurious wakeus: 0x1da
24.53user 14.63system 0:41.91elapsed 93%CPU (0avgtext+0avgdata
464maxresident)k
0inputs+0outputs (0major+154minor)pagefaults 0swaps

After:
[mst@tuck virtio]$ sudo time ./virtio_test 
spurious wakeus: 0x218
33.97user 6.22system 0:42.10elapsed 95%CPU (0avgtext+0avgdata
464maxresident)k
0inputs+0outputs (0major+154minor)pagefaults 0swaps

So user time went up significantly, as expected. Surprisingly the kernel
side started working more efficiently - surprising since
kernel was not changed - with net effect close to evening out.

So a risk of performance regressions from unnecessary fencing
seems to be non-zero, assuming that time doesn't lie.
This might be worth investigating, but I'm out of time right now.


> to show that (a) it's safe to do so and
> (b) it's actually worth it.

Worth what? I'm asking you to minimuse disruption to other platforms
while you fix ppc.
Paolo Bonzini - Sept. 5, 2011, 9:42 a.m.
> > No, rmb and wmb need to generate code.
> 
> If they do we'll have to surround each their use with
> ifndef x86 as you suggest later. Which is just messy.

[1 hour later]

I see what you mean now.  You assume there are no accesses to
write-combining memory (of course) or non-temporal load/stores
(because they are accessed only with assembly), so you can
make rmb/wmb no-ops on x86.  I was confused by the kernel
(and liburcu's) choice to use lfence/sfence for rmb/wmb.

Then it's indeed better to move the wmb() defines to
qemu-barrier.h, where they can be made processor-dependent.
S390, it seems, also does not need rmb/wmb.

Paolo
David Gibson - Sept. 6, 2011, 3:12 a.m.
On Mon, Sep 05, 2011 at 12:19:46PM +0300, Michael S. Tsirkin wrote:
> On Mon, Sep 05, 2011 at 02:43:16PM +1000, David Gibson wrote:
> > On Sun, Sep 04, 2011 at 12:16:43PM +0300, Michael S. Tsirkin wrote:
> > > On Sun, Sep 04, 2011 at 12:46:35AM +1000, David Gibson wrote:
> > > > On Fri, Sep 02, 2011 at 06:45:50PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 01, 2011 at 04:31:09PM -0400, Paolo Bonzini wrote:
> > > > > > > > > Why not limit the change to ppc then?
> > > > > > > >
> > > > > > > > Because the bug is masked by the x86 memory model, but it is still
> > > > > > > > there even there conceptually. It is not really true that x86 does
> > > > > > > > not need memory barriers, though it doesn't in this case:
> > > > > > > >
> > > > > > > > http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > > > > > > >
> > > > > > > > Paolo
> > > > > > > 
> > > > > > > Right.
> > > > > > > To summarize, on x86 we probably want wmb and rmb to be compiler
> > > > > > > barrier only. Only mb might in theory need to be an mfence.
> > > > > > 
> > > > > > No, wmb needs to be sfence and rmb needs to be lfence.  GCC does
> > > > > > not provide those, so they should become __sync_synchronize() too,
> > > > > > or you should use inline assembly.
> > > > > > 
> > > > > > > But there might be reasons why that is not an issue either
> > > > > > > if we look closely enough.
> > > > > > 
> > > > > > Since the ring buffers are not using locked instructions (no xchg
> > > > > > or cmpxchg) the barriers simply must be there, even on x86.  Whether
> > > > > > it works in practice is not interesting, only the formal model is
> > > > > > interesting.
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Well, can you describe an issue in virtio that lfence/sfence help solve
> > > > > in terms of a memory model please?
> > > > > Pls note that guest uses smp_ variants for barriers.
> > > > 
> > > > Ok, so, I'm having a bit of trouble with the fact that I'm having to
> > > > argue the case that things the protocol requiress to be memory
> > > > barriers actually *be* memory barriers on all platforms.
> > > > 
> > > > I mean argue for a richer set of barriers, with per-arch minimal
> > > > implementations instead of the large but portable hammer of
> > > > sync_synchronize, if you will.
> > > 
> > > That's what I'm saying really. On x86 the richer set of barriers
> > > need not insert code at all for both wmb and rmb macros. All we
> > > might need is an 'optimization barrier'- e.g. linux does
> > >  __asm__ __volatile__("": : :"memory")
> > > ppc needs something like sync_synchronize there.
> > 
> > But you're approaching this the wrong way around - correctness should
> > come first.  That is, we should first ensure that there is a
> > sufficient memory barrier to satisfy the protocol.  Then, *if* there
> > is a measurable performance improvement and *if* we can show that a
> > weaker barrier is sufficient on a given platform, then we can whittle
> > it down to a lighter barrier.
> 
> You are only looking at ppc. But on x86 this code ships in
> production. So changes should be made in a way to reduce
> a potential for regressions, balancing risk versus potential benefit.
> I'm trying to point out a way to do this.

Oh, please.  Adding a stronger barrier has a miniscule chance of
breaking things.  And this in a project that has build-breaking
regressions with tedious frequency.

> > > > But just leaving them out on x86!?
> > > > Seriously, wtf?  Do you enjoy having software that works chiefly by
> > > > accident?
> > > 
> > > I'm surprised at the controversy too. People seem to argue that
> > > x86 cpu does not reorder stores and that we need an sfence between
> > > stores to prevent the guest from seeing them out of order, at
> > > the same time.
> > 
> > I don't know the x86 storage model well enough to definitively say
> > that the barrier is not necessary there - nor to say that it is
> > necessary.  All I know is that the x86 model is quite strongly
> > ordered, and I assume that is why the lack of barrier has not caused
> > an observed problem on x86.
> 
> Please review Documentation/memory-barriers.txt as one reference.
> then look at how SMP barriers are implemented at various systems.
> In particular, note how it says 'Mandatory barriers should not be used
> to control SMP effects'.

No, again, correctness first; the onus of showing it's safe is on
those who want weaker barriers.

> > Again, correctness first.  sync_synchronize should be a sufficient
> > barrier for wmb() on all platforms.  If you really don't want it, the
> > onus is on you
> 
> Just for fun, I did a quick hack replacing all barriers with mb()
> in the userspace virtio test. This is on x386.
> 
> Before:
> [mst@tuck virtio]$ sudo time ./virtio_test 
> spurious wakeus: 0x1da
> 24.53user 14.63system 0:41.91elapsed 93%CPU (0avgtext+0avgdata
> 464maxresident)k
> 0inputs+0outputs (0major+154minor)pagefaults 0swaps
> 
> After:
> [mst@tuck virtio]$ sudo time ./virtio_test 
> spurious wakeus: 0x218
> 33.97user 6.22system 0:42.10elapsed 95%CPU (0avgtext+0avgdata
> 464maxresident)k
> 0inputs+0outputs (0major+154minor)pagefaults 0swaps
> 
> So user time went up significantly, as expected. Surprisingly the kernel
> side started working more efficiently - surprising since
> kernel was not changed - with net effect close to evening out.

Right.  So small overall performance impact, and that's on a dedicated
testcase which does nothing but the virtio protocol.  I *strongly*
suspect the extra cost of the memory barriers will be well and truly
lost in the rest of the overhead of the qemu networking code.

> So a risk of performance regressions from unnecessary fencing
> seems to be non-zero, assuming that time doesn't lie.
> This might be worth investigating, but I'm out of time right now.
> 
> 
> > to show that (a) it's safe to do so and
> > (b) it's actually worth it.
> 
> Worth what? I'm asking you to minimuse disruption to other platforms
> while you fix ppc.

I'm not "fixing ppc".  I'm fixing a fundamental flaw in the protocol
implementation.  _So far_ I've only observed the effects on ppc, but
that doesn't mean they don't exist.
Paolo Bonzini - Sept. 6, 2011, 6:55 a.m.
On 09/06/2011 05:12 AM, David Gibson wrote:
> I'm not "fixing ppc".  I'm fixing a fundamental flaw in the protocol
> implementation._So far_  I've only observed the effects on ppc, but
> that doesn't mean they don't exist.

Actually Michael is right.  The implementation is correct on x86, though 
wrong anywhere else (perhaps s390?).  On those architectures you do not 
need rmb() and wmb().

So doing the __sync_synchronize() on ppc only as he proposed is wrong; 
but not doing it (and leaving only a compiler barrier) on x86 is 
correct.  See http://g.oswego.edu/dl/jmm/cookbook.html under 
"Multiprocessors".

Paolo
David Gibson - Sept. 6, 2011, 9:02 a.m.
On Tue, Sep 06, 2011 at 08:55:42AM +0200, Paolo Bonzini wrote:
> On 09/06/2011 05:12 AM, David Gibson wrote:
> >I'm not "fixing ppc".  I'm fixing a fundamental flaw in the protocol
> >implementation._So far_  I've only observed the effects on ppc, but
> >that doesn't mean they don't exist.
> 
> Actually Michael is right.  The implementation is correct on x86,
> though wrong anywhere else (perhaps s390?).  On those architectures
> you do not need rmb() and wmb().
> 
> So doing the __sync_synchronize() on ppc only as he proposed is
> wrong; but not doing it (and leaving only a compiler barrier) on x86
> is correct.  See http://g.oswego.edu/dl/jmm/cookbook.html under
> "Multiprocessors".

Well, in any case, I've realised that we ought to merge the barriers
in virtio with those in qemu-barrier.h.  Which are also unsafe on
non-x86.

I have a draft two patch series to do that, which I intend to post
soon.  Unfortunately, I haven't been able to test it on ppc kvm yet,
because there are other bugs which are making qemu not even boot the
pseries machine on a ppc64 host.  Still trying to work out what the
hell's going on there, smells like bad extension or masking of CTR
(SLOF seems to get stuck in a bdnz loop with something close to 2^57
in CTR).
Avi Kivity - Sept. 6, 2011, 9:28 a.m.
On 09/06/2011 09:55 AM, Paolo Bonzini wrote:
> On 09/06/2011 05:12 AM, David Gibson wrote:
>> I'm not "fixing ppc".  I'm fixing a fundamental flaw in the protocol
>> implementation._So far_  I've only observed the effects on ppc, but
>> that doesn't mean they don't exist.
>
> Actually Michael is right.  The implementation is correct on x86, 
> though wrong anywhere else (perhaps s390?).  On those architectures 
> you do not need rmb() and wmb().

Are we sure?  Nothing prevents the guest from using weakly-ordered 
writes, is there?  For example, using MOVNTDQ.

Although in that case the guest is probably required to issue an SFENCE.
Michael S. Tsirkin - Sept. 6, 2011, 9:35 a.m.
On Tue, Sep 06, 2011 at 12:28:40PM +0300, Avi Kivity wrote:
> On 09/06/2011 09:55 AM, Paolo Bonzini wrote:
> >On 09/06/2011 05:12 AM, David Gibson wrote:
> >>I'm not "fixing ppc".  I'm fixing a fundamental flaw in the protocol
> >>implementation._So far_  I've only observed the effects on ppc, but
> >>that doesn't mean they don't exist.
> >
> >Actually Michael is right.  The implementation is correct on x86,
> >though wrong anywhere else (perhaps s390?).  On those
> >architectures you do not need rmb() and wmb().
> 
> Are we sure?  Nothing prevents the guest from using weakly-ordered
> writes, is there?  For example, using MOVNTDQ.

Yes but the macros in question are here to order qemu accesses,
not guest accesses.
> 
> Although in that case the guest is probably required to issue an SFENCE.
> -- 
> error compiling committee.c: too many arguments to function
Paolo Bonzini - Sept. 6, 2011, 9:38 a.m.
On 09/06/2011 11:28 AM, Avi Kivity wrote:
>>
>> Actually Michael is right.  The implementation is correct on x86,
>> though wrong anywhere else (perhaps s390?).  On those architectures
>> you do not need rmb() and wmb().
>
> Are we sure?  Nothing prevents the guest from using weakly-ordered
> writes, is there?  For example, using MOVNTDQ.
>
> Although in that case the guest is probably required to issue an SFENCE.

Yes, that's the guest problem.  You cannot do an SFENCE on the guest's 
behalf anyway.

Paolo

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index 13aa0fa..c9f0e75 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -21,14 +21,8 @@ 
  * x86 pagesize again. */
 #define VIRTIO_PCI_VRING_ALIGN         4096
 
-/* QEMU doesn't strictly need write barriers since everything runs in
- * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
- * KVM or if kqemu gets SMP support.
- * In any case, we must prevent the compiler from reordering the code.
- * TODO: we likely need some rmb()/mb() as well.
- */
-
-#define wmb() __asm__ __volatile__("": : :"memory")
+ /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
+ #define wmb() __sync_synchronize()
 
 typedef struct VRingDesc
 {