diff mbox

[PATCHv2,3/3] virtio: order index/descriptor reads

Message ID 6d0b21bf59b3998868b7b88bb7a39a47e9660e18.1335186822.git.mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin April 23, 2012, 1:19 p.m. UTC
virtio has the equivalent of:

	if (vq->last_avail_index != vring_avail_idx(vq)) {
		read descriptor head at vq->last_avail_index;
	}

In theory, processor can reorder descriptor head
read to happen speculatively before the index read.
this would trigger the following race:

	host descriptor head read <- reads invalid head from ring
		guest writes valid descriptor head
		guest writes avail index
	host avail index read <- observes valid index

as a result host will use an invalid head value.
This was not observed in the field by me but after
the experience with the previous two races
I think it is prudent to address this theoretical race condition.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio.c    |    5 +++++
 qemu-barrier.h |    7 ++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

Comments

Paolo Bonzini April 24, 2012, 1:48 p.m. UTC | #1
Il 23/04/2012 15:19, Michael S. Tsirkin ha scritto:
> virtio has the equivalent of:
> 
> 	if (vq->last_avail_index != vring_avail_idx(vq)) {
> 		read descriptor head at vq->last_avail_index;
> 	}
> 
> In theory, processor can reorder descriptor head
> read to happen speculatively before the index read.
> this would trigger the following race:
> 
> 	host descriptor head read <- reads invalid head from ring
> 		guest writes valid descriptor head
> 		guest writes avail index
> 	host avail index read <- observes valid index
> 
> as a result host will use an invalid head value.
> This was not observed in the field by me but after
> the experience with the previous two races
> I think it is prudent to address this theoretical race condition.

I think your fix is right, but it is not needed on x86.  lfence is only
required for weakly-ordered memory types, so says Intel, and AMD is even
clearer: "Loads from differing memory types may
be performed out of order, in particular between WC/WC+ and other memory
types".

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  hw/virtio.c    |    5 +++++
>  qemu-barrier.h |    7 ++++++-
>  2 files changed, 11 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/virtio.c b/hw/virtio.c
> index def0bf1..c081e1b 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -287,6 +287,11 @@ static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
>                       idx, vring_avail_idx(vq));
>          exit(1);
>      }
> +    /* On success, callers read a descriptor at vq->last_avail_idx.
> +     * Make sure descriptor read does not bypass avail index read. */
> +    if (num_heads) {
> +        smp_rmb();
> +    }
>  
>      return num_heads;
>  }
> diff --git a/qemu-barrier.h b/qemu-barrier.h
> index f6722a8..39aa0b0 100644
> --- a/qemu-barrier.h
> +++ b/qemu-barrier.h
> @@ -24,10 +24,13 @@
>  #define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
>  #endif
>  
> +#define smp_rmb() smp_mb()
> +
>  #elif defined(__x86_64__)
>  
>  #define smp_wmb()   barrier()
>  #define smp_mb() asm volatile("mfence" ::: "memory")
> +#define smp_rmb() asm volatile("lfence" ::: "memory")
>  
>  #elif defined(_ARCH_PPC)
>  
> @@ -38,6 +41,7 @@
>   */
>  #define smp_wmb()   asm volatile("eieio" ::: "memory")
>  #define smp_mb()   asm volatile("eieio" ::: "memory")
> +#define smp_rmb()   asm volatile("eieio" ::: "memory")

rmb() is lwsync on PPC, not eieio.

I would be grateful if, instead of fixing the qemu-barrier.h parts of
the patches, you picked up the (sole) patch in the atomics branch of
git://github.com/bonzini/qemu.git.  The constructs there are more
complete than what we have in qemu-barrier.h, and the memory barriers
are based on http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
(written by people who know much more than me :)).

Paolo

>  
>  #else
>  
> @@ -45,10 +49,11 @@
>   * For (host) platforms we don't have explicit barrier definitions
>   * for, we use the gcc __sync_synchronize() primitive to generate a
>   * full barrier.  This should be safe on all platforms, though it may
> - * be overkill for wmb().
> + * be overkill for wmb() and rmb().
>   */
>  #define smp_wmb()   __sync_synchronize()
>  #define smp_mb()   __sync_synchronize()
> +#define smp_rmb()   __sync_synchronize()
>  
>  #endif
>
Michael S. Tsirkin April 24, 2012, 2:38 p.m. UTC | #2
On Tue, Apr 24, 2012 at 03:48:27PM +0200, Paolo Bonzini wrote:
> Il 23/04/2012 15:19, Michael S. Tsirkin ha scritto:
> > virtio has the equivalent of:
> > 
> > 	if (vq->last_avail_index != vring_avail_idx(vq)) {
> > 		read descriptor head at vq->last_avail_index;
> > 	}
> > 
> > In theory, processor can reorder descriptor head
> > read to happen speculatively before the index read.
> > this would trigger the following race:
> > 
> > 	host descriptor head read <- reads invalid head from ring
> > 		guest writes valid descriptor head
> > 		guest writes avail index
> > 	host avail index read <- observes valid index
> > 
> > as a result host will use an invalid head value.
> > This was not observed in the field by me but after
> > the experience with the previous two races
> > I think it is prudent to address this theoretical race condition.
> 
> I think your fix is right, but it is not needed on x86.  lfence is only
> required for weakly-ordered memory types, so says Intel,


I see this in spec:

The LFENCE instruction establishes a memory fence for loads. It
guarantees ordering between two loads and prevents speculative loads
from passing the load fence (that is, no speculative loads are allowed
until all loads specified before the load fence have been carried out).

Speculative load is exactly what we are handling here.

> and AMD is even
> clearer: "Loads from differing memory types may
> be performed out of order, in particular between WC/WC+ and other memory
> types".

This does not say loads from same memory type are ordered.
Does it say so anywhere else? Anyway, worth special-casing AMD?


> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  hw/virtio.c    |    5 +++++
> >  qemu-barrier.h |    7 ++++++-
> >  2 files changed, 11 insertions(+), 1 deletions(-)
> > 
> > diff --git a/hw/virtio.c b/hw/virtio.c
> > index def0bf1..c081e1b 100644
> > --- a/hw/virtio.c
> > +++ b/hw/virtio.c
> > @@ -287,6 +287,11 @@ static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
> >                       idx, vring_avail_idx(vq));
> >          exit(1);
> >      }
> > +    /* On success, callers read a descriptor at vq->last_avail_idx.
> > +     * Make sure descriptor read does not bypass avail index read. */
> > +    if (num_heads) {
> > +        smp_rmb();
> > +    }
> >  
> >      return num_heads;
> >  }
> > diff --git a/qemu-barrier.h b/qemu-barrier.h
> > index f6722a8..39aa0b0 100644
> > --- a/qemu-barrier.h
> > +++ b/qemu-barrier.h
> > @@ -24,10 +24,13 @@
> >  #define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
> >  #endif
> >  
> > +#define smp_rmb() smp_mb()
> > +
> >  #elif defined(__x86_64__)
> >  
> >  #define smp_wmb()   barrier()
> >  #define smp_mb() asm volatile("mfence" ::: "memory")
> > +#define smp_rmb() asm volatile("lfence" ::: "memory")
> >  
> >  #elif defined(_ARCH_PPC)
> >  
> > @@ -38,6 +41,7 @@
> >   */
> >  #define smp_wmb()   asm volatile("eieio" ::: "memory")
> >  #define smp_mb()   asm volatile("eieio" ::: "memory")
> > +#define smp_rmb()   asm volatile("eieio" ::: "memory")
> 
> rmb() is lwsync on PPC, not eieio.
> 
> I would be grateful if, instead of fixing the qemu-barrier.h parts of
> the patches, you picked up the (sole) patch in the atomics branch of
> git://github.com/bonzini/qemu.git.  The constructs there are more
> complete than what we have in qemu-barrier.h,

Sorry this is just a bugfix in virtio, don't see a reason to make
it depend on a wholesale rework of atomics.
We can switch to qemu/atomics when it is ready.

> and the memory barriers
> are based on http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> (written by people who know much more than me :)).
> 
> Paolo
> >  
> >  #else
> >  
> > @@ -45,10 +49,11 @@
> >   * For (host) platforms we don't have explicit barrier definitions
> >   * for, we use the gcc __sync_synchronize() primitive to generate a
> >   * full barrier.  This should be safe on all platforms, though it may
> > - * be overkill for wmb().
> > + * be overkill for wmb() and rmb().
> >   */
> >  #define smp_wmb()   __sync_synchronize()
> >  #define smp_mb()   __sync_synchronize()
> > +#define smp_rmb()   __sync_synchronize()
> >  
> >  #endif
> >
Paolo Bonzini April 24, 2012, 3:01 p.m. UTC | #3
Il 24/04/2012 16:38, Michael S. Tsirkin ha scritto:
>>> I think it is prudent to address this theoretical race condition.
>>
>> I think your fix is right, but it is not needed on x86.  lfence is only
>> required for weakly-ordered memory types, so says Intel,
> 
> I see this in spec:
> 
> The LFENCE instruction establishes a memory fence for loads. It
> guarantees ordering between two loads and prevents speculative loads
> from passing the load fence (that is, no speculative loads are allowed
> until all loads specified before the load fence have been carried out).
> 
> Speculative load is exactly what we are handling here.

In my copy of the Intel manual "6.2.2 Memory Ordering in P6 and More
Recent Processor Families" says very explicitly "Reads are not reordered
with other reads".

At least this is the illusion that the processor gives; that is,
speculative loads are discarded if another processor writes to the same
location.

But it's not just speculative loads, it can also be out-of-order
execution as AMD says in "7.1.1 Read Ordering".  For cacheable memory
types "Out-of-order reads are allowed to the extent that they can be
performed transparently to software, such that the appearance of
in-order execution is maintained. Out-of-order reads can occur as a
result of out-of-order instruction execution or speculative execution.
The processor can read memory and perform cache refills out-of-order to
allow out-of-order execution to proceed".

>> and AMD is even clearer: "Loads from differing memory types may be
>> performed out of order, in particular between WC/WC+ and other
>> memory types".
> 
> This does not say loads from same memory type are ordered.

That would really be a bad trick for AMD to play.  Luckily they say that
in 7.1.1, as cited above.

>> I would be grateful if, instead of fixing the qemu-barrier.h parts of
>> the patches, you picked up the (sole) patch in the atomics branch of
>> git://github.com/bonzini/qemu.git.  The constructs there are more
>> complete than what we have in qemu-barrier.h,
> 
> Sorry this is just a bugfix in virtio, don't see a reason to make
> it depend on a wholesale rework of atomics.

The reason is that your fixes didn't work on PPC, and were suboptimal on
x86.

Paolo
Michael S. Tsirkin April 24, 2012, 3:11 p.m. UTC | #4
On Tue, Apr 24, 2012 at 05:01:07PM +0200, Paolo Bonzini wrote:
> Il 24/04/2012 16:38, Michael S. Tsirkin ha scritto:
> >>> I think it is prudent to address this theoretical race condition.
> >>
> >> I think your fix is right, but it is not needed on x86.  lfence is only
> >> required for weakly-ordered memory types, so says Intel,
> > 
> > I see this in spec:
> > 
> > The LFENCE instruction establishes a memory fence for loads. It
> > guarantees ordering between two loads and prevents speculative loads
> > from passing the load fence (that is, no speculative loads are allowed
> > until all loads specified before the load fence have been carried out).
> > 
> > Speculative load is exactly what we are handling here.
> 
> In my copy of the Intel manual "6.2.2 Memory Ordering in P6 and More
> Recent Processor Families" says very explicitly "Reads are not reordered
> with other reads".
> 
> At least this is the illusion that the processor gives; that is,
> speculative loads are discarded if another processor writes to the same
> location.

In my copy it's 8.2.2. Anyway - this explicitly talks about a single processor system.

> 
> But it's not just speculative loads, it can also be out-of-order
> execution as AMD says in "7.1.1 Read Ordering".  For cacheable memory
> types "Out-of-order reads are allowed to the extent that they can be
> performed transparently to software, such that the appearance of
> in-order execution is maintained. Out-of-order reads can occur as a
> result of out-of-order instruction execution or speculative execution.
> The processor can read memory and perform cache refills out-of-order to
> allow out-of-order execution to proceed".
> >> and AMD is even clearer: "Loads from differing memory types may be
> >> performed out of order, in particular between WC/WC+ and other
> >> memory types".
> > 
> > This does not say loads from same memory type are ordered.
> 
> That would really be a bad trick for AMD to play.  Luckily they say that
> in 7.1.1, as cited above.

Have not looked but I think they talk about a single processor here as well.
lfence is for controlling smp effects.

> >> I would be grateful if, instead of fixing the qemu-barrier.h parts of
> >> the patches, you picked up the (sole) patch in the atomics branch of
> >> git://github.com/bonzini/qemu.git.  The constructs there are more
> >> complete than what we have in qemu-barrier.h,
> > 
> > Sorry this is just a bugfix in virtio, don't see a reason to make
> > it depend on a wholesale rework of atomics.
> 
> The reason is that your fixes didn't work on PPC, and were suboptimal on
> x86.
> 
> Paolo

I'll fix PPC but I'll stick to the barriers the way Linux implements
them. They pairing rules for these are well documented so we
just need to stick to the rules.

Let's argue about C11 atomics in some other thread.
Paolo Bonzini April 24, 2012, 3:40 p.m. UTC | #5
Il 24/04/2012 17:11, Michael S. Tsirkin ha scritto:
> On Tue, Apr 24, 2012 at 05:01:07PM +0200, Paolo Bonzini wrote:
>> Il 24/04/2012 16:38, Michael S. Tsirkin ha scritto:
>>>>> I think it is prudent to address this theoretical race condition.
>>>>
>>>> I think your fix is right, but it is not needed on x86.  lfence is only
>>>> required for weakly-ordered memory types, so says Intel,
>>>
>>> I see this in spec:
>>>
>>> The LFENCE instruction establishes a memory fence for loads. It
>>> guarantees ordering between two loads and prevents speculative loads
>>> from passing the load fence (that is, no speculative loads are allowed
>>> until all loads specified before the load fence have been carried out).
>>>
>>> Speculative load is exactly what we are handling here.
>>
>> In my copy of the Intel manual "6.2.2 Memory Ordering in P6 and More
>> Recent Processor Families" says very explicitly "Reads are not reordered
>> with other reads".
>>
>> At least this is the illusion that the processor gives; that is,
>> speculative loads are discarded if another processor writes to the same
>> location.
> 
> In my copy it's 8.2.2. Anyway - this explicitly talks about a single processor system.

Not really: "Individual processors use the same ordering principles as
in a single-processor system".  You were talking of a speculative load
from processor A passing an earlier load from the same processor.

>> But it's not just speculative loads, it can also be out-of-order
>> execution as AMD says in "7.1.1 Read Ordering".  For cacheable memory
>> types "Out-of-order reads are allowed to the extent that they can be
>> performed transparently to software, such that the appearance of
>> in-order execution is maintained. Out-of-order reads can occur as a
>> result of out-of-order instruction execution or speculative execution.
>> The processor can read memory and perform cache refills out-of-order to
>> allow out-of-order execution to proceed".
>>>> and AMD is even clearer: "Loads from differing memory types may be
>>>> performed out of order, in particular between WC/WC+ and other
>>>> memory types".
>>>
>>> This does not say loads from same memory type are ordered.
>>
>> That would really be a bad trick for AMD to play.  Luckily they say that
>> in 7.1.1, as cited above.
> 
> Have not looked but I think they talk about a single processor here as well.
> lfence is for controlling smp effects.

They don't mention LFENCE in the part on SMP, because read-read ordering
is guaranteed.

They only mention LFENCE in "7.4.1 Memory Barrier Interaction with
Memory Types": "Memory types other than WB may allow weaker ordering in
certain respects. When the ordering of memory accesses to differing
memory types must be strictly enforced, software can use the LFENCE,
MFENCE or SFENCE barrier instructions to force loads and stores to
proceed in program order.  Table 7-3 on page 173 summarizes the cases
where a memory barrier must be inserted between two memory operations".

And table 7-3 says:

* A load (wp, wt, wb, wc, wc+) may pass a previous non-conflicting store
(wp, wt, wb, wc,wc+, nt)

* A load (wc, wc+) may pass a previous load (wp, wt, wb, wc, wc+). To
ensure memory order, an LFENCE instruction must be inserted between the
two loads.

Intel docs are worse, but they are really saying the same thing.

>>>> I would be grateful if, instead of fixing the qemu-barrier.h parts of
>>>> the patches, you picked up the (sole) patch in the atomics branch of
>>>> git://github.com/bonzini/qemu.git.  The constructs there are more
>>>> complete than what we have in qemu-barrier.h,
>>>
>>> Sorry this is just a bugfix in virtio, don't see a reason to make
>>> it depend on a wholesale rework of atomics.
>>
>> The reason is that your fixes didn't work on PPC, and were suboptimal on
>> x86
> 
> I'll fix PPC but I'll stick to the barriers the way Linux implements
> them. They pairing rules for these are well documented so we
> just need to stick to the rules.

Sure, and smp_rmb() *is* a no-op on Linux:

#ifdef CONFIG_SMP
#define smp_mb()        mb()
#ifdef CONFIG_X86_PPRO_FENCE
# define smp_rmb()      rmb()        <-- this is an lfence on x86_64
#else
# define smp_rmb()      barrier()    <-- this is not
#endif
#ifdef CONFIG_X86_OOSTORE
# define smp_wmb()      wmb()
#else
# define smp_wmb()      barrier()
#endif
#endif

config X86_PPRO_FENCE
   ---help---
   Old PentiumPro multiprocessor systems had errata that could cause
   memory operations to violate the x86 ordering standard in rare cases.
   Enabling this option will attempt to work around some (but not all)
   occurrences of this problem, at the cost of much heavier spinlock and
   memory barrier operations.

   If unsure, say n here. Even distro kernels should think twice before
   enabling this: there are few systems, and an unlikely bug.

> Let's argue about C11 atomics in some other thread.

How are C11 atomics relevant?  The branch I pointed to is implementing
Linux kernel semantics.

Paolo
Michael S. Tsirkin April 24, 2012, 4:08 p.m. UTC | #6
On Tue, Apr 24, 2012 at 05:40:07PM +0200, Paolo Bonzini wrote:
> >>>> I would be grateful if, instead of fixing the qemu-barrier.h parts of
> >>>> the patches, you picked up the (sole) patch in the atomics branch of
> >>>> git://github.com/bonzini/qemu.git.  The constructs there are more
> >>>> complete than what we have in qemu-barrier.h,
> >>>
> >>> Sorry this is just a bugfix in virtio, don't see a reason to make
> >>> it depend on a wholesale rework of atomics.
> >>
> >> The reason is that your fixes didn't work on PPC, and were suboptimal on
> >> x86
> > 
> > I'll fix PPC but I'll stick to the barriers the way Linux implements
> > them. They pairing rules for these are well documented so we
> > just need to stick to the rules.
> 
> Sure, and smp_rmb() *is* a no-op on Linux:
> 
> #ifdef CONFIG_SMP
> #define smp_mb()        mb()
> #ifdef CONFIG_X86_PPRO_FENCE
> # define smp_rmb()      rmb()        <-- this is an lfence on x86_64
> #else
> # define smp_rmb()      barrier()    <-- this is not
> #endif
> #ifdef CONFIG_X86_OOSTORE
> # define smp_wmb()      wmb()
> #else
> # define smp_wmb()      barrier()
> #endif
> #endif

Hmm, you are right. I'll make it a compiler barrier and add a comment
similar to wmb on x86 explaining that we don't use non-temporals.
Thanks for clarifying this.
Paolo Bonzini April 24, 2012, 4:15 p.m. UTC | #7
Il 24/04/2012 18:08, Michael S. Tsirkin ha scritto:
> On Tue, Apr 24, 2012 at 05:40:07PM +0200, Paolo Bonzini wrote:
>>>>>> I would be grateful if, instead of fixing the qemu-barrier.h parts of
>>>>>> the patches, you picked up the (sole) patch in the atomics branch of
>>>>>> git://github.com/bonzini/qemu.git.  The constructs there are more
>>>>>> complete than what we have in qemu-barrier.h,
>>>>>
>>>>> Sorry this is just a bugfix in virtio, don't see a reason to make
>>>>> it depend on a wholesale rework of atomics.
>>>>
>>>> The reason is that your fixes didn't work on PPC, and were suboptimal on
>>>> x86
>>>
>>> I'll fix PPC but I'll stick to the barriers the way Linux implements
>>> them. They pairing rules for these are well documented so we
>>> just need to stick to the rules.
>>
>> Sure, and smp_rmb() *is* a no-op on Linux:
>>
>> #ifdef CONFIG_SMP
>> #define smp_mb()        mb()
>> #ifdef CONFIG_X86_PPRO_FENCE
>> # define smp_rmb()      rmb()        <-- this is an lfence on x86_64
>> #else
>> # define smp_rmb()      barrier()    <-- this is not
>> #endif
>> #ifdef CONFIG_X86_OOSTORE
>> # define smp_wmb()      wmb()
>> #else
>> # define smp_wmb()      barrier()
>> #endif
>> #endif
> 
> Hmm, you are right. I'll make it a compiler barrier and add a comment
> similar to wmb on x86 explaining that we don't use non-temporals.
> Thanks for clarifying this.

No problem. :)

If you search the qemu-devel archives you can find me saying very wrong
things on memory barriers.  When I realized that I did my homework, and
the homework was the atomics patch.

BTW, one of the authors of the C11 atomics stuff is Paul McKenney, so
there is some cross-pollination between C and Linux atomics.

Paolo
diff mbox

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index def0bf1..c081e1b 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -287,6 +287,11 @@  static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
                      idx, vring_avail_idx(vq));
         exit(1);
     }
+    /* On success, callers read a descriptor at vq->last_avail_idx.
+     * Make sure descriptor read does not bypass avail index read. */
+    if (num_heads) {
+        smp_rmb();
+    }
 
     return num_heads;
 }
diff --git a/qemu-barrier.h b/qemu-barrier.h
index f6722a8..39aa0b0 100644
--- a/qemu-barrier.h
+++ b/qemu-barrier.h
@@ -24,10 +24,13 @@ 
 #define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
 #endif
 
+#define smp_rmb() smp_mb()
+
 #elif defined(__x86_64__)
 
 #define smp_wmb()   barrier()
 #define smp_mb() asm volatile("mfence" ::: "memory")
+#define smp_rmb() asm volatile("lfence" ::: "memory")
 
 #elif defined(_ARCH_PPC)
 
@@ -38,6 +41,7 @@ 
  */
 #define smp_wmb()   asm volatile("eieio" ::: "memory")
 #define smp_mb()   asm volatile("eieio" ::: "memory")
+#define smp_rmb()   asm volatile("eieio" ::: "memory")
 
 #else
 
@@ -45,10 +49,11 @@ 
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill for wmb().
+ * be overkill for wmb() and rmb().
  */
 #define smp_wmb()   __sync_synchronize()
 #define smp_mb()   __sync_synchronize()
+#define smp_rmb()   __sync_synchronize()
 
 #endif