Patchwork virtio: add missing mb() on notification

login
register
mail settings
Submitter Michael S. Tsirkin
Date April 22, 2012, 2:54 p.m.
Message ID <20120422145447.GA10299@redhat.com>
Download mbox | patch
Permalink /patch/154285/
State New
Headers show

Comments

Michael S. Tsirkin - April 22, 2012, 2:54 p.m.
During normal operation, virtio host first writes a used index
and then checks whether it should interrupt the guest
by reading guest avail flag/used event index values.
Guest does the reverse: writes the index/flag,
then checks the used ring.

The ordering is important: if host avail flag read bypasses the used
index write, we could in effect get this timing:

host avail flag/used event index read
		guest enable interrupts: this performs
			 avail flag/used event index write
		guest check used ring: ring is empty
host used index write

This timing results in a lost interrupt: guest will never be
notified about the used ring update.

This has actually been observed in the field, when using qemu-kvm such
that the guest vcpu and qemu io run on different host cpus,
but only seems to trigger on some specific hardware,
and only with userspace virtio: vhost has the necessary smp_mb() in
place to prevent the reordering, so the same workload stalls forever
waiting for an interrupt with vhost=off but works fine with vhost=on.

Insert an smp_mb() barrier operation in userspace virtio to
ensure the correct ordering.
Applying this patch fixed the race condition we have observed.
Tested on x86_64. I checked the code generated by the new macro
for i386 and ppc but didn't run virtio.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio.c    |    2 ++
 qemu-barrier.h |   23 ++++++++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)
Stefan Hajnoczi - April 23, 2012, 10:14 a.m.
On Sun, Apr 22, 2012 at 3:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> During normal operation, virtio host first writes a used index
> and then checks whether it should interrupt the guest
> by reading guest avail flag/used event index values.
> Guest does the reverse: writes the index/flag,
> then checks the used ring.
>
> The ordering is important: if host avail flag read bypasses the used
> index write, we could in effect get this timing:
>
> host avail flag/used event index read
>                guest enable interrupts: this performs
>                         avail flag/used event index write
>                guest check used ring: ring is empty
> host used index write
>
> This timing results in a lost interrupt: guest will never be
> notified about the used ring update.
>
> This has actually been observed in the field, when using qemu-kvm such
> that the guest vcpu and qemu io run on different host cpus,
> but only seems to trigger on some specific hardware,
> and only with userspace virtio: vhost has the necessary smp_mb() in
> place to prevent the reordering, so the same workload stalls forever
> waiting for an interrupt with vhost=off but works fine with vhost=on.
>
> Insert an smp_mb() barrier operation in userspace virtio to
> ensure the correct ordering.
> Applying this patch fixed the race condition we have observed.
> Tested on x86_64. I checked the code generated by the new macro
> for i386 and ppc but didn't run virtio.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  hw/virtio.c    |    2 ++
>  qemu-barrier.h |   23 ++++++++++++++++++++---
>  2 files changed, 22 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index f805790..6449746 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -693,6 +693,8 @@  static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     uint16_t old, new;
     bool v;
+    /* We need to expose used array entries before checking used event. */
+    smp_mb();
     /* Always notify when queue is empty (when feature acknowledge) */
     if (((vdev->guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY)) &&
          !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx)) {
diff --git a/qemu-barrier.h b/qemu-barrier.h
index c11bb2b..f6722a8 100644
--- a/qemu-barrier.h
+++ b/qemu-barrier.h
@@ -4,7 +4,7 @@ 
 /* Compiler barrier */
 #define barrier()   asm volatile("" ::: "memory")
 
-#if defined(__i386__) || defined(__x86_64__)
+#if defined(__i386__)
 
 /*
  * Because of the strongly ordered x86 storage model, wmb() is a nop
@@ -13,15 +13,31 @@ 
  * load/stores from C code.
  */
 #define smp_wmb()   barrier()
+/*
+ * We use GCC builtin if it's available, as that can use
+ * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
+ * However, on i386, there seem to be known bugs as recently as 4.3.
+ * */
+#if defined(_GNUC__) && __GNUC__ >= 4 && __GNUC_MINOR__ >= 4
+#define smp_mb() __sync_synchronize()
+#else
+#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#endif
+
+#elif defined(__x86_64__)
+
+#define smp_wmb()   barrier()
+#define smp_mb() asm volatile("mfence" ::: "memory")
 
 #elif defined(_ARCH_PPC)
 
 /*
- * We use an eieio() for a wmb() on powerpc.  This assumes we don't
+ * We use an eieio() for wmb() and mb() on powerpc.  This assumes we don't
  * need to order cacheable and non-cacheable stores with respect to
  * each other
  */
 #define smp_wmb()   asm volatile("eieio" ::: "memory")
+#define smp_mb()   asm volatile("eieio" ::: "memory")
 
 #else
 
@@ -29,9 +45,10 @@ 
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill.
+ * be overkill for wmb().
  */
 #define smp_wmb()   __sync_synchronize()
+#define smp_mb()   __sync_synchronize()
 
 #endif