diff mbox

[1/1] virtio-blk: fix race on guest notifiers

Message ID 20170302130422.81380-1-pasic@linux.vnet.ibm.com
State New
Headers show

Commit Message

Halil Pasic March 2, 2017, 1:04 p.m. UTC
The commits 03de2f527 "virtio-blk: do not use vring in dataplane"  and
9ffe337c08 "virtio-blk: always use dataplane path if ioeventfd is active"
changed how notifications are done for virtio-blk substantially. Due to a
race condition, interrupts are lost when irqfd behind the guest notifier
is torn down after notify_guest_bh was scheduled but before it actually
runs.

Let's fix this by forcing guest notifications before cleaning up the
irqfd's. Let's also add some explanatory comments.

Cc: qemu-stable@nongnu.org
Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reported-by: Michael A. Tebolt <miket@us.ibm.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
---

This patch supersedes my previous attempt to fix the same issue which was
running under the title "virtio: fallback from irqfd to non-irqfd
notify".

Tested it briefly without iothread and  over the night with iothread, but
more testing would not hurt.

@Paolo: Could you please check the comments? They reflect my
understanding which may be very wrong.
---
 hw/block/dataplane/virtio-blk.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Cornelia Huck March 2, 2017, 2:49 p.m. UTC | #1
On Thu,  2 Mar 2017 14:04:22 +0100
Halil Pasic <pasic@linux.vnet.ibm.com> wrote:

> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 5556f0e..13dd14d 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>          virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
>      }
> 
> -    /* Drain and switch bs back to the QEMU main loop */
> +    /* Drain and switch bs back to the QEMU main loop. After drain, the
> +     * device will not submit (nor comple) any requests until dataplane

s/comple/complete/

> +     * starts again.
> +     */
>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> 
> +    /* Notify guest before the guest notifiers get cleaned up */
> +    qemu_bh_cancel(s->bh);
> +    notify_guest_bh(s);
> +

Hm... does virtio-scsi dataplane need a similar treatment? Or am I
missing something?

>      aio_context_release(s->ctx);
> 
>      for (i = 0; i < nvqs; i++) {
Paolo Bonzini March 2, 2017, 3:32 p.m. UTC | #2
On 02/03/2017 15:49, Cornelia Huck wrote:
> On Thu,  2 Mar 2017 14:04:22 +0100
> Halil Pasic <pasic@linux.vnet.ibm.com> wrote:
> 
>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> index 5556f0e..13dd14d 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>>          virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
>>      }
>>
>> -    /* Drain and switch bs back to the QEMU main loop */
>> +    /* Drain and switch bs back to the QEMU main loop. After drain, the
>> +     * device will not submit (nor comple) any requests until dataplane
> 
> s/comple/complete/
> 
>> +     * starts again.
>> +     */
>>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
>>
>> +    /* Notify guest before the guest notifiers get cleaned up */
>> +    qemu_bh_cancel(s->bh);
>> +    notify_guest_bh(s);
>> +
> 
> Hm... does virtio-scsi dataplane need a similar treatment? Or am I
> missing something?

No, the BH optimization is specific to virtio-blk.  Thanks for the patch
Halil!

Paolo

>>      aio_context_release(s->ctx);
>>
>>      for (i = 0; i < nvqs; i++) {
>
Halil Pasic March 2, 2017, 3:55 p.m. UTC | #3
On 03/02/2017 04:32 PM, Paolo Bonzini wrote:
> 
> 
> On 02/03/2017 15:49, Cornelia Huck wrote:
>> On Thu,  2 Mar 2017 14:04:22 +0100
>> Halil Pasic <pasic@linux.vnet.ibm.com> wrote:
>>
>>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>>> index 5556f0e..13dd14d 100644
>>> --- a/hw/block/dataplane/virtio-blk.c
>>> +++ b/hw/block/dataplane/virtio-blk.c
>>> @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>>>          virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
>>>      }
>>>
>>> -    /* Drain and switch bs back to the QEMU main loop */
>>> +    /* Drain and switch bs back to the QEMU main loop. After drain, the
>>> +     * device will not submit (nor comple) any requests until dataplane
>>
>> s/comple/complete/
>>

Will fix.

>>> +     * starts again.
>>> +     */
>>>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());

I'm wondering if synchronization is needed for batch_notify_vqs. I think
the set_bit can be from the iothread, but the notify_guest_bh below is main
event loop. Is it OK like this (could we miss bits set in other thread)?
Does blk_set_aio_context include a barrier?

>>>
>>> +    /* Notify guest before the guest notifiers get cleaned up */
>>> +    qemu_bh_cancel(s->bh);
>>> +    notify_guest_bh(s);
>>> +
>>
>> Hm... does virtio-scsi dataplane need a similar treatment? Or am I
>> missing something?
> 
> No, the BH optimization is specific to virtio-blk.  Thanks for the patch
> Halil!
> 

You are welcome ;)!

Regards,
Halil
Paolo Bonzini March 2, 2017, 4:21 p.m. UTC | #4
On 02/03/2017 16:55, Halil Pasic wrote:
>>>>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> I'm wondering if synchronization is needed for batch_notify_vqs. I think
> the set_bit can be from the iothread, but the notify_guest_bh below is main
> event loop. Is it OK like this (could we miss bits set in other thread)?
> Does blk_set_aio_context include a barrier?

Another good question---yes, it does (via bdrv_set_aio_context's call to
aio_context_acquire).

But it's something to remember for when I get round to removing
aio_context_acquire!

Paolo
Halil Pasic March 3, 2017, 7:43 p.m. UTC | #5
On 03/02/2017 05:21 PM, Paolo Bonzini wrote:
> 
> 
> On 02/03/2017 16:55, Halil Pasic wrote:
>>>>>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
>> I'm wondering if synchronization is needed for batch_notify_vqs. I think
>> the set_bit can be from the iothread, but the notify_guest_bh below is main
>> event loop. Is it OK like this (could we miss bits set in other thread)?
>> Does blk_set_aio_context include a barrier?
> 
> Another good question---yes, it does (via bdrv_set_aio_context's call to
> aio_context_acquire).
> 
> But it's something to remember for when I get round to removing
> aio_context_acquire!

Uh, this is complicated. I'm not out of questions, but I fear taking to
much of your precious time. I will ask again nevertheless, but please
just cut the conversation with -EBUSY if it gets to expensive.

What I understand is the batch_notify_vqs is effectively protected
by blk_get_aio_context(s->conf.conf.blk)->lock which might or might not
be the same as qemu_get_aio_context()->lock. If it ain't the same
that means we are running with iothread. Now let us have a look
at the drain. IFAIU in

#define BDRV_POLL_WHILE(bs, cond) ({                       \
    bool waited_ = false;                                  \
    BlockDriverState *bs_ = (bs);                          \
    AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
    if (aio_context_in_iothread(ctx_)) {                   \
        while ((cond)) {                                   \
            aio_poll(ctx_, true);                          \
            waited_ = true;                                \
        }                                                  \
    } else {                                               \
        assert(qemu_get_current_aio_context() ==           \
               qemu_get_aio_context());                    \
        /* Ask bdrv_dec_in_flight to wake up the main      \
         * QEMU AioContext.  Extra I/O threads never take  \
         * other I/O threads' AioContexts (see for example \
         * block_job_defer_to_main_loop for how to do it). \
         */                                                \
        assert(!bs_->wakeup);                              \
        bs_->wakeup = true;                                \
        while ((cond)) {                                   \
            aio_context_release(ctx_);                     \
            aio_poll(qemu_get_aio_context(), true);        \
            aio_context_acquire(ctx_);                     \
            waited_ = true;                                \
        }                                                  \
        bs_->wakeup = false;                               \
    }                                                      \
    waited_; })

where it's interesting (me running with 2 iothreads one assigned to my
block device) we are gonna take  the "else" branch , a will end up
releasing the ctx belonging to the iothread and then acquire it again,
and basically wait till the requests are done.

Since is virtio_blk_rw_complete acquiring-releasing the same ctx this
makes a lot of sense. If we would not release the ctx we would end up
with a deadlock.


What I do not understand entirely are the atomic_read/write on
bs->in_flight which tells us when are we done waiting. In

static void blk_aio_complete(BlkAioEmAIOCB *acb)
{
    if (acb->has_returned) {
        bdrv_dec_in_flight(acb->common.bs);
        acb->common.cb(acb->common.opaque, acb->rwco.ret);
        qemu_aio_unref(acb);
    }
}

the atomic decrements is before the callback, that is
virtio_blk_rw_complete called. My guess here is, that this does not
matter, because we are holding the ctx anyway, so it is ensured that we
will not observe 0 until the last virtio_blk_rw_complete completed (mutex
is recursive). But then I do not understand what is the point acquiring
the mutex in virtio_blk_rw_complete?

TL;DR

Is patch b9e413dd required for the correctness of this patch?

What is the role of the aio_context_acquire/release introduced by
b9e413dd in virtio_blk_rw_complete?

Regards,
Halil
Paolo Bonzini March 6, 2017, 2:55 p.m. UTC | #6
On 03/03/2017 20:43, Halil Pasic wrote:
> Uh, this is complicated. I'm not out of questions, but I fear taking to
> much of your precious time. I will ask again nevertheless, but please
> just cut the conversation with -EBUSY if it gets to expensive.

It's the opposite!  I need other people to look at it, understand it and
poke holes at my own reasoning.

>         bs_->wakeup = true;                                \
>         while ((cond)) {                                   \
>             aio_context_release(ctx_);                     \
>             aio_poll(qemu_get_aio_context(), true);        \
>             aio_context_acquire(ctx_);                     \
>             waited_ = true;                                \
>         }                                                  \
>         bs_->wakeup = false;                               \
>     }                                                      \
>     waited_; })
> 
> where it's interesting (me running with 2 iothreads one assigned to my
> block device) we are gonna take  the "else" branch , a will end up
> releasing the ctx belonging to the iothread and then acquire it again,
> and basically wait till the requests are done.
> 
> Since is virtio_blk_rw_complete acquiring-releasing the same ctx this
> makes a lot of sense. If we would not release the ctx we would end up
> with a deadlock.

Related to this, note that the above release/wait/acquire logic is
itself new to 2.8.  Before, QEMU would run aio_poll(other_aio_context)
directly in the main thread.  This relied on recursive mutexes and
special callbacks to pass the lock between the I/O and main threads.

This worked but it hid some thread-unsafe idioms, so I removed this in
the commits ending at 65c1b5b ("iothread: release AioContext around
aio_poll", 2016-10-28).  BDRV_POLL_WHILE provides an abstraction that
will not change once aio_context_acquire/release disappears in favor of
fine-grained locks.

> But then I do not understand what is the point acquiring
> the mutex in virtio_blk_rw_complete?

When QEMU does I/O in the main thread, it can call
BlockBackend/BlockDriverState functions.  Even though their completion
is processed in the I/O thread (via BDRV_POLL_WHILE), you still need a
lock to handle mutual exclusion between the two.

In the case of virtio_blk_rw_complete the mutex is needed because, for
example, block_acct_done is not thread safe yet.

> Is patch b9e413dd required for the correctness of this patch?
> What is the role of the aio_context_acquire/release introduced by
> b9e413dd in virtio_blk_rw_complete?

Patch b9e413dd should have no semantic change.  The acquire/release that
are added are all balanced by removed acquire/releases elsewhere, for
example in block/linux-aio.c.

Paolo
Halil Pasic March 6, 2017, 10:27 p.m. UTC | #7
On 03/06/2017 03:55 PM, Paolo Bonzini wrote:
> 
> On 03/03/2017 20:43, Halil Pasic wrote:
>> Uh, this is complicated. I'm not out of questions, but I fear taking to
>> much of your precious time. I will ask again nevertheless, but please
>> just cut the conversation with -EBUSY if it gets to expensive.
> It's the opposite!  I need other people to look at it, understand it and
> poke holes at my own reasoning.
> 

Thanks for the explanation. I intend to send out a v2 (with
the typo fixed) tomorrow.

Regards,
Halil
diff mbox

Patch

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 5556f0e..13dd14d 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -258,9 +258,16 @@  void virtio_blk_data_plane_stop(VirtIODevice *vdev)
         virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
     }
 
-    /* Drain and switch bs back to the QEMU main loop */
+    /* Drain and switch bs back to the QEMU main loop. After drain, the
+     * device will not submit (nor comple) any requests until dataplane
+     * starts again.
+     */
     blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
 
+    /* Notify guest before the guest notifiers get cleaned up */
+    qemu_bh_cancel(s->bh);
+    notify_guest_bh(s);
+
     aio_context_release(s->ctx);
 
     for (i = 0; i < nvqs; i++) {