[v4] block: fix QEMU crash with scsi-hd and drive_del

Message ID	152750903916.663961.9369851345277129751.stgit@bahia.lan
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> sender: groug@kaod.org) by player755.ha.ovh.net (Postfix) with ESMTPA id 5FB372600A2; Mon, 28 May 2018 14:04:05 +0200 (CEST) From: Greg Kurz <groug@kaod.org> To: qemu-devel@nongnu.org Date: Mon, 28 May 2018 14:03:59 +0200 Message-ID: <152750903916.663961.9369851345277129751.stgit@bahia.lan> User-Agent: StGit/0.17.1-46-g6855-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] [PATCH v4] block: fix QEMU crash with scsi-hd and drive_del Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, qemu-stable@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, Max Reitz <mreitz@redhat.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	[v4] block: fix QEMU crash with scsi-hd and drive_del \| expand [v4] block: fix QEMU crash with scsi-hd and drive_del

Greg Kurz May 28, 2018, 12:03 p.m. UTC

Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.

An AIO flush can yield at some point:

blk_aio_flush_entry()
 blk_co_flush(blk)
  bdrv_co_flush(blk->root->bs)
   ...
    qemu_coroutine_yield()

and let the HMP command to run, free blk->root and give control
back to the AIO flush:

    hmp_drive_del()
     blk_remove_bs()
      bdrv_root_unref_child(blk->root)
       child_bs = blk->root->bs
       bdrv_detach_child(blk->root)
        bdrv_replace_child(blk->root, NULL)
         blk->root->bs = NULL
        g_free(blk->root) <============== blk->root becomes stale
       bdrv_unref(child_bs)
        bdrv_delete(child_bs)
         bdrv_close()
          bdrv_drained_begin()
           bdrv_do_drained_begin()
            bdrv_drain_recurse()
             aio_poll()
              ...
              qemu_coroutine_switch()

and the AIO flush completion ends up dereferencing blk->root:

  blk_aio_complete()
   scsi_aio_complete()
    blk_get_aio_context(blk)
     bs = blk_bs(blk)
 ie, bs = blk->root ? blk->root->bs : NULL
            ^^^^^
            stale

The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().

Signed-off-by: Greg Kurz <groug@kaod.org>
---
v4: - call blk_drain() in blk_remove_bs() (Kevin)

v3: - start drained section before modifying the graph (Stefan)

v2: - drain I/O requests when detaching the BDS (Stefan, Paolo)
---
 block/block-backend.c |    5 +++++
 1 file changed, 5 insertions(+)

Kevin Wolf May 29, 2018, 6:15 p.m. UTC | #1

Am 28.05.2018 um 14:03 hat Greg Kurz geschrieben:
> Removing a drive with drive_del while it is being used to run an I/O
> intensive workload can cause QEMU to crash.
> 
> An AIO flush can yield at some point:
> 
> blk_aio_flush_entry()
>  blk_co_flush(blk)
>   bdrv_co_flush(blk->root->bs)
>    ...
>     qemu_coroutine_yield()
> 
> and let the HMP command to run, free blk->root and give control
> back to the AIO flush:
> 
>     hmp_drive_del()
>      blk_remove_bs()
>       bdrv_root_unref_child(blk->root)
>        child_bs = blk->root->bs
>        bdrv_detach_child(blk->root)
>         bdrv_replace_child(blk->root, NULL)
>          blk->root->bs = NULL
>         g_free(blk->root) <============== blk->root becomes stale
>        bdrv_unref(child_bs)
>         bdrv_delete(child_bs)
>          bdrv_close()
>           bdrv_drained_begin()
>            bdrv_do_drained_begin()
>             bdrv_drain_recurse()
>              aio_poll()
>               ...
>               qemu_coroutine_switch()
> 
> and the AIO flush completion ends up dereferencing blk->root:
> 
>   blk_aio_complete()
>    scsi_aio_complete()
>     blk_get_aio_context(blk)
>      bs = blk_bs(blk)
>  ie, bs = blk->root ? blk->root->bs : NULL
>             ^^^^^
>             stale
> 
> The problem is that we should avoid making block driver graph
> changes while we have in-flight requests. Let's drain all I/O
> for this BB before calling bdrv_root_unref_child().
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Thanks, applied to the block branch.

Kevin

Kevin Wolf May 29, 2018, 8:19 p.m. UTC | #2

Am 28.05.2018 um 14:03 hat Greg Kurz geschrieben:
> Removing a drive with drive_del while it is being used to run an I/O
> intensive workload can cause QEMU to crash.
> 
> An AIO flush can yield at some point:
> 
> blk_aio_flush_entry()
>  blk_co_flush(blk)
>   bdrv_co_flush(blk->root->bs)
>    ...
>     qemu_coroutine_yield()
> 
> and let the HMP command to run, free blk->root and give control
> back to the AIO flush:
> 
>     hmp_drive_del()
>      blk_remove_bs()
>       bdrv_root_unref_child(blk->root)
>        child_bs = blk->root->bs
>        bdrv_detach_child(blk->root)
>         bdrv_replace_child(blk->root, NULL)
>          blk->root->bs = NULL
>         g_free(blk->root) <============== blk->root becomes stale
>        bdrv_unref(child_bs)
>         bdrv_delete(child_bs)
>          bdrv_close()
>           bdrv_drained_begin()
>            bdrv_do_drained_begin()
>             bdrv_drain_recurse()
>              aio_poll()
>               ...
>               qemu_coroutine_switch()
> 
> and the AIO flush completion ends up dereferencing blk->root:
> 
>   blk_aio_complete()
>    scsi_aio_complete()
>     blk_get_aio_context(blk)
>      bs = blk_bs(blk)
>  ie, bs = blk->root ? blk->root->bs : NULL
>             ^^^^^
>             stale
> 
> The problem is that we should avoid making block driver graph
> changes while we have in-flight requests. Let's drain all I/O
> for this BB before calling bdrv_root_unref_child().
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Hmm... It sounded convincing, but 'make check-tests/test-replication'
fails now. The good news is that with the drain fixes, for which I sent
v2 today, it passes, so instead of staging it in my block branch, I'll
put it at the end of my branch for the drain fixes.

Might take a bit longer than planned until it's in master, sorry.

Kevin

Greg Kurz May 29, 2018, 9:41 p.m. UTC | #3

On Tue, 29 May 2018 22:19:17 +0200
Kevin Wolf <kwolf@redhat.com> wrote:

> Am 28.05.2018 um 14:03 hat Greg Kurz geschrieben:
> > Removing a drive with drive_del while it is being used to run an I/O
> > intensive workload can cause QEMU to crash.
> > 
> > An AIO flush can yield at some point:
> > 
> > blk_aio_flush_entry()
> >  blk_co_flush(blk)
> >   bdrv_co_flush(blk->root->bs)
> >    ...
> >     qemu_coroutine_yield()
> > 
> > and let the HMP command to run, free blk->root and give control
> > back to the AIO flush:
> > 
> >     hmp_drive_del()
> >      blk_remove_bs()
> >       bdrv_root_unref_child(blk->root)
> >        child_bs = blk->root->bs
> >        bdrv_detach_child(blk->root)
> >         bdrv_replace_child(blk->root, NULL)
> >          blk->root->bs = NULL
> >         g_free(blk->root) <============== blk->root becomes stale
> >        bdrv_unref(child_bs)
> >         bdrv_delete(child_bs)
> >          bdrv_close()
> >           bdrv_drained_begin()
> >            bdrv_do_drained_begin()
> >             bdrv_drain_recurse()
> >              aio_poll()
> >               ...
> >               qemu_coroutine_switch()
> > 
> > and the AIO flush completion ends up dereferencing blk->root:
> > 
> >   blk_aio_complete()
> >    scsi_aio_complete()
> >     blk_get_aio_context(blk)
> >      bs = blk_bs(blk)
> >  ie, bs = blk->root ? blk->root->bs : NULL
> >             ^^^^^
> >             stale
> > 
> > The problem is that we should avoid making block driver graph
> > changes while we have in-flight requests. Let's drain all I/O
> > for this BB before calling bdrv_root_unref_child().
> > 
> > Signed-off-by: Greg Kurz <groug@kaod.org>  
> 
> Hmm... It sounded convincing, but 'make check-tests/test-replication'
> fails now. The good news is that with the drain fixes, for which I sent
> v2 today, it passes, so instead of staging it in my block branch, I'll
> put it at the end of my branch for the drain fixes.
> 
> Might take a bit longer than planned until it's in master, sorry.
> 
> Kevin

Works for me :)

Thanks !

--
Greg

Max Reitz June 26, 2018, 7:30 p.m. UTC | #4

On 2018-05-28 14:03, Greg Kurz wrote:
> Removing a drive with drive_del while it is being used to run an I/O
> intensive workload can cause QEMU to crash.
> 
> An AIO flush can yield at some point:
> 
> blk_aio_flush_entry()
>  blk_co_flush(blk)
>   bdrv_co_flush(blk->root->bs)
>    ...
>     qemu_coroutine_yield()
> 
> and let the HMP command to run, free blk->root and give control
> back to the AIO flush:
> 
>     hmp_drive_del()
>      blk_remove_bs()
>       bdrv_root_unref_child(blk->root)
>        child_bs = blk->root->bs
>        bdrv_detach_child(blk->root)
>         bdrv_replace_child(blk->root, NULL)
>          blk->root->bs = NULL
>         g_free(blk->root) <============== blk->root becomes stale
>        bdrv_unref(child_bs)
>         bdrv_delete(child_bs)
>          bdrv_close()
>           bdrv_drained_begin()
>            bdrv_do_drained_begin()
>             bdrv_drain_recurse()
>              aio_poll()
>               ...
>               qemu_coroutine_switch()
> 
> and the AIO flush completion ends up dereferencing blk->root:
> 
>   blk_aio_complete()
>    scsi_aio_complete()
>     blk_get_aio_context(blk)
>      bs = blk_bs(blk)
>  ie, bs = blk->root ? blk->root->bs : NULL
>             ^^^^^
>             stale
> 
> The problem is that we should avoid making block driver graph
> changes while we have in-flight requests. Let's drain all I/O
> for this BB before calling bdrv_root_unref_child().
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
> v4: - call blk_drain() in blk_remove_bs() (Kevin)
> 
> v3: - start drained section before modifying the graph (Stefan)
> 
> v2: - drain I/O requests when detaching the BDS (Stefan, Paolo)
> ---
>  block/block-backend.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 89f47b00ea24..bee1f0e41461 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -768,6 +768,11 @@ void blk_remove_bs(BlockBackend *blk)
>  
>      blk_update_root_state(blk);
>  
> +    /* bdrv_root_unref_child() will cause blk->root to become stale and may
> +     * switch to a completion coroutine later on. Let's drain all I/O here
> +     * to avoid that and a potential QEMU crash.
> +     */
> +    blk_drain(blk);
>      bdrv_root_unref_child(blk->root);
>      blk->root = NULL;
>  }

For some reason, this patch breaks iotest 083 (with -nbd) on tmpfs for me.

Only on tmpfs, though, so it's probably not going to be just a simple
reference output fix.

Max

Max Reitz June 26, 2018, 7:33 p.m. UTC | #5

On 2018-06-26 21:30, Max Reitz wrote:
> On 2018-05-28 14:03, Greg Kurz wrote:
>> Removing a drive with drive_del while it is being used to run an I/O
>> intensive workload can cause QEMU to crash.
>>
>> An AIO flush can yield at some point:
>>
>> blk_aio_flush_entry()
>>  blk_co_flush(blk)
>>   bdrv_co_flush(blk->root->bs)
>>    ...
>>     qemu_coroutine_yield()
>>
>> and let the HMP command to run, free blk->root and give control
>> back to the AIO flush:
>>
>>     hmp_drive_del()
>>      blk_remove_bs()
>>       bdrv_root_unref_child(blk->root)
>>        child_bs = blk->root->bs
>>        bdrv_detach_child(blk->root)
>>         bdrv_replace_child(blk->root, NULL)
>>          blk->root->bs = NULL
>>         g_free(blk->root) <============== blk->root becomes stale
>>        bdrv_unref(child_bs)
>>         bdrv_delete(child_bs)
>>          bdrv_close()
>>           bdrv_drained_begin()
>>            bdrv_do_drained_begin()
>>             bdrv_drain_recurse()
>>              aio_poll()
>>               ...
>>               qemu_coroutine_switch()
>>
>> and the AIO flush completion ends up dereferencing blk->root:
>>
>>   blk_aio_complete()
>>    scsi_aio_complete()
>>     blk_get_aio_context(blk)
>>      bs = blk_bs(blk)
>>  ie, bs = blk->root ? blk->root->bs : NULL
>>             ^^^^^
>>             stale
>>
>> The problem is that we should avoid making block driver graph
>> changes while we have in-flight requests. Let's drain all I/O
>> for this BB before calling bdrv_root_unref_child().
>>
>> Signed-off-by: Greg Kurz <groug@kaod.org>
>> ---
>> v4: - call blk_drain() in blk_remove_bs() (Kevin)
>>
>> v3: - start drained section before modifying the graph (Stefan)
>>
>> v2: - drain I/O requests when detaching the BDS (Stefan, Paolo)
>> ---
>>  block/block-backend.c |    5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index 89f47b00ea24..bee1f0e41461 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -768,6 +768,11 @@ void blk_remove_bs(BlockBackend *blk)
>>  
>>      blk_update_root_state(blk);
>>  
>> +    /* bdrv_root_unref_child() will cause blk->root to become stale and may
>> +     * switch to a completion coroutine later on. Let's drain all I/O here
>> +     * to avoid that and a potential QEMU crash.
>> +     */
>> +    blk_drain(blk);
>>      bdrv_root_unref_child(blk->root);
>>      blk->root = NULL;
>>  }
> 
> For some reason, this patch breaks iotest 083 (with -nbd) on tmpfs for me.
> 
> Only on tmpfs, though, so it's probably not going to be just a simple
> reference output fix.

Scratch that, it seems that it's not just tmpfs but just breakage in
general.  I suppose that's better.

Max

Michael Roth July 18, 2018, 9:07 p.m. UTC | #6

Quoting Kevin Wolf (2018-05-29 15:19:17)
> Am 28.05.2018 um 14:03 hat Greg Kurz geschrieben:
> > Removing a drive with drive_del while it is being used to run an I/O
> > intensive workload can cause QEMU to crash.
> > 
> > An AIO flush can yield at some point:
> > 
> > blk_aio_flush_entry()
> >  blk_co_flush(blk)
> >   bdrv_co_flush(blk->root->bs)
> >    ...
> >     qemu_coroutine_yield()
> > 
> > and let the HMP command to run, free blk->root and give control
> > back to the AIO flush:
> > 
> >     hmp_drive_del()
> >      blk_remove_bs()
> >       bdrv_root_unref_child(blk->root)
> >        child_bs = blk->root->bs
> >        bdrv_detach_child(blk->root)
> >         bdrv_replace_child(blk->root, NULL)
> >          blk->root->bs = NULL
> >         g_free(blk->root) <============== blk->root becomes stale
> >        bdrv_unref(child_bs)
> >         bdrv_delete(child_bs)
> >          bdrv_close()
> >           bdrv_drained_begin()
> >            bdrv_do_drained_begin()
> >             bdrv_drain_recurse()
> >              aio_poll()
> >               ...
> >               qemu_coroutine_switch()
> > 
> > and the AIO flush completion ends up dereferencing blk->root:
> > 
> >   blk_aio_complete()
> >    scsi_aio_complete()
> >     blk_get_aio_context(blk)
> >      bs = blk_bs(blk)
> >  ie, bs = blk->root ? blk->root->bs : NULL
> >             ^^^^^
> >             stale
> > 
> > The problem is that we should avoid making block driver graph
> > changes while we have in-flight requests. Let's drain all I/O
> > for this BB before calling bdrv_root_unref_child().
> > 
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> Hmm... It sounded convincing, but 'make check-tests/test-replication'
> fails now. The good news is that with the drain fixes, for which I sent
> v2 today, it passes, so instead of staging it in my block branch, I'll
> put it at the end of my branch for the drain fixes.
> 
> Might take a bit longer than planned until it's in master, sorry.

I'm getting the below test-replication failure/trace trying to backport
this patch for 2.12.1 (using this tree:
https://github.com/mdroth/qemu/commits/stable-2.12-staging-f45280cbf)

Is this the same issue you saw, and if so, are the drain fixes
appropriate for 2.12.x? Are there other prereqs/follow-ups you're
aware of that would also be needed?

mdroth@sif:~/w/qemu-build4$ MALLOC_PERTURB_=1 gdb --args tests/test-replication
(gdb) run
Starting program: /home/mdroth/dev/kvm/qemu-build4/tests/test-replication 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff50bf700 (LWP 3916)]
[New Thread 0x7ffff48be700 (LWP 3917)]
/replication/primary/read: OK
/replication/primary/write: OK
/replication/primary/start: OK
/replication/primary/stop: OK
/replication/primary/do_checkpoint: OK
/replication/primary/get_error_all: OK
/replication/secondary/read: OK
/replication/secondary/write: OK
/replication/secondary/start: [New Thread 0x7fffea9f5700 (LWP 3918)]
[New Thread 0x7fffea1f4700 (LWP 3919)]
[New Thread 0x7fffe99f3700 (LWP 3920)]
[New Thread 0x7fffe91f2700 (LWP 3921)]
[New Thread 0x7fffe89f1700 (LWP 3922)]
[New Thread 0x7fffdbfff700 (LWP 3923)]
[New Thread 0x7fffdb7fe700 (LWP 3924)]
[New Thread 0x7fffda1ef700 (LWP 3925)]
[New Thread 0x7fffd99ee700 (LWP 3926)]
[New Thread 0x7fffd91ed700 (LWP 3927)]
[New Thread 0x7fffd89ec700 (LWP 3928)]
[New Thread 0x7fffbffff700 (LWP 3929)]
[New Thread 0x7fffbf7fe700 (LWP 3930)]
[New Thread 0x7fffbeffd700 (LWP 3931)]
[New Thread 0x7fffbe7fc700 (LWP 3932)]
[New Thread 0x7fffbdffb700 (LWP 3933)]
[New Thread 0x7fffbd7fa700 (LWP 3934)]
[New Thread 0x7fffbcff9700 (LWP 3935)]
[New Thread 0x7fff9bfff700 (LWP 3936)]
[New Thread 0x7fff9b7fe700 (LWP 3937)]
[New Thread 0x7fff9affd700 (LWP 3938)]
[New Thread 0x7fff9a7fc700 (LWP 3939)]
[New Thread 0x7fff99ffb700 (LWP 3940)]
OK
/replication/secondary/stop: 
Thread 1 "test-replicatio" received signal SIGSEGV, Segmentation fault.
qemu_mutex_unlock_impl (mutex=mutex@entry=0x101010101010161, file=file@entry=0x5555556734b0 "/home/mdroth/w/qemu4.git/util/async.c", line=line@entry=507) at /home/mdroth/w/qemu4.git/util/qemu-thread-posix.c:94
94	    assert(mutex->initialized);
(gdb) bt
#0  qemu_mutex_unlock_impl (mutex=mutex@entry=0x101010101010161, file=file@entry=0x5555556734b0 "/home/mdroth/w/qemu4.git/util/async.c", line=line@entry=507)
    at /home/mdroth/w/qemu4.git/util/qemu-thread-posix.c:94
#1  0x00005555556231f5 in aio_context_release (ctx=ctx@entry=0x101010101010101) at /home/mdroth/w/qemu4.git/util/async.c:507
#2  0x00005555555c6895 in bdrv_drain_recurse (bs=bs@entry=0x555555d06250) at /home/mdroth/w/qemu4.git/block/io.c:197
#3  0x00005555555c6f0f in bdrv_do_drained_begin (bs=0x555555d06250, recursive=<optimized out>, parent=0x0) at /home/mdroth/w/qemu4.git/block/io.c:290
#4  0x00005555555c6ef0 in bdrv_parent_drained_begin (ignore=0x0, bs=0x555555d56fb0) at /home/mdroth/w/qemu4.git/block/io.c:53
#5  bdrv_do_drained_begin (bs=0x555555d56fb0, recursive=<optimized out>, parent=0x0) at /home/mdroth/w/qemu4.git/block/io.c:288
#6  0x00005555555c6ef0 in bdrv_parent_drained_begin (ignore=0x0, bs=0x555555d37210) at /home/mdroth/w/qemu4.git/block/io.c:53
#7  bdrv_do_drained_begin (bs=0x555555d37210, recursive=<optimized out>, parent=0x0) at /home/mdroth/w/qemu4.git/block/io.c:288
#8  0x00005555555c6ef0 in bdrv_parent_drained_begin (ignore=0x0, bs=0x555555d18450) at /home/mdroth/w/qemu4.git/block/io.c:53
#9  bdrv_do_drained_begin (bs=0x555555d18450, recursive=<optimized out>, parent=0x0) at /home/mdroth/w/qemu4.git/block/io.c:288
#10 0x00005555555ba6d9 in blk_drain (blk=0x555555d03a20) at /home/mdroth/w/qemu4.git/block/block-backend.c:1591
#11 0x00005555555bb12a in blk_remove_bs (blk=blk@entry=0x555555d03a20) at /home/mdroth/w/qemu4.git/block/block-backend.c:775
#12 0x00005555555bb366 in blk_delete (blk=0x555555d03a20) at /home/mdroth/w/qemu4.git/block/block-backend.c:401
#13 blk_unref (blk=0x555555d03a20) at /home/mdroth/w/qemu4.git/block/block-backend.c:450
#14 0x0000555555572515 in teardown_secondary () at /home/mdroth/w/qemu4.git/tests/test-replication.c:373
#15 0x0000555555572f2a in test_secondary_stop () at /home/mdroth/w/qemu4.git/tests/test-replication.c:477
#16 0x00007ffff77134aa in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#17 0x00007ffff77133db in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x00007ffff77133db in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#19 0x00007ffff7713682 in g_test_run_suite () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x00007ffff77136a1 in g_test_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#21 0x000055555557164f in main (argc=<optimized out>, argv=<optimized out>) at /home/mdroth/w/qemu4.git/tests/test-replication.c:580
(gdb)

> 
> Kevin
>

Kevin Wolf July 23, 2018, 1:48 p.m. UTC | #7

Am 18.07.2018 um 23:07 hat Michael Roth geschrieben:
> Quoting Kevin Wolf (2018-05-29 15:19:17)
> > Am 28.05.2018 um 14:03 hat Greg Kurz geschrieben:
> > > Removing a drive with drive_del while it is being used to run an I/O
> > > intensive workload can cause QEMU to crash.
> > > 
> > > An AIO flush can yield at some point:
> > > 
> > > blk_aio_flush_entry()
> > >  blk_co_flush(blk)
> > >   bdrv_co_flush(blk->root->bs)
> > >    ...
> > >     qemu_coroutine_yield()
> > > 
> > > and let the HMP command to run, free blk->root and give control
> > > back to the AIO flush:
> > > 
> > >     hmp_drive_del()
> > >      blk_remove_bs()
> > >       bdrv_root_unref_child(blk->root)
> > >        child_bs = blk->root->bs
> > >        bdrv_detach_child(blk->root)
> > >         bdrv_replace_child(blk->root, NULL)
> > >          blk->root->bs = NULL
> > >         g_free(blk->root) <============== blk->root becomes stale
> > >        bdrv_unref(child_bs)
> > >         bdrv_delete(child_bs)
> > >          bdrv_close()
> > >           bdrv_drained_begin()
> > >            bdrv_do_drained_begin()
> > >             bdrv_drain_recurse()
> > >              aio_poll()
> > >               ...
> > >               qemu_coroutine_switch()
> > > 
> > > and the AIO flush completion ends up dereferencing blk->root:
> > > 
> > >   blk_aio_complete()
> > >    scsi_aio_complete()
> > >     blk_get_aio_context(blk)
> > >      bs = blk_bs(blk)
> > >  ie, bs = blk->root ? blk->root->bs : NULL
> > >             ^^^^^
> > >             stale
> > > 
> > > The problem is that we should avoid making block driver graph
> > > changes while we have in-flight requests. Let's drain all I/O
> > > for this BB before calling bdrv_root_unref_child().
> > > 
> > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > 
> > Hmm... It sounded convincing, but 'make check-tests/test-replication'
> > fails now. The good news is that with the drain fixes, for which I sent
> > v2 today, it passes, so instead of staging it in my block branch, I'll
> > put it at the end of my branch for the drain fixes.
> > 
> > Might take a bit longer than planned until it's in master, sorry.
> 
> I'm getting the below test-replication failure/trace trying to backport
> this patch for 2.12.1 (using this tree:
> https://github.com/mdroth/qemu/commits/stable-2.12-staging-f45280cbf)
> 
> Is this the same issue you saw, and if so, are the drain fixes
> appropriate for 2.12.x? Are there other prereqs/follow-ups you're
> aware of that would also be needed?

I'm not completely sure any more, but yes, I think this might have been
the one. My rework of the bdrv_drain_*() functions fixed quite a few
bugs, including this one, but the work done since 2.12 is two rather
long and quite intrusive series, so I'm not sure if backporting them for
2.12.1 is a good idea.

Kevin

[v4] block: fix QEMU crash with scsi-hd and drive_del

Commit Message

Comments

Patch