diff mbox series

[PULL,21/35] block: fix QEMU crash with scsi-hd and drive_del

Message ID 20180618164504.24488-22-kwolf@redhat.com
State New
Headers show
Series [PULL,01/35] test-bdrv-drain: bdrv_drain() works with cross-AioContext events | expand

Commit Message

Kevin Wolf June 18, 2018, 4:44 p.m. UTC
From: Greg Kurz <groug@kaod.org>

Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.

An AIO flush can yield at some point:

blk_aio_flush_entry()
 blk_co_flush(blk)
  bdrv_co_flush(blk->root->bs)
   ...
    qemu_coroutine_yield()

and let the HMP command to run, free blk->root and give control
back to the AIO flush:

    hmp_drive_del()
     blk_remove_bs()
      bdrv_root_unref_child(blk->root)
       child_bs = blk->root->bs
       bdrv_detach_child(blk->root)
        bdrv_replace_child(blk->root, NULL)
         blk->root->bs = NULL
        g_free(blk->root) <============== blk->root becomes stale
       bdrv_unref(child_bs)
        bdrv_delete(child_bs)
         bdrv_close()
          bdrv_drained_begin()
           bdrv_do_drained_begin()
            bdrv_drain_recurse()
             aio_poll()
              ...
              qemu_coroutine_switch()

and the AIO flush completion ends up dereferencing blk->root:

  blk_aio_complete()
   scsi_aio_complete()
    blk_get_aio_context(blk)
     bs = blk_bs(blk)
 ie, bs = blk->root ? blk->root->bs : NULL
            ^^^^^
            stale

The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Eric Blake Aug. 6, 2018, 10:04 p.m. UTC | #1
On 06/18/2018 11:44 AM, Kevin Wolf wrote:
> From: Greg Kurz <groug@kaod.org>
> 
> Removing a drive with drive_del while it is being used to run an I/O
> intensive workload can cause QEMU to crash.
> 
...
> The problem is that we should avoid making block driver graph
> changes while we have in-flight requests. Let's drain all I/O
> for this BB before calling bdrv_root_unref_child().
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/block-backend.c | 5 +++++
>   1 file changed, 5 insertions(+)

Interestingly enough, git bisect is reliably pointing to this commit as 
the culprit in the changed output of iotests './check -nbd 83':

083 8s ... - output mismatch (see 083.out.bad)
--- /home/eblake/qemu/tests/qemu-iotests/083.out	2018-02-26 
11:40:31.605792220 -0600
+++ /home/eblake/qemu/tests/qemu-iotests/083.out.bad	2018-08-06 
16:57:03.411861660 -0500
@@ -41,7 +41,6 @@

  === Check disconnect after neg2 ===

-Connection closed
  read failed: Input/output error

  === Check disconnect 8 neg2 ===
@@ -54,7 +53,6 @@

  === Check disconnect before request ===

-Connection closed
  read failed: Input/output error

  === Check disconnect after request ===
@@ -116,7 +114,6 @@

  === Check disconnect after neg-classic ===

-Connection closed
  read failed: Input/output error

  === Check disconnect before neg1 ===
Failures: 083
Failed 1 of 1 tests


> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 2d1a3463e8..6b75bca317 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -767,6 +767,11 @@ void blk_remove_bs(BlockBackend *blk)
>   
>       blk_update_root_state(blk);
>   
> +    /* bdrv_root_unref_child() will cause blk->root to become stale and may
> +     * switch to a completion coroutine later on. Let's drain all I/O here
> +     * to avoid that and a potential QEMU crash.
> +     */
> +    blk_drain(blk);
>       bdrv_root_unref_child(blk->root);
>       blk->root = NULL;
>   }
> 

Test 83 sets up a client that intentionally disconnects at critical 
points in the NBD protocol exchange, to ensure that the server reacts 
sanely.  I suspect that somewhere in the NBD code, the server detects 
the disconnect and was somehow calling into blk_remove_bs() (although I 
could not quickly find the backtrace); and that prior to this patch, the 
'Connection closed' message resulted from other NBD coroutines getting a 
shot at the (now-closed) connection, while after this patch, the 
additional blk_drain() somehow tweaks things in a way that prevents the 
other NBD coroutines from printing a message.  If so, then the change in 
83 reference output is probably intentional, and we should update it.

But I'm having a hard time convincing myself that this is the case, 
particularly since I'm not even sure how to easily debug the assumptions 
I made above.

Since I'm very weak on the whole notion of what blk_drain() vs. 
blk_remove_bs() is really supposed to be doing, and could easily be 
persuaded that the change in output is a regression instead of a fix.

Thoughts?
Eric Blake Aug. 7, 2018, 7:57 p.m. UTC | #2
On 08/06/2018 05:04 PM, Eric Blake wrote:
> On 06/18/2018 11:44 AM, Kevin Wolf wrote:
>> From: Greg Kurz <groug@kaod.org>
>>
>> Removing a drive with drive_del while it is being used to run an I/O
>> intensive workload can cause QEMU to crash.
>>
> ...

> 
> Test 83 sets up a client that intentionally disconnects at critical 
> points in the NBD protocol exchange, to ensure that the server reacts 
> sanely.

Rather, nbd-fault-injector.py is a server that disconnects at critical 
points, and the test is of client reaction.

>  I suspect that somewhere in the NBD code, the server detects 
> the disconnect and was somehow calling into blk_remove_bs() (although I 
> could not quickly find the backtrace); and that prior to this patch, the 
> 'Connection closed' message resulted from other NBD coroutines getting a 
> shot at the (now-closed) connection, while after this patch, the 
> additional blk_drain() somehow tweaks things in a way that prevents the 
> other NBD coroutines from printing a message.  If so, then the change in 
> 83 reference output is probably intentional, and we should update it.

It seems like this condition is racy, and that the race is more likely 
to be lost prior to this patch than after. It's a question of whether 
the client has time to start a request to the server prior to the server 
hanging up, as the message is generated during 
nbd_co_do_receive_one_chunk.  Here's a demonstration of the fact that 
things are racy:

$ git revert f45280cbf
$ make
$ cd tests/qemu-iotests
$ cat fault.txt
[inject-error "a"]
event=neg2
when=after
$ python nbd-fault-injector.py localhost:10809 ./fault.txt &
Listening on 127.0.0.1:10809
$ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
Closing connection on rule match inject-error "a"
Connection closed
read failed: Input/output error
$ python nbd-fault-injector.py localhost:10809 ./fault.txt &
Listening on 127.0.0.1:10809
$ ../../qemu-io -f raw nbd://localhost:10809
Closing connection on rule match inject-error "a"
qemu-io> r 0 512
read failed: Input/output error
qemu-io> q

So, depending on whether the read command is kicked off quickly (via -c) 
or slowly (via typing into qemu-io) determines whether the message appears.

What's more, in commit f140e300, we specifically called out in the 
commit message that maybe it was better to trace when we detect 
connection closed rather than log it to stdout, and in all cases in that 
commit, the additional 'Connection closed' messages do not add any 
information to the error message already displayed by the rest of the code.

I don't know how much the proposed NBD reconnect code will change things 
in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix test 83.

> 
> But I'm having a hard time convincing myself that this is the case, 
> particularly since I'm not even sure how to easily debug the assumptions 
> I made above.
> 
> Since I'm very weak on the whole notion of what blk_drain() vs. 
> blk_remove_bs() is really supposed to be doing, and could easily be 
> persuaded that the change in output is a regression instead of a fix.

At this point, I don't think we have a regression, just merely a bad 
iotests reference output. The extra blk_drain() merely adds more time 
before the NBD code can send out its first request, and thus makes it 
more likely that the fault injector has closed the connection before the 
read request is issued rather than after (the message only appears when 
read beats the race), but the NBD code shouldn't be printing the error 
message in the first place, and 083 needs to be tweaked to remove the 
noisy lines added in f140e300 (not just the three lines that are 
reliably different due to this patch, but all other such lines due to 
strategic server drops at other points in the NBD protocol).
Vladimir Sementsov-Ogievskiy Aug. 8, 2018, 9:33 a.m. UTC | #3
07.08.2018 22:57, Eric Blake wrote:
> On 08/06/2018 05:04 PM, Eric Blake wrote:
>> On 06/18/2018 11:44 AM, Kevin Wolf wrote:
>>> From: Greg Kurz <groug@kaod.org>
>>>
>>> Removing a drive with drive_del while it is being used to run an I/O
>>> intensive workload can cause QEMU to crash.
>>>
>> ...
>
>>
>> Test 83 sets up a client that intentionally disconnects at critical 
>> points in the NBD protocol exchange, to ensure that the server reacts 
>> sanely.
>
> Rather, nbd-fault-injector.py is a server that disconnects at critical 
> points, and the test is of client reaction.
>
>>   I suspect that somewhere in the NBD code, the server detects the 
>> disconnect and was somehow calling into blk_remove_bs() (although I 
>> could not quickly find the backtrace); and that prior to this patch, 
>> the 'Connection closed' message resulted from other NBD coroutines 
>> getting a shot at the (now-closed) connection, while after this 
>> patch, the additional blk_drain() somehow tweaks things in a way that 
>> prevents the other NBD coroutines from printing a message.  If so, 
>> then the change in 83 reference output is probably intentional, and 
>> we should update it.
>
> It seems like this condition is racy, and that the race is more likely 
> to be lost prior to this patch than after. It's a question of whether 
> the client has time to start a request to the server prior to the 
> server hanging up, as the message is generated during 
> nbd_co_do_receive_one_chunk.  Here's a demonstration of the fact that 
> things are racy:
>
> $ git revert f45280cbf
> $ make
> $ cd tests/qemu-iotests
> $ cat fault.txt
> [inject-error "a"]
> event=neg2
> when=after
> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
> Listening on 127.0.0.1:10809
> $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
> Closing connection on rule match inject-error "a"
> Connection closed
> read failed: Input/output error
> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
> Listening on 127.0.0.1:10809
> $ ../../qemu-io -f raw nbd://localhost:10809
> Closing connection on rule match inject-error "a"
> qemu-io> r 0 512
> read failed: Input/output error
> qemu-io> q
>
> So, depending on whether the read command is kicked off quickly (via 
> -c) or slowly (via typing into qemu-io) determines whether the message 
> appears.
>
> What's more, in commit f140e300, we specifically called out in the 
> commit message that maybe it was better to trace when we detect 
> connection closed rather than log it to stdout, and in all cases in 
> that commit, the additional 'Connection closed' messages do not add 
> any information to the error message already displayed by the rest of 
> the code.
>
> I don't know how much the proposed NBD reconnect code will change 
> things in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix test 
> 83.
>
>>
>> But I'm having a hard time convincing myself that this is the case, 
>> particularly since I'm not even sure how to easily debug the 
>> assumptions I made above.
>>
>> Since I'm very weak on the whole notion of what blk_drain() vs. 
>> blk_remove_bs() is really supposed to be doing, and could easily be 
>> persuaded that the change in output is a regression instead of a fix.
>
> At this point, I don't think we have a regression, just merely a bad 
> iotests reference output. The extra blk_drain() merely adds more time 
> before the NBD code can send out its first request, and thus makes it 
> more likely that the fault injector has closed the connection before 
> the read request is issued rather than after (the message only appears 
> when read beats the race), but the NBD code shouldn't be printing the 
> error message in the first place, and 083 needs to be tweaked to 
> remove the noisy lines added in f140e300 (not just the three lines 
> that are reliably different due to this patch, but all other such 
> lines due to strategic server drops at other points in the NBD protocol).
>

Ok, agree, I'll do it in reconnect series.
Vladimir Sementsov-Ogievskiy Aug. 8, 2018, 11:40 a.m. UTC | #4
07.08.2018 22:57, Eric Blake wrote:
> On 08/06/2018 05:04 PM, Eric Blake wrote:
>> On 06/18/2018 11:44 AM, Kevin Wolf wrote:
>>> From: Greg Kurz <groug@kaod.org>
>>>
>>> Removing a drive with drive_del while it is being used to run an I/O
>>> intensive workload can cause QEMU to crash.
>>>
>> ...
>
>>
>> Test 83 sets up a client that intentionally disconnects at critical 
>> points in the NBD protocol exchange, to ensure that the server reacts 
>> sanely.
>
> Rather, nbd-fault-injector.py is a server that disconnects at critical 
> points, and the test is of client reaction.
>
>>   I suspect that somewhere in the NBD code, the server detects the 
>> disconnect and was somehow calling into blk_remove_bs() (although I 
>> could not quickly find the backtrace); and that prior to this patch, 
>> the 'Connection closed' message resulted from other NBD coroutines 
>> getting a shot at the (now-closed) connection, while after this 
>> patch, the additional blk_drain() somehow tweaks things in a way that 
>> prevents the other NBD coroutines from printing a message.  If so, 
>> then the change in 83 reference output is probably intentional, and 
>> we should update it.
>
> It seems like this condition is racy, and that the race is more likely 
> to be lost prior to this patch than after. It's a question of whether 
> the client has time to start a request to the server prior to the 
> server hanging up, as the message is generated during 
> nbd_co_do_receive_one_chunk.  Here's a demonstration of the fact that 
> things are racy:
>
> $ git revert f45280cbf
> $ make
> $ cd tests/qemu-iotests
> $ cat fault.txt
> [inject-error "a"]
> event=neg2
> when=after
> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
> Listening on 127.0.0.1:10809
> $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
> Closing connection on rule match inject-error "a"
> Connection closed
> read failed: Input/output error
> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
> Listening on 127.0.0.1:10809
> $ ../../qemu-io -f raw nbd://localhost:10809
> Closing connection on rule match inject-error "a"
> qemu-io> r 0 512
> read failed: Input/output error
> qemu-io> q
>
> So, depending on whether the read command is kicked off quickly (via 
> -c) or slowly (via typing into qemu-io) determines whether the message 
> appears.
>
> What's more, in commit f140e300, we specifically called out in the 
> commit message that maybe it was better to trace when we detect 
> connection closed rather than log it to stdout, and in all cases in 
> that commit, the additional 'Connection closed' messages do not add 
> any information to the error message already displayed by the rest of 
> the code.
>
> I don't know how much the proposed NBD reconnect code will change 
> things in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix test 
> 83.

hmm, isn't 3.0 release planned on August 14th?

>
>>
>> But I'm having a hard time convincing myself that this is the case, 
>> particularly since I'm not even sure how to easily debug the 
>> assumptions I made above.
>>
>> Since I'm very weak on the whole notion of what blk_drain() vs. 
>> blk_remove_bs() is really supposed to be doing, and could easily be 
>> persuaded that the change in output is a regression instead of a fix.
>
> At this point, I don't think we have a regression, just merely a bad 
> iotests reference output. The extra blk_drain() merely adds more time 
> before the NBD code can send out its first request, and thus makes it 
> more likely that the fault injector has closed the connection before 
> the read request is issued rather than after (the message only appears 
> when read beats the race), but the NBD code shouldn't be printing the 
> error message in the first place, and 083 needs to be tweaked to 
> remove the noisy lines added in f140e300 (not just the three lines 
> that are reliably different due to this patch, but all other such 
> lines due to strategic server drops at other points in the NBD protocol).
>
Eric Blake Aug. 8, 2018, 12:53 p.m. UTC | #5
On 08/08/2018 06:40 AM, Vladimir Sementsov-Ogievskiy wrote:

>> I don't know how much the proposed NBD reconnect code will change 
>> things in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix test 
>> 83.
> 
> hmm, isn't 3.0 release planned on August 14th?

Yes, but -rc4 has already been cut, and there will not be an -rc5 unless 
there is a MAJOR regression identified in qemu proper (a failing iotest 
is not major).
Vladimir Sementsov-Ogievskiy Aug. 8, 2018, 2:32 p.m. UTC | #6
08.08.2018 12:33, Vladimir Sementsov-Ogievskiy wrote:
> 07.08.2018 22:57, Eric Blake wrote:
>> On 08/06/2018 05:04 PM, Eric Blake wrote:
>>> On 06/18/2018 11:44 AM, Kevin Wolf wrote:
>>>> From: Greg Kurz <groug@kaod.org>
>>>>
>>>> Removing a drive with drive_del while it is being used to run an I/O
>>>> intensive workload can cause QEMU to crash.
>>>>
>>> ...
>>
>>>
>>> Test 83 sets up a client that intentionally disconnects at critical 
>>> points in the NBD protocol exchange, to ensure that the server 
>>> reacts sanely.
>>
>> Rather, nbd-fault-injector.py is a server that disconnects at 
>> critical points, and the test is of client reaction.
>>
>>>   I suspect that somewhere in the NBD code, the server detects the 
>>> disconnect and was somehow calling into blk_remove_bs() (although I 
>>> could not quickly find the backtrace); and that prior to this patch, 
>>> the 'Connection closed' message resulted from other NBD coroutines 
>>> getting a shot at the (now-closed) connection, while after this 
>>> patch, the additional blk_drain() somehow tweaks things in a way 
>>> that prevents the other NBD coroutines from printing a message.  If 
>>> so, then the change in 83 reference output is probably intentional, 
>>> and we should update it.
>>
>> It seems like this condition is racy, and that the race is more 
>> likely to be lost prior to this patch than after. It's a question of 
>> whether the client has time to start a request to the server prior to 
>> the server hanging up, as the message is generated during 
>> nbd_co_do_receive_one_chunk.  Here's a demonstration of the fact that 
>> things are racy:
>>
>> $ git revert f45280cbf
>> $ make
>> $ cd tests/qemu-iotests
>> $ cat fault.txt
>> [inject-error "a"]
>> event=neg2
>> when=after
>> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
>> Listening on 127.0.0.1:10809
>> $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
>> Closing connection on rule match inject-error "a"
>> Connection closed
>> read failed: Input/output error
>> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
>> Listening on 127.0.0.1:10809
>> $ ../../qemu-io -f raw nbd://localhost:10809
>> Closing connection on rule match inject-error "a"
>> qemu-io> r 0 512
>> read failed: Input/output error
>> qemu-io> q
>>
>> So, depending on whether the read command is kicked off quickly (via 
>> -c) or slowly (via typing into qemu-io) determines whether the 
>> message appears.
>>
>> What's more, in commit f140e300, we specifically called out in the 
>> commit message that maybe it was better to trace when we detect 
>> connection closed rather than log it to stdout, and in all cases in 
>> that commit, the additional 'Connection closed' messages do not add 
>> any information to the error message already displayed by the rest of 
>> the code.
>>
>> I don't know how much the proposed NBD reconnect code will change 
>> things in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix 
>> test 83.
>>
>>>
>>> But I'm having a hard time convincing myself that this is the case, 
>>> particularly since I'm not even sure how to easily debug the 
>>> assumptions I made above.
>>>
>>> Since I'm very weak on the whole notion of what blk_drain() vs. 
>>> blk_remove_bs() is really supposed to be doing, and could easily be 
>>> persuaded that the change in output is a regression instead of a fix.
>>
>> At this point, I don't think we have a regression, just merely a bad 
>> iotests reference output. The extra blk_drain() merely adds more time 
>> before the NBD code can send out its first request, and thus makes it 
>> more likely that the fault injector has closed the connection before 
>> the read request is issued rather than after (the message only 
>> appears when read beats the race), but the NBD code shouldn't be 
>> printing the error message in the first place, and 083 needs to be 
>> tweaked to remove the noisy lines added in f140e300 (not just the 
>> three lines that are reliably different due to this patch, but all 
>> other such lines due to strategic server drops at other points in the 
>> NBD protocol).
>>
>
> Ok, agree, I'll do it in reconnect series.
>


hmm, do what?

I was going to change these error messages to be traces, but now I'm not 
sure that it's a good idea. We have generic errp returned from the 
function, and why to drop it from logs? Fixing iotest is not a good 
reason, better is to adjust iotest itself a bit (just commit changed 
output) and forget about it. Is iotest racy itself, did you see 
different output running 83 iotest, not testing by hand?
Eric Blake Aug. 8, 2018, 2:53 p.m. UTC | #7
On 08/08/2018 09:32 AM, Vladimir Sementsov-Ogievskiy wrote:

>>> What's more, in commit f140e300, we specifically called out in the 
>>> commit message that maybe it was better to trace when we detect 
>>> connection closed rather than log it to stdout, and in all cases in 
>>> that commit, the additional 'Connection closed' messages do not add 
>>> any information to the error message already displayed by the rest of 
>>> the code.
>>>

>> Ok, agree, I'll do it in reconnect series.
>>
> 
> 
> hmm, do what?
> 
> I was going to change these error messages to be traces, but now I'm not 
> sure that it's a good idea.

Traces are fine. They won't show up in iotests, but will show up when 
debugging a failed connection.

> We have generic errp returned from the 
> function, and why to drop it from logs?

Because it is redundant with the very next line already in the log. Any 
error encountered when trying to write to a disconnected server is 
redundant with an already-reported error due to detecting EOF on reading 
from the server.

> Fixing iotest is not a good 
> reason, better is to adjust iotest itself a bit (just commit changed 
> output) and forget about it. Is iotest racy itself, did you see 
> different output running 83 iotest, not testing by hand?

The condition for the output of the 'Connection closed' message is racy 
- it depends entirely on the timing of whether the client was able to 
send() a read request to the server prior to the server disconnecting 
immediately after negotiation ended.  If the client loses the race and 
detects the server hangup prior to writing anything, you get one path; 
if the client wins the race and successfully writes the request and only 
later learns that the server has disconnected when trying to read the 
response to that request, you get the other path. The window for the 
race changed (and the iotests did not seem to ever expose it short of 
this particular change to the block layer to do an extra drain), but I 
could still imagine scenarios where iotests will trigger the opposite 
path of the race from what is expected, depending on load, since I don't 
see any synchronization points between the two processes where the 
server is hanging up after negotiation without reading the client's 
request, but where the client may or may not have had time to get its 
request sent to the server's queue.

So, just because I have not seen the iotest fail directly because of a 
race, I think that this commit causing failures in the iotest is 
evidence that the test is not robust with those extra 'Connection 
closed' messages being output.  Switching the output to be a trace 
instead should be just fine; overall, the client's attempt to read when 
the server hangs up will be an EIO failure whether or not the client was 
able to send() its request and merely fails to get a reply (server 
disconnect was slow), or whether the client was not even able to send() 
its request (server disconnect was fast).
diff mbox series

Patch

diff --git a/block/block-backend.c b/block/block-backend.c
index 2d1a3463e8..6b75bca317 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -767,6 +767,11 @@  void blk_remove_bs(BlockBackend *blk)
 
     blk_update_root_state(blk);
 
+    /* bdrv_root_unref_child() will cause blk->root to become stale and may
+     * switch to a completion coroutine later on. Let's drain all I/O here
+     * to avoid that and a potential QEMU crash.
+     */
+    blk_drain(blk);
     bdrv_root_unref_child(blk->root);
     blk->root = NULL;
 }