diff mbox

vl.c/exit: pause cpus before closing block devices

Message ID 20170713190116.21608-1-dgilbert@redhat.com
State New
Headers show

Commit Message

Dr. David Alan Gilbert July 13, 2017, 7:01 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

There's a rare exit seg if the guest is accessing
IO during exit.
It's always hitting the atomic_inc(&bs->in_flight) with a NULL
bs. This was added recently in 99723548  but I don't see it
as the cause.

Flip vl.c around so we pause the cpus before closing the block devices,
that way we shouldn't have anything trying to access them when
they're gone.

This was originally Red Hat bz https://bugzilla.redhat.com/show_bug.cgi?id=1451015

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reported-by: Cong Li <coli@redhat.com>

--
This is a very rare race, I'll leave it running in a loop to see if
we hit anything else and to check this really fixes it.

I do worry if there are other cases that can trigger this - e.g.
hot-unplug or ejecting a CD.

---
 vl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stefan Hajnoczi July 17, 2017, 10:17 a.m. UTC | #1
On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> There's a rare exit seg if the guest is accessing
> IO during exit.
> It's always hitting the atomic_inc(&bs->in_flight) with a NULL
> bs. This was added recently in 99723548  but I don't see it
> as the cause.
> 
> Flip vl.c around so we pause the cpus before closing the block devices,
> that way we shouldn't have anything trying to access them when
> they're gone.
> 
> This was originally Red Hat bz https://bugzilla.redhat.com/show_bug.cgi?id=1451015
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reported-by: Cong Li <coli@redhat.com>
> 
> --
> This is a very rare race, I'll leave it running in a loop to see if
> we hit anything else and to check this really fixes it.
> 
> I do worry if there are other cases that can trigger this - e.g.
> hot-unplug or ejecting a CD.
> 
> ---
>  vl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Dr. David Alan Gilbert July 17, 2017, 10:26 a.m. UTC | #2
* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > There's a rare exit seg if the guest is accessing
> > IO during exit.
> > It's always hitting the atomic_inc(&bs->in_flight) with a NULL
> > bs. This was added recently in 99723548  but I don't see it
> > as the cause.
> > 
> > Flip vl.c around so we pause the cpus before closing the block devices,
> > that way we shouldn't have anything trying to access them when
> > they're gone.
> > 
> > This was originally Red Hat bz https://bugzilla.redhat.com/show_bug.cgi?id=1451015
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reported-by: Cong Li <coli@redhat.com>
> > 
> > --
> > This is a very rare race, I'll leave it running in a loop to see if
> > we hit anything else and to check this really fixes it.
> > 
> > I do worry if there are other cases that can trigger this - e.g.
> > hot-unplug or ejecting a CD.
> > 
> > ---
> >  vl.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Thanks;  and the test I left running seems solid - ~12k runs
over the weekend with no seg.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
John Snow July 17, 2017, 4:43 p.m. UTC | #3
On 07/17/2017 06:26 AM, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> There's a rare exit seg if the guest is accessing
>>> IO during exit.
>>> It's always hitting the atomic_inc(&bs->in_flight) with a NULL
>>> bs. This was added recently in 99723548  but I don't see it
>>> as the cause.
>>>
>>> Flip vl.c around so we pause the cpus before closing the block devices,
>>> that way we shouldn't have anything trying to access them when
>>> they're gone.
>>>
>>> This was originally Red Hat bz https://bugzilla.redhat.com/show_bug.cgi?id=1451015
>>>
>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> Reported-by: Cong Li <coli@redhat.com>
>>>
>>> --
>>> This is a very rare race, I'll leave it running in a loop to see if
>>> we hit anything else and to check this really fixes it.
>>>
>>> I do worry if there are other cases that can trigger this - e.g.
>>> hot-unplug or ejecting a CD.
>>>
>>> ---
>>>  vl.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Thanks;  and the test I left running seems solid - ~12k runs
> over the weekend with no seg.
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

the root cause of this bug is related to this as well:
https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html

From commit 99723548 we started assuming (incorrectly?) that blk_
functions always WILL have an attached BDS, but this is not always true,
for instance, flushing the cache from an empty CDROM.

Paolo, can we move the flight counter increment outside of the
block-backend layer, is that safe?

--js
Alberto Garcia Aug. 2, 2017, 2:42 p.m. UTC | #4
On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
> --- a/vl.c
> +++ b/vl.c
> @@ -4787,8 +4787,8 @@ int main(int argc, char **argv, char **envp)
>      replay_disable_events();
>      iothread_stop_all();
>  
> -    bdrv_close_all();
>      pause_all_vcpus();
> +    bdrv_close_all();
>      res_free();

I haven't debugged it yet, but in my computer iotest 093 stops working
(it never finishes) after this change.

Berto
Dr. David Alan Gilbert Aug. 3, 2017, 4:45 p.m. UTC | #5
* Alberto Garcia (berto@igalia.com) wrote:
> On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -4787,8 +4787,8 @@ int main(int argc, char **argv, char **envp)
> >      replay_disable_events();
> >      iothread_stop_all();
> >  
> > -    bdrv_close_all();
> >      pause_all_vcpus();
> > +    bdrv_close_all();
> >      res_free();
> 
> I haven't debugged it yet, but in my computer iotest 093 stops working
> (it never finishes) after this change.

Yes, I can reproduce that here (I've got to explicitly run 093 - it
doesn't do it automatically for me):

(gdb) where
#0  0x00007feee3780b76 in ppoll () at /lib64/libc.so.6
#1  0x0000564aed293199 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:77
#2  0x0000564aed293199 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>)
    at /home/dgilbert/git/qemu/util/qemu-timer.c:322
#3  0x0000564aed294de1 in aio_poll (ctx=ctx@entry=0x564aefa578e0, blocking=<optimized out>)
    at /home/dgilbert/git/qemu/util/aio-posix.c:622
#4  0x0000564aed215924 in bdrv_drain_recurse (bs=bs@entry=0x564aefa6c320) at /home/dgilbert/git/qemu/block/io.c:188
#5  0x0000564aed216345 in bdrv_drain_all_begin () at /home/dgilbert/git/qemu/block/io.c:353
#6  0x0000564aed2164b9 in bdrv_drain_all () at /home/dgilbert/git/qemu/block/io.c:382
#7  0x0000564aed1c9b63 in bdrv_close_all () at /home/dgilbert/git/qemu/block.c:3113
#8  0x0000564aeceb04fe in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at /home/dgilbert/git/qemu/vl.c:4795

Dave

> Berto
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini Aug. 3, 2017, 10:36 p.m. UTC | #6
----- Original Message -----
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> To: "Alberto Garcia" <berto@igalia.com>
> Cc: qemu-devel@nongnu.org, pbonzini@redhat.com, jsnow@redhat.com
> Sent: Thursday, August 3, 2017 6:45:17 PM
> Subject: Re: [Qemu-devel] [PATCH] vl.c/exit: pause cpus before closing block devices
> 
> * Alberto Garcia (berto@igalia.com) wrote:
> > On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git)
> > wrote:
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -4787,8 +4787,8 @@ int main(int argc, char **argv, char **envp)
> > >      replay_disable_events();
> > >      iothread_stop_all();
> > >  
> > > -    bdrv_close_all();
> > >      pause_all_vcpus();
> > > +    bdrv_close_all();
> > >      res_free();
> > 
> > I haven't debugged it yet, but in my computer iotest 093 stops working
> > (it never finishes) after this change.
> 
> Yes, I can reproduce that here (I've got to explicitly run 093 - it
> doesn't do it automatically for me):

The culprit so to speak is this:

        if (qtest_enabled()) {
            /* For testing block IO throttling only */
            tg->clock_type = QEMU_CLOCK_VIRTUAL;
        }

So after pause_all_vcpus(), the clock doesn't advance and bdrv_close_all
hangs.  Should throttling be disabled by the time bdrv_close drains the
BlockDriverState, and likewise for bdrv_close_all?

Paolo
Stefan Hajnoczi Aug. 4, 2017, 9:56 a.m. UTC | #7
On Thu, Aug 3, 2017 at 11:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> ----- Original Message -----
>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> To: "Alberto Garcia" <berto@igalia.com>
>> Cc: qemu-devel@nongnu.org, pbonzini@redhat.com, jsnow@redhat.com
>> Sent: Thursday, August 3, 2017 6:45:17 PM
>> Subject: Re: [Qemu-devel] [PATCH] vl.c/exit: pause cpus before closing block devices
>>
>> * Alberto Garcia (berto@igalia.com) wrote:
>> > On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git)
>> > wrote:
>> > > --- a/vl.c
>> > > +++ b/vl.c
>> > > @@ -4787,8 +4787,8 @@ int main(int argc, char **argv, char **envp)
>> > >      replay_disable_events();
>> > >      iothread_stop_all();
>> > >
>> > > -    bdrv_close_all();
>> > >      pause_all_vcpus();
>> > > +    bdrv_close_all();
>> > >      res_free();
>> >
>> > I haven't debugged it yet, but in my computer iotest 093 stops working
>> > (it never finishes) after this change.
>>
>> Yes, I can reproduce that here (I've got to explicitly run 093 - it
>> doesn't do it automatically for me):
>
> The culprit so to speak is this:
>
>         if (qtest_enabled()) {
>             /* For testing block IO throttling only */
>             tg->clock_type = QEMU_CLOCK_VIRTUAL;
>         }
>
> So after pause_all_vcpus(), the clock doesn't advance and bdrv_close_all
> hangs.  Should throttling be disabled by the time bdrv_close drains the
> BlockDriverState, and likewise for bdrv_close_all?

bdrv_drain_all() is called before blk_remove_all_bs(), so throttling
is still enabled while draining:

void bdrv_close_all(void)
{
    block_job_cancel_sync_all();
    nbd_export_close_all();

    /* Drop references from requests still in flight, such as canceled block
     * jobs whose AIO context has not been polled yet */
    bdrv_drain_all();

    blk_remove_all_bs();

Perhaps we need a bdrv_throttle_disable_all() function.  Manos is
moving throttling to a block driver so there is no longer a 1:1
relationship between BB and throttling - there might be multiple
throtting nodes in a BDS graph.

Stefan
Stefan Hajnoczi Aug. 4, 2017, 9:58 a.m. UTC | #8
On Mon, Jul 17, 2017 at 5:43 PM, John Snow <jsnow@redhat.com> wrote:
> On 07/17/2017 06:26 AM, Dr. David Alan Gilbert wrote:
>> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>>> On Thu, Jul 13, 2017 at 08:01:16PM +0100, Dr. David Alan Gilbert (git) wrote:
>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> There's a rare exit seg if the guest is accessing
>>>> IO during exit.
>>>> It's always hitting the atomic_inc(&bs->in_flight) with a NULL
>>>> bs. This was added recently in 99723548  but I don't see it
>>>> as the cause.
>>>>
>>>> Flip vl.c around so we pause the cpus before closing the block devices,
>>>> that way we shouldn't have anything trying to access them when
>>>> they're gone.
>>>>
>>>> This was originally Red Hat bz https://bugzilla.redhat.com/show_bug.cgi?id=1451015
>>>>
>>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>> Reported-by: Cong Li <coli@redhat.com>
>>>>
>>>> --
>>>> This is a very rare race, I'll leave it running in a loop to see if
>>>> we hit anything else and to check this really fixes it.
>>>>
>>>> I do worry if there are other cases that can trigger this - e.g.
>>>> hot-unplug or ejecting a CD.
>>>>
>>>> ---
>>>>  vl.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>
>> Thanks;  and the test I left running seems solid - ~12k runs
>> over the weekend with no seg.
>>
>> Dave
>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>
>
> the root cause of this bug is related to this as well:
> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
>
> From commit 99723548 we started assuming (incorrectly?) that blk_
> functions always WILL have an attached BDS, but this is not always true,
> for instance, flushing the cache from an empty CDROM.
>
> Paolo, can we move the flight counter increment outside of the
> block-backend layer, is that safe?

I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
regardless of the throttling timer issue discussed below.  BB cannot
assume that the BDS graph is non-empty.

Stefan
Paolo Bonzini Aug. 4, 2017, 11:46 a.m. UTC | #9
On 04/08/2017 11:58, Stefan Hajnoczi wrote:
>> the root cause of this bug is related to this as well:
>> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
>>
>> From commit 99723548 we started assuming (incorrectly?) that blk_
>> functions always WILL have an attached BDS, but this is not always true,
>> for instance, flushing the cache from an empty CDROM.
>>
>> Paolo, can we move the flight counter increment outside of the
>> block-backend layer, is that safe?
> I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
> regardless of the throttling timer issue discussed below.  BB cannot
> assume that the BDS graph is non-empty.

Can we make bdrv_aio_* return NULL (even temporarily) if there is no
attached BDS?  That would make it much easier to fix.

Paolo
Stefan Hajnoczi Aug. 8, 2017, 12:53 p.m. UTC | #10
On Fri, Aug 04, 2017 at 01:46:17PM +0200, Paolo Bonzini wrote:
> On 04/08/2017 11:58, Stefan Hajnoczi wrote:
> >> the root cause of this bug is related to this as well:
> >> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
> >>
> >> From commit 99723548 we started assuming (incorrectly?) that blk_
> >> functions always WILL have an attached BDS, but this is not always true,
> >> for instance, flushing the cache from an empty CDROM.
> >>
> >> Paolo, can we move the flight counter increment outside of the
> >> block-backend layer, is that safe?
> > I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
> > regardless of the throttling timer issue discussed below.  BB cannot
> > assume that the BDS graph is non-empty.
> 
> Can we make bdrv_aio_* return NULL (even temporarily) if there is no
> attached BDS?  That would make it much easier to fix.

There are many blk_aio_*() callers.  Returning NULL forces them to
perform extra error handling.

When you say "temporarily" do you mean it returns NULL but schedules a
one-shot BH to invoke the callback?  I wonder if we can use a singleton
aiocb instead of NULL for -ENOMEDIUM errors.

Stefan
Kevin Wolf Aug. 8, 2017, 1:03 p.m. UTC | #11
Am 08.08.2017 um 14:53 hat Stefan Hajnoczi geschrieben:
> On Fri, Aug 04, 2017 at 01:46:17PM +0200, Paolo Bonzini wrote:
> > On 04/08/2017 11:58, Stefan Hajnoczi wrote:
> > >> the root cause of this bug is related to this as well:
> > >> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
> > >>
> > >> From commit 99723548 we started assuming (incorrectly?) that blk_
> > >> functions always WILL have an attached BDS, but this is not always true,
> > >> for instance, flushing the cache from an empty CDROM.
> > >>
> > >> Paolo, can we move the flight counter increment outside of the
> > >> block-backend layer, is that safe?
> > > I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
> > > regardless of the throttling timer issue discussed below.  BB cannot
> > > assume that the BDS graph is non-empty.
> > 
> > Can we make bdrv_aio_* return NULL (even temporarily) if there is no
> > attached BDS?  That would make it much easier to fix.
> 
> There are many blk_aio_*() callers.  Returning NULL forces them to
> perform extra error handling.

Yes, that's my concern. We removed NULL returns a long time ago. Most
callers probably don't check for it any more.

> When you say "temporarily" do you mean it returns NULL but schedules a
> one-shot BH to invoke the callback?  I wonder if we can use a singleton
> aiocb instead of NULL for -ENOMEDIUM errors.

This doesn't help. As soon as you involve BHs, you need to consider them
during blk_drain(), otherwise the drain can return too early.

And if you want to consider them during blk_drain()... well, I made an
attempt, maybe we can make it work with some more changes. But I'm
starting to see that it's not a trivial change; though admittedly, the
NULL return thing doesn't look trivial either.

Kevin
Paolo Bonzini Aug. 8, 2017, 1:07 p.m. UTC | #12
On 08/08/2017 14:53, Stefan Hajnoczi wrote:
> On Fri, Aug 04, 2017 at 01:46:17PM +0200, Paolo Bonzini wrote:
>> On 04/08/2017 11:58, Stefan Hajnoczi wrote:
>>>> the root cause of this bug is related to this as well:
>>>> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
>>>>
>>>> From commit 99723548 we started assuming (incorrectly?) that blk_
>>>> functions always WILL have an attached BDS, but this is not always true,
>>>> for instance, flushing the cache from an empty CDROM.
>>>>
>>>> Paolo, can we move the flight counter increment outside of the
>>>> block-backend layer, is that safe?
>>> I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
>>> regardless of the throttling timer issue discussed below.  BB cannot
>>> assume that the BDS graph is non-empty.
>>
>> Can we make bdrv_aio_* return NULL (even temporarily) if there is no
>> attached BDS?  That would make it much easier to fix.
> 
> There are many blk_aio_*() callers.  Returning NULL forces them to
> perform extra error handling.

Most of them either are for non-removable devices and check for a
non-empty BB at startup.  The others (ide-cd and scsi-cd, sd) check the
BlockBackend's blk_is_available or blk_is_inserted state themselves.

It does require some auditing of course, remember that anything that
would return NULL, would now be crashing already in bdrv_inc_in_flight.
We'd be seeing NULL pointer dereferences left and right, if it were so bad.

> When you say "temporarily" do you mean it returns NULL but schedules a
> one-shot BH to invoke the callback?  I wonder if we can use a singleton
> aiocb instead of NULL for -ENOMEDIUM errors.

No, I mean undoing that in 2.11.

Paolo
diff mbox

Patch

diff --git a/vl.c b/vl.c
index f7560de622..d53f715e91 100644
--- a/vl.c
+++ b/vl.c
@@ -4787,8 +4787,8 @@  int main(int argc, char **argv, char **envp)
     replay_disable_events();
     iothread_stop_all();
 
-    bdrv_close_all();
     pause_all_vcpus();
+    bdrv_close_all();
     res_free();
 
     /* vhost-user must be cleaned up before chardevs.  */