diff mbox series

[v2,11/17] block-backend: Decrease in_flight only after callback

Message ID 20180913125217.23173-12-kwolf@redhat.com
State New
Headers show
Series Fix some jobs/drain/aio_poll related hangs | expand

Commit Message

Kevin Wolf Sept. 13, 2018, 12:52 p.m. UTC
Request callbacks can do pretty much anything, including operations that
will yield from the coroutine (such as draining the backend). In that
case, a decreased in_flight would be visible to other code and could
lead to a drain completing while the callback hasn't actually completed
yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
---
 block/block-backend.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Sept. 13, 2018, 3:10 p.m. UTC | #1
On 13/09/2018 14:52, Kevin Wolf wrote:
> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> + /* If we are in the main thread, the callback is allowed to unref
> + * the BlockBackend, so we have to hold an additional reference */
> + blk_ref(acb->rwco.blk);
> + }
> acb->common.cb(acb->common.opaque, acb->rwco.ret);
> + blk_dec_in_flight(acb->rwco.blk);
> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> + blk_unref(acb->rwco.blk);
> + }

Is this something that happens only for some specific callers?  That is,
which callers are sure that the callback is invoked from the main thread?

Thanks,

Paolo
Kevin Wolf Sept. 13, 2018, 4:59 p.m. UTC | #2
Am 13.09.2018 um 17:10 hat Paolo Bonzini geschrieben:
> On 13/09/2018 14:52, Kevin Wolf wrote:
> > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > + /* If we are in the main thread, the callback is allowed to unref
> > + * the BlockBackend, so we have to hold an additional reference */
> > + blk_ref(acb->rwco.blk);
> > + }
> > acb->common.cb(acb->common.opaque, acb->rwco.ret);
> > + blk_dec_in_flight(acb->rwco.blk);
> > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > + blk_unref(acb->rwco.blk);
> > + }
> 
> Is this something that happens only for some specific callers?  That is,
> which callers are sure that the callback is invoked from the main thread?

I can't seem to reproduce the problem I saw any more even when reverting
the bdrv_ref/unref pair. If I remember correctly it was actually a
nested aio_poll() that was running a block job completion or something
like that - which would obviously only happen on the main thread because
the job intentionally defers to the main thread.

The only reason I made this conditional is that I think bdrv_unref()
still isn't safe outside the main thread, is it?

Kevin
Max Reitz Sept. 13, 2018, 8:50 p.m. UTC | #3
On 13.09.18 14:52, Kevin Wolf wrote:
> Request callbacks can do pretty much anything, including operations that
> will yield from the coroutine (such as draining the backend). In that
> case, a decreased in_flight would be visible to other code and could
> lead to a drain completing while the callback hasn't actually completed
> yet.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Fam Zheng <famz@redhat.com>
> ---
>  block/block-backend.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>
Fam Zheng Sept. 14, 2018, 7:47 a.m. UTC | #4
On Thu, 09/13 18:59, Kevin Wolf wrote:
> Am 13.09.2018 um 17:10 hat Paolo Bonzini geschrieben:
> > On 13/09/2018 14:52, Kevin Wolf wrote:
> > > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > > + /* If we are in the main thread, the callback is allowed to unref
> > > + * the BlockBackend, so we have to hold an additional reference */
> > > + blk_ref(acb->rwco.blk);
> > > + }
> > > acb->common.cb(acb->common.opaque, acb->rwco.ret);
> > > + blk_dec_in_flight(acb->rwco.blk);
> > > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > > + blk_unref(acb->rwco.blk);
> > > + }
> > 
> > Is this something that happens only for some specific callers?  That is,
> > which callers are sure that the callback is invoked from the main thread?
> 
> I can't seem to reproduce the problem I saw any more even when reverting
> the bdrv_ref/unref pair. If I remember correctly it was actually a
> nested aio_poll() that was running a block job completion or something
> like that - which would obviously only happen on the main thread because
> the job intentionally defers to the main thread.
> 
> The only reason I made this conditional is that I think bdrv_unref()
> still isn't safe outside the main thread, is it?

I think that is correct.

Fam
Paolo Bonzini Sept. 14, 2018, 3:12 p.m. UTC | #5
On 13/09/2018 18:59, Kevin Wolf wrote:
> Am 13.09.2018 um 17:10 hat Paolo Bonzini geschrieben:
>> On 13/09/2018 14:52, Kevin Wolf wrote:
>>> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
>>> + /* If we are in the main thread, the callback is allowed to unref
>>> + * the BlockBackend, so we have to hold an additional reference */
>>> + blk_ref(acb->rwco.blk);
>>> + }
>>> acb->common.cb(acb->common.opaque, acb->rwco.ret);
>>> + blk_dec_in_flight(acb->rwco.blk);
>>> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
>>> + blk_unref(acb->rwco.blk);
>>> + }
>>
>> Is this something that happens only for some specific callers?  That is,
>> which callers are sure that the callback is invoked from the main thread?
> 
> I can't seem to reproduce the problem I saw any more even when reverting
> the bdrv_ref/unref pair. If I remember correctly it was actually a
> nested aio_poll() that was running a block job completion or something
> like that - which would obviously only happen on the main thread because
> the job intentionally defers to the main thread.
> 
> The only reason I made this conditional is that I think bdrv_unref()
> still isn't safe outside the main thread, is it?

Yes, making it conditional is correct, but it is quite fishy even with
the conditional.

As you mention, you could have a nested aio_poll() in the main thread,
for example invoked from a bottom half, but in that case I'd rather
track the caller that is creating the bottom half and see if it lacks a
bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
missing).

Paolo
Kevin Wolf Sept. 14, 2018, 5:14 p.m. UTC | #6
Am 14.09.2018 um 17:12 hat Paolo Bonzini geschrieben:
> On 13/09/2018 18:59, Kevin Wolf wrote:
> > Am 13.09.2018 um 17:10 hat Paolo Bonzini geschrieben:
> >> On 13/09/2018 14:52, Kevin Wolf wrote:
> >>> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> >>> + /* If we are in the main thread, the callback is allowed to unref
> >>> + * the BlockBackend, so we have to hold an additional reference */
> >>> + blk_ref(acb->rwco.blk);
> >>> + }
> >>> acb->common.cb(acb->common.opaque, acb->rwco.ret);
> >>> + blk_dec_in_flight(acb->rwco.blk);
> >>> + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> >>> + blk_unref(acb->rwco.blk);
> >>> + }
> >>
> >> Is this something that happens only for some specific callers?  That is,
> >> which callers are sure that the callback is invoked from the main thread?
> > 
> > I can't seem to reproduce the problem I saw any more even when reverting
> > the bdrv_ref/unref pair. If I remember correctly it was actually a
> > nested aio_poll() that was running a block job completion or something
> > like that - which would obviously only happen on the main thread because
> > the job intentionally defers to the main thread.
> > 
> > The only reason I made this conditional is that I think bdrv_unref()
> > still isn't safe outside the main thread, is it?
> 
> Yes, making it conditional is correct, but it is quite fishy even with
> the conditional.
> 
> As you mention, you could have a nested aio_poll() in the main thread,
> for example invoked from a bottom half, but in that case I'd rather
> track the caller that is creating the bottom half and see if it lacks a
> bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
> missing).

I went back to the commit where I first added the patch (it already
contained the ref/unref pair) and tried if I could reproduce a bug with
the pair removed. I couldn't.

I'm starting to think that maybe I was just overly cautious with the
ref/unref. I may have confused the nested aio_poll() crash with a
different situation. I've dealt with so many crashes and hangs while
working on this series that it's quite possible.

Kevin
Paolo Bonzini Sept. 14, 2018, 5:38 p.m. UTC | #7
On 14/09/2018 19:14, Kevin Wolf wrote:
>> As you mention, you could have a nested aio_poll() in the main thread,
>> for example invoked from a bottom half, but in that case I'd rather
>> track the caller that is creating the bottom half and see if it lacks a
>> bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
>> missing).
> I went back to the commit where I first added the patch (it already
> contained the ref/unref pair) and tried if I could reproduce a bug with
> the pair removed. I couldn't.
> 
> I'm starting to think that maybe I was just overly cautious with the
> ref/unref. I may have confused the nested aio_poll() crash with a
> different situation. I've dealt with so many crashes and hangs while
> working on this series that it's quite possible.

Are you going to drop the patch hen?

Thanks,

Paolo
Kevin Wolf Sept. 17, 2018, 11:48 a.m. UTC | #8
Am 14.09.2018 um 19:38 hat Paolo Bonzini geschrieben:
> On 14/09/2018 19:14, Kevin Wolf wrote:
> >> As you mention, you could have a nested aio_poll() in the main thread,
> >> for example invoked from a bottom half, but in that case I'd rather
> >> track the caller that is creating the bottom half and see if it lacks a
> >> bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
> >> missing).
> > I went back to the commit where I first added the patch (it already
> > contained the ref/unref pair) and tried if I could reproduce a bug with
> > the pair removed. I couldn't.
> > 
> > I'm starting to think that maybe I was just overly cautious with the
> > ref/unref. I may have confused the nested aio_poll() crash with a
> > different situation. I've dealt with so many crashes and hangs while
> > working on this series that it's quite possible.
> 
> Are you going to drop the patch hen?

I think I can drop the ref/unref pair, but not the whole patch (whose
main point is reordering dec_in_flight vs. the AIO callback).

Kevin
Paolo Bonzini Sept. 17, 2018, 12:38 p.m. UTC | #9
On 17/09/2018 13:48, Kevin Wolf wrote:
> Am 14.09.2018 um 19:38 hat Paolo Bonzini geschrieben:
>> On 14/09/2018 19:14, Kevin Wolf wrote:
>>>> As you mention, you could have a nested aio_poll() in the main thread,
>>>> for example invoked from a bottom half, but in that case I'd rather
>>>> track the caller that is creating the bottom half and see if it lacks a
>>>> bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
>>>> missing).
>>> I went back to the commit where I first added the patch (it already
>>> contained the ref/unref pair) and tried if I could reproduce a bug with
>>> the pair removed. I couldn't.
>>>
>>> I'm starting to think that maybe I was just overly cautious with the
>>> ref/unref. I may have confused the nested aio_poll() crash with a
>>> different situation. I've dealt with so many crashes and hangs while
>>> working on this series that it's quite possible.
>>
>> Are you going to drop the patch hen?
> 
> I think I can drop the ref/unref pair, but not the whole patch (whose
> main point is reordering dec_in_flight vs. the AIO callback).

You're right, though I think I did that on purpose back in the day.
IIRC it was related to bdrv_drain, which might never complete if called
from an AIO callback.

Paolo
Kevin Wolf Sept. 17, 2018, 12:53 p.m. UTC | #10
Am 17.09.2018 um 14:38 hat Paolo Bonzini geschrieben:
> On 17/09/2018 13:48, Kevin Wolf wrote:
> > Am 14.09.2018 um 19:38 hat Paolo Bonzini geschrieben:
> >> On 14/09/2018 19:14, Kevin Wolf wrote:
> >>>> As you mention, you could have a nested aio_poll() in the main thread,
> >>>> for example invoked from a bottom half, but in that case I'd rather
> >>>> track the caller that is creating the bottom half and see if it lacks a
> >>>> bdrv_ref/bdrv_unref (or perhaps it's even higher in the tree that is
> >>>> missing).
> >>> I went back to the commit where I first added the patch (it already
> >>> contained the ref/unref pair) and tried if I could reproduce a bug with
> >>> the pair removed. I couldn't.
> >>>
> >>> I'm starting to think that maybe I was just overly cautious with the
> >>> ref/unref. I may have confused the nested aio_poll() crash with a
> >>> different situation. I've dealt with so many crashes and hangs while
> >>> working on this series that it's quite possible.
> >>
> >> Are you going to drop the patch hen?
> > 
> > I think I can drop the ref/unref pair, but not the whole patch (whose
> > main point is reordering dec_in_flight vs. the AIO callback).
> 
> You're right, though I think I did that on purpose back in the day.
> IIRC it was related to bdrv_drain, which might never complete if called
> from an AIO callback.

Hm... This seems to become a common pattern, it's the same as for the
job completion callbacks (only improved enough for the bug at hand to
disappear instead of properly fixed in "blockjob: Lie better in
child_job_drained_poll()").

Either you say there is no activity even though there is still a
callback pending, then bdrv_drain() called from elsewhere will return
too early and we get a bug. Or you say there is activity, then any
nested drain inside that callback will deadlock and we get a bug, too.

So I suppose we need some way to know which activities to ignore during
drain, depending on who is the caller? :-/

Kevin
Paolo Bonzini Sept. 17, 2018, 3:59 p.m. UTC | #11
On 17/09/2018 14:53, Kevin Wolf wrote:
>>> I think I can drop the ref/unref pair, but not the whole patch (whose
>>> main point is reordering dec_in_flight vs. the AIO callback).
>>
>> You're right, though I think I did that on purpose back in the day.
>> IIRC it was related to bdrv_drain, which might never complete if called
>> from an AIO callback.
> 
> Hm... This seems to become a common pattern, it's the same as for the
> job completion callbacks (only improved enough for the bug at hand to
> disappear instead of properly fixed in "blockjob: Lie better in
> child_job_drained_poll()").
> 
> Either you say there is no activity even though there is still a
> callback pending, then bdrv_drain() called from elsewhere will return
> too early and we get a bug. Or you say there is activity, then any
> nested drain inside that callback will deadlock and we get a bug, too.
> 
> So I suppose we need some way to know which activities to ignore during
> drain, depending on who is the caller? :-/

Some alternatives:

1) anything that needs to do invoke I/O for callbacks must use inc/dec
in flight manually.  Simplest but hard to assert though.  At least SCSI
IDE are broken.

2) callbacks cannot indirectly result in bdrv_drain.  Sounds easier
since there are not many AIO callers anymore - plus it also seems easier
to add debugging code for.

3) we provide device callbacks for "am I busy" and invoke them from
bdrv_drain's poll loop.

Just off the top of my head.  Not sure which is best.

Paolo
Kevin Wolf Sept. 17, 2018, 4:51 p.m. UTC | #12
Am 17.09.2018 um 17:59 hat Paolo Bonzini geschrieben:
> On 17/09/2018 14:53, Kevin Wolf wrote:
> >>> I think I can drop the ref/unref pair, but not the whole patch (whose
> >>> main point is reordering dec_in_flight vs. the AIO callback).
> >>
> >> You're right, though I think I did that on purpose back in the day.
> >> IIRC it was related to bdrv_drain, which might never complete if called
> >> from an AIO callback.
> > 
> > Hm... This seems to become a common pattern, it's the same as for the
> > job completion callbacks (only improved enough for the bug at hand to
> > disappear instead of properly fixed in "blockjob: Lie better in
> > child_job_drained_poll()").
> > 
> > Either you say there is no activity even though there is still a
> > callback pending, then bdrv_drain() called from elsewhere will return
> > too early and we get a bug. Or you say there is activity, then any
> > nested drain inside that callback will deadlock and we get a bug, too.
> > 
> > So I suppose we need some way to know which activities to ignore during
> > drain, depending on who is the caller? :-/
> 
> Some alternatives:
> 
> 1) anything that needs to do invoke I/O for callbacks must use inc/dec
> in flight manually.  Simplest but hard to assert though.  At least SCSI
> IDE are broken.
> 
> 2) callbacks cannot indirectly result in bdrv_drain.  Sounds easier
> since there are not many AIO callers anymore - plus it also seems easier
> to add debugging code for.
> 
> 3) we provide device callbacks for "am I busy" and invoke them from
> bdrv_drain's poll loop.
> 
> Just off the top of my head.  Not sure which is best.

I don't see how 1) and 3) can solve the problem. You still need to
declare something busy when someone else drains (so the drain doesn't
return too early) and not busy when it calls a nested drain (so that it
doesn't deadlock).

2) would obviously solve the problem, but I'm afraid it's not realistic
when you consider that block jobs happen _only_ in callbacks. That much
of this is disguised as coroutine code doesn't change the call chains.
You also can't use BHs to escape the AIO callback, because then the BH is
logically part of the operation that needs to be completed for an
external drain to return, so you would still have to call the job busy
and would deadlock in a nested drain in that BH. So you're back to
square one.

For devices, it might be more realistic (devices rarely have a reason to
drain a block node, much less in a request completion), but note that
"cannot indirectly result in a drain" forbids any kind of nested
aio_poll(). I'm not sure if we can audit all code to fulfill this
condition (or how to assert it in the code).

Kevin
Paolo Bonzini Sept. 17, 2018, 5:08 p.m. UTC | #13
On 17/09/2018 18:51, Kevin Wolf wrote:
> Am 17.09.2018 um 17:59 hat Paolo Bonzini geschrieben:
>> On 17/09/2018 14:53, Kevin Wolf wrote:
>>>>> I think I can drop the ref/unref pair, but not the whole patch (whose
>>>>> main point is reordering dec_in_flight vs. the AIO callback).
>>>>
>>>> You're right, though I think I did that on purpose back in the day.
>>>> IIRC it was related to bdrv_drain, which might never complete if called
>>>> from an AIO callback.
>>>
>>> Hm... This seems to become a common pattern, it's the same as for the
>>> job completion callbacks (only improved enough for the bug at hand to
>>> disappear instead of properly fixed in "blockjob: Lie better in
>>> child_job_drained_poll()").
>>>
>>> Either you say there is no activity even though there is still a
>>> callback pending, then bdrv_drain() called from elsewhere will return
>>> too early and we get a bug. Or you say there is activity, then any
>>> nested drain inside that callback will deadlock and we get a bug, too.
>>>
>>> So I suppose we need some way to know which activities to ignore during
>>> drain, depending on who is the caller? :-/
>>
>> Some alternatives:
>>
>> 1) anything that needs to do invoke I/O for callbacks must use inc/dec
>> in flight manually.  Simplest but hard to assert though.  At least SCSI
>> IDE are broken.
>>
>> 2) callbacks cannot indirectly result in bdrv_drain.  Sounds easier
>> since there are not many AIO callers anymore - plus it also seems easier
>> to add debugging code for.
>>
>> 3) we provide device callbacks for "am I busy" and invoke them from
>> bdrv_drain's poll loop.
>>
>> Just off the top of my head.  Not sure which is best.
> 
> I don't see how 1) and 3) can solve the problem. You still need to
> declare something busy when someone else drains (so the drain doesn't
> return too early) and not busy when it calls a nested drain (so that it
> doesn't deadlock).
> 
> 2) would obviously solve the problem, but I'm afraid it's not realistic
> when you consider that block jobs happen _only_ in callbacks. That much
> of this is disguised as coroutine code doesn't change the call chains.
> You also can't use BHs to escape the AIO callback, because then the BH is
> logically part of the operation that needs to be completed for an
> external drain to return, so you would still have to call the job busy
> and would deadlock in a nested drain in that BH. So you're back to
> square one.

But then basically the main issue is mirror.c's call to
bdrv_drained_begin/end.  There are no other calls to
bdrv_drained_begin/end inside coroutines IIRC.

Another long-standing idea is to replace aio_disable/enable_external
with implementations of the (currently unused) dev_ops callbacks
.drained_begin and .drained_end.  What if jobs used those callbacks to
pause themselves(*), and block/mirror.c had a pause point before the
call to bdrv_drained_begin?

	(*) of course block/mirror.c would need some smartness to
	    not pause itself when the job itself is asking to drain!
Paolo
Kevin Wolf Sept. 18, 2018, 11:34 a.m. UTC | #14
Am 17.09.2018 um 19:08 hat Paolo Bonzini geschrieben:
> On 17/09/2018 18:51, Kevin Wolf wrote:
> > Am 17.09.2018 um 17:59 hat Paolo Bonzini geschrieben:
> >> On 17/09/2018 14:53, Kevin Wolf wrote:
> >>>>> I think I can drop the ref/unref pair, but not the whole patch (whose
> >>>>> main point is reordering dec_in_flight vs. the AIO callback).
> >>>>
> >>>> You're right, though I think I did that on purpose back in the day.
> >>>> IIRC it was related to bdrv_drain, which might never complete if called
> >>>> from an AIO callback.
> >>>
> >>> Hm... This seems to become a common pattern, it's the same as for the
> >>> job completion callbacks (only improved enough for the bug at hand to
> >>> disappear instead of properly fixed in "blockjob: Lie better in
> >>> child_job_drained_poll()").
> >>>
> >>> Either you say there is no activity even though there is still a
> >>> callback pending, then bdrv_drain() called from elsewhere will return
> >>> too early and we get a bug. Or you say there is activity, then any
> >>> nested drain inside that callback will deadlock and we get a bug, too.
> >>>
> >>> So I suppose we need some way to know which activities to ignore during
> >>> drain, depending on who is the caller? :-/
> >>
> >> Some alternatives:
> >>
> >> 1) anything that needs to do invoke I/O for callbacks must use inc/dec
> >> in flight manually.  Simplest but hard to assert though.  At least SCSI
> >> IDE are broken.
> >>
> >> 2) callbacks cannot indirectly result in bdrv_drain.  Sounds easier
> >> since there are not many AIO callers anymore - plus it also seems easier
> >> to add debugging code for.
> >>
> >> 3) we provide device callbacks for "am I busy" and invoke them from
> >> bdrv_drain's poll loop.
> >>
> >> Just off the top of my head.  Not sure which is best.
> > 
> > I don't see how 1) and 3) can solve the problem. You still need to
> > declare something busy when someone else drains (so the drain doesn't
> > return too early) and not busy when it calls a nested drain (so that it
> > doesn't deadlock).
> > 
> > 2) would obviously solve the problem, but I'm afraid it's not realistic
> > when you consider that block jobs happen _only_ in callbacks. That much
> > of this is disguised as coroutine code doesn't change the call chains.
> > You also can't use BHs to escape the AIO callback, because then the BH is
> > logically part of the operation that needs to be completed for an
> > external drain to return, so you would still have to call the job busy
> > and would deadlock in a nested drain in that BH. So you're back to
> > square one.
> 
> But then basically the main issue is mirror.c's call to
> bdrv_drained_begin/end.  There are no other calls to
> bdrv_drained_begin/end inside coroutines IIRC.

Coroutine or not doesn't matter. What matters is that you drain inside
some (high-level) operation that already needs to be completed itself
for drain to return. This is the case for any block job completions that
make changes to the graph (which means everything except backup).

> Another long-standing idea is to replace aio_disable/enable_external
> with implementations of the (currently unused) dev_ops callbacks
> .drained_begin and .drained_end.  What if jobs used those callbacks to
> pause themselves(*), and block/mirror.c had a pause point before the
> call to bdrv_drained_begin?
> 
> 	(*) of course block/mirror.c would need some smartness to
> 	    not pause itself when the job itself is asking to drain!

Yes, exactly this required smartness is the kind of problem I've been
talking about here all the time!

Block jobs already implement .drained_begin/poll/end callbacks, they
just attach them directly to the BDS (where they are available as
BdrvChildRole callbacks) instead of a BB. And this is why they will
deadlock if .drained_poll() correctly returns true as long as the job
completion is still pending (or halfway done).

And to get back to the specific patch we're discussing, AIO request
completion in the BlockBackend needs the same kind of smartness so that
it doesn't deadlock, but still makes other callers of drain wait for it.

I still don't know what this smartness should look like
implementation-wise, though.

Kevin
Paolo Bonzini Sept. 18, 2018, 2:12 p.m. UTC | #15
On 18/09/2018 13:34, Kevin Wolf wrote:
>> But then basically the main issue is mirror.c's call to
>> bdrv_drained_begin/end.  There are no other calls to
>> bdrv_drained_begin/end inside coroutines IIRC.
>
> Coroutine or not doesn't matter. What matters is that you drain inside
> some (high-level) operation that already needs to be completed itself
> for drain to return.

Indeed.  However, are there any calls to bdrv_drained_begin/end that are
inside inc_in_flight/dec_in_flight, and are not inside coroutines?
(Sorry if I don't see the high-level issue, I'm more familiar with the
low-level functioning...).

It seems to me that if we fixed the mirror bdrv_(co_)drained_begin case,
then it would be good enough.

Paolo

> This is the case for any block job completions that
> make changes to the graph (which means everything except backup).
Kevin Wolf Sept. 18, 2018, 2:56 p.m. UTC | #16
Am 18.09.2018 um 16:12 hat Paolo Bonzini geschrieben:
> On 18/09/2018 13:34, Kevin Wolf wrote:
> >> But then basically the main issue is mirror.c's call to
> >> bdrv_drained_begin/end.  There are no other calls to
> >> bdrv_drained_begin/end inside coroutines IIRC.
> >
> > Coroutine or not doesn't matter. What matters is that you drain inside
> > some (high-level) operation that already needs to be completed itself
> > for drain to return.
> 
> Indeed.  However, are there any calls to bdrv_drained_begin/end that are
> inside inc_in_flight/dec_in_flight, and are not inside coroutines?
> (Sorry if I don't see the high-level issue, I'm more familiar with the
> low-level functioning...).

inc_in_flight is only one thing that signals pending activity.
Essentially, what we're interested in is everything that will make
bdrv_drain_poll() return true.

For a BDS that is in use by a block job, that BDS will be considered
busy as long as the job hasn't paused yet, because otherwise the job can
still issue new requests to the BDS (see child_job_drained_poll()).

A closely related fix in this series is "blockjob: Lie better in
child_job_drained_poll()". Jobs need to make bdrv_drain_poll() return
true for their BDSes even between returning from .run (before John's
recent series called .start) and between .prepare/.commit/.abort (which
used to be the deferred_to_main_loop part) because otherwise drain would
return, but the BH would still be pending and change the graph in the
middle of a drained section, which is obviously a big problem.

This is not purely theoretical, but a crash we observed in practice:
bdrv_reopen() suffered use-after-free crashes because after its initial
bdrv_subtree_drained_begin(), a node was removed from the graph in a
block job completion, but it still was in the ReopenQueue. (IIRC the
reopen belonged to one job and the completion to a second job that
completed at the same time.)

So there is no choice here: The bdrv_subtree_drained_begin() in reopen
_must_ make sure that the block job is really quiescent and no callbacks
for it are pending any more.

At the same time, the completion callbacks of the job (.prepare/.commit/
.abort) use drain themselves because they contain reopens and perform
graph reconfigurations. In this case, we don't want to wait for the
completion callback to finish because that's a deadlock.

Whether we want to wait on the job doesn't depend on where in its code
the job currently is. It only depends on whether the drain is issued by
the job itself or by someone else. This is what makes it hard to return
the right thing in child_job_drained_poll() because I don't see an easy
way to know the caller.

Just setting busy = false temporarily for calling bdrv_drained_begin()
in the job would be wrong: If another thread concurrently drained the
same BDS, it would still have to wait! The job being in the exact same
state, child_job_drained_poll() has to return different values in the
two threads.

> It seems to me that if we fixed the mirror bdrv_(co_)drained_begin
> case, then it would be good enough.

It's only one special case of the problem. If we could solve the problem
for the other job types, this one could be fixed the same way.

Kevin
Paolo Bonzini Sept. 19, 2018, 4:57 p.m. UTC | #17
On 18/09/2018 16:56, Kevin Wolf wrote:
> Am 18.09.2018 um 16:12 hat Paolo Bonzini geschrieben:
>> On 18/09/2018 13:34, Kevin Wolf wrote:
>>>> But then basically the main issue is mirror.c's call to
>>>> bdrv_drained_begin/end.  There are no other calls to
>>>> bdrv_drained_begin/end inside coroutines IIRC.
>>>
>>> Coroutine or not doesn't matter. What matters is that you drain inside
>>> some (high-level) operation that already needs to be completed itself
>>> for drain to return.
>>
>> Indeed.  However, are there any calls to bdrv_drained_begin/end that are
>> inside inc_in_flight/dec_in_flight, and are not inside coroutines?
>> (Sorry if I don't see the high-level issue, I'm more familiar with the
>> low-level functioning...).
> 
> inc_in_flight is only one thing that signals pending activity.
> Essentially, what we're interested in is everything that will make
> bdrv_drain_poll() return true.
> 
> For a BDS that is in use by a block job, that BDS will be considered
> busy as long as the job hasn't paused yet, because otherwise the job can
> still issue new requests to the BDS (see child_job_drained_poll()).
> 
> So there is no choice here: The bdrv_subtree_drained_begin() in reopen
> _must_ make sure that the block job is really quiescent and no callbacks
> for it are pending any more.
> 
> At the same time, the completion callbacks of the job (.prepare/.commit/
> .abort) use drain themselves because they contain reopens and perform
> graph reconfigurations. In this case, we don't want to wait for the
> completion callback to finish because that's a deadlock.
> 
> Whether we want to wait on the job doesn't depend on where in its code
> the job currently is. It only depends on whether the drain is issued by
> the job itself or by someone else. This is what makes it hard to return
> the right thing in child_job_drained_poll() because I don't see an easy
> way to know the caller.

I see.  But anyway, I now think your patch is correct (without the
ref/unref) as long as the AIO callback does not call drain directly:

- calling it through a coroutine is okay, because then
bdrv_drained_begin goes through bdrv_co_yield_to_drain and you have
in_flight=2 when bdrv_co_yield_to_drain yields, then soon in_flight=1
when the aio_co_wake in the AIO callback completes, then in_flight=0
after the bottom half starts.

- calling it through a bottom half would be okay too, as long as the AIO
callback remembers to do inc_in_flight/dec_in_flight just like
bdrv_co_yield_to_drain and bdrv_co_drain_bh_cb do

A few more important cases that come to mind:

- a coroutine that yields because of I/O is okay, with a sequence
similar to bdrv_co_yield_to_drain

- a coroutine that yields with no I/O pending will correctly decrease
in_flight to zero before yielding

- calling more AIO from the callback won't overflow the counter just
because of mutual recursion, because AIO functions always yield at least
once before invoking the callback.

Paolo
diff mbox series

Patch

diff --git a/block/block-backend.c b/block/block-backend.c
index 1b2d7a6ff5..2efaf0c968 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1338,8 +1338,16 @@  static const AIOCBInfo blk_aio_em_aiocb_info = {
 static void blk_aio_complete(BlkAioEmAIOCB *acb)
 {
     if (acb->has_returned) {
-        blk_dec_in_flight(acb->rwco.blk);
+        if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
+            /* If we are in the main thread, the callback is allowed to unref
+             * the BlockBackend, so we have to hold an additional reference */
+            blk_ref(acb->rwco.blk);
+        }
         acb->common.cb(acb->common.opaque, acb->rwco.ret);
+        blk_dec_in_flight(acb->rwco.blk);
+        if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
+            blk_unref(acb->rwco.blk);
+        }
         qemu_aio_unref(acb);
     }
 }