mbox series

[RFC,0/9] block: Rewrite block drain begin/end

Message ID 20171129144956.11409-1-famz@redhat.com
Headers show
Series block: Rewrite block drain begin/end | expand

Message

Fam Zheng Nov. 29, 2017, 2:49 p.m. UTC
While we look at the fixes for 2.11, I briefly prototyped this series to see if
it makes sense as a simplification of the drain API for 2.12.

The idea is to let AioContext manage quiesce callbacks, then the block layer
only needs to do the in-flight request waiting. This lets us get rid of the
callback recursion (both up and down).

Many commit logs and code comments are definitely missing, but it would be good
to get high level review and maybe some testing already.

Fam

Fam Zheng (9):
  block: Remove unused bdrv_requests_pending
  aio: Add drain begin/end API to AioContext
  blockjob: Implement AioContext drain ops
  throttle: Implement AioContext drain ops
  qed: Implement AioContext drain ops
  block: Use aio_context_drained_begin in bdrv_set_aio_context
  block: Switch to use AIO drained begin/end API
  block: Drop old drained_{begin,end} callbacks
  blockjob: Drop unused functions

 block.c                        |  30 +--------
 block/block-backend.c          |  22 -------
 block/io.c                     | 134 +++--------------------------------------
 block/qed.c                    |  34 ++++++++---
 block/throttle.c               |  34 ++++++++---
 blockjob.c                     |  67 ++++++++-------------
 include/block/aio.h            |  27 ++++++++-
 include/block/block.h          |  16 -----
 include/block/block_int.h      |  12 ----
 include/block/blockjob_int.h   |  14 -----
 include/sysemu/block-backend.h |   8 ---
 util/async.c                   |  73 ++++++++++++++++++++++
 12 files changed, 186 insertions(+), 285 deletions(-)

Comments

Kevin Wolf Nov. 29, 2017, 5:25 p.m. UTC | #1
Am 29.11.2017 um 15:49 hat Fam Zheng geschrieben:
> While we look at the fixes for 2.11, I briefly prototyped this series
> to see if it makes sense as a simplification of the drain API for
> 2.12.
> 
> The idea is to let AioContext manage quiesce callbacks, then the block
> layer only needs to do the in-flight request waiting. This lets us get
> rid of the callback recursion (both up and down).

So essentially you don't drain individual nodes any more, but whole
AioContexts. I have a feeeling that this would be a step in the wrong
direction.

Not only would it completely bypass the path I/O requests take and
potentially drain a lot more than is actually necessary, but it also
requires that all nodes that are connected in a tree are in the same
AioContext.

Paolo can say more on this, but my understanding was that the long-term
plan is to make the block layer fully thread safe so that any thread
could call things on any node. I remember Paolo saying that AioContext
was even fully going away in the future. I don't see how this is
compatible with your approach.

And finally, I don't really think that the recursion is even a problem.
The problem is with graph changes made by callbacks that drain allows to
run. With your changes, it might be a bit easier to avoid that
bdrv_drain() itself gets into trouble due to graph changes, but this
doesn't solve the problem for any (possibly indirect) callers of
bdrv_drain().

Kevin
Fam Zheng Nov. 30, 2017, 2:03 a.m. UTC | #2
On Wed, 11/29 18:25, Kevin Wolf wrote:
> Am 29.11.2017 um 15:49 hat Fam Zheng geschrieben:
> > While we look at the fixes for 2.11, I briefly prototyped this series
> > to see if it makes sense as a simplification of the drain API for
> > 2.12.
> > 
> > The idea is to let AioContext manage quiesce callbacks, then the block
> > layer only needs to do the in-flight request waiting. This lets us get
> > rid of the callback recursion (both up and down).
> 
> So essentially you don't drain individual nodes any more, but whole
> AioContexts. I have a feeeling that this would be a step in the wrong
> direction.
> 
> Not only would it completely bypass the path I/O requests take and
> potentially drain a lot more than is actually necessary, but it also
> requires that all nodes that are connected in a tree are in the same
> AioContext.

Yeah, good point. Initially I wanted to introduce a BlockGraph object which
manages the per-graph draining, (i.e. where to register the drain callbacks),
but I felt lazy and used AioContext.

Will that make it better?  BlockGraph would be a proper abstraction and will not
limit the API to one AioContext per tree.

> 
> And finally, I don't really think that the recursion is even a problem.
> The problem is with graph changes made by callbacks that drain allows to
> run. With your changes, it might be a bit easier to avoid that
> bdrv_drain() itself gets into trouble due to graph changes, but this
> doesn't solve the problem for any (possibly indirect) callers of
> bdrv_drain().

The recursion is the one big place that can be easily broken by graph changes,
fixing this doesn't make the situation any worse. We could still fix the
indirect callers by taking references or by introducing "ubiquitous coroutines".

Fam
Kevin Wolf Nov. 30, 2017, 10:31 a.m. UTC | #3
Am 30.11.2017 um 03:03 hat Fam Zheng geschrieben:
> On Wed, 11/29 18:25, Kevin Wolf wrote:
> > Am 29.11.2017 um 15:49 hat Fam Zheng geschrieben:
> > > While we look at the fixes for 2.11, I briefly prototyped this series
> > > to see if it makes sense as a simplification of the drain API for
> > > 2.12.
> > > 
> > > The idea is to let AioContext manage quiesce callbacks, then the block
> > > layer only needs to do the in-flight request waiting. This lets us get
> > > rid of the callback recursion (both up and down).
> > 
> > So essentially you don't drain individual nodes any more, but whole
> > AioContexts. I have a feeeling that this would be a step in the wrong
> > direction.
> > 
> > Not only would it completely bypass the path I/O requests take and
> > potentially drain a lot more than is actually necessary, but it also
> > requires that all nodes that are connected in a tree are in the same
> > AioContext.
> 
> Yeah, good point. Initially I wanted to introduce a BlockGraph object
> which manages the per-graph draining, (i.e. where to register the
> drain callbacks), but I felt lazy and used AioContext.
> 
> Will that make it better?  BlockGraph would be a proper abstraction
> and will not limit the API to one AioContext per tree.

There is only a single graph, so this would mean going back to global
bdrv_drain_all() exclusively.

What you really mean is probably connected components in the graph, but
do we really want to manage merging and splitting object representing
connected components when a node is added or removed from the graph?
Especially when that graph change occurs in a drain callback?

You can also still easily introduce bugs where graph changes during a
drain end up in nodes not being drained, possibly drained twice, you
still access the next pointer of a deleted node or you accidentally
switch to draining a different component.

It's probably possible to get this right, but essentially you're just
switching from iterating a tree to iterating a list. You get roughly the
same set of problems that you have to consider as today, and getting it
right should be about the same difficulty.

> > And finally, I don't really think that the recursion is even a problem.
> > The problem is with graph changes made by callbacks that drain allows to
> > run. With your changes, it might be a bit easier to avoid that
> > bdrv_drain() itself gets into trouble due to graph changes, but this
> > doesn't solve the problem for any (possibly indirect) callers of
> > bdrv_drain().
> 
> The recursion is the one big place that can be easily broken by graph changes,
> fixing this doesn't make the situation any worse. We could still fix the
> indirect callers by taking references or by introducing "ubiquitous coroutines".

But hiding a bug in 80% of the cases where it shows isn't enough.

I think the only real solution is to forbid graph changes until after
any critical operation has completed. I haven't tried it out in
practice, but I suppose we could use a CoMutex around them and take it
in bdrv_drained_begin/end() and all other places that can get into
trouble with graph changes.

Kevin
Fam Zheng Nov. 30, 2017, 2:34 p.m. UTC | #4
On Thu, 11/30 11:31, Kevin Wolf wrote:
> What you really mean is probably connected components in the graph, but
> do we really want to manage merging and splitting object representing
> connected components when a node is added or removed from the graph?
> Especially when that graph change occurs in a drain callback?
> 
> You can also still easily introduce bugs where graph changes during a
> drain end up in nodes not being drained, possibly drained twice, you
> still access the next pointer of a deleted node or you accidentally
> switch to draining a different component.
> 
> It's probably possible to get this right, but essentially you're just
> switching from iterating a tree to iterating a list. You get roughly the
> same set of problems that you have to consider as today, and getting it
> right should be about the same difficulty.

Not quite. The essential idea is redo the drain begin/end in a correct way:

bdrv_drained_begin(bs):
    1.a) stop all sources of new requests in the connected components, including
        * aio_disable_external()
        * stop QED timer
        * stop block job
    1.b) aio_poll() until:
        * no request is in flight for bs
        * no progress is made (any progress may generate new requests)

bdrv_drained_end(bs):
    2.a) resume stopped sources of new requests
    2.b) no need to do aio_poll() because the main loop will move on

1.a) can be either done with either a graph recursion like now, or a list
traversing if managed somewhere, like in this series. Or better, BlockGraph will
be the manager. The rule is that these operations will not cause graph change,
so it's an improvement.

1.b) doesn't need recursion IMO, only looking at bs->in_flight and the return
value of aio_poll() should be enough.

For 2.a) if the drain begin/end callbacks are mananged in a list, it's easy to
only call drained_end() for entries that got drained_begin() earlier, as shown
in patch 2.

> 
> > > And finally, I don't really think that the recursion is even a problem.
> > > The problem is with graph changes made by callbacks that drain allows to
> > > run. With your changes, it might be a bit easier to avoid that
> > > bdrv_drain() itself gets into trouble due to graph changes, but this
> > > doesn't solve the problem for any (possibly indirect) callers of
> > > bdrv_drain().
> > 
> > The recursion is the one big place that can be easily broken by graph changes,
> > fixing this doesn't make the situation any worse. We could still fix the
> > indirect callers by taking references or by introducing "ubiquitous coroutines".
> 
> But hiding a bug in 80% of the cases where it shows isn't enough.

I think they are separate bugs. And with the one that this series is fixing,
others (bdrv_drain*() callers') may even not show up, because
bdrv_drained_begin() crashed already.

> 
> I think the only real solution is to forbid graph changes until after
> any critical operation has completed. I haven't tried it out in
> practice, but I suppose we could use a CoMutex around them and take it
> in bdrv_drained_begin/end() and all other places that can get into
> trouble with graph changes.

Yes, I agree, but that (using CoMutex around graph change) requires everything,
especially the defer_to_main_loop_bh, runs in a coroutine context, which is
exactly what I mean by "introducing 'ubiquitous coroutines'", because currently
we don't have them.

Fam
Kevin Wolf Nov. 30, 2017, 3:10 p.m. UTC | #5
Am 30.11.2017 um 15:34 hat Fam Zheng geschrieben:
> On Thu, 11/30 11:31, Kevin Wolf wrote:
> > What you really mean is probably connected components in the graph, but
> > do we really want to manage merging and splitting object representing
> > connected components when a node is added or removed from the graph?
> > Especially when that graph change occurs in a drain callback?
> > 
> > You can also still easily introduce bugs where graph changes during a
> > drain end up in nodes not being drained, possibly drained twice, you
> > still access the next pointer of a deleted node or you accidentally
> > switch to draining a different component.
> > 
> > It's probably possible to get this right, but essentially you're just
> > switching from iterating a tree to iterating a list. You get roughly the
> > same set of problems that you have to consider as today, and getting it
> > right should be about the same difficulty.
> 
> Not quite. The essential idea is redo the drain begin/end in a correct way:
> 
> bdrv_drained_begin(bs):
>     1.a) stop all sources of new requests in the connected components, including
>         * aio_disable_external()
>         * stop QED timer
>         * stop block job
>     1.b) aio_poll() until:
>         * no request is in flight for bs
>         * no progress is made (any progress may generate new requests)
> 
> bdrv_drained_end(bs):
>     2.a) resume stopped sources of new requests
>     2.b) no need to do aio_poll() because the main loop will move on
> 
> 1.a) can be either done with either a graph recursion like now, or a list
> traversing if managed somewhere, like in this series. Or better, BlockGraph will
> be the manager. The rule is that these operations will not cause graph change,
> so it's an improvement.
> 
> 1.b) doesn't need recursion IMO, only looking at bs->in_flight and the return
> value of aio_poll() should be enough.

The trouble is with 1.b), which can cause graph changes, including nodes
being added to or removed from a connected component. This means that
you can get nodes in the component for which 1.a) hasn't been executed,
or nodes that are outside the component, but for which 1.a) has still
been executed.

You then need to compensate for that somehow, and basically execute 1.a)
for any new nodes (otherwise you have nodes in the component that could
still be issuing requests; skipping drained_end for them is not enough)
and execute 2.*) for the removed ones.

> For 2.a) if the drain begin/end callbacks are mananged in a list, it's
> easy to only call drained_end() for entries that got drained_begin()
> earlier, as shown in patch 2.

Yes, but not draining them was a bug in the first place.

I'll give you that trying to separate 1.a) and 1.b) so that we have only
a single BDRV_POLL_WHILE() after the most critical code, is an
interesting idea. But as long as 1.b) can involve graph changes, it
remains quite tricky.

And of course, it can't solve the problem with graph changes in callers
of bdrv_drained_begin/end because at least one BDRV_POLL_WHILE() will
stay there that can cause graph changes.

> > > > And finally, I don't really think that the recursion is even a problem.
> > > > The problem is with graph changes made by callbacks that drain allows to
> > > > run. With your changes, it might be a bit easier to avoid that
> > > > bdrv_drain() itself gets into trouble due to graph changes, but this
> > > > doesn't solve the problem for any (possibly indirect) callers of
> > > > bdrv_drain().
> > > 
> > > The recursion is the one big place that can be easily broken by graph changes,
> > > fixing this doesn't make the situation any worse. We could still fix the
> > > indirect callers by taking references or by introducing "ubiquitous coroutines".
> > 
> > But hiding a bug in 80% of the cases where it shows isn't enough.
> 
> I think they are separate bugs. And with the one that this series is fixing,
> others (bdrv_drain*() callers') may even not show up, because
> bdrv_drained_begin() crashed already.
> 
> > 
> > I think the only real solution is to forbid graph changes until after
> > any critical operation has completed. I haven't tried it out in
> > practice, but I suppose we could use a CoMutex around them and take it
> > in bdrv_drained_begin/end() and all other places that can get into
> > trouble with graph changes.
> 
> Yes, I agree, but that (using CoMutex around graph change) requires
> everything, especially the defer_to_main_loop_bh, runs in a coroutine
> context, which is exactly what I mean by "introducing 'ubiquitous
> coroutines'", because currently we don't have them.

Is it hard to do, though? Instead of using a BH to switch to the main
loop and outside of coroutine context, you could use aio_co_schedule()
and yield, which would leave you in the main loop, but still in
coroutine context.

Would this have any bad side effects I'm not aware of?

I'm not sure about other places, of course. We'd have to check that.

Kevin
Paolo Bonzini Nov. 30, 2017, 4:04 p.m. UTC | #6
On 30/11/2017 16:10, Kevin Wolf wrote:
>> Yes, I agree, but that (using CoMutex around graph change) requires
>> everything, especially the defer_to_main_loop_bh, runs in a coroutine
>> context, which is exactly what I mean by "introducing 'ubiquitous
>> coroutines'", because currently we don't have them.
> Is it hard to do, though? Instead of using a BH to switch to the main
> loop and outside of coroutine context, you could use aio_co_schedule()
> and yield, which would leave you in the main loop, but still in
> coroutine context.

Not that I think of, but just aio_co_schedule wouldn't work, because
"the coroutine must have yielded unless ctx is the context in which the
coroutine is running (i.e. the value of qemu_get_current_aio_context()
from the coroutine itself)".

So you'd have to use a bottom half that calls aio_co_schedule.  But that
would work.

Paolo

> Would this have any bad side effects I'm not aware of?
Fam Zheng Dec. 1, 2017, 9:51 a.m. UTC | #7
On Thu, 11/30 17:04, Paolo Bonzini wrote:
> On 30/11/2017 16:10, Kevin Wolf wrote:
> >> Yes, I agree, but that (using CoMutex around graph change) requires
> >> everything, especially the defer_to_main_loop_bh, runs in a coroutine
> >> context, which is exactly what I mean by "introducing 'ubiquitous
> >> coroutines'", because currently we don't have them.
> > Is it hard to do, though? Instead of using a BH to switch to the main
> > loop and outside of coroutine context, you could use aio_co_schedule()
> > and yield, which would leave you in the main loop, but still in
> > coroutine context.
> 
> Not that I think of, but just aio_co_schedule wouldn't work, because
> "the coroutine must have yielded unless ctx is the context in which the
> coroutine is running (i.e. the value of qemu_get_current_aio_context()
> from the coroutine itself)".
> 
> So you'd have to use a bottom half that calls aio_co_schedule.  But that
> would work.
> 

We have QMP commands that can manupulate the graph which are all not coroutines.
I think running QMP commands in coroutines has it merit especially regarding to
the nested event loops.

Also the bdrv_close_all() and similar at the end of main() do draining too,
which I'm not sure how to deal with. Maybe special case them and forget the
draining CoMutex?

Fam
Kevin Wolf Dec. 1, 2017, 12:24 p.m. UTC | #8
Am 01.12.2017 um 10:51 hat Fam Zheng geschrieben:
> On Thu, 11/30 17:04, Paolo Bonzini wrote:
> > On 30/11/2017 16:10, Kevin Wolf wrote:
> > >> Yes, I agree, but that (using CoMutex around graph change) requires
> > >> everything, especially the defer_to_main_loop_bh, runs in a coroutine
> > >> context, which is exactly what I mean by "introducing 'ubiquitous
> > >> coroutines'", because currently we don't have them.
> > > Is it hard to do, though? Instead of using a BH to switch to the main
> > > loop and outside of coroutine context, you could use aio_co_schedule()
> > > and yield, which would leave you in the main loop, but still in
> > > coroutine context.
> > 
> > Not that I think of, but just aio_co_schedule wouldn't work, because
> > "the coroutine must have yielded unless ctx is the context in which the
> > coroutine is running (i.e. the value of qemu_get_current_aio_context()
> > from the coroutine itself)".
> > 
> > So you'd have to use a bottom half that calls aio_co_schedule.  But that
> > would work.
> 
> We have QMP commands that can manupulate the graph which are all not coroutines.
> I think running QMP commands in coroutines has it merit especially regarding to
> the nested event loops.
> 
> Also the bdrv_close_all() and similar at the end of main() do draining too,
> which I'm not sure how to deal with. Maybe special case them and forget the
> draining CoMutex?

All of these cases have in common that they are the outermost layer with
respect to the block subsystem. This means that a nested event loop
there should be harmless because the callbacks called by it won't
influence callers further up in the call stack.

Kevin