diff mbox

[07/16] block: change drain to look only at one child at a time

Message ID 1455645388-32401-8-git-send-email-pbonzini@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini Feb. 16, 2016, 5:56 p.m. UTC
bdrv_requests_pending is checking children to also wait until internal
requests (such as metadata writes) have completed.  However, checking
children is in general overkill because, apart from this special case,
the parent's in_flight count will always be incremented by at least one
for every request in the child.

Since internal requests are only generated by the parent in the child,
instead visit the tree parent first, and then wait for internal I/O in
the children to complete.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Comments

Fam Zheng March 9, 2016, 3:41 a.m. UTC | #1
On Tue, 02/16 18:56, Paolo Bonzini wrote:
> bdrv_requests_pending is checking children to also wait until internal
> requests (such as metadata writes) have completed.  However, checking
> children is in general overkill because, apart from this special case,
> the parent's in_flight count will always be incremented by at least one
> for every request in the child.

While the patch looks good, I'm not sure I understand this: aren't children's
requests generated by the child itself, how is parent's in_flight incremented?

> 
> Since internal requests are only generated by the parent in the child,
> instead visit the tree parent first, and then wait for internal I/O in
> the children to complete.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index a9a23a6..e0c9215 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -249,6 +249,23 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>      }
>  }
>  
> +static bool bdrv_drain_io_recurse(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +    bool waited = false;
> +
> +    while (atomic_read(&bs->in_flight) > 0) {
> +        aio_poll(bdrv_get_aio_context(bs), true);
> +        waited = true;
> +    }
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        waited |= bdrv_drain_io_recurse(child->bs);
> +    }
> +
> +    return waited;
> +}
> +
>  /*
>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>   * and suspend block driver's internal I/O until next request arrives.
> @@ -265,10 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
>      bdrv_no_throttling_begin(bs);
>      bdrv_io_unplugged_begin(bs);
>      bdrv_drain_recurse(bs);
> -    while (bdrv_requests_pending(bs)) {
> -        /* Keep iterating */
> -         aio_poll(bdrv_get_aio_context(bs), true);
> -    }
> +    bdrv_drain_io_recurse(bs);
>      bdrv_io_unplugged_end(bs);
>      bdrv_no_throttling_end(bs);
>  }
> @@ -319,10 +333,7 @@ void bdrv_drain_all(void)
>              aio_context_acquire(aio_context);
>              while ((bs = bdrv_next(bs))) {
>                  if (aio_context == bdrv_get_aio_context(bs)) {
> -                    if (bdrv_requests_pending(bs)) {
> -                        aio_poll(aio_context, true);
> -                        waited = true;
> -                    }
> +                    waited |= bdrv_drain_io_recurse(bs);
>                  }
>              }
>              aio_context_release(aio_context);
> -- 
> 2.5.0
> 
> 
>
Paolo Bonzini March 9, 2016, 7:49 a.m. UTC | #2
On 09/03/2016 04:41, Fam Zheng wrote:
> > bdrv_requests_pending is checking children to also wait until internal
> > requests (such as metadata writes) have completed.  However, checking
> > children is in general overkill because, apart from this special case,
> > the parent's in_flight count will always be incremented by at least one
> > for every request in the child.
> 
> While the patch looks good, I'm not sure I understand this: aren't children's
> requests generated by the child itself, how is parent's in_flight incremented?

What I mean is that there are two cases of bs generating I/O on
bs->file->bs:

1) writes caused by an operation on bs, e.g. a bdrv_aio_write to bs.  In
this case, the bdrv_aio_write increments bs's in_flight count

2) asynchronous metadata writes or flushes.  Such writes can be started
even if bs's in_flight count is zero, but not after the .bdrv_drain
callback has been invoked.  So the correct sequence is:

    - finish current I/O

    - call bdrv_drain callback

    - recurse on children

This is what is needed for QED's callback to work correctly, and I'll
implement it this way in v2.

Paolo
Stefan Hajnoczi March 16, 2016, 4:39 p.m. UTC | #3
On Tue, Feb 16, 2016 at 06:56:19PM +0100, Paolo Bonzini wrote:
> bdrv_requests_pending is checking children to also wait until internal
> requests (such as metadata writes) have completed.  However, checking
> children is in general overkill because, apart from this special case,
> the parent's in_flight count will always be incremented by at least one
> for every request in the child.
> 
> Since internal requests are only generated by the parent in the child,
> instead visit the tree parent first, and then wait for internal I/O in
> the children to complete.

This assumption is true if the BDS graph is a tree.  When there a
multiple roots (i.e. it's a directed acyclic graph), especially with
roots at different levels, then I wonder if pre-order bdrv_drain
traversal leaves open the possibility that I/Os sneak into parts of the
BDS graph that we thought was already drained.

The advantage of the current approach is that it really wait until the
whole (sub)tree is idle.

Concrete example: image fleecing involves adding an ephemeral qcow2 file
which is exported via NBD.  The guest keeps writing to the disk image as
normal but the original data will be backed up to the ephemeral qcow2
file before being overwritten, so that the client sees a point-in-time
snapshot of the disk image.

The tree looks like this:

          [NBD export]
             /
            v
[guest] temporary qcow2
   \    /
    v  v
    disk

Block backend access is in square brackets.  Nodes without square
brackets are BDS nodes.

If the guest wants to drain the disk, it's possible for new I/O requests
to enter the disk BDS while we're recursing to disk's children because
the NBD export socket fd is in the same AIOContext.  The socket fd is
therefore handled during aio_poll() calls.

I'm not 100% sure that this is a problem, but I wonder if you've thought
about this?

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index a9a23a6..e0c9215 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -249,6 +249,23 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>      }
>  }
>  
> +static bool bdrv_drain_io_recurse(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +    bool waited = false;
> +
> +    while (atomic_read(&bs->in_flight) > 0) {
> +        aio_poll(bdrv_get_aio_context(bs), true);
> +        waited = true;
> +    }
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        waited |= bdrv_drain_io_recurse(child->bs);
> +    }
> +
> +    return waited;
> +}
> +
>  /*
>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>   * and suspend block driver's internal I/O until next request arrives.
> @@ -265,10 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
>      bdrv_no_throttling_begin(bs);
>      bdrv_io_unplugged_begin(bs);
>      bdrv_drain_recurse(bs);
> -    while (bdrv_requests_pending(bs)) {
> -        /* Keep iterating */
> -         aio_poll(bdrv_get_aio_context(bs), true);
> -    }
> +    bdrv_drain_io_recurse(bs);
>      bdrv_io_unplugged_end(bs);
>      bdrv_no_throttling_end(bs);
>  }
> @@ -319,10 +333,7 @@ void bdrv_drain_all(void)
>              aio_context_acquire(aio_context);
>              while ((bs = bdrv_next(bs))) {
>                  if (aio_context == bdrv_get_aio_context(bs)) {
> -                    if (bdrv_requests_pending(bs)) {
> -                        aio_poll(aio_context, true);
> -                        waited = true;
> -                    }
> +                    waited |= bdrv_drain_io_recurse(bs);
>                  }
>              }
>              aio_context_release(aio_context);
> -- 
> 2.5.0
> 
>
Paolo Bonzini March 16, 2016, 5:41 p.m. UTC | #4
On 16/03/2016 17:39, Stefan Hajnoczi wrote:
> The tree looks like this:
> 
>           [NBD export]
>              /
>             v
> [guest] temporary qcow2
>    \    /
>     v  v
>     disk
> 
> Block backend access is in square brackets.  Nodes without square
> brackets are BDS nodes.
> 
> If the guest wants to drain the disk, it's possible for new I/O requests
> to enter the disk BDS while we're recursing to disk's children because
> the NBD export socket fd is in the same AIOContext.  The socket fd is
> therefore handled during aio_poll() calls.
> 
> I'm not 100% sure that this is a problem, but I wonder if you've thought
> about this?

I hadn't, but I think this is handled by using
bdrv_drained_begin/bdrv_drained_end instead of bdrv_drain.  The NBD
export registers its callback as "external", and it is thus disabled
between bdrv_drained_begin and bdrv_drained_end.

It will indeed become more complex when BDSes won't have anymore a "home
AioContext" due to multiqueue.  I suspect that we should rethink the
strategy for enabling and disabling external callbacks.  For example we
could add callbacks to each BlockBackend that enable/disable external
callbacks, and when bdrv_drained_begin is called on a BDS, we call the
callbacks for all BlockBackends that are included in this BDS.  I'm not
sure if there's a way to go from a BDS to all the BBs above it.

Paolo
Fam Zheng March 17, 2016, 12:57 a.m. UTC | #5
On Wed, 03/16 18:41, Paolo Bonzini wrote:
> 
> 
> On 16/03/2016 17:39, Stefan Hajnoczi wrote:
> > The tree looks like this:
> > 
> >           [NBD export]
> >              /
> >             v
> > [guest] temporary qcow2
> >    \    /
> >     v  v
> >     disk
> > 
> > Block backend access is in square brackets.  Nodes without square
> > brackets are BDS nodes.
> > 
> > If the guest wants to drain the disk, it's possible for new I/O requests
> > to enter the disk BDS while we're recursing to disk's children because
> > the NBD export socket fd is in the same AIOContext.  The socket fd is
> > therefore handled during aio_poll() calls.
> > 
> > I'm not 100% sure that this is a problem, but I wonder if you've thought
> > about this?
> 
> I hadn't, but I think this is handled by using
> bdrv_drained_begin/bdrv_drained_end instead of bdrv_drain.  The NBD
> export registers its callback as "external", and it is thus disabled
> between bdrv_drained_begin and bdrv_drained_end.
> 
> It will indeed become more complex when BDSes won't have anymore a "home

It probably means BBs won't have a "home AioContext" too, in that case.

> AioContext" due to multiqueue.  I suspect that we should rethink the
> strategy for enabling and disabling external callbacks.  For example we
> could add callbacks to each BlockBackend that enable/disable external
> callbacks, and when bdrv_drained_begin is called on a BDS, we call the
> callbacks for all BlockBackends that are included in this BDS.  I'm not
> sure if there's a way to go from a BDS to all the BBs above it.

If none of the bdrv_drained_begin callers is in hot path, I think we can simply
call disable on each aio context that can send request to BB/BDS.

Fam
diff mbox

Patch

diff --git a/block/io.c b/block/io.c
index a9a23a6..e0c9215 100644
--- a/block/io.c
+++ b/block/io.c
@@ -249,6 +249,23 @@  static void bdrv_drain_recurse(BlockDriverState *bs)
     }
 }
 
+static bool bdrv_drain_io_recurse(BlockDriverState *bs)
+{
+    BdrvChild *child;
+    bool waited = false;
+
+    while (atomic_read(&bs->in_flight) > 0) {
+        aio_poll(bdrv_get_aio_context(bs), true);
+        waited = true;
+    }
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        waited |= bdrv_drain_io_recurse(child->bs);
+    }
+
+    return waited;
+}
+
 /*
  * Wait for pending requests to complete on a single BlockDriverState subtree,
  * and suspend block driver's internal I/O until next request arrives.
@@ -265,10 +282,7 @@  void bdrv_drain(BlockDriverState *bs)
     bdrv_no_throttling_begin(bs);
     bdrv_io_unplugged_begin(bs);
     bdrv_drain_recurse(bs);
-    while (bdrv_requests_pending(bs)) {
-        /* Keep iterating */
-         aio_poll(bdrv_get_aio_context(bs), true);
-    }
+    bdrv_drain_io_recurse(bs);
     bdrv_io_unplugged_end(bs);
     bdrv_no_throttling_end(bs);
 }
@@ -319,10 +333,7 @@  void bdrv_drain_all(void)
             aio_context_acquire(aio_context);
             while ((bs = bdrv_next(bs))) {
                 if (aio_context == bdrv_get_aio_context(bs)) {
-                    if (bdrv_requests_pending(bs)) {
-                        aio_poll(aio_context, true);
-                        waited = true;
-                    }
+                    waited |= bdrv_drain_io_recurse(bs);
                 }
             }
             aio_context_release(aio_context);