diff mbox

[v7,22/24] block: Rewrite bdrv_close_all()

Message ID 1447108773-6836-23-git-send-email-mreitz@redhat.com
State New
Headers show

Commit Message

Max Reitz Nov. 9, 2015, 10:39 p.m. UTC
This patch rewrites bdrv_close_all(): Until now, all root BDSs have been
force-closed. This is bad because it can lead to cached data not being
flushed to disk.

Instead, try to make all reference holders relinquish their reference
voluntarily:

1. All BlockBackend users are handled by making all BBs simply eject
   their BDS tree. Since a BDS can never be on top of a BB, this will
   not cause any of the issues as seen with the force-closing of BDSs.
   The references will be relinquished and any further access to the BB
   will fail gracefully.
2. All BDSs which are owned by the monitor itself (because they do not
   have a BB) are relinquished next.
3. Besides BBs and the monitor, block jobs and other BDSs are the only
   things left that can hold a reference to BDSs. After every remaining
   block job has been canceled, there should not be any BDSs left (and
   the loop added here will always terminate (as long as NDEBUG is not
   defined), because either all_bdrv_states will be empty or there will
   not be any block job left to cancel, failing the assertion).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

Comments

Fam Zheng Nov. 12, 2015, 7:34 a.m. UTC | #1
On Mon, 11/09 23:39, Max Reitz wrote:
> This patch rewrites bdrv_close_all(): Until now, all root BDSs have been
> force-closed. This is bad because it can lead to cached data not being
> flushed to disk.
> 
> Instead, try to make all reference holders relinquish their reference
> voluntarily:
> 
> 1. All BlockBackend users are handled by making all BBs simply eject
>    their BDS tree. Since a BDS can never be on top of a BB, this will
>    not cause any of the issues as seen with the force-closing of BDSs.
>    The references will be relinquished and any further access to the BB
>    will fail gracefully.
> 2. All BDSs which are owned by the monitor itself (because they do not
>    have a BB) are relinquished next.
> 3. Besides BBs and the monitor, block jobs and other BDSs are the only
>    things left that can hold a reference to BDSs. After every remaining
>    block job has been canceled, there should not be any BDSs left (and
>    the loop added here will always terminate (as long as NDEBUG is not
>    defined), because either all_bdrv_states will be empty or there will
>    not be any block job left to cancel, failing the assertion).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 45 +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/block.c b/block.c
> index b935dff..e3d5aea 100644
> --- a/block.c
> +++ b/block.c
> @@ -1905,9 +1905,7 @@ static void bdrv_close(BlockDriverState *bs)
>  {
>      BdrvAioNotifier *ban, *ban_next;
>  
> -    if (bs->job) {
> -        block_job_cancel_sync(bs->job);
> -    }
> +    assert(!bs->job);
>  
>      /* Disable I/O limits and drain all pending throttled requests */
>      if (bs->throttle_state) {
> @@ -1971,13 +1969,44 @@ static void bdrv_close(BlockDriverState *bs)
>  void bdrv_close_all(void)
>  {
>      BlockDriverState *bs;
> +    AioContext *aio_context;
> +    int original_refcount = 0;
>  
> -    QTAILQ_FOREACH(bs, &bdrv_states, device_list) {
> -        AioContext *aio_context = bdrv_get_aio_context(bs);
> +    /* Drop references from requests still in flight, such as canceled block
> +     * jobs whose AIO context has not been polled yet */
> +    bdrv_drain_all();
>  
> -        aio_context_acquire(aio_context);
> -        bdrv_close(bs);
> -        aio_context_release(aio_context);
> +    blockdev_close_all_bdrv_states();
> +    blk_remove_all_bs();
> +
> +    /* Cancel all block jobs */
> +    while (!QTAILQ_EMPTY(&all_bdrv_states)) {
> +        QTAILQ_FOREACH(bs, &all_bdrv_states, bs_list) {
> +            aio_context = bdrv_get_aio_context(bs);
> +
> +            aio_context_acquire(aio_context);
> +            if (bs->job) {
> +                /* So we can safely query the current refcount */
> +                bdrv_ref(bs);
> +                original_refcount = bs->refcnt;
> +
> +                block_job_cancel_sync(bs->job);
> +                aio_context_release(aio_context);
> +                break;
> +            }
> +            aio_context_release(aio_context);
> +        }
> +
> +        /* All the remaining BlockDriverStates are referenced directly or
> +         * indirectly from block jobs, so there needs to be at least one BDS
> +         * directly used by a block job */
> +        assert(bs);
> +
> +        /* Wait for the block job to release its reference */
> +        while (bs->refcnt >= original_refcount) {
> +            aio_poll(aio_context, true);
> +        }
> +        bdrv_unref(bs);

If at this point bs->refcnt is greater than 1, why don't we care where are the
remaining references from?

Fam

>      }
>  }
>  
> -- 
> 2.6.2
> 
>
Max Reitz Nov. 16, 2015, 4:15 p.m. UTC | #2
On 12.11.2015 08:34, Fam Zheng wrote:
> On Mon, 11/09 23:39, Max Reitz wrote:
>> This patch rewrites bdrv_close_all(): Until now, all root BDSs have been
>> force-closed. This is bad because it can lead to cached data not being
>> flushed to disk.
>>
>> Instead, try to make all reference holders relinquish their reference
>> voluntarily:
>>
>> 1. All BlockBackend users are handled by making all BBs simply eject
>>    their BDS tree. Since a BDS can never be on top of a BB, this will
>>    not cause any of the issues as seen with the force-closing of BDSs.
>>    The references will be relinquished and any further access to the BB
>>    will fail gracefully.
>> 2. All BDSs which are owned by the monitor itself (because they do not
>>    have a BB) are relinquished next.
>> 3. Besides BBs and the monitor, block jobs and other BDSs are the only
>>    things left that can hold a reference to BDSs. After every remaining
>>    block job has been canceled, there should not be any BDSs left (and
>>    the loop added here will always terminate (as long as NDEBUG is not
>>    defined), because either all_bdrv_states will be empty or there will
>>    not be any block job left to cancel, failing the assertion).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>  block.c | 45 +++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 37 insertions(+), 8 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index b935dff..e3d5aea 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1905,9 +1905,7 @@ static void bdrv_close(BlockDriverState *bs)
>>  {
>>      BdrvAioNotifier *ban, *ban_next;
>>  
>> -    if (bs->job) {
>> -        block_job_cancel_sync(bs->job);
>> -    }
>> +    assert(!bs->job);
>>  
>>      /* Disable I/O limits and drain all pending throttled requests */
>>      if (bs->throttle_state) {
>> @@ -1971,13 +1969,44 @@ static void bdrv_close(BlockDriverState *bs)
>>  void bdrv_close_all(void)
>>  {
>>      BlockDriverState *bs;
>> +    AioContext *aio_context;
>> +    int original_refcount = 0;
>>  
>> -    QTAILQ_FOREACH(bs, &bdrv_states, device_list) {
>> -        AioContext *aio_context = bdrv_get_aio_context(bs);
>> +    /* Drop references from requests still in flight, such as canceled block
>> +     * jobs whose AIO context has not been polled yet */
>> +    bdrv_drain_all();
>>  
>> -        aio_context_acquire(aio_context);
>> -        bdrv_close(bs);
>> -        aio_context_release(aio_context);
>> +    blockdev_close_all_bdrv_states();
>> +    blk_remove_all_bs();
>> +
>> +    /* Cancel all block jobs */
>> +    while (!QTAILQ_EMPTY(&all_bdrv_states)) {
>> +        QTAILQ_FOREACH(bs, &all_bdrv_states, bs_list) {
>> +            aio_context = bdrv_get_aio_context(bs);
>> +
>> +            aio_context_acquire(aio_context);
>> +            if (bs->job) {
>> +                /* So we can safely query the current refcount */
>> +                bdrv_ref(bs);
>> +                original_refcount = bs->refcnt;
>> +
>> +                block_job_cancel_sync(bs->job);
>> +                aio_context_release(aio_context);
>> +                break;
>> +            }
>> +            aio_context_release(aio_context);
>> +        }
>> +
>> +        /* All the remaining BlockDriverStates are referenced directly or
>> +         * indirectly from block jobs, so there needs to be at least one BDS
>> +         * directly used by a block job */
>> +        assert(bs);
>> +
>> +        /* Wait for the block job to release its reference */
>> +        while (bs->refcnt >= original_refcount) {
>> +            aio_poll(aio_context, true);
>> +        }
>> +        bdrv_unref(bs);
> 
> If at this point bs->refcnt is greater than 1, why don't we care where are the
> remaining references from?

We do care. A BDS will not be removed from all_bdrv_states until it is
deleted (i.e. its refcount becomes 0). Therefore, this loop will
continue until all BDSs have been deleted.

So where might additional references come from? Since this loop only
cares about direct or indirect references from block jobs, that's
exactly it:

(1) You might have multiple block jobs running on a BDS in the future.
    Then, you'll cancel them one by one, and after having canceled the
    first one, the refcount will still be greater than one before the
    bdrv_unref().

(2) Imagine a BDS A with a parent BDS B. There are block jobs running on
    both of them. Now, B is referenced by both its block job and by A
    (indirectly by the block job referencing A). If we cancel the job on
    B before the one on A, then the refcount on B will still be greater
    than 1 before bdrv_unref() because it is still referenced by its
    parent (until the block job on A is canceled, too).

The first cannot happen right now, but the second one may, I'm not sure
(depending on whether op blockers allow it).


And we do make sure that there are no additional references besides:
- directly from the monitor (monitor-owned BDSs)
- from a BB
- from block jobs
- (+ everything transitively through the respective BDS tree)

If there were additional references, the inner loop would at some point
no longer find a BDS with a block job while all_bdrv_states is still not
empty. That's what the "assert(bs)" after the inner loop is for.

If you can imagine another way where a reference to a BDS may come from,
that would be a bug in this patch and we have to make sure to respect
that case, too.

Max
Fam Zheng Nov. 18, 2015, 2:52 a.m. UTC | #3
On Mon, 11/16 17:15, Max Reitz wrote:
> >> @@ -1971,13 +1969,44 @@ static void bdrv_close(BlockDriverState *bs)
> >>  void bdrv_close_all(void)
> >>  {
> >>      BlockDriverState *bs;
> >> +    AioContext *aio_context;
> >> +    int original_refcount = 0;
> >>  
> >> -    QTAILQ_FOREACH(bs, &bdrv_states, device_list) {
> >> -        AioContext *aio_context = bdrv_get_aio_context(bs);
> >> +    /* Drop references from requests still in flight, such as canceled block
> >> +     * jobs whose AIO context has not been polled yet */
> >> +    bdrv_drain_all();
> >>  
> >> -        aio_context_acquire(aio_context);
> >> -        bdrv_close(bs);
> >> -        aio_context_release(aio_context);
> >> +    blockdev_close_all_bdrv_states();
> >> +    blk_remove_all_bs();
> >> +
> >> +    /* Cancel all block jobs */
> >> +    while (!QTAILQ_EMPTY(&all_bdrv_states)) {
> >> +        QTAILQ_FOREACH(bs, &all_bdrv_states, bs_list) {
> >> +            aio_context = bdrv_get_aio_context(bs);
> >> +
> >> +            aio_context_acquire(aio_context);
> >> +            if (bs->job) {
> >> +                /* So we can safely query the current refcount */
> >> +                bdrv_ref(bs);
> >> +                original_refcount = bs->refcnt;
> >> +
> >> +                block_job_cancel_sync(bs->job);
> >> +                aio_context_release(aio_context);
> >> +                break;
> >> +            }
> >> +            aio_context_release(aio_context);
> >> +        }
> >> +
> >> +        /* All the remaining BlockDriverStates are referenced directly or
> >> +         * indirectly from block jobs, so there needs to be at least one BDS
> >> +         * directly used by a block job */
> >> +        assert(bs);
> >> +
> >> +        /* Wait for the block job to release its reference */
> >> +        while (bs->refcnt >= original_refcount) {
> >> +            aio_poll(aio_context, true);
> >> +        }
> >> +        bdrv_unref(bs);
> > 
> > If at this point bs->refcnt is greater than 1, why don't we care where are the
> > remaining references from?
> 
> We do care. A BDS will not be removed from all_bdrv_states until it is
> deleted (i.e. its refcount becomes 0). Therefore, this loop will
> continue until all BDSs have been deleted.
> 
> So where might additional references come from? Since this loop only
> cares about direct or indirect references from block jobs, that's
> exactly it:
> 
> (1) You might have multiple block jobs running on a BDS in the future.
>     Then, you'll cancel them one by one, and after having canceled the
>     first one, the refcount will still be greater than one before the
>     bdrv_unref().
> 
> (2) Imagine a BDS A with a parent BDS B. There are block jobs running on
>     both of them. Now, B is referenced by both its block job and by A
>     (indirectly by the block job referencing A). If we cancel the job on
>     B before the one on A, then the refcount on B will still be greater
>     than 1 before bdrv_unref() because it is still referenced by its
>     parent (until the block job on A is canceled, too).
> 
> The first cannot happen right now, but the second one may, I'm not sure
> (depending on whether op blockers allow it).

OK, that makes sense. The all_bdrv_states is the central place to make sure all
refcnts reaching 0.

> 
> 
> And we do make sure that there are no additional references besides:
> - directly from the monitor (monitor-owned BDSs)
> - from a BB
> - from block jobs
> - (+ everything transitively through the respective BDS tree)
> 
> If there were additional references, the inner loop would at some point
> no longer find a BDS with a block job while all_bdrv_states is still not
> empty. That's what the "assert(bs)" after the inner loop is for.
> 
> If you can imagine another way where a reference to a BDS may come from,
> that would be a bug in this patch and we have to make sure to respect
> that case, too.

I think the only one reference I'm not sure is in xen_disk. blk_unref is called
in the .free and .disconnect call backs but I have no idea if they are called
before bdrv_close_all.

Otherwise this patch looks good for me.

Thanks,

Fam
diff mbox

Patch

diff --git a/block.c b/block.c
index b935dff..e3d5aea 100644
--- a/block.c
+++ b/block.c
@@ -1905,9 +1905,7 @@  static void bdrv_close(BlockDriverState *bs)
 {
     BdrvAioNotifier *ban, *ban_next;
 
-    if (bs->job) {
-        block_job_cancel_sync(bs->job);
-    }
+    assert(!bs->job);
 
     /* Disable I/O limits and drain all pending throttled requests */
     if (bs->throttle_state) {
@@ -1971,13 +1969,44 @@  static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
     BlockDriverState *bs;
+    AioContext *aio_context;
+    int original_refcount = 0;
 
-    QTAILQ_FOREACH(bs, &bdrv_states, device_list) {
-        AioContext *aio_context = bdrv_get_aio_context(bs);
+    /* Drop references from requests still in flight, such as canceled block
+     * jobs whose AIO context has not been polled yet */
+    bdrv_drain_all();
 
-        aio_context_acquire(aio_context);
-        bdrv_close(bs);
-        aio_context_release(aio_context);
+    blockdev_close_all_bdrv_states();
+    blk_remove_all_bs();
+
+    /* Cancel all block jobs */
+    while (!QTAILQ_EMPTY(&all_bdrv_states)) {
+        QTAILQ_FOREACH(bs, &all_bdrv_states, bs_list) {
+            aio_context = bdrv_get_aio_context(bs);
+
+            aio_context_acquire(aio_context);
+            if (bs->job) {
+                /* So we can safely query the current refcount */
+                bdrv_ref(bs);
+                original_refcount = bs->refcnt;
+
+                block_job_cancel_sync(bs->job);
+                aio_context_release(aio_context);
+                break;
+            }
+            aio_context_release(aio_context);
+        }
+
+        /* All the remaining BlockDriverStates are referenced directly or
+         * indirectly from block jobs, so there needs to be at least one BDS
+         * directly used by a block job */
+        assert(bs);
+
+        /* Wait for the block job to release its reference */
+        while (bs->refcnt >= original_refcount) {
+            aio_poll(aio_context, true);
+        }
+        bdrv_unref(bs);
     }
 }