diff mbox series

[v3,1/9] block: add .bdrv_need_rw_file_child_during_reopen_rw handler

Message ID 20190725091900.30542-2-vsementsov@virtuozzo.com
State New
Headers show
Series qcow2-bitmaps: rewrite reopening logic | expand

Commit Message

Vladimir Sementsov-Ogievskiy July 25, 2019, 9:18 a.m. UTC
On reopen to rw parent may need rw access to child in .prepare, for
example qcow2 needs to write IN_USE flags into stored bitmaps
(currently it is done in a hacky way after commit and don't work).
So, let's introduce such logic.

The drawback is that in worst case bdrv_reopen_set_read_only may finish
with error and in some intermediate state: some nodes reopened RW and
some are not. But this is a way to fix bug around reopening qcow2
bitmaps in the following commits.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h |   2 +
 block.c                   | 144 ++++++++++++++++++++++++++++++++++----
 2 files changed, 133 insertions(+), 13 deletions(-)

Comments

Max Reitz July 31, 2019, 12:09 p.m. UTC | #1
On 25.07.19 11:18, Vladimir Sementsov-Ogievskiy wrote:
> On reopen to rw parent may need rw access to child in .prepare, for
> example qcow2 needs to write IN_USE flags into stored bitmaps
> (currently it is done in a hacky way after commit and don't work).
> So, let's introduce such logic.
> 
> The drawback is that in worst case bdrv_reopen_set_read_only may finish
> with error and in some intermediate state: some nodes reopened RW and
> some are not. But this is a way to fix bug around reopening qcow2
> bitmaps in the following commits.

This commit message doesn’t really explain what this patch does.

> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block_int.h |   2 +
>  block.c                   | 144 ++++++++++++++++++++++++++++++++++----
>  2 files changed, 133 insertions(+), 13 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 3aa1e832a8..7bd6fd68dd 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -531,6 +531,8 @@ struct BlockDriver {
>                               uint64_t parent_perm, uint64_t parent_shared,
>                               uint64_t *nperm, uint64_t *nshared);
>  
> +     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
> +
>      /**
>       * Bitmaps should be marked as 'IN_USE' in the image on reopening image
>       * as rw. This handler should realize it. It also should unset readonly
> diff --git a/block.c b/block.c
> index cbd8da5f3b..3c8e1c59b4 100644
> --- a/block.c
> +++ b/block.c
> @@ -1715,10 +1715,12 @@ static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
>                                       uint64_t *shared_perm);
>  
>  typedef struct BlockReopenQueueEntry {
> -     bool prepared;
> -     bool perms_checked;
> -     BDRVReopenState state;
> -     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
> +    bool reopened_file_child_rw;
> +    bool changed_file_child_perm_rw;
> +    bool prepared;
> +    bool perms_checked;
> +    BDRVReopenState state;
> +    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>  } BlockReopenQueueEntry;
>  
>  /*
> @@ -3421,6 +3423,105 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>                                     keep_old_opts);
>  }
>  
> +static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
> +                                             bool read_only,
> +                                             Error **errp)
> +{
> +    BlockReopenQueue *queue;
> +    QDict *opts = qdict_new();
> +
> +    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
> +
> +    queue = bdrv_reopen_queue(NULL, bs, opts, true);
> +
> +    return bdrv_reopen_multiple(queue, errp);
> +}
> +
> +/*
> + * handle_recursive_reopens
> + *
> + * On fail it needs rollback_recursive_reopens to be called.

It would be nice if this description actually said anything about what
the function is supposed to do.

> + */
> +static int handle_recursive_reopens(BlockReopenQueueEntry *current,
> +                                    Error **errp)
> +{
> +    int ret;
> +    BlockDriverState *bs = current->state.bs;
> +
> +    /*
> +     * We use the fact that in reopen-queue children are always following
> +     * parents.
> +     * TODO: Switch BlockReopenQueue to be QTAILQ and use
> +     *       QTAILQ_FOREACH_REVERSE.

Why don’t you do that first?  It would make the code more obvious at
least to me.

> +     */
> +    if (QSIMPLEQ_NEXT(current, entry)) {
> +        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
> +    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
> +    {
> +        if (!bdrv_is_writable(bs->file->bs)) {
> +            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);

Hm.  Sorry, I find this all a bit hard to understand.  (No comments and
all.)

I understand that this is for an RO -> RW transition?  Everything is
still RO, but the parent will need an RW child before it transitions to
RW itself.


I’m going to be honest up front, I don’t like this very much.  But I
think it may be a reasonable solution for now.

As I remember, the problem was that when reopening a qcow2 node from RO
to RW, we need to write something in .prepare() (because it can fail),
but naturally no .prepare() is called after any .commit(), so no matter
the order of nodes in the ReopenQueue, the child node will never be RW
by this point.

Hm.  To me that mostly means that making the whole reopen process a
transaction was just a dream that turns out not to work.

OK, so what would be the real, proper, all-encompassing fix?  I suppose
we’d need a way to express reopen dependency relationships.  So if a
node depends on one or more of its children to be reopened before/after
it can be reopened itself, we’d need to pull them out of the ReopenQueue
(along with their dependencies) and do a separate bdrv_reopen_multiple()
on them.  So we’d want to split the ReopenQueue into as few subqueues as
possible, so that all dependencies are satisfied.

One such dependency is if you change a node from RO to RW, and that
change requires an RW child.

The reverse dependency occurs is if you change a node from RW to RO, and
the nodes wants to write something to its child, so it needs to remain
RW until then.

Currently, the former is just broken (hence this patch).  The latter is
kind of addressed by virtue of “Writing happens in .prepare(), and
parents come before children in the ReopenQueue”.


OK, so you address one specific case of a dependency, namely of a node
on its bs->file when it comes to writableness.  Not too bad, supporting
bs->file as the only dependency makes sense.  Everything else would
become complicated.


Besides being more specific than the general solution I tried to sketch
above, there is one more difference, though: In that description, I said
we should remove the node and its dependencies from the ReopenQueue, and
commit it earlier.  That has three implicatons:

First, this patch reopens the file if it is not writable, but the parent
needs it to be writable.  I think that’s wrong.  We should take the
file’s entry of the ReopenQueue and reopen it accordingly, not blindly
reopen it RW.  (If the user didn’t specify the file to be reopened RW,
that should be an error.)

Second, you need to take the dependencies into account.  (I don’t know
whether this one is a problem in practice.)  If the file node itself has
a child that needs to be RW, then you need to take that from the
ReopenQueue, too, and repoen it RW, too.

Third, the reopen may require some other options on the file.
Temporarily reopening it RW without changing those options may not be
what the user wants.  (So another reason to take the existing
ReopenQueue entry.)


So -- without having tried, of course -- I think a better design would
be to look for bs->file->bs in the ReopenQueue, recursively all of its
children, and move all of those entries into a new queue, and then
invoke bdrv_reopen_multiple() on that one first.

The question then becomes how to roll back those changes...  I don’t
know whether just having bdrv_reopen() partially succeed is so bad.
Otherwise, we’d need a function to translate an existing node's state
into a BdrvReopenQueueEntry so we can thus return to the old state.

> +            if (ret < 0) {
> +                return ret;
> +            }
> +            current->reopened_file_child_rw = true;
> +        }
> +
> +        if (!(bs->file->perm & BLK_PERM_WRITE)) {
> +            ret = bdrv_child_try_set_perm(bs->file,
> +                                          bs->file->perm | BLK_PERM_WRITE,
> +                                          bs->file->shared_perm, errp);

bdrv_child_try_set_perm() is dangerous, because its effect will be
undone whenever something happens that causes the permissions to be
refreshed.  (Hence the comment in block_int.h which says to avoid it.)
Generally, bdrv_child_refresh_perms() should be enough.  If it isn’t,
I’d say something’s off.

Max

> +            if (ret < 0) {
> +                return ret;
> +            }
> +            current->changed_file_child_perm_rw = true;
> +        }
> +    }
> +
> +    return 0;
> +}
Vladimir Sementsov-Ogievskiy Aug. 1, 2019, 2:02 p.m. UTC | #2
31.07.2019 15:09, Max Reitz wrote:
> On 25.07.19 11:18, Vladimir Sementsov-Ogievskiy wrote:
>> On reopen to rw parent may need rw access to child in .prepare, for
>> example qcow2 needs to write IN_USE flags into stored bitmaps
>> (currently it is done in a hacky way after commit and don't work).
>> So, let's introduce such logic.
>>
>> The drawback is that in worst case bdrv_reopen_set_read_only may finish
>> with error and in some intermediate state: some nodes reopened RW and
>> some are not. But this is a way to fix bug around reopening qcow2
>> bitmaps in the following commits.
> 
> This commit message doesn’t really explain what this patch does.
> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/block/block_int.h |   2 +
>>   block.c                   | 144 ++++++++++++++++++++++++++++++++++----
>>   2 files changed, 133 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 3aa1e832a8..7bd6fd68dd 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -531,6 +531,8 @@ struct BlockDriver {
>>                                uint64_t parent_perm, uint64_t parent_shared,
>>                                uint64_t *nperm, uint64_t *nshared);
>>   
>> +     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
>> +
>>       /**
>>        * Bitmaps should be marked as 'IN_USE' in the image on reopening image
>>        * as rw. This handler should realize it. It also should unset readonly
>> diff --git a/block.c b/block.c
>> index cbd8da5f3b..3c8e1c59b4 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1715,10 +1715,12 @@ static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
>>                                        uint64_t *shared_perm);
>>   
>>   typedef struct BlockReopenQueueEntry {
>> -     bool prepared;
>> -     bool perms_checked;
>> -     BDRVReopenState state;
>> -     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>> +    bool reopened_file_child_rw;
>> +    bool changed_file_child_perm_rw;
>> +    bool prepared;
>> +    bool perms_checked;
>> +    BDRVReopenState state;
>> +    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>>   } BlockReopenQueueEntry;
>>   
>>   /*
>> @@ -3421,6 +3423,105 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>>                                      keep_old_opts);
>>   }
>>   
>> +static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
>> +                                             bool read_only,
>> +                                             Error **errp)
>> +{
>> +    BlockReopenQueue *queue;
>> +    QDict *opts = qdict_new();
>> +
>> +    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
>> +
>> +    queue = bdrv_reopen_queue(NULL, bs, opts, true);
>> +
>> +    return bdrv_reopen_multiple(queue, errp);
>> +}
>> +
>> +/*
>> + * handle_recursive_reopens
>> + *
>> + * On fail it needs rollback_recursive_reopens to be called.
> 
> It would be nice if this description actually said anything about what
> the function is supposed to do.
> 
>> + */
>> +static int handle_recursive_reopens(BlockReopenQueueEntry *current,
>> +                                    Error **errp)
>> +{
>> +    int ret;
>> +    BlockDriverState *bs = current->state.bs;
>> +
>> +    /*
>> +     * We use the fact that in reopen-queue children are always following
>> +     * parents.
>> +     * TODO: Switch BlockReopenQueue to be QTAILQ and use
>> +     *       QTAILQ_FOREACH_REVERSE.
> 
> Why don’t you do that first?  It would make the code more obvious at
> least to me.
> 
>> +     */
>> +    if (QSIMPLEQ_NEXT(current, entry)) {
>> +        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
>> +    {
>> +        if (!bdrv_is_writable(bs->file->bs)) {
>> +            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);
> 
> Hm.  Sorry, I find this all a bit hard to understand.  (No comments and
> all.)
> 
> I understand that this is for an RO -> RW transition?  Everything is
> still RO, but the parent will need an RW child before it transitions to
> RW itself.
> 
> 
> I’m going to be honest up front, I don’t like this very much.  But I
> think it may be a reasonable solution for now.
> 
> As I remember, the problem was that when reopening a qcow2 node from RO
> to RW, we need to write something in .prepare() (because it can fail),
> but naturally no .prepare() is called after any .commit(), so no matter
> the order of nodes in the ReopenQueue, the child node will never be RW
> by this point.


Exactly.

> 
> Hm.  To me that mostly means that making the whole reopen process a
> transaction was just a dream that turns out not to work.
> 
> OK, so what would be the real, proper, all-encompassing fix?  I suppose
> we’d need a way to express reopen dependency relationships.  So if a
> node depends on one or more of its children to be reopened before/after
> it can be reopened itself, we’d need to pull them out of the ReopenQueue
> (along with their dependencies) and do a separate bdrv_reopen_multiple()
> on them.  So we’d want to split the ReopenQueue into as few subqueues as
> possible, so that all dependencies are satisfied.


Hmm. But actully, considering our case as example, reopening qcow2 RO->RW,
in general not dependent on reopening file child fist, but on just RW file
child.. As well as reopening qcow2 RW->RO depends on file child to be RW.

But I agree, that we'd better not doing reopens different from what we have
in the queue.

> 
> One such dependency is if you change a node from RO to RW, and that
> change requires an RW child.

Yes.

> 
> The reverse dependency occurs is if you change a node from RW to RO, and
> the nodes wants to write something to its child, so it needs to remain
> RW until then.

And yes, but actually this is the same dependency, but it should be resolved
in other way.

> 
> Currently, the former is just broken (hence this patch).  The latter is
> kind of addressed by virtue of “Writing happens in .prepare(), and
> parents come before children in the ReopenQueue”.
> 
> 
> OK, so you address one specific case of a dependency, namely of a node
> on its bs->file when it comes to writableness.  Not too bad, supporting
> bs->file as the only dependency makes sense.  Everything else would
> become complicated.
> 
> 
> Besides being more specific than the general solution I tried to sketch
> above, there is one more difference, though: In that description, I said
> we should remove the node and its dependencies from the ReopenQueue, and
> commit it earlier.  That has three implicatons:
> 
> First, this patch reopens the file if it is not writable, but the parent
> needs it to be writable.  I think that’s wrong.  We should take the
> file’s entry of the ReopenQueue and reopen it accordingly, not blindly
> reopen it RW.  (If the user didn’t specify the file to be reopened RW,
> that should be an error.)
> 
> Second, you need to take the dependencies into account.  (I don’t know
> whether this one is a problem in practice.)  If the file node itself has
> a child that needs to be RW, then you need to take that from the
> ReopenQueue, too, and repoen it RW, too.
> 
> Third, the reopen may require some other options on the file.
> Temporarily reopening it RW without changing those options may not be
> what the user wants.  (So another reason to take the existing
> ReopenQueue entry.)
> 
> 
> So -- without having tried, of course -- I think a better design would
> be to look for bs->file->bs in the ReopenQueue, recursively all of its
> children, and move all of those entries into a new queue, and then
> invoke bdrv_reopen_multiple() on that one first.

Why all children recursively? Shouldn't we instead only follow defined
dependencies?
Or it just seems bad to put not full subtree into bdrv_reopen multiple() ?


> 
> The question then becomes how to roll back those changes...  I don’t
> know whether just having bdrv_reopen() partially succeed is so bad.
> Otherwise, we’d need a function to translate an existing node's state
> into a BdrvReopenQueueEntry so we can thus return to the old state.

And this rollback may fail too and we are still in partial success state.

But if we roll-back simple ro->rw reopen it's safe enough: in worst case
file will be rw, but marked ro, so Qemu may have more access rights than
needed but will not use them, this is why I was concentrating around
only ro->rw case..

So, what about go similar way to this patch, i.e. only reopen ro->rw children
which we need to be rw, not touching other flags, but check, that in reopen
queue we have this child, and it is going to be reopened RW, and if not,
return error immediately?

> 
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +            current->reopened_file_child_rw = true;
>> +        }
>> +
>> +        if (!(bs->file->perm & BLK_PERM_WRITE)) {
>> +            ret = bdrv_child_try_set_perm(bs->file,
>> +                                          bs->file->perm | BLK_PERM_WRITE,
>> +                                          bs->file->shared_perm, errp);
> 
> bdrv_child_try_set_perm() is dangerous, because its effect will be
> undone whenever something happens that causes the permissions to be
> refreshed.  (Hence the comment in block_int.h which says to avoid it.)
> Generally, bdrv_child_refresh_perms() should be enough.  If it isn’t,
> I’d say something’s off.
> 
> Max
> 
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +            current->changed_file_child_perm_rw = true;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>
Max Reitz Aug. 1, 2019, 7:06 p.m. UTC | #3
On 01.08.19 16:02, Vladimir Sementsov-Ogievskiy wrote:
> 31.07.2019 15:09, Max Reitz wrote:

[...]

>> So -- without having tried, of course -- I think a better design would
>> be to look for bs->file->bs in the ReopenQueue, recursively all of its
>> children, and move all of those entries into a new queue, and then
>> invoke bdrv_reopen_multiple() on that one first.
> 
> Why all children recursively? Shouldn't we instead only follow defined
> dependencies?
> Or it just seems bad to put not full subtree into bdrv_reopen multiple() ?

For example, making a node RW without opening its children RW seems
wrong.  Whenever we reopen a node, we should reopen all of its children,
if they are in the queue.

>> The question then becomes how to roll back those changes...  I don’t
>> know whether just having bdrv_reopen() partially succeed is so bad.
>> Otherwise, we’d need a function to translate an existing node's state
>> into a BdrvReopenQueueEntry so we can thus return to the old state.
> 
> And this rollback may fail too and we are still in partial success state.
> 
> But if we roll-back simple ro->rw reopen it's safe enough: in worst case
> file will be rw, but marked ro, so Qemu may have more access rights than
> needed but will not use them, this is why I was concentrating around
> only ro->rw case..

In practice, this is always so.  The “children need to be reopened
before parent” case is always a sign of more permissions being taken;
whereas “children need to be reopened after parent” is a sign of
permissions being released.

What we want to fix now is the former “reopen children before parent”
case.  Because this is always a sign of taking more permissions, a
partial success/failure state means we always have taken more
permissions than we need to.

> So, what about go similar way to this patch, i.e. only reopen ro->rw children
> which we need to be rw, not touching other flags, but check, that in reopen
> queue we have this child, and it is going to be reopened RW, and if not,
> return error immediately?

If the RO -> RW change for the child is accompanied by other options
being changed, the user may find it vital to change these flags along
with the RO/RW access.  We shouldn’t ignore them.

Max
Vladimir Sementsov-Ogievskiy Aug. 2, 2019, 8:49 a.m. UTC | #4
01.08.2019 22:06, Max Reitz wrote:
> On 01.08.19 16:02, Vladimir Sementsov-Ogievskiy wrote:
>> 31.07.2019 15:09, Max Reitz wrote:
> 
> [...]
> 
>>> So -- without having tried, of course -- I think a better design would
>>> be to look for bs->file->bs in the ReopenQueue, recursively all of its
>>> children, and move all of those entries into a new queue, and then
>>> invoke bdrv_reopen_multiple() on that one first.
>>
>> Why all children recursively? Shouldn't we instead only follow defined
>> dependencies?
>> Or it just seems bad to put not full subtree into bdrv_reopen multiple() ?
> 
> For example, making a node RW without opening its children RW seems
> wrong.  Whenever we reopen a node, we should reopen all of its children,
> if they are in the queue.
> 
>>> The question then becomes how to roll back those changes...  I don’t
>>> know whether just having bdrv_reopen() partially succeed is so bad.
>>> Otherwise, we’d need a function to translate an existing node's state
>>> into a BdrvReopenQueueEntry so we can thus return to the old state.
>>
>> And this rollback may fail too and we are still in partial success state.
>>
>> But if we roll-back simple ro->rw reopen it's safe enough: in worst case
>> file will be rw, but marked ro, so Qemu may have more access rights than
>> needed but will not use them, this is why I was concentrating around
>> only ro->rw case..
> 
> In practice, this is always so.  The “children need to be reopened
> before parent” case is always a sign of more permissions being taken;
> whereas “children need to be reopened after parent” is a sign of
> permissions being released.
> 
> What we want to fix now is the former “reopen children before parent”
> case.  Because this is always a sign of taking more permissions, a
> partial success/failure state means we always have taken more
> permissions than we need to.

OK, I'll try.

> 
>> So, what about go similar way to this patch, i.e. only reopen ro->rw children
>> which we need to be rw, not touching other flags, but check, that in reopen
>> queue we have this child, and it is going to be reopened RW, and if not,
>> return error immediately?
> 
> If the RO -> RW change for the child is accompanied by other options
> being changed, the user may find it vital to change these flags along
> with the RO/RW access.  We shouldn’t ignore them.
> 

Hmm, understand, OK.
Kevin Wolf Aug. 2, 2019, 3:42 p.m. UTC | #5
Am 31.07.2019 um 14:09 hat Max Reitz geschrieben:
> On 25.07.19 11:18, Vladimir Sementsov-Ogievskiy wrote:
> > On reopen to rw parent may need rw access to child in .prepare, for
> > example qcow2 needs to write IN_USE flags into stored bitmaps
> > (currently it is done in a hacky way after commit and don't work).
> > So, let's introduce such logic.
> > 
> > The drawback is that in worst case bdrv_reopen_set_read_only may finish
> > with error and in some intermediate state: some nodes reopened RW and
> > some are not. But this is a way to fix bug around reopening qcow2
> > bitmaps in the following commits.
> 
> This commit message doesn’t really explain what this patch does.
> 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > ---
> >  include/block/block_int.h |   2 +
> >  block.c                   | 144 ++++++++++++++++++++++++++++++++++----
> >  2 files changed, 133 insertions(+), 13 deletions(-)
> > 
> > diff --git a/include/block/block_int.h b/include/block/block_int.h
> > index 3aa1e832a8..7bd6fd68dd 100644
> > --- a/include/block/block_int.h
> > +++ b/include/block/block_int.h
> > @@ -531,6 +531,8 @@ struct BlockDriver {
> >                               uint64_t parent_perm, uint64_t parent_shared,
> >                               uint64_t *nperm, uint64_t *nshared);
> >  
> > +     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
> > +
> >      /**
> >       * Bitmaps should be marked as 'IN_USE' in the image on reopening image
> >       * as rw. This handler should realize it. It also should unset readonly
> > diff --git a/block.c b/block.c
> > index cbd8da5f3b..3c8e1c59b4 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -1715,10 +1715,12 @@ static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
> >                                       uint64_t *shared_perm);
> >  
> >  typedef struct BlockReopenQueueEntry {
> > -     bool prepared;
> > -     bool perms_checked;
> > -     BDRVReopenState state;
> > -     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
> > +    bool reopened_file_child_rw;
> > +    bool changed_file_child_perm_rw;
> > +    bool prepared;
> > +    bool perms_checked;
> > +    BDRVReopenState state;
> > +    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
> >  } BlockReopenQueueEntry;
> >  
> >  /*
> > @@ -3421,6 +3423,105 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
> >                                     keep_old_opts);
> >  }
> >  
> > +static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
> > +                                             bool read_only,
> > +                                             Error **errp)
> > +{
> > +    BlockReopenQueue *queue;
> > +    QDict *opts = qdict_new();
> > +
> > +    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
> > +
> > +    queue = bdrv_reopen_queue(NULL, bs, opts, true);
> > +
> > +    return bdrv_reopen_multiple(queue, errp);
> > +}
> > +
> > +/*
> > + * handle_recursive_reopens
> > + *
> > + * On fail it needs rollback_recursive_reopens to be called.
> 
> It would be nice if this description actually said anything about what
> the function is supposed to do.
> 
> > + */
> > +static int handle_recursive_reopens(BlockReopenQueueEntry *current,
> > +                                    Error **errp)
> > +{
> > +    int ret;
> > +    BlockDriverState *bs = current->state.bs;
> > +
> > +    /*
> > +     * We use the fact that in reopen-queue children are always following
> > +     * parents.
> > +     * TODO: Switch BlockReopenQueue to be QTAILQ and use
> > +     *       QTAILQ_FOREACH_REVERSE.
> 
> Why don’t you do that first?  It would make the code more obvious at
> least to me.
> 
> > +     */
> > +    if (QSIMPLEQ_NEXT(current, entry)) {
> > +        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
> > +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
> > +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
> > +    {
> > +        if (!bdrv_is_writable(bs->file->bs)) {
> > +            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);
> 
> Hm.  Sorry, I find this all a bit hard to understand.  (No comments and
> all.)
> 
> I understand that this is for an RO -> RW transition?  Everything is
> still RO, but the parent will need an RW child before it transitions to
> RW itself.
> 
> 
> I’m going to be honest up front, I don’t like this very much.  But I
> think it may be a reasonable solution for now.
> 
> As I remember, the problem was that when reopening a qcow2 node from RO
> to RW, we need to write something in .prepare() (because it can fail),
> but naturally no .prepare() is called after any .commit(), so no matter
> the order of nodes in the ReopenQueue, the child node will never be RW
> by this point.
> 
> Hm.  To me that mostly means that making the whole reopen process a
> transaction was just a dream that turns out not to work.

This patch already looks somewhat complicated to me, and what you're
proposing below sounds another order of magnitude more complex.

But I think the important point is your last sentence above. Once we
acknowledge that we can't possibly maintain full transaction semantics,
I'll ask this naive question: What prevents us from just keeping the
additional write in .commit?

It is expected to work in the common case, so we're only talking about
the behaviour in error cases anyway. If something fails and we can't do
things in a transactionable way, we need to decide what the surprise
result will look like, because we can neither guarantee a rollback nor
successful completion.

If the write fails unexpectedly, and we end up with a qcow2 image that
is opened r/w, but has read-only bitmaps - wouldn't that be a reasonable
result? It seems much easier to explain than some dependency subchain
already being committed and the rest rolled back.

So, I guess my question is, what is the specifc scenario you're trying
to fix with this series (why isn't the final patch a test case that
would answer this question?), and are we sure that the cure isn't worse
than the disease?

Kevin
Vladimir Sementsov-Ogievskiy Aug. 2, 2019, 4:23 p.m. UTC | #6
02.08.2019 18:42, Kevin Wolf wrote:
> Am 31.07.2019 um 14:09 hat Max Reitz geschrieben:
>> On 25.07.19 11:18, Vladimir Sementsov-Ogievskiy wrote:
>>> On reopen to rw parent may need rw access to child in .prepare, for
>>> example qcow2 needs to write IN_USE flags into stored bitmaps
>>> (currently it is done in a hacky way after commit and don't work).
>>> So, let's introduce such logic.
>>>
>>> The drawback is that in worst case bdrv_reopen_set_read_only may finish
>>> with error and in some intermediate state: some nodes reopened RW and
>>> some are not. But this is a way to fix bug around reopening qcow2
>>> bitmaps in the following commits.
>>
>> This commit message doesn’t really explain what this patch does.
>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   include/block/block_int.h |   2 +
>>>   block.c                   | 144 ++++++++++++++++++++++++++++++++++----
>>>   2 files changed, 133 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>>> index 3aa1e832a8..7bd6fd68dd 100644
>>> --- a/include/block/block_int.h
>>> +++ b/include/block/block_int.h
>>> @@ -531,6 +531,8 @@ struct BlockDriver {
>>>                                uint64_t parent_perm, uint64_t parent_shared,
>>>                                uint64_t *nperm, uint64_t *nshared);
>>>   
>>> +     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
>>> +
>>>       /**
>>>        * Bitmaps should be marked as 'IN_USE' in the image on reopening image
>>>        * as rw. This handler should realize it. It also should unset readonly
>>> diff --git a/block.c b/block.c
>>> index cbd8da5f3b..3c8e1c59b4 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -1715,10 +1715,12 @@ static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
>>>                                        uint64_t *shared_perm);
>>>   
>>>   typedef struct BlockReopenQueueEntry {
>>> -     bool prepared;
>>> -     bool perms_checked;
>>> -     BDRVReopenState state;
>>> -     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>>> +    bool reopened_file_child_rw;
>>> +    bool changed_file_child_perm_rw;
>>> +    bool prepared;
>>> +    bool perms_checked;
>>> +    BDRVReopenState state;
>>> +    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>>>   } BlockReopenQueueEntry;
>>>   
>>>   /*
>>> @@ -3421,6 +3423,105 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>>>                                      keep_old_opts);
>>>   }
>>>   
>>> +static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
>>> +                                             bool read_only,
>>> +                                             Error **errp)
>>> +{
>>> +    BlockReopenQueue *queue;
>>> +    QDict *opts = qdict_new();
>>> +
>>> +    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
>>> +
>>> +    queue = bdrv_reopen_queue(NULL, bs, opts, true);
>>> +
>>> +    return bdrv_reopen_multiple(queue, errp);
>>> +}
>>> +
>>> +/*
>>> + * handle_recursive_reopens
>>> + *
>>> + * On fail it needs rollback_recursive_reopens to be called.
>>
>> It would be nice if this description actually said anything about what
>> the function is supposed to do.
>>
>>> + */
>>> +static int handle_recursive_reopens(BlockReopenQueueEntry *current,
>>> +                                    Error **errp)
>>> +{
>>> +    int ret;
>>> +    BlockDriverState *bs = current->state.bs;
>>> +
>>> +    /*
>>> +     * We use the fact that in reopen-queue children are always following
>>> +     * parents.
>>> +     * TODO: Switch BlockReopenQueue to be QTAILQ and use
>>> +     *       QTAILQ_FOREACH_REVERSE.
>>
>> Why don’t you do that first?  It would make the code more obvious at
>> least to me.
>>
>>> +     */
>>> +    if (QSIMPLEQ_NEXT(current, entry)) {
>>> +        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +    }
>>> +
>>> +    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
>>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
>>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
>>> +    {
>>> +        if (!bdrv_is_writable(bs->file->bs)) {
>>> +            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);
>>
>> Hm.  Sorry, I find this all a bit hard to understand.  (No comments and
>> all.)
>>
>> I understand that this is for an RO -> RW transition?  Everything is
>> still RO, but the parent will need an RW child before it transitions to
>> RW itself.
>>
>>
>> I’m going to be honest up front, I don’t like this very much.  But I
>> think it may be a reasonable solution for now.
>>
>> As I remember, the problem was that when reopening a qcow2 node from RO
>> to RW, we need to write something in .prepare() (because it can fail),
>> but naturally no .prepare() is called after any .commit(), so no matter
>> the order of nodes in the ReopenQueue, the child node will never be RW
>> by this point.
>>
>> Hm.  To me that mostly means that making the whole reopen process a
>> transaction was just a dream that turns out not to work.
> 
> This patch already looks somewhat complicated to me, and what you're
> proposing below sounds another order of magnitude more complex.

Agree :) However, at this point I've almost implemented it (it's not a reason
to chose more complex way, of course).

> 
> But I think the important point is your last sentence above. Once we
> acknowledge that we can't possibly maintain full transaction semantics,
> I'll ask this naive question: What prevents us from just keeping the
> additional write in .commit?

Hmm, it's what we've started with. The only thing to do is to reverse order
of commits, to commit file child before parent (and this way it works in
Virtuozzo now).
And I proposed it long ago:
https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg04528.html


> 
> It is expected to work in the common case, so we're only talking about
> the behaviour in error cases anyway. If something fails and we can't do
> things in a transactionable way, we need to decide what the surprise
> result will look like, because we can neither guarantee a rollback nor
> successful completion.
> 
> If the write fails unexpectedly, and we end up with a qcow2 image that
> is opened r/w, but has read-only bitmaps - wouldn't that be a reasonable
> result? It seems much easier to explain than some dependency subchain
> already being committed and the rest rolled back.

And this is how it works now (except it doesn't work because of commit order).
In our long ago conversation (link above) you pointed that the problem here
is that we don't return an error from actually failed commit and it is
ignored..

> 
> So, I guess my question is, what is the specifc scenario you're trying
> to fix with this series (why isn't the final patch a test case that
> would answer this question?), and are we sure that the cure isn't worse
> than the disease?
> 

Test appears at 03 and tests what already works, and lacks test-cases which
are broken, and they are added in 08 when all is fixed.

And here are two things to fix:
First is that we lose bitmaps on reopening to RO and it is described and
fixed in 06.
Second is that we cant set IN_USE flag when reopening to RW and it is fixed
finally in 08.


=====

So, if we decide to keep things simple, what to do? Just reorder commits to
satisfy dependencies, if any?

Should we add return code to commit, which should always be 0 except very rare
case?
Vladimir Sementsov-Ogievskiy Aug. 2, 2019, 4:41 p.m. UTC | #7
02.08.2019 19:23, Vladimir Sementsov-Ogievskiy wrote:
> 02.08.2019 18:42, Kevin Wolf wrote:
>> Am 31.07.2019 um 14:09 hat Max Reitz geschrieben:
>>> On 25.07.19 11:18, Vladimir Sementsov-Ogievskiy wrote:
>>>> On reopen to rw parent may need rw access to child in .prepare, for
>>>> example qcow2 needs to write IN_USE flags into stored bitmaps
>>>> (currently it is done in a hacky way after commit and don't work).
>>>> So, let's introduce such logic.
>>>>
>>>> The drawback is that in worst case bdrv_reopen_set_read_only may finish
>>>> with error and in some intermediate state: some nodes reopened RW and
>>>> some are not. But this is a way to fix bug around reopening qcow2
>>>> bitmaps in the following commits.
>>>
>>> This commit message doesn’t really explain what this patch does.
>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>   include/block/block_int.h |   2 +
>>>>   block.c                   | 144 ++++++++++++++++++++++++++++++++++----
>>>>   2 files changed, 133 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>>>> index 3aa1e832a8..7bd6fd68dd 100644
>>>> --- a/include/block/block_int.h
>>>> +++ b/include/block/block_int.h
>>>> @@ -531,6 +531,8 @@ struct BlockDriver {
>>>>                                uint64_t parent_perm, uint64_t parent_shared,
>>>>                                uint64_t *nperm, uint64_t *nshared);
>>>> +     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
>>>> +
>>>>       /**
>>>>        * Bitmaps should be marked as 'IN_USE' in the image on reopening image
>>>>        * as rw. This handler should realize it. It also should unset readonly
>>>> diff --git a/block.c b/block.c
>>>> index cbd8da5f3b..3c8e1c59b4 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -1715,10 +1715,12 @@ static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
>>>>                                        uint64_t *shared_perm);
>>>>   typedef struct BlockReopenQueueEntry {
>>>> -     bool prepared;
>>>> -     bool perms_checked;
>>>> -     BDRVReopenState state;
>>>> -     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>>>> +    bool reopened_file_child_rw;
>>>> +    bool changed_file_child_perm_rw;
>>>> +    bool prepared;
>>>> +    bool perms_checked;
>>>> +    BDRVReopenState state;
>>>> +    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
>>>>   } BlockReopenQueueEntry;
>>>>   /*
>>>> @@ -3421,6 +3423,105 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>>>>                                      keep_old_opts);
>>>>   }
>>>> +static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
>>>> +                                             bool read_only,
>>>> +                                             Error **errp)
>>>> +{
>>>> +    BlockReopenQueue *queue;
>>>> +    QDict *opts = qdict_new();
>>>> +
>>>> +    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
>>>> +
>>>> +    queue = bdrv_reopen_queue(NULL, bs, opts, true);
>>>> +
>>>> +    return bdrv_reopen_multiple(queue, errp);
>>>> +}
>>>> +
>>>> +/*
>>>> + * handle_recursive_reopens
>>>> + *
>>>> + * On fail it needs rollback_recursive_reopens to be called.
>>>
>>> It would be nice if this description actually said anything about what
>>> the function is supposed to do.
>>>
>>>> + */
>>>> +static int handle_recursive_reopens(BlockReopenQueueEntry *current,
>>>> +                                    Error **errp)
>>>> +{
>>>> +    int ret;
>>>> +    BlockDriverState *bs = current->state.bs;
>>>> +
>>>> +    /*
>>>> +     * We use the fact that in reopen-queue children are always following
>>>> +     * parents.
>>>> +     * TODO: Switch BlockReopenQueue to be QTAILQ and use
>>>> +     *       QTAILQ_FOREACH_REVERSE.
>>>
>>> Why don’t you do that first?  It would make the code more obvious at
>>> least to me.
>>>
>>>> +     */
>>>> +    if (QSIMPLEQ_NEXT(current, entry)) {
>>>> +        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
>>>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
>>>> +        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
>>>> +    {
>>>> +        if (!bdrv_is_writable(bs->file->bs)) {
>>>> +            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);
>>>
>>> Hm.  Sorry, I find this all a bit hard to understand.  (No comments and
>>> all.)
>>>
>>> I understand that this is for an RO -> RW transition?  Everything is
>>> still RO, but the parent will need an RW child before it transitions to
>>> RW itself.
>>>
>>>
>>> I’m going to be honest up front, I don’t like this very much.  But I
>>> think it may be a reasonable solution for now.
>>>
>>> As I remember, the problem was that when reopening a qcow2 node from RO
>>> to RW, we need to write something in .prepare() (because it can fail),
>>> but naturally no .prepare() is called after any .commit(), so no matter
>>> the order of nodes in the ReopenQueue, the child node will never be RW
>>> by this point.
>>>
>>> Hm.  To me that mostly means that making the whole reopen process a
>>> transaction was just a dream that turns out not to work.
>>
>> This patch already looks somewhat complicated to me, and what you're
>> proposing below sounds another order of magnitude more complex.
> 
> Agree :) However, at this point I've almost implemented it (it's not a reason
> to chose more complex way, of course).
> 
>>
>> But I think the important point is your last sentence above. Once we
>> acknowledge that we can't possibly maintain full transaction semantics,
>> I'll ask this naive question: What prevents us from just keeping the
>> additional write in .commit?
> 
> Hmm, it's what we've started with. The only thing to do is to reverse order
> of commits, to commit file child before parent (and this way it works in
> Virtuozzo now).
> And I proposed it long ago:
> https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg04528.html
> 
> 
>>
>> It is expected to work in the common case, so we're only talking about
>> the behaviour in error cases anyway. If something fails and we can't do
>> things in a transactionable way, we need to decide what the surprise
>> result will look like, because we can neither guarantee a rollback nor
>> successful completion.
>>
>> If the write fails unexpectedly, and we end up with a qcow2 image that
>> is opened r/w, but has read-only bitmaps - wouldn't that be a reasonable
>> result? It seems much easier to explain than some dependency subchain
>> already being committed and the rest rolled back.
> 
> And this is how it works now (except it doesn't work because of commit order).
> In our long ago conversation (link above) you pointed that the problem here
> is that we don't return an error from actually failed commit and it is
> ignored..
> 
>>
>> So, I guess my question is, what is the specifc scenario you're trying
>> to fix with this series (why isn't the final patch a test case that
>> would answer this question?), and are we sure that the cure isn't worse
>> than the disease?
>>
> 
> Test appears at 03 and tests what already works, and lacks test-cases which
> are broken, and they are added in 08 when all is fixed.
> 
> And here are two things to fix:
> First is that we lose bitmaps on reopening to RO and it is described and
> fixed in 06.
> Second is that we cant set IN_USE flag when reopening to RW and it is fixed
> finally in 08.
> 
> 
> =====
> 
> So, if we decide to keep things simple, what to do? Just reorder commits to
> satisfy dependencies, if any?
> 
> Should we add return code to commit, which should always be 0 except very rare
> case?
> 
> 


And another idea is to postpone IN_USE setting until we really need to modify the
bitmap.. Not sure that it is simpler.
diff mbox series

Patch

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 3aa1e832a8..7bd6fd68dd 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -531,6 +531,8 @@  struct BlockDriver {
                              uint64_t parent_perm, uint64_t parent_shared,
                              uint64_t *nperm, uint64_t *nshared);
 
+     bool (*bdrv_need_rw_file_child_during_reopen_rw)(BlockDriverState *bs);
+
     /**
      * Bitmaps should be marked as 'IN_USE' in the image on reopening image
      * as rw. This handler should realize it. It also should unset readonly
diff --git a/block.c b/block.c
index cbd8da5f3b..3c8e1c59b4 100644
--- a/block.c
+++ b/block.c
@@ -1715,10 +1715,12 @@  static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
                                      uint64_t *shared_perm);
 
 typedef struct BlockReopenQueueEntry {
-     bool prepared;
-     bool perms_checked;
-     BDRVReopenState state;
-     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
+    bool reopened_file_child_rw;
+    bool changed_file_child_perm_rw;
+    bool prepared;
+    bool perms_checked;
+    BDRVReopenState state;
+    QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
 } BlockReopenQueueEntry;
 
 /*
@@ -3421,6 +3423,105 @@  BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                    keep_old_opts);
 }
 
+static int bdrv_reopen_set_read_only_drained(BlockDriverState *bs,
+                                             bool read_only,
+                                             Error **errp)
+{
+    BlockReopenQueue *queue;
+    QDict *opts = qdict_new();
+
+    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
+
+    queue = bdrv_reopen_queue(NULL, bs, opts, true);
+
+    return bdrv_reopen_multiple(queue, errp);
+}
+
+/*
+ * handle_recursive_reopens
+ *
+ * On fail it needs rollback_recursive_reopens to be called.
+ */
+static int handle_recursive_reopens(BlockReopenQueueEntry *current,
+                                    Error **errp)
+{
+    int ret;
+    BlockDriverState *bs = current->state.bs;
+
+    /*
+     * We use the fact that in reopen-queue children are always following
+     * parents.
+     * TODO: Switch BlockReopenQueue to be QTAILQ and use
+     *       QTAILQ_FOREACH_REVERSE.
+     */
+    if (QSIMPLEQ_NEXT(current, entry)) {
+        ret = handle_recursive_reopens(QSIMPLEQ_NEXT(current, entry), errp);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    if ((current->state.flags & BDRV_O_RDWR) && bs->file && bs->drv &&
+        bs->drv->bdrv_need_rw_file_child_during_reopen_rw &&
+        bs->drv->bdrv_need_rw_file_child_during_reopen_rw(bs))
+    {
+        if (!bdrv_is_writable(bs->file->bs)) {
+            ret = bdrv_reopen_set_read_only_drained(bs->file->bs, false, errp);
+            if (ret < 0) {
+                return ret;
+            }
+            current->reopened_file_child_rw = true;
+        }
+
+        if (!(bs->file->perm & BLK_PERM_WRITE)) {
+            ret = bdrv_child_try_set_perm(bs->file,
+                                          bs->file->perm | BLK_PERM_WRITE,
+                                          bs->file->shared_perm, errp);
+            if (ret < 0) {
+                return ret;
+            }
+            current->changed_file_child_perm_rw = true;
+        }
+    }
+
+    return 0;
+}
+
+static int rollback_recursive_reopens(BlockReopenQueue *bs_queue,
+                                      Error **errp)
+{
+    int ret;
+    BlockReopenQueueEntry *bs_entry;
+
+    /*
+     * We use the fact that in reopen-queue children are always following
+     * parents.
+     */
+    QSIMPLEQ_FOREACH(bs_entry, bs_queue, entry) {
+        BlockDriverState *bs = bs_entry->state.bs;
+
+        if (bs_entry->changed_file_child_perm_rw) {
+            ret = bdrv_child_try_set_perm(bs->file,
+                                          bs->file->perm & ~BLK_PERM_WRITE,
+                                          bs->file->shared_perm, errp);
+
+            if (ret < 0) {
+                return ret;
+            }
+        }
+
+        if (bs_entry->reopened_file_child_rw) {
+            ret = bdrv_reopen_set_read_only_drained(bs, true, errp);
+
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Reopen multiple BlockDriverStates atomically & transactionally.
  *
@@ -3440,11 +3541,18 @@  BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
  */
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 {
-    int ret = -1;
+    int ret;
     BlockReopenQueueEntry *bs_entry, *next;
 
     assert(bs_queue != NULL);
 
+    ret = handle_recursive_reopens(QSIMPLEQ_FIRST(bs_queue), errp);
+    if (ret < 0) {
+        goto rollback_recursion;
+    }
+
+    ret = -1;
+
     QSIMPLEQ_FOREACH(bs_entry, bs_queue, entry) {
         assert(bs_entry->state.bs->quiesce_counter > 0);
         if (bdrv_reopen_prepare(&bs_entry->state, bs_queue, errp)) {
@@ -3485,7 +3593,7 @@  int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 
     ret = 0;
 cleanup_perm:
-    QSIMPLEQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+    QSIMPLEQ_FOREACH(bs_entry, bs_queue, entry) {
         BDRVReopenState *state = &bs_entry->state;
 
         if (!bs_entry->perms_checked) {
@@ -3502,7 +3610,7 @@  cleanup_perm:
         }
     }
 cleanup:
-    QSIMPLEQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+    QSIMPLEQ_FOREACH(bs_entry, bs_queue, entry) {
         if (ret) {
             if (bs_entry->prepared) {
                 bdrv_reopen_abort(&bs_entry->state);
@@ -3513,8 +3621,23 @@  cleanup:
         if (bs_entry->state.new_backing_bs) {
             bdrv_unref(bs_entry->state.new_backing_bs);
         }
+    }
+
+rollback_recursion:
+    if (ret) {
+        Error *local_err = NULL;
+        int ret2 = rollback_recursive_reopens(bs_queue, &local_err);
+
+        if (ret2 < 0) {
+            error_reportf_err(local_err, "Failed to rollback partially "
+                              "succeeded reopen to RW: ");
+        }
+    }
+
+    QSIMPLEQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
         g_free(bs_entry);
     }
+
     g_free(bs_queue);
 
     return ret;
@@ -3524,14 +3647,9 @@  int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
                               Error **errp)
 {
     int ret;
-    BlockReopenQueue *queue;
-    QDict *opts = qdict_new();
-
-    qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
 
     bdrv_subtree_drained_begin(bs);
-    queue = bdrv_reopen_queue(NULL, bs, opts, true);
-    ret = bdrv_reopen_multiple(queue, errp);
+    ret = bdrv_reopen_set_read_only_drained(bs, read_only, errp);
     bdrv_subtree_drained_end(bs);
 
     return ret;