diff mbox series

[v2,1/3] block: Make bdrv_refresh_limits() non-recursive

Message ID 20220216105355.30729-2-hreitz@redhat.com
State New
Headers show
Series block: Make bdrv_refresh_limits() non-recursive | expand

Commit Message

Hanna Czenczek Feb. 16, 2022, 10:53 a.m. UTC
bdrv_refresh_limits() recurses down to the node's children.  That does
not seem necessary: When we refresh limits on some node, and then
recurse down and were to change one of its children's BlockLimits, then
that would mean we noticed the changed limits by pure chance.  The fact
that we refresh the parent's limits has nothing to do with it, so the
reason for the change probably happened before this point in time, and
we should have refreshed the limits then.

On the other hand, we do not have infrastructure for noticing that block
limits change after they have been initialized for the first time (this
would require propagating the change upwards to the respective node's
parents), and so evidently we consider this case impossible.

If this case is impossible, then we will not need to recurse down in
bdrv_refresh_limits().  Every node's limits are initialized in
bdrv_open_driver(), and are refreshed whenever its children change.
We want to use the childrens' limits to get some initial default, but we
can just take them, we do not need to refresh them.

The problem with recursing is that bdrv_refresh_limits() is not atomic.
It begins with zeroing BDS.bl, and only then sets proper, valid limits.
If we do not drain all nodes whose limits are refreshed, then concurrent
I/O requests can encounter invalid request_alignment values and crash
qemu.  Therefore, a recursing bdrv_refresh_limits() requires the whole
subtree to be drained, which is currently not ensured by most callers.

A non-recursive bdrv_refresh_limits() only requires the node in question
to not receive I/O requests, and this is done by most callers in some
way or another:
- bdrv_open_driver() deals with a new node with no parents yet
- bdrv_set_file_or_backing_noperm() acts on a drained node
- bdrv_reopen_commit() acts only on drained nodes
- bdrv_append() should in theory require the node to be drained; in
  practice most callers just lock the AioContext, which should at least
  be enough to prevent concurrent I/O requests from accessing invalid
  limits

So we can resolve the bug by making bdrv_refresh_limits() non-recursive.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Kevin Wolf March 3, 2022, 4:56 p.m. UTC | #1
Am 16.02.2022 um 11:53 hat Hanna Reitz geschrieben:
> bdrv_refresh_limits() recurses down to the node's children.  That does
> not seem necessary: When we refresh limits on some node, and then
> recurse down and were to change one of its children's BlockLimits, then
> that would mean we noticed the changed limits by pure chance.  The fact
> that we refresh the parent's limits has nothing to do with it, so the
> reason for the change probably happened before this point in time, and
> we should have refreshed the limits then.
> 
> On the other hand, we do not have infrastructure for noticing that block
> limits change after they have been initialized for the first time (this
> would require propagating the change upwards to the respective node's
> parents), and so evidently we consider this case impossible.

I like your optimistic approach, but my interpretation would have been
that this is simply a bug. ;-)

blockdev-reopen allows changing options that affect the block limits
(most importantly probably request_alignment), so this should be
propagated to the parents. I think we'll actually not see failures if we
forget to do this, but parents can either advertise excessive alignment
requirements or they may run into RMW when accessing the child, so this
would only affect performance. This is probably why nobody reported it
yet.

> If this case is impossible, then we will not need to recurse down in
> bdrv_refresh_limits().  Every node's limits are initialized in
> bdrv_open_driver(), and are refreshed whenever its children change.
> We want to use the childrens' limits to get some initial default, but
> we can just take them, we do not need to refresh them.

I think even if we need to propagate to the parents, we still don't need
to propagate to the children because the children have already been
refreshed by whatever changed their options (like bdrv_reopen_commit()).
And parent limits don't influence the child limits at all.

So this patch looks good to me, just not the reasoning.

Kevin

> The problem with recursing is that bdrv_refresh_limits() is not atomic.
> It begins with zeroing BDS.bl, and only then sets proper, valid limits.
> If we do not drain all nodes whose limits are refreshed, then concurrent
> I/O requests can encounter invalid request_alignment values and crash
> qemu.  Therefore, a recursing bdrv_refresh_limits() requires the whole
> subtree to be drained, which is currently not ensured by most callers.
> 
> A non-recursive bdrv_refresh_limits() only requires the node in question
> to not receive I/O requests, and this is done by most callers in some
> way or another:
> - bdrv_open_driver() deals with a new node with no parents yet
> - bdrv_set_file_or_backing_noperm() acts on a drained node
> - bdrv_reopen_commit() acts only on drained nodes
> - bdrv_append() should in theory require the node to be drained; in
>   practice most callers just lock the AioContext, which should at least
>   be enough to prevent concurrent I/O requests from accessing invalid
>   limits
> 
> So we can resolve the bug by making bdrv_refresh_limits() non-recursive.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 4e4cb556c5..c3e7301613 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -189,10 +189,6 @@ void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp)
>      QLIST_FOREACH(c, &bs->children, next) {
>          if (c->role & (BDRV_CHILD_DATA | BDRV_CHILD_FILTERED | BDRV_CHILD_COW))
>          {
> -            bdrv_refresh_limits(c->bs, tran, errp);
> -            if (*errp) {
> -                return;
> -            }
>              bdrv_merge_limits(&bs->bl, &c->bs->bl);
>              have_limits = true;
>          }
> -- 
> 2.34.1
>
Hanna Czenczek March 4, 2022, 12:44 p.m. UTC | #2
On 03.03.22 17:56, Kevin Wolf wrote:
> Am 16.02.2022 um 11:53 hat Hanna Reitz geschrieben:
>> bdrv_refresh_limits() recurses down to the node's children.  That does
>> not seem necessary: When we refresh limits on some node, and then
>> recurse down and were to change one of its children's BlockLimits, then
>> that would mean we noticed the changed limits by pure chance.  The fact
>> that we refresh the parent's limits has nothing to do with it, so the
>> reason for the change probably happened before this point in time, and
>> we should have refreshed the limits then.
>>
>> On the other hand, we do not have infrastructure for noticing that block
>> limits change after they have been initialized for the first time (this
>> would require propagating the change upwards to the respective node's
>> parents), and so evidently we consider this case impossible.
> I like your optimistic approach, but my interpretation would have been
> that this is simply a bug. ;-)
>
> blockdev-reopen allows changing options that affect the block limits
> (most importantly probably request_alignment), so this should be
> propagated to the parents. I think we'll actually not see failures if we
> forget to do this, but parents can either advertise excessive alignment
> requirements or they may run into RMW when accessing the child, so this
> would only affect performance. This is probably why nobody reported it
> yet.

Ah, right, I forgot this for parents of parents...  I thought the block 
limits of a node might change if its children list changes, and so we 
should bdrv_refresh_limits() when that children list changes, but forgot 
that we really do need to propagate this up, right.

>> If this case is impossible, then we will not need to recurse down in
>> bdrv_refresh_limits().  Every node's limits are initialized in
>> bdrv_open_driver(), and are refreshed whenever its children change.
>> We want to use the childrens' limits to get some initial default, but
>> we can just take them, we do not need to refresh them.
> I think even if we need to propagate to the parents, we still don't need
> to propagate to the children because the children have already been
> refreshed by whatever changed their options (like bdrv_reopen_commit()).
> And parent limits don't influence the child limits at all.
>
> So this patch looks good to me, just not the reasoning.

OK, so, uh, can we just drop these two paragraphs?  (“On the other 
hand...” and “If this case is impossible…”)

Or we could replace them with a note hinting at the potential bug that 
would need to be fixed, e.g.

“
Consequently, we should actually propagate block limits changes upwards,
not downwards.  That is a separate and pre-existing issue, though, and
so will not be addressed in this patch.
”

Question is, if we at some point do propagate this upwards, won’t this 
cause exactly the same problem that this patch is trying to get around, 
i.e. that we might call bdrv_refresh_limits() on non-drained parent nodes?

Hanna

> Kevin
>
>> The problem with recursing is that bdrv_refresh_limits() is not atomic.
>> It begins with zeroing BDS.bl, and only then sets proper, valid limits.
>> If we do not drain all nodes whose limits are refreshed, then concurrent
>> I/O requests can encounter invalid request_alignment values and crash
>> qemu.  Therefore, a recursing bdrv_refresh_limits() requires the whole
>> subtree to be drained, which is currently not ensured by most callers.
>>
>> A non-recursive bdrv_refresh_limits() only requires the node in question
>> to not receive I/O requests, and this is done by most callers in some
>> way or another:
>> - bdrv_open_driver() deals with a new node with no parents yet
>> - bdrv_set_file_or_backing_noperm() acts on a drained node
>> - bdrv_reopen_commit() acts only on drained nodes
>> - bdrv_append() should in theory require the node to be drained; in
>>    practice most callers just lock the AioContext, which should at least
>>    be enough to prevent concurrent I/O requests from accessing invalid
>>    limits
>>
>> So we can resolve the bug by making bdrv_refresh_limits() non-recursive.
>>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
>> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> ---
>>   block/io.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 4e4cb556c5..c3e7301613 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -189,10 +189,6 @@ void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp)
>>       QLIST_FOREACH(c, &bs->children, next) {
>>           if (c->role & (BDRV_CHILD_DATA | BDRV_CHILD_FILTERED | BDRV_CHILD_COW))
>>           {
>> -            bdrv_refresh_limits(c->bs, tran, errp);
>> -            if (*errp) {
>> -                return;
>> -            }
>>               bdrv_merge_limits(&bs->bl, &c->bs->bl);
>>               have_limits = true;
>>           }
>> -- 
>> 2.34.1
>>
Kevin Wolf March 4, 2022, 2:14 p.m. UTC | #3
Am 04.03.2022 um 13:44 hat Hanna Reitz geschrieben:
> On 03.03.22 17:56, Kevin Wolf wrote:
> > Am 16.02.2022 um 11:53 hat Hanna Reitz geschrieben:
> > > bdrv_refresh_limits() recurses down to the node's children.  That does
> > > not seem necessary: When we refresh limits on some node, and then
> > > recurse down and were to change one of its children's BlockLimits, then
> > > that would mean we noticed the changed limits by pure chance.  The fact
> > > that we refresh the parent's limits has nothing to do with it, so the
> > > reason for the change probably happened before this point in time, and
> > > we should have refreshed the limits then.
> > > 
> > > On the other hand, we do not have infrastructure for noticing that block
> > > limits change after they have been initialized for the first time (this
> > > would require propagating the change upwards to the respective node's
> > > parents), and so evidently we consider this case impossible.
> > I like your optimistic approach, but my interpretation would have been
> > that this is simply a bug. ;-)
> > 
> > blockdev-reopen allows changing options that affect the block limits
> > (most importantly probably request_alignment), so this should be
> > propagated to the parents. I think we'll actually not see failures if we
> > forget to do this, but parents can either advertise excessive alignment
> > requirements or they may run into RMW when accessing the child, so this
> > would only affect performance. This is probably why nobody reported it
> > yet.
> 
> Ah, right, I forgot this for parents of parents...  I thought the
> block limits of a node might change if its children list changes, and
> so we should bdrv_refresh_limits() when that children list changes,
> but forgot that we really do need to propagate this up, right.

I mean the case that you mention is true as well. A few places do call
bdrv_refresh_limits() after changing the graph, but I don't know if it
covers all cases.

> > > If this case is impossible, then we will not need to recurse down in
> > > bdrv_refresh_limits().  Every node's limits are initialized in
> > > bdrv_open_driver(), and are refreshed whenever its children change.
> > > We want to use the childrens' limits to get some initial default, but
> > > we can just take them, we do not need to refresh them.
> > I think even if we need to propagate to the parents, we still don't need
> > to propagate to the children because the children have already been
> > refreshed by whatever changed their options (like bdrv_reopen_commit()).
> > And parent limits don't influence the child limits at all.
> > 
> > So this patch looks good to me, just not the reasoning.
> 
> OK, so, uh, can we just drop these two paragraphs?  (“On the other hand...”
> and “If this case is impossible…”)
> 
> Or we could replace them with a note hinting at the potential bug that would
> need to be fixed, e.g.
> 
> “
> Consequently, we should actually propagate block limits changes upwards,
> not downwards.  That is a separate and pre-existing issue, though, and
> so will not be addressed in this patch.
> ”

Ok, I'm replacing this in my tree.

> Question is, if we at some point do propagate this upwards, won’t this cause
> exactly the same problem that this patch is trying to get around, i.e. that
> we might call bdrv_refresh_limits() on non-drained parent nodes?

Drain also propagates upwards, so at least those callers that drain the
node itself won't have the problem. And the other cases from the commit
messages look like they shouldn't have any parents.

Kevin
Hanna Czenczek March 4, 2022, 2:59 p.m. UTC | #4
On 04.03.22 15:14, Kevin Wolf wrote:
> Am 04.03.2022 um 13:44 hat Hanna Reitz geschrieben:
>> On 03.03.22 17:56, Kevin Wolf wrote:
>>> Am 16.02.2022 um 11:53 hat Hanna Reitz geschrieben:
>>>> bdrv_refresh_limits() recurses down to the node's children.  That does
>>>> not seem necessary: When we refresh limits on some node, and then
>>>> recurse down and were to change one of its children's BlockLimits, then
>>>> that would mean we noticed the changed limits by pure chance.  The fact
>>>> that we refresh the parent's limits has nothing to do with it, so the
>>>> reason for the change probably happened before this point in time, and
>>>> we should have refreshed the limits then.
>>>>
>>>> On the other hand, we do not have infrastructure for noticing that block
>>>> limits change after they have been initialized for the first time (this
>>>> would require propagating the change upwards to the respective node's
>>>> parents), and so evidently we consider this case impossible.
>>> I like your optimistic approach, but my interpretation would have been
>>> that this is simply a bug. ;-)
>>>
>>> blockdev-reopen allows changing options that affect the block limits
>>> (most importantly probably request_alignment), so this should be
>>> propagated to the parents. I think we'll actually not see failures if we
>>> forget to do this, but parents can either advertise excessive alignment
>>> requirements or they may run into RMW when accessing the child, so this
>>> would only affect performance. This is probably why nobody reported it
>>> yet.
>> Ah, right, I forgot this for parents of parents...  I thought the
>> block limits of a node might change if its children list changes, and
>> so we should bdrv_refresh_limits() when that children list changes,
>> but forgot that we really do need to propagate this up, right.
> I mean the case that you mention is true as well. A few places do call
> bdrv_refresh_limits() after changing the graph, but I don't know if it
> covers all cases.
>
>>>> If this case is impossible, then we will not need to recurse down in
>>>> bdrv_refresh_limits().  Every node's limits are initialized in
>>>> bdrv_open_driver(), and are refreshed whenever its children change.
>>>> We want to use the childrens' limits to get some initial default, but
>>>> we can just take them, we do not need to refresh them.
>>> I think even if we need to propagate to the parents, we still don't need
>>> to propagate to the children because the children have already been
>>> refreshed by whatever changed their options (like bdrv_reopen_commit()).
>>> And parent limits don't influence the child limits at all.
>>>
>>> So this patch looks good to me, just not the reasoning.
>> OK, so, uh, can we just drop these two paragraphs?  (“On the other hand...”
>> and “If this case is impossible…”)
>>
>> Or we could replace them with a note hinting at the potential bug that would
>> need to be fixed, e.g.
>>
>> “
>> Consequently, we should actually propagate block limits changes upwards,
>> not downwards.  That is a separate and pre-existing issue, though, and
>> so will not be addressed in this patch.
>> ”
> Ok, I'm replacing this in my tree.
>
>> Question is, if we at some point do propagate this upwards, won’t this cause
>> exactly the same problem that this patch is trying to get around, i.e. that
>> we might call bdrv_refresh_limits() on non-drained parent nodes?
> Drain also propagates upwards, so at least those callers that drain the
> node itself won't have the problem. And the other cases from the commit
> messages look like they shouldn't have any parents.

Finally some good news today :)
diff mbox series

Patch

diff --git a/block/io.c b/block/io.c
index 4e4cb556c5..c3e7301613 100644
--- a/block/io.c
+++ b/block/io.c
@@ -189,10 +189,6 @@  void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp)
     QLIST_FOREACH(c, &bs->children, next) {
         if (c->role & (BDRV_CHILD_DATA | BDRV_CHILD_FILTERED | BDRV_CHILD_COW))
         {
-            bdrv_refresh_limits(c->bs, tran, errp);
-            if (*errp) {
-                return;
-            }
             bdrv_merge_limits(&bs->bl, &c->bs->bl);
             have_limits = true;
         }