[for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap
diff mbox series

Message ID 20191205193049.30666-1-vsementsov@virtuozzo.com
State New
Headers show
Series
  • [for-4.2?] block/qcow2-bitmap: fix crash bug in qcow2_co_remove_persistent_dirty_bitmap
Related show

Commit Message

Vladimir Sementsov-Ogievskiy Dec. 5, 2019, 7:30 p.m. UTC
Here is double bug:

First, return error but not set errp. This may lead to:
qmp block-dirty-bitmap-remove may report success when actually failed

block-dirty-bitmap-remove used in a transaction will crash, as
qmp_transaction will think that it returned success and will cal
block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
NULL

Second (like in anecdote), this case is not an error at all. As it is
documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
definition, absence of bitmap is not an error, and similar case handled
at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
there is no bitmaps at all..

But when there are some bitmaps, but not the requested one, it return
error with errp unset.

Fix that.

Fixes: b56a1e31759b750
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi all!

Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
sorry for introducing it, and it sad that it's found so late..

Personally, I think that this one worth rc5, as it makes new bitmap
interfaces unusable. But the decision is yours.

Last minute edit: hmm, actually, transaction action introduced in
4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
command is a regression... Maybe it's OK for stable.

 block/qcow2-bitmap.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

John Snow Dec. 5, 2019, 7:39 p.m. UTC | #1
On 12/5/19 2:30 PM, Vladimir Sementsov-Ogievskiy wrote:
> Here is double bug:
> 
> First, return error but not set errp. This may lead to:
> qmp block-dirty-bitmap-remove may report success when actually failed
> 
> block-dirty-bitmap-remove used in a transaction will crash, as
> qmp_transaction will think that it returned success and will cal
> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
> NULL
> 
> Second (like in anecdote), this case is not an error at all. As it is
> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
> definition, absence of bitmap is not an error, and similar case handled
> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
> there is no bitmaps at all..
> 
> But when there are some bitmaps, but not the requested one, it return
> error with errp unset.
> 
> Fix that.
> 
> Fixes: b56a1e31759b750
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
> sorry for introducing it, and it sad that it's found so late..
> 
> Personally, I think that this one worth rc5, as it makes new bitmap
> interfaces unusable. But the decision is yours.
> 
> Last minute edit: hmm, actually, transaction action introduced in
> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
> command is a regression... Maybe it's OK for stable.
> 

Might be hard to justify an RC5 for the sake of this patch.

If we need an RC5 for other reasons, I'd want to include this.

>  block/qcow2-bitmap.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 8abaf632fc..c6c8ebbe89 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>      Qcow2BitmapList *bm_list;
>  
>      if (s->nb_bitmaps == 0) {
> -        /* Absence of the bitmap is not an error: see explanation above
> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
> +        /*
> +         * Absence of the bitmap is not an error: see explanation above
> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
> +         */
>          return 0;
>      }
>  
> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>  
>      bm = find_bitmap_by_name(bm_list, name);
>      if (bm == NULL) {
> -        ret = -EINVAL;
> +        /* Absence of the bitmap is not an error, see above. */
> +        ret = 0;
>          goto out;
>      }
>  
>
Eric Blake Dec. 5, 2019, 8:09 p.m. UTC | #2
On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
> Here is double bug:
> 
> First, return error but not set errp. This may lead to:
> qmp block-dirty-bitmap-remove may report success when actually failed
> 
> block-dirty-bitmap-remove used in a transaction will crash, as
> qmp_transaction will think that it returned success and will cal

call

> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
> NULL
> 
> Second (like in anecdote), this case is not an error at all. As it is
> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
> definition, absence of bitmap is not an error, and similar case handled
> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
> there is no bitmaps at all..

double .

> 
> But when there are some bitmaps, but not the requested one, it return
> error with errp unset.
> 
> Fix that.
> 
> Fixes: b56a1e31759b750
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
> sorry for introducing it, and it sad that it's found so late..
> 
> Personally, I think that this one worth rc5, as it makes new bitmap
> interfaces unusable. But the decision is yours.
> 
> Last minute edit: hmm, actually, transaction action introduced in
> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
> command is a regression... Maybe it's OK for stable.

Libvirt REALLY wants to use transaction bitmap management (and require 
qemu 4.2) for its incremental backup patches that Peter is almost ready 
to merge in.  I'm trying to ascertain:

When exactly can this bug hit?  Can you give a sequence of QMP commands 
that will trigger it for easy reproduction?  Are there workarounds (such 
as checking that a bitmap exists prior to attempting to remove it)?  If 
it does NOT get fixed in rc5, how will libvirt be able to probe whether 
the fix is in place?

> 
>   block/qcow2-bitmap.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 8abaf632fc..c6c8ebbe89 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>       Qcow2BitmapList *bm_list;
>   
>       if (s->nb_bitmaps == 0) {
> -        /* Absence of the bitmap is not an error: see explanation above
> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
> +        /*
> +         * Absence of the bitmap is not an error: see explanation above
> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
> +         */
>           return 0;
>       }
>   
> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>   
>       bm = find_bitmap_by_name(bm_list, name);
>       if (bm == NULL) {
> -        ret = -EINVAL;
> +        /* Absence of the bitmap is not an error, see above. */
> +        ret = 0;
>           goto out;
>       }
>   
>
John Snow Dec. 5, 2019, 8:16 p.m. UTC | #3
On 12/5/19 3:09 PM, Eric Blake wrote:
> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>> Here is double bug:
>>
>> First, return error but not set errp. This may lead to:
>> qmp block-dirty-bitmap-remove may report success when actually failed
>>
>> block-dirty-bitmap-remove used in a transaction will crash, as
>> qmp_transaction will think that it returned success and will cal
> 
> call
> 
>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>> NULL
>>
>> Second (like in anecdote), this case is not an error at all. As it is
>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>> definition, absence of bitmap is not an error, and similar case handled
>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>> there is no bitmaps at all..
> 
> double .
> 
>>
>> But when there are some bitmaps, but not the requested one, it return
>> error with errp unset.
>>
>> Fix that.
>>
>> Fixes: b56a1e31759b750
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> Hi all!
>>
>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>> sorry for introducing it, and it sad that it's found so late..
>>
>> Personally, I think that this one worth rc5, as it makes new bitmap
>> interfaces unusable. But the decision is yours.
>>
>> Last minute edit: hmm, actually, transaction action introduced in
>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>> command is a regression... Maybe it's OK for stable.
> 
> Libvirt REALLY wants to use transaction bitmap management (and require
> qemu 4.2) for its incremental backup patches that Peter is almost ready
> to merge in.  I'm trying to ascertain:
> 
> When exactly can this bug hit?  Can you give a sequence of QMP commands
> that will trigger it for easy reproduction?  Are there workarounds (such
> as checking that a bitmap exists prior to attempting to remove it)?  If
> it does NOT get fixed in rc5, how will libvirt be able to probe whether
> the fix is in place?
> 

It looks like:

- You need to have at least one bitmap
- You need to use transactional remove
- you need to misspell the bitmap name
- The remove will fail (return -EINVAL) but doesn't set errp
- The caller chokes on incomplete information, state->bitmap is NULL

>>
>>   block/qcow2-bitmap.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>> index 8abaf632fc..c6c8ebbe89 100644
>> --- a/block/qcow2-bitmap.c
>> +++ b/block/qcow2-bitmap.c
>> @@ -1469,8 +1469,10 @@ int coroutine_fn
>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       Qcow2BitmapList *bm_list;
>>         if (s->nb_bitmaps == 0) {
>> -        /* Absence of the bitmap is not an error: see explanation above
>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>> +        /*
>> +         * Absence of the bitmap is not an error: see explanation above
>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>> +         */
>>           return 0;
>>       }
>>   @@ -1485,7 +1487,8 @@ int coroutine_fn
>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>         bm = find_bitmap_by_name(bm_list, name);
>>       if (bm == NULL) {
>> -        ret = -EINVAL;
>> +        /* Absence of the bitmap is not an error, see above. */
>> +        ret = 0;
>>           goto out;
>>       }
>>  
>
Eric Blake Dec. 5, 2019, 9:53 p.m. UTC | #4
On 12/5/19 2:16 PM, John Snow wrote:

>>> Last minute edit: hmm, actually, transaction action introduced in
>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>> command is a regression... Maybe it's OK for stable.
>>
>> Libvirt REALLY wants to use transaction bitmap management (and require
>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>> to merge in.  I'm trying to ascertain:
>>
>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>> that will trigger it for easy reproduction?  Are there workarounds (such
>> as checking that a bitmap exists prior to attempting to remove it)?  If
>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>> the fix is in place?
>>
> 
> It looks like:
> 
> - You need to have at least one bitmap
> - You need to use transactional remove
> - you need to misspell the bitmap name
> - The remove will fail (return -EINVAL) but doesn't set errp
> - The caller chokes on incomplete information, state->bitmap is NULL

So in libvirt's case, as long as libvirt manages bitmaps completely, 
it's a bug in libvirt to request deletion of a bitmap that doesn't 
exist.  Or, someone messes with a qcow2 image of an offline guest behind 
libvirt's back without updating libvirt's metadata of what bitmaps 
should exist, and then if libvirt fails to check that a bitmap actually 
exists, a user may be able to coerce libvirt into requesting a bitmap 
deletion that will cause a qemu crash, but that's the user's fault for 
going behind libvirt's back.  Or, libvirt could add code that instead of 
trying to blindly delete a bitmap, it first makes a QMP call to ensure 
the bitmap still exists (annoying, but harmless even when the bug is 
fixed), instead of blaming the bug on the user operating behind 
libvirt's back.

The bug is nasty, but feels to be enough of a corner case that I think 
deferring to 5.0 with CC: stable (and then downstreams can backport it 
at will) is the right approach; no need to hold up 4.2 if this is the 
only flaw.  But I'm also not opposed to it going in 4.2 if we have 
anything else serious.
John Snow Dec. 5, 2019, 10 p.m. UTC | #5
On 12/5/19 4:53 PM, Eric Blake wrote:
> On 12/5/19 2:16 PM, John Snow wrote:
> 
>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>> 4.2, so crash is not a regression, only broken
>>>> block-dirty-bitmap-remove
>>>> command is a regression... Maybe it's OK for stable.
>>>
>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>> to merge in.  I'm trying to ascertain:
>>>
>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>> that will trigger it for easy reproduction?  Are there workarounds (such
>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>> the fix is in place?
>>>
>>
>> It looks like:
>>
>> - You need to have at least one bitmap
>> - You need to use transactional remove
>> - you need to misspell the bitmap name
>> - The remove will fail (return -EINVAL) but doesn't set errp
>> - The caller chokes on incomplete information, state->bitmap is NULL
> 
> So in libvirt's case, as long as libvirt manages bitmaps completely,
> it's a bug in libvirt to request deletion of a bitmap that doesn't
> exist.  Or, someone messes with a qcow2 image of an offline guest behind
> libvirt's back without updating libvirt's metadata of what bitmaps
> should exist, and then if libvirt fails to check that a bitmap actually
> exists, a user may be able to coerce libvirt into requesting a bitmap
> deletion that will cause a qemu crash, but that's the user's fault for
> going behind libvirt's back.  Or, libvirt could add code that instead of
> trying to blindly delete a bitmap, it first makes a QMP call to ensure
> the bitmap still exists (annoying, but harmless even when the bug is
> fixed), instead of blaming the bug on the user operating behind
> libvirt's back.
> 
> The bug is nasty, but feels to be enough of a corner case that I think
> deferring to 5.0 with CC: stable (and then downstreams can backport it
> at will) is the right approach; no need to hold up 4.2 if this is the
> only flaw.  But I'm also not opposed to it going in 4.2 if we have
> anything else serious.
> 

Further, the NASTIEST problem is with transactional remove, which is new
to 4.2.

Normal remove is also broken, but won't choke because it doesn't hold
undo information.

Vladimir, do you agree with this assessment? Do we have it correct?

--js
Vladimir Sementsov-Ogievskiy Dec. 6, 2019, 10:18 a.m. UTC | #6
05.12.2019 23:16, John Snow wrote:
> 
> 
> On 12/5/19 3:09 PM, Eric Blake wrote:
>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> Here is double bug:
>>>
>>> First, return error but not set errp. This may lead to:
>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>
>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>> qmp_transaction will think that it returned success and will cal
>>
>> call
>>
>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>> NULL
>>>
>>> Second (like in anecdote), this case is not an error at all. As it is
>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>> definition, absence of bitmap is not an error, and similar case handled
>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>> there is no bitmaps at all..
>>
>> double .
>>
>>>
>>> But when there are some bitmaps, but not the requested one, it return
>>> error with errp unset.
>>>
>>> Fix that.
>>>
>>> Fixes: b56a1e31759b750
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>
>>> Hi all!
>>>
>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>> sorry for introducing it, and it sad that it's found so late..
>>>
>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>> interfaces unusable. But the decision is yours.
>>>
>>> Last minute edit: hmm, actually, transaction action introduced in
>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>> command is a regression... Maybe it's OK for stable.
>>
>> Libvirt REALLY wants to use transaction bitmap management (and require
>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>> to merge in.  I'm trying to ascertain:
>>
>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>> that will trigger it for easy reproduction?  Are there workarounds (such
>> as checking that a bitmap exists prior to attempting to remove it)?  If
>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>> the fix is in place?
>>
> 
> It looks like:
> 
> - You need to have at least one bitmap
> - You need to use transactional remove
> - you need to misspell the bitmap name
> - The remove will fail (return -EINVAL) but doesn't set errp
> - The caller chokes on incomplete information, state->bitmap is NULL


No, that would be too simple, the thing is worse. Absolutele correct removing is broken, without any misspelling

Bug triggers when we are removing persistent bitmap that is not stored yet in the image AND at least one (another) bitmap already stored in the image. So, something like:

1. create persistent bitmap A
2. shutdown vm  (bitmap A is synced)
3. start vm
4. create persistent bitmap B
5. remove bitmap B - it fails (and crashes if in transaction)

====

Hmm, workaround..

I'm afraid that the only thing is not remove persistent bitmaps, which were never synced to the image. So, instead the sequence above, we need


1. create persistent bitmap A
2. shutdown vm
3. start vm
4. create persistent bitmap B
5. remember, that we want to remove bitmap B after vm shutdown
...
  some other operations
...
6. vm shutdown
7. start vm in stopped mode, and remove all bitmaps marked for removing
8. stop vm

But, I think that in real circumstances, vm shutdown is rare thing...


> 
>>>
>>>    block/qcow2-bitmap.c | 9 ++++++---
>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>>> index 8abaf632fc..c6c8ebbe89 100644
>>> --- a/block/qcow2-bitmap.c
>>> +++ b/block/qcow2-bitmap.c
>>> @@ -1469,8 +1469,10 @@ int coroutine_fn
>>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>>        Qcow2BitmapList *bm_list;
>>>          if (s->nb_bitmaps == 0) {
>>> -        /* Absence of the bitmap is not an error: see explanation above
>>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>>> +        /*
>>> +         * Absence of the bitmap is not an error: see explanation above
>>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>>> +         */
>>>            return 0;
>>>        }
>>>    @@ -1485,7 +1487,8 @@ int coroutine_fn
>>> qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>>          bm = find_bitmap_by_name(bm_list, name);
>>>        if (bm == NULL) {
>>> -        ret = -EINVAL;
>>> +        /* Absence of the bitmap is not an error, see above. */
>>> +        ret = 0;
>>>            goto out;
>>>        }
>>>   
>>
>
Vladimir Sementsov-Ogievskiy Dec. 6, 2019, 2:29 p.m. UTC | #7
05.12.2019 23:09, Eric Blake wrote:
> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>> Here is double bug:
>>
>> First, return error but not set errp. This may lead to:
>> qmp block-dirty-bitmap-remove may report success when actually failed
>>
>> block-dirty-bitmap-remove used in a transaction will crash, as
>> qmp_transaction will think that it returned success and will cal
> 
> call
> 
>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>> NULL
>>
>> Second (like in anecdote), this case is not an error at all. As it is
>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>> definition, absence of bitmap is not an error, and similar case handled
>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>> there is no bitmaps at all..
> 
> double .
> 
>>
>> But when there are some bitmaps, but not the requested one, it return
>> error with errp unset.
>>
>> Fix that.
>>
>> Fixes: b56a1e31759b750
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> Hi all!
>>
>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>> sorry for introducing it, and it sad that it's found so late..
>>
>> Personally, I think that this one worth rc5, as it makes new bitmap
>> interfaces unusable. But the decision is yours.
>>
>> Last minute edit: hmm, actually, transaction action introduced in
>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>> command is a regression... Maybe it's OK for stable.
> 
> Libvirt REALLY wants to use transaction bitmap management (and require qemu 4.2) for its incremental backup patches that Peter is almost ready to merge in.  I'm trying to ascertain:

Side question:

Is libvirt prepared to inconsistent bitmaps?

For example, migration with enabled dirty-bitmaps capability will fail if there are inconsistent bitmaps.
Also, incremental backup may be affected (you see, that you have your bitmap, try to do something with it,
aiming to start next incremental backup, but bitmap is inconsistent, and operation fails)..

In fact, we in Virtuozzo now faced with these broken migration and backup because of inconsistent bitmaps
(which were ignored in past), and we have to apply a temporary fix like

--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -986,8 +986,8 @@ bool qcow2_load_dirty_bitmaps(BlockDriverState *bs, Error **errp)
      QSIMPLEQ_FOREACH(bm, bm_list, entry) {
          BdrvDirtyBitmap *bitmap;

-        if ((bm->flags & BME_FLAG_IN_USE) &&
-            bdrv_find_dirty_bitmap(bs, bm->name))
+        if ((bm->flags & BME_FLAG_IN_USE) /* &&
+            bdrv_find_dirty_bitmap(bs, bm->name) */)
          {
              /*
               * We already have corresponding BdrvDirtyBitmap, and bitmap in the
@@ -1000,6 +1000,13 @@ bool qcow2_load_dirty_bitmaps(BlockDriverState *bs, Error **errp)
               * should not load it on migration target, as we already have this
               * bitmap, being migrated.
               */
+
+            /*
+             * VZ7.12:
+             *
+             * We don't load inconsistent bitmaps at all, as they block
+             * migration. So the second condition is commented out.
+             */
              continue;
          }

And correct way is to handle inconsistent bitmaps in libvirt somehow.


Note, that the source of inconsistent bitmaps are crashes, hard resets, etc, when Qemu is unable to store bitmaps correctly.


> 
> When exactly can this bug hit?  Can you give a sequence of QMP commands that will trigger it for easy reproduction?  Are there workarounds (such as checking that a bitmap exists prior to attempting to remove it)?  If it does NOT get fixed in rc5, how will libvirt be able to probe whether the fix is in place?
> 
>>
>>   block/qcow2-bitmap.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>> index 8abaf632fc..c6c8ebbe89 100644
>> --- a/block/qcow2-bitmap.c
>> +++ b/block/qcow2-bitmap.c
>> @@ -1469,8 +1469,10 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       Qcow2BitmapList *bm_list;
>>       if (s->nb_bitmaps == 0) {
>> -        /* Absence of the bitmap is not an error: see explanation above
>> -         * bdrv_remove_persistent_dirty_bitmap() definition. */
>> +        /*
>> +         * Absence of the bitmap is not an error: see explanation above
>> +         * bdrv_co_remove_persistent_dirty_bitmap() definition.
>> +         */
>>           return 0;
>>       }
>> @@ -1485,7 +1487,8 @@ int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>       bm = find_bitmap_by_name(bm_list, name);
>>       if (bm == NULL) {
>> -        ret = -EINVAL;
>> +        /* Absence of the bitmap is not an error, see above. */
>> +        ret = 0;
>>           goto out;
>>       }
>>
>
Eric Blake Dec. 6, 2019, 2:36 p.m. UTC | #8
[adding in Peter Maydell, as there is now potential talk of other 
4.2-worthy patches]

On 12/6/19 4:18 AM, Vladimir Sementsov-Ogievskiy wrote:
> 05.12.2019 23:16, John Snow wrote:
>>
>>
>> On 12/5/19 3:09 PM, Eric Blake wrote:
>>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> Here is double bug:
>>>>
>>>> First, return error but not set errp. This may lead to:
>>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>>
>>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>>> qmp_transaction will think that it returned success and will cal
>>>
>>> call
>>>
>>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>>> NULL
>>>>
>>>> Second (like in anecdote), this case is not an error at all. As it is
>>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>>> definition, absence of bitmap is not an error, and similar case handled
>>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>>> there is no bitmaps at all..
>>>
>>> double .
>>>
>>>>
>>>> But when there are some bitmaps, but not the requested one, it return
>>>> error with errp unset.
>>>>
>>>> Fix that.
>>>>
>>>> Fixes: b56a1e31759b750
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>
>>>> Hi all!
>>>>
>>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>>> sorry for introducing it, and it sad that it's found so late..
>>>>
>>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>>> interfaces unusable. But the decision is yours.
>>>>
>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>> 4.2, so crash is not a regression, only broken block-dirty-bitmap-remove
>>>> command is a regression... Maybe it's OK for stable.
>>>
>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>> to merge in.  I'm trying to ascertain:
>>>
>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>> that will trigger it for easy reproduction?  Are there workarounds (such
>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>> the fix is in place?
>>>
>>
>> It looks like:
>>
>> - You need to have at least one bitmap
>> - You need to use transactional remove
>> - you need to misspell the bitmap name
>> - The remove will fail (return -EINVAL) but doesn't set errp
>> - The caller chokes on incomplete information, state->bitmap is NULL
> 
> 
> No, that would be too simple, the thing is worse. Absolutele correct removing is broken, without any misspelling
> 
> Bug triggers when we are removing persistent bitmap that is not stored yet in the image AND at least one (another) bitmap already stored in the image. So, something like:
> 
> 1. create persistent bitmap A
> 2. shutdown vm  (bitmap A is synced)
> 3. start vm
> 4. create persistent bitmap B
> 5. remove bitmap B - it fails (and crashes if in transaction)
> 
> ====
> 
> Hmm, workaround..
> 
> I'm afraid that the only thing is not remove persistent bitmaps, which were never synced to the image. So, instead the sequence above, we need
> 
> 
> 1. create persistent bitmap A
> 2. shutdown vm
> 3. start vm
> 4. create persistent bitmap B
> 5. remember, that we want to remove bitmap B after vm shutdown
> ...
>    some other operations
> ...
> 6. vm shutdown
> 7. start vm in stopped mode, and remove all bitmaps marked for removing
> 8. stop vm
> 
> But, I think that in real circumstances, vm shutdown is rare thing...

This is sounding a bit more serious. As I said earlier, it shouldn't 
delay 4.2 on its own, but if the fix is obvious (and other than 
comments, it is a single change from 'ret = -EINVAL' to 'ret = 0' which 
fixes a definite reproducible crash), I think it rises to the level of 
acceptable.

I've been so worried about the question of which release, that I don't 
know if I've previously offered:
Reviewed-by: Eric Blake <eblake@redhat.com>
John Snow Dec. 6, 2019, 7:02 p.m. UTC | #9
On 12/6/19 9:36 AM, Eric Blake wrote:
> [adding in Peter Maydell, as there is now potential talk of other
> 4.2-worthy patches]
> 
> On 12/6/19 4:18 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 05.12.2019 23:16, John Snow wrote:
>>>
>>>
>>> On 12/5/19 3:09 PM, Eric Blake wrote:
>>>> On 12/5/19 1:30 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Here is double bug:
>>>>>
>>>>> First, return error but not set errp. This may lead to:
>>>>> qmp block-dirty-bitmap-remove may report success when actually failed
>>>>>
>>>>> block-dirty-bitmap-remove used in a transaction will crash, as
>>>>> qmp_transaction will think that it returned success and will cal
>>>>
>>>> call
>>>>
>>>>> block_dirty_bitmap_remove_commit which will crash, as state->bitmap is
>>>>> NULL
>>>>>
>>>>> Second (like in anecdote), this case is not an error at all. As it is
>>>>> documented in the comment above bdrv_co_remove_persistent_dirty_bitmap
>>>>> definition, absence of bitmap is not an error, and similar case
>>>>> handled
>>>>> at start of qcow2_co_remove_persistent_dirty_bitmap, it returns 0 when
>>>>> there is no bitmaps at all..
>>>>
>>>> double .
>>>>
>>>>>
>>>>> But when there are some bitmaps, but not the requested one, it return
>>>>> error with errp unset.
>>>>>
>>>>> Fix that.
>>>>>
>>>>> Fixes: b56a1e31759b750
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>
>>>>> Hi all!
>>>>>
>>>>> Ohm, suddenly we faced this bug. It's a regression in 4.2. I'm very
>>>>> sorry for introducing it, and it sad that it's found so late..
>>>>>
>>>>> Personally, I think that this one worth rc5, as it makes new bitmap
>>>>> interfaces unusable. But the decision is yours.
>>>>>
>>>>> Last minute edit: hmm, actually, transaction action introduced in
>>>>> 4.2, so crash is not a regression, only broken
>>>>> block-dirty-bitmap-remove
>>>>> command is a regression... Maybe it's OK for stable.
>>>>
>>>> Libvirt REALLY wants to use transaction bitmap management (and require
>>>> qemu 4.2) for its incremental backup patches that Peter is almost ready
>>>> to merge in.  I'm trying to ascertain:
>>>>
>>>> When exactly can this bug hit?  Can you give a sequence of QMP commands
>>>> that will trigger it for easy reproduction?  Are there workarounds
>>>> (such
>>>> as checking that a bitmap exists prior to attempting to remove it)?  If
>>>> it does NOT get fixed in rc5, how will libvirt be able to probe whether
>>>> the fix is in place?
>>>>
>>>
>>> It looks like:
>>>
>>> - You need to have at least one bitmap
>>> - You need to use transactional remove
>>> - you need to misspell the bitmap name
>>> - The remove will fail (return -EINVAL) but doesn't set errp
>>> - The caller chokes on incomplete information, state->bitmap is NULL
>>
>>
>> No, that would be too simple, the thing is worse. Absolutele correct
>> removing is broken, without any misspelling
>>
>> Bug triggers when we are removing persistent bitmap that is not stored
>> yet in the image AND at least one (another) bitmap already stored in
>> the image. So, something like:
>>
>> 1. create persistent bitmap A
>> 2. shutdown vm  (bitmap A is synced)
>> 3. start vm
>> 4. create persistent bitmap B
>> 5. remove bitmap B - it fails (and crashes if in transaction)
>>
>> ====
>>
>> Hmm, workaround..
>>
>> I'm afraid that the only thing is not remove persistent bitmaps, which
>> were never synced to the image. So, instead the sequence above, we need
>>
>>
>> 1. create persistent bitmap A
>> 2. shutdown vm
>> 3. start vm
>> 4. create persistent bitmap B
>> 5. remember, that we want to remove bitmap B after vm shutdown
>> ...
>>    some other operations
>> ...
>> 6. vm shutdown
>> 7. start vm in stopped mode, and remove all bitmaps marked for removing
>> 8. stop vm
>>
>> But, I think that in real circumstances, vm shutdown is rare thing...
> 
> This is sounding a bit more serious. As I said earlier, it shouldn't
> delay 4.2 on its own, but if the fix is obvious (and other than
> comments, it is a single change from 'ret = -EINVAL' to 'ret = 0' which
> fixes a definite reproducible crash), I think it rises to the level of
> acceptable.
> 
> I've been so worried about the question of which release, that I don't
> know if I've previously offered:
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 

Oh, that is quite a bit more serious than I thought too.

Yeah, I want this in 4.2 if at all possible.

Reviewed-by: John Snow <jsnow@redhat.com>
Eric Blake Dec. 6, 2019, 7:48 p.m. UTC | #10
On 12/6/19 1:02 PM, John Snow wrote:

>>> I'm afraid that the only thing is not remove persistent bitmaps, which
>>> were never synced to the image. So, instead the sequence above, we need
>>>
>>>
>>> 1. create persistent bitmap A
>>> 2. shutdown vm
>>> 3. start vm
>>> 4. create persistent bitmap B
>>> 5. remember, that we want to remove bitmap B after vm shutdown
>>> ...
>>>     some other operations
>>> ...
>>> 6. vm shutdown
>>> 7. start vm in stopped mode, and remove all bitmaps marked for removing
>>> 8. stop vm
>>>
>>> But, I think that in real circumstances, vm shutdown is rare thing...

Part of me wonders if we would have detected this MUCH sooner if I had 
gotten my wish of having the qcow2 metadata updated on creation of any 
persistent bitmap (not actually writing out the bitmap itself, just 
updating the bitmap table to mark that there is a new persistent 
inconsistent bitmap), so that a) qemu-img info -U can see the persistent 
bitmap's existence, and b) an unexpected abrupt crash of qemu does not 
lose the existence of the bitmap.  At the time I raised the question, 
the push-back at the time was a desire to minimize writes to the qcow2 
metadata at all costs, warranting our current extreme code contortions 
to keep persistent bitmaps were kept in memory only until VM shutdown. 
But had we been doing it, we would spot problems like this without 
having to do VM shutdown, and our code might actually be simpler than 
our current contortions.  Maybe we should still revisit that decision 
(of course, that question is independent of this patch, and therefore 
5.0 material at earliest).

Patch
diff mbox series

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 8abaf632fc..c6c8ebbe89 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1469,8 +1469,10 @@  int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
     Qcow2BitmapList *bm_list;
 
     if (s->nb_bitmaps == 0) {
-        /* Absence of the bitmap is not an error: see explanation above
-         * bdrv_remove_persistent_dirty_bitmap() definition. */
+        /*
+         * Absence of the bitmap is not an error: see explanation above
+         * bdrv_co_remove_persistent_dirty_bitmap() definition.
+         */
         return 0;
     }
 
@@ -1485,7 +1487,8 @@  int coroutine_fn qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
 
     bm = find_bitmap_by_name(bm_list, name);
     if (bm == NULL) {
-        ret = -EINVAL;
+        /* Absence of the bitmap is not an error, see above. */
+        ret = 0;
         goto out;
     }