diff mbox series

[4/6] dirty-bitmaps: clean-up bitmaps loading and migration logic

Message ID 700dffe4-f3a8-8f70-052c-9f6f8ffbe3d3@redhat.com
State New
Headers show
Series None | expand

Commit Message

John Snow July 21, 2018, 2:41 a.m. UTC
On 06/26/2018 09:50 AM, Vladimir Sementsov-Ogievskiy wrote:
> This patch aims to bring the following behavior:

Just as a primer for anyone else reading this email (nobody) who might
not understand the terminology (less than nobody), it might be helpful
to remember that:

-INACTIVATE occurs either as a starting state for a QEMU awaiting an
incoming migration, or as a finishing state for a QEMU that has handed
off control of its disks to a target QEMU during a migration. Despite
being questionable grammar, it largely makes sense.

-INVALIDATE occurs only on inactive images (it is a NOP on active
images); and is used to obliterate any cache/state that may exist from a
prior open call. This also removes the BDRV_O_INACTIVE flag, and
therefore also "activates" an image. It is called in two cases, AFAICT:

(1) To engage an image after a shared storage migration, by the
destination QEMU
(2) To re-engage a previously inactivated image after a failed
migration, by the source QEMU

And for those of you who already knew all of this, this is a chance for
you to correct me if I was wrong.

As further recap, bitmaps can be migrated generally in two ways:

(1) For persistent bitmaps only: as part of a shared storage migration,
they can be flushed to disk before the handoff. This does not require
any special migration capabilities.
(2) For either shared or non-shared storage migrations, bitmaps
regardless of their persistence can be migrated using
migration/block-dirty-bitmap.c. This feature has to be opted into.

> 
> 1. We don't load bitmaps, when started in inactive mode. It's the case
> of incoming migration. In this case we wait for bitmaps migration
> through migration channel (if 'dirty-bitmaps' capability is enabled) or
> for invalidation (to load bitmaps from the image).
> 

OK, so performing qcow2_open while (flags & BDRV_O_INACTIVE) -- when
we're awaiting an incoming migration, generally -- will forego
attempting to load ANY bitmaps, under the pretense that:

(1) We will need to have re-loaded them on invalidate anyway, or
(2) We will be receiving them through the postcopy migration mechanisms.

This sounds correct to me; they won't get set as IN_USE on disk and if
anyone tries to query or address them prior to handoff, they won't
exist, so they can't be misused, either.

> 2. We don't remove persistent bitmaps on inactivation. Instead, we only
> remove bitmaps after storing. This is the only way to restore bitmaps,
> if we decided to resume source after [failed] migration with
> 'dirty-bitmaps' capability enabled (which means, that bitmaps were not
> stored).
> 

So currently, when we inactivate we remove the in-memory bitmaps that we
considered to be associated with the bitmap -- ONLY for persistent ones.
bdrv_release_persistent_dirty_bitmaps() managed this.

However, the current hook for this in  bdrv_inactivate_recurse is a bit
of a hack -- it just muses that the bitmaps "should already be stored by
the format driver" -- and it's correct, they SHOULD be, but we can just
move the hook to precisely when we store the bitmap. This is indeed more
resilient.

For any bitmaps theoretically left behind in this state, we can rest
assured that we cannot write to an INACTIVE node anyway, so they won't
be able to change on us. Either they'll get erased on close, or we'll
re-use them on INVALIDATE.


So now, in the shared storage + no-bitmap-migrate case:

- We flush the bitmaps to disk anyway. The bitmaps are removed on-store.
If we need to INVALIDATE and become active again, we just re-read them
from disk. Any bitmaps we had that were NOT persistent never got
removed, so we are fine.

And in the migrate-bitmap case:

You opt not to allow them to be flushed to disk, which means that not
deleting them is definitely mandatory, in case of failure.



The change that correlates to this bullet-point (delete only on store)
is good regardless, but as of right now I'm confused as to why we can't
flush bitmaps we want to transmit across the live migration channel to
disk anyway.

I guess you want to avoid a double-load in the case where we do a shared
storage migration and a bitmap migration?

The problem I have here is that this means that we simply ignore
flushing to disk, so the bitmap remains IN_USE even when it isn't truly
IN_USE, and that the method of coping with this means *ignoring* bitmaps
that are IN_USE instead of erroring out and failing to load.

I think that's dangerous, unless I'm missing something -- I want QEMU to
*error* if it sees an IN_USE bitmap. I think it ought to, as it's not
safe to modify only some of these bitmaps.

I think it's very strange to NOT flush bitmaps to disk on INACTIVATE,
and I think we MUST do so.

> 3. We load bitmaps on open and any invalidation, it's ok for all cases:
>   - normal open

"Normal" open in the !(BDRV_O_INACTIVE) sense, yes.

>   - migration target invalidation with dirty-bitmaps capability
>     (bitmaps are migrating through migration channel, the are not
>      stored, so they should have IN_USE flag set and will be skipped
>      when loading. However, it would fail if bitmaps are read-only[1])

I don't like this as stated above...

>   - migration target invalidation without dirty-bitmaps capability
>     (normal load of the bitmaps, if migrated with shared storage)

This is good.

>   - source invalidation with dirty-bitmaps capability
>     (skip because IN_USE)

I don't like the idea of skipping bitmaps that are IN_USE and continuing
on as if that's OK. That could be hiding deeper problems.

>   - source invalidation without dirty-bitmaps capability
>     (bitmaps were dropped, reload them)

This is good.

> 
> [1]: to accurately handle this, migration of read-only bitmaps is
>      explicitly forbidden in this patch.
> 
> New mechanism for not storing bitmaps when migrate with dirty-bitmaps
> capability is introduced: migration filed in BdrvDirtyBitmap.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/dirty-bitmap.h   |  2 +-
>  block.c                        | 11 ++++---
>  block/dirty-bitmap.c           | 36 +++++++++--------------
>  block/qcow2-bitmap.c           | 16 +++++++++++
>  block/qcow2.c                  | 65 ++++++++++++++++++++++++++++++++++++++++--
>  migration/block-dirty-bitmap.c | 10 +++++--
>  6 files changed, 109 insertions(+), 31 deletions(-)
> 
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index 259bd27c40..95c7847ec6 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -26,7 +26,6 @@ BdrvDirtyBitmap *bdrv_find_dirty_bitmap(BlockDriverState *bs,
>                                          const char *name);
>  void bdrv_release_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap);
>  void bdrv_release_named_dirty_bitmaps(BlockDriverState *bs);
> -void bdrv_release_persistent_dirty_bitmaps(BlockDriverState *bs);
>  void bdrv_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>                                           const char *name,
>                                           Error **errp);
> @@ -72,6 +71,7 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap,
>  void bdrv_dirty_bitmap_set_qmp_locked(BdrvDirtyBitmap *bitmap, bool qmp_locked);
>  void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
>                               Error **errp);
> +void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool migration);
>  
>  /* Functions that require manual locking.  */
>  void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap);
> diff --git a/block.c b/block.c
> index 1b8147c1b3..07d7a974d2 100644
> --- a/block.c
> +++ b/block.c
> @@ -4396,6 +4396,7 @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
>      uint64_t perm, shared_perm;
>      Error *local_err = NULL;
>      int ret;
> +    BdrvDirtyBitmap *bm;
>  
>      if (!bs->drv)  {
>          return;
> @@ -4445,6 +4446,12 @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
>          }
>      }
>  
> +    for (bm = bdrv_dirty_bitmap_next(bs, NULL); bm;
> +         bm = bdrv_dirty_bitmap_next(bs, bm))
> +    {
> +        bdrv_dirty_bitmap_set_migration(bm, false);
> +    }
> +
>      ret = refresh_total_sectors(bs, bs->total_sectors);
>      if (ret < 0) {
>          bs->open_flags |= BDRV_O_INACTIVE;
> @@ -4559,10 +4566,6 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs,
>          }
>      }
>  
> -    /* At this point persistent bitmaps should be already stored by the format
> -     * driver */
> -    bdrv_release_persistent_dirty_bitmaps(bs);
> -
>      return 0;
>  }
>  
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index c9b8a6fd52..a13c3bdcfa 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -55,6 +55,10 @@ struct BdrvDirtyBitmap {
>                                     and this bitmap must remain unchanged while
>                                     this flag is set. */
>      bool persistent;            /* bitmap must be saved to owner disk image */
> +    bool migration;             /* Bitmap is selected for migration, it should
> +                                   not be stored on the next inactivation
> +                                   (persistent flag doesn't matter until next
> +                                   invalidation).*/
>      QLIST_ENTRY(BdrvDirtyBitmap) list;
>  };
>  
> @@ -384,26 +388,6 @@ void bdrv_release_named_dirty_bitmaps(BlockDriverState *bs)
>  }
>  
>  /**
> - * Release all persistent dirty bitmaps attached to a BDS (for use in
> - * bdrv_inactivate_recurse()).
> - * There must not be any frozen bitmaps attached.
> - * This function does not remove persistent bitmaps from the storage.
> - * Called with BQL taken.
> - */
> -void bdrv_release_persistent_dirty_bitmaps(BlockDriverState *bs)
> -{
> -    BdrvDirtyBitmap *bm, *next;
> -
> -    bdrv_dirty_bitmaps_lock(bs);
> -    QLIST_FOREACH_SAFE(bm, &bs->dirty_bitmaps, list, next) {
> -        if (bdrv_dirty_bitmap_get_persistance(bm)) {
> -            bdrv_release_dirty_bitmap_locked(bm);
> -        }
> -    }
> -    bdrv_dirty_bitmaps_unlock(bs);
> -}
> -
> -/**
>   * Remove persistent dirty bitmap from the storage if it exists.
>   * Absence of bitmap is not an error, because we have the following scenario:
>   * BdrvDirtyBitmap can have .persistent = true but not yet saved and have no
> @@ -756,16 +740,24 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap, bool persistent)
>      qemu_mutex_unlock(bitmap->mutex);
>  }
>  
> +/* Called with BQL taken. */
> +void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool migration)
> +{
> +    qemu_mutex_lock(bitmap->mutex);
> +    bitmap->migration = migration;
> +    qemu_mutex_unlock(bitmap->mutex);
> +}
> +
>  bool bdrv_dirty_bitmap_get_persistance(BdrvDirtyBitmap *bitmap)
>  {
> -    return bitmap->persistent;
> +    return bitmap->persistent && !bitmap->migration;
>  }
>  
>  bool bdrv_has_changed_persistent_bitmaps(BlockDriverState *bs)
>  {
>      BdrvDirtyBitmap *bm;
>      QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
> -        if (bm->persistent && !bm->readonly) {
> +        if (bm->persistent && !bm->readonly && !bm->migration) {
>              return true;
>          }
>      }
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 69485aa1de..c9662b21c7 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1413,6 +1413,22 @@ void qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs, Error **errp)
>          g_free(tb);
>      }
>  
> +    QSIMPLEQ_FOREACH(bm, bm_list, entry) {
> +        /* For safety, we remove bitmap after storing.
> +         * We may be here in two cases:
> +         * 1. bdrv_close. It's ok to drop bitmap.
> +         * 2. inactivation. It means migration without 'dirty-bitmaps'
> +         *    capability, so bitmaps are not marked with
> +         *    BdrvDirtyBitmap.migration flags. It's not bad to drop them too,
> +         *    and reload on invalidation.
> +         */
> +        if (bm->dirty_bitmap == NULL) {
> +            continue;
> +        }
> +
> +        bdrv_release_dirty_bitmap(bs, bm->dirty_bitmap);
> +    }
> +
>      bitmap_list_free(bm_list);
>      return;
>  
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 0044ff58e7..f5c99f4ed4 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1487,8 +1487,69 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>          s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
>      }
>  
> -    if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
> -        update_header = false;
> +    /* == Handle persistent dirty bitmaps ==
> +     *
> +     * We want load dirty bitmaps in three cases:
> +     *
> +     * 1. Normal open of the disk in active mode, not related to invalidation
> +     *    after migration.
> +     *
> +     * 2. Invalidation of the target vm after pre-copy phase of migration, if
> +     *    bitmaps are _not_ migrating through migration channel, i.e.
> +     *    'dirty-bitmaps' capability is disabled.
> +     *
> +     * 3. Invalidation of source vm after failed or canceled migration.
> +     *    This is a very interesting case. There are two possible types of
> +     *    bitmaps:
> +     *
> +     *    A. Stored on inactivation and removed. They should be loaded from the
> +     *       image.
> +     *
> +     *    B. Not stored: not-persistent bitmaps and bitmaps, migrated through
> +     *       the migration channel (with dirty-bitmaps capability).
> +     *

In my draft below, I suggest that "not stored" can only happen with
non-persistent bitmaps. These are fine to keep in-memory and we don't
need to do anything special to them on INACTIVATE/INVALIDATE cases.

> +     *    On the other hand, there are two possible sub-cases:
> +     *
> +     *    3.1 disk was changed by somebody else while were inactive. In this
> +     *        case all in-RAM dirty bitmaps (both persistent and not) are
> +     *        definitely invalid. And we don't have any method to determine
> +     *        this.
> +     *

I think if the disk was changed out from under us we do indeed have
bigger problems to worry about WRT the memory state, so I wouldn't worry
about this right now.

> +     *        Simple and safe thing is to just drop all the bitmaps of type B on
> +     *        inactivation. But in this case we lose bitmaps in valid 4.2 case.
> +     *
> +     *        On the other hand, resuming source vm, if disk was already changed
> +     *        is a bad thing anyway: not only bitmaps, the whole vm state is
> +     *        out of sync with disk.
> +     *
> +     *        This means, that user or management tool, who for some reason
> +     *        decided to resume source vm, after disk was already changed by
> +     *        target vm, should at least drop all dirty bitmaps by hand.
> +     *
> +     *        So, we can ignore this case for now, but TODO: "generation"
> +     *        extension for qcow2, to determine, that image was changed after
> +     *        last inactivation. And if it is changed, we will drop (or at least
> +     *        mark as 'invalid' all the bitmaps of type B, both persistent
> +     *        and not).
> +     *
> +     *    3.2 disk was _not_ changed while were inactive. Bitmaps may be saved
> +     *        to disk ('dirty-bitmaps' capability disabled), or not saved
> +     *        ('dirty-bitmaps' capability enabled), but we don't need to care
> +     *        of: let's load bitmaps as always: stored bitmaps will be loaded,
> +     *        and not stored has flag IN_USE=1 in the image and will be skipped
> +     *        on loading.
> +     *
> +     * One remaining possible case when we don't want load bitmaps:
> +     *
> +     * 4. Open disk in inactive mode in target vm (bitmaps are migrating or
> +     *    will be loaded on invalidation, no needs try loading them before)
> +     */
> +
> +    if (!(bdrv_get_flags(bs) & BDRV_O_INACTIVE)) {
> +        /* It's case 1, 2 or 3.2. Or 3.1 which is BUG in management layer. */
> +        bool header_updated = qcow2_load_dirty_bitmaps(bs, &local_err);
> +
> +        update_header = update_header && !header_updated;
>      }
>      if (local_err != NULL) {
>          error_propagate(errp, local_err);
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index 477826330c..ae4e88354a 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -313,6 +313,12 @@ static int init_dirty_bitmap_migration(void)
>                  goto fail;
>              }
>  
> +            if (bdrv_dirty_bitmap_readonly(bitmap)) {
> +                error_report("Can't migrate read-only dirty bitmap: '%s",
> +                             bdrv_dirty_bitmap_name(bitmap));
> +                goto fail;
> +            }
> +
>              bdrv_ref(bs);
>              bdrv_dirty_bitmap_set_qmp_locked(bitmap, true);
>  
> @@ -335,9 +341,9 @@ static int init_dirty_bitmap_migration(void)
>          }
>      }
>  
> -    /* unset persistance here, to not roll back it */
> +    /* unset migration flags here, to not roll back it */

Stale comment, but I'm proposing we get rid of this idea anyway.

>      QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> -        bdrv_dirty_bitmap_set_persistance(dbms->bitmap, false);
> +        bdrv_dirty_bitmap_set_migration(dbms->bitmap, true);
>      }
>  
>      if (QSIMPLEQ_EMPTY(&dirty_bitmap_mig_state.dbms_list)) {
> 

So having read this, I think the problem is in cases where we migrate
bitmaps in potentially two ways simultaneously: persistent/shared
alongside migrate bitmaps capability.

You've solved this by opting to not flush bitmaps to disk, and then just
skipping them on-load.

I'd rather do something like this:
- Always flush bitmaps to disk on inactivate.
- Skip loading bitmaps if inactive.
- When migrating, omit bitmaps from the stream if they're persistent and
we're doing a shared storage migration. (i.e, stream only when we have to.)
- Refuse to load an image at all if it has IN_USE bitmaps that we are
expected to keep consistent with the disk. (We won't be able to!)

I think this has a lot of good properties:
- disk state is consistent (we never have IN_USE when that's not true)
- it avoids unnecessary traffic for the migration stream
- it gets rid of the special migration bool in bitmaps
- it allows us to hard reject qcow2 files with IN_USE bitmaps, which is
important for consistency.

this breaks your test suite, though, because you don't *actually*
initiate a block migration, you just simulate it by giving it a blank
qcow2 to migrate the bitmaps too -- so you can fix it by actually doing
a block migration of the empty disks.

Here's a patch presented as patch 7/6 that:

- Removes set_migration flag, functions, and calls
- Changes the migbitmap behavior
- Adjusts test 169 accordingly

(Because I'm positive thunderbird is about to mangle this, I've mirrored
it on github:
https://github.com/jnsnow/qemu/commit/4f3f2a30c60b793312a3b994f8defbb2f2402121
)

Comments

Dr. David Alan Gilbert Aug. 1, 2018, 10:20 a.m. UTC | #1
* John Snow (jsnow@redhat.com) wrote:

<snip>

> I'd rather do something like this:
> - Always flush bitmaps to disk on inactivate.

Does that increase the time taken by the inactivate measurably?
If it's small relative to everything else that's fine; it's just I
always worry a little since I think this happens after we've stopped the
CPU on the source, so is part of the 'downtime'.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Vladimir Sementsov-Ogievskiy Aug. 1, 2018, 12:24 p.m. UTC | #2
21.07.2018 05:41, John Snow wrote:
> On 06/26/2018 09:50 AM, Vladimir Sementsov-Ogievskiy wrote:
>> This patch aims to bring the following behavior:
> Just as a primer for anyone else reading this email (nobody) who might
> not understand the terminology (less than nobody), it might be helpful
> to remember that:
>
> -INACTIVATE occurs either as a starting state for a QEMU awaiting an
> incoming migration, or as a finishing state for a QEMU that has handed
> off control of its disks to a target QEMU during a migration. Despite
> being questionable grammar, it largely makes sense.
>
> -INVALIDATE occurs only on inactive images (it is a NOP on active
> images); and is used to obliterate any cache/state that may exist from a
> prior open call. This also removes the BDRV_O_INACTIVE flag, and
> therefore also "activates" an image. It is called in two cases, AFAICT:
>
> (1) To engage an image after a shared storage migration, by the
> destination QEMU
> (2) To re-engage a previously inactivated image after a failed
> migration, by the source QEMU
>
> And for those of you who already knew all of this, this is a chance for
> you to correct me if I was wrong.
>
> As further recap, bitmaps can be migrated generally in two ways:
>
> (1) For persistent bitmaps only: as part of a shared storage migration,
> they can be flushed to disk before the handoff. This does not require
> any special migration capabilities.
> (2) For either shared or non-shared storage migrations, bitmaps
> regardless of their persistence can be migrated using
> migration/block-dirty-bitmap.c. This feature has to be opted into.
>
>> 1. We don't load bitmaps, when started in inactive mode. It's the case
>> of incoming migration. In this case we wait for bitmaps migration
>> through migration channel (if 'dirty-bitmaps' capability is enabled) or
>> for invalidation (to load bitmaps from the image).
>>
> OK, so performing qcow2_open while (flags & BDRV_O_INACTIVE) -- when
> we're awaiting an incoming migration, generally -- will forego
> attempting to load ANY bitmaps, under the pretense that:
>
> (1) We will need to have re-loaded them on invalidate anyway, or
> (2) We will be receiving them through the postcopy migration mechanisms.
>
> This sounds correct to me; they won't get set as IN_USE on disk and if
> anyone tries to query or address them prior to handoff, they won't
> exist, so they can't be misused, either.
>
>> 2. We don't remove persistent bitmaps on inactivation. Instead, we only
>> remove bitmaps after storing. This is the only way to restore bitmaps,
>> if we decided to resume source after [failed] migration with
>> 'dirty-bitmaps' capability enabled (which means, that bitmaps were not
>> stored).
>>
> So currently, when we inactivate we remove the in-memory bitmaps that we
> considered to be associated with the bitmap -- ONLY for persistent ones.
> bdrv_release_persistent_dirty_bitmaps() managed this.
>
> However, the current hook for this in  bdrv_inactivate_recurse is a bit
> of a hack -- it just muses that the bitmaps "should already be stored by
> the format driver" -- and it's correct, they SHOULD be, but we can just
> move the hook to precisely when we store the bitmap. This is indeed more
> resilient.
>
> For any bitmaps theoretically left behind in this state, we can rest
> assured that we cannot write to an INACTIVE node anyway, so they won't
> be able to change on us. Either they'll get erased on close, or we'll
> re-use them on INVALIDATE.
>
>
> So now, in the shared storage + no-bitmap-migrate case:
>
> - We flush the bitmaps to disk anyway. The bitmaps are removed on-store.
> If we need to INVALIDATE and become active again, we just re-read them
> from disk. Any bitmaps we had that were NOT persistent never got
> removed, so we are fine.
>
> And in the migrate-bitmap case:
>
> You opt not to allow them to be flushed to disk, which means that not
> deleting them is definitely mandatory, in case of failure.
>
>
>
> The change that correlates to this bullet-point (delete only on store)
> is good regardless, but as of right now I'm confused as to why we can't
> flush bitmaps we want to transmit across the live migration channel to
> disk anyway.
>
> I guess you want to avoid a double-load in the case where we do a shared
> storage migration and a bitmap migration?

storing a lot of bitmap data on inactivate can significantly increase 
downtime of migration.

>
> The problem I have here is that this means that we simply ignore
> flushing to disk, so the bitmap remains IN_USE even when it isn't truly
> IN_USE, and that the method of coping with this means *ignoring* bitmaps
> that are IN_USE instead of erroring out and failing to load.

"IN_USE" means that bitmap was not successfully stored after last usage, 
and should be considered as inconsistent

>
> I think that's dangerous, unless I'm missing something -- I want QEMU to
> *error* if it sees an IN_USE bitmap. I think it ought to, as it's not
> safe to modify only some of these bitmaps.

hmm, what is unsafe with it? if bitmap is in_use it will not be loaded, 
so it's up to libvirt, to error out it's absence or recreate it..

>
> I think it's very strange to NOT flush bitmaps to disk on INACTIVATE,
> and I think we MUST do so.

Then we can't do online migration.

>
>> 3. We load bitmaps on open and any invalidation, it's ok for all cases:
>>    - normal open
> "Normal" open in the !(BDRV_O_INACTIVE) sense, yes.
>
>>    - migration target invalidation with dirty-bitmaps capability
>>      (bitmaps are migrating through migration channel, the are not
>>       stored, so they should have IN_USE flag set and will be skipped
>>       when loading. However, it would fail if bitmaps are read-only[1])
> I don't like this as stated above...
>
>>    - migration target invalidation without dirty-bitmaps capability
>>      (normal load of the bitmaps, if migrated with shared storage)
> This is good.
>
>>    - source invalidation with dirty-bitmaps capability
>>      (skip because IN_USE)
> I don't like the idea of skipping bitmaps that are IN_USE and continuing
> on as if that's OK. That could be hiding deeper problems.

I think it's a normal way to implement online migration of bitmaps 
without storing them (increasing downtime) and with a possibility to 
resume source on migration failure.

>
>>    - source invalidation without dirty-bitmaps capability
>>      (bitmaps were dropped, reload them)
> This is good.
>
>> [1]: to accurately handle this, migration of read-only bitmaps is
>>       explicitly forbidden in this patch.
>>
>> New mechanism for not storing bitmaps when migrate with dirty-bitmaps
>> capability is introduced: migration filed in BdrvDirtyBitmap.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/block/dirty-bitmap.h   |  2 +-
>>   block.c                        | 11 ++++---
>>   block/dirty-bitmap.c           | 36 +++++++++--------------
>>   block/qcow2-bitmap.c           | 16 +++++++++++
>>   block/qcow2.c                  | 65 ++++++++++++++++++++++++++++++++++++++++--
>>   migration/block-dirty-bitmap.c | 10 +++++--
>>   6 files changed, 109 insertions(+), 31 deletions(-)
>>
>> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
>> index 259bd27c40..95c7847ec6 100644
>> --- a/include/block/dirty-bitmap.h
>> +++ b/include/block/dirty-bitmap.h
>> @@ -26,7 +26,6 @@ BdrvDirtyBitmap *bdrv_find_dirty_bitmap(BlockDriverState *bs,
>>                                           const char *name);
>>   void bdrv_release_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap);
>>   void bdrv_release_named_dirty_bitmaps(BlockDriverState *bs);
>> -void bdrv_release_persistent_dirty_bitmaps(BlockDriverState *bs);
>>   void bdrv_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>>                                            const char *name,
>>                                            Error **errp);
>> @@ -72,6 +71,7 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap,
>>   void bdrv_dirty_bitmap_set_qmp_locked(BdrvDirtyBitmap *bitmap, bool qmp_locked);
>>   void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
>>                                Error **errp);
>> +void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool migration);
>>   
>>   /* Functions that require manual locking.  */
>>   void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap);
>> diff --git a/block.c b/block.c
>> index 1b8147c1b3..07d7a974d2 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4396,6 +4396,7 @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
>>       uint64_t perm, shared_perm;
>>       Error *local_err = NULL;
>>       int ret;
>> +    BdrvDirtyBitmap *bm;
>>   
>>       if (!bs->drv)  {
>>           return;
>> @@ -4445,6 +4446,12 @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
>>           }
>>       }
>>   
>> +    for (bm = bdrv_dirty_bitmap_next(bs, NULL); bm;
>> +         bm = bdrv_dirty_bitmap_next(bs, bm))
>> +    {
>> +        bdrv_dirty_bitmap_set_migration(bm, false);
>> +    }
>> +
>>       ret = refresh_total_sectors(bs, bs->total_sectors);
>>       if (ret < 0) {
>>           bs->open_flags |= BDRV_O_INACTIVE;
>> @@ -4559,10 +4566,6 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs,
>>           }
>>       }
>>   
>> -    /* At this point persistent bitmaps should be already stored by the format
>> -     * driver */
>> -    bdrv_release_persistent_dirty_bitmaps(bs);
>> -
>>       return 0;
>>   }
>>   
>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>> index c9b8a6fd52..a13c3bdcfa 100644
>> --- a/block/dirty-bitmap.c
>> +++ b/block/dirty-bitmap.c
>> @@ -55,6 +55,10 @@ struct BdrvDirtyBitmap {
>>                                      and this bitmap must remain unchanged while
>>                                      this flag is set. */
>>       bool persistent;            /* bitmap must be saved to owner disk image */
>> +    bool migration;             /* Bitmap is selected for migration, it should
>> +                                   not be stored on the next inactivation
>> +                                   (persistent flag doesn't matter until next
>> +                                   invalidation).*/
>>       QLIST_ENTRY(BdrvDirtyBitmap) list;
>>   };
>>   
>> @@ -384,26 +388,6 @@ void bdrv_release_named_dirty_bitmaps(BlockDriverState *bs)
>>   }
>>   
>>   /**
>> - * Release all persistent dirty bitmaps attached to a BDS (for use in
>> - * bdrv_inactivate_recurse()).
>> - * There must not be any frozen bitmaps attached.
>> - * This function does not remove persistent bitmaps from the storage.
>> - * Called with BQL taken.
>> - */
>> -void bdrv_release_persistent_dirty_bitmaps(BlockDriverState *bs)
>> -{
>> -    BdrvDirtyBitmap *bm, *next;
>> -
>> -    bdrv_dirty_bitmaps_lock(bs);
>> -    QLIST_FOREACH_SAFE(bm, &bs->dirty_bitmaps, list, next) {
>> -        if (bdrv_dirty_bitmap_get_persistance(bm)) {
>> -            bdrv_release_dirty_bitmap_locked(bm);
>> -        }
>> -    }
>> -    bdrv_dirty_bitmaps_unlock(bs);
>> -}
>> -
>> -/**
>>    * Remove persistent dirty bitmap from the storage if it exists.
>>    * Absence of bitmap is not an error, because we have the following scenario:
>>    * BdrvDirtyBitmap can have .persistent = true but not yet saved and have no
>> @@ -756,16 +740,24 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap, bool persistent)
>>       qemu_mutex_unlock(bitmap->mutex);
>>   }
>>   
>> +/* Called with BQL taken. */
>> +void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool migration)
>> +{
>> +    qemu_mutex_lock(bitmap->mutex);
>> +    bitmap->migration = migration;
>> +    qemu_mutex_unlock(bitmap->mutex);
>> +}
>> +
>>   bool bdrv_dirty_bitmap_get_persistance(BdrvDirtyBitmap *bitmap)
>>   {
>> -    return bitmap->persistent;
>> +    return bitmap->persistent && !bitmap->migration;
>>   }
>>   
>>   bool bdrv_has_changed_persistent_bitmaps(BlockDriverState *bs)
>>   {
>>       BdrvDirtyBitmap *bm;
>>       QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
>> -        if (bm->persistent && !bm->readonly) {
>> +        if (bm->persistent && !bm->readonly && !bm->migration) {
>>               return true;
>>           }
>>       }
>> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
>> index 69485aa1de..c9662b21c7 100644
>> --- a/block/qcow2-bitmap.c
>> +++ b/block/qcow2-bitmap.c
>> @@ -1413,6 +1413,22 @@ void qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs, Error **errp)
>>           g_free(tb);
>>       }
>>   
>> +    QSIMPLEQ_FOREACH(bm, bm_list, entry) {
>> +        /* For safety, we remove bitmap after storing.
>> +         * We may be here in two cases:
>> +         * 1. bdrv_close. It's ok to drop bitmap.
>> +         * 2. inactivation. It means migration without 'dirty-bitmaps'
>> +         *    capability, so bitmaps are not marked with
>> +         *    BdrvDirtyBitmap.migration flags. It's not bad to drop them too,
>> +         *    and reload on invalidation.
>> +         */
>> +        if (bm->dirty_bitmap == NULL) {
>> +            continue;
>> +        }
>> +
>> +        bdrv_release_dirty_bitmap(bs, bm->dirty_bitmap);
>> +    }
>> +
>>       bitmap_list_free(bm_list);
>>       return;
>>   
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 0044ff58e7..f5c99f4ed4 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1487,8 +1487,69 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>>           s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
>>       }
>>   
>> -    if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
>> -        update_header = false;
>> +    /* == Handle persistent dirty bitmaps ==
>> +     *
>> +     * We want load dirty bitmaps in three cases:
>> +     *
>> +     * 1. Normal open of the disk in active mode, not related to invalidation
>> +     *    after migration.
>> +     *
>> +     * 2. Invalidation of the target vm after pre-copy phase of migration, if
>> +     *    bitmaps are _not_ migrating through migration channel, i.e.
>> +     *    'dirty-bitmaps' capability is disabled.
>> +     *
>> +     * 3. Invalidation of source vm after failed or canceled migration.
>> +     *    This is a very interesting case. There are two possible types of
>> +     *    bitmaps:
>> +     *
>> +     *    A. Stored on inactivation and removed. They should be loaded from the
>> +     *       image.
>> +     *
>> +     *    B. Not stored: not-persistent bitmaps and bitmaps, migrated through
>> +     *       the migration channel (with dirty-bitmaps capability).
>> +     *
> In my draft below, I suggest that "not stored" can only happen with
> non-persistent bitmaps. These are fine to keep in-memory and we don't
> need to do anything special to them on INACTIVATE/INVALIDATE cases.
>
>> +     *    On the other hand, there are two possible sub-cases:
>> +     *
>> +     *    3.1 disk was changed by somebody else while were inactive. In this
>> +     *        case all in-RAM dirty bitmaps (both persistent and not) are
>> +     *        definitely invalid. And we don't have any method to determine
>> +     *        this.
>> +     *
> I think if the disk was changed out from under us we do indeed have
> bigger problems to worry about WRT the memory state, so I wouldn't worry
> about this right now.
>
>> +     *        Simple and safe thing is to just drop all the bitmaps of type B on
>> +     *        inactivation. But in this case we lose bitmaps in valid 4.2 case.
>> +     *
>> +     *        On the other hand, resuming source vm, if disk was already changed
>> +     *        is a bad thing anyway: not only bitmaps, the whole vm state is
>> +     *        out of sync with disk.
>> +     *
>> +     *        This means, that user or management tool, who for some reason
>> +     *        decided to resume source vm, after disk was already changed by
>> +     *        target vm, should at least drop all dirty bitmaps by hand.
>> +     *
>> +     *        So, we can ignore this case for now, but TODO: "generation"
>> +     *        extension for qcow2, to determine, that image was changed after
>> +     *        last inactivation. And if it is changed, we will drop (or at least
>> +     *        mark as 'invalid' all the bitmaps of type B, both persistent
>> +     *        and not).
>> +     *
>> +     *    3.2 disk was _not_ changed while were inactive. Bitmaps may be saved
>> +     *        to disk ('dirty-bitmaps' capability disabled), or not saved
>> +     *        ('dirty-bitmaps' capability enabled), but we don't need to care
>> +     *        of: let's load bitmaps as always: stored bitmaps will be loaded,
>> +     *        and not stored has flag IN_USE=1 in the image and will be skipped
>> +     *        on loading.
>> +     *
>> +     * One remaining possible case when we don't want load bitmaps:
>> +     *
>> +     * 4. Open disk in inactive mode in target vm (bitmaps are migrating or
>> +     *    will be loaded on invalidation, no needs try loading them before)
>> +     */
>> +
>> +    if (!(bdrv_get_flags(bs) & BDRV_O_INACTIVE)) {
>> +        /* It's case 1, 2 or 3.2. Or 3.1 which is BUG in management layer. */
>> +        bool header_updated = qcow2_load_dirty_bitmaps(bs, &local_err);
>> +
>> +        update_header = update_header && !header_updated;
>>       }
>>       if (local_err != NULL) {
>>           error_propagate(errp, local_err);
>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
>> index 477826330c..ae4e88354a 100644
>> --- a/migration/block-dirty-bitmap.c
>> +++ b/migration/block-dirty-bitmap.c
>> @@ -313,6 +313,12 @@ static int init_dirty_bitmap_migration(void)
>>                   goto fail;
>>               }
>>   
>> +            if (bdrv_dirty_bitmap_readonly(bitmap)) {
>> +                error_report("Can't migrate read-only dirty bitmap: '%s",
>> +                             bdrv_dirty_bitmap_name(bitmap));
>> +                goto fail;
>> +            }
>> +
>>               bdrv_ref(bs);
>>               bdrv_dirty_bitmap_set_qmp_locked(bitmap, true);
>>   
>> @@ -335,9 +341,9 @@ static int init_dirty_bitmap_migration(void)
>>           }
>>       }
>>   
>> -    /* unset persistance here, to not roll back it */
>> +    /* unset migration flags here, to not roll back it */
> Stale comment, but I'm proposing we get rid of this idea anyway.
>
>>       QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
>> -        bdrv_dirty_bitmap_set_persistance(dbms->bitmap, false);
>> +        bdrv_dirty_bitmap_set_migration(dbms->bitmap, true);
>>       }
>>   
>>       if (QSIMPLEQ_EMPTY(&dirty_bitmap_mig_state.dbms_list)) {
>>
> So having read this, I think the problem is in cases where we migrate
> bitmaps in potentially two ways simultaneously: persistent/shared
> alongside migrate bitmaps capability.
>
> You've solved this by opting to not flush bitmaps to disk, and then just
> skipping them on-load.
>
> I'd rather do something like this:
> - Always flush bitmaps to disk on inactivate.

this is the most significant of your changes and it is unacceptable for 
us. We can't store bitmaps in migration downtime, as it is slow.
64k granulated bitmap for 16tb disk has 30mb size. And what if we have 
several of such bitmaps?

Again, we can't do it. So, I'm for my version of patch series.

> - Skip loading bitmaps if inactive.
> - When migrating, omit bitmaps from the stream if they're persistent and
> we're doing a shared storage migration. (i.e, stream only when we have to.)
> - Refuse to load an image at all if it has IN_USE bitmaps that we are
> expected to keep consistent with the disk. (We won't be able to!)
>
> I think this has a lot of good properties:
> - disk state is consistent (we never have IN_USE when that's not true)
> - it avoids unnecessary traffic for the migration stream
> - it gets rid of the special migration bool in bitmaps
> - it allows us to hard reject qcow2 files with IN_USE bitmaps, which is
> important for consistency.
>
> this breaks your test suite, though, because you don't *actually*
> initiate a block migration, you just simulate it by giving it a blank
> qcow2 to migrate the bitmaps too -- so you can fix it by actually doing
> a block migration of the empty disks.
>
> Here's a patch presented as patch 7/6 that:
>
> - Removes set_migration flag, functions, and calls
> - Changes the migbitmap behavior
> - Adjusts test 169 accordingly
>
> (Because I'm positive thunderbird is about to mangle this, I've mirrored
> it on github:
> https://github.com/jnsnow/qemu/commit/4f3f2a30c60b793312a3b994f8defbb2f2402121
> )
>
> diff --git a/block.c b/block.c
> index 5f1a133417..f16bf106d1 100644
> --- a/block.c
> +++ b/block.c
> @@ -4334,7 +4334,6 @@ static void coroutine_fn
> bdrv_co_invalidate_cache(BlockDriverState *bs,
>       uint64_t perm, shared_perm;
>       Error *local_err = NULL;
>       int ret;
> -    BdrvDirtyBitmap *bm;
>
>       if (!bs->drv)  {
>           return;
> @@ -4384,12 +4383,6 @@ static void coroutine_fn
> bdrv_co_invalidate_cache(BlockDriverState *bs,
>           }
>       }
>
> -    for (bm = bdrv_dirty_bitmap_next(bs, NULL); bm;
> -         bm = bdrv_dirty_bitmap_next(bs, bm))
> -    {
> -        bdrv_dirty_bitmap_set_migration(bm, false);
> -    }
> -
>       ret = refresh_total_sectors(bs, bs->total_sectors);
>       if (ret < 0) {
>           bs->open_flags |= BDRV_O_INACTIVE;
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index a13c3bdcfa..e44061fe2e 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -55,10 +55,6 @@ struct BdrvDirtyBitmap {
>                                      and this bitmap must remain
> unchanged while
>                                      this flag is set. */
>       bool persistent;            /* bitmap must be saved to owner disk
> image */
> -    bool migration;             /* Bitmap is selected for migration, it
> should
> -                                   not be stored on the next inactivation
> -                                   (persistent flag doesn't matter
> until next
> -                                   invalidation).*/
>       QLIST_ENTRY(BdrvDirtyBitmap) list;
>   };
>
> @@ -740,24 +736,16 @@ void
> bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap, bool persistent)
>       qemu_mutex_unlock(bitmap->mutex);
>   }
>
> -/* Called with BQL taken. */
> -void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool
> migration)
> -{
> -    qemu_mutex_lock(bitmap->mutex);
> -    bitmap->migration = migration;
> -    qemu_mutex_unlock(bitmap->mutex);
> -}
> -
>   bool bdrv_dirty_bitmap_get_persistance(BdrvDirtyBitmap *bitmap)
>   {
> -    return bitmap->persistent && !bitmap->migration;
> +    return bitmap->persistent;
>   }
>
>   bool bdrv_has_changed_persistent_bitmaps(BlockDriverState *bs)
>   {
>       BdrvDirtyBitmap *bm;
>       QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
> -        if (bm->persistent && !bm->readonly && !bm->migration) {
> +        if (bm->persistent && !bm->readonly) {
>               return true;
>           }
>       }
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index ae4e88354a..428a86303d 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -295,6 +295,12 @@ static int init_dirty_bitmap_migration(void)
>                   continue;
>               }
>
> +            /* Skip persistant bitmaps, unless it's a block migration: */
> +            if (bdrv_dirty_bitmap_get_persistance(bitmap) &&
> +                !migrate_use_block()) {
> +                continue;
> +            }
> +
>               if (drive_name == NULL) {
>                   error_report("Found bitmap '%s' in unnamed node %p. It
> can't "
>                                "be migrated",
> bdrv_dirty_bitmap_name(bitmap), bs);
> @@ -341,11 +347,6 @@ static int init_dirty_bitmap_migration(void)
>           }
>       }
>
> -    /* unset migration flags here, to not roll back it */
> -    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> -        bdrv_dirty_bitmap_set_migration(dbms->bitmap, true);
> -    }
> -
>       if (QSIMPLEQ_EMPTY(&dirty_bitmap_mig_state.dbms_list)) {
>           dirty_bitmap_mig_state.no_bitmaps = true;
>       }
> diff --git a/tests/qemu-iotests/169 b/tests/qemu-iotests/169
> index 69850c4c67..eaff111480 100755
> --- a/tests/qemu-iotests/169
> +++ b/tests/qemu-iotests/169
> @@ -105,8 +105,7 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
>                   break
>
>           # test that bitmap is still here
> -        removed = (not migrate_bitmaps) and persistent
> -        self.check_bitmap(self.vm_a, False if removed else sha256)
> +        self.check_bitmap(self.vm_a, False if persistent else sha256)
>
>           self.vm_a.qmp('cont')
>
> @@ -142,6 +141,8 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
>           mig_caps = [{'capability': 'events', 'state': True}]
>           if migrate_bitmaps:
>               mig_caps.append({'capability': 'dirty-bitmaps', 'state': True})
> +        if not shared_storage:
> +            mig_caps.append({'capability': 'block', 'state': True})
>
>           result = self.vm_a.qmp('migrate-set-capabilities',
>                                  capabilities=mig_caps)
> @@ -191,6 +192,8 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
>               # possible error
>               log = self.vm_b.get_log()
>               log = re.sub(r'^\[I \d+\.\d+\] OPENED\n', '', log)
> +            log = re.sub(r'Receiving block device images\n', '', log)
> +            log = re.sub(r'Completed \d+ %\r?\n?', '', log)
>               log = re.sub(r'\[I \+\d+\.\d+\] CLOSED\n?$', '', log)
>               self.assertEqual(log, '')
John Snow Aug. 1, 2018, 5:34 p.m. UTC | #3
On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
> * John Snow (jsnow@redhat.com) wrote:
> 
> <snip>
> 
>> I'd rather do something like this:
>> - Always flush bitmaps to disk on inactivate.
> 
> Does that increase the time taken by the inactivate measurably?
> If it's small relative to everything else that's fine; it's just I
> always worry a little since I think this happens after we've stopped the
> CPU on the source, so is part of the 'downtime'.
> 
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

I'm worried that if we don't, we're leaving behind unusable, partially
complete files behind us. That's a bad design and we shouldn't push for
it just because it's theoretically faster.

--js
Dr. David Alan Gilbert Aug. 1, 2018, 5:40 p.m. UTC | #4
* John Snow (jsnow@redhat.com) wrote:
> 
> 
> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
> > * John Snow (jsnow@redhat.com) wrote:
> > 
> > <snip>
> > 
> >> I'd rather do something like this:
> >> - Always flush bitmaps to disk on inactivate.
> > 
> > Does that increase the time taken by the inactivate measurably?
> > If it's small relative to everything else that's fine; it's just I
> > always worry a little since I think this happens after we've stopped the
> > CPU on the source, so is part of the 'downtime'.
> > 
> > Dave
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> I'm worried that if we don't, we're leaving behind unusable, partially
> complete files behind us. That's a bad design and we shouldn't push for
> it just because it's theoretically faster.

Oh I don't care about theoretical speed; but if it's actually unusably
slow in practice then it needs fixing.

Dave

> --js
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev Aug. 1, 2018, 6:42 p.m. UTC | #5
On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
> * John Snow (jsnow@redhat.com) wrote:
>>
>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>> * John Snow (jsnow@redhat.com) wrote:
>>>
>>> <snip>
>>>
>>>> I'd rather do something like this:
>>>> - Always flush bitmaps to disk on inactivate.
>>> Does that increase the time taken by the inactivate measurably?
>>> If it's small relative to everything else that's fine; it's just I
>>> always worry a little since I think this happens after we've stopped the
>>> CPU on the source, so is part of the 'downtime'.
>>>
>>> Dave
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>> I'm worried that if we don't, we're leaving behind unusable, partially
>> complete files behind us. That's a bad design and we shouldn't push for
>> it just because it's theoretically faster.
> Oh I don't care about theoretical speed; but if it's actually unusably
> slow in practice then it needs fixing.
>
> Dave

This is not "theoretical" speed. This is real practical speed and
instability.
EACH IO operation can be performed unpredictably slow and thus with
IO operations in mind you can not even calculate or predict downtime,
which should be done according to the migration protocol.

That is why we have very specifically (for the purpose) improved
migration protocol to migrate CBT via postcopy method, which
does not influence downtime.

That is why we strictly opposes any CBT writing operation in migration
code. It should also be noted, that CBT can be calculated for all discs,
including raw but could be written for QCOW2 only. With external CBT storage
for such discs the situation during migration would become even worse.

Den
Dr. David Alan Gilbert Aug. 1, 2018, 6:55 p.m. UTC | #6
* Denis V. Lunev (den@openvz.org) wrote:
> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
> > * John Snow (jsnow@redhat.com) wrote:
> >>
> >> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
> >>> * John Snow (jsnow@redhat.com) wrote:
> >>>
> >>> <snip>
> >>>
> >>>> I'd rather do something like this:
> >>>> - Always flush bitmaps to disk on inactivate.
> >>> Does that increase the time taken by the inactivate measurably?
> >>> If it's small relative to everything else that's fine; it's just I
> >>> always worry a little since I think this happens after we've stopped the
> >>> CPU on the source, so is part of the 'downtime'.
> >>>
> >>> Dave
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> >> I'm worried that if we don't, we're leaving behind unusable, partially
> >> complete files behind us. That's a bad design and we shouldn't push for
> >> it just because it's theoretically faster.
> > Oh I don't care about theoretical speed; but if it's actually unusably
> > slow in practice then it needs fixing.
> >
> > Dave
> 
> This is not "theoretical" speed. This is real practical speed and
> instability.
> EACH IO operation can be performed unpredictably slow and thus with
> IO operations in mind you can not even calculate or predict downtime,
> which should be done according to the migration protocol.

We end up doing some IO anyway, even ignoring these new bitmaps,
at the end of the migration when we pause the CPU, we do a
bdrv_inactivate_all to flush any outstanding writes; so we've already
got that unpredictable slowness.

So, not being a block person, but with some interest in making sure
downtime doesn't increase, I just wanted to understand whether the
amount of writes we're talking about here is comparable to that
which already exists or a lot smaller or a lot larger.
If the amount of IO you're talking about is much smaller than what
we typically already do, then John has a point and you may as well
do the write.
If the amount of IO for the bitmap is much larger and would slow
the downtime a lot then you've got a point and that would be unworkable.

Dave

> That is why we have very specifically (for the purpose) improved
> migration protocol to migrate CBT via postcopy method, which
> does not influence downtime.
> 
> That is why we strictly opposes any CBT writing operation in migration
> code. It should also be noted, that CBT can be calculated for all discs,
> including raw but could be written for QCOW2 only. With external CBT storage
> for such discs the situation during migration would become even worse.

> Den
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
John Snow Aug. 1, 2018, 6:56 p.m. UTC | #7
On 08/01/2018 02:42 PM, Denis V. Lunev wrote:
> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>> * John Snow (jsnow@redhat.com) wrote:
>>>
>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>
>>>> <snip>
>>>>
>>>>> I'd rather do something like this:
>>>>> - Always flush bitmaps to disk on inactivate.
>>>> Does that increase the time taken by the inactivate measurably?
>>>> If it's small relative to everything else that's fine; it's just I
>>>> always worry a little since I think this happens after we've stopped the
>>>> CPU on the source, so is part of the 'downtime'.
>>>>
>>>> Dave
>>>> --
>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>
>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>> complete files behind us. That's a bad design and we shouldn't push for
>>> it just because it's theoretically faster.
>> Oh I don't care about theoretical speed; but if it's actually unusably
>> slow in practice then it needs fixing.
>>
>> Dave
> 
> This is not "theoretical" speed. This is real practical speed and
> instability.

It's theoretical until I see performance testing numbers; do you have
any? How much faster does the pivot happen by avoiding making the qcow2
consistent on close?

I don't argue that it's faster to just simply not write data, but what's
not obvious is how much time it actually saves in practice and if that's
worth doing unintuitive and undocumented things like "the source file
loses bitmaps after a migration because it was faster."

> EACH IO operation can be performed unpredictably slow and thus with
> IO operations in mind you can not even calculate or predict downtime,
> which should be done according to the migration protocol.
> 
> That is why we have very specifically (for the purpose) improved
> migration protocol to migrate CBT via postcopy method, which
> does not influence downtime.
> 
> That is why we strictly opposes any CBT writing operation in migration
> code. It should also be noted, that CBT can be calculated for all discs,
> including raw but could be written for QCOW2 only. With external CBT storage
> for such discs the situation during migration would become even worse.
> 
> Den
>
Denis V. Lunev Aug. 1, 2018, 8:25 p.m. UTC | #8
On 08/01/2018 09:55 PM, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>> * John Snow (jsnow@redhat.com) wrote:
>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>> I'd rather do something like this:
>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>> Does that increase the time taken by the inactivate measurably?
>>>>> If it's small relative to everything else that's fine; it's just I
>>>>> always worry a little since I think this happens after we've stopped the
>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>
>>>>> Dave
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>
>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>> it just because it's theoretically faster.
>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>> slow in practice then it needs fixing.
>>>
>>> Dave
>> This is not "theoretical" speed. This is real practical speed and
>> instability.
>> EACH IO operation can be performed unpredictably slow and thus with
>> IO operations in mind you can not even calculate or predict downtime,
>> which should be done according to the migration protocol.
> We end up doing some IO anyway, even ignoring these new bitmaps,
> at the end of the migration when we pause the CPU, we do a
> bdrv_inactivate_all to flush any outstanding writes; so we've already
> got that unpredictable slowness.
>
> So, not being a block person, but with some interest in making sure
> downtime doesn't increase, I just wanted to understand whether the
> amount of writes we're talking about here is comparable to that
> which already exists or a lot smaller or a lot larger.
> If the amount of IO you're talking about is much smaller than what
> we typically already do, then John has a point and you may as well
> do the write.
> If the amount of IO for the bitmap is much larger and would slow
> the downtime a lot then you've got a point and that would be unworkable.
>
> Dave
This is not theoretical difference.

For 1 Tb drive and 64 kb bitmap granularity the size of bitmap is
2 Mb + some metadata (64 Kb). Thus we will have to write
2 Mb of data per bitmap. For some case there are 2-3-5 bitmaps
this we will have 10 Mb of data. With 16 Tb drive the amount of
data to write will be multiplied by 16 which gives 160 Mb to
write. More disks and bigger the size - more data to write.

Above amount should be multiplied by 2 - x Mb to be written
on source, x Mb to be read on target which gives 320 Mb to
write.

That is why this is not good - we have linear increase with the
size and amount of disks.

There is also some thoughts on normal guest IO. Theoretically
we can think on replaying IO on the target closing the file
immediately or block writes to changed areas and notify
target upon IO completion or invent other fancy dances.
At least we think right now on these optimizations for regular
migration paths.

The problem right that such things are not needed now for CBT
but will become necessary and pretty much useless upon
introducing this stuff.

Den
Denis V. Lunev Aug. 1, 2018, 8:31 p.m. UTC | #9
On 08/01/2018 09:56 PM, John Snow wrote:
>
> On 08/01/2018 02:42 PM, Denis V. Lunev wrote:
>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>> * John Snow (jsnow@redhat.com) wrote:
>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>> I'd rather do something like this:
>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>> Does that increase the time taken by the inactivate measurably?
>>>>> If it's small relative to everything else that's fine; it's just I
>>>>> always worry a little since I think this happens after we've stopped the
>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>
>>>>> Dave
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>
>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>> it just because it's theoretically faster.
>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>> slow in practice then it needs fixing.
>>>
>>> Dave
>> This is not "theoretical" speed. This is real practical speed and
>> instability.
> It's theoretical until I see performance testing numbers; do you have
> any? How much faster does the pivot happen by avoiding making the qcow2
> consistent on close?
>
> I don't argue that it's faster to just simply not write data, but what's
> not obvious is how much time it actually saves in practice and if that's
> worth doing unintuitive and undocumented things like "the source file
> loses bitmaps after a migration because it was faster."
John,

pls see my letter to Dave. Speaking about theoretical things I can
avoid waiting of any guest IO. At least we have started research
in that direction.

With this code merged we'll have IO when we can avoid it completely.
That is why this approach should not be taken.

You should know, as we have discussed things originally, that
technically we can lost CBT completely and this is just one time
problem. The data will not be lost. You will just have to call full backup
once. Thus there is no need to create something much slower
and complex where there are faster ways.

Den
Denis V. Lunev Aug. 1, 2018, 8:47 p.m. UTC | #10
On 08/01/2018 09:56 PM, John Snow wrote:
>
> On 08/01/2018 02:42 PM, Denis V. Lunev wrote:
>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>> * John Snow (jsnow@redhat.com) wrote:
>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>> I'd rather do something like this:
>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>> Does that increase the time taken by the inactivate measurably?
>>>>> If it's small relative to everything else that's fine; it's just I
>>>>> always worry a little since I think this happens after we've stopped the
>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>
>>>>> Dave
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>
>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>> it just because it's theoretically faster.
>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>> slow in practice then it needs fixing.
>>>
>>> Dave
>> This is not "theoretical" speed. This is real practical speed and
>> instability.
> It's theoretical until I see performance testing numbers; do you have
> any? How much faster does the pivot happen by avoiding making the qcow2
> consistent on close?
>
> I don't argue that it's faster to just simply not write data, but what's
> not obvious is how much time it actually saves in practice and if that's
> worth doing unintuitive and undocumented things like "the source file
> loses bitmaps after a migration because it was faster."

Also, frankly speaking, I do not understand the goal of this purism.

There 2 main cases - shared and non-shared storage. On shared
storage:
- normally migration is finished successfully. Source is shut down,
  target is started. The data in the file on shared storage would be
  __IMMEDIATELY__ marked as stale on target, i.e. you will save CBT
 on source (with IO over networked fs), load CBT on target (with IO
 over networked FS), mark CBT as stale (IO again). CBT data written
 is marked as lost
- failed migration. OK, we have CBT data written on source, CBT
  data read on source, CBT data marked stale. Thus any CBT on
  disk while VM is running is pure overhead.

The same situation is when we use non-shared migration. In this
case the situation is even worse. You save CBT and put it to trash
upon migration complete.

Please also note, that CBT saving almost does not protect us
from powerlosses as the power should be lost at the very
specific moment to allow data to survive and most likely
we will have to drop CBT completely.

Den

Normally migration is executed as follows:
- source is gently shutdowned, all da
John Snow Aug. 1, 2018, 10:28 p.m. UTC | #11
On 08/01/2018 04:47 PM, Denis V. Lunev wrote:
> On 08/01/2018 09:56 PM, John Snow wrote:
>>
>> On 08/01/2018 02:42 PM, Denis V. Lunev wrote:
>>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> I'd rather do something like this:
>>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>>> Does that increase the time taken by the inactivate measurably?
>>>>>> If it's small relative to everything else that's fine; it's just I
>>>>>> always worry a little since I think this happens after we've stopped the
>>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>>
>>>>>> Dave
>>>>>> --
>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>
>>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>>> it just because it's theoretically faster.
>>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>>> slow in practice then it needs fixing.
>>>>
>>>> Dave
>>> This is not "theoretical" speed. This is real practical speed and
>>> instability.
>> It's theoretical until I see performance testing numbers; do you have
>> any? How much faster does the pivot happen by avoiding making the qcow2
>> consistent on close?
>>
>> I don't argue that it's faster to just simply not write data, but what's
>> not obvious is how much time it actually saves in practice and if that's
>> worth doing unintuitive and undocumented things like "the source file
>> loses bitmaps after a migration because it was faster."
> 
> Also, frankly speaking, I do not understand the goal of this purism.
> 

The goal of my series originally was just to limit some corner cases. At
the time it was not evident that avoiding a flush was a *goal* of that
series rather than a *side-effect* or a means to an end (avoiding
migrating a bitmap over two different channels).

It was not immediately obvious to me that intentionally leaving behind
partially flushed qcow2 files was expected behavior. I still think it's
probably not the best behavior in general, but it's also not really
catastrophic either. If you had benchmarks it'd be useful to show an
obvious benefit to doing something unconventional.

In this case, I *do* consider not writing metadata back out to disk on
close something "unconventional."

Clearly my series missed missed an important case, so it can't be used
at all, and the status quo is also broken for several cases and also
cannot be used. With your performance concerns in mind, I'm looking at
Vladimir's series again. It might just require some more concise
comments explaining why you're taking the exact approach that you are.

--js

> There 2 main cases - shared and non-shared storage. On shared
> storage:
> - normally migration is finished successfully. Source is shut down,
>   target is started. The data in the file on shared storage would be
>   __IMMEDIATELY__ marked as stale on target, i.e. you will save CBT
>  on source (with IO over networked fs), load CBT on target (with IO
>  over networked FS), mark CBT as stale (IO again). CBT data written
>  is marked as lost
> - failed migration. OK, we have CBT data written on source, CBT
>   data read on source, CBT data marked stale. Thus any CBT on
>   disk while VM is running is pure overhead.
> 
> The same situation is when we use non-shared migration. In this
> case the situation is even worse. You save CBT and put it to trash
> upon migration complete.
> 
> Please also note, that CBT saving almost does not protect us
> from powerlosses as the power should be lost at the very
> specific moment to allow data to survive and most likely
> we will have to drop CBT completely.
> 
> Den
>
Dr. David Alan Gilbert Aug. 2, 2018, 9:29 a.m. UTC | #12
* Denis V. Lunev (den@openvz.org) wrote:
> On 08/01/2018 09:55 PM, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (den@openvz.org) wrote:
> >> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
> >>> * John Snow (jsnow@redhat.com) wrote:
> >>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
> >>>>> * John Snow (jsnow@redhat.com) wrote:
> >>>>>
> >>>>> <snip>
> >>>>>
> >>>>>> I'd rather do something like this:
> >>>>>> - Always flush bitmaps to disk on inactivate.
> >>>>> Does that increase the time taken by the inactivate measurably?
> >>>>> If it's small relative to everything else that's fine; it's just I
> >>>>> always worry a little since I think this happens after we've stopped the
> >>>>> CPU on the source, so is part of the 'downtime'.
> >>>>>
> >>>>> Dave
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>>>
> >>>> I'm worried that if we don't, we're leaving behind unusable, partially
> >>>> complete files behind us. That's a bad design and we shouldn't push for
> >>>> it just because it's theoretically faster.
> >>> Oh I don't care about theoretical speed; but if it's actually unusably
> >>> slow in practice then it needs fixing.
> >>>
> >>> Dave
> >> This is not "theoretical" speed. This is real practical speed and
> >> instability.
> >> EACH IO operation can be performed unpredictably slow and thus with
> >> IO operations in mind you can not even calculate or predict downtime,
> >> which should be done according to the migration protocol.
> > We end up doing some IO anyway, even ignoring these new bitmaps,
> > at the end of the migration when we pause the CPU, we do a
> > bdrv_inactivate_all to flush any outstanding writes; so we've already
> > got that unpredictable slowness.
> >
> > So, not being a block person, but with some interest in making sure
> > downtime doesn't increase, I just wanted to understand whether the
> > amount of writes we're talking about here is comparable to that
> > which already exists or a lot smaller or a lot larger.
> > If the amount of IO you're talking about is much smaller than what
> > we typically already do, then John has a point and you may as well
> > do the write.
> > If the amount of IO for the bitmap is much larger and would slow
> > the downtime a lot then you've got a point and that would be unworkable.
> >
> > Dave
> This is not theoretical difference.
> 
> For 1 Tb drive and 64 kb bitmap granularity the size of bitmap is
> 2 Mb + some metadata (64 Kb). Thus we will have to write
> 2 Mb of data per bitmap.

OK, this was about my starting point; I think your Mb here is Byte not
Bit; so assuming a drive of 200MByte/s, that's 200/2=1/100th of a
second = 10ms; now 10ms I'd say is small enough not to worry about downtime
increases, since the number we normally hope for is in the 300ms ish
range.

> For some case there are 2-3-5 bitmaps
> this we will have 10 Mb of data.

OK, remembering I'm not a block person can you just explain why
you need 5 bitmaps?
But with 5 bitmaps that's 50ms, that's starting to get worrying.

> With 16 Tb drive the amount of
> data to write will be multiplied by 16 which gives 160 Mb to
> write. More disks and bigger the size - more data to write.

Yeh and that's going on for a second and way too big.

(Although that feels like you could fix it by adding bitmaps on your
bitmaps hierarchically so you didn't write them all; but that's
getting way more complex).

> Above amount should be multiplied by 2 - x Mb to be written
> on source, x Mb to be read on target which gives 320 Mb to
> write.
> 
> That is why this is not good - we have linear increase with the
> size and amount of disks.
> 
> There is also some thoughts on normal guest IO. Theoretically
> we can think on replaying IO on the target closing the file
> immediately or block writes to changed areas and notify
> target upon IO completion or invent other fancy dances.
> At least we think right now on these optimizations for regular
> migration paths.
> 
> The problem right that such things are not needed now for CBT
> but will become necessary and pretty much useless upon
> introducing this stuff.

I don't quite understand the last two paragraphs.

However, coming back to my question; it was really saying that
normal guest IO during the end of the migration will cause
a delay; I'm expecting that to be fairly unrelated to the size
of the disk; more to do with workload; so I guess in your case
the worry is the case of big large disks giving big large
bitmaps.

Dave

> Den
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev Aug. 2, 2018, 9:38 a.m. UTC | #13
On 08/02/2018 12:29 PM, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>> On 08/01/2018 09:55 PM, Dr. David Alan Gilbert wrote:
>>> * Denis V. Lunev (den@openvz.org) wrote:
>>>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> I'd rather do something like this:
>>>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>>>> Does that increase the time taken by the inactivate measurably?
>>>>>>> If it's small relative to everything else that's fine; it's just I
>>>>>>> always worry a little since I think this happens after we've stopped the
>>>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>>>
>>>>>>> Dave
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>
>>>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>>>> it just because it's theoretically faster.
>>>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>>>> slow in practice then it needs fixing.
>>>>>
>>>>> Dave
>>>> This is not "theoretical" speed. This is real practical speed and
>>>> instability.
>>>> EACH IO operation can be performed unpredictably slow and thus with
>>>> IO operations in mind you can not even calculate or predict downtime,
>>>> which should be done according to the migration protocol.
>>> We end up doing some IO anyway, even ignoring these new bitmaps,
>>> at the end of the migration when we pause the CPU, we do a
>>> bdrv_inactivate_all to flush any outstanding writes; so we've already
>>> got that unpredictable slowness.
>>>
>>> So, not being a block person, but with some interest in making sure
>>> downtime doesn't increase, I just wanted to understand whether the
>>> amount of writes we're talking about here is comparable to that
>>> which already exists or a lot smaller or a lot larger.
>>> If the amount of IO you're talking about is much smaller than what
>>> we typically already do, then John has a point and you may as well
>>> do the write.
>>> If the amount of IO for the bitmap is much larger and would slow
>>> the downtime a lot then you've got a point and that would be unworkable.
>>>
>>> Dave
>> This is not theoretical difference.
>>
>> For 1 Tb drive and 64 kb bitmap granularity the size of bitmap is
>> 2 Mb + some metadata (64 Kb). Thus we will have to write
>> 2 Mb of data per bitmap.
> OK, this was about my starting point; I think your Mb here is Byte not
> Bit; so assuming a drive of 200MByte/s, that's 200/2=1/100th of a
> second = 10ms; now 10ms I'd say is small enough not to worry about downtime
> increases, since the number we normally hope for is in the 300ms ish
> range.
>
>> For some case there are 2-3-5 bitmaps
>> this we will have 10 Mb of data.
> OK, remembering I'm not a block person can you just explain why
> you need 5 bitmaps?
> But with 5 bitmaps that's 50ms, that's starting to get worrying.
>
>> With 16 Tb drive the amount of
>> data to write will be multiplied by 16 which gives 160 Mb to
>> write. More disks and bigger the size - more data to write.
> Yeh and that's going on for a second and way too big.
>
> (Although that feels like you could fix it by adding bitmaps on your
> bitmaps hierarchically so you didn't write them all; but that's
> getting way more complex).
>
>> Above amount should be multiplied by 2 - x Mb to be written
>> on source, x Mb to be read on target which gives 320 Mb to
>> write.
>>
>> That is why this is not good - we have linear increase with the
>> size and amount of disks.
>>
>> There is also some thoughts on normal guest IO. Theoretically
>> we can think on replaying IO on the target closing the file
>> immediately or block writes to changed areas and notify
>> target upon IO completion or invent other fancy dances.
>> At least we think right now on these optimizations for regular
>> migration paths.
>>
>> The problem right that such things are not needed now for CBT
>> but will become necessary and pretty much useless upon
>> introducing this stuff.
> I don't quite understand the last two paragraphs.
we are thinking right now to eliminate delay on regular IO
for migration. There is some thoughts and internal work in
progress. That is why I am worrying.

> However, coming back to my question; it was really saying that
> normal guest IO during the end of the migration will cause
> a delay; I'm expecting that to be fairly unrelated to the size
> of the disk; more to do with workload; so I guess in your case
> the worry is the case of big large disks giving big large
> bitmaps.

exactly!

Den
Dr. David Alan Gilbert Aug. 2, 2018, 9:50 a.m. UTC | #14
* Denis V. Lunev (den@openvz.org) wrote:


> > I don't quite understand the last two paragraphs.
> we are thinking right now to eliminate delay on regular IO
> for migration. There is some thoughts and internal work in
> progress. That is why I am worrying.

What downtime are you typicaly seeing and what are you aiming for?

It would be good if you could explain what you're planning to
fix there so we can get a feel for it nearer the start of it
rather than at the end of the reviewing!

Dave

> 
> > However, coming back to my question; it was really saying that
> > normal guest IO during the end of the migration will cause
> > a delay; I'm expecting that to be fairly unrelated to the size
> > of the disk; more to do with workload; so I guess in your case
> > the worry is the case of big large disks giving big large
> > bitmaps.
> 
> exactly!
> 
> Den
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Vladimir Sementsov-Ogievskiy Aug. 2, 2018, 10:23 a.m. UTC | #15
02.08.2018 01:28, John Snow wrote:
>
> On 08/01/2018 04:47 PM, Denis V. Lunev wrote:
>> On 08/01/2018 09:56 PM, John Snow wrote:
>>> On 08/01/2018 02:42 PM, Denis V. Lunev wrote:
>>>> On 08/01/2018 08:40 PM, Dr. David Alan Gilbert wrote:
>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>> On 08/01/2018 06:20 AM, Dr. David Alan Gilbert wrote:
>>>>>>> * John Snow (jsnow@redhat.com) wrote:
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> I'd rather do something like this:
>>>>>>>> - Always flush bitmaps to disk on inactivate.
>>>>>>> Does that increase the time taken by the inactivate measurably?
>>>>>>> If it's small relative to everything else that's fine; it's just I
>>>>>>> always worry a little since I think this happens after we've stopped the
>>>>>>> CPU on the source, so is part of the 'downtime'.
>>>>>>>
>>>>>>> Dave
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>
>>>>>> I'm worried that if we don't, we're leaving behind unusable, partially
>>>>>> complete files behind us. That's a bad design and we shouldn't push for
>>>>>> it just because it's theoretically faster.
>>>>> Oh I don't care about theoretical speed; but if it's actually unusably
>>>>> slow in practice then it needs fixing.
>>>>>
>>>>> Dave
>>>> This is not "theoretical" speed. This is real practical speed and
>>>> instability.
>>> It's theoretical until I see performance testing numbers; do you have
>>> any? How much faster does the pivot happen by avoiding making the qcow2
>>> consistent on close?
>>>
>>> I don't argue that it's faster to just simply not write data, but what's
>>> not obvious is how much time it actually saves in practice and if that's
>>> worth doing unintuitive and undocumented things like "the source file
>>> loses bitmaps after a migration because it was faster."
>> Also, frankly speaking, I do not understand the goal of this purism.
>>
> The goal of my series originally was just to limit some corner cases. At
> the time it was not evident that avoiding a flush was a *goal* of that
> series rather than a *side-effect* or a means to an end (avoiding
> migrating a bitmap over two different channels).

It's not a goal and its not a side-effect. Flush is already avoided 
currently, in upstream code. And your series is degradation.
The goal is another: make the whole thing more obvious and make it 
possible to restore bitmaps persistance after failed migration.

>
> It was not immediately obvious to me that intentionally leaving behind
> partially flushed qcow2 files was expected behavior. I still think it's
> probably not the best behavior in general, but it's also not really
> catastrophic either. If you had benchmarks it'd be useful to show an
> obvious benefit to doing something unconventional.
>
> In this case, I *do* consider not writing metadata back out to disk on
> close something "unconventional."
>
> Clearly my series missed missed an important case, so it can't be used
> at all, and the status quo is also broken for several cases and also
> cannot be used. With your performance concerns in mind, I'm looking at
> Vladimir's series again. It might just require some more concise
> comments explaining why you're taking the exact approach that you are.
>
> --js
>
>> There 2 main cases - shared and non-shared storage. On shared
>> storage:
>> - normally migration is finished successfully. Source is shut down,
>>    target is started. The data in the file on shared storage would be
>>    __IMMEDIATELY__ marked as stale on target, i.e. you will save CBT
>>   on source (with IO over networked fs), load CBT on target (with IO
>>   over networked FS), mark CBT as stale (IO again). CBT data written
>>   is marked as lost
>> - failed migration. OK, we have CBT data written on source, CBT
>>    data read on source, CBT data marked stale. Thus any CBT on
>>    disk while VM is running is pure overhead.
>>
>> The same situation is when we use non-shared migration. In this
>> case the situation is even worse. You save CBT and put it to trash
>> upon migration complete.
>>
>> Please also note, that CBT saving almost does not protect us
>> from powerlosses as the power should be lost at the very
>> specific moment to allow data to survive and most likely
>> we will have to drop CBT completely.
>>
>> Den
>>
Denis V. Lunev Aug. 2, 2018, 7:05 p.m. UTC | #16
On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>
>
>>> I don't quite understand the last two paragraphs.
>> we are thinking right now to eliminate delay on regular IO
>> for migration. There is some thoughts and internal work in
>> progress. That is why I am worrying.
> What downtime are you typicaly seeing and what are you aiming for?
>
> It would be good if you could explain what you're planning to
> fix there so we can get a feel for it nearer the start of it
> rather than at the end of the reviewing!
>
> Dave
The ultimate goal is to reliable reach 100 ms with ongoing IO and
you are perfectly correct about reviewing :)

Though the problem is that right now we are just trying to
invent something suitable :(

Den

>>> However, coming back to my question; it was really saying that
>>> normal guest IO during the end of the migration will cause
>>> a delay; I'm expecting that to be fairly unrelated to the size
>>> of the disk; more to do with workload; so I guess in your case
>>> the worry is the case of big large disks giving big large
>>> bitmaps.
>> exactly!
>>
>> Den
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
John Snow Aug. 2, 2018, 7:10 p.m. UTC | #17
On 08/02/2018 03:05 PM, Denis V. Lunev wrote:
> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
>> * Denis V. Lunev (den@openvz.org) wrote:
>>
>>
>>>> I don't quite understand the last two paragraphs.
>>> we are thinking right now to eliminate delay on regular IO
>>> for migration. There is some thoughts and internal work in
>>> progress. That is why I am worrying.
>> What downtime are you typicaly seeing and what are you aiming for?
>>
>> It would be good if you could explain what you're planning to
>> fix there so we can get a feel for it nearer the start of it
>> rather than at the end of the reviewing!
>>
>> Dave
> The ultimate goal is to reliable reach 100 ms with ongoing IO and
> you are perfectly correct about reviewing :)
> 
> Though the problem is that right now we are just trying to
> invent something suitable :(
> 
> Den
> 

Yeah, I gotcha -- please give me just a little time. I'm not against
using Vlad's series, it just surprised me. David and I have talked about
what you're trying to do and we are in agreement that it is the right
thing to do for now.

I have some more questions about Vlad's series, but I want to test it a
bit myself before I waste time asking them. I'll stage a working version
as soon as I can.

--js
Dr. David Alan Gilbert Aug. 3, 2018, 8:33 a.m. UTC | #18
* Denis V. Lunev (den@openvz.org) wrote:
> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (den@openvz.org) wrote:
> >
> >
> >>> I don't quite understand the last two paragraphs.
> >> we are thinking right now to eliminate delay on regular IO
> >> for migration. There is some thoughts and internal work in
> >> progress. That is why I am worrying.
> > What downtime are you typicaly seeing and what are you aiming for?
> >
> > It would be good if you could explain what you're planning to
> > fix there so we can get a feel for it nearer the start of it
> > rather than at the end of the reviewing!
> >
> > Dave
> The ultimate goal is to reliable reach 100 ms with ongoing IO and
> you are perfectly correct about reviewing :)

That would be neat.

> Though the problem is that right now we are just trying to
> invent something suitable :(

OK, some brain-storm level ideas:

  a) Throttle the write bandwidth at later stages of migration
     (I think that's been suggested before)
  b) Switch to some active-sync like behaviour where the writes
     are sent over the network as they happen to the destination
     (mreitz has some prototype code for that type of behaviour
     for postcopy)
  c) Write the writes into a buffer that gets migrated over the
    migration stream to get committed on the destination side.

As I say, brainstorm level ideas only!

Dave


> Den
> 
> >>> However, coming back to my question; it was really saying that
> >>> normal guest IO during the end of the migration will cause
> >>> a delay; I'm expecting that to be fairly unrelated to the size
> >>> of the disk; more to do with workload; so I guess in your case
> >>> the worry is the case of big large disks giving big large
> >>> bitmaps.
> >> exactly!
> >>
> >> Den
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Vladimir Sementsov-Ogievskiy Aug. 3, 2018, 8:44 a.m. UTC | #19
03.08.2018 11:33, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
>>> * Denis V. Lunev (den@openvz.org) wrote:
>>>
>>>
>>>>> I don't quite understand the last two paragraphs.
>>>> we are thinking right now to eliminate delay on regular IO
>>>> for migration. There is some thoughts and internal work in
>>>> progress. That is why I am worrying.
>>> What downtime are you typicaly seeing and what are you aiming for?
>>>
>>> It would be good if you could explain what you're planning to
>>> fix there so we can get a feel for it nearer the start of it
>>> rather than at the end of the reviewing!
>>>
>>> Dave
>> The ultimate goal is to reliable reach 100 ms with ongoing IO and
>> you are perfectly correct about reviewing :)
> That would be neat.
>
>> Though the problem is that right now we are just trying to
>> invent something suitable :(
> OK, some brain-storm level ideas:
>
>    a) Throttle the write bandwidth at later stages of migration
>       (I think that's been suggested before)
>    b) Switch to some active-sync like behaviour where the writes
>       are sent over the network as they happen to the destination
>       (mreitz has some prototype code for that type of behaviour
>       for postcopy)
>    c) Write the writes into a buffer that gets migrated over the
>      migration stream to get committed on the destination side.
>
> As I say, brainstorm level ideas only!
>
> Dave

Do you say about bitmaps migration? With current approach (postcopy) 
there should not be problems with downtime, as data can be sent after 
target vm start

>
>> Den
>>
>>>>> However, coming back to my question; it was really saying that
>>>>> normal guest IO during the end of the migration will cause
>>>>> a delay; I'm expecting that to be fairly unrelated to the size
>>>>> of the disk; more to do with workload; so I guess in your case
>>>>> the worry is the case of big large disks giving big large
>>>>> bitmaps.
>>>> exactly!
>>>>
>>>> Den
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Aug. 3, 2018, 8:49 a.m. UTC | #20
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> 03.08.2018 11:33, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (den@openvz.org) wrote:
> > > On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
> > > > * Denis V. Lunev (den@openvz.org) wrote:
> > > > 
> > > > 
> > > > > > I don't quite understand the last two paragraphs.
> > > > > we are thinking right now to eliminate delay on regular IO
> > > > > for migration. There is some thoughts and internal work in
> > > > > progress. That is why I am worrying.
> > > > What downtime are you typicaly seeing and what are you aiming for?
> > > > 
> > > > It would be good if you could explain what you're planning to
> > > > fix there so we can get a feel for it nearer the start of it
> > > > rather than at the end of the reviewing!
> > > > 
> > > > Dave
> > > The ultimate goal is to reliable reach 100 ms with ongoing IO and
> > > you are perfectly correct about reviewing :)
> > That would be neat.
> > 
> > > Though the problem is that right now we are just trying to
> > > invent something suitable :(
> > OK, some brain-storm level ideas:
> > 
> >    a) Throttle the write bandwidth at later stages of migration
> >       (I think that's been suggested before)
> >    b) Switch to some active-sync like behaviour where the writes
> >       are sent over the network as they happen to the destination
> >       (mreitz has some prototype code for that type of behaviour
> >       for postcopy)
> >    c) Write the writes into a buffer that gets migrated over the
> >      migration stream to get committed on the destination side.
> > 
> > As I say, brainstorm level ideas only!
> > 
> > Dave
> 
> Do you say about bitmaps migration? With current approach (postcopy) there
> should not be problems with downtime, as data can be sent after target vm
> start

The ideas above were meant to suggest ways to reduce the downtime due to
*write data* not bitmaps.

Dave

> > 
> > > Den
> > > 
> > > > > > However, coming back to my question; it was really saying that
> > > > > > normal guest IO during the end of the migration will cause
> > > > > > a delay; I'm expecting that to be fairly unrelated to the size
> > > > > > of the disk; more to do with workload; so I guess in your case
> > > > > > the worry is the case of big large disks giving big large
> > > > > > bitmaps.
> > > > > exactly!
> > > > > 
> > > > > Den
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> -- 
> Best regards,
> Vladimir
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev Aug. 3, 2018, 8:59 a.m. UTC | #21
On 08/03/2018 11:33 AM, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
>>> * Denis V. Lunev (den@openvz.org) wrote:
>>>
>>>
>>>>> I don't quite understand the last two paragraphs.
>>>> we are thinking right now to eliminate delay on regular IO
>>>> for migration. There is some thoughts and internal work in
>>>> progress. That is why I am worrying.
>>> What downtime are you typicaly seeing and what are you aiming for?
>>>
>>> It would be good if you could explain what you're planning to
>>> fix there so we can get a feel for it nearer the start of it
>>> rather than at the end of the reviewing!
>>>
>>> Dave
>> The ultimate goal is to reliable reach 100 ms with ongoing IO and
>> you are perfectly correct about reviewing :)
> That would be neat.
>
>> Though the problem is that right now we are just trying to
>> invent something suitable :(
> OK, some brain-storm level ideas:
>
>   a) Throttle the write bandwidth at later stages of migration
>      (I think that's been suggested before)
yes

>   b) Switch to some active-sync like behaviour where the writes
>      are sent over the network as they happen to the destination
>      (mreitz has some prototype code for that type of behaviour
>      for postcopy)
will not work. Even with the sync mirror (which we have submitted
2 years ago, but not accepted), usual downtime will be around 1 sec
due to requests in flight.

>   c) Write the writes into a buffer that gets migrated over the
>     migration stream to get committed on the destination side.
yes. this is an option but the buffer can be too big.

For the shared disk migration the options are the following:
- without metadata updates writes could be just passed to finish, there
  is no need to wait. But completions should be reported to destination
- metadata updates are not allowed, they should be transferred to the
  destination

For non-shared disk migration we do not need to wait local IO, we just need
to restart them on target. Alternatively these areas could be marked as
blocked for IO and re-sync again once writes are completed.

These are raw ideas, which should be improved and tweaked.

Den

> As I say, brainstorm level ideas only!
>
> Dave
>
>
>> Den
>>
>>>>> However, coming back to my question; it was really saying that
>>>>> normal guest IO during the end of the migration will cause
>>>>> a delay; I'm expecting that to be fairly unrelated to the size
>>>>> of the disk; more to do with workload; so I guess in your case
>>>>> the worry is the case of big large disks giving big large
>>>>> bitmaps.
>>>> exactly!
>>>>
>>>> Den
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Aug. 3, 2018, 9:10 a.m. UTC | #22
* Denis V. Lunev (den@openvz.org) wrote:
> On 08/03/2018 11:33 AM, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (den@openvz.org) wrote:
> >> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
> >>> * Denis V. Lunev (den@openvz.org) wrote:
> >>>
> >>>
> >>>>> I don't quite understand the last two paragraphs.
> >>>> we are thinking right now to eliminate delay on regular IO
> >>>> for migration. There is some thoughts and internal work in
> >>>> progress. That is why I am worrying.
> >>> What downtime are you typicaly seeing and what are you aiming for?
> >>>
> >>> It would be good if you could explain what you're planning to
> >>> fix there so we can get a feel for it nearer the start of it
> >>> rather than at the end of the reviewing!
> >>>
> >>> Dave
> >> The ultimate goal is to reliable reach 100 ms with ongoing IO and
> >> you are perfectly correct about reviewing :)
> > That would be neat.
> >
> >> Though the problem is that right now we are just trying to
> >> invent something suitable :(
> > OK, some brain-storm level ideas:
> >
> >   a) Throttle the write bandwidth at later stages of migration
> >      (I think that's been suggested before)
> yes
> 
> >   b) Switch to some active-sync like behaviour where the writes
> >      are sent over the network as they happen to the destination
> >      (mreitz has some prototype code for that type of behaviour
> >      for postcopy)
> will not work. Even with the sync mirror (which we have submitted
> 2 years ago, but not accepted), usual downtime will be around 1 sec
> due to requests in flight.

I'm confused why it would be that high; are you saying that would be
longer than the downtime with the current writes near the end of
migration on the source?

> >   c) Write the writes into a buffer that gets migrated over the
> >     migration stream to get committed on the destination side.
> yes. this is an option but the buffer can be too big.

Combined with some throttling you should be able to bound the size;
especially since it should be done only for the very last part of the
migration.

> For the shared disk migration the options are the following:
> - without metadata updates writes could be just passed to finish, there
>   is no need to wait. But completions should be reported to destination
> - metadata updates are not allowed, they should be transferred to the
>   destination
> 
> For non-shared disk migration we do not need to wait local IO, we just need
> to restart them on target. Alternatively these areas could be marked as
> blocked for IO and re-sync again once writes are completed.
> 
> These are raw ideas, which should be improved and tweaked.

Nod; it needs some thinking about.

Dave

> Den
> 
> > As I say, brainstorm level ideas only!
> >
> > Dave
> >
> >
> >> Den
> >>
> >>>>> However, coming back to my question; it was really saying that
> >>>>> normal guest IO during the end of the migration will cause
> >>>>> a delay; I'm expecting that to be fairly unrelated to the size
> >>>>> of the disk; more to do with workload; so I guess in your case
> >>>>> the worry is the case of big large disks giving big large
> >>>>> bitmaps.
> >>>> exactly!
> >>>>
> >>>> Den
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Vladimir Sementsov-Ogievskiy Sept. 17, 2018, 3:51 p.m. UTC | #23
Hmm, ping, anybody here ?)

02.08.2018 22:10, John Snow wrote:
>
> On 08/02/2018 03:05 PM, Denis V. Lunev wrote:
>> On 08/02/2018 12:50 PM, Dr. David Alan Gilbert wrote:
>>> * Denis V. Lunev (den@openvz.org) wrote:
>>>
>>>
>>>>> I don't quite understand the last two paragraphs.
>>>> we are thinking right now to eliminate delay on regular IO
>>>> for migration. There is some thoughts and internal work in
>>>> progress. That is why I am worrying.
>>> What downtime are you typicaly seeing and what are you aiming for?
>>>
>>> It would be good if you could explain what you're planning to
>>> fix there so we can get a feel for it nearer the start of it
>>> rather than at the end of the reviewing!
>>>
>>> Dave
>> The ultimate goal is to reliable reach 100 ms with ongoing IO and
>> you are perfectly correct about reviewing :)
>>
>> Though the problem is that right now we are just trying to
>> invent something suitable :(
>>
>> Den
>>
> Yeah, I gotcha -- please give me just a little time. I'm not against
> using Vlad's series, it just surprised me. David and I have talked about
> what you're trying to do and we are in agreement that it is the right
> thing to do for now.
>
> I have some more questions about Vlad's series, but I want to test it a
> bit myself before I waste time asking them. I'll stage a working version
> as soon as I can.
>
> --js
John Snow Sept. 17, 2018, 6:33 p.m. UTC | #24
On 09/17/2018 11:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> Hmm, ping, anybody here ?)
> 

Was preparing to stage on Friday, working on it now.

I never understood why you forbid the transfer of read only bitmaps
though, can you point that out for me?

--js
Vladimir Sementsov-Ogievskiy Sept. 18, 2018, 10:37 a.m. UTC | #25
17.09.2018 21:33, John Snow wrote:
>
> On 09/17/2018 11:51 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Hmm, ping, anybody here ?)
>>
> Was preparing to stage on Friday, working on it now.
>
> I never understood why you forbid the transfer of read only bitmaps
> though, can you point that out for me?
>
> --js

readonly bitmap - means it is not marked in_use in qcow2 file.

Hm, I don't remember exact reason, but at least for shared migration it 
is unsafe to migrate readonly bitmap without special handling: we 
migrate it, it becomes not-readonly, it changes in RAM, then (assume) 
storing failed and we have old version ofthe bitmap in qcow2, which is 
not marked in_use.. So it should be somehow carefully handled, the 
simplest option is just to forbid it.
Vladimir Sementsov-Ogievskiy Oct. 15, 2018, 9:42 a.m. UTC | #26
ping

18.09.2018 13:37, Vladimir Sementsov-Ogievskiy wrote:
> 17.09.2018 21:33, John Snow wrote:
>>
>> On 09/17/2018 11:51 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Hmm, ping, anybody here ?)
>>>
>> Was preparing to stage on Friday, working on it now.
>>
>> I never understood why you forbid the transfer of read only bitmaps
>> though, can you point that out for me?
>>
>> --js
>
> readonly bitmap - means it is not marked in_use in qcow2 file.
>
> Hm, I don't remember exact reason, but at least for shared migration 
> it is unsafe to migrate readonly bitmap without special handling: we 
> migrate it, it becomes not-readonly, it changes in RAM, then (assume) 
> storing failed and we have old version ofthe bitmap in qcow2, which is 
> not marked in_use.. So it should be somehow carefully handled, the 
> simplest option is just to forbid it.
>
Vladimir Sementsov-Ogievskiy Oct. 29, 2018, 5:52 p.m. UTC | #27
ping2

Hi, what about this finally? Looks like everyone see a lot of comments 
and thing, that there should be a new version, but actually, there is 
almost unrelated discussion and/or answered questions. Only rephrasing 
by Eric in patch 2 may be applied, to change something.
The series fix the bug in bitmaps migration, look at patch 5.

18.09.2018 13:37, Vladimir Sementsov-Ogievskiy wrote:
> 17.09.2018 21:33, John Snow wrote:
>>
>> On 09/17/2018 11:51 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Hmm, ping, anybody here ?)
>>>
>> Was preparing to stage on Friday, working on it now.
>>
>> I never understood why you forbid the transfer of read only bitmaps
>> though, can you point that out for me?
>>
>> --js
>
> readonly bitmap - means it is not marked in_use in qcow2 file.
>
> Hm, I don't remember exact reason, but at least for shared migration 
> it is unsafe to migrate readonly bitmap without special handling: we 
> migrate it, it becomes not-readonly, it changes in RAM, then (assume) 
> storing failed and we have old version ofthe bitmap in qcow2, which is 
> not marked in_use.. So it should be somehow carefully handled, the 
> simplest option is just to forbid it.
>
John Snow Oct. 29, 2018, 6:06 p.m. UTC | #28
On 10/29/2018 01:52 PM, Vladimir Sementsov-Ogievskiy wrote:
> ping2
> 
> Hi, what about this finally? Looks like everyone see a lot of comments 
> and thing, that there should be a new version, but actually, there is 
> almost unrelated discussion and/or answered questions. Only rephrasing 
> by Eric in patch 2 may be applied, to change something.
> The series fix the bug in bitmaps migration, look at patch 5.
> 

Yeah, I'm not sure of the comments but we can patch them up later. I've
been sick for the last two weeks and then I just got back from KVM
Forum. Running some tests now and will send a PR out.

We'll change what we need to for the RC.

--js
diff mbox series

Patch

diff --git a/block.c b/block.c
index 5f1a133417..f16bf106d1 100644
--- a/block.c
+++ b/block.c
@@ -4334,7 +4334,6 @@  static void coroutine_fn
bdrv_co_invalidate_cache(BlockDriverState *bs,
     uint64_t perm, shared_perm;
     Error *local_err = NULL;
     int ret;
-    BdrvDirtyBitmap *bm;

     if (!bs->drv)  {
         return;
@@ -4384,12 +4383,6 @@  static void coroutine_fn
bdrv_co_invalidate_cache(BlockDriverState *bs,
         }
     }

-    for (bm = bdrv_dirty_bitmap_next(bs, NULL); bm;
-         bm = bdrv_dirty_bitmap_next(bs, bm))
-    {
-        bdrv_dirty_bitmap_set_migration(bm, false);
-    }
-
     ret = refresh_total_sectors(bs, bs->total_sectors);
     if (ret < 0) {
         bs->open_flags |= BDRV_O_INACTIVE;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index a13c3bdcfa..e44061fe2e 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -55,10 +55,6 @@  struct BdrvDirtyBitmap {
                                    and this bitmap must remain
unchanged while
                                    this flag is set. */
     bool persistent;            /* bitmap must be saved to owner disk
image */
-    bool migration;             /* Bitmap is selected for migration, it
should
-                                   not be stored on the next inactivation
-                                   (persistent flag doesn't matter
until next
-                                   invalidation).*/
     QLIST_ENTRY(BdrvDirtyBitmap) list;
 };

@@ -740,24 +736,16 @@  void
bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap, bool persistent)
     qemu_mutex_unlock(bitmap->mutex);
 }

-/* Called with BQL taken. */
-void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool
migration)
-{
-    qemu_mutex_lock(bitmap->mutex);
-    bitmap->migration = migration;
-    qemu_mutex_unlock(bitmap->mutex);
-}
-
 bool bdrv_dirty_bitmap_get_persistance(BdrvDirtyBitmap *bitmap)
 {
-    return bitmap->persistent && !bitmap->migration;
+    return bitmap->persistent;
 }

 bool bdrv_has_changed_persistent_bitmaps(BlockDriverState *bs)
 {
     BdrvDirtyBitmap *bm;
     QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
-        if (bm->persistent && !bm->readonly && !bm->migration) {
+        if (bm->persistent && !bm->readonly) {
             return true;
         }
     }
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index ae4e88354a..428a86303d 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -295,6 +295,12 @@  static int init_dirty_bitmap_migration(void)
                 continue;
             }

+            /* Skip persistant bitmaps, unless it's a block migration: */
+            if (bdrv_dirty_bitmap_get_persistance(bitmap) &&
+                !migrate_use_block()) {
+                continue;
+            }
+
             if (drive_name == NULL) {
                 error_report("Found bitmap '%s' in unnamed node %p. It
can't "
                              "be migrated",
bdrv_dirty_bitmap_name(bitmap), bs);
@@ -341,11 +347,6 @@  static int init_dirty_bitmap_migration(void)
         }
     }

-    /* unset migration flags here, to not roll back it */
-    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
-        bdrv_dirty_bitmap_set_migration(dbms->bitmap, true);
-    }
-
     if (QSIMPLEQ_EMPTY(&dirty_bitmap_mig_state.dbms_list)) {
         dirty_bitmap_mig_state.no_bitmaps = true;
     }
diff --git a/tests/qemu-iotests/169 b/tests/qemu-iotests/169
index 69850c4c67..eaff111480 100755
--- a/tests/qemu-iotests/169
+++ b/tests/qemu-iotests/169
@@ -105,8 +105,7 @@  class TestDirtyBitmapMigration(iotests.QMPTestCase):
                 break

         # test that bitmap is still here
-        removed = (not migrate_bitmaps) and persistent
-        self.check_bitmap(self.vm_a, False if removed else sha256)
+        self.check_bitmap(self.vm_a, False if persistent else sha256)

         self.vm_a.qmp('cont')

@@ -142,6 +141,8 @@  class TestDirtyBitmapMigration(iotests.QMPTestCase):
         mig_caps = [{'capability': 'events', 'state': True}]
         if migrate_bitmaps:
             mig_caps.append({'capability': 'dirty-bitmaps', 'state': True})
+        if not shared_storage:
+            mig_caps.append({'capability': 'block', 'state': True})

         result = self.vm_a.qmp('migrate-set-capabilities',
                                capabilities=mig_caps)
@@ -191,6 +192,8 @@  class TestDirtyBitmapMigration(iotests.QMPTestCase):
             # possible error
             log = self.vm_b.get_log()
             log = re.sub(r'^\[I \d+\.\d+\] OPENED\n', '', log)
+            log = re.sub(r'Receiving block device images\n', '', log)
+            log = re.sub(r'Completed \d+ %\r?\n?', '', log)
             log = re.sub(r'\[I \+\d+\.\d+\] CLOSED\n?$', '', log)
             self.assertEqual(log, '')