mbox series

[v4,00/18] bitmaps: introduce 'bitmap' sync mode

Message ID 20190709232550.10724-1-jsnow@redhat.com
Headers show
Series bitmaps: introduce 'bitmap' sync mode | expand

Message

John Snow July 9, 2019, 11:25 p.m. UTC
This series adds a new "BITMAP" sync mode that is meant to replace the
existing "INCREMENTAL" sync mode.

This mode can have its behavior modified by issuing any of three bitmap sync
modes, passed as arguments to the job.

The three bitmap sync modes are:
- ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
              conditionally synchronized based on the return code of the job
              upon completion.
- NEVER: This is, effectively, the differential backup mode. It never clears
         the bitmap, as the name suggests.
- ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
          even on failure. On success, this is identical to incremental, but
          on failure it clears only the bits that were copied successfully.
          This can be used to "resume" incremental backups from later points
          in times.

I wrote this series by accident on my way to implement incremental mode
for mirror, but this happened first -- the problem is that Mirror mode
uses its existing modes in a very particular way; and this was the best
way to add bitmap support into the mirror job properly.

Summary:
- 01-03: refactor blockdev-backup and drive-backup to share more interface code
- 04-05: add the new 'bitmap' sync mode with sync policy 'conditional',
         which is functionally identical to 'incremental' sync mode.
- 06:    add sync policy 'never' ("Differential" backups.)
- 07-11: rework some merging code to facilite patch 12;
- 12:    add sync policy 'always' ("Resumable" backups)
- 13-16: test infrastructure changes to support patch 16:
- 17:    new iotest!
- 18:    minor policy loosening as a QOL improvement

Future work:
 - Update bitmaps.rst to explain these. (WIP, it's hard, sorry!)
 - Add these modes to Mirror. (Done*, but needs tests.)
 - Allow the use of bitmaps and bitmap sync modes with non-BITMAP modes;
   This will allow for resumable/re-tryable full backups.

===
V4:
===

[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/18:[----] [--] 'qapi/block-core: Introduce BackupCommon'
002/18:[----] [--] 'drive-backup: create do_backup_common'
003/18:[----] [--] 'blockdev-backup: utilize do_backup_common'
004/18:[----] [--] 'qapi: add BitmapSyncMode enum'
005/18:[----] [--] 'block/backup: Add mirror sync mode 'bitmap''
006/18:[----] [--] 'block/backup: add 'never' policy to bitmap sync mode'
007/18:[----] [--] 'hbitmap: Fix merge when b is empty, and result is not an alias of a'
008/18:[----] [--] 'hbitmap: enable merging across granularities'
009/18:[0004] [FC] 'block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal'
010/18:[----] [--] 'block/dirty-bitmap: add bdrv_dirty_bitmap_get'
011/18:[0008] [FC] 'block/backup: upgrade copy_bitmap to BdrvDirtyBitmap'
012/18:[----] [--] 'block/backup: add 'always' bitmap sync policy'
013/18:[----] [--] 'iotests: add testing shim for script-style python tests'
014/18:[----] [--] 'iotests: teach run_job to cancel pending jobs'
015/18:[----] [--] 'iotests: teach FilePath to produce multiple paths'
016/18:[----] [--] 'iotests: Add virtio-scsi device helper'
017/18:[0063] [FC] 'iotests: add test 257 for bitmap-mode backups'
018/18:[----] [--] 'block/backup: loosen restriction on readonly bitmaps'

Changes:
009: Added assertions.
011: Moved copy bitmap to source node.
017: Rework get_bitmap to tolerate multiple anonymous bitmaps
     Update test output to accommodate the same.

===
V3:
===

Changes:
001: Made suggested doc fixes.
     Changed 'since' to 4.2.
002: Added bds and aio_context to backup_common
     Removed accidental extraneous unref on target_bs
     Removed local_err propagation
003: Fallout from #002; hoist aio_context acquisition up into do_blockdev_backup
004: 'conditional' --> 'on-success'
005: Rediscover the lost stanza that ensures a bitmap mode was given
     Fallout from 2, 3, 4.
006: Block comment fix for patchew
     Fallout from #4
009: Fix assert() style issue. Why'd they let a macro be lowercase like that?
     Probably to make specifically my life difficult.
010: Fix style issue {
011: Fix long lines
     rename "bs" --> "target_bs" where appropriate
     Free copy_bitmap from the right node
012: Multiline comment changes for patchew
     Fallout from #4
015: Fix long line for patchew
     Reinstate that second newline that Max likes
017: Fallout from #4.

===
V2:
===

Changes:
004: Fixed typo
     Change @conditional docstring
005: Moved desugaring code into blockdev.c, facilitated by patches 1-3.
006: Change @never docstring slightly.
007: Merge will clear the target bitmap when both components bitmaps are empty,
     and the target bitmap is not an alias of either component bitmap.
008: Check orig_size (logical size) instead of size (actual size) to enable
         cross-granularity merging.
     Fix the sparse merge itself, based on the block/backup code.
     Clear the target bitmap before cross-granularity merge.
     Assert the size is itself equal when logical size and granularities are
         equal.
---: Dropped bdrv_dirty_bitmap_claim.
012: Rewrote the cleanup logic to hopefully be clearer.
     use merge intsead of dropped reclaim.
015: Changed docstring
     factored out filename pattern generation.
017: Fix mkpattern indent.
     Use integer division!!!
     Add parenthesis to boolean assignment
     Change test run ordering; update output to reflect this
     Use virtio-scsi-ccw when appropriate
     Update test output to reflect new test running order
018: Fallout from patches 1-3; restrictions only need loosened in one place
         instead of two.

Changes not made:
- Allowing 'cancel' to skip synchronization on cancel:
  Decided against it, opting for consistency. The user asked for a sync,
  and it's simpler logistically to execute on that desire.
  Use the new mode carefully, please!

John Snow (18):
  qapi/block-core: Introduce BackupCommon
  drive-backup: create do_backup_common
  blockdev-backup: utilize do_backup_common
  qapi: add BitmapSyncMode enum
  block/backup: Add mirror sync mode 'bitmap'
  block/backup: add 'never' policy to bitmap sync mode
  hbitmap: Fix merge when b is empty, and result is not an alias of a
  hbitmap: enable merging across granularities
  block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal
  block/dirty-bitmap: add bdrv_dirty_bitmap_get
  block/backup: upgrade copy_bitmap to BdrvDirtyBitmap
  block/backup: add 'always' bitmap sync policy
  iotests: add testing shim for script-style python tests
  iotests: teach run_job to cancel pending jobs
  iotests: teach FilePath to produce multiple paths
  iotests: Add virtio-scsi device helper
  iotests: add test 257 for bitmap-mode backups
  block/backup: loosen restriction on readonly bitmaps

 block/backup.c                |  135 +-
 block/dirty-bitmap.c          |   73 +-
 block/mirror.c                |    8 +-
 block/replication.c           |    2 +-
 blockdev.c                    |  208 ++-
 include/block/block_int.h     |    7 +-
 include/block/dirty-bitmap.h  |    4 +-
 migration/block.c             |    5 +-
 nbd/server.c                  |    2 +-
 qapi/block-core.json          |  138 +-
 tests/qemu-iotests/040        |    6 +-
 tests/qemu-iotests/093        |    6 +-
 tests/qemu-iotests/139        |    7 +-
 tests/qemu-iotests/238        |    5 +-
 tests/qemu-iotests/257        |  416 ++++++
 tests/qemu-iotests/257.out    | 2247 +++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group      |    1 +
 tests/qemu-iotests/iotests.py |  100 +-
 util/hbitmap.c                |   49 +-
 19 files changed, 3107 insertions(+), 312 deletions(-)
 create mode 100755 tests/qemu-iotests/257
 create mode 100644 tests/qemu-iotests/257.out

Comments

Max Reitz July 10, 2019, 2:48 p.m. UTC | #1
On 10.07.19 01:25, John Snow wrote:
> This series adds a new "BITMAP" sync mode that is meant to replace the
> existing "INCREMENTAL" sync mode.

So who’s going to take it? :-)

get_maintainer.pl says I’m responsible for 14/18 patches, and you are
for 9/18.  But you did prefix the title with “bitmaps”, which
MAINTAINERS clearly states is your territory. ;-)

Max
John Snow July 10, 2019, 5:21 p.m. UTC | #2
On 7/10/19 10:48 AM, Max Reitz wrote:
> On 10.07.19 01:25, John Snow wrote:
>> This series adds a new "BITMAP" sync mode that is meant to replace the
>> existing "INCREMENTAL" sync mode.
> 
> So who’s going to take it? :-)
> 
> get_maintainer.pl says I’m responsible for 14/18 patches, and you are
> for 9/18.  But you did prefix the title with “bitmaps”, which
> MAINTAINERS clearly states is your territory. ;-)
> 
> Max
> 

I'll do it, since there's more to come and it'll be easier for me to
stage it all in one giant chunk.

Thank you for your patience and reviews! This series has helped me
scratch a major itch that I've had ever since I implemented incremental
backups.

--js
John Snow July 15, 2019, 8 p.m. UTC | #3
On 7/9/19 7:25 PM, John Snow wrote:
> This series adds a new "BITMAP" sync mode that is meant to replace the
> existing "INCREMENTAL" sync mode.
> 
> This mode can have its behavior modified by issuing any of three bitmap sync
> modes, passed as arguments to the job.
> 
> The three bitmap sync modes are:
> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
>               conditionally synchronized based on the return code of the job
>               upon completion.
> - NEVER: This is, effectively, the differential backup mode. It never clears
>          the bitmap, as the name suggests.
> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
>           even on failure. On success, this is identical to incremental, but
>           on failure it clears only the bits that were copied successfully.
>           This can be used to "resume" incremental backups from later points
>           in times.
> 
> I wrote this series by accident on my way to implement incremental mode
> for mirror, but this happened first -- the problem is that Mirror mode
> uses its existing modes in a very particular way; and this was the best
> way to add bitmap support into the mirror job properly.
> 
> Summary:
> - 01-03: refactor blockdev-backup and drive-backup to share more interface code
> - 04-05: add the new 'bitmap' sync mode with sync policy 'conditional',
>          which is functionally identical to 'incremental' sync mode.
> - 06:    add sync policy 'never' ("Differential" backups.)
> - 07-11: rework some merging code to facilite patch 12;
> - 12:    add sync policy 'always' ("Resumable" backups)
> - 13-16: test infrastructure changes to support patch 16:
> - 17:    new iotest!
> - 18:    minor policy loosening as a QOL improvement
> 
> Future work:
>  - Update bitmaps.rst to explain these. (WIP, it's hard, sorry!)
>  - Add these modes to Mirror. (Done*, but needs tests.)
>  - Allow the use of bitmaps and bitmap sync modes with non-BITMAP modes;
>    This will allow for resumable/re-tryable full backups.
> 
> ===
> V4:
> ===
> 
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/18:[----] [--] 'qapi/block-core: Introduce BackupCommon'
> 002/18:[----] [--] 'drive-backup: create do_backup_common'
> 003/18:[----] [--] 'blockdev-backup: utilize do_backup_common'
> 004/18:[----] [--] 'qapi: add BitmapSyncMode enum'
> 005/18:[----] [--] 'block/backup: Add mirror sync mode 'bitmap''
> 006/18:[----] [--] 'block/backup: add 'never' policy to bitmap sync mode'
> 007/18:[----] [--] 'hbitmap: Fix merge when b is empty, and result is not an alias of a'
> 008/18:[----] [--] 'hbitmap: enable merging across granularities'
> 009/18:[0004] [FC] 'block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal'
> 010/18:[----] [--] 'block/dirty-bitmap: add bdrv_dirty_bitmap_get'
> 011/18:[0008] [FC] 'block/backup: upgrade copy_bitmap to BdrvDirtyBitmap'
> 012/18:[----] [--] 'block/backup: add 'always' bitmap sync policy'
> 013/18:[----] [--] 'iotests: add testing shim for script-style python tests'
> 014/18:[----] [--] 'iotests: teach run_job to cancel pending jobs'
> 015/18:[----] [--] 'iotests: teach FilePath to produce multiple paths'
> 016/18:[----] [--] 'iotests: Add virtio-scsi device helper'
> 017/18:[0063] [FC] 'iotests: add test 257 for bitmap-mode backups'
> 018/18:[----] [--] 'block/backup: loosen restriction on readonly bitmaps'
> 
> Changes:
> 009: Added assertions.
> 011: Moved copy bitmap to source node.
> 017: Rework get_bitmap to tolerate multiple anonymous bitmaps
>      Update test output to accommodate the same.
> 
> ===
> V3:
> ===
> 
> Changes:
> 001: Made suggested doc fixes.
>      Changed 'since' to 4.2.
> 002: Added bds and aio_context to backup_common
>      Removed accidental extraneous unref on target_bs
>      Removed local_err propagation
> 003: Fallout from #002; hoist aio_context acquisition up into do_blockdev_backup
> 004: 'conditional' --> 'on-success'
> 005: Rediscover the lost stanza that ensures a bitmap mode was given
>      Fallout from 2, 3, 4.
> 006: Block comment fix for patchew
>      Fallout from #4
> 009: Fix assert() style issue. Why'd they let a macro be lowercase like that?
>      Probably to make specifically my life difficult.
> 010: Fix style issue {
> 011: Fix long lines
>      rename "bs" --> "target_bs" where appropriate
>      Free copy_bitmap from the right node
> 012: Multiline comment changes for patchew
>      Fallout from #4
> 015: Fix long line for patchew
>      Reinstate that second newline that Max likes
> 017: Fallout from #4.
> 
> ===
> V2:
> ===
> 
> Changes:
> 004: Fixed typo
>      Change @conditional docstring
> 005: Moved desugaring code into blockdev.c, facilitated by patches 1-3.
> 006: Change @never docstring slightly.
> 007: Merge will clear the target bitmap when both components bitmaps are empty,
>      and the target bitmap is not an alias of either component bitmap.
> 008: Check orig_size (logical size) instead of size (actual size) to enable
>          cross-granularity merging.
>      Fix the sparse merge itself, based on the block/backup code.
>      Clear the target bitmap before cross-granularity merge.
>      Assert the size is itself equal when logical size and granularities are
>          equal.
> ---: Dropped bdrv_dirty_bitmap_claim.
> 012: Rewrote the cleanup logic to hopefully be clearer.
>      use merge intsead of dropped reclaim.
> 015: Changed docstring
>      factored out filename pattern generation.
> 017: Fix mkpattern indent.
>      Use integer division!!!
>      Add parenthesis to boolean assignment
>      Change test run ordering; update output to reflect this
>      Use virtio-scsi-ccw when appropriate
>      Update test output to reflect new test running order
> 018: Fallout from patches 1-3; restrictions only need loosened in one place
>          instead of two.
> 
> Changes not made:
> - Allowing 'cancel' to skip synchronization on cancel:
>   Decided against it, opting for consistency. The user asked for a sync,
>   and it's simpler logistically to execute on that desire.
>   Use the new mode carefully, please!
> 
> John Snow (18):
>   qapi/block-core: Introduce BackupCommon
>   drive-backup: create do_backup_common
>   blockdev-backup: utilize do_backup_common
>   qapi: add BitmapSyncMode enum
>   block/backup: Add mirror sync mode 'bitmap'
>   block/backup: add 'never' policy to bitmap sync mode
>   hbitmap: Fix merge when b is empty, and result is not an alias of a
>   hbitmap: enable merging across granularities
>   block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal
>   block/dirty-bitmap: add bdrv_dirty_bitmap_get
>   block/backup: upgrade copy_bitmap to BdrvDirtyBitmap
>   block/backup: add 'always' bitmap sync policy
>   iotests: add testing shim for script-style python tests
>   iotests: teach run_job to cancel pending jobs
>   iotests: teach FilePath to produce multiple paths
>   iotests: Add virtio-scsi device helper
>   iotests: add test 257 for bitmap-mode backups
>   block/backup: loosen restriction on readonly bitmaps
> 
>  block/backup.c                |  135 +-
>  block/dirty-bitmap.c          |   73 +-
>  block/mirror.c                |    8 +-
>  block/replication.c           |    2 +-
>  blockdev.c                    |  208 ++-
>  include/block/block_int.h     |    7 +-
>  include/block/dirty-bitmap.h  |    4 +-
>  migration/block.c             |    5 +-
>  nbd/server.c                  |    2 +-
>  qapi/block-core.json          |  138 +-
>  tests/qemu-iotests/040        |    6 +-
>  tests/qemu-iotests/093        |    6 +-
>  tests/qemu-iotests/139        |    7 +-
>  tests/qemu-iotests/238        |    5 +-
>  tests/qemu-iotests/257        |  416 ++++++
>  tests/qemu-iotests/257.out    | 2247 +++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/group      |    1 +
>  tests/qemu-iotests/iotests.py |  100 +-
>  util/hbitmap.c                |   49 +-
>  19 files changed, 3107 insertions(+), 312 deletions(-)
>  create mode 100755 tests/qemu-iotests/257
>  create mode 100644 tests/qemu-iotests/257.out
> 

Thanks, applied to my bitmaps tree:

https://github.com/jnsnow/qemu/commits/bitmaps
https://github.com/jnsnow/qemu.git

--js
Fabian Grünbichler July 22, 2019, 12:17 p.m. UTC | #4
On Tue, Jul 09, 2019 at 07:25:32PM -0400, John Snow wrote:
> This series adds a new "BITMAP" sync mode that is meant to replace the
> existing "INCREMENTAL" sync mode.
> 
> This mode can have its behavior modified by issuing any of three bitmap sync
> modes, passed as arguments to the job.
> 
> The three bitmap sync modes are:
> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
>               conditionally synchronized based on the return code of the job
>               upon completion.
> - NEVER: This is, effectively, the differential backup mode. It never clears
>          the bitmap, as the name suggests.
> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
>           even on failure. On success, this is identical to incremental, but
>           on failure it clears only the bits that were copied successfully.
>           This can be used to "resume" incremental backups from later points
>           in times.
> 
> I wrote this series by accident on my way to implement incremental mode
> for mirror, but this happened first -- the problem is that Mirror mode
> uses its existing modes in a very particular way; and this was the best
> way to add bitmap support into the mirror job properly.
> 
> [...]
> 
> Future work:
> [..]
>  - Add these modes to Mirror. (Done*, but needs tests.)

are these mirror patches available somehwere for testing in combination
with this series? your bitmaps branch does not seem to contain them ;)

we've been experimenting with Ma Haocong's patch (v4 from February) to add
"incremental"/differential sync to drive-mirror recently with positive
results so far, and this sounds like it is another attempt at getting
this properly integrated into Qemu.
John Snow July 22, 2019, 5:21 p.m. UTC | #5
On 7/22/19 8:17 AM, Fabian Grünbichler wrote:
> On Tue, Jul 09, 2019 at 07:25:32PM -0400, John Snow wrote:
>> This series adds a new "BITMAP" sync mode that is meant to replace the
>> existing "INCREMENTAL" sync mode.
>>
>> This mode can have its behavior modified by issuing any of three bitmap sync
>> modes, passed as arguments to the job.
>>
>> The three bitmap sync modes are:
>> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
>>               conditionally synchronized based on the return code of the job
>>               upon completion.
>> - NEVER: This is, effectively, the differential backup mode. It never clears
>>          the bitmap, as the name suggests.
>> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
>>           even on failure. On success, this is identical to incremental, but
>>           on failure it clears only the bits that were copied successfully.
>>           This can be used to "resume" incremental backups from later points
>>           in times.
>>
>> I wrote this series by accident on my way to implement incremental mode
>> for mirror, but this happened first -- the problem is that Mirror mode
>> uses its existing modes in a very particular way; and this was the best
>> way to add bitmap support into the mirror job properly.
>>
>> [...]
>>
>> Future work:
>> [..]
>>  - Add these modes to Mirror. (Done*, but needs tests.)
> 
> are these mirror patches available somehwere for testing in combination
> with this series? your bitmaps branch does not seem to contain them ;)
> 
> we've been experimenting with Ma Haocong's patch (v4 from February) to add
> "incremental"/differential sync to drive-mirror recently with positive
> results so far, and this sounds like it is another attempt at getting
> this properly integrated into Qemu.
> 

Not available quite yet; I added it in fairly hastily but haven't done
the testing I want to do yet, so I wouldn't feel comfortable sharing it
before I do my own due diligence on it. Give me a chance to polish it so
that the testing effort isn't wasted :)

Can you share some of your use-cases for how you are using the
"incremental mirror" so far? It might be useful for the patch
justification if I can point to production use cases. (And good for
allocating time, too.)

--js
Fabian Grünbichler July 23, 2019, 9:47 a.m. UTC | #6
On Mon, Jul 22, 2019 at 01:21:02PM -0400, John Snow wrote:
> 
> 
> On 7/22/19 8:17 AM, Fabian Grünbichler wrote:
> > On Tue, Jul 09, 2019 at 07:25:32PM -0400, John Snow wrote:
> >> This series adds a new "BITMAP" sync mode that is meant to replace the
> >> existing "INCREMENTAL" sync mode.
> >>
> >> This mode can have its behavior modified by issuing any of three bitmap sync
> >> modes, passed as arguments to the job.
> >>
> >> The three bitmap sync modes are:
> >> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
> >>               conditionally synchronized based on the return code of the job
> >>               upon completion.
> >> - NEVER: This is, effectively, the differential backup mode. It never clears
> >>          the bitmap, as the name suggests.
> >> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
> >>           even on failure. On success, this is identical to incremental, but
> >>           on failure it clears only the bits that were copied successfully.
> >>           This can be used to "resume" incremental backups from later points
> >>           in times.
> >>
> >> I wrote this series by accident on my way to implement incremental mode
> >> for mirror, but this happened first -- the problem is that Mirror mode
> >> uses its existing modes in a very particular way; and this was the best
> >> way to add bitmap support into the mirror job properly.
> >>
> >> [...]
> >>
> >> Future work:
> >> [..]
> >>  - Add these modes to Mirror. (Done*, but needs tests.)
> > 
> > are these mirror patches available somehwere for testing in combination
> > with this series? your bitmaps branch does not seem to contain them ;)
> > 
> > we've been experimenting with Ma Haocong's patch (v4 from February) to add
> > "incremental"/differential sync to drive-mirror recently with positive
> > results so far, and this sounds like it is another attempt at getting
> > this properly integrated into Qemu.
> > 
> 
> Not available quite yet; I added it in fairly hastily but haven't done
> the testing I want to do yet, so I wouldn't feel comfortable sharing it
> before I do my own due diligence on it. Give me a chance to polish it so
> that the testing effort isn't wasted :)

fair enough, and no hurries :)

> 
> Can you share some of your use-cases for how you are using the
> "incremental mirror" so far? It might be useful for the patch
> justification if I can point to production use cases. (And good for
> allocating time, too.)

it's basically the same use case that the original "incremental mirror"
patch (series)[1] from two years ago had (no affiliation with the author
though) - we have a guest disk replication feature for ZFS/zvols in a
clustered hypervisor setting, and would like to re-use the already
replicated disk state when live-migrating a VM. Qemu does not know
anything about the replication, since it happens on the storage layer
with zfs send/zfs receive. note that for VMs, we use zvols which are
block devices backed by ZFS (or rather, ZFS datasets exposed as block
devices), minus the file system part of regular ZFS datasets. from
Qemu's PoV these (replicated) disks are just regular block devices (and not
image-backed disks on a filesystem, or accessed via some special
BlockDriver like Ceph's RBD images).

we currently support live migration
1) with disks on shared/distributed storage (easy enough)
2) with regular (non-replicated, local) disks (via nbd/drive-mirror)
3) with unused disks on the storage level (disks are not known to Qemu/the VM)

1-3 can be mixed and matched arbitrarily in one guest, e.g. with one
disk on a shared Ceph cluster, one disk that is not in use on an NFS
share, and another disk on a local LVM-thin pool. 2) and 3) also allow
switching the underlying storage on the fly, since they transfer the
full disk (content) anyway.

we also support offline migration with shared, local, unused and/or
replicated disks (all on the storage level with no involvement of Qemu).

as you can see there is a gap in the live-migration feature matrix: when
replication is used, you either have to poweroff the VM to re-use the
replication state (storage-only migration), or drop the replication
state and do a full local-disk live-migration before re-creating the
replication state from scratch (which is bad, since replication can have
multiple target hosts, and re-establishing the whole disk can take a
while if its big).

our basic approach is (currently) the following:

1) get disk info
2) Qemu: add dirty bitmaps for currently used, replicated disks
3) storage/ZFS: do a regular replication of all replicated disks (used AND unused)
4) storage: do a regular storage migration of all regular unused local disks
5a) Qemu: do a regular drive-mirror of all currently used, local disks
5b) Qemu: do an incremental drive-mirror for all currently used, replicated disks
6) Qemu: wait for convergence of drive-mirror jobs
7) Qemu: do a regular live-migration of VM
8) Qemu: once converged and VM is suspended, complete drive-mirror jobs
9) Qemu: resume now fully migrated VM on target node
10) Qemu/storage: clean up on source node

5b) with bitmaps from 2) is what is currently missing on the Qemu side,
but seems easy enough to support (like I said, we are currently using Ma
Haocong's patch for testing, but want to get this feature upstream one
way or another instead of carrying our own, possibly incompatible in the
near-future version).

2) and 3) are obviously not atomic, so the bitmaps will contain some
writes that have been replicated already on the block/storage layer
below the VM, and those writes will be done a second time in step 5b).

we can work around this by adding another short down time by
freezing/suspending prior to 2) until after doing the ZFS snapshots at
the start of 3), in case these duplicate writes turn out to be
problematic after all. this downtime would be rather short, as the bulk
of the replication work (actually transfering the latest delta) can
happen after unfreezing/resuming the VM. so far we haven't encountered
any problems in our (albeit limited) testing though, so if possible we
would naturally like to avoid the additional downtime altogether ;)

looking forward to your patch(es) :)

1: <CAKVPjOZ8Y8U2zHgo_06aozrdd9_Cq6txWrX5F4HnFefAUjimyQ@mail.gmail.com>
and <20170504105444.8940-1-daniel.kucera@gmail.com>
John Snow July 23, 2019, 4:58 p.m. UTC | #7
On 7/23/19 5:47 AM, Fabian Grünbichler wrote:
> On Mon, Jul 22, 2019 at 01:21:02PM -0400, John Snow wrote:
>>
>>
>> On 7/22/19 8:17 AM, Fabian Grünbichler wrote:
>>> On Tue, Jul 09, 2019 at 07:25:32PM -0400, John Snow wrote:
>>>> This series adds a new "BITMAP" sync mode that is meant to replace the
>>>> existing "INCREMENTAL" sync mode.
>>>>
>>>> This mode can have its behavior modified by issuing any of three bitmap sync
>>>> modes, passed as arguments to the job.
>>>>
>>>> The three bitmap sync modes are:
>>>> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
>>>>               conditionally synchronized based on the return code of the job
>>>>               upon completion.
>>>> - NEVER: This is, effectively, the differential backup mode. It never clears
>>>>          the bitmap, as the name suggests.
>>>> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
>>>>           even on failure. On success, this is identical to incremental, but
>>>>           on failure it clears only the bits that were copied successfully.
>>>>           This can be used to "resume" incremental backups from later points
>>>>           in times.
>>>>
>>>> I wrote this series by accident on my way to implement incremental mode
>>>> for mirror, but this happened first -- the problem is that Mirror mode
>>>> uses its existing modes in a very particular way; and this was the best
>>>> way to add bitmap support into the mirror job properly.
>>>>
>>>> [...]
>>>>
>>>> Future work:
>>>> [..]
>>>>  - Add these modes to Mirror. (Done*, but needs tests.)
>>>
>>> are these mirror patches available somehwere for testing in combination
>>> with this series? your bitmaps branch does not seem to contain them ;)
>>>
>>> we've been experimenting with Ma Haocong's patch (v4 from February) to add
>>> "incremental"/differential sync to drive-mirror recently with positive
>>> results so far, and this sounds like it is another attempt at getting
>>> this properly integrated into Qemu.
>>>
>>
>> Not available quite yet; I added it in fairly hastily but haven't done
>> the testing I want to do yet, so I wouldn't feel comfortable sharing it
>> before I do my own due diligence on it. Give me a chance to polish it so
>> that the testing effort isn't wasted :)
> 
> fair enough, and no hurries :)
> 
>>
>> Can you share some of your use-cases for how you are using the
>> "incremental mirror" so far? It might be useful for the patch
>> justification if I can point to production use cases. (And good for
>> allocating time, too.)
> 
> it's basically the same use case that the original "incremental mirror"
> patch (series)[1] from two years ago had (no affiliation with the author
> though) - we have a guest disk replication feature for ZFS/zvols in a
> clustered hypervisor setting, and would like to re-use the already
> replicated disk state when live-migrating a VM. Qemu does not know
> anything about the replication, since it happens on the storage layer
> with zfs send/zfs receive. note that for VMs, we use zvols which are
> block devices backed by ZFS (or rather, ZFS datasets exposed as block
> devices), minus the file system part of regular ZFS datasets. from
> Qemu's PoV these (replicated) disks are just regular block devices (and not
> image-backed disks on a filesystem, or accessed via some special
> BlockDriver like Ceph's RBD images).
> 
> we currently support live migration
> 1) with disks on shared/distributed storage (easy enough)
> 2) with regular (non-replicated, local) disks (via nbd/drive-mirror)
> 3) with unused disks on the storage level (disks are not known to Qemu/the VM)
> 
> 1-3 can be mixed and matched arbitrarily in one guest, e.g. with one
> disk on a shared Ceph cluster, one disk that is not in use on an NFS
> share, and another disk on a local LVM-thin pool. 2) and 3) also allow
> switching the underlying storage on the fly, since they transfer the
> full disk (content) anyway.
> 
> we also support offline migration with shared, local, unused and/or
> replicated disks (all on the storage level with no involvement of Qemu).
> 
> as you can see there is a gap in the live-migration feature matrix: when
> replication is used, you either have to poweroff the VM to re-use the
> replication state (storage-only migration), or drop the replication
> state and do a full local-disk live-migration before re-creating the
> replication state from scratch (which is bad, since replication can have
> multiple target hosts, and re-establishing the whole disk can take a
> while if its big).
> 
> our basic approach is (currently) the following:
> 
> 1) get disk info
> 2) Qemu: add dirty bitmaps for currently used, replicated disks
> 3) storage/ZFS: do a regular replication of all replicated disks (used AND unused)

I take it that the ZFS replication is not an ongoing process but
something that terminates, so you need QEMU to pick up the difference
that occurred during that time?

(Which I imagine the bitmap will pick up some writes that DO get
replicated, but copying some extra is safe.)

> 4) storage: do a regular storage migration of all regular unused local disks
> 5a) Qemu: do a regular drive-mirror of all currently used, local disks
> 5b) Qemu: do an incremental drive-mirror for all currently used, replicated disks

To mirror anything written since the replication started, based on this
timeline.

> 6) Qemu: wait for convergence of drive-mirror jobs
> 7) Qemu: do a regular live-migration of VM
> 8) Qemu: once converged and VM is suspended, complete drive-mirror jobs
> 9) Qemu: resume now fully migrated VM on target node
> 10) Qemu/storage: clean up on source node
> 
> 5b) with bitmaps from 2) is what is currently missing on the Qemu side,
> but seems easy enough to support (like I said, we are currently using Ma
> Haocong's patch for testing, but want to get this feature upstream one
> way or another instead of carrying our own, possibly incompatible in the
> near-future version).
> 

It will look VERY similar. Switching should be easy; the only difference
will be:

sync=BITMAP instead of sync=INCREMENTAL, and
bitmap_mode=NEVER provided explicitly to match Ma Haocong's patch behavior.

You can alternatively use the other bitmap policies depending on what
you want:

NEVER leaves the bitmap alone entirely like Ma Haocong's patch does. It
reflects a kind of "differential backup" intent; changes accumulate in
the bitmap if it was enabled.

ON-SUCCESS will reset any bits copied out if the job completes
successfully (note that this includes mirror cancellation after sync as
well as a COMPLETE instruction that includes the pivot.)

ALWAYS will reset any bits successfully copied out, regardless of the
final state of the job. You can use this one to resume the mirror on
failures.

You should be able to get the exact behavior you've already programmed
for, and maybe some new toys.

> 2) and 3) are obviously not atomic, so the bitmaps will contain some
> writes that have been replicated already on the block/storage layer
> below the VM, and those writes will be done a second time in step 5b).
> 
> we can work around this by adding another short down time by
> freezing/suspending prior to 2) until after doing the ZFS snapshots at
> the start of 3), in case these duplicate writes turn out to be
> problematic after all. this downtime would be rather short, as the bulk
> of the replication work (actually transfering the latest delta) can
> happen after unfreezing/resuming the VM. so far we haven't encountered
> any problems in our (albeit limited) testing though, so if possible we
> would naturally like to avoid the additional downtime altogether ;)
> 
> looking forward to your patch(es) :)
> 
> 1: <CAKVPjOZ8Y8U2zHgo_06aozrdd9_Cq6txWrX5F4HnFefAUjimyQ@mail.gmail.com>
> and <20170504105444.8940-1-daniel.kucera@gmail.com>
> 

Thanks for the writeup! My goal is to have this in for 4.2 alongside all
of the other bitmap changes I've queued so far.

--js
Fabian Grünbichler July 24, 2019, 8:41 a.m. UTC | #8
On Tue, Jul 23, 2019 at 12:58:10PM -0400, John Snow wrote:
> 
> 
> On 7/23/19 5:47 AM, Fabian Grünbichler wrote:
> > On Mon, Jul 22, 2019 at 01:21:02PM -0400, John Snow wrote:
> >>
> >>
> >> On 7/22/19 8:17 AM, Fabian Grünbichler wrote:
> >>> On Tue, Jul 09, 2019 at 07:25:32PM -0400, John Snow wrote:
> >>>> This series adds a new "BITMAP" sync mode that is meant to replace the
> >>>> existing "INCREMENTAL" sync mode.
> >>>>
> >>>> This mode can have its behavior modified by issuing any of three bitmap sync
> >>>> modes, passed as arguments to the job.
> >>>>
> >>>> The three bitmap sync modes are:
> >>>> - ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
> >>>>               conditionally synchronized based on the return code of the job
> >>>>               upon completion.
> >>>> - NEVER: This is, effectively, the differential backup mode. It never clears
> >>>>          the bitmap, as the name suggests.
> >>>> - ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
> >>>>           even on failure. On success, this is identical to incremental, but
> >>>>           on failure it clears only the bits that were copied successfully.
> >>>>           This can be used to "resume" incremental backups from later points
> >>>>           in times.
> >>>>
> >>>> I wrote this series by accident on my way to implement incremental mode
> >>>> for mirror, but this happened first -- the problem is that Mirror mode
> >>>> uses its existing modes in a very particular way; and this was the best
> >>>> way to add bitmap support into the mirror job properly.
> >>>>
> >>>> [...]
> >>>>
> >>>> Future work:
> >>>> [..]
> >>>>  - Add these modes to Mirror. (Done*, but needs tests.)
> >>>
> >>> are these mirror patches available somehwere for testing in combination
> >>> with this series? your bitmaps branch does not seem to contain them ;)
> >>>
> >>> we've been experimenting with Ma Haocong's patch (v4 from February) to add
> >>> "incremental"/differential sync to drive-mirror recently with positive
> >>> results so far, and this sounds like it is another attempt at getting
> >>> this properly integrated into Qemu.
> >>>
> >>
> >> Not available quite yet; I added it in fairly hastily but haven't done
> >> the testing I want to do yet, so I wouldn't feel comfortable sharing it
> >> before I do my own due diligence on it. Give me a chance to polish it so
> >> that the testing effort isn't wasted :)
> > 
> > fair enough, and no hurries :)
> > 
> >>
> >> Can you share some of your use-cases for how you are using the
> >> "incremental mirror" so far? It might be useful for the patch
> >> justification if I can point to production use cases. (And good for
> >> allocating time, too.)
> > 
> > it's basically the same use case that the original "incremental mirror"
> > patch (series)[1] from two years ago had (no affiliation with the author
> > though) - we have a guest disk replication feature for ZFS/zvols in a
> > clustered hypervisor setting, and would like to re-use the already
> > replicated disk state when live-migrating a VM. Qemu does not know
> > anything about the replication, since it happens on the storage layer
> > with zfs send/zfs receive. note that for VMs, we use zvols which are
> > block devices backed by ZFS (or rather, ZFS datasets exposed as block
> > devices), minus the file system part of regular ZFS datasets. from
> > Qemu's PoV these (replicated) disks are just regular block devices (and not
> > image-backed disks on a filesystem, or accessed via some special
> > BlockDriver like Ceph's RBD images).
> > 
> > we currently support live migration
> > 1) with disks on shared/distributed storage (easy enough)
> > 2) with regular (non-replicated, local) disks (via nbd/drive-mirror)
> > 3) with unused disks on the storage level (disks are not known to Qemu/the VM)
> > 
> > 1-3 can be mixed and matched arbitrarily in one guest, e.g. with one
> > disk on a shared Ceph cluster, one disk that is not in use on an NFS
> > share, and another disk on a local LVM-thin pool. 2) and 3) also allow
> > switching the underlying storage on the fly, since they transfer the
> > full disk (content) anyway.
> > 
> > we also support offline migration with shared, local, unused and/or
> > replicated disks (all on the storage level with no involvement of Qemu).
> > 
> > as you can see there is a gap in the live-migration feature matrix: when
> > replication is used, you either have to poweroff the VM to re-use the
> > replication state (storage-only migration), or drop the replication
> > state and do a full local-disk live-migration before re-creating the
> > replication state from scratch (which is bad, since replication can have
> > multiple target hosts, and re-establishing the whole disk can take a
> > while if its big).
> > 
> > our basic approach is (currently) the following:
> > 
> > 1) get disk info
> > 2) Qemu: add dirty bitmaps for currently used, replicated disks
> > 3) storage/ZFS: do a regular replication of all replicated disks (used AND unused)
> 
> I take it that the ZFS replication is not an ongoing process but
> something that terminates, so you need QEMU to pick up the difference
> that occurred during that time?

yes exactly. the ZFS replication creates a new snapshot, and transfers
all intermediate deltas since the last successfully replicated snapshot.

> 
> (Which I imagine the bitmap will pick up some writes that DO get
> replicated, but copying some extra is safe.)

yep, see the note regarding this in my last mail ;)

> 
> > 4) storage: do a regular storage migration of all regular unused local disks
> > 5a) Qemu: do a regular drive-mirror of all currently used, local disks
> > 5b) Qemu: do an incremental drive-mirror for all currently used, replicated disks
> 
> To mirror anything written since the replication started, based on this
> timeline.

yes. or rather, to mirror anything written since shortly before the
replication started, which means to mirror anything written since the
(now) last replication snapshot, plus some extra writes that happened
right before and are included in that snapshot.

> 
> > 6) Qemu: wait for convergence of drive-mirror jobs
> > 7) Qemu: do a regular live-migration of VM
> > 8) Qemu: once converged and VM is suspended, complete drive-mirror jobs
> > 9) Qemu: resume now fully migrated VM on target node
> > 10) Qemu/storage: clean up on source node
> > 
> > 5b) with bitmaps from 2) is what is currently missing on the Qemu side,
> > but seems easy enough to support (like I said, we are currently using Ma
> > Haocong's patch for testing, but want to get this feature upstream one
> > way or another instead of carrying our own, possibly incompatible in the
> > near-future version).
> > 
> 
> It will look VERY similar. Switching should be easy; the only difference
> will be:
> 
> sync=BITMAP instead of sync=INCREMENTAL, and
> bitmap_mode=NEVER provided explicitly to match Ma Haocong's patch behavior.
> 
> You can alternatively use the other bitmap policies depending on what
> you want:
> 
> NEVER leaves the bitmap alone entirely like Ma Haocong's patch does. It
> reflects a kind of "differential backup" intent; changes accumulate in
> the bitmap if it was enabled.
> 
> ON-SUCCESS will reset any bits copied out if the job completes
> successfully (note that this includes mirror cancellation after sync as
> well as a COMPLETE instruction that includes the pivot.)
> 
> ALWAYS will reset any bits successfully copied out, regardless of the
> final state of the job. You can use this one to resume the mirror on
> failures.
> 
> You should be able to get the exact behavior you've already programmed
> for, and maybe some new toys.

that sounds promising. for this specific use case we don't care what
happens to the bitmap, since on success we switch to the migration
target VM and stop the old/source one (and thus the bitmap), and on
failure we abort the migration (and thus drop the bitmap) and a new
attempt will start with a new bitmap+replication run.

it might be interesting for other use cases though.

> 
> > 2) and 3) are obviously not atomic, so the bitmaps will contain some
> > writes that have been replicated already on the block/storage layer
> > below the VM, and those writes will be done a second time in step 5b).
> > 
> > we can work around this by adding another short down time by
> > freezing/suspending prior to 2) until after doing the ZFS snapshots at
> > the start of 3), in case these duplicate writes turn out to be
> > problematic after all. this downtime would be rather short, as the bulk
> > of the replication work (actually transfering the latest delta) can
> > happen after unfreezing/resuming the VM. so far we haven't encountered
> > any problems in our (albeit limited) testing though, so if possible we
> > would naturally like to avoid the additional downtime altogether ;)
> > 
> > looking forward to your patch(es) :)
> > 
> > 1: <CAKVPjOZ8Y8U2zHgo_06aozrdd9_Cq6txWrX5F4HnFefAUjimyQ@mail.gmail.com>
> > and <20170504105444.8940-1-daniel.kucera@gmail.com>
> > 
> 
> Thanks for the writeup! My goal is to have this in for 4.2 alongside all
> of the other bitmap changes I've queued so far.

that sounds great! feel free to CC me on subsequent series if you want
early testing ;) I do try to keep up with -devel anyways, but sometimes
stuff slips through.