diff mbox

[3/3] block: mirror - zero unallocated target sectors when zero init not present

Message ID c3dcba707e0d2b976d788eb228336237fd7d612c.1443410673.git.jcody@redhat.com
State New
Headers show

Commit Message

Jeff Cody Sept. 28, 2015, 3:29 a.m. UTC
During mirror, if the target device does not have support zero
initialization, a mirror may result in a corrupt image.

For instance, on mirror to a host device with format = raw, whatever
random data is on the target device will still be there for unallocated
sectors.

This is because during the mirror, we set the dirty bitmap to copy only
sectors allocated above 'base'.  In the case of target devices where we
cannot assume unallocated sectors will be read as zeroes, we need to
explicitely zero out this data.

In order to avoid zeroing out all sectors of the target device prior to
mirroring, we do zeroing as part of the block job.  A second dirty
bitmap cache is created, to track sectors that are unallocated above
'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
on the target - if they are not, then zeroes are explicitly written.

This only occurs under two conditions:

    1. 'mode' != "existing"
    2. bdrv_has_zero_init(target) == NULL

We perform the mirroring through mirror_iteration() as before, except
in two passes.  If the above two conditions are met, the first pass
is using the bitmap tracking unallocated sectors, to write the needed
zeroes.  Then, the second pass is performed, to mirror the actual data
as before.

If the above two conditions are not met, then the first pass is skipped,
and only the second pass (the one with the actual data) is performed.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/mirror.c            | 109 ++++++++++++++++++++++++++++++++++------------
 blockdev.c                |   2 +-
 include/block/block_int.h |   3 +-
 qapi/block-core.json      |   6 ++-
 4 files changed, 87 insertions(+), 33 deletions(-)

Comments

Paolo Bonzini Sept. 28, 2015, 2:13 p.m. UTC | #1
On 28/09/2015 05:29, Jeff Cody wrote:
> This only occurs under two conditions:
> 
>     1. 'mode' != "existing"
>     2. bdrv_has_zero_init(target) == NULL
> 

I'm not sure if mode != "existing" actually matters.  I think what
actually matters is sync == "full".

The reasons are:

1) with sync != "full", unallocated target sectors should remain
unallocated on the destination because they are supposed to point to the
backing file.

2) even with mode == "existing" you expect the data to be consistent at
the end of the mirroring

Paolo
Kevin Wolf Sept. 28, 2015, 3:07 p.m. UTC | #2
Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben:
> During mirror, if the target device does not have support zero
> initialization, a mirror may result in a corrupt image.

I think you want to check this sentence. ("During mirror [...], a
mirror may result [...]")

> For instance, on mirror to a host device with format = raw, whatever
> random data is on the target device will still be there for unallocated
> sectors.
> 
> This is because during the mirror, we set the dirty bitmap to copy only
> sectors allocated above 'base'.  In the case of target devices where we
> cannot assume unallocated sectors will be read as zeroes, we need to
> explicitely zero out this data.
> 
> In order to avoid zeroing out all sectors of the target device prior to
> mirroring, we do zeroing as part of the block job.  A second dirty
> bitmap cache is created, to track sectors that are unallocated above
> 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> on the target - if they are not, then zeroes are explicitly written.

Why do you need a bitmap? You never change the bitmap after initialising
it, so couldn't you instead just check the allocation status when you
need it?

In fact, why do we need two passes? I would have expected that commit
dcfb3beb already does the trick, with checking allocation status and
writing zeroes during the normal single pass.

If that commit fails to solve the problem, I guess I first need to
understand why before I can continue reviewing this one...

> This only occurs under two conditions:
> 
>     1. 'mode' != "existing"
>     2. bdrv_has_zero_init(target) == NULL
> 
> We perform the mirroring through mirror_iteration() as before, except
> in two passes.  If the above two conditions are met, the first pass
> is using the bitmap tracking unallocated sectors, to write the needed
> zeroes.  Then, the second pass is performed, to mirror the actual data
> as before.
> 
> If the above two conditions are not met, then the first pass is skipped,
> and only the second pass (the one with the actual data) is performed.
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  block/mirror.c            | 109 ++++++++++++++++++++++++++++++++++------------
>  blockdev.c                |   2 +-
>  include/block/block_int.h |   3 +-
>  qapi/block-core.json      |   6 ++-
>  4 files changed, 87 insertions(+), 33 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 405e5c4..b599176 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob {
>      int64_t bdev_length;
>      unsigned long *cow_bitmap;
>      BdrvDirtyBitmap *dirty_bitmap;
> -    HBitmapIter hbi;
> +    HBitmapIter zero_hbi;
> +    HBitmapIter allocated_hbi;
> +    HBitmapIter *hbi;
>      uint8_t *buf;
>      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
>      int buf_free_count;
> @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob {
>      int sectors_in_flight;
>      int ret;
>      bool unmap;
> +    bool zero_unallocated;
> +    bool zero_cycle;
>      bool waiting_for_io;
>  } MirrorBlockJob;
>  
> @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>      int pnum;
>      int64_t ret;
>  
> -    s->sector_num = hbitmap_iter_next(&s->hbi);
> +    s->sector_num = hbitmap_iter_next(s->hbi);
>      if (s->sector_num < 0) {
> -        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
> -        s->sector_num = hbitmap_iter_next(&s->hbi);
> +        bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi);
> +        s->sector_num = hbitmap_iter_next(s->hbi);
>          trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
>          assert(s->sector_num >= 0);
>      }
> @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>           */
>          if (next_sector > hbitmap_next_sector
>              && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
> -            hbitmap_next_sector = hbitmap_iter_next(&s->hbi);
> +            hbitmap_next_sector = hbitmap_iter_next(s->hbi);
>          }
>  
>          next_sector += sectors_per_chunk;
> @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>      s->sectors_in_flight += nb_sectors;
>      trace_mirror_one_iteration(s, sector_num, nb_sectors);
>  
> -    ret = bdrv_get_block_status_above(source, NULL, sector_num,
> -                                      nb_sectors, &pnum);
> -    if (ret < 0 || pnum < nb_sectors ||
> -            (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> -        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> -                       mirror_read_complete, op);
> -    } else if (ret & BDRV_BLOCK_ZERO) {
> -        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> -                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> -                              mirror_write_complete, op);
> +    if (s->zero_cycle) {
> +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                                  mirror_write_complete, op);
> +        }

It seems to be expected that this function always involves an AIO
request and the completion event is what helps making progress. For the
BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what
exactly this means, but at least I think we are applying block job
throttling to doing nothing with some areas of the image.

>      } else {
> -        assert(!(ret & BDRV_BLOCK_DATA));
> -        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> -                         mirror_write_complete, op);
> +        ret = bdrv_get_block_status_above(source, NULL, sector_num,
> +                                          nb_sectors, &pnum);
> +        if (ret < 0 || pnum < nb_sectors ||
> +                (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> +            bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> +                           mirror_read_complete, op);
> +        } else if (ret & BDRV_BLOCK_ZERO) {
> +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                                  mirror_write_complete, op);
> +        } else {
> +            assert(!(ret & BDRV_BLOCK_DATA));
> +            bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> +                             mirror_write_complete, op);
> +        }
>      }
>      return delay_ns;
>  }

Kevin
Kevin Wolf Sept. 28, 2015, 3:10 p.m. UTC | #3
Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben:
> During mirror, if the target device does not have support zero
> initialization, a mirror may result in a corrupt image.
> 
> For instance, on mirror to a host device with format = raw, whatever
> random data is on the target device will still be there for unallocated
> sectors.
> 
> This is because during the mirror, we set the dirty bitmap to copy only
> sectors allocated above 'base'.  In the case of target devices where we
> cannot assume unallocated sectors will be read as zeroes, we need to
> explicitely zero out this data.
> 
> In order to avoid zeroing out all sectors of the target device prior to
> mirroring, we do zeroing as part of the block job.  A second dirty
> bitmap cache is created, to track sectors that are unallocated above
> 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> on the target - if they are not, then zeroes are explicitly written.
> 
> This only occurs under two conditions:
> 
>     1. 'mode' != "existing"
>     2. bdrv_has_zero_init(target) == NULL
> 
> We perform the mirroring through mirror_iteration() as before, except
> in two passes.  If the above two conditions are met, the first pass
> is using the bitmap tracking unallocated sectors, to write the needed
> zeroes.  Then, the second pass is performed, to mirror the actual data
> as before.
> 
> If the above two conditions are not met, then the first pass is skipped,
> and only the second pass (the one with the actual data) is performed.
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>

Also, this makes qemu-iotests 097 fail for me.

Kevin
Stefan Hajnoczi Sept. 28, 2015, 3:23 p.m. UTC | #4
On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote:
> +    if (s->zero_cycle) {
> +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                                  mirror_write_complete, op);

mirror_write_complete will advance s->common.offset.  Won't the progress
be incorrect if we do that for both zeroing and regular mirroring?
Max Reitz Sept. 28, 2015, 5:32 p.m. UTC | #5
On 28.09.2015 05:29, Jeff Cody wrote:
> During mirror, if the target device does not have support zero
> initialization, a mirror may result in a corrupt image.
> 
> For instance, on mirror to a host device with format = raw, whatever
> random data is on the target device will still be there for unallocated
> sectors.
> 
> This is because during the mirror, we set the dirty bitmap to copy only
> sectors allocated above 'base'.  In the case of target devices where we
> cannot assume unallocated sectors will be read as zeroes, we need to
> explicitely zero out this data.
> 
> In order to avoid zeroing out all sectors of the target device prior to
> mirroring, we do zeroing as part of the block job.  A second dirty
> bitmap cache is created, to track sectors that are unallocated above
> 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> on the target - if they are not, then zeroes are explicitly written.
> 
> This only occurs under two conditions:
> 
>     1. 'mode' != "existing"
>     2. bdrv_has_zero_init(target) == NULL
> 
> We perform the mirroring through mirror_iteration() as before, except
> in two passes.  If the above two conditions are met, the first pass
> is using the bitmap tracking unallocated sectors, to write the needed
> zeroes.  Then, the second pass is performed, to mirror the actual data
> as before.
> 
> If the above two conditions are not met, then the first pass is skipped,
> and only the second pass (the one with the actual data) is performed.
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  block/mirror.c            | 109 ++++++++++++++++++++++++++++++++++------------
>  blockdev.c                |   2 +-
>  include/block/block_int.h |   3 +-
>  qapi/block-core.json      |   6 ++-
>  4 files changed, 87 insertions(+), 33 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 405e5c4..b599176 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob {
>      int64_t bdev_length;
>      unsigned long *cow_bitmap;
>      BdrvDirtyBitmap *dirty_bitmap;
> -    HBitmapIter hbi;
> +    HBitmapIter zero_hbi;
> +    HBitmapIter allocated_hbi;
> +    HBitmapIter *hbi;
>      uint8_t *buf;
>      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
>      int buf_free_count;
> @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob {
>      int sectors_in_flight;
>      int ret;
>      bool unmap;
> +    bool zero_unallocated;
> +    bool zero_cycle;
>      bool waiting_for_io;
>  } MirrorBlockJob;
>  
> @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>      int pnum;
>      int64_t ret;
>  
> -    s->sector_num = hbitmap_iter_next(&s->hbi);
> +    s->sector_num = hbitmap_iter_next(s->hbi);
>      if (s->sector_num < 0) {
> -        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
> -        s->sector_num = hbitmap_iter_next(&s->hbi);
> +        bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi);
> +        s->sector_num = hbitmap_iter_next(s->hbi);
>          trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
>          assert(s->sector_num >= 0);
>      }
> @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>           */
>          if (next_sector > hbitmap_next_sector
>              && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
> -            hbitmap_next_sector = hbitmap_iter_next(&s->hbi);
> +            hbitmap_next_sector = hbitmap_iter_next(s->hbi);
>          }
>  
>          next_sector += sectors_per_chunk;
> @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>      s->sectors_in_flight += nb_sectors;
>      trace_mirror_one_iteration(s, sector_num, nb_sectors);
>  
> -    ret = bdrv_get_block_status_above(source, NULL, sector_num,
> -                                      nb_sectors, &pnum);
> -    if (ret < 0 || pnum < nb_sectors ||
> -            (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> -        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> -                       mirror_read_complete, op);
> -    } else if (ret & BDRV_BLOCK_ZERO) {
> -        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> -                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> -                              mirror_write_complete, op);
> +    if (s->zero_cycle) {
> +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                                  mirror_write_complete, op);
> +        }
>      } else {
> -        assert(!(ret & BDRV_BLOCK_DATA));
> -        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> -                         mirror_write_complete, op);
> +        ret = bdrv_get_block_status_above(source, NULL, sector_num,
> +                                          nb_sectors, &pnum);
> +        if (ret < 0 || pnum < nb_sectors ||
> +                (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> +            bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> +                           mirror_read_complete, op);
> +        } else if (ret & BDRV_BLOCK_ZERO) {
> +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                                  mirror_write_complete, op);
> +        } else {
> +            assert(!(ret & BDRV_BLOCK_DATA));
> +            bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> +                             mirror_write_complete, op);
> +        }
>      }
>      return delay_ns;
>  }
>  
> -static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
> +static int mirror_do_iteration(MirrorBlockJob *s, uint64_t *last_pause_ns)
>  {
>      int ret;
>  
> @@ -347,7 +360,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
>           * We do so every SLICE_TIME nanoseconds, or when there is an error,
>           * or when the source is clean, whichever comes first.
>           */
> -        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns < SLICE_TIME
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - *last_pause_ns < SLICE_TIME
>              && s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
>              if (s->in_flight == MAX_IN_FLIGHT || s->buf_free_count == 0 ||
>                  (cnt == 0 && s->in_flight > 0)) {
> @@ -371,6 +384,14 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
>                      goto immediate_exit;
>                  }
>              } else {
> +
> +                if (s->zero_cycle) {
> +                    /* this is not the end of the streaming cycle,
> +                     * if we are just filling in zeroes for unallocated
> +                     * sectors prior to streaming the real data */
> +                    goto immediate_exit;
> +                }
> +
>                  /* We're out of the streaming phase.  From now on, if the job
>                   * is cancelled we will actually complete all pending I/O and
>                   * report completion.  This way, block-job-cancel will leave
> @@ -419,7 +440,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
>              s->common.cancelled = false;
>              break;
>          }
> -        last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +        *last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      }
>  
>  immediate_exit:
> @@ -511,6 +532,15 @@ static void coroutine_fn mirror_run(void *opaque)
>                                   checking for a NULL string */
>      int ret = 0;
>      int n;
> +    BdrvDirtyBitmap *zero_dirty_bitmap;
> +    BdrvDirtyBitmap *allocated_dirty_bitmap = s->dirty_bitmap;
> +
> +    zero_dirty_bitmap = bdrv_create_dirty_bitmap(s->target,
> +                                                 s->granularity, NULL, true,
> +                                                 NULL);
> +    if (zero_dirty_bitmap == NULL) {
> +        goto immediate_exit;
> +    }

I think I'd like the error to be reported to the user; but in any case,
you have to set ret to a negative value.

>  
>      if (block_job_is_cancelled(&s->common)) {
>          goto immediate_exit;
> @@ -588,14 +618,33 @@ static void coroutine_fn mirror_run(void *opaque)
>              assert(n > 0);
>              if (ret == 1) {
>                  bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
> +            } else if (s->zero_unallocated) {
> +                bdrv_set_dirty_bitmap(zero_dirty_bitmap, sector_num, n);
>              }
>              sector_num += n;
>          }
>      }
>  
> -    bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
> +    bdrv_dirty_iter_init(s->dirty_bitmap, &s->allocated_hbi);
>  
> -    ret = mirror_do_iteration(s, last_pause_ns);
> +    if (s->zero_unallocated) {
> +        bdrv_dirty_iter_init(zero_dirty_bitmap, &s->zero_hbi);
> +        s->dirty_bitmap = zero_dirty_bitmap;
> +        s->hbi = &s->zero_hbi;
> +
> +        s->zero_cycle = true;
> +        ret = mirror_do_iteration(s, &last_pause_ns);
> +        if (ret < 0) {
> +            goto immediate_exit;
> +        }
> +
> +        mirror_drain(s);
> +        s->zero_cycle = false;
> +    }
> +
> +    s->dirty_bitmap = allocated_dirty_bitmap;
> +    s->hbi = &s->allocated_hbi;
> +    ret = mirror_do_iteration(s, &last_pause_ns);
>  
>  immediate_exit:
>      if (s->in_flight > 0) {
> @@ -611,7 +660,8 @@ immediate_exit:
>      qemu_vfree(s->buf);
>      g_free(s->cow_bitmap);
>      g_free(s->in_flight_bitmap);
> -    bdrv_release_dirty_bitmap(bs, s->dirty_bitmap);
> +    bdrv_release_dirty_bitmap(bs, allocated_dirty_bitmap);
> +    bdrv_release_dirty_bitmap(NULL, zero_dirty_bitmap);
>      bdrv_iostatus_disable(s->target);
>  
>      data = g_malloc(sizeof(*data));
> @@ -702,7 +752,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
>                               int64_t buf_size,
>                               BlockdevOnError on_source_error,
>                               BlockdevOnError on_target_error,
> -                             bool unmap,
> +                             bool unmap, bool existing,
>                               BlockCompletionFunc *cb,
>                               void *opaque, Error **errp,
>                               const BlockJobDriver *driver,
> @@ -737,6 +787,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
>          return;
>      }
>  
> +    s->zero_unallocated = !existing && !bdrv_has_zero_init(target);

I think this should be set only if we're doing a full mirror operation.
For instance, I could do a none, top or incremental mirror to a new
qcow2 file, which would give it a backing file, obviously. You're lucky
in that qcow2 claims to always have zero initialization, when this is in
fact not true (someone's ought to fix that...): With a backing file, an
overlay file just cannot have zero initialization, it's impossible
(well, unless the backing file is completely zero).

So if qcow2 were to answer correctly, i.e. "No, with a backing file I do
not have zero init", then this would overwrite all sectors which are
supposed to be unallocated because they are present in the backing file.

>      s->replaces = g_strdup(replaces);
>      s->on_source_error = on_source_error;
>      s->on_target_error = on_target_error;
> @@ -767,7 +818,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
>                    int64_t speed, uint32_t granularity, int64_t buf_size,
>                    MirrorSyncMode mode, BlockdevOnError on_source_error,
>                    BlockdevOnError on_target_error,
> -                  bool unmap,
> +                  bool unmap, bool existing,
>                    BlockCompletionFunc *cb,
>                    void *opaque, Error **errp)
>  {
> @@ -782,8 +833,8 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
>      base = mode == MIRROR_SYNC_MODE_TOP ? bs->backing_hd : NULL;
>      mirror_start_job(bs, target, replaces,
>                       speed, granularity, buf_size,
> -                     on_source_error, on_target_error, unmap, cb, opaque, errp,
> -                     &mirror_job_driver, is_none_mode, base);
> +                     on_source_error, on_target_error, unmap, existing, cb,
> +                     opaque, errp, &mirror_job_driver, is_none_mode, base);
>  }
>  
>  void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
> @@ -830,7 +881,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
>  
>      bdrv_ref(base);
>      mirror_start_job(bs, base, NULL, speed, 0, 0,
> -                     on_error, on_error, false, cb, opaque, &local_err,
> +                     on_error, on_error, false, false, cb, opaque, &local_err,

This should probably be true; the commit target is already existing,
after all. Also, without it being true, iotest 097 fails.

>                       &commit_active_job_driver, false, base);
>      if (local_err) {
>          error_propagate(errp, local_err);
> diff --git a/blockdev.c b/blockdev.c
> index cb9f78d..c06ac60 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -2816,7 +2816,7 @@ void qmp_drive_mirror(const char *device, const char *target,
>                   has_replaces ? replaces : NULL,
>                   speed, granularity, buf_size, sync,
>                   on_source_error, on_target_error,
> -                 unmap,
> +                 unmap, mode == NEW_IMAGE_MODE_EXISTING,
>                   block_job_cb, bs, &local_err);
>      if (local_err != NULL) {
>          bdrv_unref(target_bs);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 14ad4c3..21a8988 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -614,6 +614,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
>   * @on_source_error: The action to take upon error reading from the source.
>   * @on_target_error: The action to take upon error writing to the target.
>   * @unmap: Whether to unmap target where source sectors only contain zeroes.
> + * @existing: Whether target image is an existing image prior to the QMP cmd.
>   * @cb: Completion function for the job.
>   * @opaque: Opaque pointer value passed to @cb.
>   * @errp: Error object.
> @@ -628,7 +629,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
>                    int64_t speed, uint32_t granularity, int64_t buf_size,
>                    MirrorSyncMode mode, BlockdevOnError on_source_error,
>                    BlockdevOnError on_target_error,
> -                  bool unmap,
> +                  bool unmap, bool existing,
>                    BlockCompletionFunc *cb,
>                    void *opaque, Error **errp);
>  
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index bb2189e..033afb4 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -952,8 +952,10 @@
>  #            broken Quorum files. (Since 2.1)
>  #
>  # @mode: #optional whether and how QEMU should create a new image, default is
> -#        'absolute-paths'.
> -#

This empty line should stay.

> +#        'absolute-paths'.  If mode != 'existing', and the target does not
> +#         have zero init (sparseness), then the target image will have sectors
> +#         zeroed out that correspond to sectors in an unallocated state in the
> +#         source image.

As I said above, this should only happen if @sync == 'full'.

Max

>  # @speed:  #optional the maximum speed, in bytes per second
>  #
>  # @sync: what parts of the disk image should be copied to the destination
>
Eric Blake Sept. 28, 2015, 8:31 p.m. UTC | #6
On 09/28/2015 08:13 AM, Paolo Bonzini wrote:
> 
> 
> On 28/09/2015 05:29, Jeff Cody wrote:
>> This only occurs under two conditions:
>>
>>     1. 'mode' != "existing"
>>     2. bdrv_has_zero_init(target) == NULL
>>
> 
> I'm not sure if mode != "existing" actually matters.  I think what
> actually matters is sync == "full".

When mode == 'existing' for a shallow mirror (sync != 'full'), that is
the caller stating that the guest-visible contents of the destination
match the guest-visible contents of the backing image.  The only sectors
to be copied are those that differ from the backing file, and we should
not be zeroing unrelated sectors because the user has already promised
they have the same guest-visible content as the backing image would report.

When mode == 'existing' for a full mirror (sync == 'full'), that is the
caller stating that they want every single sector of the destination
written to hold the current state of the source (of course, allowing for
optimizations such as skipping the write where the contents will read
back the same as if the write had been performed).

I think Paolo is right: we care about zeroing unallocated sectors for
sync == 'full', regardless of whether mode == 'existing'.

I also think the reason Jeff confused it for mode == 'existing' is that
the other modes let qemu create the file, but qemu does not create block
devices (the only way to mirror to a block device is via mode ==
'existing'), and it is primarily block devices where zero init is not
guaranteed.

> 
> The reasons are:
> 
> 1) with sync != "full", unallocated target sectors should remain
> unallocated on the destination because they are supposed to point to the
> backing file.
> 
> 2) even with mode == "existing" you expect the data to be consistent at
> the end of the mirroring
> 
> Paolo
> 
>
Jeff Cody Sept. 28, 2015, 9:32 p.m. UTC | #7
On Sep 28, 2015 4:12 PM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>
>

Replying from mobile; please excuse any formatting issues.

>
> On 28/09/2015 05:29, Jeff Cody wrote:
> > This only occurs under two conditions:
> >
> >     1. 'mode' != "existing"
> >     2. bdrv_has_zero_init(target) == NULL
> >
>
> I'm not sure if mode != "existing" actually matters.  I think what
> actually matters is sync == "full".
>
> The reasons are:
>
> 1) with sync != "full", unallocated target sectors should remain
> unallocated on the destination because they are supposed to point to the
> backing file.

I guess that makes sense.  What about  the case when the target is a raw
device without zero init?  There is no backing file... Of course, perhaps
in the raw case the user should be using sync==full anyways.

>
> 2) even with mode == "existing" you expect the data to be consistent at
> the end of the mirroring
>

The reason I added the "existing" exception was so the user could avoid the
time penalty of zeroing out the data if they knew the target had already
explicitly been zeroed.  Do you think it is fair to assume that if the user
specified existing, that they take responsibility for setting up the target
image how they like (including data initialization)?  Or should we add
another option for mirror, to allow the user to bypass the zero fill?

Thanks,
Jeff
Jeff Cody Sept. 28, 2015, 9:57 p.m. UTC | #8
On Sep 28, 2015 5:31 PM, "Kevin Wolf" <kwolf@redhat.com> wrote:
>

(Responding from mobile phone again)

> Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben:
> > During mirror, if the target device does not have support zero
> > initialization, a mirror may result in a corrupt image.
>
> I think you want to check this sentence. ("During mirror [...], a
> mirror may result [...]")
>

Yes, thanks.

> > For instance, on mirror to a host device with format = raw, whatever
> > random data is on the target device will still be there for unallocated
> > sectors.
> >
> > This is because during the mirror, we set the dirty bitmap to copy only
> > sectors allocated above 'base'.  In the case of target devices where we
> > cannot assume unallocated sectors will be read as zeroes, we need to
> > explicitely zero out this data.
> >
> > In order to avoid zeroing out all sectors of the target device prior to
> > mirroring, we do zeroing as part of the block job.  A second dirty
> > bitmap cache is created, to track sectors that are unallocated above
> > 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> > on the target - if they are not, then zeroes are explicitly written.
>
> Why do you need a bitmap? You never change the bitmap after initialising
> it, so couldn't you instead just check the allocation status when you
> need it?

The main reason was really to maximize code reuse, and be able to use the
same iteration code in the mirror coroutine.

>
> In fact, why do we need two passes? I would have expected that commit
> dcfb3beb already does the trick, with checking allocation status and
> writing zeroes during the normal single pass.
>
> If that commit fails to solve the problem, I guess I first need to
> understand why before I can continue reviewing this one...
>

Responding from memory right now, but that commit only helps if the guest
unmaps data, changing the sectors to unallocated after the mirror begins.

However, before we get to this point we've already generated our bitmap of
dirty sectors in mirror_run(), and those are explicitly only sectors that
are allocated above the source.  Inside the iteration, we'll only pick up
the unallocated sectors if they have been changed by the guest.

> > This only occurs under two conditions:
> >
> >     1. 'mode' != "existing"
> >     2. bdrv_has_zero_init(target) == NULL
> >
> > We perform the mirroring through mirror_iteration() as before, except
> > in two passes.  If the above two conditions are met, the first pass
> > is using the bitmap tracking unallocated sectors, to write the needed
> > zeroes.  Then, the second pass is performed, to mirror the actual data
> > as before.
> >
> > If the above two conditions are not met, then the first pass is skipped,
> > and only the second pass (the one with the actual data) is performed.
> >
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > ---
> >  block/mirror.c            | 109
++++++++++++++++++++++++++++++++++------------
> >  blockdev.c                |   2 +-
> >  include/block/block_int.h |   3 +-
> >  qapi/block-core.json      |   6 ++-
> >  4 files changed, 87 insertions(+), 33 deletions(-)
> >
> > diff --git a/block/mirror.c b/block/mirror.c
> > index 405e5c4..b599176 100644
> > --- a/block/mirror.c
> > +++ b/block/mirror.c
> > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob {
> >      int64_t bdev_length;
> >      unsigned long *cow_bitmap;
> >      BdrvDirtyBitmap *dirty_bitmap;
> > -    HBitmapIter hbi;
> > +    HBitmapIter zero_hbi;
> > +    HBitmapIter allocated_hbi;
> > +    HBitmapIter *hbi;
> >      uint8_t *buf;
> >      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
> >      int buf_free_count;
> > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob {
> >      int sectors_in_flight;
> >      int ret;
> >      bool unmap;
> > +    bool zero_unallocated;
> > +    bool zero_cycle;
> >      bool waiting_for_io;
> >  } MirrorBlockJob;
> >
> > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn
mirror_iteration(MirrorBlockJob *s)
> >      int pnum;
> >      int64_t ret;
> >
> > -    s->sector_num = hbitmap_iter_next(&s->hbi);
> > +    s->sector_num = hbitmap_iter_next(s->hbi);
> >      if (s->sector_num < 0) {
> > -        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
> > -        s->sector_num = hbitmap_iter_next(&s->hbi);
> > +        bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi);
> > +        s->sector_num = hbitmap_iter_next(s->hbi);
> >          trace_mirror_restart_iter(s,
bdrv_get_dirty_count(s->dirty_bitmap));
> >          assert(s->sector_num >= 0);
> >      }
> > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn
mirror_iteration(MirrorBlockJob *s)
> >           */
> >          if (next_sector > hbitmap_next_sector
> >              && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
> > -            hbitmap_next_sector = hbitmap_iter_next(&s->hbi);
> > +            hbitmap_next_sector = hbitmap_iter_next(s->hbi);
> >          }
> >
> >          next_sector += sectors_per_chunk;
> > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn
mirror_iteration(MirrorBlockJob *s)
> >      s->sectors_in_flight += nb_sectors;
> >      trace_mirror_one_iteration(s, sector_num, nb_sectors);
> >
> > -    ret = bdrv_get_block_status_above(source, NULL, sector_num,
> > -                                      nb_sectors, &pnum);
> > -    if (ret < 0 || pnum < nb_sectors ||
> > -            (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> > -        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> > -                       mirror_read_complete, op);
> > -    } else if (ret & BDRV_BLOCK_ZERO) {
> > -        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > -                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > -                              mirror_write_complete, op);
> > +    if (s->zero_cycle) {
> > +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors,
&pnum);
> > +        if (!(ret & BDRV_BLOCK_ZERO)) {
> > +            bdrv_aio_write_zeroes(s->target, sector_num,
op->nb_sectors,
> > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > +                                  mirror_write_complete, op);
> > +        }
>
> It seems to be expected that this function always involves an AIO
> request and the completion event is what helps making progress. For the
> BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what
> exactly this means, but at least I think we are applying block job
> throttling to doing nothing with some areas of the image.
>
> >      } else {
> > -        assert(!(ret & BDRV_BLOCK_DATA));
> > -        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> > -                         mirror_write_complete, op);
> > +        ret = bdrv_get_block_status_above(source, NULL, sector_num,
> > +                                          nb_sectors, &pnum);
> > +        if (ret < 0 || pnum < nb_sectors ||
> > +                (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> > +            bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> > +                           mirror_read_complete, op);
> > +        } else if (ret & BDRV_BLOCK_ZERO) {
> > +            bdrv_aio_write_zeroes(s->target, sector_num,
op->nb_sectors,
> > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > +                                  mirror_write_complete, op);
> > +        } else {
> > +            assert(!(ret & BDRV_BLOCK_DATA));
> > +            bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> > +                             mirror_write_complete, op);
> > +        }
> >      }
> >      return delay_ns;
> >  }
>
> Kevin
>
Jeff Cody Sept. 28, 2015, 9:58 p.m. UTC | #9
On Sep 28, 2015 5:34 PM, "Kevin Wolf" <kwolf@redhat.com> wrote:
>
> Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben:
> > During mirror, if the target device does not have support zero
> > initialization, a mirror may result in a corrupt image.
> >
> > For instance, on mirror to a host device with format = raw, whatever
> > random data is on the target device will still be there for unallocated
> > sectors.
> >
> > This is because during the mirror, we set the dirty bitmap to copy only
> > sectors allocated above 'base'.  In the case of target devices where we
> > cannot assume unallocated sectors will be read as zeroes, we need to
> > explicitely zero out this data.
> >
> > In order to avoid zeroing out all sectors of the target device prior to
> > mirroring, we do zeroing as part of the block job.  A second dirty
> > bitmap cache is created, to track sectors that are unallocated above
> > 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> > on the target - if they are not, then zeroes are explicitly written.
> >
> > This only occurs under two conditions:
> >
> >     1. 'mode' != "existing"
> >     2. bdrv_has_zero_init(target) == NULL
> >
> > We perform the mirroring through mirror_iteration() as before, except
> > in two passes.  If the above two conditions are met, the first pass
> > is using the bitmap tracking unallocated sectors, to write the needed
> > zeroes.  Then, the second pass is performed, to mirror the actual data
> > as before.
> >
> > If the above two conditions are not met, then the first pass is skipped,
> > and only the second pass (the one with the actual data) is performed.
> >
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
>
> Also, this makes qemu-iotests 097 fail for me.
>

OK, thanks - I'll check that out tomorrow afternoon.  I ran iotests on all
the tests I thought dealt with mirror, but I must have missed that one with
my grep.

Jeff
Eric Blake Sept. 29, 2015, 2:48 a.m. UTC | #10
On 09/28/2015 03:32 PM, Jeff Cody wrote:

> I guess that makes sense.  What about  the case when the target is a raw
> device without zero init?  There is no backing file... Of course, perhaps
> in the raw case the user should be using sync==full anyways.
> 
>>
>> 2) even with mode == "existing" you expect the data to be consistent at
>> the end of the mirroring
>>
> 
> The reason I added the "existing" exception was so the user could avoid the
> time penalty of zeroing out the data if they knew the target had already
> explicitly been zeroed.  Do you think it is fair to assume that if the user
> specified existing, that they take responsibility for setting up the target
> image how they like (including data initialization)?  Or should we add
> another option for mirror, to allow the user to bypass the zero fill?

mode == 'existing' puts the burden on the caller to ensure that the file
they are passing in starts with known contents (either contents don't
matter because we are doing sync == 'full' to write every sector, or
contents MUST initially match what the guest would see looking at the
backing image when doing a shallow clone).  But if there is a way for a
user to pass in an existing file which they have pre-zeroed, even though
the file would normally be treated as though it did not have zero fill,
then the option to bypass a redundant zero fill might be useful.  I'm
not sure it's worth implementing without a known user, though, and I
don't know that libvirt would use it.
Kevin Wolf Sept. 29, 2015, 8:10 a.m. UTC | #11
Am 28.09.2015 um 22:31 hat Eric Blake geschrieben:
> On 09/28/2015 08:13 AM, Paolo Bonzini wrote:
> > 
> > 
> > On 28/09/2015 05:29, Jeff Cody wrote:
> >> This only occurs under two conditions:
> >>
> >>     1. 'mode' != "existing"
> >>     2. bdrv_has_zero_init(target) == NULL
> >>
> > 
> > I'm not sure if mode != "existing" actually matters.  I think what
> > actually matters is sync == "full".
> 
> When mode == 'existing' for a shallow mirror (sync != 'full'), that is
> the caller stating that the guest-visible contents of the destination
> match the guest-visible contents of the backing image.  The only sectors
> to be copied are those that differ from the backing file, and we should
> not be zeroing unrelated sectors because the user has already promised
> they have the same guest-visible content as the backing image would report.

Where is this promise documented? I wasn't aware of it and can't seem to
find it in the QAPI documentation of drive-mirror.

> When mode == 'existing' for a full mirror (sync == 'full'), that is the
> caller stating that they want every single sector of the destination
> written to hold the current state of the source (of course, allowing for
> optimizations such as skipping the write where the contents will read
> back the same as if the write had been performed).
> 
> I think Paolo is right: we care about zeroing unallocated sectors for
> sync == 'full', regardless of whether mode == 'existing'.

I agree.

> I also think the reason Jeff confused it for mode == 'existing' is that
> the other modes let qemu create the file, but qemu does not create block
> devices (the only way to mirror to a block device is via mode ==
> 'existing'), and it is primarily block devices where zero init is not
> guaranteed.

'qemu-img create' works on block devices (even though for raw it doesn't
do more than checking if it's large enough; but for qcow2, it's obvious
that it's necessary), so I'm pretty sure that mode != 'existing' works
on them as well.

Kevin
Kevin Wolf Sept. 29, 2015, 8:28 a.m. UTC | #12
Am 28.09.2015 um 23:57 hat Jeff Cody geschrieben:
> On Sep 28, 2015 5:31 PM, "Kevin Wolf" <kwolf@redhat.com> wrote:
> >
> 
> (Responding from mobile phone again)
> 
> > Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben:
> > > During mirror, if the target device does not have support zero
> > > initialization, a mirror may result in a corrupt image.
> >
> > I think you want to check this sentence. ("During mirror [...], a
> > mirror may result [...]")
> >
> 
> Yes, thanks.
> 
> > > For instance, on mirror to a host device with format = raw, whatever
> > > random data is on the target device will still be there for unallocated
> > > sectors.
> > >
> > > This is because during the mirror, we set the dirty bitmap to copy only
> > > sectors allocated above 'base'.  In the case of target devices where we
> > > cannot assume unallocated sectors will be read as zeroes, we need to
> > > explicitely zero out this data.
> > >
> > > In order to avoid zeroing out all sectors of the target device prior to
> > > mirroring, we do zeroing as part of the block job.  A second dirty
> > > bitmap cache is created, to track sectors that are unallocated above
> > > 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> > > on the target - if they are not, then zeroes are explicitly written.
> >
> > Why do you need a bitmap? You never change the bitmap after initialising
> > it, so couldn't you instead just check the allocation status when you
> > need it?
> 
> The main reason was really to maximize code reuse, and be able to use the same
> iteration code in the mirror coroutine.
> 
> >
> > In fact, why do we need two passes? I would have expected that commit
> > dcfb3beb already does the trick, with checking allocation status and
> > writing zeroes during the normal single pass.
> >
> > If that commit fails to solve the problem, I guess I first need to
> > understand why before I can continue reviewing this one...
> >
> 
> Responding from memory right now, but that commit only helps if the guest
> unmaps data, changing the sectors to unallocated after the mirror begins.
> 
> However, before we get to this point we've already generated our bitmap of
> dirty sectors in mirror_run(), and those are explicitly only sectors that are
> allocated above the source.  Inside the iteration, we'll only pick up the
> unallocated sectors if they have been changed by the guest.

So the problem is just that the sectors aren't included in the
initialisation of the dirty bitmap? If so, why do we need a second
bitmap and can't basically or the zero bitmap into the normal one?

Is the real fix that for sync == full, we need to set the whole bitmap
initially, regardless of the allocation status? After all, the
initialisation of the bitmap is how the sync modes are defined in the
QAPI documentation.

Hm... Of course, you can only rely on zero initialisation as long as
nobody (including an earlier mirror iteration) has written to the target
before, so if you want to keep the target sparse if possible, you do
need a second bitmap; a normal dirty bitmap for the target would be
enough though.

With this, you could probably implement something very similar to
convert_write() in qemu-img. Unifying qemu-img convert and mirroring
seems to be worthwhile in the long run anyway.

Kevin

> > > This only occurs under two conditions:
> > >
> > >     1. 'mode' != "existing"
> > >     2. bdrv_has_zero_init(target) == NULL
> > >
> > > We perform the mirroring through mirror_iteration() as before, except
> > > in two passes.  If the above two conditions are met, the first pass
> > > is using the bitmap tracking unallocated sectors, to write the needed
> > > zeroes.  Then, the second pass is performed, to mirror the actual data
> > > as before.
> > >
> > > If the above two conditions are not met, then the first pass is skipped,
> > > and only the second pass (the one with the actual data) is performed.
> > >
> > > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > ---
> > >  block/mirror.c            | 109
> ++++++++++++++++++++++++++++++++++------------
> > >  blockdev.c                |   2 +-
> > >  include/block/block_int.h |   3 +-
> > >  qapi/block-core.json      |   6 ++-
> > >  4 files changed, 87 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/block/mirror.c b/block/mirror.c
> > > index 405e5c4..b599176 100644
> > > --- a/block/mirror.c
> > > +++ b/block/mirror.c
> > > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob {
> > >      int64_t bdev_length;
> > >      unsigned long *cow_bitmap;
> > >      BdrvDirtyBitmap *dirty_bitmap;
> > > -    HBitmapIter hbi;
> > > +    HBitmapIter zero_hbi;
> > > +    HBitmapIter allocated_hbi;
> > > +    HBitmapIter *hbi;
> > >      uint8_t *buf;
> > >      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
> > >      int buf_free_count;
> > > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob {
> > >      int sectors_in_flight;
> > >      int ret;
> > >      bool unmap;
> > > +    bool zero_unallocated;
> > > +    bool zero_cycle;
> > >      bool waiting_for_io;
> > >  } MirrorBlockJob;
> > >
> > > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration
> (MirrorBlockJob *s)
> > >      int pnum;
> > >      int64_t ret;
> > >
> > > -    s->sector_num = hbitmap_iter_next(&s->hbi);
> > > +    s->sector_num = hbitmap_iter_next(s->hbi);
> > >      if (s->sector_num < 0) {
> > > -        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
> > > -        s->sector_num = hbitmap_iter_next(&s->hbi);
> > > +        bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi);
> > > +        s->sector_num = hbitmap_iter_next(s->hbi);
> > >          trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->
> dirty_bitmap));
> > >          assert(s->sector_num >= 0);
> > >      }
> > > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration
> (MirrorBlockJob *s)
> > >           */
> > >          if (next_sector > hbitmap_next_sector
> > >              && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
> > > -            hbitmap_next_sector = hbitmap_iter_next(&s->hbi);
> > > +            hbitmap_next_sector = hbitmap_iter_next(s->hbi);
> > >          }
> > >
> > >          next_sector += sectors_per_chunk;
> > > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration
> (MirrorBlockJob *s)
> > >      s->sectors_in_flight += nb_sectors;
> > >      trace_mirror_one_iteration(s, sector_num, nb_sectors);
> > >
> > > -    ret = bdrv_get_block_status_above(source, NULL, sector_num,
> > > -                                      nb_sectors, &pnum);
> > > -    if (ret < 0 || pnum < nb_sectors ||
> > > -            (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> > > -        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> > > -                       mirror_read_complete, op);
> > > -    } else if (ret & BDRV_BLOCK_ZERO) {
> > > -        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > > -                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > > -                              mirror_write_complete, op);
> > > +    if (s->zero_cycle) {
> > > +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &
> pnum);
> > > +        if (!(ret & BDRV_BLOCK_ZERO)) {
> > > +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > > +                                  mirror_write_complete, op);
> > > +        }
> >
> > It seems to be expected that this function always involves an AIO
> > request and the completion event is what helps making progress. For the
> > BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what
> > exactly this means, but at least I think we are applying block job
> > throttling to doing nothing with some areas of the image.
> >
> > >      } else {
> > > -        assert(!(ret & BDRV_BLOCK_DATA));
> > > -        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> > > -                         mirror_write_complete, op);
> > > +        ret = bdrv_get_block_status_above(source, NULL, sector_num,
> > > +                                          nb_sectors, &pnum);
> > > +        if (ret < 0 || pnum < nb_sectors ||
> > > +                (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
> > > +            bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> > > +                           mirror_read_complete, op);
> > > +        } else if (ret & BDRV_BLOCK_ZERO) {
> > > +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > > +                                  mirror_write_complete, op);
> > > +        } else {
> > > +            assert(!(ret & BDRV_BLOCK_DATA));
> > > +            bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> > > +                             mirror_write_complete, op);
> > > +        }
> > >      }
> > >      return delay_ns;
> > >  }
> >
> > Kevin
> >
>
Kevin Wolf Sept. 29, 2015, 8:39 a.m. UTC | #13
Am 28.09.2015 um 19:32 hat Max Reitz geschrieben:
> On 28.09.2015 05:29, Jeff Cody wrote:
> > During mirror, if the target device does not have support zero
> > initialization, a mirror may result in a corrupt image.
> > 
> > For instance, on mirror to a host device with format = raw, whatever
> > random data is on the target device will still be there for unallocated
> > sectors.
> > 
> > This is because during the mirror, we set the dirty bitmap to copy only
> > sectors allocated above 'base'.  In the case of target devices where we
> > cannot assume unallocated sectors will be read as zeroes, we need to
> > explicitely zero out this data.
> > 
> > In order to avoid zeroing out all sectors of the target device prior to
> > mirroring, we do zeroing as part of the block job.  A second dirty
> > bitmap cache is created, to track sectors that are unallocated above
> > 'base'.  These sectors are then checked for status of BDRV_BLOCK_ZERO
> > on the target - if they are not, then zeroes are explicitly written.
> > 
> > This only occurs under two conditions:
> > 
> >     1. 'mode' != "existing"
> >     2. bdrv_has_zero_init(target) == NULL
> > 
> > We perform the mirroring through mirror_iteration() as before, except
> > in two passes.  If the above two conditions are met, the first pass
> > is using the bitmap tracking unallocated sectors, to write the needed
> > zeroes.  Then, the second pass is performed, to mirror the actual data
> > as before.
> > 
> > If the above two conditions are not met, then the first pass is skipped,
> > and only the second pass (the one with the actual data) is performed.
> > 
> > Signed-off-by: Jeff Cody <jcody@redhat.com>

> > @@ -737,6 +787,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
> >          return;
> >      }
> >  
> > +    s->zero_unallocated = !existing && !bdrv_has_zero_init(target);
> 
> I think this should be set only if we're doing a full mirror operation.
> For instance, I could do a none, top or incremental mirror to a new
> qcow2 file, which would give it a backing file, obviously. You're lucky
> in that qcow2 claims to always have zero initialization, when this is in
> fact not true (someone's ought to fix that...): With a backing file, an
> overlay file just cannot have zero initialization, it's impossible
> (well, unless the backing file is completely zero).

bdrv_has_zero_init() takes care of that, in theory. The "problem" here
is that the target is opened with BDRV_O_NO_BACKING, so the block layer
doesn't consider this an image with a backing file.

Is there anything better than bs->backing_hd that we could check?

Kevin
Paolo Bonzini Sept. 29, 2015, 8:42 a.m. UTC | #14
On 29/09/2015 10:10, Kevin Wolf wrote:
>> When mode == 'existing' for a shallow mirror (sync != 'full'),
>> that is the caller stating that the guest-visible contents of the
>> destination match the guest-visible contents of the backing
>> image.  The only sectors to be copied are those that differ from
>> the backing file, and we should not be zeroing unrelated sectors
>> because the user has already promised they have the same
>> guest-visible content as the backing image would report.
> 
> Where is this promise documented? I wasn't aware of it and can't
> seem to find it in the QAPI documentation of drive-mirror.

I don't think it is really a promise, but it's the only sensible way
to use this combination.

Paolo

>> When mode == 'existing' for a full mirror (sync == 'full'), that
>> is the caller stating that they want every single sector of the
>> destination written to hold the current state of the source (of
>> course, allowing for optimizations such as skipping the write
>> where the contents will read back the same as if the write had
>> been performed).
>> 
>> I think Paolo is right: we care about zeroing unallocated sectors
>> for sync == 'full', regardless of whether mode == 'existing'.
> 
> I agree.
> 
>> I also think the reason Jeff confused it for mode == 'existing'
>> is that the other modes let qemu create the file, but qemu does
>> not create block devices (the only way to mirror to a block
>> device is via mode == 'existing'), and it is primarily block
>> devices where zero init is not guaranteed.
> 
> 'qemu-img create' works on block devices (even though for raw it
> doesn't do more than checking if it's large enough; but for qcow2,
> it's obvious that it's necessary), so I'm pretty sure that mode !=
> 'existing' works on them as well.
> 
> Kevin
>
Kevin Wolf Sept. 29, 2015, 9:35 a.m. UTC | #15
Am 29.09.2015 um 10:42 hat Paolo Bonzini geschrieben:
> 
> 
> On 29/09/2015 10:10, Kevin Wolf wrote:
> >> When mode == 'existing' for a shallow mirror (sync != 'full'),
> >> that is the caller stating that the guest-visible contents of the
> >> destination match the guest-visible contents of the backing
> >> image.  The only sectors to be copied are those that differ from
> >> the backing file, and we should not be zeroing unrelated sectors
> >> because the user has already promised they have the same
> >> guest-visible content as the backing image would report.
> > 
> > Where is this promise documented? I wasn't aware of it and can't
> > seem to find it in the QAPI documentation of drive-mirror.
> 
> I don't think it is really a promise, but it's the only sensible way
> to use this combination.

The caller could be copying the backing file in the background and it
may not yet be finished. So I don't think we can rely on a promise that
isn't explicitly mentioned anywhere. We don't do this now, but assuming
the promise means that we could e.g. read the backing file in order to
optimise sparseness in the target (if it happens to have the same data
as its backing file) - and I don't think this would be valid with our
currently documented API.

Anyway, the conclusion that we shouldn't zero unrelated sectors is still
right. But it's because we document which sectors we copy, not because
we can make assumptions about the user.

Kevin

> Paolo
> 
> >> When mode == 'existing' for a full mirror (sync == 'full'), that
> >> is the caller stating that they want every single sector of the
> >> destination written to hold the current state of the source (of
> >> course, allowing for optimizations such as skipping the write
> >> where the contents will read back the same as if the write had
> >> been performed).
> >> 
> >> I think Paolo is right: we care about zeroing unallocated sectors
> >> for sync == 'full', regardless of whether mode == 'existing'.
> > 
> > I agree.
> > 
> >> I also think the reason Jeff confused it for mode == 'existing'
> >> is that the other modes let qemu create the file, but qemu does
> >> not create block devices (the only way to mirror to a block
> >> device is via mode == 'existing'), and it is primarily block
> >> devices where zero init is not guaranteed.
> > 
> > 'qemu-img create' works on block devices (even though for raw it
> > doesn't do more than checking if it's large enough; but for qcow2,
> > it's obvious that it's necessary), so I'm pretty sure that mode !=
> > 'existing' works on them as well.
> > 
> > Kevin
> >
Paolo Bonzini Sept. 29, 2015, 10:52 a.m. UTC | #16
On 29/09/2015 11:35, Kevin Wolf wrote:
> The caller could be copying the backing file in the background and it
> may not yet be finished.

Yes, and this is permitted (the destination file of mirroring is opened
with BDRV_O_NO_BACKING).

Some more assumptions arise when block-job-complete is invoked, because
at this point the content must not change under the guest's feet.
Because block-job-complete does bdrv_open_backing_file on the
destination, for sync!='full' it means that either 1) the image has no
backing file, but it starts with the content of the backing file or 2)
the image's backing file is complete at the time block-job-complete is
invoked.

For mode!='existing' it is always case (2), and the backing file is
complete all the time; for mode=='existing' the backing file could be
copied in the background, and case (1) could happen as well.  An example
of case (1) is replacing sync=='full' with a "fast copy" of the backing
file (e.g. via btrfs's COW copies) and sync=='top'.  This should be valid.

Of course, if block-job-complete is never called, all bets are off.

> We don't do this now, but assuming
> the promise means that we could e.g. read the backing file in order to
> optimise sparseness in the target (if it happens to have the same data
> as its backing file) - and I don't think this would be valid with our
> currently documented API.

Accessing the backing file of the target is never valid indeed.

> Anyway, the conclusion that we shouldn't zero unrelated sectors is still
> right. But it's because we document which sectors we copy, not because
> we can make assumptions about the user.

Right.

Paolo
Paolo Bonzini Sept. 29, 2015, 2:47 p.m. UTC | #17
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256



On 29/09/2015 10:39, Kevin Wolf wrote:
> bdrv_has_zero_init() takes care of that, in theory. The "problem"
> here is that the target is opened with BDRV_O_NO_BACKING, so the
> block layer doesn't consider this an image with a backing file.

I think bdrv_has_zero_init() is working right. If you read the qcow2
file as it was opened (i.e. with BDRV_O_NO_BACKING), unallocated areas
will indeed read as zeroes.

Of course if the file is opened with BDRV_O_NO_BACKING but does have a
backing file, you ought not to read unallocated areas at all.

So it's not the answer (of bdrv_has_zero_init) that is wrong, but the
question that was not well-specified.

> Is there anything better than bs->backing_hd that we could check?

It's simply sync == 'full', I think.  Then the problematic case never
even reaches bdrv_has_zero_init.

Paolo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWCqRqAAoJEL/70l94x66DgcUH/jN8VFkGpZxXS5b+TnU8BeGV
Xmr3AjqICYS4K1mKcuu20GKZO5QSTh4Z7p/Igo2KmiGqven2kT/NIvjPRlSv4tqZ
Vov6AaamO6OIme+nA0hYbc3ANUY+b/7CqL8tDb3rKzah0FeFMSi1x7Who7aOCTQs
IjsJ37/ay+mGmPR9akDAfqjJjGPBJFL9dxz/0pgdPDUyj7IwvyolgGZ49rGNzoHE
/86Dy23ET16HQHDOz3afsrLHf9gxGZFCMsLJostqH0cuMs2sk1qnY9i9xEXYUM00
XNoaVafwCeH1ypXHNcP+GWtbbHBaMJJtmoRFB72VDRPq39XvpzWhifbKK3+c2Qw=
=/2aT
-----END PGP SIGNATURE-----
Jeff Cody Sept. 30, 2015, 2:43 p.m. UTC | #18
On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote:
> 
> 
> On 29/09/2015 11:35, Kevin Wolf wrote:
> > The caller could be copying the backing file in the background and it
> > may not yet be finished.
> 
> Yes, and this is permitted (the destination file of mirroring is opened
> with BDRV_O_NO_BACKING).
> 
> Some more assumptions arise when block-job-complete is invoked, because
> at this point the content must not change under the guest's feet.
> Because block-job-complete does bdrv_open_backing_file on the
> destination, for sync!='full' it means that either 1) the image has no
> backing file, but it starts with the content of the backing file or 2)
> the image's backing file is complete at the time block-job-complete is
> invoked.
> 
> For mode!='existing' it is always case (2), and the backing file is
> complete all the time; for mode=='existing' the backing file could be
> copied in the background, and case (1) could happen as well.  An example
> of case (1) is replacing sync=='full' with a "fast copy" of the backing
> file (e.g. via btrfs's COW copies) and sync=='top'.  This should be valid.

One issue is that QEMU will do mode!='existing' && sync!='full' for
drivers that do not support backing files (raw host devices, for
instance).  We could refuse to start a mirror in the case of:

    mode != 'existing' && sync != 'full' && !target->drv->supports_backing

Alternatively, we could do the two-pass zero approach in this patch,
except under the following conditions:

    sync == 'full' || (mode != 'existing' && !target->drv->supports_backing)

(In the sync == 'full' case, we could also just queue all sectors, as
Kevin suggested)

> 
> Of course, if block-job-complete is never called, all bets are off.
> 
> > We don't do this now, but assuming
> > the promise means that we could e.g. read the backing file in order to
> > optimise sparseness in the target (if it happens to have the same data
> > as its backing file) - and I don't think this would be valid with our
> > currently documented API.
> 
> Accessing the backing file of the target is never valid indeed.
> 
> > Anyway, the conclusion that we shouldn't zero unrelated sectors is still
> > right. But it's because we document which sectors we copy, not because
> > we can make assumptions about the user.
> 
> Right.
> 
> Paolo
Jeff Cody Sept. 30, 2015, 3:11 p.m. UTC | #19
On Mon, Sep 28, 2015 at 04:23:16PM +0100, Stefan Hajnoczi wrote:
> On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote:
> > +    if (s->zero_cycle) {
> > +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
> > +        if (!(ret & BDRV_BLOCK_ZERO)) {
> > +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > +                                  mirror_write_complete, op);
> 
> mirror_write_complete will advance s->common.offset.  Won't the progress
> be incorrect if we do that for both zeroing and regular mirroring?

Good point.  However, Is it really wrong to count it in the progress,
if we do the zero mirror pass?  I
Paolo Bonzini Sept. 30, 2015, 3:16 p.m. UTC | #20
On 30/09/2015 16:43, Jeff Cody wrote:
> One issue is that QEMU will do mode!='existing' && sync!='full' for
> drivers that do not support backing files (raw host devices, for
> instance).

Yup, this can be used to get a mirror of future operations (the idea was
to support things such as antiviruses, where the antivirus connects to
QEMU via NBD).

I think it's okay to ignore this case, since the resulting image is
bogus anyway.  What is interesting is only the stream of writes which
you can get through NBD.

Paolo
Kevin Wolf Sept. 30, 2015, 3:26 p.m. UTC | #21
Am 30.09.2015 um 16:43 hat Jeff Cody geschrieben:
> On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 29/09/2015 11:35, Kevin Wolf wrote:
> > > The caller could be copying the backing file in the background and it
> > > may not yet be finished.
> > 
> > Yes, and this is permitted (the destination file of mirroring is opened
> > with BDRV_O_NO_BACKING).
> > 
> > Some more assumptions arise when block-job-complete is invoked, because
> > at this point the content must not change under the guest's feet.
> > Because block-job-complete does bdrv_open_backing_file on the
> > destination, for sync!='full' it means that either 1) the image has no
> > backing file, but it starts with the content of the backing file or 2)
> > the image's backing file is complete at the time block-job-complete is
> > invoked.
> > 
> > For mode!='existing' it is always case (2), and the backing file is
> > complete all the time; for mode=='existing' the backing file could be
> > copied in the background, and case (1) could happen as well.  An example
> > of case (1) is replacing sync=='full' with a "fast copy" of the backing
> > file (e.g. via btrfs's COW copies) and sync=='top'.  This should be valid.
> 
> One issue is that QEMU will do mode!='existing' && sync!='full' for
> drivers that do not support backing files (raw host devices, for
> instance).  We could refuse to start a mirror in the case of:
> 
>     mode != 'existing' && sync != 'full' && !target->drv->supports_backing
> 
> Alternatively, we could do the two-pass zero approach in this patch,
> except under the following conditions:
> 
>     sync == 'full' || (mode != 'existing' && !target->drv->supports_backing)
> 
> (In the sync == 'full' case, we could also just queue all sectors, as
> Kevin suggested)

I don't think that mode == 'existing' should play any role in the
behaviour of any block job. There's no reason why doing an external
'qemu-img create' should make it do anything different compared to
images created using the monitor.

Kevin
Kevin Wolf Sept. 30, 2015, 3:28 p.m. UTC | #22
Am 30.09.2015 um 17:11 hat Jeff Cody geschrieben:
> On Mon, Sep 28, 2015 at 04:23:16PM +0100, Stefan Hajnoczi wrote:
> > On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote:
> > > +    if (s->zero_cycle) {
> > > +        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
> > > +        if (!(ret & BDRV_BLOCK_ZERO)) {
> > > +            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> > > +                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> > > +                                  mirror_write_complete, op);
> > 
> > mirror_write_complete will advance s->common.offset.  Won't the progress
> > be incorrect if we do that for both zeroing and regular mirroring?
> 
> Good point.  However, Is it really wrong to count it in the progress,
> if we do the zero mirror pass?  I

It's wrong as long as you increment the progress (offset), but don't
consider it in the expected value for completion (length).

Kevin
Jeff Cody Sept. 30, 2015, 4:02 p.m. UTC | #23
On Wed, Sep 30, 2015 at 05:26:28PM +0200, Kevin Wolf wrote:
> Am 30.09.2015 um 16:43 hat Jeff Cody geschrieben:
> > On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 29/09/2015 11:35, Kevin Wolf wrote:
> > > > The caller could be copying the backing file in the background and it
> > > > may not yet be finished.
> > > 
> > > Yes, and this is permitted (the destination file of mirroring is opened
> > > with BDRV_O_NO_BACKING).
> > > 
> > > Some more assumptions arise when block-job-complete is invoked, because
> > > at this point the content must not change under the guest's feet.
> > > Because block-job-complete does bdrv_open_backing_file on the
> > > destination, for sync!='full' it means that either 1) the image has no
> > > backing file, but it starts with the content of the backing file or 2)
> > > the image's backing file is complete at the time block-job-complete is
> > > invoked.
> > > 
> > > For mode!='existing' it is always case (2), and the backing file is
> > > complete all the time; for mode=='existing' the backing file could be
> > > copied in the background, and case (1) could happen as well.  An example
> > > of case (1) is replacing sync=='full' with a "fast copy" of the backing
> > > file (e.g. via btrfs's COW copies) and sync=='top'.  This should be valid.
> > 
> > One issue is that QEMU will do mode!='existing' && sync!='full' for
> > drivers that do not support backing files (raw host devices, for
> > instance).  We could refuse to start a mirror in the case of:
> > 
> >     mode != 'existing' && sync != 'full' && !target->drv->supports_backing
> > 
> > Alternatively, we could do the two-pass zero approach in this patch,
> > except under the following conditions:
> > 
> >     sync == 'full' || (mode != 'existing' && !target->drv->supports_backing)
> > 
> > (In the sync == 'full' case, we could also just queue all sectors, as
> > Kevin suggested)
> 
> I don't think that mode == 'existing' should play any role in the
> behaviour of any block job. There's no reason why doing an external
> 'qemu-img create' should make it do anything different compared to
> images created using the monitor.
>

As a general rule for blockjobs, I disagree.

Right away, there is a key difference: we don't know that the image is
(or should be) empty.  With mode != "existing", we know the image
should be empty, since we just created it (although for a host device,
it may have extraneous data in it).  So I think it is not so much what
we can assume about an existing image, as it is what we cannot assume.
And that could potentially influence some block jobs.

That said, I think I agree with simplifying the mirror case, as you
suggested earlier.  Namely, just adding all sectors into the dirty
bitmap when sync=='full'. That approach obviates the need for patches
1 & 2, and makes the single resulting patch pretty small.

The case of: 
             sync!='full' && 
             mode!='existing' && 
             !target->drv->supports_backing &&
             !bdrv_has_zero_init(target)

will result in an image with possible extraneous data, but I think I
agree with Paolo that it is either A) unimportant for the use case, or
B) user error.
Paolo Bonzini Sept. 30, 2015, 4:06 p.m. UTC | #24
On 30/09/2015 18:02, Jeff Cody wrote:
> As a general rule for blockjobs, I disagree.
> 
> Right away, there is a key difference: we don't know that the image is
> (or should be) empty.

Not necessarily empty.  sync='top' && mode='existing' &&
!target->backing_file, for example, makes sense if the target is a copy
of source->bs.

In fact, commit of the active layer is almost exactly a mode='existing'
drive-mirror operation.

But if you use mode == 'existing', and don't provide an image that
follows the rules, it's garbage-in garbage-out.  The sequence of
operation makes sense, but the resulting image does not.

Paolo

> With mode != "existing", we know the image
> should be empty, since we just created it (although for a host device,
> it may have extraneous data in it).  So I think it is not so much what
> we can assume about an existing image, as it is what we cannot assume.
> And that could potentially influence some block jobs.
Kevin Wolf Oct. 1, 2015, 8:23 a.m. UTC | #25
Am 30.09.2015 um 18:06 hat Paolo Bonzini geschrieben:
> 
> 
> On 30/09/2015 18:02, Jeff Cody wrote:
> > As a general rule for blockjobs, I disagree.
> > 
> > Right away, there is a key difference: we don't know that the image is
> > (or should be) empty.
> 
> Not necessarily empty.  sync='top' && mode='existing' &&
> !target->backing_file, for example, makes sense if the target is a copy
> of source->bs.
> 
> In fact, commit of the active layer is almost exactly a mode='existing'
> drive-mirror operation.
> 
> But if you use mode == 'existing', and don't provide an image that
> follows the rules, it's garbage-in garbage-out.  The sequence of
> operation makes sense, but the resulting image does not.

Yes, that's the point I was trying to make. The behaviour of our block
jobs doesn't depend on the contents of the target image, and I don't see
a future block job where this would change.

So if the user thinks it makes sense to start with a non-empty image and
keep some non-zero data in places that the block job didn't touch,
that's their choice (and as you mentioned, there are examples where it
does indeed make sense).

The block job code need not care about that, it just does its job
without ever looking at the contents of the target image.

Kevin
diff mbox

Patch

diff --git a/block/mirror.c b/block/mirror.c
index 405e5c4..b599176 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -50,7 +50,9 @@  typedef struct MirrorBlockJob {
     int64_t bdev_length;
     unsigned long *cow_bitmap;
     BdrvDirtyBitmap *dirty_bitmap;
-    HBitmapIter hbi;
+    HBitmapIter zero_hbi;
+    HBitmapIter allocated_hbi;
+    HBitmapIter *hbi;
     uint8_t *buf;
     QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
     int buf_free_count;
@@ -60,6 +62,8 @@  typedef struct MirrorBlockJob {
     int sectors_in_flight;
     int ret;
     bool unmap;
+    bool zero_unallocated;
+    bool zero_cycle;
     bool waiting_for_io;
 } MirrorBlockJob;
 
@@ -166,10 +170,10 @@  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     int pnum;
     int64_t ret;
 
-    s->sector_num = hbitmap_iter_next(&s->hbi);
+    s->sector_num = hbitmap_iter_next(s->hbi);
     if (s->sector_num < 0) {
-        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
-        s->sector_num = hbitmap_iter_next(&s->hbi);
+        bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi);
+        s->sector_num = hbitmap_iter_next(s->hbi);
         trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
         assert(s->sector_num >= 0);
     }
@@ -287,7 +291,7 @@  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
          */
         if (next_sector > hbitmap_next_sector
             && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
-            hbitmap_next_sector = hbitmap_iter_next(&s->hbi);
+            hbitmap_next_sector = hbitmap_iter_next(s->hbi);
         }
 
         next_sector += sectors_per_chunk;
@@ -300,25 +304,34 @@  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     s->sectors_in_flight += nb_sectors;
     trace_mirror_one_iteration(s, sector_num, nb_sectors);
 
-    ret = bdrv_get_block_status_above(source, NULL, sector_num,
-                                      nb_sectors, &pnum);
-    if (ret < 0 || pnum < nb_sectors ||
-            (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
-        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
-                       mirror_read_complete, op);
-    } else if (ret & BDRV_BLOCK_ZERO) {
-        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
-                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
-                              mirror_write_complete, op);
+    if (s->zero_cycle) {
+        ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum);
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
+                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
+                                  mirror_write_complete, op);
+        }
     } else {
-        assert(!(ret & BDRV_BLOCK_DATA));
-        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
-                         mirror_write_complete, op);
+        ret = bdrv_get_block_status_above(source, NULL, sector_num,
+                                          nb_sectors, &pnum);
+        if (ret < 0 || pnum < nb_sectors ||
+                (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) {
+            bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
+                           mirror_read_complete, op);
+        } else if (ret & BDRV_BLOCK_ZERO) {
+            bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
+                                  s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
+                                  mirror_write_complete, op);
+        } else {
+            assert(!(ret & BDRV_BLOCK_DATA));
+            bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
+                             mirror_write_complete, op);
+        }
     }
     return delay_ns;
 }
 
-static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
+static int mirror_do_iteration(MirrorBlockJob *s, uint64_t *last_pause_ns)
 {
     int ret;
 
@@ -347,7 +360,7 @@  static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
          * We do so every SLICE_TIME nanoseconds, or when there is an error,
          * or when the source is clean, whichever comes first.
          */
-        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns < SLICE_TIME
+        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - *last_pause_ns < SLICE_TIME
             && s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
             if (s->in_flight == MAX_IN_FLIGHT || s->buf_free_count == 0 ||
                 (cnt == 0 && s->in_flight > 0)) {
@@ -371,6 +384,14 @@  static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
                     goto immediate_exit;
                 }
             } else {
+
+                if (s->zero_cycle) {
+                    /* this is not the end of the streaming cycle,
+                     * if we are just filling in zeroes for unallocated
+                     * sectors prior to streaming the real data */
+                    goto immediate_exit;
+                }
+
                 /* We're out of the streaming phase.  From now on, if the job
                  * is cancelled we will actually complete all pending I/O and
                  * report completion.  This way, block-job-cancel will leave
@@ -419,7 +440,7 @@  static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns)
             s->common.cancelled = false;
             break;
         }
-        last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+        *last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
 immediate_exit:
@@ -511,6 +532,15 @@  static void coroutine_fn mirror_run(void *opaque)
                                  checking for a NULL string */
     int ret = 0;
     int n;
+    BdrvDirtyBitmap *zero_dirty_bitmap;
+    BdrvDirtyBitmap *allocated_dirty_bitmap = s->dirty_bitmap;
+
+    zero_dirty_bitmap = bdrv_create_dirty_bitmap(s->target,
+                                                 s->granularity, NULL, true,
+                                                 NULL);
+    if (zero_dirty_bitmap == NULL) {
+        goto immediate_exit;
+    }
 
     if (block_job_is_cancelled(&s->common)) {
         goto immediate_exit;
@@ -588,14 +618,33 @@  static void coroutine_fn mirror_run(void *opaque)
             assert(n > 0);
             if (ret == 1) {
                 bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
+            } else if (s->zero_unallocated) {
+                bdrv_set_dirty_bitmap(zero_dirty_bitmap, sector_num, n);
             }
             sector_num += n;
         }
     }
 
-    bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
+    bdrv_dirty_iter_init(s->dirty_bitmap, &s->allocated_hbi);
 
-    ret = mirror_do_iteration(s, last_pause_ns);
+    if (s->zero_unallocated) {
+        bdrv_dirty_iter_init(zero_dirty_bitmap, &s->zero_hbi);
+        s->dirty_bitmap = zero_dirty_bitmap;
+        s->hbi = &s->zero_hbi;
+
+        s->zero_cycle = true;
+        ret = mirror_do_iteration(s, &last_pause_ns);
+        if (ret < 0) {
+            goto immediate_exit;
+        }
+
+        mirror_drain(s);
+        s->zero_cycle = false;
+    }
+
+    s->dirty_bitmap = allocated_dirty_bitmap;
+    s->hbi = &s->allocated_hbi;
+    ret = mirror_do_iteration(s, &last_pause_ns);
 
 immediate_exit:
     if (s->in_flight > 0) {
@@ -611,7 +660,8 @@  immediate_exit:
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
     g_free(s->in_flight_bitmap);
-    bdrv_release_dirty_bitmap(bs, s->dirty_bitmap);
+    bdrv_release_dirty_bitmap(bs, allocated_dirty_bitmap);
+    bdrv_release_dirty_bitmap(NULL, zero_dirty_bitmap);
     bdrv_iostatus_disable(s->target);
 
     data = g_malloc(sizeof(*data));
@@ -702,7 +752,7 @@  static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
                              int64_t buf_size,
                              BlockdevOnError on_source_error,
                              BlockdevOnError on_target_error,
-                             bool unmap,
+                             bool unmap, bool existing,
                              BlockCompletionFunc *cb,
                              void *opaque, Error **errp,
                              const BlockJobDriver *driver,
@@ -737,6 +787,7 @@  static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
         return;
     }
 
+    s->zero_unallocated = !existing && !bdrv_has_zero_init(target);
     s->replaces = g_strdup(replaces);
     s->on_source_error = on_source_error;
     s->on_target_error = on_target_error;
@@ -767,7 +818,7 @@  void mirror_start(BlockDriverState *bs, BlockDriverState *target,
                   int64_t speed, uint32_t granularity, int64_t buf_size,
                   MirrorSyncMode mode, BlockdevOnError on_source_error,
                   BlockdevOnError on_target_error,
-                  bool unmap,
+                  bool unmap, bool existing,
                   BlockCompletionFunc *cb,
                   void *opaque, Error **errp)
 {
@@ -782,8 +833,8 @@  void mirror_start(BlockDriverState *bs, BlockDriverState *target,
     base = mode == MIRROR_SYNC_MODE_TOP ? bs->backing_hd : NULL;
     mirror_start_job(bs, target, replaces,
                      speed, granularity, buf_size,
-                     on_source_error, on_target_error, unmap, cb, opaque, errp,
-                     &mirror_job_driver, is_none_mode, base);
+                     on_source_error, on_target_error, unmap, existing, cb,
+                     opaque, errp, &mirror_job_driver, is_none_mode, base);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
@@ -830,7 +881,7 @@  void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
 
     bdrv_ref(base);
     mirror_start_job(bs, base, NULL, speed, 0, 0,
-                     on_error, on_error, false, cb, opaque, &local_err,
+                     on_error, on_error, false, false, cb, opaque, &local_err,
                      &commit_active_job_driver, false, base);
     if (local_err) {
         error_propagate(errp, local_err);
diff --git a/blockdev.c b/blockdev.c
index cb9f78d..c06ac60 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2816,7 +2816,7 @@  void qmp_drive_mirror(const char *device, const char *target,
                  has_replaces ? replaces : NULL,
                  speed, granularity, buf_size, sync,
                  on_source_error, on_target_error,
-                 unmap,
+                 unmap, mode == NEW_IMAGE_MODE_EXISTING,
                  block_job_cb, bs, &local_err);
     if (local_err != NULL) {
         bdrv_unref(target_bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 14ad4c3..21a8988 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -614,6 +614,7 @@  void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  * @on_source_error: The action to take upon error reading from the source.
  * @on_target_error: The action to take upon error writing to the target.
  * @unmap: Whether to unmap target where source sectors only contain zeroes.
+ * @existing: Whether target image is an existing image prior to the QMP cmd.
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
@@ -628,7 +629,7 @@  void mirror_start(BlockDriverState *bs, BlockDriverState *target,
                   int64_t speed, uint32_t granularity, int64_t buf_size,
                   MirrorSyncMode mode, BlockdevOnError on_source_error,
                   BlockdevOnError on_target_error,
-                  bool unmap,
+                  bool unmap, bool existing,
                   BlockCompletionFunc *cb,
                   void *opaque, Error **errp);
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index bb2189e..033afb4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -952,8 +952,10 @@ 
 #            broken Quorum files. (Since 2.1)
 #
 # @mode: #optional whether and how QEMU should create a new image, default is
-#        'absolute-paths'.
-#
+#        'absolute-paths'.  If mode != 'existing', and the target does not
+#         have zero init (sparseness), then the target image will have sectors
+#         zeroed out that correspond to sectors in an unallocated state in the
+#         source image.
 # @speed:  #optional the maximum speed, in bytes per second
 #
 # @sync: what parts of the disk image should be copied to the destination