diff mbox

[v4,3/8] mirror: Do zero write on target if sectors not allocated

Message ID 1432266060-22104-4-git-send-email-famz@redhat.com
State New
Headers show

Commit Message

Fam Zheng May 22, 2015, 3:40 a.m. UTC
If guest discards a source cluster, mirroring with bdrv_aio_readv is overkill.
Some protocols do zero upon discard, where it's best to use
bdrv_aio_write_zeroes, otherwise, bdrv_aio_discard will be enough.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/mirror.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Eric Blake May 22, 2015, 8:20 p.m. UTC | #1
On 05/21/2015 09:40 PM, Fam Zheng wrote:
> If guest discards a source cluster, mirroring with bdrv_aio_readv is overkill.
> Some protocols do zero upon discard, where it's best to use
> bdrv_aio_write_zeroes, otherwise, bdrv_aio_discard will be enough.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/mirror.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> +
> +    ret = bdrv_get_block_status(source, NULL, sector_num, nb_sectors, &pnum);

Ah, you are checking the entire chain for allocation, so if it is
unallocated through all layers, then the destination doesn't need to
allocate it either.  But is this the correct location to start with,
when the block-mirror is shallow?

> +    if (ret < 0 || pnum < nb_sectors ||
> +            (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO))) {
> +        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,

Do we still want to call bdrv_aio_readv() if ret < 0 (where it will
likely fail), or should this 'if' be broken into two clauses?

> +                       mirror_read_complete, op);
> +    } else if (ret & BDRV_BLOCK_ZERO) {
> +        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                              mirror_write_complete, op);
> +    } else {
> +        assert(!(ret & BDRV_BLOCK_ALLOCATED));
> +        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> +                         mirror_write_complete, op);
> +    }

I'm okay with what happens when you mirror to a flat image, as in
copying "base <- active" into "copy".  There, the copy can omit clusters
that are not allocated in anywhere in the source chain, and can also
omit clusters if the source has all zero in the cluster, and the
destination would read back all zero even if the cluster is unallocated.

But I'm worried about a shallow copy.  If I start with "base <- active",
where "active" has an explicit zero cluster that is overwriting an
allocated non-zero cluster in "base", and I'm creating the shallow clone
to "base <- copy", then the default of 'unmap=true' says that
bdrv_aio_write_zeroes() may attempt to unmap the cluster in "copy".  At
which point, doesn't that mean that reading from "copy" will dredge up
the non-zero data from "base", which is NOT a faithful mirroring of
"active"?

Or symbolically, suppose I have this layout, with letters for non-zero
clusters, 0 for explicit zero clusters, and - for unallocated clusters:

base   AAA000---
active -0B-0B-0B   # Guest sees A0B00B00B

If I'm understanding your code correctly, a deep block-mirror will
create either:

copy   A-B--B--B   # Guest sees A0B00B00B, unmap was true, image is sparse

or

copy   A0B00B-0B   # Guest sees A0B00B00B, unmap was false, image is
allocated

But a shallow block-mirror will cause:

base   AAA000---
copy   -0B-0B-0B   # Guest sees A0B00B00B, unmap was false

or

base   AAA000---
copy   --B--B--B   # Guest sees AAB00B00B, unmap was true

Whoops - unmapping a cluster in the destination which was all zeros in
the source caused corruption in what the guest sees.
Paolo Bonzini May 25, 2015, 2:36 p.m. UTC | #2
On 22/05/2015 05:40, Fam Zheng wrote:
> +    ret = bdrv_get_block_status(source, NULL, sector_num, nb_sectors, &pnum);
> +    if (ret < 0 || pnum < nb_sectors ||
> +            (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO))) {
> +        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
> +                       mirror_read_complete, op);
> +    } else if (ret & BDRV_BLOCK_ZERO) {
> +        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
> +                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> +                              mirror_write_complete, op);
> +    } else {
> +        assert(!(ret & BDRV_BLOCK_ALLOCATED));
> +        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
> +                         mirror_write_complete, op);
> +    }

This doesn't work if you have a backing file.  You want to test
BDRV_BLOCK_DATA, not BDRV_BLOCK_ALLOCATED.

On the other hand, if BDRV_BLOCK_ALLOCATED is nonzero, you need to
recurse on bs->backing_hd.  The logic is very similar to
bdrv_is_allocated_above, but you need to write bdrv_get_block_status_above.

Paolo
Paolo Bonzini May 25, 2015, 2:38 p.m. UTC | #3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256



On 22/05/2015 22:20, Eric Blake wrote:
> But I'm worried about a shallow copy.  If I start with "base <-
> active", where "active" has an explicit zero cluster that is
> overwriting an allocated non-zero cluster in "base", and I'm
> creating the shallow clone to "base <- copy", then the default of
> 'unmap=true' says that bdrv_aio_write_zeroes() may attempt to unmap
> the cluster in "copy".  At which point, doesn't that mean that
> reading from "copy" will dredge up the non-zero data from "base",
> which is NOT a faithful mirroring of "active"?

No, bdv_aio_write_zeroes+BDRV_REQ_MAY_UNMAP only unmaps if it results
in zeroes.  In addition, unlike bdrv_aio_discard,
bdrv_aio_write_zeroes will do a real write of zeroes if [sector_num,
sector_num+nb_sectors) is not aligned to the disk's unmap granularity.

Paolo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJVYzPhAAoJEL/70l94x66DGMYIAKKGQIts+Ve689Iar0P6TmXz
BOO0xKJm477PAP9pvUtnGnpt+g2+OqKp0VBTu4F3tbLIQLQOZT1kAZ9oxJ9PMnEa
69c8n86DWdJIgx1lcnOI2wgjaoG65e2K2MH3vYs2Rj6ZD65/FfsF7+oj/R9zCgZm
MIMSiW/fbC5EjBQxOTkDWekuB+/iNRtV9K8YBCtmb8ZlF28X3FPrdZnoqcZGtxPL
RoGYLckjEuLMLSsDH3Frb1+lfKpCaSlJ203ihQQfXzr/8UBH4zOqPZOTbrhdA1pZ
3fHaW5KWyZ1vjW6bb64kpjMnRwCRxB+uAAY4pUmUCZxGsHxh19hGUeSuOexFvqc=
=GUVR
-----END PGP SIGNATURE-----
Paolo Bonzini May 25, 2015, 2:41 p.m. UTC | #4
On 25/05/2015 16:36, Paolo Bonzini wrote:
> 
> 
> On 22/05/2015 05:40, Fam Zheng wrote:
>> +    ret = bdrv_get_block_status(source, NULL, sector_num, nb_sectors, &pnum);
>> +    if (ret < 0 || pnum < nb_sectors ||
>> +            (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO))) {
>> +        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
>> +                       mirror_read_complete, op);
>> +    } else if (ret & BDRV_BLOCK_ZERO) {
>> +        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
>> +                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
>> +                              mirror_write_complete, op);
>> +    } else {
>> +        assert(!(ret & BDRV_BLOCK_ALLOCATED));
>> +        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
>> +                         mirror_write_complete, op);
>> +    }
> 
> This doesn't work if you have a backing file.  You want to test
> BDRV_BLOCK_DATA, not BDRV_BLOCK_ALLOCATED.
> 
> On the other hand, if BDRV_BLOCK_ALLOCATED is nonzero, you need to
> recurse on bs->backing_hd.  The logic is very similar to
> bdrv_is_allocated_above, but you need to write bdrv_get_block_status_above.

Oops, I totally missed the "NULL" in the first line.  Still, I think
BDRV_BLOCK_DATA is a better check than BDRV_BLOCK_ALLOCATED.

Paolo
diff mbox

Patch

diff --git a/block/mirror.c b/block/mirror.c
index 85995b2..87f5ac4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -164,6 +164,8 @@  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector;
     uint64_t delay_ns = 0;
     MirrorOp *op;
+    int pnum;
+    int64_t ret;
 
     s->sector_num = hbitmap_iter_next(&s->hbi);
     if (s->sector_num < 0) {
@@ -290,8 +292,21 @@  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     s->in_flight++;
     s->sectors_in_flight += nb_sectors;
     trace_mirror_one_iteration(s, sector_num, nb_sectors);
-    bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
-                   mirror_read_complete, op);
+
+    ret = bdrv_get_block_status(source, NULL, sector_num, nb_sectors, &pnum);
+    if (ret < 0 || pnum < nb_sectors ||
+            (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO))) {
+        bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
+                       mirror_read_complete, op);
+    } else if (ret & BDRV_BLOCK_ZERO) {
+        bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors,
+                              s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
+                              mirror_write_complete, op);
+    } else {
+        assert(!(ret & BDRV_BLOCK_ALLOCATED));
+        bdrv_aio_discard(s->target, sector_num, op->nb_sectors,
+                         mirror_write_complete, op);
+    }
     return delay_ns;
 }