diff mbox series

[2/2] block/backup: fix fleecing scheme: use serialized writes

Message ID 20180703180751.243496-3-vsementsov@virtuozzo.com
State New
Headers show
Series fix image fleecing | expand

Commit Message

Vladimir Sementsov-Ogievskiy July 3, 2018, 6:07 p.m. UTC
Fleecing scheme works as follows: we want a kind of temporary snapshot
of active drive A. We create temporary image B, with B->backing = A.
Then we start backup(sync=none) from A to B. From this point, B reads
as point-in-time snapshot of A (A continues to be active drive,
accepting guest IO).

This scheme needs some additional synchronization between reads from B
and backup COW operations, otherwise, the following situation is
theoretically possible:

(assume B is qcow2, client is NBD client, reading from B)

1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
   goes up to l2 table loading (assume cache miss)

2) guest write => backup COW => qcow2 write =>
   try to take qcow2 mutex => waiting

3. l2 table loaded, we see that cluster is UNALLOCATED, go to
   "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
   bdrv_co_preadv(bs->backing, ...)

4) aha, mutex unlocked, backup COW continues, and we finally finish
   guest write and change cluster in our active disk A

5. actually, do bdrv_co_preadv(bs->backing, ...) and read
   _new updated_ data.

To avoid this, let's make all COW writes serializing, to not intersect
with reads from B.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

Kevin Wolf July 4, 2018, 10:39 a.m. UTC | #1
Am 03.07.2018 um 20:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Fleecing scheme works as follows: we want a kind of temporary snapshot
> of active drive A. We create temporary image B, with B->backing = A.
> Then we start backup(sync=none) from A to B. From this point, B reads
> as point-in-time snapshot of A (A continues to be active drive,
> accepting guest IO).
> 
> This scheme needs some additional synchronization between reads from B
> and backup COW operations, otherwise, the following situation is
> theoretically possible:
> 
> (assume B is qcow2, client is NBD client, reading from B)
> 
> 1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
>    goes up to l2 table loading (assume cache miss)
> 
> 2) guest write => backup COW => qcow2 write =>
>    try to take qcow2 mutex => waiting
> 
> 3. l2 table loaded, we see that cluster is UNALLOCATED, go to
>    "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
>    bdrv_co_preadv(bs->backing, ...)
> 
> 4) aha, mutex unlocked, backup COW continues, and we finally finish
>    guest write and change cluster in our active disk A
> 
> 5. actually, do bdrv_co_preadv(bs->backing, ...) and read
>    _new updated_ data.
> 
> To avoid this, let's make all COW writes serializing, to not intersect
> with reads from B.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

I think this should work, even though we can still improve it a bit.

First, I don't think we need to disable copy offloading as long as we
add BDRV_REQ_SERIALISING support to bdrv_co_copy_range_internal(). The
thing that's a bit strange there is that there is only a single flags
parameter for both source and target. We may need to split that so
that we can pass BDRV_REQ_NO_SERIALISING for the source and
BDRV_REQ_SERIALISING for the target.

> +    /* Detect image fleecing */
> +    fleecing = sync_mode == MIRROR_SYNC_MODE_NONE && target->backing->bs == bs;

The other thing is that I'm not convinced of the condition here. This is
the narrowest thinkable condition to recognise the exact fleecing setup
we have in mind right now. However, it is not the condition to flag all
setups that are affected by the same problem.

I don't think sync_mode actually makes a meaningful difference here. If
someone wants to create a full backup and already access it while the
job is still in progress, they will still want it to be consistent.

It also doesn't make a difference whether the fleecing overlay has the
source directly attached as a backing node or whether there is a filter
node between them.

So while I'm not sure whether it really covers all interesting cases,
this condition would already be a lot better:

    fleecing = bdrv_chain_contains(target, bs);

Maybe rename the variable to serialise_target_writes because it's no
longer restricted to the exact fleecing case.

> +    if (fleecing) {
> +        if (compress) {
> +            error_setg(errp, "Image fleecing doesn't support compressed mode.");
> +            return NULL;
> +        }
> +    }

Why not fleecing && compress instead of a nested if? And why is
fleecing even at odds with compression?

Kevin
Vladimir Sementsov-Ogievskiy July 4, 2018, 12:31 p.m. UTC | #2
04.07.2018 13:39, Kevin Wolf wrote:
> Am 03.07.2018 um 20:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Fleecing scheme works as follows: we want a kind of temporary snapshot
>> of active drive A. We create temporary image B, with B->backing = A.
>> Then we start backup(sync=none) from A to B. From this point, B reads
>> as point-in-time snapshot of A (A continues to be active drive,
>> accepting guest IO).
>>
>> This scheme needs some additional synchronization between reads from B
>> and backup COW operations, otherwise, the following situation is
>> theoretically possible:
>>
>> (assume B is qcow2, client is NBD client, reading from B)
>>
>> 1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
>>     goes up to l2 table loading (assume cache miss)
>>
>> 2) guest write => backup COW => qcow2 write =>
>>     try to take qcow2 mutex => waiting
>>
>> 3. l2 table loaded, we see that cluster is UNALLOCATED, go to
>>     "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
>>     bdrv_co_preadv(bs->backing, ...)
>>
>> 4) aha, mutex unlocked, backup COW continues, and we finally finish
>>     guest write and change cluster in our active disk A
>>
>> 5. actually, do bdrv_co_preadv(bs->backing, ...) and read
>>     _new updated_ data.
>>
>> To avoid this, let's make all COW writes serializing, to not intersect
>> with reads from B.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> I think this should work, even though we can still improve it a bit.
>
> First, I don't think we need to disable copy offloading as long as we
> add BDRV_REQ_SERIALISING support to bdrv_co_copy_range_internal(). The
> thing that's a bit strange there is that there is only a single flags
> parameter for both source and target. We may need to split that so
> that we can pass BDRV_REQ_NO_SERIALISING for the source and
> BDRV_REQ_SERIALISING for the target.

Ok

>
>> +    /* Detect image fleecing */
>> +    fleecing = sync_mode == MIRROR_SYNC_MODE_NONE && target->backing->bs == bs;
> The other thing is that I'm not convinced of the condition here. This is
> the narrowest thinkable condition to recognise the exact fleecing setup
> we have in mind right now. However, it is not the condition to flag all
> setups that are affected by the same problem.
>
> I don't think sync_mode actually makes a meaningful difference here. If
> someone wants to create a full backup and already access it while the
> job is still in progress, they will still want it to be consistent.
>
> It also doesn't make a difference whether the fleecing overlay has the
> source directly attached as a backing node or whether there is a filter
> node between them.
>
> So while I'm not sure whether it really covers all interesting cases,
> this condition would already be a lot better:
>
>      fleecing = bdrv_chain_contains(target, bs);
>
> Maybe rename the variable to serialise_target_writes because it's no
> longer restricted to the exact fleecing case.

Ok

>
>> +    if (fleecing) {
>> +        if (compress) {
>> +            error_setg(errp, "Image fleecing doesn't support compressed mode.");
>> +            return NULL;
>> +        }
>> +    }
> Why not fleecing && compress instead of a nested if?

hmm, really strange

>   And why is
> fleecing even at odds with compression?

I thought that compressed writes goes some other way, now I see: they 
don't. Will drop the restriction

>
> Kevin

Thank you for the review, I'll resend soon.
Vladimir Sementsov-Ogievskiy July 4, 2018, 1:20 p.m. UTC | #3
04.07.2018 13:39, Kevin Wolf wrote:
> Am 03.07.2018 um 20:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Fleecing scheme works as follows: we want a kind of temporary snapshot
>> of active drive A. We create temporary image B, with B->backing = A.
>> Then we start backup(sync=none) from A to B. From this point, B reads
>> as point-in-time snapshot of A (A continues to be active drive,
>> accepting guest IO).
>>
>> This scheme needs some additional synchronization between reads from B
>> and backup COW operations, otherwise, the following situation is
>> theoretically possible:
>>
>> (assume B is qcow2, client is NBD client, reading from B)
>>
>> 1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
>>     goes up to l2 table loading (assume cache miss)
>>
>> 2) guest write => backup COW => qcow2 write =>
>>     try to take qcow2 mutex => waiting
>>
>> 3. l2 table loaded, we see that cluster is UNALLOCATED, go to
>>     "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
>>     bdrv_co_preadv(bs->backing, ...)
>>
>> 4) aha, mutex unlocked, backup COW continues, and we finally finish
>>     guest write and change cluster in our active disk A
>>
>> 5. actually, do bdrv_co_preadv(bs->backing, ...) and read
>>     _new updated_ data.
>>
>> To avoid this, let's make all COW writes serializing, to not intersect
>> with reads from B.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> I think this should work, even though we can still improve it a bit.
>
> First, I don't think we need to disable copy offloading as long as we
> add BDRV_REQ_SERIALISING support to bdrv_co_copy_range_internal(). The
> thing that's a bit strange there is that there is only a single flags
> parameter for both source and target. We may need to split that so
> that we can pass BDRV_REQ_NO_SERIALISING for the source and
> BDRV_REQ_SERIALISING for the target.

here is interesting point: why copy_range skips wait_for_serializing for 
write if k? I think it should influence only on read, isn't it? is it a bug?

another question, I don't see, where is request alignment handled in 
copy_range path, like in bdrv_co_pwritev/readv ?

>
>> +    /* Detect image fleecing */
>> +    fleecing = sync_mode == MIRROR_SYNC_MODE_NONE && target->backing->bs == bs;
> The other thing is that I'm not convinced of the condition here. This is
> the narrowest thinkable condition to recognise the exact fleecing setup
> we have in mind right now. However, it is not the condition to flag all
> setups that are affected by the same problem.
>
> I don't think sync_mode actually makes a meaningful difference here. If
> someone wants to create a full backup and already access it while the
> job is still in progress, they will still want it to be consistent.
>
> It also doesn't make a difference whether the fleecing overlay has the
> source directly attached as a backing node or whether there is a filter
> node between them.
>
> So while I'm not sure whether it really covers all interesting cases,
> this condition would already be a lot better:
>
>      fleecing = bdrv_chain_contains(target, bs);
>
> Maybe rename the variable to serialise_target_writes because it's no
> longer restricted to the exact fleecing case.
>
>> +    if (fleecing) {
>> +        if (compress) {
>> +            error_setg(errp, "Image fleecing doesn't support compressed mode.");
>> +            return NULL;
>> +        }
>> +    }
> Why not fleecing && compress instead of a nested if? And why is
> fleecing even at odds with compression?
>
> Kevin
diff mbox series

Patch

diff --git a/block/backup.c b/block/backup.c
index 81895ddbe2..ab46a7d43d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -47,6 +47,8 @@  typedef struct BackupBlockJob {
     HBitmap *copy_bitmap;
     bool use_copy_range;
     int64_t copy_range_size;
+
+    bool fleecing;
 } BackupBlockJob;
 
 static const BlockJobDriver backup_job_driver;
@@ -102,6 +104,7 @@  static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
     QEMUIOVector qiov;
     BlockBackend *blk = job->common.blk;
     int nbytes;
+    int sflags = job->fleecing ? BDRV_REQ_SERIALISING : 0;
 
     hbitmap_reset(job->copy_bitmap, start / job->cluster_size, 1);
     nbytes = MIN(job->cluster_size, job->len - start);
@@ -124,11 +127,12 @@  static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
 
     if (qemu_iovec_is_zero(&qiov)) {
         ret = blk_co_pwrite_zeroes(job->target, start,
-                                   qiov.size, BDRV_REQ_MAY_UNMAP);
+                                   qiov.size, sflags | BDRV_REQ_MAY_UNMAP);
     } else {
         ret = blk_co_pwritev(job->target, start,
                              qiov.size, &qiov,
-                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+                             job->compress ? BDRV_REQ_WRITE_COMPRESSED :
+                                             sflags);
     }
     if (ret < 0) {
         trace_backup_do_cow_write_fail(job, start, ret);
@@ -614,6 +618,7 @@  BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     BlockDriverInfo bdi;
     BackupBlockJob *job = NULL;
     int ret;
+    bool fleecing;
 
     assert(bs);
     assert(target);
@@ -668,6 +673,15 @@  BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
         return NULL;
     }
 
+    /* Detect image fleecing */
+    fleecing = sync_mode == MIRROR_SYNC_MODE_NONE && target->backing->bs == bs;
+    if (fleecing) {
+        if (compress) {
+            error_setg(errp, "Image fleecing doesn't support compressed mode.");
+            return NULL;
+        }
+    }
+
     len = bdrv_getlength(bs);
     if (len < 0) {
         error_setg_errno(errp, -len, "unable to get length for '%s'",
@@ -700,6 +714,7 @@  BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     job->sync_bitmap = sync_mode == MIRROR_SYNC_MODE_INCREMENTAL ?
                        sync_bitmap : NULL;
     job->compress = compress;
+    job->fleecing = fleecing;
 
     /* If there is no backing file on the target, we cannot rely on COW if our
      * backup cluster size is smaller than the target cluster size. Even for
@@ -727,7 +742,7 @@  BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     } else {
         job->cluster_size = MAX(BACKUP_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
     }
-    job->use_copy_range = true;
+    job->use_copy_range = !fleecing;
     job->copy_range_size = MIN_NON_ZERO(blk_get_max_transfer(job->common.blk),
                                         blk_get_max_transfer(job->target));
     job->copy_range_size = MAX(job->cluster_size,