From patchwork Fri Jan 18 16:22:48 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Paolo Bonzini <pbonzini@redhat.com>
X-Patchwork-Id: 213669
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 6AC972C0084
	for <incoming@patchwork.ozlabs.org>;
	Sat, 19 Jan 2013 03:23:04 +1100 (EST)
Received: from localhost ([::1]:33473 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1TwEic-0005s1-Io
	for incoming@patchwork.ozlabs.org; Fri, 18 Jan 2013 11:23:02 -0500
Received: from eggs.gnu.org ([208.118.235.92]:51253)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1TwEiU-0005rX-6F
	for qemu-devel@nongnu.org; Fri, 18 Jan 2013 11:22:56 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1TwEiS-0003CD-OH
	for qemu-devel@nongnu.org; Fri, 18 Jan 2013 11:22:54 -0500
Received: from mx1.redhat.com ([209.132.183.28]:20236)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1TwEiS-0003C9-GH
	for qemu-devel@nongnu.org; Fri, 18 Jan 2013 11:22:52 -0500
Received: from int-mx10.intmail.prod.int.phx2.redhat.com
	(int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r0IGMpNO026559
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Fri, 18 Jan 2013 11:22:51 -0500
Received: from yakj.usersys.redhat.com (ovpn-112-26.ams2.redhat.com
	[10.36.112.26])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	ESMTP id r0IGMnEq024792; Fri, 18 Jan 2013 11:22:49 -0500
Message-ID: <50F976D8.60303@redhat.com>
Date: Fri, 18 Jan 2013 17:22:48 +0100
From: Paolo Bonzini <pbonzini@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Kevin Wolf <kwolf@redhat.com>
References: <1358357479-7912-1-git-send-email-pbonzini@redhat.com>
	<1358357479-7912-6-git-send-email-pbonzini@redhat.com>
	<50F9668A.4080308@redhat.com>
In-Reply-To: <50F9668A.4080308@redhat.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 209.132.183.28
Cc: qemu-devel@nongnu.org, stefanha@redhat.com
Subject: Re: [Qemu-devel] [PATCH v2 05/12] mirror: perform COW if the
	cluster size is bigger than the granularity
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Il 18/01/2013 16:13, Kevin Wolf ha scritto:
> Am 16.01.2013 18:31, schrieb Paolo Bonzini:
>> When mirroring runs, the backing files for the target may not yet be
>> ready.  However, this means that a copy-on-write operation on the target
>> would fill the missing sectors with zeros.  Copy-on-write only happens
>> if the granularity of the dirty bitmap is smaller than the cluster size
>> (and only for clusters that are allocated in the source after the job
>> has started copying).  So far, the granularity was fixed to 1MB; to avoid
>> the problem we detected the situation and required the backing files to
>> be available in that case only.
>>
>> However, we want to lower the granularity for efficiency, so we need
>> a better solution.  The solution is to always copy a whole cluster the
>> first time it is touched.  The code keeps a bitmap of clusters that
>> have already been allocated by the mirroring job, and only does "manual"
>> copy-on-write if the chunk being copied is zero in the bitmap.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>         v1->v2: rebased for moved include files
>>
>>  block/mirror.c             |   60 +++++++++++++++++++++++++++++++++++++------
>>  blockdev.c                 |   15 ++---------
>>  tests/qemu-iotests/041     |   21 +++++++++++++++
>>  tests/qemu-iotests/041.out |    4 +-
>>  trace-events               |    1 +
>>  5 files changed, 78 insertions(+), 23 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 20cb1e7..ee45e2e 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -15,6 +15,7 @@
>>  #include "block/blockjob.h"
>>  #include "block/block_int.h"
>>  #include "qemu/ratelimit.h"
>> +#include "qemu/bitmap.h"
>>  
>>  enum {
>>      /*
>> @@ -36,6 +37,8 @@ typedef struct MirrorBlockJob {
>>      bool synced;
>>      bool should_complete;
>>      int64_t sector_num;
>> +    size_t buf_size;
>> +    unsigned long *cow_bitmap;
>>      HBitmapIter hbi;
>>      uint8_t *buf;
>>  } MirrorBlockJob;
>> @@ -60,7 +63,7 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>      BlockDriverState *target = s->target;
>>      QEMUIOVector qiov;
>>      int ret, nb_sectors;
>> -    int64_t end;
>> +    int64_t end, sector_num, cluster_num;
>>      struct iovec iov;
>>  
>>      s->sector_num = hbitmap_iter_next(&s->hbi);
>> @@ -71,22 +74,41 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>          assert(s->sector_num >= 0);
>>      }
>>  
>> +    /* If we have no backing file yet in the destination, and the cluster size
>> +     * is very large, we need to do COW ourselves.  The first time a cluster is
>> +     * copied, copy it entirely.
>> +     *
>> +     * Because both BDRV_SECTORS_PER_DIRTY_CHUNK and the cluster size are
>> +     * powers of two, the number of sectors to copy cannot exceed one cluster.
>> +     */
>> +    sector_num = s->sector_num;
>> +    nb_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK;
>> +    cluster_num = sector_num / BDRV_SECTORS_PER_DIRTY_CHUNK;
>> +    if (s->cow_bitmap && !test_bit(cluster_num, s->cow_bitmap)) {
>> +        trace_mirror_cow(s, sector_num);
>> +        bdrv_round_to_clusters(s->target,
>> +                               sector_num, BDRV_SECTORS_PER_DIRTY_CHUNK,
>> +                               &sector_num, &nb_sectors);
>> +        bitmap_set(s->cow_bitmap, sector_num / BDRV_SECTORS_PER_DIRTY_CHUNK,
>> +                   nb_sectors / BDRV_SECTORS_PER_DIRTY_CHUNK);
> 
> Here the bit in the cow_bitmap is set before the COW has actually been
> performed. It could still fail.
> 
>> +    }
>> +
>>      end = s->common.len >> BDRV_SECTOR_BITS;
>> -    nb_sectors = MIN(BDRV_SECTORS_PER_DIRTY_CHUNK, end - s->sector_num);
>> -    bdrv_reset_dirty(source, s->sector_num, nb_sectors);
>> +    nb_sectors = MIN(nb_sectors, end - sector_num);
>> +    bdrv_reset_dirty(source, sector_num, nb_sectors);
>>  
>>      /* Copy the dirty cluster.  */
>>      iov.iov_base = s->buf;
>>      iov.iov_len  = nb_sectors * 512;
>>      qemu_iovec_init_external(&qiov, &iov, 1);
>>  
>> -    trace_mirror_one_iteration(s, s->sector_num, nb_sectors);
>> -    ret = bdrv_co_readv(source, s->sector_num, nb_sectors, &qiov);
>> +    trace_mirror_one_iteration(s, sector_num, nb_sectors);
>> +    ret = bdrv_co_readv(source, sector_num, nb_sectors, &qiov);
>>      if (ret < 0) {
>>          *p_action = mirror_error_action(s, true, -ret);
>>          goto fail;
>>      }
>> -    ret = bdrv_co_writev(target, s->sector_num, nb_sectors, &qiov);
>> +    ret = bdrv_co_writev(target, sector_num, nb_sectors, &qiov);
>>      if (ret < 0) {
>>          *p_action = mirror_error_action(s, false, -ret);
>>          s->synced = false;
>> @@ -96,7 +118,7 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>  
>>  fail:
>>      /* Try again later.  */
>> -    bdrv_set_dirty(source, s->sector_num, nb_sectors);
>> +    bdrv_set_dirty(source, sector_num, nb_sectors);
> 
> If it does, we mark the whole cluster dirty now, but in the cow_bitmap
> it's still marked at present on the target. When restarting the job,
> wouldn't it copy only the start of the cluster next time and corrupt the
> rest of it?

Yes, very good catch.

I think this should fix it.


I haven't written a testcase for it, it's tricky but should be doable.
Do you want me to respin, or can it be done as a followup?

I would prefer a followup also because it will give a better pointer when
we backport this fix to the RHEL6 code.

Paolo

diff --git a/block/mirror.c b/block/mirror.c
index 82abc2f..0fc140a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -87,6 +87,9 @@ static void mirror_iteration_done(MirrorOp *op)
     cluster_num = op->sector_num / s->granularity;
     nb_chunks = op->nb_sectors / s->granularity;
     bitmap_clear(s->in_flight_bitmap, cluster_num, nb_chunks);
+    if (s->cow_bitmap) {
+        bitmap_set(s->cow_bitmap, cluster_num, nb_chunks);
+    }
 
     trace_mirror_iteration_done(s, op->sector_num, op->nb_sectors);
     g_slice_free(MirrorOp, op);
@@ -217,9 +220,6 @@ static void coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
         /* We have enough free space to copy these sectors.  */
         bitmap_set(s->in_flight_bitmap, next_cluster, added_chunks);
-        if (s->cow_bitmap) {
-            bitmap_set(s->cow_bitmap, next_cluster, added_chunks);
-        }
 
         nb_sectors += added_sectors;
         nb_chunks += added_chunks;