Patchwork [v2,05/12] mirror: perform COW if the cluster size is bigger than the granularity

login
register
mail settings
Submitter Paolo Bonzini
Date Jan. 18, 2013, 4:22 p.m.
Message ID <50F976D8.60303@redhat.com>
Download mbox | patch
Permalink /patch/213669/
State New
Headers show

Comments

Paolo Bonzini - Jan. 18, 2013, 4:22 p.m.
Il 18/01/2013 16:13, Kevin Wolf ha scritto:
> Am 16.01.2013 18:31, schrieb Paolo Bonzini:
>> When mirroring runs, the backing files for the target may not yet be
>> ready.  However, this means that a copy-on-write operation on the target
>> would fill the missing sectors with zeros.  Copy-on-write only happens
>> if the granularity of the dirty bitmap is smaller than the cluster size
>> (and only for clusters that are allocated in the source after the job
>> has started copying).  So far, the granularity was fixed to 1MB; to avoid
>> the problem we detected the situation and required the backing files to
>> be available in that case only.
>>
>> However, we want to lower the granularity for efficiency, so we need
>> a better solution.  The solution is to always copy a whole cluster the
>> first time it is touched.  The code keeps a bitmap of clusters that
>> have already been allocated by the mirroring job, and only does "manual"
>> copy-on-write if the chunk being copied is zero in the bitmap.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>         v1->v2: rebased for moved include files
>>
>>  block/mirror.c             |   60 +++++++++++++++++++++++++++++++++++++------
>>  blockdev.c                 |   15 ++---------
>>  tests/qemu-iotests/041     |   21 +++++++++++++++
>>  tests/qemu-iotests/041.out |    4 +-
>>  trace-events               |    1 +
>>  5 files changed, 78 insertions(+), 23 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 20cb1e7..ee45e2e 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -15,6 +15,7 @@
>>  #include "block/blockjob.h"
>>  #include "block/block_int.h"
>>  #include "qemu/ratelimit.h"
>> +#include "qemu/bitmap.h"
>>  
>>  enum {
>>      /*
>> @@ -36,6 +37,8 @@ typedef struct MirrorBlockJob {
>>      bool synced;
>>      bool should_complete;
>>      int64_t sector_num;
>> +    size_t buf_size;
>> +    unsigned long *cow_bitmap;
>>      HBitmapIter hbi;
>>      uint8_t *buf;
>>  } MirrorBlockJob;
>> @@ -60,7 +63,7 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>      BlockDriverState *target = s->target;
>>      QEMUIOVector qiov;
>>      int ret, nb_sectors;
>> -    int64_t end;
>> +    int64_t end, sector_num, cluster_num;
>>      struct iovec iov;
>>  
>>      s->sector_num = hbitmap_iter_next(&s->hbi);
>> @@ -71,22 +74,41 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>          assert(s->sector_num >= 0);
>>      }
>>  
>> +    /* If we have no backing file yet in the destination, and the cluster size
>> +     * is very large, we need to do COW ourselves.  The first time a cluster is
>> +     * copied, copy it entirely.
>> +     *
>> +     * Because both BDRV_SECTORS_PER_DIRTY_CHUNK and the cluster size are
>> +     * powers of two, the number of sectors to copy cannot exceed one cluster.
>> +     */
>> +    sector_num = s->sector_num;
>> +    nb_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK;
>> +    cluster_num = sector_num / BDRV_SECTORS_PER_DIRTY_CHUNK;
>> +    if (s->cow_bitmap && !test_bit(cluster_num, s->cow_bitmap)) {
>> +        trace_mirror_cow(s, sector_num);
>> +        bdrv_round_to_clusters(s->target,
>> +                               sector_num, BDRV_SECTORS_PER_DIRTY_CHUNK,
>> +                               &sector_num, &nb_sectors);
>> +        bitmap_set(s->cow_bitmap, sector_num / BDRV_SECTORS_PER_DIRTY_CHUNK,
>> +                   nb_sectors / BDRV_SECTORS_PER_DIRTY_CHUNK);
> 
> Here the bit in the cow_bitmap is set before the COW has actually been
> performed. It could still fail.
> 
>> +    }
>> +
>>      end = s->common.len >> BDRV_SECTOR_BITS;
>> -    nb_sectors = MIN(BDRV_SECTORS_PER_DIRTY_CHUNK, end - s->sector_num);
>> -    bdrv_reset_dirty(source, s->sector_num, nb_sectors);
>> +    nb_sectors = MIN(nb_sectors, end - sector_num);
>> +    bdrv_reset_dirty(source, sector_num, nb_sectors);
>>  
>>      /* Copy the dirty cluster.  */
>>      iov.iov_base = s->buf;
>>      iov.iov_len  = nb_sectors * 512;
>>      qemu_iovec_init_external(&qiov, &iov, 1);
>>  
>> -    trace_mirror_one_iteration(s, s->sector_num, nb_sectors);
>> -    ret = bdrv_co_readv(source, s->sector_num, nb_sectors, &qiov);
>> +    trace_mirror_one_iteration(s, sector_num, nb_sectors);
>> +    ret = bdrv_co_readv(source, sector_num, nb_sectors, &qiov);
>>      if (ret < 0) {
>>          *p_action = mirror_error_action(s, true, -ret);
>>          goto fail;
>>      }
>> -    ret = bdrv_co_writev(target, s->sector_num, nb_sectors, &qiov);
>> +    ret = bdrv_co_writev(target, sector_num, nb_sectors, &qiov);
>>      if (ret < 0) {
>>          *p_action = mirror_error_action(s, false, -ret);
>>          s->synced = false;
>> @@ -96,7 +118,7 @@ static int coroutine_fn mirror_iteration(MirrorBlockJob *s,
>>  
>>  fail:
>>      /* Try again later.  */
>> -    bdrv_set_dirty(source, s->sector_num, nb_sectors);
>> +    bdrv_set_dirty(source, sector_num, nb_sectors);
> 
> If it does, we mark the whole cluster dirty now, but in the cow_bitmap
> it's still marked at present on the target. When restarting the job,
> wouldn't it copy only the start of the cluster next time and corrupt the
> rest of it?

Yes, very good catch.

I think this should fix it.


I haven't written a testcase for it, it's tricky but should be doable.
Do you want me to respin, or can it be done as a followup?

I would prefer a followup also because it will give a better pointer when
we backport this fix to the RHEL6 code.

Paolo
Kevin Wolf - Jan. 18, 2013, 5:05 p.m.
Am 18.01.2013 17:22, schrieb Paolo Bonzini:
> I haven't written a testcase for it, it's tricky but should be doable.
> Do you want me to respin, or can it be done as a followup?

I think I would prefer a respin, but if you think otherwise, I won't insist.

> I would prefer a followup also because it will give a better pointer when
> we backport this fix to the RHEL6 code.

That's not really a valid argument for upstream.

Also, wouldn't we backport the fixed version in the first place so that
a pointer isn't even needed? This code doesn't seem to exist yet in RHEL 6.

Kevin
Paolo Bonzini - Jan. 18, 2013, 5:33 p.m.
> Am 18.01.2013 17:22, schrieb Paolo Bonzini:
> > I haven't written a testcase for it, it's tricky but should be
> > doable.
> > Do you want me to respin, or can it be done as a followup?
> 
> I think I would prefer a respin, but if you think otherwise, I won't
> insist.

Okay, I'll respin.  I'll just note that this series now is in danger of
missing 1.4 (after 1.2 and 1.3) because only Laszlo and Eric gave it a
decent review in the six months since it was first posted.

Had I been employed by any other company, I'd probably just have kept
the code in house and forgotten about upstream.

> Also, wouldn't we backport the fixed version in the first place so that
> a pointer isn't even needed? This code doesn't seem to exist yet in
> RHEL 6.

Oops, you're right.  We fixed the problem in a different way.

Paolo
Kevin Wolf - Jan. 21, 2013, 10:17 a.m.
Am 18.01.2013 18:33, schrieb Paolo Bonzini:
> 
>> Am 18.01.2013 17:22, schrieb Paolo Bonzini:
>>> I haven't written a testcase for it, it's tricky but should be
>>> doable.
>>> Do you want me to respin, or can it be done as a followup?
>>
>> I think I would prefer a respin, but if you think otherwise, I won't
>> insist.
> 
> Okay, I'll respin.  I'll just note that this series now is in danger of
> missing 1.4 (after 1.2 and 1.3) because only Laszlo and Eric gave it a
> decent review in the six months since it was first posted.
> 
> Had I been employed by any other company, I'd probably just have kept
> the code in house and forgotten about upstream.

I hope this doesn't imply that you feel I'm happy or even just
indifferent about it. This is just what happens when you get a huge
numbers of patches and have only very few reviewers. I hope it has got a
bit better since Stefan supports me in maintaining the block layer, but
I'm afraid we're still not good enough with it. Any helpful suggestions
are appreciated.

Kevin
Paolo Bonzini - Jan. 21, 2013, 11:15 a.m.
Il 21/01/2013 11:17, Kevin Wolf ha scritto:
> Am 18.01.2013 18:33, schrieb Paolo Bonzini:
>>
>>> Am 18.01.2013 17:22, schrieb Paolo Bonzini:
>>>> I haven't written a testcase for it, it's tricky but should be
>>>> doable.
>>>> Do you want me to respin, or can it be done as a followup?
>>>
>>> I think I would prefer a respin, but if you think otherwise, I won't
>>> insist.
>>
>> Okay, I'll respin.  I'll just note that this series now is in danger of
>> missing 1.4 (after 1.2 and 1.3) because only Laszlo and Eric gave it a
>> decent review in the six months since it was first posted.
>>
>> Had I been employed by any other company, I'd probably just have kept
>> the code in house and forgotten about upstream.
> 
> I hope this doesn't imply that you feel I'm happy or even just
> indifferent about it. This is just what happens when you get a huge
> numbers of patches and have only very few reviewers. I hope it has got a
> bit better since Stefan supports me in maintaining the block layer, but
> I'm afraid we're still not good enough with it. Any helpful suggestions
> are appreciated.

No, I don't think you're happy.  And I'm sorry if it felt like a
complaint, it wasn't meant to be---I had plenty of other patches
committed by either you or Stefan or Anthony, so I cannot really
complain about anything. :)

The problem is that we have lots of patches that are not ready posted
too early without really following comments.  These patches consume a
huge amount of review bandwidth.  And more often than not are never
committed because people disappear when they are almost ready.

At the same time, patches that are almost ready from the beginning, tend
to fall through the cracks.  It is not exclusive to the block layer, see
for example Alberto Garcia's serial port patches.

Paolo
Paolo Bonzini - Jan. 21, 2013, 2:10 p.m.
Il 18/01/2013 17:22, Paolo Bonzini ha scritto:
>> > If it does, we mark the whole cluster dirty now, but in the cow_bitmap
>> > it's still marked at present on the target. When restarting the job,
>> > wouldn't it copy only the start of the cluster next time and corrupt the
>> > rest of it?
> Yes, very good catch.

Actually, it works.  Because the whole destination cluster is marked
dirty, all of it is ultimately copied correctly.  The special handling
of COW is required for all others sources of dirty data, where the dirty
bitmap can include a subset of a destination cluster, but not in this case.

I'll include the testcase, and test the patch I attached to my previous
message.

Paolo
Stefan Hajnoczi - Jan. 22, 2013, 9:28 a.m.
On Mon, Jan 21, 2013 at 12:15:51PM +0100, Paolo Bonzini wrote:
> Il 21/01/2013 11:17, Kevin Wolf ha scritto:
> > Am 18.01.2013 18:33, schrieb Paolo Bonzini:
> >>
> >>> Am 18.01.2013 17:22, schrieb Paolo Bonzini:
> >>>> I haven't written a testcase for it, it's tricky but should be
> >>>> doable.
> >>>> Do you want me to respin, or can it be done as a followup?
> >>>
> >>> I think I would prefer a respin, but if you think otherwise, I won't
> >>> insist.
> >>
> >> Okay, I'll respin.  I'll just note that this series now is in danger of
> >> missing 1.4 (after 1.2 and 1.3) because only Laszlo and Eric gave it a
> >> decent review in the six months since it was first posted.
> >>
> >> Had I been employed by any other company, I'd probably just have kept
> >> the code in house and forgotten about upstream.
> > 
> > I hope this doesn't imply that you feel I'm happy or even just
> > indifferent about it. This is just what happens when you get a huge
> > numbers of patches and have only very few reviewers. I hope it has got a
> > bit better since Stefan supports me in maintaining the block layer, but
> > I'm afraid we're still not good enough with it. Any helpful suggestions
> > are appreciated.
> 
> No, I don't think you're happy.  And I'm sorry if it felt like a
> complaint, it wasn't meant to be---I had plenty of other patches
> committed by either you or Stefan or Anthony, so I cannot really
> complain about anything. :)
> 
> The problem is that we have lots of patches that are not ready posted
> too early without really following comments.  These patches consume a
> huge amount of review bandwidth.  And more often than not are never
> committed because people disappear when they are almost ready.

This is due to growth.  QEMU is attracting new contributors all the
time.  Their new features need more integration help not just because
the contributors are new the QEMU codebase, but also because the
features stretch the boundaries of what QEMU is designed for.

I'd rather that RFC series are posted and slowly digested by the
community than the alternatives, which are throwing "finished" code over
the wall without prior discussion (and the contributor is faced with big
change requests from the community, gets frustrated because they thought
it was done, and leaves) or simply keeping code out of the QEMU source
tree altogether.

> At the same time, patches that are almost ready from the beginning, tend
> to fall through the cracks.  It is not exclusive to the block layer, see
> for example Alberto Garcia's serial port patches.

We need more reviewers.  I've noticed Eric Blake doing a lot of good
code review over the past months.  When people review code it frees up
maintainers to spend more time applying patches.

Stefan

Patch

diff --git a/block/mirror.c b/block/mirror.c
index 82abc2f..0fc140a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -87,6 +87,9 @@  static void mirror_iteration_done(MirrorOp *op)
     cluster_num = op->sector_num / s->granularity;
     nb_chunks = op->nb_sectors / s->granularity;
     bitmap_clear(s->in_flight_bitmap, cluster_num, nb_chunks);
+    if (s->cow_bitmap) {
+        bitmap_set(s->cow_bitmap, cluster_num, nb_chunks);
+    }
 
     trace_mirror_iteration_done(s, op->sector_num, op->nb_sectors);
     g_slice_free(MirrorOp, op);
@@ -217,9 +220,6 @@  static void coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
         /* We have enough free space to copy these sectors.  */
         bitmap_set(s->in_flight_bitmap, next_cluster, added_chunks);
-        if (s->cow_bitmap) {
-            bitmap_set(s->cow_bitmap, next_cluster, added_chunks);
-        }
 
         nb_sectors += added_sectors;
         nb_chunks += added_chunks;