Patchwork [RFC,2/4] block: add live block commit functionality

login
register
mail settings
Submitter Jeff Cody
Date July 31, 2012, 5:16 a.m.
Message ID <a2017b092460dd5f35123ec65a5237b3d46ed9f7.1343710713.git.jcody@redhat.com>
Download mbox | patch
Permalink /patch/174145/
State New
Headers show

Comments

Jeff Cody - July 31, 2012, 5:16 a.m.
This adds the live commit coroutine.  This iteration focuses on the
commit only below the active layer, and not the active layer itself.

The behaviour is similar to block streaming; the sectors are walked
through, and anything that exists above 'base' is committed back down
into base.  At the end, intermediate images are deleted, and the
chain stiched together.  Images are restored to their original open
flags.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/Makefile.objs |    2 +-
 block/commit.c      |  200 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block_int.h         |   19 +++++
 trace-events        |    2 +
 4 files changed, 222 insertions(+), 1 deletion(-)
 create mode 100644 block/commit.c
Eric Blake - July 31, 2012, 5:51 p.m.
On 07/30/2012 11:16 PM, Jeff Cody wrote:
> This adds the live commit coroutine.  This iteration focuses on the
> commit only below the active layer, and not the active layer itself.
> 
> The behaviour is similar to block streaming; the sectors are walked
> through, and anything that exists above 'base' is committed back down
> into base.  At the end, intermediate images are deleted, and the
> chain stiched together.  Images are restored to their original open

s/stiched/stitched/

> +
> +enum {
> +    /*
> +     * Size of data buffer for populating the image file.  This should be large
> +     * enough to process multiple clusters in a single call, so that populating
> +     * contiguous regions of the image is efficient.
> +     */
> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
> +};

Paolo's latest round of patches got to the point of making this
configurable for drive-mirror; is that something you should be copying here?

> +++ b/block_int.h
> @@ -342,4 +342,23 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
>                    BlockDriverCompletionFunc *cb,
>                    void *opaque, Error **errp);
>  
> +/**
> + * commit_start:
> + * @bs: Top Block device
> + * @base: Block device that will be written into, and become the new top
> + * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
> + * @on_error: The action to take upon error.

Do we need to distinguish between rerror and werror?

> + * @cb: Completion function for the job.
> + * @opaque: Opaque pointer value passed to @cb.
> + * @orig_base_flags: The original open flags for the base image
> + * @orig_top_flags: The original open flags for the top image
> + * @errp: Error object.
> + *
> + */
> +void commit_start(BlockDriverState *bs, BlockDriverState *base,
> +                 BlockDriverState *top, int64_t speed,
> +                 BlockdevOnError on_error, BlockDriverCompletionFunc *cb,
> +                 void *opaque, int orig_base_flags, int orig_top_flags,
> +                 Error **errp);
> +
>  #endif /* BLOCK_INT_H */
Jeff Cody - July 31, 2012, 5:55 p.m.
On 07/31/2012 01:51 PM, Eric Blake wrote:
> On 07/30/2012 11:16 PM, Jeff Cody wrote:
>> This adds the live commit coroutine.  This iteration focuses on the
>> commit only below the active layer, and not the active layer itself.
>>
>> The behaviour is similar to block streaming; the sectors are walked
>> through, and anything that exists above 'base' is committed back down
>> into base.  At the end, intermediate images are deleted, and the
>> chain stiched together.  Images are restored to their original open
> 
> s/stiched/stitched/
> 
>> +
>> +enum {
>> +    /*
>> +     * Size of data buffer for populating the image file.  This should be large
>> +     * enough to process multiple clusters in a single call, so that populating
>> +     * contiguous regions of the image is efficient.
>> +     */
>> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
>> +};
> 
> Paolo's latest round of patches got to the point of making this
> configurable for drive-mirror; is that something you should be copying here?

Yes

> 
>> +++ b/block_int.h
>> @@ -342,4 +342,23 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
>>                    BlockDriverCompletionFunc *cb,
>>                    void *opaque, Error **errp);
>>  
>> +/**
>> + * commit_start:
>> + * @bs: Top Block device
>> + * @base: Block device that will be written into, and become the new top
>> + * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
>> + * @on_error: The action to take upon error.
> 
> Do we need to distinguish between rerror and werror?

Good question - I don't think that would change the error handling, but
it may be useful information to get the user.  If you prefer rerror and
werror distinction, I'll add that in.

> 
>> + * @cb: Completion function for the job.
>> + * @opaque: Opaque pointer value passed to @cb.
>> + * @orig_base_flags: The original open flags for the base image
>> + * @orig_top_flags: The original open flags for the top image
>> + * @errp: Error object.
>> + *
>> + */
>> +void commit_start(BlockDriverState *bs, BlockDriverState *base,
>> +                 BlockDriverState *top, int64_t speed,
>> +                 BlockdevOnError on_error, BlockDriverCompletionFunc *cb,
>> +                 void *opaque, int orig_base_flags, int orig_top_flags,
>> +                 Error **errp);
>> +
>>  #endif /* BLOCK_INT_H */
>
Kevin Wolf - Aug. 1, 2012, 6:32 a.m.
Am 31.07.2012 19:55, schrieb Jeff Cody:
> On 07/31/2012 01:51 PM, Eric Blake wrote:
>> On 07/30/2012 11:16 PM, Jeff Cody wrote:
>>> This adds the live commit coroutine.  This iteration focuses on the
>>> commit only below the active layer, and not the active layer itself.
>>>
>>> The behaviour is similar to block streaming; the sectors are walked
>>> through, and anything that exists above 'base' is committed back down
>>> into base.  At the end, intermediate images are deleted, and the
>>> chain stiched together.  Images are restored to their original open
>>
>> s/stiched/stitched/
>>
>>> +
>>> +enum {
>>> +    /*
>>> +     * Size of data buffer for populating the image file.  This should be large
>>> +     * enough to process multiple clusters in a single call, so that populating
>>> +     * contiguous regions of the image is efficient.
>>> +     */
>>> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
>>> +};
>>
>> Paolo's latest round of patches got to the point of making this
>> configurable for drive-mirror; is that something you should be copying here?
> 
> Yes

Though its use is very limited for live commit. For the mirror it's
important because a larger number can mean that more data is
unnecessarily written, and the target can become larger than the source.
For live commit, I think using a larger buffer is always better.

Hm, part of the difference is that I assume that commit uses
bdrv_is_allocated() to check whether some data must really be copied.
But then, there's no reason why mirroring couldn't do that as well. Paolo?

Kevin
Paolo Bonzini - Aug. 1, 2012, 7:07 a.m.
Il 01/08/2012 08:32, Kevin Wolf ha scritto:
>>>> >>> +enum {
>>>> >>> +    /*
>>>> >>> +     * Size of data buffer for populating the image file.  This should be large
>>>> >>> +     * enough to process multiple clusters in a single call, so that populating
>>>> >>> +     * contiguous regions of the image is efficient.
>>>> >>> +     */
>>>> >>> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
>>>> >>> +};
>>> >>
>>> >> Paolo's latest round of patches got to the point of making this
>>> >> configurable for drive-mirror; is that something you should be copying here?
>> > 
>> > Yes
> Though its use is very limited for live commit. For the mirror it's
> important because a larger number can mean that more data is
> unnecessarily written, and the target can become larger than the source.

Note that the latest version of mirroring has _two_ knobs:

1) granularity is what decides how much data could be written
unnecessarily, because of the dirty bitmap.

2) buffer size is what decides how much I/O is in flight at one time.

The default values are resp. the cluster size (or 64K for raw) and 10M.
 The two together give mirroring some self-tuning capability.  For
example in the first part of the mirroring you will likely proceed in
10M chunks with no concurrency; once you're synchronized, you'll
probably send several chunks, perhaps all 64K if the guest does random
writes.

Live commit as it is done now doesn't need any of this complication; it
is just a background operation that does not need to compete with the
guest.  So using a larger buffer is indeed always better, and 512K is a
nice intermediate value between mirroring's 64K and 10M extremes.

> For live commit, I think using a larger buffer is always better.
> 
> Hm, part of the difference is that I assume that commit uses
> bdrv_is_allocated() to check whether some data must really be copied.
> But then, there's no reason why mirroring couldn't do that as well. Paolo?

We copy a cluster at a time, and that's also the resolution of
bdrv_is_allocated so we wouldn't gain anything.  Nice idea though, I had
to mull about it to find the flaw. :)

Paolo
Jeff Cody - Aug. 1, 2012, 11:23 a.m.
On 08/01/2012 03:07 AM, Paolo Bonzini wrote:
> Il 01/08/2012 08:32, Kevin Wolf ha scritto:
>>>>>>>> +enum {
>>>>>>>> +    /*
>>>>>>>> +     * Size of data buffer for populating the image file.  This should be large
>>>>>>>> +     * enough to process multiple clusters in a single call, so that populating
>>>>>>>> +     * contiguous regions of the image is efficient.
>>>>>>>> +     */
>>>>>>>> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
>>>>>>>> +};
>>>>>>
>>>>>> Paolo's latest round of patches got to the point of making this
>>>>>> configurable for drive-mirror; is that something you should be copying here?
>>>>
>>>> Yes
>> Though its use is very limited for live commit. For the mirror it's
>> important because a larger number can mean that more data is
>> unnecessarily written, and the target can become larger than the source.
> 
> Note that the latest version of mirroring has _two_ knobs:
> 
> 1) granularity is what decides how much data could be written
> unnecessarily, because of the dirty bitmap.
> 
> 2) buffer size is what decides how much I/O is in flight at one time.
> 
> The default values are resp. the cluster size (or 64K for raw) and 10M.
>  The two together give mirroring some self-tuning capability.  For
> example in the first part of the mirroring you will likely proceed in
> 10M chunks with no concurrency; once you're synchronized, you'll
> probably send several chunks, perhaps all 64K if the guest does random
> writes.
> 
> Live commit as it is done now doesn't need any of this complication; it
> is just a background operation that does not need to compete with the
> guest.  So using a larger buffer is indeed always better, and 512K is a
> nice intermediate value between mirroring's 64K and 10M extremes.
> 

We may want these same adjust knobs for the 'second phase' of live
commit, however.  It will commit down the top (active) layer, which will
have similar properties to mirroring, since the guest can still be
writing to the active layer.  If we do need it for the second phase,
perhaps those controls will be useful now.


>> For live commit, I think using a larger buffer is always better.
>>
>> Hm, part of the difference is that I assume that commit uses
>> bdrv_is_allocated() to check whether some data must really be copied.
>> But then, there's no reason why mirroring couldn't do that as well. Paolo?
> 
> We copy a cluster at a time, and that's also the resolution of
> bdrv_is_allocated so we wouldn't gain anything.  Nice idea though, I had
> to mull about it to find the flaw. :)
> 
> Paolo
>

Patch

diff --git a/block/Makefile.objs b/block/Makefile.objs
index f1a394a..9d07d3c 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -9,4 +9,4 @@  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
 block-obj-$(CONFIG_CURL) += curl.o
 block-obj-$(CONFIG_RBD) += rbd.o
 
-common-obj-y += stream.o mirror.o
+common-obj-y += stream.o mirror.o commit.o
diff --git a/block/commit.c b/block/commit.c
new file mode 100644
index 0000000..6b29f79
--- /dev/null
+++ b/block/commit.c
@@ -0,0 +1,200 @@ 
+/*
+ * Live block commit
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Jeff Cody   <jcody@redhat.com>
+ *  Based on stream.c by Stefan Hajnoczi
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "trace.h"
+#include "block_int.h"
+#include "blockjob.h"
+#include "qemu/ratelimit.h"
+
+enum {
+    /*
+     * Size of data buffer for populating the image file.  This should be large
+     * enough to process multiple clusters in a single call, so that populating
+     * contiguous regions of the image is efficient.
+     */
+    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
+};
+
+#define SLICE_TIME 100000000ULL /* ns */
+
+typedef struct CommitBlockJob {
+    BlockJob common;
+    RateLimit limit;
+    BlockDriverState *active;
+    BlockDriverState *top;
+    BlockDriverState *base;
+    BlockdevOnError on_error;
+    int base_flags;
+    int top_flags;
+} CommitBlockJob;
+
+static int coroutine_fn commit_populate(BlockDriverState *bs,
+                                        BlockDriverState *base,
+                                        int64_t sector_num, int nb_sectors,
+                                        void *buf)
+{
+    if (bdrv_read(bs, sector_num, buf, nb_sectors)) {
+        return -EIO;
+    }
+    if (bdrv_write(base, sector_num, buf, nb_sectors)) {
+        return -EIO;
+    }
+    return 0;
+}
+
+static void coroutine_fn commit_run(void *opaque)
+{
+    CommitBlockJob *s = opaque;
+    BlockDriverState *active = s->active;
+    BlockDriverState *top = s->top;
+    BlockDriverState *base = s->base;
+    int64_t sector_num, end;
+    int error = 0;
+    int ret = 0;
+    int n = 0;
+    void *buf;
+    int bytes_written = 0;
+
+    s->common.len = bdrv_getlength(top);
+    if (s->common.len < 0) {
+        block_job_completed(&s->common, s->common.len);
+        return;
+    }
+
+    end = s->common.len >> BDRV_SECTOR_BITS;
+    buf = qemu_blockalign(top, COMMIT_BUFFER_SIZE);
+
+    for (sector_num = 0; sector_num < end; sector_num += n) {
+        uint64_t delay_ns = 0;
+        bool copy;
+
+wait:
+        /* Note that even when no rate limit is applied we need to yield
+         * with no pending I/O here so that qemu_aio_flush() returns.
+         */
+        block_job_sleep_ns(&s->common, rt_clock, delay_ns);
+        if (block_job_is_cancelled(&s->common)) {
+            break;
+        }
+        /* Copy if allocated above the base */
+        ret = bdrv_co_is_allocated_above(top, base, sector_num,
+                                         COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
+                                         &n);
+        copy = (ret == 1);
+        trace_commit_one_iteration(s, sector_num, n, ret);
+        if (ret >= 0 && copy) {
+            if (s->common.speed) {
+                delay_ns = ratelimit_calculate_delay(&s->limit, n);
+                if (delay_ns > 0) {
+                    goto wait;
+                }
+            }
+            ret = commit_populate(top, base, sector_num, n, buf);
+            bytes_written += n * BDRV_SECTOR_SIZE;
+        }
+        if (ret < 0) {
+            BlockErrorAction action =
+                block_job_error_action(&s->common, s->common.bs, s->on_error,
+                                       true, -ret);
+            if (action == BDRV_ACTION_STOP) {
+                n = 0;
+                continue;
+            }
+            if (error == 0) {
+                error = ret;
+            }
+            if (action == BDRV_ACTION_REPORT) {
+                break;
+            }
+        }
+        ret = 0;
+
+        /* Publish progress */
+        s->common.offset += n * BDRV_SECTOR_SIZE;
+    }
+
+    if (!block_job_is_cancelled(&s->common) && sector_num == end && ret == 0) {
+        /* success */
+        if (bdrv_delete_intermediate(active, top, base)) {
+            /* something went wrong! */
+            /* TODO:add error reporting here */
+        }
+    }
+
+    /* restore base open flags here if appropriate (e.g., change the base back
+     * to r/o) */
+    if (s->base_flags != bdrv_get_flags(base)) {
+        bdrv_reopen(base, s->base_flags);
+    }
+    if (s->top_flags != bdrv_get_flags(top)) {
+        bdrv_reopen(top, s->top_flags);
+    }
+
+    qemu_vfree(buf);
+    block_job_completed(&s->common, ret);
+}
+
+static void commit_set_speed(BlockJob *job, int64_t speed, Error **errp)
+{
+    CommitBlockJob *s = container_of(job, CommitBlockJob, common);
+
+    if (speed < 0) {
+        error_set(errp, QERR_INVALID_PARAMETER, "speed");
+        return;
+    }
+    ratelimit_set_speed(&s->limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME);
+}
+
+static BlockJobType commit_job_type = {
+    .instance_size = sizeof(CommitBlockJob),
+    .job_type      = "commit",
+    .set_speed     = commit_set_speed,
+};
+
+void commit_start(BlockDriverState *bs, BlockDriverState *base,
+                 BlockDriverState *top, int64_t speed,
+                 BlockdevOnError on_error, BlockDriverCompletionFunc *cb,
+                 void *opaque, int orig_base_flags, int orig_top_flags,
+                 Error **errp)
+{
+    CommitBlockJob *s;
+
+    if ((on_error == BLOCKDEV_ON_ERROR_STOP ||
+         on_error == BLOCKDEV_ON_ERROR_ENOSPC) &&
+        !bdrv_iostatus_is_enabled(bs)) {
+        error_set(errp, QERR_INVALID_PARAMETER_COMBINATION);
+        return;
+    }
+
+    s = block_job_create(&commit_job_type, bs, speed, cb, opaque, errp);
+    if (!s) {
+        return;
+    }
+
+    s->base   = base;
+    s->top    = top;
+    s->active = bs;
+
+    s->base_flags = orig_base_flags;
+    s->top_flags  = orig_top_flags;
+
+    s->on_error = on_error;
+    s->common.co = qemu_coroutine_create(commit_run);
+
+    trace_commit_start(bs, base, top, s, s->common.co, opaque,
+                       orig_base_flags, orig_top_flags);
+    qemu_coroutine_enter(s->common.co, s);
+
+    return;
+}
diff --git a/block_int.h b/block_int.h
index de52263..e3c2121 100644
--- a/block_int.h
+++ b/block_int.h
@@ -342,4 +342,23 @@  void mirror_start(BlockDriverState *bs, BlockDriverState *target,
                   BlockDriverCompletionFunc *cb,
                   void *opaque, Error **errp);
 
+/**
+ * commit_start:
+ * @bs: Top Block device
+ * @base: Block device that will be written into, and become the new top
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @on_error: The action to take upon error.
+ * @cb: Completion function for the job.
+ * @opaque: Opaque pointer value passed to @cb.
+ * @orig_base_flags: The original open flags for the base image
+ * @orig_top_flags: The original open flags for the top image
+ * @errp: Error object.
+ *
+ */
+void commit_start(BlockDriverState *bs, BlockDriverState *base,
+                 BlockDriverState *top, int64_t speed,
+                 BlockdevOnError on_error, BlockDriverCompletionFunc *cb,
+                 void *opaque, int orig_base_flags, int orig_top_flags,
+                 Error **errp);
+
 #endif /* BLOCK_INT_H */
diff --git a/trace-events b/trace-events
index d61bcba..3897de7 100644
--- a/trace-events
+++ b/trace-events
@@ -74,6 +74,8 @@  bdrv_co_do_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t c
 # block/stream.c
 stream_one_iteration(void *s, int64_t sector_num, int nb_sectors, int is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d"
 stream_start(void *bs, void *base, void *s, void *co, void *opaque) "bs %p base %p s %p co %p opaque %p"
+commit_one_iteration(void *s, int64_t sector_num, int nb_sectors, int is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d"
+commit_start(void *bs, void *base, void *top, void *s, void *co, void *opaque, int base_flags, int top_flags) "bs %p base %p top %p s %p co %p opaque %p base_flags %d top_flags %d"
 
 # block/mirror.c
 mirror_start(void *bs, void *s, void *co, void *opaque) "bs %p s %p co %p opaque %p"