Message ID | a2017b092460dd5f35123ec65a5237b3d46ed9f7.1343710713.git.jcody@redhat.com |
---|---|
State | New |
Headers | show |
On 07/30/2012 11:16 PM, Jeff Cody wrote: > This adds the live commit coroutine. This iteration focuses on the > commit only below the active layer, and not the active layer itself. > > The behaviour is similar to block streaming; the sectors are walked > through, and anything that exists above 'base' is committed back down > into base. At the end, intermediate images are deleted, and the > chain stiched together. Images are restored to their original open s/stiched/stitched/ > + > +enum { > + /* > + * Size of data buffer for populating the image file. This should be large > + * enough to process multiple clusters in a single call, so that populating > + * contiguous regions of the image is efficient. > + */ > + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ > +}; Paolo's latest round of patches got to the point of making this configurable for drive-mirror; is that something you should be copying here? > +++ b/block_int.h > @@ -342,4 +342,23 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, > BlockDriverCompletionFunc *cb, > void *opaque, Error **errp); > > +/** > + * commit_start: > + * @bs: Top Block device > + * @base: Block device that will be written into, and become the new top > + * @speed: The maximum speed, in bytes per second, or 0 for unlimited. > + * @on_error: The action to take upon error. Do we need to distinguish between rerror and werror? > + * @cb: Completion function for the job. > + * @opaque: Opaque pointer value passed to @cb. > + * @orig_base_flags: The original open flags for the base image > + * @orig_top_flags: The original open flags for the top image > + * @errp: Error object. > + * > + */ > +void commit_start(BlockDriverState *bs, BlockDriverState *base, > + BlockDriverState *top, int64_t speed, > + BlockdevOnError on_error, BlockDriverCompletionFunc *cb, > + void *opaque, int orig_base_flags, int orig_top_flags, > + Error **errp); > + > #endif /* BLOCK_INT_H */
On 07/31/2012 01:51 PM, Eric Blake wrote: > On 07/30/2012 11:16 PM, Jeff Cody wrote: >> This adds the live commit coroutine. This iteration focuses on the >> commit only below the active layer, and not the active layer itself. >> >> The behaviour is similar to block streaming; the sectors are walked >> through, and anything that exists above 'base' is committed back down >> into base. At the end, intermediate images are deleted, and the >> chain stiched together. Images are restored to their original open > > s/stiched/stitched/ > >> + >> +enum { >> + /* >> + * Size of data buffer for populating the image file. This should be large >> + * enough to process multiple clusters in a single call, so that populating >> + * contiguous regions of the image is efficient. >> + */ >> + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ >> +}; > > Paolo's latest round of patches got to the point of making this > configurable for drive-mirror; is that something you should be copying here? Yes > >> +++ b/block_int.h >> @@ -342,4 +342,23 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, >> BlockDriverCompletionFunc *cb, >> void *opaque, Error **errp); >> >> +/** >> + * commit_start: >> + * @bs: Top Block device >> + * @base: Block device that will be written into, and become the new top >> + * @speed: The maximum speed, in bytes per second, or 0 for unlimited. >> + * @on_error: The action to take upon error. > > Do we need to distinguish between rerror and werror? Good question - I don't think that would change the error handling, but it may be useful information to get the user. If you prefer rerror and werror distinction, I'll add that in. > >> + * @cb: Completion function for the job. >> + * @opaque: Opaque pointer value passed to @cb. >> + * @orig_base_flags: The original open flags for the base image >> + * @orig_top_flags: The original open flags for the top image >> + * @errp: Error object. >> + * >> + */ >> +void commit_start(BlockDriverState *bs, BlockDriverState *base, >> + BlockDriverState *top, int64_t speed, >> + BlockdevOnError on_error, BlockDriverCompletionFunc *cb, >> + void *opaque, int orig_base_flags, int orig_top_flags, >> + Error **errp); >> + >> #endif /* BLOCK_INT_H */ >
Am 31.07.2012 19:55, schrieb Jeff Cody: > On 07/31/2012 01:51 PM, Eric Blake wrote: >> On 07/30/2012 11:16 PM, Jeff Cody wrote: >>> This adds the live commit coroutine. This iteration focuses on the >>> commit only below the active layer, and not the active layer itself. >>> >>> The behaviour is similar to block streaming; the sectors are walked >>> through, and anything that exists above 'base' is committed back down >>> into base. At the end, intermediate images are deleted, and the >>> chain stiched together. Images are restored to their original open >> >> s/stiched/stitched/ >> >>> + >>> +enum { >>> + /* >>> + * Size of data buffer for populating the image file. This should be large >>> + * enough to process multiple clusters in a single call, so that populating >>> + * contiguous regions of the image is efficient. >>> + */ >>> + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ >>> +}; >> >> Paolo's latest round of patches got to the point of making this >> configurable for drive-mirror; is that something you should be copying here? > > Yes Though its use is very limited for live commit. For the mirror it's important because a larger number can mean that more data is unnecessarily written, and the target can become larger than the source. For live commit, I think using a larger buffer is always better. Hm, part of the difference is that I assume that commit uses bdrv_is_allocated() to check whether some data must really be copied. But then, there's no reason why mirroring couldn't do that as well. Paolo? Kevin
Il 01/08/2012 08:32, Kevin Wolf ha scritto: >>>> >>> +enum { >>>> >>> + /* >>>> >>> + * Size of data buffer for populating the image file. This should be large >>>> >>> + * enough to process multiple clusters in a single call, so that populating >>>> >>> + * contiguous regions of the image is efficient. >>>> >>> + */ >>>> >>> + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ >>>> >>> +}; >>> >> >>> >> Paolo's latest round of patches got to the point of making this >>> >> configurable for drive-mirror; is that something you should be copying here? >> > >> > Yes > Though its use is very limited for live commit. For the mirror it's > important because a larger number can mean that more data is > unnecessarily written, and the target can become larger than the source. Note that the latest version of mirroring has _two_ knobs: 1) granularity is what decides how much data could be written unnecessarily, because of the dirty bitmap. 2) buffer size is what decides how much I/O is in flight at one time. The default values are resp. the cluster size (or 64K for raw) and 10M. The two together give mirroring some self-tuning capability. For example in the first part of the mirroring you will likely proceed in 10M chunks with no concurrency; once you're synchronized, you'll probably send several chunks, perhaps all 64K if the guest does random writes. Live commit as it is done now doesn't need any of this complication; it is just a background operation that does not need to compete with the guest. So using a larger buffer is indeed always better, and 512K is a nice intermediate value between mirroring's 64K and 10M extremes. > For live commit, I think using a larger buffer is always better. > > Hm, part of the difference is that I assume that commit uses > bdrv_is_allocated() to check whether some data must really be copied. > But then, there's no reason why mirroring couldn't do that as well. Paolo? We copy a cluster at a time, and that's also the resolution of bdrv_is_allocated so we wouldn't gain anything. Nice idea though, I had to mull about it to find the flaw. :) Paolo
On 08/01/2012 03:07 AM, Paolo Bonzini wrote: > Il 01/08/2012 08:32, Kevin Wolf ha scritto: >>>>>>>> +enum { >>>>>>>> + /* >>>>>>>> + * Size of data buffer for populating the image file. This should be large >>>>>>>> + * enough to process multiple clusters in a single call, so that populating >>>>>>>> + * contiguous regions of the image is efficient. >>>>>>>> + */ >>>>>>>> + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ >>>>>>>> +}; >>>>>> >>>>>> Paolo's latest round of patches got to the point of making this >>>>>> configurable for drive-mirror; is that something you should be copying here? >>>> >>>> Yes >> Though its use is very limited for live commit. For the mirror it's >> important because a larger number can mean that more data is >> unnecessarily written, and the target can become larger than the source. > > Note that the latest version of mirroring has _two_ knobs: > > 1) granularity is what decides how much data could be written > unnecessarily, because of the dirty bitmap. > > 2) buffer size is what decides how much I/O is in flight at one time. > > The default values are resp. the cluster size (or 64K for raw) and 10M. > The two together give mirroring some self-tuning capability. For > example in the first part of the mirroring you will likely proceed in > 10M chunks with no concurrency; once you're synchronized, you'll > probably send several chunks, perhaps all 64K if the guest does random > writes. > > Live commit as it is done now doesn't need any of this complication; it > is just a background operation that does not need to compete with the > guest. So using a larger buffer is indeed always better, and 512K is a > nice intermediate value between mirroring's 64K and 10M extremes. > We may want these same adjust knobs for the 'second phase' of live commit, however. It will commit down the top (active) layer, which will have similar properties to mirroring, since the guest can still be writing to the active layer. If we do need it for the second phase, perhaps those controls will be useful now. >> For live commit, I think using a larger buffer is always better. >> >> Hm, part of the difference is that I assume that commit uses >> bdrv_is_allocated() to check whether some data must really be copied. >> But then, there's no reason why mirroring couldn't do that as well. Paolo? > > We copy a cluster at a time, and that's also the resolution of > bdrv_is_allocated so we wouldn't gain anything. Nice idea though, I had > to mull about it to find the flaw. :) > > Paolo >
diff --git a/block/Makefile.objs b/block/Makefile.objs index f1a394a..9d07d3c 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -9,4 +9,4 @@ block-obj-$(CONFIG_LIBISCSI) += iscsi.o block-obj-$(CONFIG_CURL) += curl.o block-obj-$(CONFIG_RBD) += rbd.o -common-obj-y += stream.o mirror.o +common-obj-y += stream.o mirror.o commit.o diff --git a/block/commit.c b/block/commit.c new file mode 100644 index 0000000..6b29f79 --- /dev/null +++ b/block/commit.c @@ -0,0 +1,200 @@ +/* + * Live block commit + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Jeff Cody <jcody@redhat.com> + * Based on stream.c by Stefan Hajnoczi + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + * + */ + +#include "trace.h" +#include "block_int.h" +#include "blockjob.h" +#include "qemu/ratelimit.h" + +enum { + /* + * Size of data buffer for populating the image file. This should be large + * enough to process multiple clusters in a single call, so that populating + * contiguous regions of the image is efficient. + */ + COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */ +}; + +#define SLICE_TIME 100000000ULL /* ns */ + +typedef struct CommitBlockJob { + BlockJob common; + RateLimit limit; + BlockDriverState *active; + BlockDriverState *top; + BlockDriverState *base; + BlockdevOnError on_error; + int base_flags; + int top_flags; +} CommitBlockJob; + +static int coroutine_fn commit_populate(BlockDriverState *bs, + BlockDriverState *base, + int64_t sector_num, int nb_sectors, + void *buf) +{ + if (bdrv_read(bs, sector_num, buf, nb_sectors)) { + return -EIO; + } + if (bdrv_write(base, sector_num, buf, nb_sectors)) { + return -EIO; + } + return 0; +} + +static void coroutine_fn commit_run(void *opaque) +{ + CommitBlockJob *s = opaque; + BlockDriverState *active = s->active; + BlockDriverState *top = s->top; + BlockDriverState *base = s->base; + int64_t sector_num, end; + int error = 0; + int ret = 0; + int n = 0; + void *buf; + int bytes_written = 0; + + s->common.len = bdrv_getlength(top); + if (s->common.len < 0) { + block_job_completed(&s->common, s->common.len); + return; + } + + end = s->common.len >> BDRV_SECTOR_BITS; + buf = qemu_blockalign(top, COMMIT_BUFFER_SIZE); + + for (sector_num = 0; sector_num < end; sector_num += n) { + uint64_t delay_ns = 0; + bool copy; + +wait: + /* Note that even when no rate limit is applied we need to yield + * with no pending I/O here so that qemu_aio_flush() returns. + */ + block_job_sleep_ns(&s->common, rt_clock, delay_ns); + if (block_job_is_cancelled(&s->common)) { + break; + } + /* Copy if allocated above the base */ + ret = bdrv_co_is_allocated_above(top, base, sector_num, + COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE, + &n); + copy = (ret == 1); + trace_commit_one_iteration(s, sector_num, n, ret); + if (ret >= 0 && copy) { + if (s->common.speed) { + delay_ns = ratelimit_calculate_delay(&s->limit, n); + if (delay_ns > 0) { + goto wait; + } + } + ret = commit_populate(top, base, sector_num, n, buf); + bytes_written += n * BDRV_SECTOR_SIZE; + } + if (ret < 0) { + BlockErrorAction action = + block_job_error_action(&s->common, s->common.bs, s->on_error, + true, -ret); + if (action == BDRV_ACTION_STOP) { + n = 0; + continue; + } + if (error == 0) { + error = ret; + } + if (action == BDRV_ACTION_REPORT) { + break; + } + } + ret = 0; + + /* Publish progress */ + s->common.offset += n * BDRV_SECTOR_SIZE; + } + + if (!block_job_is_cancelled(&s->common) && sector_num == end && ret == 0) { + /* success */ + if (bdrv_delete_intermediate(active, top, base)) { + /* something went wrong! */ + /* TODO:add error reporting here */ + } + } + + /* restore base open flags here if appropriate (e.g., change the base back + * to r/o) */ + if (s->base_flags != bdrv_get_flags(base)) { + bdrv_reopen(base, s->base_flags); + } + if (s->top_flags != bdrv_get_flags(top)) { + bdrv_reopen(top, s->top_flags); + } + + qemu_vfree(buf); + block_job_completed(&s->common, ret); +} + +static void commit_set_speed(BlockJob *job, int64_t speed, Error **errp) +{ + CommitBlockJob *s = container_of(job, CommitBlockJob, common); + + if (speed < 0) { + error_set(errp, QERR_INVALID_PARAMETER, "speed"); + return; + } + ratelimit_set_speed(&s->limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME); +} + +static BlockJobType commit_job_type = { + .instance_size = sizeof(CommitBlockJob), + .job_type = "commit", + .set_speed = commit_set_speed, +}; + +void commit_start(BlockDriverState *bs, BlockDriverState *base, + BlockDriverState *top, int64_t speed, + BlockdevOnError on_error, BlockDriverCompletionFunc *cb, + void *opaque, int orig_base_flags, int orig_top_flags, + Error **errp) +{ + CommitBlockJob *s; + + if ((on_error == BLOCKDEV_ON_ERROR_STOP || + on_error == BLOCKDEV_ON_ERROR_ENOSPC) && + !bdrv_iostatus_is_enabled(bs)) { + error_set(errp, QERR_INVALID_PARAMETER_COMBINATION); + return; + } + + s = block_job_create(&commit_job_type, bs, speed, cb, opaque, errp); + if (!s) { + return; + } + + s->base = base; + s->top = top; + s->active = bs; + + s->base_flags = orig_base_flags; + s->top_flags = orig_top_flags; + + s->on_error = on_error; + s->common.co = qemu_coroutine_create(commit_run); + + trace_commit_start(bs, base, top, s, s->common.co, opaque, + orig_base_flags, orig_top_flags); + qemu_coroutine_enter(s->common.co, s); + + return; +} diff --git a/block_int.h b/block_int.h index de52263..e3c2121 100644 --- a/block_int.h +++ b/block_int.h @@ -342,4 +342,23 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, BlockDriverCompletionFunc *cb, void *opaque, Error **errp); +/** + * commit_start: + * @bs: Top Block device + * @base: Block device that will be written into, and become the new top + * @speed: The maximum speed, in bytes per second, or 0 for unlimited. + * @on_error: The action to take upon error. + * @cb: Completion function for the job. + * @opaque: Opaque pointer value passed to @cb. + * @orig_base_flags: The original open flags for the base image + * @orig_top_flags: The original open flags for the top image + * @errp: Error object. + * + */ +void commit_start(BlockDriverState *bs, BlockDriverState *base, + BlockDriverState *top, int64_t speed, + BlockdevOnError on_error, BlockDriverCompletionFunc *cb, + void *opaque, int orig_base_flags, int orig_top_flags, + Error **errp); + #endif /* BLOCK_INT_H */ diff --git a/trace-events b/trace-events index d61bcba..3897de7 100644 --- a/trace-events +++ b/trace-events @@ -74,6 +74,8 @@ bdrv_co_do_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t c # block/stream.c stream_one_iteration(void *s, int64_t sector_num, int nb_sectors, int is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d" stream_start(void *bs, void *base, void *s, void *co, void *opaque) "bs %p base %p s %p co %p opaque %p" +commit_one_iteration(void *s, int64_t sector_num, int nb_sectors, int is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d" +commit_start(void *bs, void *base, void *top, void *s, void *co, void *opaque, int base_flags, int top_flags) "bs %p base %p top %p s %p co %p opaque %p base_flags %d top_flags %d" # block/mirror.c mirror_start(void *bs, void *s, void *co, void *opaque) "bs %p s %p co %p opaque %p"
This adds the live commit coroutine. This iteration focuses on the commit only below the active layer, and not the active layer itself. The behaviour is similar to block streaming; the sectors are walked through, and anything that exists above 'base' is committed back down into base. At the end, intermediate images are deleted, and the chain stiched together. Images are restored to their original open flags. Signed-off-by: Jeff Cody <jcody@redhat.com> --- block/Makefile.objs | 2 +- block/commit.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++ block_int.h | 19 +++++ trace-events | 2 + 4 files changed, 222 insertions(+), 1 deletion(-) create mode 100644 block/commit.c