[v14,7/8] Implement new driver for block replication

Message ID	1452676712-24239-8-git-send-email-xiecl.fnst@cn.fujitsu.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Changlong Xie <xiecl.fnst@cn.fujitsu.com> To: qemu devel <qemu-devel@nongnu.org>, Fam Zheng <famz@redhat.com>, Max Reitz <mreitz@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com> Date: Wed, 13 Jan 2016 17:18:31 +0800 Message-ID: <1452676712-24239-8-git-send-email-xiecl.fnst@cn.fujitsu.com> In-Reply-To: <1452676712-24239-1-git-send-email-xiecl.fnst@cn.fujitsu.com> References: <1452676712-24239-1-git-send-email-xiecl.fnst@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain Cc: Gonglei <arei.gonglei@huawei.com>, zhanghailiang <zhang.zhanghailiang@huawei.com>, fnstml-hwcolo@cn.fujitsu.com Subject: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Changlong Xie Jan. 13, 2016, 9:18 a.m. UTC

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block/Makefile.objs              |   1 +
 block/replication-comm.c         |  66 +++++
 block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
 include/block/replication-comm.h |  50 ++++
 qapi/block-core.json             |  13 +
 5 files changed, 720 insertions(+)
 create mode 100644 block/replication-comm.c
 create mode 100644 block/replication.c
 create mode 100644 include/block/replication-comm.h

Eric Blake Jan. 20, 2016, 12:04 a.m. UTC | #1

On 01/13/2016 02:18 AM, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> ---
>  block/Makefile.objs              |   1 +
>  block/replication-comm.c         |  66 +++++
>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>  include/block/replication-comm.h |  50 ++++
>  qapi/block-core.json             |  13 +
>  5 files changed, 720 insertions(+)
>  create mode 100644 block/replication-comm.c
>  create mode 100644 block/replication.c
>  create mode 100644 include/block/replication-comm.h
> 

Just a high-level overview, mainly on the user-visible interface and
things that caught my eye.

> +++ b/block/replication-comm.c
> @@ -0,0 +1,66 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED

Do you want to claim 2016 in any of this?


> +
> +enum {
> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/

Space before */

> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
> +};

Should this be an enum type in a .json file?


> +
> +static int replication_open(BlockDriverState *bs, QDict *options,
> +                            int flags, Error **errp)
> +{

> +
> +fail:
> +    qemu_opts_del(opts);
> +    /* propagate error */
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +    }

It's safe to call error_propagate() unconditionally (you could drop the
'if (local_err)').


> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> +                                               int64_t sector_num,
> +                                               int nb_sectors)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 1) {
> +        /* It is secondary qemu and failover are running*/

Space before */

> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    if (!s->secondary_disk->job) {
> +        error_setg(errp, "Backup job is cancelled unexpectedly");

Maybe s/is/was/

> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +    int64_t active_length, hidden_length, disk_length;
> +    AioContext *aio_context;
> +    Error *local_err = NULL;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
> +        error_setg(errp, "Block replication is running or done");
> +        return;
> +    }
> +
> +    if (s->mode != mode) {
> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
> +                   " but receives %d", s->mode, mode);

s/receives/got/


> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
> +{
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    if (s->error) {
> +        error_setg(errp, "I/O error occurs");

s/occurs/occurred/

> +++ b/qapi/block-core.json
> @@ -1975,6 +1975,19 @@
>              '*read-pattern': 'QuorumReadPattern' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.6
> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> +

Interface looks okay.

Changlong Xie Jan. 20, 2016, 7:45 a.m. UTC | #2

On 01/20/2016 08:04 AM, Eric Blake wrote:
> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> ---
>>   block/Makefile.objs              |   1 +
>>   block/replication-comm.c         |  66 +++++
>>   block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>>   include/block/replication-comm.h |  50 ++++
>>   qapi/block-core.json             |  13 +
>>   5 files changed, 720 insertions(+)
>>   create mode 100644 block/replication-comm.c
>>   create mode 100644 block/replication.c
>>   create mode 100644 include/block/replication-comm.h
>>
>
> Just a high-level overview, mainly on the user-visible interface and
> things that caught my eye.

Hi eric, thanks for your patience.

>
>> +++ b/block/replication-comm.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>
> Do you want to claim 2016 in any of this?
>
>

Will correct all Copyright issues in next version.

>> +
>> +enum {
>> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
>> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
>> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
>> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
>
> Space before */

Will fix this space issue

>
>> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
>> +};
>
> Should this be an enum type in a .json file?
>

Since this is just an internal enum that only used in 
block/replication.c, i think we don't need put it in *.json file.

>
>> +
>> +static int replication_open(BlockDriverState *bs, QDict *options,
>> +                            int flags, Error **errp)
>> +{
>
>> +
>> +fail:
>> +    qemu_opts_del(opts);
>> +    /* propagate error */
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +    }
>
> It's safe to call error_propagate() unconditionally (you could drop the
> 'if (local_err)').

You're right, i'll fix all relevant issue in my patchset.

>
>
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +                                               int64_t sector_num,
>> +                                               int nb_sectors)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 1) {
>> +        /* It is secondary qemu and failover are running*/
>
> Space before */

Will fix

>
>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    if (!s->secondary_disk->job) {
>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>
> Maybe s/is/was/
>

will fix

>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +    int64_t active_length, hidden_length, disk_length;
>> +    AioContext *aio_context;
>> +    Error *local_err = NULL;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
>> +        error_setg(errp, "Block replication is running or done");
>> +        return;
>> +    }
>> +
>> +    if (s->mode != mode) {
>> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
>> +                   " but receives %d", s->mode, mode);
>
> s/receives/got/
>

will fix

>
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
>> +{
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    if (s->error) {
>> +        error_setg(errp, "I/O error occurs");
>
> s/occurs/occurred/
>

Will fix

>> +++ b/qapi/block-core.json
>> @@ -1975,6 +1975,19 @@
>>               '*read-pattern': 'QuorumReadPattern' } }
>>
>>   ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>> +
>
> Interface looks okay.


Thanks
	-Xie

>

Stefan Hajnoczi Jan. 27, 2016, 2:46 p.m. UTC | #3

On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> ---
>  block/Makefile.objs              |   1 +
>  block/replication-comm.c         |  66 +++++
>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>  include/block/replication-comm.h |  50 ++++
>  qapi/block-core.json             |  13 +
>  5 files changed, 720 insertions(+)
>  create mode 100644 block/replication-comm.c
>  create mode 100644 block/replication.c
>  create mode 100644 include/block/replication-comm.h
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index fa05f37..7037662 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
>  block-obj-y += accounting.o
>  block-obj-y += write-threshold.o
>  block-obj-y += backup.o
> +block-obj-y += replication-comm.o replication.o
>  
>  common-obj-y += stream.o
>  common-obj-y += commit.o
> diff --git a/block/replication-comm.c b/block/replication-comm.c
> new file mode 100644
> index 0000000..8af748b
> --- /dev/null
> +++ b/block/replication-comm.c
> @@ -0,0 +1,66 @@
> +/*
> + * Replication Block filter

Is the start/stop/checkpoint callback interface only useful for block
replication?

This seems like a generic interface for registering with COLO.  Other
components (networking, etc) might also need start/stop/checkpoint
callbacks.  If that's the case then this code should be outside block/
and the brs->bs field should either be void *opaque or removed (the
caller needs to use container_of()).

> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "block/replication-comm.h"
> +
> +static QLIST_HEAD(, BlockReplicationState) block_replication_states;
> +
> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
> +                                             BlockReplicationOps *ops)
> +{
> +    BlockReplicationState *brs;
> +
> +    brs = g_new0(BlockReplicationState, 1);
> +    brs->bs = bs;
> +    brs->ops = ops;
> +    QLIST_INSERT_HEAD(&block_replication_states, brs, node);
> +
> +    return brs;
> +}
> +
> +void block_replication_remove(BlockReplicationState *brs)
> +{
> +    QLIST_REMOVE(brs, node);
> +    g_free(brs);
> +}
> +
> +void block_replication_start_all(ReplicationMode mode, Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->start) {
> +            brs->ops->start(brs, mode, errp);
> +        }
> +    }
> +}
> +
> +void block_replication_do_checkpoint_all(Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->checkpoint) {
> +            brs->ops->checkpoint(brs, errp);
> +        }
> +    }
> +}
> +
> +void block_replication_stop_all(bool failover, Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->stop) {
> +            brs->ops->stop(brs, failover, errp);
> +        }
> +    }
> +}
> diff --git a/block/replication.c b/block/replication.c
> new file mode 100644
> index 0000000..29c677a
> --- /dev/null
> +++ b/block/replication.c
> @@ -0,0 +1,590 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu-common.h"
> +#include "block/blockjob.h"
> +#include "block/nbd.h"
> +#include "block/replication-comm.h"
> +
> +typedef struct BDRVReplicationState {
> +    ReplicationMode mode;
> +    int replication_state;
> +    BlockDriverState *active_disk;
> +    BlockDriverState *hidden_disk;
> +    BlockDriverState *secondary_disk;
> +    BlockDriverState *top_bs;
> +    BlockReplicationState *brs;
> +    Error *blocker;
> +    int orig_hidden_flags;
> +    int orig_secondary_flags;
> +    int error;
> +} BDRVReplicationState;
> +
> +enum {
> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
> +};
> +
> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp);
> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp);
> +static void replication_stop(BlockReplicationState *brs, bool failover,
> +                             Error **errp);
> +
> +#define REPLICATION_MODE        "mode"
> +static QemuOptsList replication_runtime_opts = {
> +    .name = "replication",
> +    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = REPLICATION_MODE,
> +            .type = QEMU_OPT_STRING,
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static BlockReplicationOps replication_ops = {
> +    .start = replication_start,
> +    .checkpoint = replication_do_checkpoint,
> +    .stop = replication_stop,
> +};
> +
> +static int replication_open(BlockDriverState *bs, QDict *options,
> +                            int flags, Error **errp)
> +{
> +    int ret;
> +    BDRVReplicationState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    QemuOpts *opts = NULL;
> +    const char *mode;
> +
> +    ret = -EINVAL;
> +    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &local_err);
> +    if (local_err) {
> +        goto fail;
> +    }
> +
> +    mode = qemu_opt_get(opts, REPLICATION_MODE);
> +    if (!mode) {
> +        error_setg(&local_err, "Missing the option mode");
> +        goto fail;
> +    }
> +
> +    if (!strcmp(mode, "primary")) {
> +        s->mode = REPLICATION_MODE_PRIMARY;
> +    } else if (!strcmp(mode, "secondary")) {
> +        s->mode = REPLICATION_MODE_SECONDARY;
> +    } else {
> +        error_setg(&local_err,
> +                   "The option mode's value should be primary or secondary");
> +        goto fail;
> +    }
> +
> +    s->brs = block_replication_new(bs, &replication_ops);
> +
> +    ret = 0;
> +
> +fail:
> +    qemu_opts_del(opts);
> +    /* propagate error */
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +    }
> +    return ret;
> +}
> +
> +static void replication_close(BlockDriverState *bs)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +
> +    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
> +        replication_stop(s->brs, false, NULL);
> +        block_replication_remove(s->brs);
> +    }
> +}
> +
> +static int64_t replication_getlength(BlockDriverState *bs)
> +{
> +    return bdrv_getlength(bs->file->bs);
> +}
> +
> +static int replication_get_io_status(BDRVReplicationState *s)
> +{
> +    switch (s->replication_state) {
> +    case BLOCK_REPLICATION_NONE:
> +        return -EIO;
> +    case BLOCK_REPLICATION_RUNNING:
> +        return 0;
> +    case BLOCK_REPLICATION_FAILOVER:
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
> +    case BLOCK_REPLICATION_FAILOVER_FAILED:
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
> +    case BLOCK_REPLICATION_DONE:
> +        /*
> +         * active commit job completes, and active disk and secondary_disk
> +         * is swapped, so we can operate bs->file directly
> +         */
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
> +    default:
> +        abort();
> +    }
> +}
> +
> +static int replication_return_value(BDRVReplicationState *s, int ret)
> +{
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        return ret;
> +    }
> +
> +    if (ret < 0) {
> +        s->error = ret;
> +        ret = 0;
> +    }
> +
> +    return ret;
> +}
> +
> +static coroutine_fn int replication_co_readv(BlockDriverState *bs,
> +                                             int64_t sector_num,
> +                                             int remaining_sectors,
> +                                             QEMUIOVector *qiov)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    if (s->mode == REPLICATION_MODE_PRIMARY) {
> +        /* We only use it to forward primary write requests */
> +        return -EIO;
> +    }
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
> +    return replication_return_value(s, ret);
> +}
> +
> +static coroutine_fn int replication_co_writev(BlockDriverState *bs,
> +                                              int64_t sector_num,
> +                                              int remaining_sectors,
> +                                              QEMUIOVector *qiov)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    QEMUIOVector hd_qiov;
> +    uint64_t bytes_done = 0;
> +    BlockDriverState *top = bs->file->bs;
> +    BlockDriverState *base = s->secondary_disk;
> +    BlockDriverState *target;
> +    int ret, n;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 0) {
> +        ret = bdrv_co_writev(top, sector_num,
> +                             remaining_sectors, qiov);
> +        return replication_return_value(s, ret);
> +    }
> +
> +    /*
> +     * Failover failed, only write to active disk if the sectors
> +     * have already been allocated in active disk/hidden disk.
> +     */
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +    while (remaining_sectors > 0) {
> +        ret = bdrv_is_allocated_above(top, base, sector_num,
> +                                      remaining_sectors, &n);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        qemu_iovec_reset(&hd_qiov);
> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * BDRV_SECTOR_SIZE);
> +
> +        target = ret ? top : base;
> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        remaining_sectors -= n;
> +        sector_num += n;
> +        bytes_done += n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> +                                               int64_t sector_num,
> +                                               int nb_sectors)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 1) {
> +        /* It is secondary qemu and failover are running*/
> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
> +    return replication_return_value(s, ret);
> +}
> +
> +static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
> +                                                    BlockDriverState *candidate)
> +{
> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
> +}
> +
> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    if (!s->secondary_disk->job) {
> +        error_setg(errp, "Backup job is cancelled unexpectedly");
> +        return;
> +    }
> +
> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);

What happens to in-flight requests to the active and hidden disks?

> +    if (ret < 0) {
> +        error_setg(errp, "Cannot make active disk empty");
> +        return;
> +    }
> +
> +    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
> +    if (ret < 0) {
> +        error_setg(errp, "Cannot make hidden disk empty");
> +        return;
> +    }
> +}
> +
> +static void reopen_backing_file(BDRVReplicationState *s, bool writable,
> +                                Error **errp)
> +{
> +    BlockReopenQueue *reopen_queue = NULL;
> +    int orig_hidden_flags, orig_secondary_flags;
> +    int new_hidden_flags, new_secondary_flags;
> +    Error *local_err = NULL;
> +
> +    if (writable) {
> +        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
> +        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
> +        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
> +        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
> +    } else {
> +        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
> +        new_hidden_flags = s->orig_hidden_flags;
> +        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
> +        new_secondary_flags = s->orig_secondary_flags;
> +    }
> +
> +    if (orig_hidden_flags != new_hidden_flags) {
> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
> +                                         new_hidden_flags);
> +    }
> +
> +    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
> +                                         new_secondary_flags);
> +    }
> +
> +    if (reopen_queue) {
> +        bdrv_reopen_multiple(reopen_queue, &local_err);
> +        if (local_err != NULL) {
> +            error_propagate(errp, local_err);
> +        }
> +    }
> +}
> +
> +static void backup_job_cleanup(BDRVReplicationState *s)
> +{
> +    bdrv_op_unblock_all(s->top_bs, s->blocker);
> +    error_free(s->blocker);
> +    reopen_backing_file(s, false, NULL);
> +}
> +
> +static void backup_job_completed(void *opaque, int ret)
> +{
> +    BDRVReplicationState *s = opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
> +        /* The backup job is cancelled unexpectedly */
> +        s->error = -EIO;
> +    }
> +
> +    backup_job_cleanup(s);
> +}
> +
> +static BlockDriverState *get_top_bs(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +
> +    while (!bs->blk) {
> +        if (QLIST_EMPTY(&bs->parents)) {
> +            return NULL;
> +        }
> +
> +        child = QLIST_FIRST(&bs->parents);
> +        if (QLIST_NEXT(child, next_parent)) {
> +            return NULL;
> +        }
> +
> +        bs = child->parent;
> +    }
> +
> +    return bs;
> +}
> +
> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +    int64_t active_length, hidden_length, disk_length;
> +    AioContext *aio_context;
> +    Error *local_err = NULL;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
> +        error_setg(errp, "Block replication is running or done");
> +        return;
> +    }
> +
> +    if (s->mode != mode) {
> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
> +                   " but receives %d", s->mode, mode);
> +        return;
> +    }
> +
> +    switch (s->mode) {
> +    case REPLICATION_MODE_PRIMARY:
> +        break;
> +    case REPLICATION_MODE_SECONDARY:
> +        s->active_disk = bs->file->bs;
> +        if (!bs->file->bs->backing) {
> +            error_setg(errp, "Active disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->hidden_disk = s->active_disk->backing->bs;
> +        if (!s->hidden_disk->backing) {
> +            error_setg(errp, "Hidden disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->secondary_disk = s->hidden_disk->backing->bs;
> +        if (!s->secondary_disk->blk) {
> +            error_setg(errp, "The secondary disk doesn't have block backend");
> +            return;
> +        }

Kevin: Is code allowed to stash away BlockDriverState pointers for
convenience or should it keep the BdrvChild pointers instead?  In order
for replication to work as expected, the graph shouldn't change but for
consistency maybe BdrvChild is best.

> +
> +        s->top_bs = get_top_bs(bs);
> +        if (!s->top_bs) {
> +            error_setg(errp, "Cannot get the top block driver state to do"
> +                       " internal backup");
> +            return;
> +        }
> +
> +        /* verify the length */
> +        active_length = bdrv_getlength(s->active_disk);
> +        hidden_length = bdrv_getlength(s->hidden_disk);
> +        disk_length = bdrv_getlength(s->secondary_disk);
> +        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
> +            active_length != hidden_length || hidden_length != disk_length) {
> +            error_setg(errp, "active disk, hidden disk, secondary disk's length"
> +                       " are not the same");
> +            return;
> +        }
> +
> +        if (!s->active_disk->drv->bdrv_make_empty ||
> +            !s->hidden_disk->drv->bdrv_make_empty) {
> +            error_setg(errp,
> +                       "active disk or hidden disk doesn't support make_empty");
> +            return;
> +        }
> +
> +        /* reopen the backing file in r/w mode */
> +        reopen_backing_file(s, true, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +
> +        /* start backup job now */
> +        error_setg(&s->blocker,
> +                   "block device is in use by internal backup job");
> +        bdrv_op_block_all(s->top_bs, s->blocker);
> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
> +        bdrv_ref(s->hidden_disk);

Why is the explicit reference to hidden_disk (but not secondary_disk or
active_disk) is necessary?

> +
> +        aio_context = bdrv_get_aio_context(bs);
> +        aio_context_acquire(aio_context);
> +        backup_start(s->secondary_disk, s->hidden_disk, 0,
> +                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
> +                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
> +                     s, NULL, &local_err);
> +        aio_context_release(aio_context);

This is the only place that uses AioContext to protect against other
threads accessing BlockDriverState at the same time.  Any operation like
replication_start/stop/checkpoint where you are not sure if the calling
thread is the same thread that holds the AioContext must use
aio_context_acquire/release() to protect against race conditions.

For example, replication_do_checkpoint() might be called from a thread
that doesn't hold AioContext so it must acquire brs->bs AioContext
before accessing BDRVReplicationState.

> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            backup_job_cleanup(s);
> +            bdrv_unref(s->hidden_disk);
> +            return;
> +        }
> +        break;
> +    default:
> +        abort();
> +    }
> +
> +    s->replication_state = BLOCK_REPLICATION_RUNNING;
> +
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        secondary_do_checkpoint(s, errp);
> +    }
> +
> +    s->error = 0;
> +}
> +
> +
> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
> +{
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    if (s->error) {
> +        error_setg(errp, "I/O error occurs");
> +        return;
> +    }
> +
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        secondary_do_checkpoint(s, errp);
> +    }
> +}
> +
> +static void replication_done(void *opaque, int ret)
> +{
> +    BlockDriverState *bs = opaque;
> +    BDRVReplicationState *s = bs->opaque;
> +
> +    if (ret == 0) {
> +        s->replication_state = BLOCK_REPLICATION_DONE;
> +
> +        /* refresh top bs's filename */
> +        bdrv_refresh_filename(bs);
> +        s->active_disk = NULL;
> +        s->secondary_disk = NULL;
> +        s->hidden_disk = NULL;
> +        s->error = 0;
> +    } else {
> +        s->replication_state = BLOCK_REPLICATION_FAILOVER_FAILED;
> +        s->error = -EIO;
> +    }
> +}
> +
> +static void replication_stop(BlockReplicationState *brs, bool failover, Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    switch (s->mode) {
> +    case REPLICATION_MODE_PRIMARY:
> +        s->replication_state = BLOCK_REPLICATION_DONE;
> +        s->error = 0;
> +        break;
> +    case REPLICATION_MODE_SECONDARY:
> +        if (!failover) {
> +            /*
> +             * This BDS will be closed, and the job should be completed
> +             * before the BDS is closed, because we will access hidden
> +             * disk, secondary disk in backup_job_completed().
> +             */
> +            if (s->secondary_disk->job) {
> +                block_job_cancel_sync(s->secondary_disk->job);
> +            }
> +            secondary_do_checkpoint(s, errp);
> +            s->replication_state = BLOCK_REPLICATION_DONE;
> +            return;
> +        }
> +
> +        s->replication_state = BLOCK_REPLICATION_FAILOVER;
> +        if (s->secondary_disk->job) {
> +            block_job_cancel(s->secondary_disk->job);
> +        }
> +
> +        commit_active_start(s->active_disk, s->secondary_disk, 0,
> +                            BLOCKDEV_ON_ERROR_REPORT, replication_done,
> +                            bs, errp, true);
> +        break;
> +    default:
> +        abort();
> +    }
> +}
> +
> +BlockDriver bdrv_replication = {
> +    .format_name                = "replication",
> +    .protocol_name              = "replication",
> +    .instance_size              = sizeof(BDRVReplicationState),
> +
> +    .bdrv_open                  = replication_open,
> +    .bdrv_close                 = replication_close,
> +
> +    .bdrv_getlength             = replication_getlength,
> +    .bdrv_co_readv              = replication_co_readv,
> +    .bdrv_co_writev             = replication_co_writev,
> +    .bdrv_co_discard            = replication_co_discard,
> +
> +    .is_filter                  = true,
> +    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
> +
> +    .has_variable_length        = true,
> +};
> +
> +static void bdrv_replication_init(void)
> +{
> +    bdrv_register(&bdrv_replication);
> +}
> +
> +block_init(bdrv_replication_init);
> diff --git a/include/block/replication-comm.h b/include/block/replication-comm.h
> new file mode 100644
> index 0000000..6483e56
> --- /dev/null
> +++ b/include/block/replication-comm.h
> @@ -0,0 +1,50 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef REPLICATION_H
> +#define REPLICATION_H
> +
> +#include "qemu/queue.h"
> +#include "block/block_int.h"
> +
> +typedef struct BlockReplicationOps BlockReplicationOps;
> +typedef struct BlockReplicationState BlockReplicationState;
> +typedef void (*Start)(BlockReplicationState *brs, ReplicationMode mode, Error **errp);
> +typedef void (*Stop)(BlockReplicationState *brs, bool failover, Error **errp);
> +typedef void (*Checkpoint)(BlockReplicationState *brs, Error **errp);
> +
> +struct BlockReplicationState {
> +    BlockDriverState *bs;
> +    BlockReplicationOps *ops;
> +    QLIST_ENTRY(BlockReplicationState) node;
> +};
> +
> +struct BlockReplicationOps{
> +    Start start;
> +    Stop stop;
> +    Checkpoint checkpoint;
> +};
> +
> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
> +                                             BlockReplicationOps *ops);
> +
> +void block_replication_remove(BlockReplicationState *brs);
> +
> +void block_replication_start_all(ReplicationMode mode, Error **errp);
> +
> +void block_replication_do_checkpoint_all(Error **errp);
> +
> +void block_replication_stop_all(bool failover, Error **errp);
> +
> +#endif /* REPLICATION_H */
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 2e86ede..d73f46b 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1975,6 +1975,19 @@
>              '*read-pattern': 'QuorumReadPattern' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.6
> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> +
> +##
>  # @BlockdevOptions
>  #
>  # Options for creating a block device.
> -- 
> 1.9.3
> 
> 
>

Wen Congyang Jan. 28, 2016, 1:13 a.m. UTC | #4

On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> ---
>>  block/Makefile.objs              |   1 +
>>  block/replication-comm.c         |  66 +++++
>>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>>  include/block/replication-comm.h |  50 ++++
>>  qapi/block-core.json             |  13 +
>>  5 files changed, 720 insertions(+)
>>  create mode 100644 block/replication-comm.c
>>  create mode 100644 block/replication.c
>>  create mode 100644 include/block/replication-comm.h
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index fa05f37..7037662 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
>>  block-obj-y += accounting.o
>>  block-obj-y += write-threshold.o
>>  block-obj-y += backup.o
>> +block-obj-y += replication-comm.o replication.o
>>  
>>  common-obj-y += stream.o
>>  common-obj-y += commit.o
>> diff --git a/block/replication-comm.c b/block/replication-comm.c
>> new file mode 100644
>> index 0000000..8af748b
>> --- /dev/null
>> +++ b/block/replication-comm.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Replication Block filter
> 
> Is the start/stop/checkpoint callback interface only useful for block
> replication?
> 
> This seems like a generic interface for registering with COLO.  Other
> components (networking, etc) might also need start/stop/checkpoint
> callbacks.  If that's the case then this code should be outside block/
> and the brs->bs field should either be void *opaque or removed (the
> caller needs to use container_of()).

Yes, we will do it in the next version.

> 
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "block/replication-comm.h"
>> +
>> +static QLIST_HEAD(, BlockReplicationState) block_replication_states;
>> +
>> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
>> +                                             BlockReplicationOps *ops)
>> +{
>> +    BlockReplicationState *brs;
>> +
>> +    brs = g_new0(BlockReplicationState, 1);
>> +    brs->bs = bs;
>> +    brs->ops = ops;
>> +    QLIST_INSERT_HEAD(&block_replication_states, brs, node);
>> +
>> +    return brs;
>> +}
>> +
>> +void block_replication_remove(BlockReplicationState *brs)
>> +{
>> +    QLIST_REMOVE(brs, node);
>> +    g_free(brs);
>> +}
>> +
>> +void block_replication_start_all(ReplicationMode mode, Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->start) {
>> +            brs->ops->start(brs, mode, errp);
>> +        }
>> +    }
>> +}
>> +
>> +void block_replication_do_checkpoint_all(Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->checkpoint) {
>> +            brs->ops->checkpoint(brs, errp);
>> +        }
>> +    }
>> +}
>> +
>> +void block_replication_stop_all(bool failover, Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->stop) {
>> +            brs->ops->stop(brs, failover, errp);
>> +        }
>> +    }
>> +}
>> diff --git a/block/replication.c b/block/replication.c
>> new file mode 100644
>> index 0000000..29c677a
>> --- /dev/null
>> +++ b/block/replication.c
>> @@ -0,0 +1,590 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu-common.h"
>> +#include "block/blockjob.h"
>> +#include "block/nbd.h"
>> +#include "block/replication-comm.h"
>> +
>> +typedef struct BDRVReplicationState {
>> +    ReplicationMode mode;
>> +    int replication_state;
>> +    BlockDriverState *active_disk;
>> +    BlockDriverState *hidden_disk;
>> +    BlockDriverState *secondary_disk;
>> +    BlockDriverState *top_bs;
>> +    BlockReplicationState *brs;
>> +    Error *blocker;
>> +    int orig_hidden_flags;
>> +    int orig_secondary_flags;
>> +    int error;
>> +} BDRVReplicationState;
>> +
>> +enum {
>> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
>> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
>> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
>> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
>> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
>> +};
>> +
>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp);
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp);
>> +static void replication_stop(BlockReplicationState *brs, bool failover,
>> +                             Error **errp);
>> +
>> +#define REPLICATION_MODE        "mode"
>> +static QemuOptsList replication_runtime_opts = {
>> +    .name = "replication",
>> +    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = REPLICATION_MODE,
>> +            .type = QEMU_OPT_STRING,
>> +        },
>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +static BlockReplicationOps replication_ops = {
>> +    .start = replication_start,
>> +    .checkpoint = replication_do_checkpoint,
>> +    .stop = replication_stop,
>> +};
>> +
>> +static int replication_open(BlockDriverState *bs, QDict *options,
>> +                            int flags, Error **errp)
>> +{
>> +    int ret;
>> +    BDRVReplicationState *s = bs->opaque;
>> +    Error *local_err = NULL;
>> +    QemuOpts *opts = NULL;
>> +    const char *mode;
>> +
>> +    ret = -EINVAL;
>> +    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (local_err) {
>> +        goto fail;
>> +    }
>> +
>> +    mode = qemu_opt_get(opts, REPLICATION_MODE);
>> +    if (!mode) {
>> +        error_setg(&local_err, "Missing the option mode");
>> +        goto fail;
>> +    }
>> +
>> +    if (!strcmp(mode, "primary")) {
>> +        s->mode = REPLICATION_MODE_PRIMARY;
>> +    } else if (!strcmp(mode, "secondary")) {
>> +        s->mode = REPLICATION_MODE_SECONDARY;
>> +    } else {
>> +        error_setg(&local_err,
>> +                   "The option mode's value should be primary or secondary");
>> +        goto fail;
>> +    }
>> +
>> +    s->brs = block_replication_new(bs, &replication_ops);
>> +
>> +    ret = 0;
>> +
>> +fail:
>> +    qemu_opts_del(opts);
>> +    /* propagate error */
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static void replication_close(BlockDriverState *bs)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +
>> +    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
>> +        replication_stop(s->brs, false, NULL);
>> +        block_replication_remove(s->brs);
>> +    }
>> +}
>> +
>> +static int64_t replication_getlength(BlockDriverState *bs)
>> +{
>> +    return bdrv_getlength(bs->file->bs);
>> +}
>> +
>> +static int replication_get_io_status(BDRVReplicationState *s)
>> +{
>> +    switch (s->replication_state) {
>> +    case BLOCK_REPLICATION_NONE:
>> +        return -EIO;
>> +    case BLOCK_REPLICATION_RUNNING:
>> +        return 0;
>> +    case BLOCK_REPLICATION_FAILOVER:
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
>> +    case BLOCK_REPLICATION_FAILOVER_FAILED:
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
>> +    case BLOCK_REPLICATION_DONE:
>> +        /*
>> +         * active commit job completes, and active disk and secondary_disk
>> +         * is swapped, so we can operate bs->file directly
>> +         */
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
>> +    default:
>> +        abort();
>> +    }
>> +}
>> +
>> +static int replication_return_value(BDRVReplicationState *s, int ret)
>> +{
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret < 0) {
>> +        s->error = ret;
>> +        ret = 0;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static coroutine_fn int replication_co_readv(BlockDriverState *bs,
>> +                                             int64_t sector_num,
>> +                                             int remaining_sectors,
>> +                                             QEMUIOVector *qiov)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    if (s->mode == REPLICATION_MODE_PRIMARY) {
>> +        /* We only use it to forward primary write requests */
>> +        return -EIO;
>> +    }
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
>> +    return replication_return_value(s, ret);
>> +}
>> +
>> +static coroutine_fn int replication_co_writev(BlockDriverState *bs,
>> +                                              int64_t sector_num,
>> +                                              int remaining_sectors,
>> +                                              QEMUIOVector *qiov)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    QEMUIOVector hd_qiov;
>> +    uint64_t bytes_done = 0;
>> +    BlockDriverState *top = bs->file->bs;
>> +    BlockDriverState *base = s->secondary_disk;
>> +    BlockDriverState *target;
>> +    int ret, n;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 0) {
>> +        ret = bdrv_co_writev(top, sector_num,
>> +                             remaining_sectors, qiov);
>> +        return replication_return_value(s, ret);
>> +    }
>> +
>> +    /*
>> +     * Failover failed, only write to active disk if the sectors
>> +     * have already been allocated in active disk/hidden disk.
>> +     */
>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> +    while (remaining_sectors > 0) {
>> +        ret = bdrv_is_allocated_above(top, base, sector_num,
>> +                                      remaining_sectors, &n);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        qemu_iovec_reset(&hd_qiov);
>> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * BDRV_SECTOR_SIZE);
>> +
>> +        target = ret ? top : base;
>> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        remaining_sectors -= n;
>> +        sector_num += n;
>> +        bytes_done += n * BDRV_SECTOR_SIZE;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +                                               int64_t sector_num,
>> +                                               int nb_sectors)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 1) {
>> +        /* It is secondary qemu and failover are running*/
>> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
>> +    return replication_return_value(s, ret);
>> +}
>> +
>> +static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
>> +                                                    BlockDriverState *candidate)
>> +{
>> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
>> +}
>> +
>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    if (!s->secondary_disk->job) {
>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>> +        return;
>> +    }
>> +
>> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
> 
> What happens to in-flight requests to the active and hidden disks?

we MUST call do_checkpoint() when the vm is stopped.

> 
>> +    if (ret < 0) {
>> +        error_setg(errp, "Cannot make active disk empty");
>> +        return;
>> +    }
>> +
>> +    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
>> +    if (ret < 0) {
>> +        error_setg(errp, "Cannot make hidden disk empty");
>> +        return;
>> +    }
>> +}
>> +
>> +static void reopen_backing_file(BDRVReplicationState *s, bool writable,
>> +                                Error **errp)
>> +{
>> +    BlockReopenQueue *reopen_queue = NULL;
>> +    int orig_hidden_flags, orig_secondary_flags;
>> +    int new_hidden_flags, new_secondary_flags;
>> +    Error *local_err = NULL;
>> +
>> +    if (writable) {
>> +        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
>> +        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
>> +        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
>> +        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
>> +    } else {
>> +        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
>> +        new_hidden_flags = s->orig_hidden_flags;
>> +        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
>> +        new_secondary_flags = s->orig_secondary_flags;
>> +    }
>> +
>> +    if (orig_hidden_flags != new_hidden_flags) {
>> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
>> +                                         new_hidden_flags);
>> +    }
>> +
>> +    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
>> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
>> +                                         new_secondary_flags);
>> +    }
>> +
>> +    if (reopen_queue) {
>> +        bdrv_reopen_multiple(reopen_queue, &local_err);
>> +        if (local_err != NULL) {
>> +            error_propagate(errp, local_err);
>> +        }
>> +    }
>> +}
>> +
>> +static void backup_job_cleanup(BDRVReplicationState *s)
>> +{
>> +    bdrv_op_unblock_all(s->top_bs, s->blocker);
>> +    error_free(s->blocker);
>> +    reopen_backing_file(s, false, NULL);
>> +}
>> +
>> +static void backup_job_completed(void *opaque, int ret)
>> +{
>> +    BDRVReplicationState *s = opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
>> +        /* The backup job is cancelled unexpectedly */
>> +        s->error = -EIO;
>> +    }
>> +
>> +    backup_job_cleanup(s);
>> +}
>> +
>> +static BlockDriverState *get_top_bs(BlockDriverState *bs)
>> +{
>> +    BdrvChild *child;
>> +
>> +    while (!bs->blk) {
>> +        if (QLIST_EMPTY(&bs->parents)) {
>> +            return NULL;
>> +        }
>> +
>> +        child = QLIST_FIRST(&bs->parents);
>> +        if (QLIST_NEXT(child, next_parent)) {
>> +            return NULL;
>> +        }
>> +
>> +        bs = child->parent;
>> +    }
>> +
>> +    return bs;
>> +}
>> +
>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +    int64_t active_length, hidden_length, disk_length;
>> +    AioContext *aio_context;
>> +    Error *local_err = NULL;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
>> +        error_setg(errp, "Block replication is running or done");
>> +        return;
>> +    }
>> +
>> +    if (s->mode != mode) {
>> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
>> +                   " but receives %d", s->mode, mode);
>> +        return;
>> +    }
>> +
>> +    switch (s->mode) {
>> +    case REPLICATION_MODE_PRIMARY:
>> +        break;
>> +    case REPLICATION_MODE_SECONDARY:
>> +        s->active_disk = bs->file->bs;
>> +        if (!bs->file->bs->backing) {
>> +            error_setg(errp, "Active disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->hidden_disk = s->active_disk->backing->bs;
>> +        if (!s->hidden_disk->backing) {
>> +            error_setg(errp, "Hidden disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->secondary_disk = s->hidden_disk->backing->bs;
>> +        if (!s->secondary_disk->blk) {
>> +            error_setg(errp, "The secondary disk doesn't have block backend");
>> +            return;
>> +        }
> 
> Kevin: Is code allowed to stash away BlockDriverState pointers for
> convenience or should it keep the BdrvChild pointers instead?  In order
> for replication to work as expected, the graph shouldn't change but for
> consistency maybe BdrvChild is best.
> 
>> +
>> +        s->top_bs = get_top_bs(bs);
>> +        if (!s->top_bs) {
>> +            error_setg(errp, "Cannot get the top block driver state to do"
>> +                       " internal backup");
>> +            return;
>> +        }
>> +
>> +        /* verify the length */
>> +        active_length = bdrv_getlength(s->active_disk);
>> +        hidden_length = bdrv_getlength(s->hidden_disk);
>> +        disk_length = bdrv_getlength(s->secondary_disk);
>> +        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
>> +            active_length != hidden_length || hidden_length != disk_length) {
>> +            error_setg(errp, "active disk, hidden disk, secondary disk's length"
>> +                       " are not the same");
>> +            return;
>> +        }
>> +
>> +        if (!s->active_disk->drv->bdrv_make_empty ||
>> +            !s->hidden_disk->drv->bdrv_make_empty) {
>> +            error_setg(errp,
>> +                       "active disk or hidden disk doesn't support make_empty");
>> +            return;
>> +        }
>> +
>> +        /* reopen the backing file in r/w mode */
>> +        reopen_backing_file(s, true, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +
>> +        /* start backup job now */
>> +        error_setg(&s->blocker,
>> +                   "block device is in use by internal backup job");
>> +        bdrv_op_block_all(s->top_bs, s->blocker);
>> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
>> +        bdrv_ref(s->hidden_disk);
> 
> Why is the explicit reference to hidden_disk (but not secondary_disk or
> active_disk) is necessary?

IIRC, we should reference the backup target before calling backup_start(),
and we will reference the backup source in backup_start().

> 
>> +
>> +        aio_context = bdrv_get_aio_context(bs);
>> +        aio_context_acquire(aio_context);
>> +        backup_start(s->secondary_disk, s->hidden_disk, 0,
>> +                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
>> +                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
>> +                     s, NULL, &local_err);
>> +        aio_context_release(aio_context);
> 
> This is the only place that uses AioContext to protect against other
> threads accessing BlockDriverState at the same time.  Any operation like
> replication_start/stop/checkpoint where you are not sure if the calling
> thread is the same thread that holds the AioContext must use
> aio_context_acquire/release() to protect against race conditions.
> 
> For example, replication_do_checkpoint() might be called from a thread
> that doesn't hold AioContext so it must acquire brs->bs AioContext
> before accessing BDRVReplicationState.

OK, I will check it and fix it in the next version.

Thanks
Wen Congyang

> 
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            backup_job_cleanup(s);
>> +            bdrv_unref(s->hidden_disk);
>> +            return;
>> +        }
>> +        break;
>> +    default:
>> +        abort();
>> +    }
>> +
>> +    s->replication_state = BLOCK_REPLICATION_RUNNING;
>> +
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        secondary_do_checkpoint(s, errp);
>> +    }
>> +
>> +    s->error = 0;
>> +}
>> +
>> +
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
>> +{
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    if (s->error) {
>> +        error_setg(errp, "I/O error occurs");
>> +        return;
>> +    }
>> +
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        secondary_do_checkpoint(s, errp);
>> +    }
>> +}
>> +
>> +static void replication_done(void *opaque, int ret)
>> +{
>> +    BlockDriverState *bs = opaque;
>> +    BDRVReplicationState *s = bs->opaque;
>> +
>> +    if (ret == 0) {
>> +        s->replication_state = BLOCK_REPLICATION_DONE;
>> +
>> +        /* refresh top bs's filename */
>> +        bdrv_refresh_filename(bs);
>> +        s->active_disk = NULL;
>> +        s->secondary_disk = NULL;
>> +        s->hidden_disk = NULL;
>> +        s->error = 0;
>> +    } else {
>> +        s->replication_state = BLOCK_REPLICATION_FAILOVER_FAILED;
>> +        s->error = -EIO;
>> +    }
>> +}
>> +
>> +static void replication_stop(BlockReplicationState *brs, bool failover, Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    switch (s->mode) {
>> +    case REPLICATION_MODE_PRIMARY:
>> +        s->replication_state = BLOCK_REPLICATION_DONE;
>> +        s->error = 0;
>> +        break;
>> +    case REPLICATION_MODE_SECONDARY:
>> +        if (!failover) {
>> +            /*
>> +             * This BDS will be closed, and the job should be completed
>> +             * before the BDS is closed, because we will access hidden
>> +             * disk, secondary disk in backup_job_completed().
>> +             */
>> +            if (s->secondary_disk->job) {
>> +                block_job_cancel_sync(s->secondary_disk->job);
>> +            }
>> +            secondary_do_checkpoint(s, errp);
>> +            s->replication_state = BLOCK_REPLICATION_DONE;
>> +            return;
>> +        }
>> +
>> +        s->replication_state = BLOCK_REPLICATION_FAILOVER;
>> +        if (s->secondary_disk->job) {
>> +            block_job_cancel(s->secondary_disk->job);
>> +        }
>> +
>> +        commit_active_start(s->active_disk, s->secondary_disk, 0,
>> +                            BLOCKDEV_ON_ERROR_REPORT, replication_done,
>> +                            bs, errp, true);
>> +        break;
>> +    default:
>> +        abort();
>> +    }
>> +}
>> +
>> +BlockDriver bdrv_replication = {
>> +    .format_name                = "replication",
>> +    .protocol_name              = "replication",
>> +    .instance_size              = sizeof(BDRVReplicationState),
>> +
>> +    .bdrv_open                  = replication_open,
>> +    .bdrv_close                 = replication_close,
>> +
>> +    .bdrv_getlength             = replication_getlength,
>> +    .bdrv_co_readv              = replication_co_readv,
>> +    .bdrv_co_writev             = replication_co_writev,
>> +    .bdrv_co_discard            = replication_co_discard,
>> +
>> +    .is_filter                  = true,
>> +    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
>> +
>> +    .has_variable_length        = true,
>> +};
>> +
>> +static void bdrv_replication_init(void)
>> +{
>> +    bdrv_register(&bdrv_replication);
>> +}
>> +
>> +block_init(bdrv_replication_init);
>> diff --git a/include/block/replication-comm.h b/include/block/replication-comm.h
>> new file mode 100644
>> index 0000000..6483e56
>> --- /dev/null
>> +++ b/include/block/replication-comm.h
>> @@ -0,0 +1,50 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef REPLICATION_H
>> +#define REPLICATION_H
>> +
>> +#include "qemu/queue.h"
>> +#include "block/block_int.h"
>> +
>> +typedef struct BlockReplicationOps BlockReplicationOps;
>> +typedef struct BlockReplicationState BlockReplicationState;
>> +typedef void (*Start)(BlockReplicationState *brs, ReplicationMode mode, Error **errp);
>> +typedef void (*Stop)(BlockReplicationState *brs, bool failover, Error **errp);
>> +typedef void (*Checkpoint)(BlockReplicationState *brs, Error **errp);
>> +
>> +struct BlockReplicationState {
>> +    BlockDriverState *bs;
>> +    BlockReplicationOps *ops;
>> +    QLIST_ENTRY(BlockReplicationState) node;
>> +};
>> +
>> +struct BlockReplicationOps{
>> +    Start start;
>> +    Stop stop;
>> +    Checkpoint checkpoint;
>> +};
>> +
>> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
>> +                                             BlockReplicationOps *ops);
>> +
>> +void block_replication_remove(BlockReplicationState *brs);
>> +
>> +void block_replication_start_all(ReplicationMode mode, Error **errp);
>> +
>> +void block_replication_do_checkpoint_all(Error **errp);
>> +
>> +void block_replication_stop_all(bool failover, Error **errp);
>> +
>> +#endif /* REPLICATION_H */
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 2e86ede..d73f46b 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1975,6 +1975,19 @@
>>              '*read-pattern': 'QuorumReadPattern' } }
>>  
>>  ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>> +
>> +##
>>  # @BlockdevOptions
>>  #
>>  # Options for creating a block device.
>> -- 
>> 1.9.3
>>
>>
>>

Stefan Hajnoczi Jan. 28, 2016, 3:15 p.m. UTC | #5

On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> > On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> >> +{
> >> +    Error *local_err = NULL;
> >> +    int ret;
> >> +
> >> +    if (!s->secondary_disk->job) {
> >> +        error_setg(errp, "Backup job is cancelled unexpectedly");
> >> +        return;
> >> +    }
> >> +
> >> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +
> >> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
> > 
> > What happens to in-flight requests to the active and hidden disks?
> 
> we MUST call do_checkpoint() when the vm is stopped.

Please document the environment under which the block replication
callback functions run.

I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
time if the disk is slow/failing.  bdrv_drain_all() blocks until all
in-flight I/O requests have completed.  What does the Primary do if the
Secondary becomes unresponsive?

> >> +    switch (s->mode) {
> >> +    case REPLICATION_MODE_PRIMARY:
> >> +        break;
> >> +    case REPLICATION_MODE_SECONDARY:
> >> +        s->active_disk = bs->file->bs;
> >> +        if (!bs->file->bs->backing) {
> >> +            error_setg(errp, "Active disk doesn't have backing file");
> >> +            return;
> >> +        }
> >> +
> >> +        s->hidden_disk = s->active_disk->backing->bs;
> >> +        if (!s->hidden_disk->backing) {
> >> +            error_setg(errp, "Hidden disk doesn't have backing file");
> >> +            return;
> >> +        }
> >> +
> >> +        s->secondary_disk = s->hidden_disk->backing->bs;
> >> +        if (!s->secondary_disk->blk) {
> >> +            error_setg(errp, "The secondary disk doesn't have block backend");
> >> +            return;
> >> +        }
> > 
> > Kevin: Is code allowed to stash away BlockDriverState pointers for
> > convenience or should it keep the BdrvChild pointers instead?  In order
> > for replication to work as expected, the graph shouldn't change but for
> > consistency maybe BdrvChild is best.

I asked Kevin about this on IRC and he agreed that BdrvChild should be
used instead of holding on to BlockDriverState * pointers.  Although
these pointers will not change during replication (if the op blockers
are set up correctly), it's more consistent and certainly safer to go
through BdrvChild.

> >> +        /* start backup job now */
> >> +        error_setg(&s->blocker,
> >> +                   "block device is in use by internal backup job");
> >> +        bdrv_op_block_all(s->top_bs, s->blocker);
> >> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
> >> +        bdrv_ref(s->hidden_disk);
> > 
> > Why is the explicit reference to hidden_disk (but not secondary_disk or
> > active_disk) is necessary?
> 
> IIRC, we should reference the backup target before calling backup_start(),
> and we will reference the backup source in backup_start().

I'm not sure why this is necessary since they are part of the backing
chain.

If it is necessary, please add a comment so it's clear why the reference
is being taken.

Stefan

Changlong Xie Jan. 29, 2016, 3:13 a.m. UTC | #6

On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>>>> +{
>>>> +    Error *local_err = NULL;
>>>> +    int ret;
>>>> +
>>>> +    if (!s->secondary_disk->job) {
>>>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
>>>
>>> What happens to in-flight requests to the active and hidden disks?
>>
>> we MUST call do_checkpoint() when the vm is stopped.
>
> Please document the environment under which the block replication
> callback functions run.

OK

>
> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> in-flight I/O requests have completed.  What does the Primary do if the
> Secondary becomes unresponsive?

Actually, we knew this problem. But currently, there seems no better way 
to resolve it. If you have any ideas?

>
>>>> +    switch (s->mode) {
>>>> +    case REPLICATION_MODE_PRIMARY:
>>>> +        break;
>>>> +    case REPLICATION_MODE_SECONDARY:
>>>> +        s->active_disk = bs->file->bs;
>>>> +        if (!bs->file->bs->backing) {
>>>> +            error_setg(errp, "Active disk doesn't have backing file");
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        s->hidden_disk = s->active_disk->backing->bs;
>>>> +        if (!s->hidden_disk->backing) {
>>>> +            error_setg(errp, "Hidden disk doesn't have backing file");
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        s->secondary_disk = s->hidden_disk->backing->bs;
>>>> +        if (!s->secondary_disk->blk) {
>>>> +            error_setg(errp, "The secondary disk doesn't have block backend");
>>>> +            return;
>>>> +        }
>>>
>>> Kevin: Is code allowed to stash away BlockDriverState pointers for
>>> convenience or should it keep the BdrvChild pointers instead?  In order
>>> for replication to work as expected, the graph shouldn't change but for
>>> consistency maybe BdrvChild is best.
>
> I asked Kevin about this on IRC and he agreed that BdrvChild should be
> used instead of holding on to BlockDriverState * pointers.  Although
> these pointers will not change during replication (if the op blockers
> are set up correctly), it's more consistent and certainly safer to go
> through BdrvChild.

Ok

>
>>>> +        /* start backup job now */
>>>> +        error_setg(&s->blocker,
>>>> +                   "block device is in use by internal backup job");
>>>> +        bdrv_op_block_all(s->top_bs, s->blocker);
>>>> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
>>>> +        bdrv_ref(s->hidden_disk);
>>>
>>> Why is the explicit reference to hidden_disk (but not secondary_disk or
>>> active_disk) is necessary?
>>
>> IIRC, we should reference the backup target before calling backup_start(),
>> and we will reference the backup source in backup_start().
>
> I'm not sure why this is necessary since they are part of the backing
> chain.
>

Just as Wen said, we should reference the backup target before calling 
backup_start() to protect it from destroying, if backup job is stopped 
unexpectedly.

> If it is necessary, please add a comment so it's clear why the reference
> is being taken.
>

Ok

> Stefan
>

Stefan Hajnoczi Jan. 29, 2016, 3:46 p.m. UTC | #7

On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >in-flight I/O requests have completed.  What does the Primary do if the
> >Secondary becomes unresponsive?
> 
> Actually, we knew this problem. But currently, there seems no better way to
> resolve it. If you have any ideas?

Is it possible to hold the checkpoint information and acknowledge the
checkpoint right away, without waiting for bdrv_drain_all() or any
Secondory guest activity to complete?

I think this really means falling back to microcheckpointing until the
Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
would prevent vcpus from running and when the last pending I/O finishes
the Secondary could apply the last checkpoint.  This approach does not
block QEMU (the monitor, etc).

Wen Congyang Feb. 1, 2016, 1:13 a.m. UTC | #8

On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>> in-flight I/O requests have completed.  What does the Primary do if the
>>> Secondary becomes unresponsive?
>>
>> Actually, we knew this problem. But currently, there seems no better way to
>> resolve it. If you have any ideas?
> 
> Is it possible to hold the checkpoint information and acknowledge the
> checkpoint right away, without waiting for bdrv_drain_all() or any
> Secondory guest activity to complete?

There is no way to know that secondary becomes unreponsive.

> 
> I think this really means falling back to microcheckpointing until the
> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> would prevent vcpus from running and when the last pending I/O finishes
> the Secondary could apply the last checkpoint.  This approach does not
> block QEMU (the monitor, etc).
> 

If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
We should do failover in this case.

Thanks
Wen Congyang

Stefan Hajnoczi Feb. 2, 2016, 2:34 p.m. UTC | #9

On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> > On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> >> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >>> in-flight I/O requests have completed.  What does the Primary do if the
> >>> Secondary becomes unresponsive?
> >>
> >> Actually, we knew this problem. But currently, there seems no better way to
> >> resolve it. If you have any ideas?
> > 
> > Is it possible to hold the checkpoint information and acknowledge the
> > checkpoint right away, without waiting for bdrv_drain_all() or any
> > Secondory guest activity to complete?
> 
> There is no way to know that secondary becomes unreponsive.

I meant whether it is necessary for the Secondary to vm_stop() and apply
the checkpoint before acknowledging the checkpoint to the Primary?

> > I think this really means falling back to microcheckpointing until the
> > Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> > would prevent vcpus from running and when the last pending I/O finishes
> > the Secondary could apply the last checkpoint.  This approach does not
> > block QEMU (the monitor, etc).
> > 
> 
> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
> We should do failover in this case.

This is dangerous because it means that a delay/failure in the Secondary
would cause the Primary to fail over to the broken Secondary.  All the
more reason not to perform blocking operations on the Secondary in the
checkpoint code path.

Stefan

Wen Congyang Feb. 3, 2016, 1:29 a.m. UTC | #10

On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>> Secondary becomes unresponsive?
>>>>
>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>> resolve it. If you have any ideas?
>>>
>>> Is it possible to hold the checkpoint information and acknowledge the
>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>> Secondory guest activity to complete?
>>
>> There is no way to know that secondary becomes unreponsive.
> 
> I meant whether it is necessary for the Secondary to vm_stop() and apply
> the checkpoint before acknowledging the checkpoint to the Primary?

I don't understand this.
Here is the COLO checkpoint flow:

    Primary                                                Secondary
    new checkpoint notice                 --->
    vm_stop()                                              vm_stop()
    vm state(device state, memory, cpu)   --->
                                                           load state
                                          <---             done
    vm_start()                                             vm_start()
> 
>>> I think this really means falling back to microcheckpointing until the
>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>> would prevent vcpus from running and when the last pending I/O finishes
>>> the Secondary could apply the last checkpoint.  This approach does not
>>> block QEMU (the monitor, etc).
>>>
>>
>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>> We should do failover in this case.
> 
> This is dangerous because it means that a delay/failure in the Secondary
> would cause the Primary to fail over to the broken Secondary.  All the
> more reason not to perform blocking operations on the Secondary in the
> checkpoint code path.

If the secondary is broken, primary qemu will take over.

Thanks
Wen Congyang

> 
> Stefan
>

Stefan Hajnoczi Feb. 3, 2016, 9:32 a.m. UTC | #11

On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> > On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
> >> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> >>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> >>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >>>>> in-flight I/O requests have completed.  What does the Primary do if the
> >>>>> Secondary becomes unresponsive?
> >>>>
> >>>> Actually, we knew this problem. But currently, there seems no better way to
> >>>> resolve it. If you have any ideas?
> >>>
> >>> Is it possible to hold the checkpoint information and acknowledge the
> >>> checkpoint right away, without waiting for bdrv_drain_all() or any
> >>> Secondory guest activity to complete?
> >>
> >> There is no way to know that secondary becomes unreponsive.
> > 
> > I meant whether it is necessary for the Secondary to vm_stop() and apply
> > the checkpoint before acknowledging the checkpoint to the Primary?
> 
> I don't understand this.
> Here is the COLO checkpoint flow:
> 
>     Primary                                                Secondary
>     new checkpoint notice                 --->
>     vm_stop()                                              vm_stop()
>     vm state(device state, memory, cpu)   --->
>                                                            load state
>                                           <---             done
>     vm_start()                                             vm_start()

If the Secondary's vm_stop() call blocks then the Primary is stuck too.

I was wondering whether the Secondary can do:

<---  done
      vm_stop()
      load state

It simply receives the checkpoint data into a buffer and immediately
replies with "done".  vm_stop() and load state is only performed after
sending "done".

The advantage is that the Primary will not be delayed by the Secondary.
It's an approach that doesn't block.

But perhaps it's a problem if the Secondary is slower than the Primary
since the Secondary still needs to complete vm_stop() and load state
before it can resume execution?

> >>> I think this really means falling back to microcheckpointing until the
> >>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> >>> would prevent vcpus from running and when the last pending I/O finishes
> >>> the Secondary could apply the last checkpoint.  This approach does not
> >>> block QEMU (the monitor, etc).
> >>>
> >>
> >> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
> >> We should do failover in this case.
> > 
> > This is dangerous because it means that a delay/failure in the Secondary
> > would cause the Primary to fail over to the broken Secondary.  All the
> > more reason not to perform blocking operations on the Secondary in the
> > checkpoint code path.
> 
> If the secondary is broken, primary qemu will take over.

Does the Primary use a timeout between "new checkpoint notice" and
Secondary's "done" so it can move on if the Secondary is unresponsive?

Stefan

Wen Congyang Feb. 3, 2016, 9:55 a.m. UTC | #12

On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>> Secondary becomes unresponsive?
>>>>>>
>>>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>>>> resolve it. If you have any ideas?
>>>>>
>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>> Secondory guest activity to complete?
>>>>
>>>> There is no way to know that secondary becomes unreponsive.
>>>
>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>
>> I don't understand this.
>> Here is the COLO checkpoint flow:
>>
>>     Primary                                                Secondary
>>     new checkpoint notice                 --->
>>     vm_stop()                                              vm_stop()
>>     vm state(device state, memory, cpu)   --->
>>                                                            load state
>>                                           <---             done
>>     vm_start()                                             vm_start()
> 
> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
> 
> I was wondering whether the Secondary can do:
> 
> <---  done
>       vm_stop()
>       load state
> 
> It simply receives the checkpoint data into a buffer and immediately
> replies with "done".  vm_stop() and load state is only performed after
> sending "done".

Secondary vm is running, so we should also get the pages that are dirtied
by secondary vm, but not dirtied by primary vm.
We have two ways to do it:
1. Cache all original memory in the secondary qemu
2. Send the dirty pfn list to primary qemu, and get it.

If we ack the checkpoint and the call vm_stop(), we only can select 1. It
means that secondary qemu costs more memory.
In COLO mode, we will compare the output socket, and will do checkpoint if
the application level data is different. If we ack the checkpoint and the
call vm_stop(), the client can not get any more data until secondary vm
is running again. So we still 'wait' the secondary vm.


> 
> The advantage is that the Primary will not be delayed by the Secondary.
> It's an approach that doesn't block.
> 
> But perhaps it's a problem if the Secondary is slower than the Primary
> since the Secondary still needs to complete vm_stop() and load state
> before it can resume execution?
> 
>>>>> I think this really means falling back to microcheckpointing until the
>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>> block QEMU (the monitor, etc).
>>>>>
>>>>
>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>>>> We should do failover in this case.
>>>
>>> This is dangerous because it means that a delay/failure in the Secondary
>>> would cause the Primary to fail over to the broken Secondary.  All the
>>> more reason not to perform blocking operations on the Secondary in the
>>> checkpoint code path.
>>
>> If the secondary is broken, primary qemu will take over.
> 
> Does the Primary use a timeout between "new checkpoint notice" and
> Secondary's "done" so it can move on if the Secondary is unresponsive?

To hailiang:
IIRC, we don't use a timeout but I think we can do it. In our design, there is
an exteranl heartbeat to check primary and secondary status, and decide when
to do checkpoint.

Thanks
Wen Congyang


> 
> Stefan
>

Zhanghailiang Feb. 3, 2016, 11:25 a.m. UTC | #13

On 2016/2/3 17:55, Wen Congyang wrote:
> On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
>> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>>> Secondary becomes unresponsive?
>>>>>>>
>>>>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>>>>> resolve it. If you have any ideas?
>>>>>>
>>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>>> Secondory guest activity to complete?
>>>>>
>>>>> There is no way to know that secondary becomes unreponsive.
>>>>
>>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>>
>>> I don't understand this.
>>> Here is the COLO checkpoint flow:
>>>
>>>      Primary                                                Secondary
>>>      new checkpoint notice                 --->
>>>      vm_stop()                                              vm_stop()
>>>      vm state(device state, memory, cpu)   --->
>>>                                                             load state
>>>                                            <---             done
>>>      vm_start()                                             vm_start()
>>
>> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
>>
>> I was wondering whether the Secondary can do:
>>
>> <---  done
>>        vm_stop()
>>        load state
>>
>> It simply receives the checkpoint data into a buffer and immediately
>> replies with "done".  vm_stop() and load state is only performed after
>> sending "done".
>
> Secondary vm is running, so we should also get the pages that are dirtied
> by secondary vm, but not dirtied by primary vm.
> We have two ways to do it:
> 1. Cache all original memory in the secondary qemu
> 2. Send the dirty pfn list to primary qemu, and get it.
>
> If we ack the checkpoint and the call vm_stop(), we only can select 1. It
> means that secondary qemu costs more memory.
> In COLO mode, we will compare the output socket, and will do checkpoint if
> the application level data is different. If we ack the checkpoint and the
> call vm_stop(), the client can not get any more data until secondary vm
> is running again. So we still 'wait' the secondary vm.
>
>
>>
>> The advantage is that the Primary will not be delayed by the Secondary.
>> It's an approach that doesn't block.
>>
>> But perhaps it's a problem if the Secondary is slower than the Primary
>> since the Secondary still needs to complete vm_stop() and load state
>> before it can resume execution?
>>
>>>>>> I think this really means falling back to microcheckpointing until the
>>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>>> block QEMU (the monitor, etc).
>>>>>>
>>>>>
>>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>>>>> We should do failover in this case.
>>>>
>>>> This is dangerous because it means that a delay/failure in the Secondary
>>>> would cause the Primary to fail over to the broken Secondary.  All the
>>>> more reason not to perform blocking operations on the Secondary in the
>>>> checkpoint code path.
>>>
>>> If the secondary is broken, primary qemu will take over.
>>
>> Does the Primary use a timeout between "new checkpoint notice" and
>> Secondary's "done" so it can move on if the Secondary is unresponsive?
>
> To hailiang:
> IIRC, we don't use a timeout but I think we can do it. In our design, there is

Yes, we may need a timeout to help detecting the unresponsive case
which can not be caught by the external heartbeat module.
I will investigate it.

Thanks,
Hailiang

> an exteranl heartbeat to check primary and secondary status, and decide when
> to do checkpoint.
>
> Thanks
> Wen Congyang
>
>
>>
>> Stefan
>>
>
>
>
>
> .
>

[v14,7/8] Implement new driver for block replication

Commit Message

Comments

Patch