diff mbox

[v10,08/10] Implement new driver for block replication

Message ID 1443161858-20533-9-git-send-email-wency@cn.fujitsu.com
State New
Headers show

Commit Message

Wen Congyang Sept. 25, 2015, 6:17 a.m. UTC
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 471 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 472 insertions(+)
 create mode 100644 block/replication.c

Comments

Stefan Hajnoczi Oct. 12, 2015, 4:25 p.m. UTC | #1
On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> +static void backup_job_completed(void *opaque, int ret)
> +{
> +    BDRVReplicationState *s = opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
> +        /* The backup job is cancelled unexpectedly */
> +        s->error = -EIO;
> +    }
> +
> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> +                  s->active_disk->backing_blocker);
> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> +                  s->hidden_disk->backing_blocker);
> +
> +    bdrv_put_ref_bh_schedule(s->secondary_disk);

Why is bdrv_put_ref_bh_schedule() necessary?
Stefan Hajnoczi Oct. 12, 2015, 4:27 p.m. UTC | #2
On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> +        /* start backup job now */
> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> +                        s->active_disk->backing_blocker);
> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> +                        s->hidden_disk->backing_blocker);

Why is it safe to unblock these operations?

Why do they have to be blocked for non-replication users?

Stefan
Stefan Hajnoczi Oct. 12, 2015, 4:31 p.m. UTC | #3
On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> +static void replication_start(BlockDriverState *bs, ReplicationMode mode,
> +                              Error **errp)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int64_t active_length, hidden_length, disk_length;
> +    AioContext *aio_context;
> +    Error *local_err = NULL;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
> +        error_setg(errp, "Block replication is running or done");
> +        return;
> +    }
> +
> +    if (s->mode != mode) {
> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
> +                   " but receives %d", s->mode, mode);
> +        return;
> +    }
> +
> +    switch (s->mode) {
> +    case REPLICATION_MODE_PRIMARY:
> +        break;
> +    case REPLICATION_MODE_SECONDARY:
> +        s->active_disk = bs->file;
> +        if (!bs->file->backing_hd) {
> +            error_setg(errp, "Active disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->hidden_disk = s->active_disk->backing_hd;
> +        if (!s->hidden_disk->backing_hd) {
> +            error_setg(errp, "Hidden disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->secondary_disk = s->hidden_disk->backing_hd;
> +        if (!s->secondary_disk->blk) {
> +            error_setg(errp, "The secondary disk doesn't have block backend");
> +            return;
> +        }
...
> +        aio_context = bdrv_get_aio_context(bs);
> +        aio_context_acquire(aio_context);
> +        bdrv_set_aio_context(s->secondary_disk, aio_context);

Why is this bdrv_set_aio_context() call necessary?

Child BDS nodes are in the same AioContext as their parents.  Other
block jobs need something like this because they operate on a second BDS
which is not bs' backing file chain.  I think you have a different
situation here so it's not needed.
Wen Congyang Oct. 13, 2015, 8:59 a.m. UTC | #4
On 10/13/2015 12:25 AM, Stefan Hajnoczi wrote:
> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>> +static void backup_job_completed(void *opaque, int ret)
>> +{
>> +    BDRVReplicationState *s = opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
>> +        /* The backup job is cancelled unexpectedly */
>> +        s->error = -EIO;
>> +    }
>> +
>> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>> +                  s->active_disk->backing_blocker);
>> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>> +                  s->hidden_disk->backing_blocker);
>> +
>> +    bdrv_put_ref_bh_schedule(s->secondary_disk);
> 
> Why is bdrv_put_ref_bh_schedule() necessary?

It is copied from block_job_cb(). According to the comments in bdrv_put_ref_bh_schedule():
/*
 * Release a BDS reference in a BH
 *
 * It is not safe to use bdrv_unref() from a callback function when the callers
 * still need the BlockDriverState.  In such cases we schedule a BH to release
 * the reference.
 */

If the comment is right, I think it is necessary to call bdrv_put_ref_bh_schedule() here.
Because the job is created on the BDS s->secondary disk, backup_job_completed() is
called in block_job_completed(), and we will still use s->secondary_disk in block_job_release().

Thanks
Wen Congyang

> .
>
Wen Congyang Oct. 13, 2015, 9:08 a.m. UTC | #5
On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>> +        /* start backup job now */
>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>> +                        s->active_disk->backing_blocker);
>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>> +                        s->hidden_disk->backing_blocker);
> 
> Why is it safe to unblock these operations?
> 
> Why do they have to be blocked for non-replication users?

hidden_disk and secondary disk are opened as backing file, so it is blocked for
non-replication users.
What can I do if I don't unblock it and want to do backup?

Thanks
Wen Congyang

> 
> Stefan
> .
>
Wen Congyang Oct. 13, 2015, 9:09 a.m. UTC | #6
On 10/13/2015 12:31 AM, Stefan Hajnoczi wrote:
> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>> +static void replication_start(BlockDriverState *bs, ReplicationMode mode,
>> +                              Error **errp)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int64_t active_length, hidden_length, disk_length;
>> +    AioContext *aio_context;
>> +    Error *local_err = NULL;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
>> +        error_setg(errp, "Block replication is running or done");
>> +        return;
>> +    }
>> +
>> +    if (s->mode != mode) {
>> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
>> +                   " but receives %d", s->mode, mode);
>> +        return;
>> +    }
>> +
>> +    switch (s->mode) {
>> +    case REPLICATION_MODE_PRIMARY:
>> +        break;
>> +    case REPLICATION_MODE_SECONDARY:
>> +        s->active_disk = bs->file;
>> +        if (!bs->file->backing_hd) {
>> +            error_setg(errp, "Active disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->hidden_disk = s->active_disk->backing_hd;
>> +        if (!s->hidden_disk->backing_hd) {
>> +            error_setg(errp, "Hidden disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->secondary_disk = s->hidden_disk->backing_hd;
>> +        if (!s->secondary_disk->blk) {
>> +            error_setg(errp, "The secondary disk doesn't have block backend");
>> +            return;
>> +        }
> ...
>> +        aio_context = bdrv_get_aio_context(bs);
>> +        aio_context_acquire(aio_context);
>> +        bdrv_set_aio_context(s->secondary_disk, aio_context);
> 
> Why is this bdrv_set_aio_context() call necessary?
> 
> Child BDS nodes are in the same AioContext as their parents.  Other
> block jobs need something like this because they operate on a second BDS
> which is not bs' backing file chain.  I think you have a different
> situation here so it's not needed.

I think you are right. I will check it and remove it.

Thanks
Wen Congyang

> .
>
Fam Zheng Oct. 13, 2015, 9:41 a.m. UTC | #7
On Tue, 10/13 16:59, Wen Congyang wrote:
> On 10/13/2015 12:25 AM, Stefan Hajnoczi wrote:
> > On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> >> +static void backup_job_completed(void *opaque, int ret)
> >> +{
> >> +    BDRVReplicationState *s = opaque;
> >> +
> >> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
> >> +        /* The backup job is cancelled unexpectedly */
> >> +        s->error = -EIO;
> >> +    }
> >> +
> >> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> >> +                  s->active_disk->backing_blocker);
> >> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> >> +                  s->hidden_disk->backing_blocker);
> >> +
> >> +    bdrv_put_ref_bh_schedule(s->secondary_disk);
> > 
> > Why is bdrv_put_ref_bh_schedule() necessary?
> 
> It is copied from block_job_cb(). According to the comments in bdrv_put_ref_bh_schedule():
> /*
>  * Release a BDS reference in a BH
>  *
>  * It is not safe to use bdrv_unref() from a callback function when the callers
>  * still need the BlockDriverState.  In such cases we schedule a BH to release
>  * the reference.
>  */
> 
> If the comment is right, I think it is necessary to call bdrv_put_ref_bh_schedule() here.
> Because the job is created on the BDS s->secondary disk, backup_job_completed() is
> called in block_job_completed(), and we will still use s->secondary_disk in block_job_release().

Where is the matching bdrv_ref called?

Fam
Wen Congyang Oct. 13, 2015, 9:46 a.m. UTC | #8
On 10/13/2015 05:41 PM, Fam Zheng wrote:
> On Tue, 10/13 16:59, Wen Congyang wrote:
>> On 10/13/2015 12:25 AM, Stefan Hajnoczi wrote:
>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>>>> +static void backup_job_completed(void *opaque, int ret)
>>>> +{
>>>> +    BDRVReplicationState *s = opaque;
>>>> +
>>>> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
>>>> +        /* The backup job is cancelled unexpectedly */
>>>> +        s->error = -EIO;
>>>> +    }
>>>> +
>>>> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>>>> +                  s->active_disk->backing_blocker);
>>>> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>>>> +                  s->hidden_disk->backing_blocker);
>>>> +
>>>> +    bdrv_put_ref_bh_schedule(s->secondary_disk);
>>>
>>> Why is bdrv_put_ref_bh_schedule() necessary?
>>
>> It is copied from block_job_cb(). According to the comments in bdrv_put_ref_bh_schedule():
>> /*
>>  * Release a BDS reference in a BH
>>  *
>>  * It is not safe to use bdrv_unref() from a callback function when the callers
>>  * still need the BlockDriverState.  In such cases we schedule a BH to release
>>  * the reference.
>>  */
>>
>> If the comment is right, I think it is necessary to call bdrv_put_ref_bh_schedule() here.
>> Because the job is created on the BDS s->secondary disk, backup_job_completed() is
>> called in block_job_completed(), and we will still use s->secondary_disk in block_job_release().
> 
> Where is the matching bdrv_ref called?

It is in block_job_create()....

source: we call in bdrv_ref() in block_job_create(), and the user should unref it.
target: the user call bdrv_ref(), and we will unref it in the job

I don't know why we design it like this...

Thanks
Wen Congyang

> 
> Fam
> .
>
Fam Zheng Oct. 13, 2015, 10:12 a.m. UTC | #9
On Tue, 10/13 17:46, Wen Congyang wrote:
> On 10/13/2015 05:41 PM, Fam Zheng wrote:
> > On Tue, 10/13 16:59, Wen Congyang wrote:
> >> On 10/13/2015 12:25 AM, Stefan Hajnoczi wrote:
> >>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> >>>> +static void backup_job_completed(void *opaque, int ret)
> >>>> +{
> >>>> +    BDRVReplicationState *s = opaque;
> >>>> +
> >>>> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
> >>>> +        /* The backup job is cancelled unexpectedly */
> >>>> +        s->error = -EIO;
> >>>> +    }
> >>>> +
> >>>> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> >>>> +                  s->active_disk->backing_blocker);
> >>>> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> >>>> +                  s->hidden_disk->backing_blocker);
> >>>> +
> >>>> +    bdrv_put_ref_bh_schedule(s->secondary_disk);
> >>>
> >>> Why is bdrv_put_ref_bh_schedule() necessary?
> >>
> >> It is copied from block_job_cb(). According to the comments in bdrv_put_ref_bh_schedule():
> >> /*
> >>  * Release a BDS reference in a BH
> >>  *
> >>  * It is not safe to use bdrv_unref() from a callback function when the callers
> >>  * still need the BlockDriverState.  In such cases we schedule a BH to release
> >>  * the reference.
> >>  */
> >>
> >> If the comment is right, I think it is necessary to call bdrv_put_ref_bh_schedule() here.
> >> Because the job is created on the BDS s->secondary disk, backup_job_completed() is
> >> called in block_job_completed(), and we will still use s->secondary_disk in block_job_release().
> > 
> > Where is the matching bdrv_ref called?
> 
> It is in block_job_create()....
> 
> source: we call in bdrv_ref() in block_job_create(), and the user should unref it.
> target: the user call bdrv_ref(), and we will unref it in the job
> 
> I don't know why we design it like this...
> 

Maybe it's better to unref it in block_job_release. Then we can simply drop
bdrv_put_ref_bh_schedule.

Fam
Wen Congyang Oct. 14, 2015, 12:55 a.m. UTC | #10
On 10/13/2015 06:12 PM, Fam Zheng wrote:
> On Tue, 10/13 17:46, Wen Congyang wrote:
>> On 10/13/2015 05:41 PM, Fam Zheng wrote:
>>> On Tue, 10/13 16:59, Wen Congyang wrote:
>>>> On 10/13/2015 12:25 AM, Stefan Hajnoczi wrote:
>>>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>>>>>> +static void backup_job_completed(void *opaque, int ret)
>>>>>> +{
>>>>>> +    BDRVReplicationState *s = opaque;
>>>>>> +
>>>>>> +    if (s->replication_state != BLOCK_REPLICATION_DONE) {
>>>>>> +        /* The backup job is cancelled unexpectedly */
>>>>>> +        s->error = -EIO;
>>>>>> +    }
>>>>>> +
>>>>>> +    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>>>>>> +                  s->active_disk->backing_blocker);
>>>>>> +    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>>>>>> +                  s->hidden_disk->backing_blocker);
>>>>>> +
>>>>>> +    bdrv_put_ref_bh_schedule(s->secondary_disk);
>>>>>
>>>>> Why is bdrv_put_ref_bh_schedule() necessary?
>>>>
>>>> It is copied from block_job_cb(). According to the comments in bdrv_put_ref_bh_schedule():
>>>> /*
>>>>  * Release a BDS reference in a BH
>>>>  *
>>>>  * It is not safe to use bdrv_unref() from a callback function when the callers
>>>>  * still need the BlockDriverState.  In such cases we schedule a BH to release
>>>>  * the reference.
>>>>  */
>>>>
>>>> If the comment is right, I think it is necessary to call bdrv_put_ref_bh_schedule() here.
>>>> Because the job is created on the BDS s->secondary disk, backup_job_completed() is
>>>> called in block_job_completed(), and we will still use s->secondary_disk in block_job_release().
>>>
>>> Where is the matching bdrv_ref called?
>>
>> It is in block_job_create()....
>>
>> source: we call in bdrv_ref() in block_job_create(), and the user should unref it.
>> target: the user call bdrv_ref(), and we will unref it in the job
>>
>> I don't know why we design it like this...
>>
> 
> Maybe it's better to unref it in block_job_release. Then we can simply drop
> bdrv_put_ref_bh_schedule.

I agree with it.

Thanks
Wen Congyang

> 
> Fam
> .
>
Stefan Hajnoczi Oct. 14, 2015, 2:27 p.m. UTC | #11
On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
> > On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> >> +        /* start backup job now */
> >> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> >> +                        s->active_disk->backing_blocker);
> >> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> >> +                        s->hidden_disk->backing_blocker);
> > 
> > Why is it safe to unblock these operations?
> > 
> > Why do they have to be blocked for non-replication users?
> 
> hidden_disk and secondary disk are opened as backing file, so it is blocked for
> non-replication users.
> What can I do if I don't unblock it and want to do backup?

CCing Jeff Cody, block jobs maintainer

You need to explain why it is safe remove this protection.  We can't
merge code that may be unsafe.

I think we can investigate further by asking: when does QEMU code assume
the backing file is read-only?

I haven't checked but these cases come to mind:

Operations that move data between BDS in the backing chain (e.g. commit
and stream block jobs) will lose or overwrite data if the backing file
is being written to by another coroutine.

We need to prevent users from running these operations at the same time.

Also, accessing bs->backing_blocker is a layering violation.  No one
outside block.c:bdrv_set_backing_hd() is supposed to access this field.

Let's figure out the safety concerns first and then the
bs->backing_blocker access will probably be eliminated as part of the
solution.

Stefan
Wen Congyang Oct. 15, 2015, 2:19 a.m. UTC | #12
On 10/14/2015 10:27 PM, Stefan Hajnoczi wrote:
> On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
>> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>>>> +        /* start backup job now */
>>>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>>>> +                        s->active_disk->backing_blocker);
>>>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>>>> +                        s->hidden_disk->backing_blocker);
>>>
>>> Why is it safe to unblock these operations?
>>>
>>> Why do they have to be blocked for non-replication users?
>>
>> hidden_disk and secondary disk are opened as backing file, so it is blocked for
>> non-replication users.
>> What can I do if I don't unblock it and want to do backup?
> 
> CCing Jeff Cody, block jobs maintainer
> 
> You need to explain why it is safe remove this protection.  We can't
> merge code that may be unsafe.
> 
> I think we can investigate further by asking: when does QEMU code assume
> the backing file is read-only?

The backing file is opened in read-only mode. I want to reopen it in read-write
mode here in the next version(So the patch 1 will be dropped)

> 
> I haven't checked but these cases come to mind:
> 
> Operations that move data between BDS in the backing chain (e.g. commit
> and stream block jobs) will lose or overwrite data if the backing file
> is being written to by another coroutine.
> 
> We need to prevent users from running these operations at the same time.

Yes, but qemu doesn't provide such API.

> 
> Also, accessing bs->backing_blocker is a layering violation.  No one
> outside block.c:bdrv_set_backing_hd() is supposed to access this field.

I agree with it.

Thanks
Wen Congyang

> 
> Let's figure out the safety concerns first and then the
> bs->backing_blocker access will probably be eliminated as part of the
> solution.
> 
> Stefan
> .
>
Stefan Hajnoczi Oct. 15, 2015, 2:55 p.m. UTC | #13
On Thu, Oct 15, 2015 at 10:19:17AM +0800, Wen Congyang wrote:
> On 10/14/2015 10:27 PM, Stefan Hajnoczi wrote:
> > On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
> >> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
> >>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> >>>> +        /* start backup job now */
> >>>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> >>>> +                        s->active_disk->backing_blocker);
> >>>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> >>>> +                        s->hidden_disk->backing_blocker);
> >>>
> >>> Why is it safe to unblock these operations?
> >>>
> >>> Why do they have to be blocked for non-replication users?
> >>
> >> hidden_disk and secondary disk are opened as backing file, so it is blocked for
> >> non-replication users.
> >> What can I do if I don't unblock it and want to do backup?
> > 
> > CCing Jeff Cody, block jobs maintainer
> > 
> > You need to explain why it is safe remove this protection.  We can't
> > merge code that may be unsafe.
> > 
> > I think we can investigate further by asking: when does QEMU code assume
> > the backing file is read-only?
> 
> The backing file is opened in read-only mode. I want to reopen it in read-write
> mode here in the next version(So the patch 1 will be dropped)
> 
> > 
> > I haven't checked but these cases come to mind:
> > 
> > Operations that move data between BDS in the backing chain (e.g. commit
> > and stream block jobs) will lose or overwrite data if the backing file
> > is being written to by another coroutine.
> > 
> > We need to prevent users from running these operations at the same time.
> 
> Yes, but qemu doesn't provide such API.

This series can't be merged unless it is safe.

Have you looked at op blockers and thought about how to prevent unsafe
operations?

Stefan
Wen Congyang Oct. 16, 2015, 12:52 a.m. UTC | #14
On 10/15/2015 10:55 PM, Stefan Hajnoczi wrote:
> On Thu, Oct 15, 2015 at 10:19:17AM +0800, Wen Congyang wrote:
>> On 10/14/2015 10:27 PM, Stefan Hajnoczi wrote:
>>> On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
>>>> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
>>>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>>>>>> +        /* start backup job now */
>>>>>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>>>>>> +                        s->active_disk->backing_blocker);
>>>>>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>>>>>> +                        s->hidden_disk->backing_blocker);
>>>>>
>>>>> Why is it safe to unblock these operations?
>>>>>
>>>>> Why do they have to be blocked for non-replication users?
>>>>
>>>> hidden_disk and secondary disk are opened as backing file, so it is blocked for
>>>> non-replication users.
>>>> What can I do if I don't unblock it and want to do backup?
>>>
>>> CCing Jeff Cody, block jobs maintainer
>>>
>>> You need to explain why it is safe remove this protection.  We can't
>>> merge code that may be unsafe.
>>>
>>> I think we can investigate further by asking: when does QEMU code assume
>>> the backing file is read-only?
>>
>> The backing file is opened in read-only mode. I want to reopen it in read-write
>> mode here in the next version(So the patch 1 will be dropped)
>>
>>>
>>> I haven't checked but these cases come to mind:
>>>
>>> Operations that move data between BDS in the backing chain (e.g. commit
>>> and stream block jobs) will lose or overwrite data if the backing file
>>> is being written to by another coroutine.
>>>
>>> We need to prevent users from running these operations at the same time.
>>
>> Yes, but qemu doesn't provide such API.
> 
> This series can't be merged unless it is safe.
> 
> Have you looked at op blockers and thought about how to prevent unsafe
> operations?

Hmm, only block jobs will write to the backing file? If so, op blockers can
prevent unsafe operations.

Thanks
Wen Congyang

> 
> Stefan
> .
>
Wen Congyang Oct. 16, 2015, 2:22 a.m. UTC | #15
On 10/15/2015 10:55 PM, Stefan Hajnoczi wrote:
> On Thu, Oct 15, 2015 at 10:19:17AM +0800, Wen Congyang wrote:
>> On 10/14/2015 10:27 PM, Stefan Hajnoczi wrote:
>>> On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
>>>> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
>>>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
>>>>>> +        /* start backup job now */
>>>>>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
>>>>>> +                        s->active_disk->backing_blocker);
>>>>>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
>>>>>> +                        s->hidden_disk->backing_blocker);
>>>>>
>>>>> Why is it safe to unblock these operations?
>>>>>
>>>>> Why do they have to be blocked for non-replication users?
>>>>
>>>> hidden_disk and secondary disk are opened as backing file, so it is blocked for
>>>> non-replication users.
>>>> What can I do if I don't unblock it and want to do backup?
>>>
>>> CCing Jeff Cody, block jobs maintainer
>>>
>>> You need to explain why it is safe remove this protection.  We can't
>>> merge code that may be unsafe.
>>>
>>> I think we can investigate further by asking: when does QEMU code assume
>>> the backing file is read-only?
>>
>> The backing file is opened in read-only mode. I want to reopen it in read-write
>> mode here in the next version(So the patch 1 will be dropped)
>>
>>>
>>> I haven't checked but these cases come to mind:
>>>
>>> Operations that move data between BDS in the backing chain (e.g. commit
>>> and stream block jobs) will lose or overwrite data if the backing file
>>> is being written to by another coroutine.
>>>
>>> We need to prevent users from running these operations at the same time.
>>
>> Yes, but qemu doesn't provide such API.
> 
> This series can't be merged unless it is safe.
> 
> Have you looked at op blockers and thought about how to prevent unsafe
> operations?

What about this solution:
1. unblock it in bdrv_set_backing_hd()
2. block it in qmp_block_commit(), qmp_block_stream(), qmp_block_backup()..., to
   prevent unsafe operations

Thanks
Wen Congyang

> 
> Stefan
> .
>
Stefan Hajnoczi Oct. 16, 2015, 11:37 a.m. UTC | #16
On Fri, Oct 16, 2015 at 10:22:05AM +0800, Wen Congyang wrote:
> On 10/15/2015 10:55 PM, Stefan Hajnoczi wrote:
> > On Thu, Oct 15, 2015 at 10:19:17AM +0800, Wen Congyang wrote:
> >> On 10/14/2015 10:27 PM, Stefan Hajnoczi wrote:
> >>> On Tue, Oct 13, 2015 at 05:08:17PM +0800, Wen Congyang wrote:
> >>>> On 10/13/2015 12:27 AM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Sep 25, 2015 at 02:17:36PM +0800, Wen Congyang wrote:
> >>>>>> +        /* start backup job now */
> >>>>>> +        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
> >>>>>> +                        s->active_disk->backing_blocker);
> >>>>>> +        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
> >>>>>> +                        s->hidden_disk->backing_blocker);
> >>>>>
> >>>>> Why is it safe to unblock these operations?
> >>>>>
> >>>>> Why do they have to be blocked for non-replication users?
> >>>>
> >>>> hidden_disk and secondary disk are opened as backing file, so it is blocked for
> >>>> non-replication users.
> >>>> What can I do if I don't unblock it and want to do backup?
> >>>
> >>> CCing Jeff Cody, block jobs maintainer
> >>>
> >>> You need to explain why it is safe remove this protection.  We can't
> >>> merge code that may be unsafe.
> >>>
> >>> I think we can investigate further by asking: when does QEMU code assume
> >>> the backing file is read-only?
> >>
> >> The backing file is opened in read-only mode. I want to reopen it in read-write
> >> mode here in the next version(So the patch 1 will be dropped)
> >>
> >>>
> >>> I haven't checked but these cases come to mind:
> >>>
> >>> Operations that move data between BDS in the backing chain (e.g. commit
> >>> and stream block jobs) will lose or overwrite data if the backing file
> >>> is being written to by another coroutine.
> >>>
> >>> We need to prevent users from running these operations at the same time.
> >>
> >> Yes, but qemu doesn't provide such API.
> > 
> > This series can't be merged unless it is safe.
> > 
> > Have you looked at op blockers and thought about how to prevent unsafe
> > operations?
> 
> What about this solution:
> 1. unblock it in bdrv_set_backing_hd()
> 2. block it in qmp_block_commit(), qmp_block_stream(), qmp_block_backup()..., to
>    prevent unsafe operations

Come to think of it, currently QEMU only supports 1 block job per BDS.

This means that as long as COLO has a backup job running, no other block
jobs can interfere.

There still might be a risk with monitor commands like 'commit'.

Stefan
diff mbox

Patch

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@  block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 0000000..813f610
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,471 @@ 
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+
+typedef struct BDRVReplicationState {
+    ReplicationMode mode;
+    int replication_state;
+    BlockDriverState *active_disk;
+    BlockDriverState *hidden_disk;
+    BlockDriverState *secondary_disk;
+    int error;
+} BDRVReplicationState;
+
+enum {
+    BLOCK_REPLICATION_NONE,     /* block replication is not started */
+    BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+    BLOCK_REPLICATION_DONE,     /* block replication is done(failover) */
+};
+
+#define COMMIT_CLUSTER_BITS 16
+#define COMMIT_CLUSTER_SIZE (1 << COMMIT_CLUSTER_BITS)
+#define COMMIT_SECTORS_PER_CLUSTER (COMMIT_CLUSTER_SIZE / BDRV_SECTOR_SIZE)
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp);
+
+#define REPLICATION_MODE        "mode"
+static QemuOptsList replication_runtime_opts = {
+    .name = "replication",
+    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+    .desc = {
+        {
+            .name = REPLICATION_MODE,
+            .type = QEMU_OPT_STRING,
+        },
+        { /* end of list */ }
+    },
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+                            int flags, Error **errp)
+{
+    int ret;
+    BDRVReplicationState *s = bs->opaque;;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    const char *mode;
+
+    ret = -EINVAL;
+    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        goto fail;
+    }
+
+    mode = qemu_opt_get(opts, REPLICATION_MODE);
+    if (!mode) {
+        error_setg(&local_err, "Missing the option mode");
+        goto fail;
+    }
+
+    if (!strcmp(mode, "primary")) {
+        s->mode = REPLICATION_MODE_PRIMARY;
+    } else if (!strcmp(mode, "secondary")) {
+        s->mode = REPLICATION_MODE_SECONDARY;
+    } else {
+        error_setg(&local_err,
+                   "The option mode's value should be primary or secondary");
+        goto fail;
+    }
+
+    ret = 0;
+
+fail:
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+        replication_stop(bs, false, NULL);
+    }
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+    switch (s->replication_state) {
+    case BLOCK_REPLICATION_NONE:
+        return -EIO;
+    case BLOCK_REPLICATION_RUNNING:
+        return 0;
+    case BLOCK_REPLICATION_DONE:
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+    default:
+        abort();
+    }
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        return ret;
+    }
+
+    if (ret < 0) {
+        s->error = ret;
+        ret = 0;
+    }
+
+    return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             int remaining_sectors,
+                                             QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    if (s->mode == REPLICATION_MODE_PRIMARY) {
+        /* We only use it to forward primary write requests */
+        return -EIO;
+    }
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /*
+     * After failover, because we don't commit active disk/hidden disk
+     * to secondary disk, so we should read from active disk directly.
+     */
+    ret = bdrv_co_readv(bs->file, sector_num, remaining_sectors, qiov);
+    return replication_return_value(s, ret);
+}
+
+static coroutine_fn int replication_co_writev(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int remaining_sectors,
+                                              QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    QEMUIOVector hd_qiov;
+    uint64_t bytes_done = 0;
+    BlockDriverState *top = bs->file;
+    BlockDriverState *base = s->secondary_disk;
+    BlockDriverState *target;
+    int ret, n;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 0) {
+        ret = bdrv_co_writev(bs->file, sector_num, remaining_sectors, qiov);
+        return replication_return_value(s, ret);
+    }
+
+    /*
+     * Only write to active disk if the sectors have
+     * already been allocated in active disk/hidden disk.
+     */
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+    while (remaining_sectors > 0) {
+        ret = bdrv_is_allocated_above(top, base, sector_num,
+                                      remaining_sectors, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
+
+        target = ret ? top: base;
+        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
+        if (ret < 0) {
+            return ret;
+        }
+
+        remaining_sectors -= n;
+        sector_num += n;
+        bytes_done += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static coroutine_fn int replication_co_discard(BlockDriverState *bs,
+                                               int64_t sector_num,
+                                               int nb_sectors)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 1) {
+        /* It is secondary qemu and we are after failover */
+        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    ret = bdrv_co_discard(bs->file, sector_num, nb_sectors);
+    return replication_return_value(s, ret);
+}
+
+static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
+                                                    BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file, candidate);
+}
+
+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    if (!s->secondary_disk->job) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make active disk empty");
+        return;
+    }
+
+    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make hidden disk empty");
+        return;
+    }
+}
+
+static void backup_job_completed(void *opaque, int ret)
+{
+    BDRVReplicationState *s = opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_DONE) {
+        /* The backup job is cancelled unexpectedly */
+        s->error = -EIO;
+    }
+
+    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                  s->active_disk->backing_blocker);
+    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                  s->hidden_disk->backing_blocker);
+
+    bdrv_put_ref_bh_schedule(s->secondary_disk);
+}
+
+static void replication_start(BlockDriverState *bs, ReplicationMode mode,
+                              Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int64_t active_length, hidden_length, disk_length;
+    AioContext *aio_context;
+    Error *local_err = NULL;
+
+    if (s->replication_state != BLOCK_REPLICATION_NONE) {
+        error_setg(errp, "Block replication is running or done");
+        return;
+    }
+
+    if (s->mode != mode) {
+        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
+                   " but receives %d", s->mode, mode);
+        return;
+    }
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        s->active_disk = bs->file;
+        if (!bs->file->backing_hd) {
+            error_setg(errp, "Active disk doesn't have backing file");
+            return;
+        }
+
+        s->hidden_disk = s->active_disk->backing_hd;
+        if (!s->hidden_disk->backing_hd) {
+            error_setg(errp, "Hidden disk doesn't have backing file");
+            return;
+        }
+
+        s->secondary_disk = s->hidden_disk->backing_hd;
+        if (!s->secondary_disk->blk) {
+            error_setg(errp, "The secondary disk doesn't have block backend");
+            return;
+        }
+
+        /* verify the length */
+        active_length = bdrv_getlength(s->active_disk);
+        hidden_length = bdrv_getlength(s->hidden_disk);
+        disk_length = bdrv_getlength(s->secondary_disk);
+        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
+            active_length != hidden_length || hidden_length != disk_length) {
+            error_setg(errp, "active disk, hidden disk, secondary disk's length"
+                       " are not the same");
+            return;
+        }
+
+        if (!s->active_disk->drv->bdrv_make_empty ||
+            !s->hidden_disk->drv->bdrv_make_empty) {
+            error_setg(errp,
+                       "active disk or hidden disk doesn't support make_empty");
+            return;
+        }
+
+        /* start backup job now */
+        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                        s->active_disk->backing_blocker);
+        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                        s->hidden_disk->backing_blocker);
+        bdrv_ref(s->hidden_disk);
+
+        aio_context = bdrv_get_aio_context(bs);
+        aio_context_acquire(aio_context);
+        bdrv_set_aio_context(s->secondary_disk, aio_context);
+        backup_start(s->secondary_disk, s->hidden_disk, 0,
+                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
+                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
+                     s, &local_err);
+        aio_context_release(aio_context);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                          s->active_disk->backing_blocker);
+            bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                          s->hidden_disk->backing_blocker);
+            bdrv_unref(s->hidden_disk);
+            return;
+        }
+        break;
+    default:
+        abort();
+    }
+
+    s->replication_state = BLOCK_REPLICATION_RUNNING;
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+
+    s->error = 0;
+}
+
+static void replication_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    if (s->error) {
+        error_setg(errp, "I/O error occurs");
+        return;
+    }
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+}
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    s->replication_state = BLOCK_REPLICATION_DONE;
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        if (!failover) {
+            /*
+             * The guest will be shutdown, and secondary disk is the
+             * same as the primary disk. Just make active disk and
+             * hidden disk empty.
+             */
+            secondary_do_checkpoint(s, errp);
+            return;
+        }
+
+        if (s->secondary_disk->job) {
+            block_job_cancel(s->secondary_disk->job);
+        }
+        break;
+    default:
+        abort();
+    }
+}
+
+BlockDriver bdrv_replication = {
+    .format_name                = "replication",
+    .protocol_name              = "replication",
+    .instance_size              = sizeof(BDRVReplicationState),
+
+    .bdrv_open                  = replication_open,
+    .bdrv_close                 = replication_close,
+
+    .bdrv_getlength             = replication_getlength,
+    .bdrv_co_readv              = replication_co_readv,
+    .bdrv_co_writev             = replication_co_writev,
+    .bdrv_co_discard            = replication_co_discard,
+
+    .is_filter                  = true,
+    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
+
+    .bdrv_start_replication     = replication_start,
+    .bdrv_do_checkpoint         = replication_do_checkpoint,
+    .bdrv_stop_replication      = replication_stop,
+
+    .has_variable_length        = true,
+};
+
+static void bdrv_replication_init(void)
+{
+    bdrv_register(&bdrv_replication);
+}
+
+block_init(bdrv_replication_init);