diff mbox

[RFC,01/14] docs: block replication's description

Message ID 1423710438-14377-2-git-send-email-wency@cn.fujitsu.com
State New
Headers show

Commit Message

Wen Congyang Feb. 12, 2015, 3:07 a.m. UTC
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100644 docs/block-replication.txt

Comments

Fam Zheng Feb. 12, 2015, 7:21 a.m. UTC | #1
Hi Congyang,

On Thu, 02/12 11:07, Wen Congyang wrote:
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.

I'm a little confused by the tenses ("will be" versus "are") and terms. I am
reading them as "s/will be/are/g"

Why do you need this buffer?

If both primary and secondary write to the same sector, what is saved in the
buffer?

Fam

> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
Wen Congyang Feb. 12, 2015, 7:40 a.m. UTC | #2
On 02/12/2015 03:21 PM, Fam Zheng wrote:
> Hi Congyang,
> 
> On Thu, 02/12 11:07, Wen Congyang wrote:
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
> 
> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> reading them as "s/will be/are/g"
> 
> Why do you need this buffer?

We only sync the disk till next checkpoint. Before next checkpoint, secondary
vm write to the buffer.

> 
> If both primary and secondary write to the same sector, what is saved in the
> buffer?

The primary content will be written to the secondary disk, and the secondary content
is saved in the buffer.

Thanks
Wen Congyang

> 
> Fam
> 
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
> .
>
Fam Zheng Feb. 12, 2015, 8:44 a.m. UTC | #3
On Thu, 02/12 15:40, Wen Congyang wrote:
> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> > Hi Congyang,
> > 
> > On Thu, 02/12 11:07, Wen Congyang wrote:
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> > 
> > I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> > reading them as "s/will be/are/g"
> > 
> > Why do you need this buffer?
> 
> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> vm write to the buffer.
> 
> > 
> > If both primary and secondary write to the same sector, what is saved in the
> > buffer?
> 
> The primary content will be written to the secondary disk, and the secondary content
> is saved in the buffer.

I wonder if alternatively this is possible with an imaginary "writable backing
image" feature, as described below.

When we have a normal backing chain,

               {virtio-blk dev 'foo'}
                         |
                         |
                         |
    [base] <- [mid] <- (foo)

Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
to an existing image on top,

               {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
                         |                              |
                         |                              |
                         |                              |
    [base] <- [mid] <- (foo)  <---------------------- (bar)

It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
We can utilize an automatic hidden drive-backup target:

               {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
                         |                                                          |
                         |                                                          |
                         v                                                          v

    [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)

                         v                              ^
                         v                              ^
                         v                              ^
                         v                              ^
                         >>>> drive-backup sync=none >>>>

So when guest writes to 'foo', the old data is moved to (hidden target), which
remains unchanged from (bar)'s PoV.

The drive in the middle is called hidden because QEMU creates it automatically,
the naming is arbitrary.

It is interesting because it is a more generalized case of image fleecing,
where the (hidden target) is exposed via NBD server for data scanning (read
only) purpose.

More interestingly, with above facility, it is also possible to create a guest
visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
cheaply. Or call it shadow copy if you will.

Back to the COLO case, the configuration will be very similar:


                      {primary wr}                                                {secondary vm}
                            |                                                           |
                            |                                                           |
                            |                                                           |
                            v                                                           v

   [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)

                            v                              ^
                            v                              ^
                            v                              ^
                            v                              ^
                            >>>> drive-backup sync=none >>>>

The workflow analogue is:

> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.

Primary write requests are forwarded to secondary QEMU as well.

> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.

Before Primary write requests are written to (nbd target), aka the Secondary
disk, the orignal sector content is read from it and copied to (hidden buf
disk) by drive-backup. It obviously will not overwrite the data in (active
disk).

> >> +    3) Primary write requests will be written to Secondary disk.

Primary write requests are written to (nbd target).

> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.

Secondary write request will be written in (active disk) as usual.

Finally, when checkpoint arrives, if you want to sync with primary, just drop
data in (hidden buf disk) and (active disk); when failover happends, if you
want to promote secondary vm, you can commit (active disk) to (nbd target), and
drop data in (hidden buf disk).

Fam
Wen Congyang Feb. 12, 2015, 9:33 a.m. UTC | #4
On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>

What is active disk? There are two disk images?

Thanks
Wen Congyang

> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
>
Yang Hongyang Feb. 12, 2015, 9:36 a.m. UTC | #5
Hi Fam,

在 02/12/2015 04:44 PM, Fam Zheng 写道:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
>
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
>
> When we have a normal backing chain,
>
>                 {virtio-blk dev 'foo'}
>                           |
>                           |
>                           |
>      [base] <- [mid] <- (foo)
>
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
>
>                 {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                           |                              |
>                           |                              |
>                           |                              |
>      [base] <- [mid] <- (foo)  <---------------------- (bar)
>
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
>
>                 {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                           |                                                          |
>                           |                                                          |
>                           v                                                          v
>
>      [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>
>                           v                              ^
>                           v                              ^
>                           v                              ^
>                           v                              ^
>                           >>>> drive-backup sync=none >>>>
>
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
>
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
>
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
>
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
>
> Back to the COLO case, the configuration will be very similar:
>
>
>                        {primary wr}                                                {secondary vm}
>                              |                                                           |
>                              |                                                           |
>                              |                                                           |
>                              v                                                           v
>
>     [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>
>                              v                              ^
>                              v                              ^
>                              v                              ^
>                              v                              ^
>                              >>>> drive-backup sync=none >>>>
>
> The workflow analogue is:
>
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>
> Primary write requests are forwarded to secondary QEMU as well.
>
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
>
>>>> +    3) Primary write requests will be written to Secondary disk.
>
> Primary write requests are written to (nbd target).
>
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
>
> Secondary write request will be written in (active disk) as usual.
>
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).

If I understand correctly, you split the Disk Buffer into a hidden buf disk +
an active disk. What we need to do is only to implement a buf disk(will be
used as hidden buf disk and active disk as mentioned), apart from this, we can
use the existing mechinism like backing-file/drive-backup?

>
> Fam
> .
>
Fam Zheng Feb. 12, 2015, 9:44 a.m. UTC | #6
On Thu, 02/12 17:33, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> 
> What is active disk? There are two disk images?

It starts as an empty image with (hidden buf disk) as backing file, which in
turn has (nbd target) as backing file.

Fam

> 
> Thanks
> Wen Congyang
> 
> > 
> > The workflow analogue is:
> > 
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> > 
> > Primary write requests are forwarded to secondary QEMU as well.
> > 
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> > 
> > Before Primary write requests are written to (nbd target), aka the Secondary
> > disk, the orignal sector content is read from it and copied to (hidden buf
> > disk) by drive-backup. It obviously will not overwrite the data in (active
> > disk).
> > 
> >>>> +    3) Primary write requests will be written to Secondary disk.
> > 
> > Primary write requests are written to (nbd target).
> > 
> >>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>> +       will overwrite the existing sector content in the buffer.
> > 
> > Secondary write request will be written in (active disk) as usual.
> > 
> > Finally, when checkpoint arrives, if you want to sync with primary, just drop
> > data in (hidden buf disk) and (active disk); when failover happends, if you
> > want to promote secondary vm, you can commit (active disk) to (nbd target), and
> > drop data in (hidden buf disk).
> > 
> > Fam
> > .
> > 
> 
>
Fam Zheng Feb. 12, 2015, 9:46 a.m. UTC | #7
On Thu, 02/12 17:36, Hongyang Yang wrote:
> Hi Fam,
> 
> 在 02/12/2015 04:44 PM, Fam Zheng 写道:
> >On Thu, 02/12 15:40, Wen Congyang wrote:
> >>On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>Hi Congyang,
> >>>
> >>>On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>+== Workflow ==
> >>>>+The following is the image of block replication workflow:
> >>>>+
> >>>>+        +----------------------+            +------------------------+
> >>>>+        |Primary Write Requests|            |Secondary Write Requests|
> >>>>+        +----------------------+            +------------------------+
> >>>>+                  |                                       |
> >>>>+                  |                                      (4)
> >>>>+                  |                                       V
> >>>>+                  |                              /-------------\
> >>>>+                  |      Copy and Forward        |             |
> >>>>+                  |---------(1)----------+       | Disk Buffer |
> >>>>+                  |                      |       |             |
> >>>>+                  |                     (3)      \-------------/
> >>>>+                  |                 speculative      ^
> >>>>+                  |                write through    (2)
> >>>>+                  |                      |           |
> >>>>+                  V                      V           |
> >>>>+           +--------------+           +----------------+
> >>>>+           | Primary Disk |           | Secondary Disk |
> >>>>+           +--------------+           +----------------+
> >>>>+
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>+       sector content in the Disk buffer.
> >>>
> >>>I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>reading them as "s/will be/are/g"
> >>>
> >>>Why do you need this buffer?
> >>
> >>We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>vm write to the buffer.
> >>
> >>>
> >>>If both primary and secondary write to the same sector, what is saved in the
> >>>buffer?
> >>
> >>The primary content will be written to the secondary disk, and the secondary content
> >>is saved in the buffer.
> >
> >I wonder if alternatively this is possible with an imaginary "writable backing
> >image" feature, as described below.
> >
> >When we have a normal backing chain,
> >
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> >
> >Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >to an existing image on top,
> >
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >
> >It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >We can utilize an automatic hidden drive-backup target:
> >
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> >
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> >
> >So when guest writes to 'foo', the old data is moved to (hidden target), which
> >remains unchanged from (bar)'s PoV.
> >
> >The drive in the middle is called hidden because QEMU creates it automatically,
> >the naming is arbitrary.
> >
> >It is interesting because it is a more generalized case of image fleecing,
> >where the (hidden target) is exposed via NBD server for data scanning (read
> >only) purpose.
> >
> >More interestingly, with above facility, it is also possible to create a guest
> >visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >cheaply. Or call it shadow copy if you will.
> >
> >Back to the COLO case, the configuration will be very similar:
> >
> >
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> >
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> >
> >The workflow analogue is:
> >
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >
> >Primary write requests are forwarded to secondary QEMU as well.
> >
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>+       sector content in the Disk buffer.
> >
> >Before Primary write requests are written to (nbd target), aka the Secondary
> >disk, the orignal sector content is read from it and copied to (hidden buf
> >disk) by drive-backup. It obviously will not overwrite the data in (active
> >disk).
> >
> >>>>+    3) Primary write requests will be written to Secondary disk.
> >
> >Primary write requests are written to (nbd target).
> >
> >>>>+    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>+       will overwrite the existing sector content in the buffer.
> >
> >Secondary write request will be written in (active disk) as usual.
> >
> >Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >data in (hidden buf disk) and (active disk); when failover happends, if you
> >want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >drop data in (hidden buf disk).
> 
> If I understand correctly, you split the Disk Buffer into a hidden buf disk +
> an active disk. What we need to do is only to implement a buf disk(will be
> used as hidden buf disk and active disk as mentioned), apart from this, we can
> use the existing mechinism like backing-file/drive-backup?
> 

Yes, but you need a separate driver to take care of the buffer logic as
introduced in this series, which is less generic, but does the same thing we
will need in the image fleecing use case.

Fam
Wen Congyang Feb. 12, 2015, 10:11 a.m. UTC | #8
On 02/12/2015 05:44 PM, Fam Zheng wrote:
> On Thu, 02/12 17:33, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>>
>>> It is interesting because it is a more generalized case of image fleecing,
>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>> only) purpose.
>>>
>>> More interestingly, with above facility, it is also possible to create a guest
>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>> cheaply. Or call it shadow copy if you will.
>>>
>>> Back to the COLO case, the configuration will be very similar:
>>>
>>>
>>>                       {primary wr}                                                {secondary vm}
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             v                                                           v
>>>
>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             >>>> drive-backup sync=none >>>>
>>
>> What is active disk? There are two disk images?
> 
> It starts as an empty image with (hidden buf disk) as backing file, which in
> turn has (nbd target) as backing file.

It's too complicated..., and I don't understand it.
1. What is active disk? Use raw or a new block driver?
2. Hidden buf disk use new block driver?
3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
   export a nbd with read-only BlockDriverState, but nbd server needs to write it.

Thanks
Wen Congyang

> 
> Fam
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> The workflow analogue is:
>>>
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>
>>> Primary write requests are forwarded to secondary QEMU as well.
>>>
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>
>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>> disk).
>>>
>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>
>>> Primary write requests are written to (nbd target).
>>>
>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>> +       will overwrite the existing sector content in the buffer.
>>>
>>> Secondary write request will be written in (active disk) as usual.
>>>
>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>> drop data in (hidden buf disk).
>>>
>>> Fam
>>> .
>>>
>>
>>
> .
>
Fam Zheng Feb. 12, 2015, 10:26 a.m. UTC | #9
On Thu, 02/12 18:11, Wen Congyang wrote:
> On 02/12/2015 05:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 17:33, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>>
> >>> It is interesting because it is a more generalized case of image fleecing,
> >>> where the (hidden target) is exposed via NBD server for data scanning (read
> >>> only) purpose.
> >>>
> >>> More interestingly, with above facility, it is also possible to create a guest
> >>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >>> cheaply. Or call it shadow copy if you will.
> >>>
> >>> Back to the COLO case, the configuration will be very similar:
> >>>
> >>>
> >>>                       {primary wr}                                                {secondary vm}
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             v                                                           v
> >>>
> >>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >>>
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             >>>> drive-backup sync=none >>>>
> >>
> >> What is active disk? There are two disk images?
> > 
> > It starts as an empty image with (hidden buf disk) as backing file, which in
> > turn has (nbd target) as backing file.
> 
> It's too complicated..., and I don't understand it.
> 1. What is active disk? Use raw or a new block driver?

It is an empty qcow2 image with the same lenght as your Secondary Disk.

> 2. Hidden buf disk use new block driver?

It is an empty qcow2 image with the same lenght as your Secondary Disk, too.

> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.

NBD target is your Secondary Disk. It is opened read-write.

The patches to enable opening it as read-write, and starting drive-backup
between it and hidden buf disk, are all work in progress (the core concept) of
image fleecing.

Fam

> >>>
> >>> The workflow analogue is:
> >>>
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>
> >>> Primary write requests are forwarded to secondary QEMU as well.
> >>>
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>
> >>> Before Primary write requests are written to (nbd target), aka the Secondary
> >>> disk, the orignal sector content is read from it and copied to (hidden buf
> >>> disk) by drive-backup. It obviously will not overwrite the data in (active
> >>> disk).
> >>>
> >>>>>> +    3) Primary write requests will be written to Secondary disk.
> >>>
> >>> Primary write requests are written to (nbd target).
> >>>
> >>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>>> +       will overwrite the existing sector content in the buffer.
> >>>
> >>> Secondary write request will be written in (active disk) as usual.
> >>>
> >>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >>> data in (hidden buf disk) and (active disk); when failover happends, if you
> >>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >>> drop data in (hidden buf disk).
> >>>
> >>> Fam
> >>> .
> >>>
> >>
> >>
> > .
> > 
>
Wen Congyang Feb. 13, 2015, 5:09 a.m. UTC | #10
On 02/12/2015 06:26 PM, famz@redhat.com wrote:
> On Thu, 02/12 18:11, Wen Congyang wrote:
>> On 02/12/2015 05:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 17:33, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>
>>>> What is active disk? There are two disk images?
>>>
>>> It starts as an empty image with (hidden buf disk) as backing file, which in
>>> turn has (nbd target) as backing file.
>>
>> It's too complicated..., and I don't understand it.
>> 1. What is active disk? Use raw or a new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk.
> 
>> 2. Hidden buf disk use new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk, too.
> 
>> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.
> 
> NBD target is your Secondary Disk. It is opened read-write.
> 
> The patches to enable opening it as read-write, and starting drive-backup
> between it and hidden buf disk, are all work in progress (the core concept) of
> image fleecing.

What is image fleecing? Are you implementing it now?

Thanks
Wen Congyang

> 
> Fam
> 
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>>>
>>> .
>>>
>>
> .
>
Fam Zheng Feb. 13, 2015, 7:01 a.m. UTC | #11
On Fri, 02/13 13:09, Wen Congyang wrote:
> What is image fleecing?
> 

It's the name of the feature which enables the built-in NBD server to exporting
a thin point-in-time snapshot created via drive-backup sync=none.

It's for host side data scanning tool to access a disk snapshot of running VM.
The workflow in theory is:

1. guest uses "disk0" as its virtio-blk device.

2. in qmp, use blockdev-add (drive-backup) to add an empty "target0" qcow2
image, that uses "disk0" as its backing file, and use nbd-server-add to export
this empty image with NBD. This way, all reads coming from NBD client will
produce data of "disk0".

3. in qmp, start blockdev-backup from "disk0" to "target0" with "sync=none".
After this point, all guest data written to "disk0" will COW the original data
to "target0", in other words, reading "target0" will effectively produce a
point-in-time snapshot of the time when blockdev-backup started.

4. after step 3, the disk data seen by the NBD client is the stable snapshot.
Because of the COW mechanism in blockdev-backup, "target0" is thin, and can be
dropped once the inspection process is done.

>  Are you implementing it now?

I worked on it. Most parts of the series is merged, the remaining part is
relatively small, namely to

1) enable adding "target0" in step 2 (currently in blockdev-add it's not
possible to reference an existing drive as backing file);

2) enable "blockdev-backup" from "disk0" to "target0", which is obviously not
possible because 1) is not done.

I do have the patches at my tree, just that they need to be refreshed. :)

https://github.com/famz/qemu/tree/image-fleecing

Fam
John Snow Feb. 13, 2015, 8:29 p.m. UTC | #12
On 02/13/2015 02:01 AM, Fam Zheng wrote:
> On Fri, 02/13 13:09, Wen Congyang wrote:
>> What is image fleecing?
>>
>
> It's the name of the feature which enables the built-in NBD server to exporting
> a thin point-in-time snapshot created via drive-backup sync=none.
>
> It's for host side data scanning tool to access a disk snapshot of running VM.
> The workflow in theory is:
>
> 1. guest uses "disk0" as its virtio-blk device.
>
> 2. in qmp, use blockdev-add (drive-backup) to add an empty "target0" qcow2
> image, that uses "disk0" as its backing file, and use nbd-server-add to export
> this empty image with NBD. This way, all reads coming from NBD client will
> produce data of "disk0".
>
> 3. in qmp, start blockdev-backup from "disk0" to "target0" with "sync=none".
> After this point, all guest data written to "disk0" will COW the original data
> to "target0", in other words, reading "target0" will effectively produce a
> point-in-time snapshot of the time when blockdev-backup started.
>
> 4. after step 3, the disk data seen by the NBD client is the stable snapshot.
> Because of the COW mechanism in blockdev-backup, "target0" is thin, and can be
> dropped once the inspection process is done.
>
>>   Are you implementing it now?
>
> I worked on it. Most parts of the series is merged, the remaining part is
> relatively small, namely to
>
> 1) enable adding "target0" in step 2 (currently in blockdev-add it's not
> possible to reference an existing drive as backing file);
>
> 2) enable "blockdev-backup" from "disk0" to "target0", which is obviously not
> possible because 1) is not done.
>
> I do have the patches at my tree, just that they need to be refreshed. :)
>
> https://github.com/famz/qemu/tree/image-fleecing
>
> Fam
>

I had intended to pick up these patches after I got incremental backup 
working, as Fam had started both of these projects and I inherited them 
-- though I hadn't begun work in earnest on refining and testing this 
particular feature yet.

--js
Wen Congyang Feb. 24, 2015, 7:50 a.m. UTC | #13
On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.

I don't understand this. In which function, the hidden target is created automatically?

Thanks
Wen Congyang

> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>
> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
>
Fam Zheng Feb. 25, 2015, 2:46 a.m. UTC | #14
On Tue, 02/24 15:50, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> 
> I don't understand this. In which function, the hidden target is created automatically?
> 

It's to be determined. This part is only in my mind :)

Fam

> 
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> > 
> > The workflow analogue is:
> > 
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> > 
> > Primary write requests are forwarded to secondary QEMU as well.
> > 
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> > 
> > Before Primary write requests are written to (nbd target), aka the Secondary
> > disk, the orignal sector content is read from it and copied to (hidden buf
> > disk) by drive-backup. It obviously will not overwrite the data in (active
> > disk).
> > 
> >>>> +    3) Primary write requests will be written to Secondary disk.
> > 
> > Primary write requests are written to (nbd target).
> > 
> >>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>> +       will overwrite the existing sector content in the buffer.
> > 
> > Secondary write request will be written in (active disk) as usual.
> > 
> > Finally, when checkpoint arrives, if you want to sync with primary, just drop
> > data in (hidden buf disk) and (active disk); when failover happends, if you
> > want to promote secondary vm, you can commit (active disk) to (nbd target), and
> > drop data in (hidden buf disk).
> > 
> > Fam
> > .
> > 
>
Wen Congyang Feb. 25, 2015, 8:11 a.m. UTC | #15
On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)

foo's backing is mid, and mid's backing is base?

The foo is a base's snapshot?

Thanks
Wen Congyang

> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>
> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
>
Fam Zheng Feb. 25, 2015, 8:18 a.m. UTC | #16
On Wed, 02/25 16:11, Wen Congyang wrote:
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> 
> foo's backing is mid, and mid's backing is base?

Yes.

Fam
Wen Congyang Feb. 25, 2015, 8:36 a.m. UTC | #17
On 02/25/2015 10:46 AM, Fam Zheng wrote:
> On Tue, 02/24 15:50, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>
>> I don't understand this. In which function, the hidden target is created automatically?
>>
> 
> It's to be determined. This part is only in my mind :)

Does hidden target is only used for COLO?

Thanks
Wen Congyang

> 
> Fam
> 
>>
>>>
>>> It is interesting because it is a more generalized case of image fleecing,
>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>> only) purpose.
>>>
>>> More interestingly, with above facility, it is also possible to create a guest
>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>> cheaply. Or call it shadow copy if you will.
>>>
>>> Back to the COLO case, the configuration will be very similar:
>>>
>>>
>>>                       {primary wr}                                                {secondary vm}
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             v                                                           v
>>>
>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             >>>> drive-backup sync=none >>>>
>>>
>>> The workflow analogue is:
>>>
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>
>>> Primary write requests are forwarded to secondary QEMU as well.
>>>
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>
>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>> disk).
>>>
>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>
>>> Primary write requests are written to (nbd target).
>>>
>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>> +       will overwrite the existing sector content in the buffer.
>>>
>>> Secondary write request will be written in (active disk) as usual.
>>>
>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>> drop data in (hidden buf disk).
>>>
>>> Fam
>>> .
>>>
>>
> .
>
Fam Zheng Feb. 25, 2015, 8:58 a.m. UTC | #18
On Wed, 02/25 16:36, Wen Congyang wrote:
> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> > On Tue, 02/24 15:50, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>
> >> I don't understand this. In which function, the hidden target is created automatically?
> >>
> > 
> > It's to be determined. This part is only in my mind :)
> 
> Does hidden target is only used for COLO?
> 

I'm not sure I get your question.

In this case yes, this is a dedicate target that's only written to by COLO's
secondary VM.

In other general cases, this infrastructure could also be used for backup or
image fleecing.

Fam

> 
> > 
> > Fam
> > 
> >>
> >>>
> >>> It is interesting because it is a more generalized case of image fleecing,
> >>> where the (hidden target) is exposed via NBD server for data scanning (read
> >>> only) purpose.
> >>>
> >>> More interestingly, with above facility, it is also possible to create a guest
> >>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >>> cheaply. Or call it shadow copy if you will.
> >>>
> >>> Back to the COLO case, the configuration will be very similar:
> >>>
> >>>
> >>>                       {primary wr}                                                {secondary vm}
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             v                                                           v
> >>>
> >>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >>>
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             >>>> drive-backup sync=none >>>>
> >>>
> >>> The workflow analogue is:
> >>>
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>
> >>> Primary write requests are forwarded to secondary QEMU as well.
> >>>
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>
> >>> Before Primary write requests are written to (nbd target), aka the Secondary
> >>> disk, the orignal sector content is read from it and copied to (hidden buf
> >>> disk) by drive-backup. It obviously will not overwrite the data in (active
> >>> disk).
> >>>
> >>>>>> +    3) Primary write requests will be written to Secondary disk.
> >>>
> >>> Primary write requests are written to (nbd target).
> >>>
> >>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>>> +       will overwrite the existing sector content in the buffer.
> >>>
> >>> Secondary write request will be written in (active disk) as usual.
> >>>
> >>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >>> data in (hidden buf disk) and (active disk); when failover happends, if you
> >>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >>> drop data in (hidden buf disk).
> >>>
> >>> Fam
> >>> .
> >>>
> >>
> > .
> > 
> 
>
Wen Congyang Feb. 25, 2015, 9:10 a.m. UTC | #19
On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>

Why nbd target has backing image ever?

> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).

We cannot drop data in (hidden buf disk). Commit (hidden buf disk) to (nbd target)
first, and then commit (active disk) to (nbd target).

Thanks
Wen Congyang

> 
> Fam
> .
>
Fam Zheng Feb. 25, 2015, 9:45 a.m. UTC | #20
On Wed, 02/25 17:10, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> 
> Why nbd target has backing image ever?

It's not strictly necessary, depending on your VM disk configuration. (for
example at the time of vm booting, your image already points to a backing file,
etc.

Fam
Wen Congyang Feb. 25, 2015, 9:58 a.m. UTC | #21
On 02/25/2015 04:58 PM, Fam Zheng wrote:
> On Wed, 02/25 16:36, Wen Congyang wrote:
>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>
>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>
>>>
>>> It's to be determined. This part is only in my mind :)
>>
>> Does hidden target is only used for COLO?
>>
> 
> I'm not sure I get your question.
> 
> In this case yes, this is a dedicate target that's only written to by COLO's
> secondary VM.
> 
> In other general cases, this infrastructure could also be used for backup or
> image fleecing.

In COLO case, we can create (hidden buf disk) when starting block replication.
In other general cases, I don't know when to create (hidden buf disk)?

Thanks
Wen Congyang

> 
> Fam
> 
>>
>>>
>>> Fam
>>>
>>>>
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
>>
> .
>
Wen Congyang Feb. 26, 2015, 6:38 a.m. UTC | #22
On 02/25/2015 10:46 AM, Fam Zheng wrote:
> On Tue, 02/24 15:50, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>
>> I don't understand this. In which function, the hidden target is created automatically?
>>
> 
> It's to be determined. This part is only in my mind :)

What about this:
-drive file=nbd-target,if=none,id=nbd-target0 \
-drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0

Thanks
Wen Congyang

> 
> Fam
>
Fam Zheng Feb. 26, 2015, 8:44 a.m. UTC | #23
On Thu, 02/26 14:38, Wen Congyang wrote:
> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> > On Tue, 02/24 15:50, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>
> >> I don't understand this. In which function, the hidden target is created automatically?
> >>
> > 
> > It's to be determined. This part is only in my mind :)
> 
> What about this:
> -drive file=nbd-target,if=none,id=nbd-target0 \
> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
> 

It's close. I suppose backing.backing is referencing another drive as its
backing_hd, then you cannot have the other backing.file.* option - they
conflict. It would be something along:

-drive file=nbd-target,if=none,id=nbd-target0 \
-drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
-drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0

Or for simplicity, s/backing.backing=/backing=/g

Yes, adding these "backing=$drive_id" option is also exactly what we expect
in order to support image-fleecing, but we haven't figured how to allow that
without breaking other qmp operations like block jobs, etc.

Fam
Wen Congyang Feb. 26, 2015, 9:07 a.m. UTC | #24
On 02/26/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/26 14:38, Wen Congyang wrote:
>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>
>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>
>>>
>>> It's to be determined. This part is only in my mind :)
>>
>> What about this:
>> -drive file=nbd-target,if=none,id=nbd-target0 \
>> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
>>
> 
> It's close. I suppose backing.backing is referencing another drive as its
> backing_hd, then you cannot have the other backing.file.* option - they
> conflict. It would be something along:
> 
> -drive file=nbd-target,if=none,id=nbd-target0 \
> -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
> -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
> 
> Or for simplicity, s/backing.backing=/backing=/g

If using backing=drive_id, backing.backing and backing.file.* are not conflict.
backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.

> 
> Yes, adding these "backing=$drive_id" option is also exactly what we expect
> in order to support image-fleecing, but we haven't figured how to allow that
> without breaking other qmp operations like block jobs, etc.

I don't understand this. In which case, qmp operations will be broken? Can you give
me some examples?

Thanks
Wen Congyang

> 
> Fam
> .
>
Fam Zheng Feb. 26, 2015, 10:02 a.m. UTC | #25
On Thu, 02/26 17:07, Wen Congyang wrote:
> On 02/26/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/26 14:38, Wen Congyang wrote:
> >> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> >>> On Tue, 02/24 15:50, Wen Congyang wrote:
> >>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>>>> Hi Congyang,
> >>>>>>>
> >>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>>>> +== Workflow ==
> >>>>>>>> +The following is the image of block replication workflow:
> >>>>>>>> +
> >>>>>>>> +        +----------------------+            +------------------------+
> >>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>>>> +        +----------------------+            +------------------------+
> >>>>>>>> +                  |                                       |
> >>>>>>>> +                  |                                      (4)
> >>>>>>>> +                  |                                       V
> >>>>>>>> +                  |                              /-------------\
> >>>>>>>> +                  |      Copy and Forward        |             |
> >>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>>>> +                  |                      |       |             |
> >>>>>>>> +                  |                     (3)      \-------------/
> >>>>>>>> +                  |                 speculative      ^
> >>>>>>>> +                  |                write through    (2)
> >>>>>>>> +                  |                      |           |
> >>>>>>>> +                  V                      V           |
> >>>>>>>> +           +--------------+           +----------------+
> >>>>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>>>> +           +--------------+           +----------------+
> >>>>>>>> +
> >>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>>>> +       QEMU.
> >>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>>>> +       original sector content will be read from Secondary disk and
> >>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>>>> +       sector content in the Disk buffer.
> >>>>>>>
> >>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>>>> reading them as "s/will be/are/g"
> >>>>>>>
> >>>>>>> Why do you need this buffer?
> >>>>>>
> >>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>>>> vm write to the buffer.
> >>>>>>
> >>>>>>>
> >>>>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>>>> buffer?
> >>>>>>
> >>>>>> The primary content will be written to the secondary disk, and the secondary content
> >>>>>> is saved in the buffer.
> >>>>>
> >>>>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>>>> image" feature, as described below.
> >>>>>
> >>>>> When we have a normal backing chain,
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}
> >>>>>                          |
> >>>>>                          |
> >>>>>                          |
> >>>>>     [base] <- [mid] <- (foo)
> >>>>>
> >>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>>>> to an existing image on top,
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>>>                          |                              |
> >>>>>                          |                              |
> >>>>>                          |                              |
> >>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>>>
> >>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>>>> We can utilize an automatic hidden drive-backup target:
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>>>                          |                                                          |
> >>>>>                          |                                                          |
> >>>>>                          v                                                          v
> >>>>>
> >>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>>>
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          >>>> drive-backup sync=none >>>>
> >>>>>
> >>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>>>> remains unchanged from (bar)'s PoV.
> >>>>>
> >>>>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>>>> the naming is arbitrary.
> >>>>
> >>>> I don't understand this. In which function, the hidden target is created automatically?
> >>>>
> >>>
> >>> It's to be determined. This part is only in my mind :)
> >>
> >> What about this:
> >> -drive file=nbd-target,if=none,id=nbd-target0 \
> >> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
> >>
> > 
> > It's close. I suppose backing.backing is referencing another drive as its
> > backing_hd, then you cannot have the other backing.file.* option - they
> > conflict. It would be something along:
> > 
> > -drive file=nbd-target,if=none,id=nbd-target0 \
> > -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
> > -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
> > 
> > Or for simplicity, s/backing.backing=/backing=/g
> 
> If using backing=drive_id, backing.backing and backing.file.* are not conflict.
> backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.

I see.

> 
> > 
> > Yes, adding these "backing=$drive_id" option is also exactly what we expect
> > in order to support image-fleecing, but we haven't figured how to allow that
> > without breaking other qmp operations like block jobs, etc.
> 
> I don't understand this. In which case, qmp operations will be broken? Can you give
> me some examples?
> 

I don't mean there is a fundamental stopper for this, but in order to relax the
assumption that "only top BDS can have a BlockBackend", we need to think
through the whole block layer, and add new finer checks/restrictions where it's
necessary, otherwise it will be a mess to allow arbitrary backing reference.

Some random questions I'm now aware of:

1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
corrupted by writings to it. So there need to be a new convention and
invariance to follow.

2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
corrupt data (from nbd-target0's perspective).

3. unclear implications of "change" and "eject" when there is backing
reference.

4. can a drive be backing referenced by more than one other drives?

Just two cents, and I still need to think about it systematically.

Fam
Wen Congyang Feb. 27, 2015, 2:27 a.m. UTC | #26
On 02/26/2015 06:02 PM, Fam Zheng wrote:
> On Thu, 02/26 17:07, Wen Congyang wrote:
>> On 02/26/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/26 14:38, Wen Congyang wrote:
>>>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>>>> Hi Congyang,
>>>>>>>>>
>>>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>>>> +== Workflow ==
>>>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>>>> +
>>>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>>>> +                  |                                       |
>>>>>>>>>> +                  |                                      (4)
>>>>>>>>>> +                  |                                       V
>>>>>>>>>> +                  |                              /-------------\
>>>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>>>> +                  |                      |       |             |
>>>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>>>> +                  |                 speculative      ^
>>>>>>>>>> +                  |                write through    (2)
>>>>>>>>>> +                  |                      |           |
>>>>>>>>>> +                  V                      V           |
>>>>>>>>>> +           +--------------+           +----------------+
>>>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>>>> +           +--------------+           +----------------+
>>>>>>>>>> +
>>>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>>>> +       QEMU.
>>>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>>>
>>>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>>>> reading them as "s/will be/are/g"
>>>>>>>>>
>>>>>>>>> Why do you need this buffer?
>>>>>>>>
>>>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>>>> vm write to the buffer.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>>>> buffer?
>>>>>>>>
>>>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>>>> is saved in the buffer.
>>>>>>>
>>>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>>>> image" feature, as described below.
>>>>>>>
>>>>>>> When we have a normal backing chain,
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}
>>>>>>>                          |
>>>>>>>                          |
>>>>>>>                          |
>>>>>>>     [base] <- [mid] <- (foo)
>>>>>>>
>>>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>>>> to an existing image on top,
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>>>                          |                              |
>>>>>>>                          |                              |
>>>>>>>                          |                              |
>>>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>>>
>>>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>>>                          |                                                          |
>>>>>>>                          |                                                          |
>>>>>>>                          v                                                          v
>>>>>>>
>>>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>>>
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>>>
>>>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>>>> remains unchanged from (bar)'s PoV.
>>>>>>>
>>>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>>>> the naming is arbitrary.
>>>>>>
>>>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>>>
>>>>>
>>>>> It's to be determined. This part is only in my mind :)
>>>>
>>>> What about this:
>>>> -drive file=nbd-target,if=none,id=nbd-target0 \
>>>> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
>>>>
>>>
>>> It's close. I suppose backing.backing is referencing another drive as its
>>> backing_hd, then you cannot have the other backing.file.* option - they
>>> conflict. It would be something along:
>>>
>>> -drive file=nbd-target,if=none,id=nbd-target0 \
>>> -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
>>> -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
>>>
>>> Or for simplicity, s/backing.backing=/backing=/g
>>
>> If using backing=drive_id, backing.backing and backing.file.* are not conflict.
>> backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.
> 
> I see.
> 
>>
>>>
>>> Yes, adding these "backing=$drive_id" option is also exactly what we expect
>>> in order to support image-fleecing, but we haven't figured how to allow that
>>> without breaking other qmp operations like block jobs, etc.
>>
>> I don't understand this. In which case, qmp operations will be broken? Can you give
>> me some examples?
>>
> 
> I don't mean there is a fundamental stopper for this, but in order to relax the
> assumption that "only top BDS can have a BlockBackend", we need to think
> through the whole block layer, and add new finer checks/restrictions where it's
> necessary, otherwise it will be a mess to allow arbitrary backing reference.
> 
> Some random questions I'm now aware of:
> 
> 1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
> corrupted by writings to it. So there need to be a new convention and
> invariance to follow.

Hmm, I understand while the hidden-disk should be opened automatically now.
If we use backing reference, I think we should open a hindden-disk, and set
drive backup automatically. Block any conflict operations(commit, change, eject?)

> 
> 2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
> corrupt data (from nbd-target0's perspective).
> 
> 3. unclear implications of "change" and "eject" when there is backing
> reference.
> 
> 4. can a drive be backing referenced by more than one other drives?

We can forbid it first.

Thanks
Wen Congyang

> 
> Just two cents, and I still need to think about it systematically.
> 
> Fam
> .
>
Fam Zheng Feb. 27, 2015, 2:32 a.m. UTC | #27
On Fri, 02/27 10:27, Wen Congyang wrote:
> > 1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
> > corrupted by writings to it. So there need to be a new convention and
> > invariance to follow.
> 
> Hmm, I understand while the hidden-disk should be opened automatically now.
> If we use backing reference, I think we should open a hindden-disk, and set
> drive backup automatically. Block any conflict operations(commit, change, eject?)

This might be a good idea.

> 
> > 
> > 2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
> > corrupt data (from nbd-target0's perspective).
> > 
> > 3. unclear implications of "change" and "eject" when there is backing
> > reference.
> > 
> > 4. can a drive be backing referenced by more than one other drives?
> 
> We can forbid it first.
> 

Yes, probably with a new op blocker type.

Fam
Wen Congyang March 3, 2015, 7:53 a.m. UTC | #28
On 02/12/2015 06:26 PM, famz@redhat.com wrote:
> On Thu, 02/12 18:11, Wen Congyang wrote:
>> On 02/12/2015 05:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 17:33, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>
>>>> What is active disk? There are two disk images?
>>>
>>> It starts as an empty image with (hidden buf disk) as backing file, which in
>>> turn has (nbd target) as backing file.
>>
>> It's too complicated..., and I don't understand it.
>> 1. What is active disk? Use raw or a new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk.

I test qcow2_make_empty()'s performance. The result shows that it may
take about 100ms(normal sata disk). It is not acceptable for COLO. So
I think disk buff is necessary(just use it to replace qcow2).

Thanks
Wen Congyang

> 
>> 2. Hidden buf disk use new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk, too.
> 
>> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.
> 
> NBD target is your Secondary Disk. It is opened read-write.
> 
> The patches to enable opening it as read-write, and starting drive-backup
> between it and hidden buf disk, are all work in progress (the core concept) of
> image fleecing.
> 
> Fam
> 
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>>>
>>> .
>>>
>>
> .
>
Fam Zheng March 3, 2015, 7:59 a.m. UTC | #29
On Tue, 03/03 15:53, Wen Congyang wrote:
> I test qcow2_make_empty()'s performance. The result shows that it may
> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> I think disk buff is necessary(just use it to replace qcow2).

Why not tmpfs or ramdisk?

Fam
Wen Congyang March 3, 2015, 12:12 p.m. UTC | #30
On 03/03/2015 03:59 PM, Fam Zheng wrote:
> On Tue, 03/03 15:53, Wen Congyang wrote:
>> I test qcow2_make_empty()'s performance. The result shows that it may
>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>> I think disk buff is necessary(just use it to replace qcow2).
> 
> Why not tmpfs or ramdisk?

I test it, and it only takes 2-3ms.

Thanks
Wen Congyang

> 
> Fam
> .
>
Dr. David Alan Gilbert March 4, 2015, 4:35 p.m. UTC | #31
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

Hi,

> ---
>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..59150b8
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,129 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +The block replication is used for continuous checkpoints. It is designed
> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> +scene that Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.

Can you explain how the block data is synchronised with the main checkpoint
stream?  i.e. when the secondary receives a new checkpoint how does it know
it's received all of the block writes from the primary associated with that
checkpoint and that all the following writes that it receives are for the
next checkpoint period?

Dave

> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                  virtio-blk
> +                          ||        ^                         ^
> +--------.                 ||        |                         |
> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
> +--------'                 ||                   backing
> +
> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
> +and the disk buffer uses bdrv_add_before_write_notifier to implement
> +copy-on-write, similar to block/backup.c.
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transfered to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +c. bdrv_stop_replication()
> +   It is called when failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=first,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +
> +Secondary:
> +  -drive if=xxx,driver=blkcolo,export=xxx,\
> +         backing.file.filename=1.raw,\
> +         backing.driver=raw
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd_server_start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd_server_start's other options
> -- 
> 2.1.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wen Congyang March 5, 2015, 1:03 a.m. UTC | #32
On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> 
> Hi,
> 
>> ---
>>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 129 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..59150b8
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,129 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +The block replication is used for continuous checkpoints. It is designed
>> +for COLO that Secondary VM is running. It can also be applied for FT/HA
>> +scene that Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
> 
> Can you explain how the block data is synchronised with the main checkpoint
> stream?  i.e. when the secondary receives a new checkpoint how does it know
> it's received all of the block writes from the primary associated with that
> checkpoint and that all the following writes that it receives are for the
> next checkpoint period?

NBD server will do it. Writing to NBD client will return after NBD server replies
the result(ACK or error).

Thanks
Wen Congyang

> 
> Dave
> 
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                  virtio-blk
>> +                          ||        ^                         ^
>> +--------.                 ||        |                         |
>> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
>> +--------'                 ||                   backing
>> +
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
>> +and the disk buffer uses bdrv_add_before_write_notifier to implement
>> +copy-on-write, similar to block/backup.c.
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transfered to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +c. bdrv_stop_replication()
>> +   It is called when failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication.
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=first,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +
>> +Secondary:
>> +  -drive if=xxx,driver=blkcolo,export=xxx,\
>> +         backing.file.filename=1.raw,\
>> +         backing.driver=raw
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd_server_start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd_server_start's other options
>> -- 
>> 2.1.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>
Dr. David Alan Gilbert March 5, 2015, 7:04 p.m. UTC | #33
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > 
> > Hi,
> > 
> >> ---
> >>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 129 insertions(+)
> >>  create mode 100644 docs/block-replication.txt
> >>
> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> >> new file mode 100644
> >> index 0000000..59150b8
> >> --- /dev/null
> >> +++ b/docs/block-replication.txt
> >> @@ -0,0 +1,129 @@
> >> +Block replication
> >> +----------------------------------------
> >> +Copyright Fujitsu, Corp. 2015
> >> +Copyright (c) 2015 Intel Corporation
> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> >> +
> >> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> +See the COPYING file in the top-level directory.
> >> +
> >> +The block replication is used for continuous checkpoints. It is designed
> >> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> >> +scene that Secondary VM is not running.
> >> +
> >> +This document gives an overview of block replication's design.
> >> +
> >> +== Background ==
> >> +High availability solutions such as micro checkpoint and COLO will do
> >> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> >> +identical right after a VM checkpoint, but becomes different as the VM
> >> +executes till the next checkpoint. To support disk contents checkpoint,
> >> +the modified disk contents in the Secondary VM must be buffered, and are
> >> +only dropped at next checkpoint time. To reduce the network transportation
> >> +effort at the time of checkpoint, the disk modification operations of
> >> +Primary disk are asynchronously forwarded to the Secondary node.
> > 
> > Can you explain how the block data is synchronised with the main checkpoint
> > stream?  i.e. when the secondary receives a new checkpoint how does it know
> > it's received all of the block writes from the primary associated with that
> > checkpoint and that all the following writes that it receives are for the
> > next checkpoint period?
> 
> NBD server will do it. Writing to NBD client will return after NBD server replies
> the result(ACK or error).

Ah OK, so if the NBD client is synchronous then yes I can see that;
(I was confused by the word 'asynchronously' in your description above
but I guess that means asynchronous to the checkpoint stream).
I see that 'do_colo_transaction' keeps the primary stopped until after
the secondary does blk_do_checkpoint and then sends 'LOADED'.

I think yes that should work; although potentially you could make it faster;
since the primary doesn't need to know that it's write has been commited
until the next checkpoint, and if you could mark the separation in the two
checkpoints, then you could start the primary running again earlier.  But that's
all more complicated; this should work OK.

Thanks for the explanation,

Dave

> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >> +
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> >> +    3) Primary write requests will be written to Secondary disk.
> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.
> >> +
> >> +== Architecture ==
> >> +We are going to implement COLO block replication from many basic
> >> +blocks that are already in QEMU.
> >> +
> >> +         virtio-blk       ||
> >> +             ^            ||                            .----------
> >> +             |            ||                            | Secondary
> >> +        1 Quorum          ||                            '----------
> >> +         /      \         ||
> >> +        /        \        ||
> >> +   Primary      2 NBD  ------->  2 NBD
> >> +     disk       client    ||     server                  virtio-blk
> >> +                          ||        ^                         ^
> >> +--------.                 ||        |                         |
> >> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
> >> +--------'                 ||                   backing
> >> +
> >> +1) The disk on the primary is represented by a block device with two
> >> +children, providing replication between a primary disk and the host that
> >> +runs the secondary VM. The read pattern for quorum can be extended to
> >> +make the primary always read from the local disk instead of going through
> >> +NBD.
> >> +
> >> +2) The secondary disk receives writes from the primary VM through QEMU's
> >> +embedded NBD server (speculative write-through).
> >> +
> >> +3) The disk on the secondary is represented by a custom block device
> >> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
> >> +and the disk buffer uses bdrv_add_before_write_notifier to implement
> >> +copy-on-write, similar to block/backup.c.
> >> +
> >> +== New block driver interface ==
> >> +We add three block driver interfaces to control block replication:
> >> +a. bdrv_start_replication()
> >> +   Start block replication, called in migration/checkpoint thread.
> >> +   We must call bdrv_start_replication() in secondary QEMU before
> >> +   calling bdrv_start_replication() in primary QEMU.
> >> +b. bdrv_do_checkpoint()
> >> +   This interface is called after all VM state is transfered to
> >> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> >> +c. bdrv_stop_replication()
> >> +   It is called when failover. We will flush the Disk buffer into
> >> +   Secondary Disk and stop block replication.
> >> +
> >> +== Usage ==
> >> +Primary:
> >> +  -drive if=xxx,driver=quorum,read-pattern=first,\
> >> +         children.0.file.filename=1.raw,\
> >> +         children.0.driver=raw,\
> >> +         children.1.file.driver=nbd+colo,\
> >> +         children.1.file.host=xxx,\
> >> +         children.1.file.port=xxx,\
> >> +         children.1.file.export=xxx,\
> >> +         children.1.driver=raw
> >> +  Note:
> >> +  1. NBD Client should not be the first child of quorum.
> >> +  2. There should be only one NBD Client.
> >> +  3. host is the secondary physical machine's hostname or IP
> >> +  4. Each disk must have its own export name.
> >> +
> >> +Secondary:
> >> +  -drive if=xxx,driver=blkcolo,export=xxx,\
> >> +         backing.file.filename=1.raw,\
> >> +         backing.driver=raw
> >> +  Then run qmp command:
> >> +    nbd_server_start host:port
> >> +  Note:
> >> +  1. The export name for the same disk must be the same in primary
> >> +     and secondary QEMU command line
> >> +  2. The qmp command nbd_server_start must be run before running the
> >> +     qmp command migrate on primary QEMU
> >> +  3. Don't use nbd_server_start's other options
> >> -- 
> >> 2.1.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > .
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wen Congyang March 11, 2015, 6:44 a.m. UTC | #34
On 03/03/2015 03:59 PM, Fam Zheng wrote:
> On Tue, 03/03 15:53, Wen Congyang wrote:
>> I test qcow2_make_empty()'s performance. The result shows that it may
>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>> I think disk buff is necessary(just use it to replace qcow2).
> 
> Why not tmpfs or ramdisk?

Another problem:
After failover, secondary write request will be written in (active disk)?
It is better to write request to (nbd target). Is there any feature can
be reused to implement it?

Thanks
Wen Congyang

> 
> Fam
> .
>
Fam Zheng March 11, 2015, 6:49 a.m. UTC | #35
On Wed, 03/11 14:44, Wen Congyang wrote:
> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> > On Tue, 03/03 15:53, Wen Congyang wrote:
> >> I test qcow2_make_empty()'s performance. The result shows that it may
> >> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >> I think disk buff is necessary(just use it to replace qcow2).
> > 
> > Why not tmpfs or ramdisk?
> 
> Another problem:
> After failover, secondary write request will be written in (active disk)?
> It is better to write request to (nbd target). Is there any feature can
> be reused to implement it?

You can use block commit or stream to move the data.

Fam
Wen Congyang March 11, 2015, 7:01 a.m. UTC | #36
On 03/11/2015 02:49 PM, Fam Zheng wrote:
> On Wed, 03/11 14:44, Wen Congyang wrote:
>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>
>>> Why not tmpfs or ramdisk?
>>
>> Another problem:
>> After failover, secondary write request will be written in (active disk)?
>> It is better to write request to (nbd target). Is there any feature can
>> be reused to implement it?
> 
> You can use block commit or stream to move the data.

When doing failover, we can use it to move the data. After failover,
I need an endless job to move the data.

Thanks
Wen Congyang

> 
> Fam
> 
> .
>
Fam Zheng March 11, 2015, 7:04 a.m. UTC | #37
On Wed, 03/11 15:01, Wen Congyang wrote:
> On 03/11/2015 02:49 PM, Fam Zheng wrote:
> > On Wed, 03/11 14:44, Wen Congyang wrote:
> >> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> >>> On Tue, 03/03 15:53, Wen Congyang wrote:
> >>>> I test qcow2_make_empty()'s performance. The result shows that it may
> >>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >>>> I think disk buff is necessary(just use it to replace qcow2).
> >>>
> >>> Why not tmpfs or ramdisk?
> >>
> >> Another problem:
> >> After failover, secondary write request will be written in (active disk)?
> >> It is better to write request to (nbd target). Is there any feature can
> >> be reused to implement it?
> > 
> > You can use block commit or stream to move the data.
> 
> When doing failover, we can use it to move the data. After failover,
> I need an endless job to move the data.
> 

I see what you mean. After failover, does the nbd server receive more data
(i.e. do you need a buffer to stash data from the other side)? If you commit
(active disk) to (nbd target), all the writes will go to a single image.

Fam
Wen Congyang March 11, 2015, 7:12 a.m. UTC | #38
On 03/11/2015 03:04 PM, Fam Zheng wrote:
> On Wed, 03/11 15:01, Wen Congyang wrote:
>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>
>>>>> Why not tmpfs or ramdisk?
>>>>
>>>> Another problem:
>>>> After failover, secondary write request will be written in (active disk)?
>>>> It is better to write request to (nbd target). Is there any feature can
>>>> be reused to implement it?
>>>
>>> You can use block commit or stream to move the data.
>>
>> When doing failover, we can use it to move the data. After failover,
>> I need an endless job to move the data.
>>
> 
> I see what you mean. After failover, does the nbd server receive more data
> (i.e. do you need a buffer to stash data from the other side)? If you commit
> (active disk) to (nbd target), all the writes will go to a single image.

After failover(primary host downs), only secondary qemu works, and nbd server
doesn't receive any more data.

Thanks
Wen Congyang

> 
> Fam
> 
> .
>
Wen Congyang March 13, 2015, 9:01 a.m. UTC | #39
On 03/11/2015 02:49 PM, Fam Zheng wrote:
> On Wed, 03/11 14:44, Wen Congyang wrote:
>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>
>>> Why not tmpfs or ramdisk?
>>
>> Another problem:
>> After failover, secondary write request will be written in (active disk)?
>> It is better to write request to (nbd target). Is there any feature can
>> be reused to implement it?
> 
> You can use block commit or stream to move the data.

Can the job stream move the data? I don't find the write ops in block/stream.c.

Thanks
Wen Congyang

> 
> Fam
> 
> .
>
Fam Zheng March 13, 2015, 9:05 a.m. UTC | #40
On Fri, 03/13 17:01, Wen Congyang wrote:
> On 03/11/2015 02:49 PM, Fam Zheng wrote:
> > On Wed, 03/11 14:44, Wen Congyang wrote:
> >> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> >>> On Tue, 03/03 15:53, Wen Congyang wrote:
> >>>> I test qcow2_make_empty()'s performance. The result shows that it may
> >>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >>>> I think disk buff is necessary(just use it to replace qcow2).
> >>>
> >>> Why not tmpfs or ramdisk?
> >>
> >> Another problem:
> >> After failover, secondary write request will be written in (active disk)?
> >> It is better to write request to (nbd target). Is there any feature can
> >> be reused to implement it?
> > 
> > You can use block commit or stream to move the data.
> 
> Can the job stream move the data? I don't find the write ops in block/stream.c.

It is bdrv_co_copy_on_readv that moves data.

Fam
Wen Congyang March 16, 2015, 6:19 a.m. UTC | #41
On 03/13/2015 05:05 PM, Fam Zheng wrote:
> On Fri, 03/13 17:01, Wen Congyang wrote:
>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>
>>>>> Why not tmpfs or ramdisk?
>>>>
>>>> Another problem:
>>>> After failover, secondary write request will be written in (active disk)?
>>>> It is better to write request to (nbd target). Is there any feature can
>>>> be reused to implement it?
>>>
>>> You can use block commit or stream to move the data.
>>
>> Can the job stream move the data? I don't find the write ops in block/stream.c.
> 
> It is bdrv_co_copy_on_readv that moves data.

Does the job stream move the data from base to top?

Thanks
Wen Congyang

> 
> Fam
> .
>
Paolo Bonzini March 25, 2015, 12:41 p.m. UTC | #42
On 16/03/2015 07:19, Wen Congyang wrote:
> On 03/13/2015 05:05 PM, Fam Zheng wrote:
>> On Fri, 03/13 17:01, Wen Congyang wrote:
>>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>>
>>>>>> Why not tmpfs or ramdisk?
>>>>>
>>>>> Another problem:
>>>>> After failover, secondary write request will be written in (active disk)?
>>>>> It is better to write request to (nbd target). Is there any feature can
>>>>> be reused to implement it?
>>>>
>>>> You can use block commit or stream to move the data.
>>>
>>> Can the job stream move the data? I don't find the write ops in block/stream.c.
>>
>> It is bdrv_co_copy_on_readv that moves data.
> 
> Does the job stream move the data from base to top?

Yes.  block-commit goes in the other direction.

Paolo
diff mbox

Patch

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..59150b8
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,129 @@ 
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The block replication is used for continuous checkpoints. It is designed
+for COLO that Secondary VM is running. It can also be applied for FT/HA
+scene that Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                  virtio-blk
+                          ||        ^                         ^
+--------.                 ||        |                         |
+Primary |                 ||  Secondary disk <--------- COLO buffer 3
+--------'                 ||                   backing
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+("COLO buffer"). The disk buffer's backing image is the secondary disk,
+and the disk buffer uses bdrv_add_before_write_notifier to implement
+copy-on-write, similar to block/backup.c.
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+c. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=first,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+         children.1.file.driver=nbd+colo,\
+         children.1.file.host=xxx,\
+         children.1.file.port=xxx,\
+         children.1.file.export=xxx,\
+         children.1.driver=raw
+  Note:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+  3. host is the secondary physical machine's hostname or IP
+  4. Each disk must have its own export name.
+
+Secondary:
+  -drive if=xxx,driver=blkcolo,export=xxx,\
+         backing.file.filename=1.raw,\
+         backing.driver=raw
+  Then run qmp command:
+    nbd_server_start host:port
+  Note:
+  1. The export name for the same disk must be the same in primary
+     and secondary QEMU command line
+  2. The qmp command nbd_server_start must be run before running the
+     qmp command migrate on primary QEMU
+  3. Don't use nbd_server_start's other options