Patchwork [2/2] sheepdog: support 'qemu-img snapshot -a'

login
register
mail settings
Submitter namei.unix@gmail.com
Date June 6, 2013, 11:57 a.m.
Message ID <1370519849-10620-3-git-send-email-namei.unix@gmail.com>
Download mbox | patch
Permalink /patch/249396/
State New
Headers show

Comments

namei.unix@gmail.com - June 6, 2013, 11:57 a.m.
Just call sd_create_branch() to rollback the image is good enough

Cc: qemu-devel@nongnu.org
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/sheepdog.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)
Kevin Wolf - June 6, 2013, 12:46 p.m.
Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
> Just call sd_create_branch() to rollback the image is good enough
> 
> Cc: qemu-devel@nongnu.org
> Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/sheepdog.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 94218ac..cb5ca4a 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -2072,9 +2072,11 @@ static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
>      }
>  
>      if (!s->inode.vm_state_size) {
> -        error_report("Invalid snapshot");
> -        ret = -ENOENT;
> -        goto out;
> +        /* qemu-img asks us to rollback, we need to do it right now */
> +        ret = sd_create_branch(s);
> +        if (ret) {
> +            goto out;
> +        }
>      }

I'm not sure how snapshots work internally for Sheepdog, but it seems
odd to me that you need to do this only for disk-only snapshots, but not
when the snapshot has VM state. (Also, note that 'qemu-img snapshot -a'
works on images with a VM state, so the comment doesn't seem to be
completely accurate)

Would you mind explaining to me how this works in detail?

Kevin
namei.unix@gmail.com - June 6, 2013, 1:09 p.m.
On 06/06/2013 08:46 PM, Kevin Wolf wrote:
> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
>> Just call sd_create_branch() to rollback the image is good enough
>>
>> Cc: qemu-devel@nongnu.org
>> Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
>> Cc: Kevin Wolf <kwolf@redhat.com>
>> Cc: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
>> ---
>>  block/sheepdog.c |    8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/sheepdog.c b/block/sheepdog.c
>> index 94218ac..cb5ca4a 100644
>> --- a/block/sheepdog.c
>> +++ b/block/sheepdog.c
>> @@ -2072,9 +2072,11 @@ static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
>>      }
>>  
>>      if (!s->inode.vm_state_size) {
>> -        error_report("Invalid snapshot");
>> -        ret = -ENOENT;
>> -        goto out;
>> +        /* qemu-img asks us to rollback, we need to do it right now */
>> +        ret = sd_create_branch(s);
>> +        if (ret) {
>> +            goto out;
>> +        }
>>      }
> 
> I'm not sure how snapshots work internally for Sheepdog, but it seems
> odd to me that you need to do this only for disk-only snapshots, but not
> when the snapshot has VM state. (Also, note that 'qemu-img snapshot -a'
> works on images with a VM state, so the comment doesn't seem to be
> completely accurate)
> 
> Would you mind explaining to me how this works in detail?
> 

Hmm, the original code isn't written by me and this snapshot mechanism
exists since day 0. I just hacks it to work now. So I'll try to explain
on my understanding.

When we do a snapshot such as 'savedvm' or 'qemu-img snapshot', the
active vdi is snapshotted and marked as snapshot and a new vdi is
created as copy-on-write on the previous active vdi, then this new vdi
becomes active vdi. For e.g,

As1 --> As2 --> A

We take snapshot of vdi A twice, tagged s1 and s2 respectively. I guess
this is quit similar to qcow2 snapshots, only inode object with a bitmap
is created.

So when we 'loadvm' or 'qemu-img snapshot -a' to A, current logic just
handle 'loadvm', that .bdrv_snapshot_goto only reloads inode object,
that is, for e.g, we 'savevm s1', and mark it as snapshot, the chain
would like

As1 --> As2 --> A
 |
 v
 just reload As1's inode object

Only when the write comes from VM, we do the following stuff
 - delete active vdi A
 - created a new inode based on the previously reloaded As1's inode

The chain will look like:

As1 --> As2
 |
 V
 A

This is how sheepdog handles savevm/loadvm.

So for 'qemu-img snapshot -a', we should create the branch in the
.bdrv_snapshot_goto.

As you pointed out, we need to consider vm state even for 'snapshot -a',
so I need to rework the patch 2/2.

Thanks,
Yuan
namei.unix@gmail.com - June 6, 2013, 1:41 p.m.
On 06/06/2013 08:46 PM, Kevin Wolf wrote:
> I'm not sure how snapshots work internally for Sheepdog, but it seems
> odd to me that you need to do this only for disk-only snapshots, but not
> when the snapshot has VM state. (Also, note that 'qemu-img snapshot -a'
> works on images with a VM state, so the comment doesn't seem to be
> completely accurate)

Seems that I misunderstood your comments. What my 2/2 patch tried to do
is enable 'qemu-img snapshot -a' to rollback the sheepdog disk states
only and it is correct. So what I need fixing is comment, right?

how about the commenting as 'qemu-img asks us to rollback disk only, we
can't rely on the write request to sd_create_branch, so just call it
directly' ?

Thanks,
Yuan
Kevin Wolf - June 7, 2013, 7:31 a.m.
Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben:
> On 06/06/2013 08:46 PM, Kevin Wolf wrote:
> > Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
> >> Just call sd_create_branch() to rollback the image is good enough
> >>
> >> Cc: qemu-devel@nongnu.org
> >> Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> >> Cc: Kevin Wolf <kwolf@redhat.com>
> >> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> >> ---
> >>  block/sheepdog.c |    8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/block/sheepdog.c b/block/sheepdog.c
> >> index 94218ac..cb5ca4a 100644
> >> --- a/block/sheepdog.c
> >> +++ b/block/sheepdog.c
> >> @@ -2072,9 +2072,11 @@ static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
> >>      }
> >>  
> >>      if (!s->inode.vm_state_size) {
> >> -        error_report("Invalid snapshot");
> >> -        ret = -ENOENT;
> >> -        goto out;
> >> +        /* qemu-img asks us to rollback, we need to do it right now */
> >> +        ret = sd_create_branch(s);
> >> +        if (ret) {
> >> +            goto out;
> >> +        }
> >>      }
> > 
> > I'm not sure how snapshots work internally for Sheepdog, but it seems
> > odd to me that you need to do this only for disk-only snapshots, but not
> > when the snapshot has VM state. (Also, note that 'qemu-img snapshot -a'
> > works on images with a VM state, so the comment doesn't seem to be
> > completely accurate)
> > 
> > Would you mind explaining to me how this works in detail?
> > 
> 
> Hmm, the original code isn't written by me and this snapshot mechanism
> exists since day 0. I just hacks it to work now. So I'll try to explain
> on my understanding.
> 
> When we do a snapshot such as 'savedvm' or 'qemu-img snapshot', the
> active vdi is snapshotted and marked as snapshot and a new vdi is
> created as copy-on-write on the previous active vdi, then this new vdi
> becomes active vdi. For e.g,
> 
> As1 --> As2 --> A
> 
> We take snapshot of vdi A twice, tagged s1 and s2 respectively. I guess
> this is quit similar to qcow2 snapshots, only inode object with a bitmap
> is created.
> 
> So when we 'loadvm' or 'qemu-img snapshot -a' to A, current logic just
> handle 'loadvm', that .bdrv_snapshot_goto only reloads inode object,
> that is, for e.g, we 'savevm s1', and mark it as snapshot, the chain
> would like
> 
> As1 --> As2 --> A
>  |
>  v
>  just reload As1's inode object
> 
> Only when the write comes from VM, we do the following stuff
>  - delete active vdi A
>  - created a new inode based on the previously reloaded As1's inode

Thanks, this is the part that I missed.

I'm not sure however why the actual switch is delayed until the first
write. This seems inconsistent with qcow2 snapshots.

Do you know if there is a reason why we can't always do this already
during bdrv_snapshot_goto()?

> The chain will look like:
> 
> As1 --> As2
>  |
>  V
>  A
> 
> This is how sheepdog handles savevm/loadvm.
> 
> So for 'qemu-img snapshot -a', we should create the branch in the
> .bdrv_snapshot_goto.
> 
> As you pointed out, we need to consider vm state even for 'snapshot -a',
> so I need to rework the patch 2/2.

Yes, the presence of VM state is independent from whether you're using
qemu-img or loadvm. And it actually goes both ways: qemu-img can revert
to snapshots that have a VM state, and loadvm can be used with images
that don't have a VM state (if you have multiple images, only one of
them has the VM state).

Kevin
namei.unix@gmail.com - June 7, 2013, 1:48 p.m.
On 06/07/2013 03:31 PM, Kevin Wolf wrote:
> Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben:
>> On 06/06/2013 08:46 PM, Kevin Wolf wrote:
>>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
>>>> Just call sd_create_branch() to rollback the image is good enough
>>>>
>>>> Cc: qemu-devel@nongnu.org
>>>> Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
>>>> Cc: Kevin Wolf <kwolf@redhat.com>
>>>> Cc: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
>>>> ---
>>>>  block/sheepdog.c |    8 +++++---
>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/sheepdog.c b/block/sheepdog.c
>>>> index 94218ac..cb5ca4a 100644
>>>> --- a/block/sheepdog.c
>>>> +++ b/block/sheepdog.c
>>>> @@ -2072,9 +2072,11 @@ static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
>>>>      }
>>>>  
>>>>      if (!s->inode.vm_state_size) {
>>>> -        error_report("Invalid snapshot");
>>>> -        ret = -ENOENT;
>>>> -        goto out;
>>>> +        /* qemu-img asks us to rollback, we need to do it right now */
>>>> +        ret = sd_create_branch(s);
>>>> +        if (ret) {
>>>> +            goto out;
>>>> +        }
>>>>      }
>>>
>>> I'm not sure how snapshots work internally for Sheepdog, but it seems
>>> odd to me that you need to do this only for disk-only snapshots, but not
>>> when the snapshot has VM state. (Also, note that 'qemu-img snapshot -a'
>>> works on images with a VM state, so the comment doesn't seem to be
>>> completely accurate)
>>>
>>> Would you mind explaining to me how this works in detail?
>>>
>>
>> Hmm, the original code isn't written by me and this snapshot mechanism
>> exists since day 0. I just hacks it to work now. So I'll try to explain
>> on my understanding.
>>
>> When we do a snapshot such as 'savedvm' or 'qemu-img snapshot', the
>> active vdi is snapshotted and marked as snapshot and a new vdi is
>> created as copy-on-write on the previous active vdi, then this new vdi
>> becomes active vdi. For e.g,
>>
>> As1 --> As2 --> A
>>
>> We take snapshot of vdi A twice, tagged s1 and s2 respectively. I guess
>> this is quit similar to qcow2 snapshots, only inode object with a bitmap
>> is created.
>>
>> So when we 'loadvm' or 'qemu-img snapshot -a' to A, current logic just
>> handle 'loadvm', that .bdrv_snapshot_goto only reloads inode object,
>> that is, for e.g, we 'savevm s1', and mark it as snapshot, the chain
>> would like
>>
>> As1 --> As2 --> A
>>  |
>>  v
>>  just reload As1's inode object
>>
>> Only when the write comes from VM, we do the following stuff
>>  - delete active vdi A
>>  - created a new inode based on the previously reloaded As1's inode
> 
> Thanks, this is the part that I missed.
> 
> I'm not sure however why the actual switch is delayed until the first
> write. This seems inconsistent with qcow2 snapshots.
> 
> Do you know if there is a reason why we can't always do this already
> during bdrv_snapshot_goto()?
> 

I think the reason is sd_load_vmstate() need to load vm state objects
with the correct inode object.

I tried to remove

  if (!s->inode.vm_state_size)

and make sd_create_branch unconditional. This means 'loadvm' command
will try to call sd_create_branch() inside sd_snapshot_goto(). But
failed with reading the wrong snapshot because the vdi's inode object is
changed by sd_create_branch().

>> The chain will look like:
>>
>> As1 --> As2
>>  |
>>  V
>>  A
>>
>> This is how sheepdog handles savevm/loadvm.
>>
>> So for 'qemu-img snapshot -a', we should create the branch in the
>> .bdrv_snapshot_goto.
>>
>> As you pointed out, we need to consider vm state even for 'snapshot -a',
>> so I need to rework the patch 2/2.
> 
> Yes, the presence of VM state is independent from whether you're using
> qemu-img or loadvm. And it actually goes both ways: qemu-img can revert
> to snapshots that have a VM state, and loadvm can be used with images
> that don't have a VM state (if you have multiple images, only one of
> them has the VM state).
> 

Seems not true of current code. If I 'loadvm' a snapshot without a
vmstate, I'll get 'qemu-system-x86_64: This is a disk-only snapshot.
Revert to it offline using qemu-img'.

But 'qemu-img snapshot -a' works as you said, it can rollback to the
snapshot regardless of vmstate.

Also this is a difference to store vmstate for sheepdog images. *Every*
snapshot image can have its own vmstate stored in sheepdog cluster. That
is, we can have multiple snapshot with its own private vmstate for sheepdog.

I think my patch did the correct thing, just rollback the disk state of
the snapshot for 'qemu-img snapshot -a'. Anyway, I found a minor issue
of 2/2 patch, so I'll resend the set.

Thanks,
Yuan
Kevin Wolf - June 7, 2013, 3:22 p.m.
Am 07.06.2013 um 15:48 hat Liu Yuan geschrieben:
> On 06/07/2013 03:31 PM, Kevin Wolf wrote:
> > Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben:
> >> On 06/06/2013 08:46 PM, Kevin Wolf wrote:
> >>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
> >> Only when the write comes from VM, we do the following stuff
> >>  - delete active vdi A
> >>  - created a new inode based on the previously reloaded As1's inode
> > 
> > Thanks, this is the part that I missed.
> > 
> > I'm not sure however why the actual switch is delayed until the first
> > write. This seems inconsistent with qcow2 snapshots.
> > 
> > Do you know if there is a reason why we can't always do this already
> > during bdrv_snapshot_goto()?
> > 
> 
> I think the reason is sd_load_vmstate() need to load vm state objects
> with the correct inode object.
> 
> I tried to remove
> 
>   if (!s->inode.vm_state_size)
> 
> and make sd_create_branch unconditional. This means 'loadvm' command
> will try to call sd_create_branch() inside sd_snapshot_goto(). But
> failed with reading the wrong snapshot because the vdi's inode object is
> changed by sd_create_branch().

Ok, I think I start to understand how these things work. Basically,
qemu's savevm functionality is designed to work with qcow2 like this:

1. Save the VM state to the active layer
2. create_snapshot saves the active layer including VM state
3. [ Forget to remove the VM state from the active layer :-) ]
4. loadvm loads the snapshot again, including VM state
5. VM state is restored from the active layer

So for Sheepdog, the problem is that step 2 doesn't include the VM state,
right? So our options are either to change Sheepdog so that step 2 does
involve moving the VM state (might end up rather ugly) or you need to
swap steps 1 and 2, so that you first switch to the new snapshot and
then write the VM state.

Because I think what we're doing currently is not only hard to
understand, but actually wrong. Consider this:

1. savevm with a Sheepdog image
2. Exit qemu before any write request is sent
3. Restart qemu with the Sheepdog image and notice how the wrong
   snapshot is active. Oops.

Or doesn't Sheepdog have any notion of an "active snapshot" and this is
only a runtime decision?

> >> The chain will look like:
> >>
> >> As1 --> As2
> >>  |
> >>  V
> >>  A
> >>
> >> This is how sheepdog handles savevm/loadvm.
> >>
> >> So for 'qemu-img snapshot -a', we should create the branch in the
> >> .bdrv_snapshot_goto.
> >>
> >> As you pointed out, we need to consider vm state even for 'snapshot -a',
> >> so I need to rework the patch 2/2.
> > 
> > Yes, the presence of VM state is independent from whether you're using
> > qemu-img or loadvm. And it actually goes both ways: qemu-img can revert
> > to snapshots that have a VM state, and loadvm can be used with images
> > that don't have a VM state (if you have multiple images, only one of
> > them has the VM state).
> > 
> 
> Seems not true of current code. If I 'loadvm' a snapshot without a
> vmstate, I'll get 'qemu-system-x86_64: This is a disk-only snapshot.
> Revert to it offline using qemu-img'.

Try again with multiple disks. Only one disk gets the VM state, the
other ones get only the disk snapshot.

> But 'qemu-img snapshot -a' works as you said, it can rollback to the
> snapshot regardless of vmstate.
> 
> Also this is a difference to store vmstate for sheepdog images. *Every*
> snapshot image can have its own vmstate stored in sheepdog cluster. That
> is, we can have multiple snapshot with its own private vmstate for sheepdog.

Sorry, can't parse. :-(

> I think my patch did the correct thing, just rollback the disk state of
> the snapshot for 'qemu-img snapshot -a'. Anyway, I found a minor issue
> of 2/2 patch, so I'll resend the set.

I think I agree that the problem is less with your patch, but more with
the preexisting code. We should get it sorted out anyway.

Kevin
namei.unix@gmail.com - June 7, 2013, 4:14 p.m.
On 06/07/2013 11:22 PM, Kevin Wolf wrote:
> Am 07.06.2013 um 15:48 hat Liu Yuan geschrieben:
>> On 06/07/2013 03:31 PM, Kevin Wolf wrote:
>>> Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben:
>>>> On 06/06/2013 08:46 PM, Kevin Wolf wrote:
>>>>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
>>>> Only when the write comes from VM, we do the following stuff
>>>>  - delete active vdi A
>>>>  - created a new inode based on the previously reloaded As1's inode
>>>
>>> Thanks, this is the part that I missed.
>>>
>>> I'm not sure however why the actual switch is delayed until the first
>>> write. This seems inconsistent with qcow2 snapshots.
>>>
>>> Do you know if there is a reason why we can't always do this already
>>> during bdrv_snapshot_goto()?
>>>
>>
>> I think the reason is sd_load_vmstate() need to load vm state objects
>> with the correct inode object.
>>
>> I tried to remove
>>
>>   if (!s->inode.vm_state_size)
>>
>> and make sd_create_branch unconditional. This means 'loadvm' command
>> will try to call sd_create_branch() inside sd_snapshot_goto(). But
>> failed with reading the wrong snapshot because the vdi's inode object is
>> changed by sd_create_branch().
> 
> Ok, I think I start to understand how these things work. Basically,
> qemu's savevm functionality is designed to work with qcow2 like this:
> 
> 1. Save the VM state to the active layer
> 2. create_snapshot saves the active layer including VM state
> 3. [ Forget to remove the VM state from the active layer :-) ]
> 4. loadvm loads the snapshot again, including VM state
> 5. VM state is restored from the active layer
> 
> So for Sheepdog, the problem is that step 2 doesn't include the VM state,
> right? So our options are either to change Sheepdog so that step 2 does
> involve moving the VM state (might end up rather ugly) or you need to
> swap steps 1 and 2, so that you first switch to the new snapshot and
> then write the VM state.
> 

Sorry, I didn't fully understand the above example. If sheepdog takes
snapshots, vmstate will go with the exact current active disk into the
cluster storage, for e.g

1 we take two snapshots on the disk 'test' with tag A and B respectively
2 disk(A) + vmstate(A) will be stored as indexed by A's vdi
  disk(B) + vmstate(B) will be stored as indexed by B's vdi

The chain is A --> B --> C, C is the current active vdi. Note, both A, B
and C has different vdi_id.

The problem show up when we do loadvm A, the process is:

1 reload inode of A (in snapshot_goto)
2 read vmstate via A's vdi_id (loadvm_state)
3 delete C and create C1, reload inode of C1 (sd_create_branch on write)

So if at stage 1, we call sd_create_branch(), the inode will point to C1
and stage 2 will fail to read vmstate because vdi_id is C1's.

This is why we rely on the write to call sd_create_branch(). I am not
sure if we can fix sheepdog's loadvm without touching the core code of QEMU.


> Because I think what we're doing currently is not only hard to
> understand, but actually wrong. Consider this:
> 
> 1. savevm with a Sheepdog image
> 2. Exit qemu before any write request is sent
> 3. Restart qemu with the Sheepdog image and notice how the wrong
>    snapshot is active. Oops.
> 

Sheepdog indeed has this problem.

> Or doesn't Sheepdog have any notion of an "active snapshot" and this is
> only a runtime decision?
> 


>>>> The chain will look like:
>>>>
>>>> As1 --> As2
>>>>  |
>>>>  V
>>>>  A
>>>>
>>>> This is how sheepdog handles savevm/loadvm.
>>>>
>>>> So for 'qemu-img snapshot -a', we should create the branch in the
>>>> .bdrv_snapshot_goto.
>>>>
>>>> As you pointed out, we need to consider vm state even for 'snapshot -a',
>>>> so I need to rework the patch 2/2.
>>>
>>> Yes, the presence of VM state is independent from whether you're using
>>> qemu-img or loadvm. And it actually goes both ways: qemu-img can revert
>>> to snapshots that have a VM state, and loadvm can be used with images
>>> that don't have a VM state (if you have multiple images, only one of
>>> them has the VM state).
>>>
>>
>> Seems not true of current code. If I 'loadvm' a snapshot without a
>> vmstate, I'll get 'qemu-system-x86_64: This is a disk-only snapshot.
>> Revert to it offline using qemu-img'.
> 
> Try again with multiple disks. Only one disk gets the VM state, the
> other ones get only the disk snapshot.
> 
>> But 'qemu-img snapshot -a' works as you said, it can rollback to the
>> snapshot regardless of vmstate.
>>
>> Also this is a difference to store vmstate for sheepdog images. *Every*
>> snapshot image can have its own vmstate stored in sheepdog cluster. That
>> is, we can have multiple snapshot with its own private vmstate for sheepdog.
> 
> Sorry, can't parse. :-(
> 

I misunderstood your statements, please ignore it.

Thanks,
Yuan
Kevin Wolf - June 7, 2013, 4:48 p.m.
Am 07.06.2013 um 18:14 hat Liu Yuan geschrieben:
> On 06/07/2013 11:22 PM, Kevin Wolf wrote:
> > Am 07.06.2013 um 15:48 hat Liu Yuan geschrieben:
> >> On 06/07/2013 03:31 PM, Kevin Wolf wrote:
> >>> Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben:
> >>>> On 06/06/2013 08:46 PM, Kevin Wolf wrote:
> >>>>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben:
> >>>> Only when the write comes from VM, we do the following stuff
> >>>>  - delete active vdi A
> >>>>  - created a new inode based on the previously reloaded As1's inode
> >>>
> >>> Thanks, this is the part that I missed.
> >>>
> >>> I'm not sure however why the actual switch is delayed until the first
> >>> write. This seems inconsistent with qcow2 snapshots.
> >>>
> >>> Do you know if there is a reason why we can't always do this already
> >>> during bdrv_snapshot_goto()?
> >>>
> >>
> >> I think the reason is sd_load_vmstate() need to load vm state objects
> >> with the correct inode object.
> >>
> >> I tried to remove
> >>
> >>   if (!s->inode.vm_state_size)
> >>
> >> and make sd_create_branch unconditional. This means 'loadvm' command
> >> will try to call sd_create_branch() inside sd_snapshot_goto(). But
> >> failed with reading the wrong snapshot because the vdi's inode object is
> >> changed by sd_create_branch().
> > 
> > Ok, I think I start to understand how these things work. Basically,
> > qemu's savevm functionality is designed to work with qcow2 like this:
> > 
> > 1. Save the VM state to the active layer
> > 2. create_snapshot saves the active layer including VM state
> > 3. [ Forget to remove the VM state from the active layer :-) ]
> > 4. loadvm loads the snapshot again, including VM state
> > 5. VM state is restored from the active layer
> > 
> > So for Sheepdog, the problem is that step 2 doesn't include the VM state,
> > right? So our options are either to change Sheepdog so that step 2 does
> > involve moving the VM state (might end up rather ugly) or you need to
> > swap steps 1 and 2, so that you first switch to the new snapshot and
> > then write the VM state.
> > 
> 
> Sorry, I didn't fully understand the above example. If sheepdog takes
> snapshots, vmstate will go with the exact current active disk into the
> cluster storage, for e.g
> 
> 1 we take two snapshots on the disk 'test' with tag A and B respectively
> 2 disk(A) + vmstate(A) will be stored as indexed by A's vdi
>   disk(B) + vmstate(B) will be stored as indexed by B's vdi
> 
> The chain is A --> B --> C, C is the current active vdi. Note, both A, B
> and C has different vdi_id.
> 
> The problem show up when we do loadvm A, the process is:
> 
> 1 reload inode of A (in snapshot_goto)
> 2 read vmstate via A's vdi_id (loadvm_state)
> 3 delete C and create C1, reload inode of C1 (sd_create_branch on write)
> 
> So if at stage 1, we call sd_create_branch(), the inode will point to C1
> and stage 2 will fail to read vmstate because vdi_id is C1's.
> 
> This is why we rely on the write to call sd_create_branch(). I am not
> sure if we can fix sheepdog's loadvm without touching the core code of QEMU.

Hm, yes, I was confused. :-)

Anyway, the options stay the same: Either C1 must somehow inherit the VM
state from A on the Sheepdog level, or we must make sure to get this
order:

1. Switch to (read-only) snapshot A
2. Load the VM state
3. Create (writable) snapshot C1 and switch to it

This is a different order as required for qcow2 (where C1 would inherit
the VM state from A, so our current first 1, then 3, then 2 works), but
if we can't fix it on the Sheepdog side, we'll have to touch the core
code of qemu. I don't like much to touch the block.c code for it, but if
it's required to fix a bug, let's just do it.

> > Because I think what we're doing currently is not only hard to
> > understand, but actually wrong. Consider this:
> > 
> > 1. savevm with a Sheepdog image
> > 2. Exit qemu before any write request is sent
> > 3. Restart qemu with the Sheepdog image and notice how the wrong
> >    snapshot is active. Oops.
> > 
> 
> Sheepdog indeed has this problem.

Thanks for confirming.

Kevin
namei.unix@gmail.com - June 7, 2013, 5:23 p.m.
On 06/08/2013 12:48 AM, Kevin Wolf wrote:
> Hm, yes, I was confused. :-)
> 
> Anyway, the options stay the same: Either C1 must somehow inherit the VM
> state from A on the Sheepdog level, or we must make sure to get this
> order:
> 
> 1. Switch to (read-only) snapshot A
> 2. Load the VM state
> 3. Create (writable) snapshot C1 and switch to it
> 
> This is a different order as required for qcow2 (where C1 would inherit
> the VM state from A, so our current first 1, then 3, then 2 works), but
> if we can't fix it on the Sheepdog side, we'll have to touch the core
> code of qemu. I don't like much to touch the block.c code for it, but if
> it's required to fix a bug, let's just do it.

Hmm, I found actually we can inherit A's vdi_id from C1. I've sent v3
and I think this version will fix the possible bug you pointed out and
do loadvm the way core code assumes.

Thanks,
Yuan

Patch

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 94218ac..cb5ca4a 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2072,9 +2072,11 @@  static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
     }
 
     if (!s->inode.vm_state_size) {
-        error_report("Invalid snapshot");
-        ret = -ENOENT;
-        goto out;
+        /* qemu-img asks us to rollback, we need to do it right now */
+        ret = sd_create_branch(s);
+        if (ret) {
+            goto out;
+        }
     }
 
     s->is_snapshot = true;