Patchwork [3/3] disk: don't read from disk until the guest starts

login
register
mail settings
Submitter Anthony Liguori
Date Sept. 11, 2010, 2:04 p.m.
Message ID <1284213896-12705-4-git-send-email-aliguori@us.ibm.com>
Download mbox | patch
Permalink /patch/64516/
State New
Headers show

Comments

Anthony Liguori - Sept. 11, 2010, 2:04 p.m.
This fixes a couple nasty problems relating to live migration.

1) When dealing with shared storage with weak coherence (i.e. NFS), even if
   we re-read, we may end up with undesired caching.  By delaying any reads
   until we absolutely have to, we decrease the likelihood of any undesirable
   caching.

2) When dealing with copy-on-read, the local storage acts as a cache.  We need
   to make sure to avoid any reads to avoid polluting the local cache.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Stefan Hajnoczi - Sept. 11, 2010, 5:24 p.m.
On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> This fixes a couple nasty problems relating to live migration.
>
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>   we re-read, we may end up with undesired caching.  By delaying any reads
>   until we absolutely have to, we decrease the likelihood of any undesirable
>   caching.
>
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>   to make sure to avoid any reads to avoid polluting the local cache.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index 1e466d1..57d8db3 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>        return;
>     }
>
> +    guess_geometry(s);
> +

Does the same change need to be made in ide_cfata_identify()?

I quickly checked the VMStateDescription and don't see cylinders,
heads, sectors being saved for migration.  I am concerned that IDE
will break after migration if the following happens:
1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads,
sectors are not initialized.
2. Normal I/O is performed, invoking ide_get_sector() which uses
geometry information that has not been initialized.

Did I miss something?

> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index bd6bbe6..0bf17ec 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>
>     bdrv_get_geometry(s->bs, &capacity);
>     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
> +    if (cylinders == 0) {
> +        bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
> +    }
> +

bdrv_guess_geometry() can be called unconditionally.  The call to
bdrv_get_geometry_hint() can be eliminated.  bdrv_guess_geometry()
updates the geometry hint and does not probe the boot sector after the
first time.

Stefan
Anthony Liguori - Sept. 11, 2010, 5:34 p.m.
On 09/11/2010 12:24 PM, Stefan Hajnoczi wrote:
> On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>    we re-read, we may end up with undesired caching.  By delaying any reads
>>    until we absolutely have to, we decrease the likelihood of any undesirable
>>    caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>    to make sure to avoid any reads to avoid polluting the local cache.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 1e466d1..57d8db3 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>         return;
>>      }
>>
>> +    guess_geometry(s);
>> +
>>      
> Does the same change need to be made in ide_cfata_identify()?
>
> I quickly checked the VMStateDescription and don't see cylinders,
> heads, sectors being saved for migration.  I am concerned that IDE
> will break after migration if the following happens:
> 1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads,
> sectors are not initialized.
> 2. Normal I/O is performed, invoking ide_get_sector() which uses
> geometry information that has not been initialized.
>
> Did I miss something?
>    

Nope, I just missed some places that need calls.  I've renamed 
guess_geometry() to init_geometry() and added calls to 
ide_cfata_identify(), ide_get_sector(), and ide_set_sector().

That should cover it all.

>> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
>> index bd6bbe6..0bf17ec 100644
>> --- a/hw/virtio-blk.c
>> +++ b/hw/virtio-blk.c
>> @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>>
>>      bdrv_get_geometry(s->bs,&capacity);
>>      bdrv_get_geometry_hint(s->bs,&cylinders,&heads,&secs);
>> +    if (cylinders == 0) {
>> +        bdrv_guess_geometry(s->bs,&cylinders,&heads,&secs);
>> +    }
>> +
>>      
> bdrv_guess_geometry() can be called unconditionally.  The call to
> bdrv_get_geometry_hint() can be eliminated.  bdrv_guess_geometry()
> updates the geometry hint and does not probe the boot sector after the
> first time.
>    

Yeah, that's a fair point.

Regards,

Anthony Liguori

> Stefan
>
>
Avi Kivity - Sept. 12, 2010, 10:42 a.m.
On 09/11/2010 05:04 PM, Anthony Liguori wrote:
> This fixes a couple nasty problems relating to live migration.
>
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>     we re-read, we may end up with undesired caching.  By delaying any reads
>     until we absolutely have to, we decrease the likelihood of any undesirable
>     caching.
>
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>     to make sure to avoid any reads to avoid polluting the local cache.
>
> +
>   static void ide_identify(IDEState *s)
>   {
>       uint16_t *p;
> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>   	return;
>       }
>
> +    guess_geometry(s);
> +

This can cause a disk read, no?  Shouldn't it be made asynchronous?

Or just move it to just before the guest starts?
Anthony Liguori - Sept. 12, 2010, 1:08 p.m.
On 09/12/2010 05:42 AM, Avi Kivity wrote:
>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), 
>> even if
>>     we re-read, we may end up with undesired caching.  By delaying 
>> any reads
>>     until we absolutely have to, we decrease the likelihood of any 
>> undesirable
>>     caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a 
>> cache.  We need
>>     to make sure to avoid any reads to avoid polluting the local cache.
>>
>> +
>>   static void ide_identify(IDEState *s)
>>   {
>>       uint16_t *p;
>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>       return;
>>       }
>>
>> +    guess_geometry(s);
>> +
>
> This can cause a disk read, no?  Shouldn't it be made asynchronous?

Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of how 
these functions are called.

>
> Or just move it to just before the guest starts?

We don't really have a notion of "guest starts" today although maybe we 
should.

Regards,

Anthony Liguori
Avi Kivity - Sept. 12, 2010, 1:26 p.m.
On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>
>
> Yes, it should.  I'm not sure there's an easy way to make it 
> asynchronous though not because of the block layer but because of how 
> these functions are called.

Sorry to harp on the subject, but that's the standard problem with state 
machines.  Every time you want to do a blocking operation in a function, 
you have to put all its locals in some structure, split the function 
into two, do some scheduling, etc.

>>
>> Or just move it to just before the guest starts?
>
> We don't really have a notion of "guest starts" today although maybe 
> we should.

Wasn't there some qdev callback that represents this?  Faint memory from 
the reset thread.
Anthony Liguori - Sept. 12, 2010, 3:29 p.m.
On 09/12/2010 08:26 AM, Avi Kivity wrote:
>  On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>>
>>
>> Yes, it should.  I'm not sure there's an easy way to make it 
>> asynchronous though not because of the block layer but because of how 
>> these functions are called.
>
> Sorry to harp on the subject, but that's the standard problem with 
> state machines.  Every time you want to do a blocking operation in a 
> function, you have to put all its locals in some structure, split the 
> function into two, do some scheduling, etc.

We can't block the VCPU thread for arbitrarily long periods of time.  If 
we get a PIO operation requesting information about geometry, we can't 
wait for a disk read in order to satisfy that request.

We need to kick off the I/O operations in the background such that the 
data is available before the PIO operation happens.  This isn't SM vs. 
threads at all, this is simply about the fact that we can't do block I/O 
during a PIO operation.

>>>
>>> Or just move it to just before the guest starts?
>>
>> We don't really have a notion of "guest starts" today although maybe 
>> we should.
>
> Wasn't there some qdev callback that represents this?  Faint memory 
> from the reset thread.

Yes, realize().  I guess that's a reasonable approach.

Regards,

Anthony Liguori
Avi Kivity - Sept. 12, 2010, 4:04 p.m.
On 09/12/2010 05:29 PM, Anthony Liguori wrote:
> On 09/12/2010 08:26 AM, Avi Kivity wrote:
>>  On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>>>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>>>
>>>
>>> Yes, it should.  I'm not sure there's an easy way to make it 
>>> asynchronous though not because of the block layer but because of 
>>> how these functions are called.
>>
>> Sorry to harp on the subject, but that's the standard problem with 
>> state machines.  Every time you want to do a blocking operation in a 
>> function, you have to put all its locals in some structure, split the 
>> function into two, do some scheduling, etc.
>
> We can't block the VCPU thread for arbitrarily long periods of time.  
> If we get a PIO operation requesting information about geometry, we 
> can't wait for a disk read in order to satisfy that request.
>
> We need to kick off the I/O operations in the background such that the 
> data is available before the PIO operation happens.  This isn't SM vs. 
> threads at all, this is simply about the fact that we can't do block 
> I/O during a PIO operation.

Isn't this an identify command, where the guest can only read the data 
after the interface indicates it is ready (or posts an interrupt)?  I 
thought the guest interface was already async.

The code appears to support this:

         switch(val) {
         case WIN_IDENTIFY:
             if (s->bs && s->drive_kind != IDE_CD) {
                 if (s->drive_kind != IDE_CFATA)
                     ide_identify(s);
                 else
                     ide_cfata_identify(s);
                 s->status = READY_STAT | SEEK_STAT;
                 ide_transfer_start(s, s->io_buffer, 512, 
ide_transfer_stop);
             } else {
                 if (s->drive_kind == IDE_CD) {
                     ide_set_signature(s);
                 }
                 ide_abort_command(s);
             }
             ide_set_irq(s->bus);
             break;

but I may be misinterpreting it.  If I'm right, all we need to do is 
push this whole block to a thread.
Kevin Wolf - Sept. 13, 2010, 8:32 a.m.
Am 11.09.2010 16:04, schrieb Anthony Liguori:
> This fixes a couple nasty problems relating to live migration.
> 
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>    we re-read, we may end up with undesired caching.  By delaying any reads
>    until we absolutely have to, we decrease the likelihood of any undesirable
>    caching.
> 
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>    to make sure to avoid any reads to avoid polluting the local cache.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

I think we should also delay even opening the image file at all to the
latest possible point to avoid that new problems of this kind are
introduced. Ideally, opening the image would be the very last thing we
do before telling the migration source that we're set and it should
close the images.

Even better would be to only open the image when the source has already
closed it (we could completely avoid the invalidation/reopen then), but
I think you were afraid that we might lose the image on both ends.

Kevin
Anthony Liguori - Sept. 13, 2010, 1:29 p.m.
On 09/13/2010 03:32 AM, Kevin Wolf wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>    
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>     we re-read, we may end up with undesired caching.  By delaying any reads
>>     until we absolutely have to, we decrease the likelihood of any undesirable
>>     caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>     to make sure to avoid any reads to avoid polluting the local cache.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>      
> I think we should also delay even opening the image file at all to the
> latest possible point to avoid that new problems of this kind are
> introduced. Ideally, opening the image would be the very last thing we
> do before telling the migration source that we're set and it should
> close the images.
>    

There's a lot of possible failure scenarios that opening an image file 
can introduce.  Fortunately, I don't think we have a strict requirement 
for it provided that we make a couple of reasonable changes.

> Even better would be to only open the image when the source has already
> closed it (we could completely avoid the invalidation/reopen then), but
> I think you were afraid that we might lose the image on both ends.
>    

Yeah, one of the key design points of live migration is to minimize the 
number of failure scenarios where you lose a VM.  If someone typed the 
wrong command line or shared storage hasn't been mounted yet and we 
delay failure until live migration is in the critical path, that would 
be terribly unfortunate.

Regards,

Anthony Liguori

> Kevin
>
Kevin Wolf - Sept. 13, 2010, 1:39 p.m.
Am 13.09.2010 15:29, schrieb Anthony Liguori:
> On 09/13/2010 03:32 AM, Kevin Wolf wrote:
>> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>>    
>>> This fixes a couple nasty problems relating to live migration.
>>>
>>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>>     we re-read, we may end up with undesired caching.  By delaying any reads
>>>     until we absolutely have to, we decrease the likelihood of any undesirable
>>>     caching.
>>>
>>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>>     to make sure to avoid any reads to avoid polluting the local cache.
>>>
>>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>>      
>> I think we should also delay even opening the image file at all to the
>> latest possible point to avoid that new problems of this kind are
>> introduced. Ideally, opening the image would be the very last thing we
>> do before telling the migration source that we're set and it should
>> close the images.
>>    
> 
> There's a lot of possible failure scenarios that opening an image file 
> can introduce.  Fortunately, I don't think we have a strict requirement 
> for it provided that we make a couple of reasonable changes.
>
>> Even better would be to only open the image when the source has already
>> closed it (we could completely avoid the invalidation/reopen then), but
>> I think you were afraid that we might lose the image on both ends.
>>    
> 
> Yeah, one of the key design points of live migration is to minimize the 
> number of failure scenarios where you lose a VM.  If someone typed the 
> wrong command line or shared storage hasn't been mounted yet and we 
> delay failure until live migration is in the critical path, that would 
> be terribly unfortunate.

We would catch most of them if we try to open the image when migration
starts and immediately close it again until migration is (almost)
completed, so that no other code can possibly use it before the source
has really closed it.

Kevin
Anthony Liguori - Sept. 13, 2010, 1:42 p.m.
On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>> Yeah, one of the key design points of live migration is to minimize the
>> number of failure scenarios where you lose a VM.  If someone typed the
>> wrong command line or shared storage hasn't been mounted yet and we
>> delay failure until live migration is in the critical path, that would
>> be terribly unfortunate.
>>      
> We would catch most of them if we try to open the image when migration
> starts and immediately close it again until migration is (almost)
> completed, so that no other code can possibly use it before the source
> has really closed it.
>    

I think the only real advantage is that we fix NFS migration, right?

But if we do invalidate_cache() as you suggested with a close/open of 
the qcow2 layer, and also acquire and release a lock in the file layer 
by propagating the invalidate_cache(), that should work robustly with NFS.

I think that's a simpler change.  Do you see additional advantages to 
delaying the open?

Regards,

anthony Liguori

> Kevin
>
Kevin Wolf - Sept. 13, 2010, 2:13 p.m.
Am 13.09.2010 15:42, schrieb Anthony Liguori:
> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>> Yeah, one of the key design points of live migration is to minimize the
>>> number of failure scenarios where you lose a VM.  If someone typed the
>>> wrong command line or shared storage hasn't been mounted yet and we
>>> delay failure until live migration is in the critical path, that would
>>> be terribly unfortunate.
>>>      
>> We would catch most of them if we try to open the image when migration
>> starts and immediately close it again until migration is (almost)
>> completed, so that no other code can possibly use it before the source
>> has really closed it.
>>    
> 
> I think the only real advantage is that we fix NFS migration, right?

That's the one that we know about, yes.

The rest is not a specific scenario, but a strong feeling that having an
image opened twice at the same time feels dangerous. As soon as an
open/close sequence writes to the image for some format, we probably
have a bug. For example, what about this mounted flag that you were
discussing for QED?

> But if we do invalidate_cache() as you suggested with a close/open of 
> the qcow2 layer, and also acquire and release a lock in the file layer 
> by propagating the invalidate_cache(), that should work robustly with NFS.
> 
> I think that's a simpler change.  Do you see additional advantages to 
> delaying the open?

Just that it makes it very obvious if a device model is doing bad things
and accessing the image before it should. The difference is a failed
request vs. silently corrupted data.

Kevin
Anthony Liguori - Sept. 13, 2010, 2:34 p.m.
On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>> I think the only real advantage is that we fix NFS migration, right?
>>      
> That's the one that we know about, yes.
>
> The rest is not a specific scenario, but a strong feeling that having an
> image opened twice at the same time feels dangerous.

We've never really had clear semantics about live migration and block 
driver's life cycles.  At a high level, for live migration to work, we 
need the following sequence:

1) src> flush all pending writes to disk
2) <barrier>
3) dst> invalidate any cached data
4) dst> start guest

We've gotten away ignoring (3) because raw disks never cache anything.  
But that assumes that we're on top of cache coherent storage.  If we 
don't have fully cache coherent storage, we need to do more.

We need to extend (3) to also flush the cache of the underlying 
storage.  There are two ways we can solve this, we can either ensure 
that (3) is a nop by not having any operations that would cause caching 
until after (3), or we can find a way to inject a flush into the 
underlying cache.

Since he later approach requires storage specific knowledge, I went with 
the former approach.  Of course, if you did a close-open sequence at 
(3), it may achieve the same goal but only really for NFS.  If you have 
something weaker than close-to-open coherency, you still need to do 
something unique in step (3).

I don't know that I see a perfect model.  Pushing reads past point (3) 
is easy and fixes raw on top of NFS.  I think we want to do that because 
it's low hanging fruit.  An block driver hook for (3) also seems 
appealing because we can make use of it easily in QED.

That said, I'm open to suggestions of a better model.  Delaying open 
(especially if you open, then close, then open again) seems a bit hacky.

With respect to the devices, I think the question of when block devices 
can begin accessing drives is orthogonal to this discussion.  Even 
without delaying open, we could simply not give them their 
BlockDriverStates until realize() or something like that.

Regards,

Anthony Liguori

>   As soon as an
> open/close sequence writes to the image for some format, we probably
> have a bug. For example, what about this mounted flag that you were
> discussing for QED?
>
>    
>> But if we do invalidate_cache() as you suggested with a close/open of
>> the qcow2 layer, and also acquire and release a lock in the file layer
>> by propagating the invalidate_cache(), that should work robustly with NFS.
>>
>> I think that's a simpler change.  Do you see additional advantages to
>> delaying the open?
>>      
> Just that it makes it very obvious if a device model is doing bad things
> and accessing the image before it should. The difference is a failed
> request vs. silently corrupted data.
>
> Kevin
>
>
Stefan Hajnoczi - Sept. 13, 2010, 7:29 p.m.
On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>> Yeah, one of the key design points of live migration is to minimize the
>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>> delay failure until live migration is in the critical path, that would
>>>> be terribly unfortunate.
>>>>
>>> We would catch most of them if we try to open the image when migration
>>> starts and immediately close it again until migration is (almost)
>>> completed, so that no other code can possibly use it before the source
>>> has really closed it.
>>>
>>
>> I think the only real advantage is that we fix NFS migration, right?
>
> That's the one that we know about, yes.
>
> The rest is not a specific scenario, but a strong feeling that having an
> image opened twice at the same time feels dangerous. As soon as an
> open/close sequence writes to the image for some format, we probably
> have a bug. For example, what about this mounted flag that you were
> discussing for QED?

There is some room left to work in, even if we can't check in open().
One idea would be to do the check asynchronously once I/O begins.  It
is actually easy to check L1/L2 tables as they are loaded.

The only barrier relationship between I/O and checking is that an
allocating write (which will need to update L1/L2 tables) is only
allowed after check completes.  Otherwise reads and non-allocating
writes may proceed while the image is not yet fully checked.  We can
detect when a table element is an invalid offset and discard it.

Stefan
Kevin Wolf - Sept. 13, 2010, 8:03 p.m.
Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>> delay failure until live migration is in the critical path, that would
>>>>> be terribly unfortunate.
>>>>>
>>>> We would catch most of them if we try to open the image when migration
>>>> starts and immediately close it again until migration is (almost)
>>>> completed, so that no other code can possibly use it before the source
>>>> has really closed it.
>>>>
>>>
>>> I think the only real advantage is that we fix NFS migration, right?
>>
>> That's the one that we know about, yes.
>>
>> The rest is not a specific scenario, but a strong feeling that having an
>> image opened twice at the same time feels dangerous. As soon as an
>> open/close sequence writes to the image for some format, we probably
>> have a bug. For example, what about this mounted flag that you were
>> discussing for QED?
> 
> There is some room left to work in, even if we can't check in open().
> One idea would be to do the check asynchronously once I/O begins.  It
> is actually easy to check L1/L2 tables as they are loaded.
> 
> The only barrier relationship between I/O and checking is that an
> allocating write (which will need to update L1/L2 tables) is only
> allowed after check completes.  Otherwise reads and non-allocating
> writes may proceed while the image is not yet fully checked.  We can
> detect when a table element is an invalid offset and discard it.

I'm not even talking about such complicated things. You wanted to have a
dirty flag in the header, right? So when we allow opening an image
twice, you get this sequence with migration:

Source: open
Destination: open (with dirty image)
Source: close

The image is now marked as clean, even though the destination is still
working on it.

Kevin
Anthony Liguori - Sept. 13, 2010, 8:09 p.m.
On 09/13/2010 03:03 PM, Kevin Wolf wrote:
> Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
>    
>> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>      
>>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>>        
>>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>          
>>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>>> delay failure until live migration is in the critical path, that would
>>>>>> be terribly unfortunate.
>>>>>>
>>>>>>              
>>>>> We would catch most of them if we try to open the image when migration
>>>>> starts and immediately close it again until migration is (almost)
>>>>> completed, so that no other code can possibly use it before the source
>>>>> has really closed it.
>>>>>
>>>>>            
>>>> I think the only real advantage is that we fix NFS migration, right?
>>>>          
>>> That's the one that we know about, yes.
>>>
>>> The rest is not a specific scenario, but a strong feeling that having an
>>> image opened twice at the same time feels dangerous. As soon as an
>>> open/close sequence writes to the image for some format, we probably
>>> have a bug. For example, what about this mounted flag that you were
>>> discussing for QED?
>>>        
>> There is some room left to work in, even if we can't check in open().
>> One idea would be to do the check asynchronously once I/O begins.  It
>> is actually easy to check L1/L2 tables as they are loaded.
>>
>> The only barrier relationship between I/O and checking is that an
>> allocating write (which will need to update L1/L2 tables) is only
>> allowed after check completes.  Otherwise reads and non-allocating
>> writes may proceed while the image is not yet fully checked.  We can
>> detect when a table element is an invalid offset and discard it.
>>      
> I'm not even talking about such complicated things. You wanted to have a
> dirty flag in the header, right? So when we allow opening an image
> twice, you get this sequence with migration:
>
> Source: open
> Destination: open (with dirty image)
> Source: close
>
> The image is now marked as clean, even though the destination is still
> working on it.
>    

The dirty flag should be read on demand (which is the first time we 
fetch an L1/L2 table).

I agree that the life cycle of the block drivers is getting fuzzy.  Need 
to think quite a bit here.

Regards,

Anthony Liguori

> Kevin
>
Kevin Wolf - Sept. 14, 2010, 8:28 a.m.
Am 13.09.2010 22:09, schrieb Anthony Liguori:
> On 09/13/2010 03:03 PM, Kevin Wolf wrote:
>> Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
>>    
>>> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>      
>>>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>>>        
>>>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>>          
>>>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>>>> delay failure until live migration is in the critical path, that would
>>>>>>> be terribly unfortunate.
>>>>>>>
>>>>>>>              
>>>>>> We would catch most of them if we try to open the image when migration
>>>>>> starts and immediately close it again until migration is (almost)
>>>>>> completed, so that no other code can possibly use it before the source
>>>>>> has really closed it.
>>>>>>
>>>>>>            
>>>>> I think the only real advantage is that we fix NFS migration, right?
>>>>>          
>>>> That's the one that we know about, yes.
>>>>
>>>> The rest is not a specific scenario, but a strong feeling that having an
>>>> image opened twice at the same time feels dangerous. As soon as an
>>>> open/close sequence writes to the image for some format, we probably
>>>> have a bug. For example, what about this mounted flag that you were
>>>> discussing for QED?
>>>>        
>>> There is some room left to work in, even if we can't check in open().
>>> One idea would be to do the check asynchronously once I/O begins.  It
>>> is actually easy to check L1/L2 tables as they are loaded.
>>>
>>> The only barrier relationship between I/O and checking is that an
>>> allocating write (which will need to update L1/L2 tables) is only
>>> allowed after check completes.  Otherwise reads and non-allocating
>>> writes may proceed while the image is not yet fully checked.  We can
>>> detect when a table element is an invalid offset and discard it.
>>>      
>> I'm not even talking about such complicated things. You wanted to have a
>> dirty flag in the header, right? So when we allow opening an image
>> twice, you get this sequence with migration:
>>
>> Source: open
>> Destination: open (with dirty image)
>> Source: close
>>
>> The image is now marked as clean, even though the destination is still
>> working on it.
>>    
> 
> The dirty flag should be read on demand (which is the first time we 
> fetch an L1/L2 table).
> 
> I agree that the life cycle of the block drivers is getting fuzzy.  Need 
> to think quite a bit here.

The point is that after the close the flag is wrong on disk. That's
completely unrelated to when you read it. If anything, you could also
delay writing the flag until the first write, so that the close comes first.

But honestly, I'm not very excited about pulling half of the open()
logic into read/write functions.

Kevin
Avi Kivity - Sept. 14, 2010, 9:47 a.m.
On 09/13/2010 04:34 PM, Anthony Liguori wrote:
> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>> I think the only real advantage is that we fix NFS migration, right?
>> That's the one that we know about, yes.
>>
>> The rest is not a specific scenario, but a strong feeling that having an
>> image opened twice at the same time feels dangerous.
>
> We've never really had clear semantics about live migration and block 
> driver's life cycles.  At a high level, for live migration to work, we 
> need the following sequence:
>
> 1) src> flush all pending writes to disk
> 2) <barrier>
> 3) dst> invalidate any cached data
> 4) dst> start guest

That's pretty complicated, compared to

1) src> close
2) dst> open
3) dst> start guest

There are two failure scenarios with this model:

1. dst cannot open the image

We fix that by killing dst and continuing src (which has to re-open its 
images).

2. dst cannot open the image, and src cannot as well

In this case, what would be gained by having an image handle open in one 
of the hosts, but no way to open it again?  As soon as the surviving 
qemu exited (or crashed), the image would be lost for ever.

To get (1) working correctly we need an event that tells management that 
all initialization has completed and the guest is ready to run (so 
management can terminate the source).
Anthony Liguori - Sept. 14, 2010, 12:51 p.m.
On 09/14/2010 04:47 AM, Avi Kivity wrote:
>  On 09/13/2010 04:34 PM, Anthony Liguori wrote:
>> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>>> I think the only real advantage is that we fix NFS migration, right?
>>> That's the one that we know about, yes.
>>>
>>> The rest is not a specific scenario, but a strong feeling that 
>>> having an
>>> image opened twice at the same time feels dangerous.
>>
>> We've never really had clear semantics about live migration and block 
>> driver's life cycles.  At a high level, for live migration to work, 
>> we need the following sequence:
>>
>> 1) src> flush all pending writes to disk
>> 2) <barrier>
>> 3) dst> invalidate any cached data
>> 4) dst> start guest
>
> That's pretty complicated, compared to
>
> 1) src> close

1.5) <barrier>

> 2) dst> open
> 3) dst> start guest

You need to make sure the open happens *after* the close.

You're just using close to flush all pending writes or open to 
invalidate any cached data.

Regards,

Anthony Liguori

> There are two failure scenarios with this model:
>
> 1. dst cannot open the image
>
> We fix that by killing dst and continuing src (which has to re-open 
> its images).
>
> 2. dst cannot open the image, and src cannot as well
>
> In this case, what would be gained by having an image handle open in 
> one of the hosts, but no way to open it again?  As soon as the 
> surviving qemu exited (or crashed), the image would be lost for ever.
>
> To get (1) working correctly we need an event that tells management 
> that all initialization has completed and the guest is ready to run 
> (so management can terminate the source).
>
Avi Kivity - Sept. 14, 2010, 1:16 p.m.
On 09/14/2010 02:51 PM, Anthony Liguori wrote:
> On 09/14/2010 04:47 AM, Avi Kivity wrote:
>>  On 09/13/2010 04:34 PM, Anthony Liguori wrote:
>>> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>>>> I think the only real advantage is that we fix NFS migration, right?
>>>> That's the one that we know about, yes.
>>>>
>>>> The rest is not a specific scenario, but a strong feeling that 
>>>> having an
>>>> image opened twice at the same time feels dangerous.
>>>
>>> We've never really had clear semantics about live migration and 
>>> block driver's life cycles.  At a high level, for live migration to 
>>> work, we need the following sequence:
>>>
>>> 1) src> flush all pending writes to disk
>>> 2) <barrier>
>>> 3) dst> invalidate any cached data
>>> 4) dst> start guest
>>
>> That's pretty complicated, compared to
>>
>> 1) src> close
>
> 1.5) <barrier>
>
>> 2) dst> open
>> 3) dst> start guest
>
> You need to make sure the open happens *after* the close.
>
> You're just using close to flush all pending writes or open to 
> invalidate any cached data.


Right, that's simpler than teaching all block format drivers about 
invalidating any cached data, or using nfs locks to force the host to do 
the same.

It also complies with close-to-open consistency without relying on 
implementation details.
Juan Quintela - Sept. 15, 2010, 4:10 p.m.
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 09/12/2010 05:42 AM, Avi Kivity wrote:
>>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>>> This fixes a couple nasty problems relating to live migration.
>>>
>>> 1) When dealing with shared storage with weak coherence (i.e. NFS),
>>> even if
>>>     we re-read, we may end up with undesired caching.  By delaying
>>> any reads
>>>     until we absolutely have to, we decrease the likelihood of any
>>> undesirable
>>>     caching.
>>>
>>> 2) When dealing with copy-on-read, the local storage acts as a
>>> cache.  We need
>>>     to make sure to avoid any reads to avoid polluting the local cache.
>>>
>>> +
>>>   static void ide_identify(IDEState *s)
>>>   {
>>>       uint16_t *p;
>>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>>       return;
>>>       }
>>>
>>> +    guess_geometry(s);
>>> +
>>
>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>
> Yes, it should.  I'm not sure there's an easy way to make it
> asynchronous though not because of the block layer but because of how
> these functions are called.
>
>>
>> Or just move it to just before the guest starts?
>
> We don't really have a notion of "guest starts" today although maybe
> we should.

I agree that we should have more states.  Expecting Migration/Migration
done are quite special casses that are handled adhoc now.  Having
some state like: NeverRun will make trivial to have vm_start() do the
right thing on the 1st run.

Later, Juan.

> Regards,
>
> Anthony Liguori
Juan Quintela - Sept. 15, 2010, 4:16 p.m.
Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>> This fixes a couple nasty problems relating to live migration.
>> 
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>    we re-read, we may end up with undesired caching.  By delaying any reads
>>    until we absolutely have to, we decrease the likelihood of any undesirable
>>    caching.
>> 
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>    to make sure to avoid any reads to avoid polluting the local cache.
>> 
>> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
> I think we should also delay even opening the image file at all to the
> latest possible point to avoid that new problems of this kind are
> introduced. Ideally, opening the image would be the very last thing we
> do before telling the migration source that we're set and it should
> close the images.
>
> Even better would be to only open the image when the source has already
> closed it (we could completely avoid the invalidation/reopen then), but
> I think you were afraid that we might lose the image on both ends.

But then we have formats like qcow2 that store the "number_of_sectors"
inside the file.

And we need the size of the disk to initialize some structures :(

About having the image opened in two different machines .....

- Having it opened in two places: makes it "interesting" for cocherence reasons.
- Not having it, makes the failure case of anything failing during
  migration ... interesting.

Real solution for "this" problem is to just send in the migration stream
the size/head/cylinder/sector ... values and not having to open the
image.

Problem here is that for qemu architecture, we have to know this values
before we call migration functions.

As far as I can see, until we are not able to create a machine from
inside the monitor (i.e. qemu initialization & machine creation are
decoupled), we are not going to be able to fix migration properly (using
that very same mechanism to create the machine during migration).

Later, Juan.

Patch

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 1e466d1..57d8db3 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -94,6 +94,33 @@  static void put_le16(uint16_t *p, unsigned int v)
     *p = cpu_to_le16(v);
 }
 
+static int guess_geometry(IDEState *s)
+{
+    int cylinders, heads, secs;
+
+    if (s->cylinders != 0) {
+        return -1;
+    }
+
+    bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
+    if (cylinders < 1 || cylinders > 16383) {
+        error_report("cyls must be between 1 and 16383");
+        return -1;
+    }
+    if (heads < 1 || heads > 16) {
+        error_report("heads must be between 1 and 16");
+        return -1;
+    }
+    if (secs < 1 || secs > 63) {
+        error_report("secs must be between 1 and 63");
+        return -1;
+    }
+    s->cylinders = cylinders;
+    s->heads = heads;
+    s->sectors = secs;
+    return 0;
+}
+
 static void ide_identify(IDEState *s)
 {
     uint16_t *p;
@@ -105,6 +132,8 @@  static void ide_identify(IDEState *s)
 	return;
     }
 
+    guess_geometry(s);
+
     memset(s->io_buffer, 0, 512);
     p = (uint16_t *)s->io_buffer;
     put_le16(p + 0, 0x0040);
@@ -2614,27 +2643,13 @@  void ide_bus_reset(IDEBus *bus)
 int ide_init_drive(IDEState *s, BlockDriverState *bs,
                    const char *version, const char *serial)
 {
-    int cylinders, heads, secs;
     uint64_t nb_sectors;
 
     s->bs = bs;
-    bdrv_get_geometry(bs, &nb_sectors);
-    bdrv_guess_geometry(bs, &cylinders, &heads, &secs);
-    if (cylinders < 1 || cylinders > 16383) {
-        error_report("cyls must be between 1 and 16383");
-        return -1;
-    }
-    if (heads < 1 || heads > 16) {
-        error_report("heads must be between 1 and 16");
-        return -1;
-    }
-    if (secs < 1 || secs > 63) {
-        error_report("secs must be between 1 and 63");
-        return -1;
-    }
-    s->cylinders = cylinders;
-    s->heads = heads;
-    s->sectors = secs;
+    bdrv_get_geometry(s->bs, &nb_sectors);
+    s->cylinders = 0;
+    s->heads = 0;
+    s->sectors = 0;
     s->nb_sectors = nb_sectors;
     /* The SMART values should be preserved across power cycles
        but they aren't.  */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index bd6bbe6..0bf17ec 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -427,6 +427,10 @@  static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 
     bdrv_get_geometry(s->bs, &capacity);
     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
+    if (cylinders == 0) {
+        bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
+    }
+
     memset(&blkcfg, 0, sizeof(blkcfg));
     stq_raw(&blkcfg.capacity, capacity);
     stl_raw(&blkcfg.seg_max, 128 - 2);
@@ -501,7 +505,6 @@  static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id)
 VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf)
 {
     VirtIOBlock *s;
-    int cylinders, heads, secs;
     static int virtio_blk_id;
     DriveInfo *dinfo;
 
@@ -525,7 +528,6 @@  VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf)
     s->conf = conf;
     s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
-    bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     /* NB: per existing s/n string convention the string is terminated
      * by '\0' only when less than sizeof (s->sn)