diff mbox

[1/3] block: allow migration to work with image files

Message ID 1284213896-12705-2-git-send-email-aliguori@us.ibm.com
State New
Headers show

Commit Message

Anthony Liguori Sept. 11, 2010, 2:04 p.m. UTC
Image files have two types of data: immutable data that describes things like
image size, backing files, etc. and mutable data that includes offset and
reference count tables.

Today, image formats aggressively cache mutable data to improve performance.  In
some cases, this happens before a guest even starts.  When dealing with live
migration, since a file is open on two machines, the caching of meta data can
lead to data corruption.

This patch addresses this by introducing a mechanism to invalidate any cached
mutable data a block driver may have which is then used by the live migration
code.

NB, this still requires coherent shared storage.  Addressing migration without
coherent shared storage (i.e. NFS) requires additional work.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

Comments

Avi Kivity Sept. 12, 2010, 10:37 a.m. UTC | #1
On 09/11/2010 05:04 PM, Anthony Liguori wrote:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
>

Note: even the logical size is, in principle, mutable.  If we introduce 
external snapshots, then so are backing files?
Anthony Liguori Sept. 12, 2010, 1:06 p.m. UTC | #2
On 09/12/2010 05:37 AM, Avi Kivity wrote:
>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>> Image files have two types of data: immutable data that describes 
>> things like
>> image size, backing files, etc. and mutable data that includes offset 
>> and
>> reference count tables.
>>
>
> Note: even the logical size is, in principle, mutable.  If we 
> introduce external snapshots, then so are backing files?

Maybe it should be, "data the can be changed during a live migration".

Backing files and logical size shouldn't change during live migration.

But even so, I think the interface make sense.  It's basically, drop 
anything you have cached that may change during migration.  What needs 
to be read is dependent on features/format.

Regards,

Anthony Liguori
Avi Kivity Sept. 12, 2010, 1:28 p.m. UTC | #3
On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>
> Backing files and logical size shouldn't change during live migration.

Why not?

>
> But even so, I think the interface make sense.  It's basically, drop 
> anything you have cached that may change during migration.  What needs 
> to be read is dependent on features/format.

Sure.  Certainly as an incremental step.
Anthony Liguori Sept. 12, 2010, 3:26 p.m. UTC | #4
On 09/12/2010 08:28 AM, Avi Kivity wrote:
>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>
>> Backing files and logical size shouldn't change during live migration.
>
> Why not?

To make our lives easier.

Regards,

Anthony Liguori

>>
>> But even so, I think the interface make sense.  It's basically, drop 
>> anything you have cached that may change during migration.  What 
>> needs to be read is dependent on features/format.
>
> Sure.  Certainly as an incremental step.
>
Avi Kivity Sept. 12, 2010, 4:06 p.m. UTC | #5
On 09/12/2010 05:26 PM, Anthony Liguori wrote:
> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>
>>> Backing files and logical size shouldn't change during live migration.
>>
>> Why not?
>
> To make our lives easier.

It means management needs to block volume resize while a live migration 
takes place.  Since live migration is typically done by the system 
automatically, while volume resize happens in response to user request, 
this isn't a good idea.  Both in terms of user experience, and in terms 
of pushing more complexity to management.
Anthony Liguori Sept. 12, 2010, 5:10 p.m. UTC | #6
On 09/12/2010 11:06 AM, Avi Kivity wrote:
>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>
>>>> Backing files and logical size shouldn't change during live migration.
>>>
>>> Why not?
>>
>> To make our lives easier.
>
> It means management needs to block volume resize while a live 
> migration takes place.  Since live migration is typically done by the 
> system automatically, while volume resize happens in response to user 
> request, this isn't a good idea.  Both in terms of user experience, 
> and in terms of pushing more complexity to management.

We don't do volume resize today so it's a moot point.

Regards,

Anthony Liguori
Avi Kivity Sept. 12, 2010, 5:51 p.m. UTC | #7
On 09/12/2010 07:10 PM, Anthony Liguori wrote:
> On 09/12/2010 11:06 AM, Avi Kivity wrote:
>>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>>
>>>>> Backing files and logical size shouldn't change during live 
>>>>> migration.
>>>>
>>>> Why not?
>>>
>>> To make our lives easier.
>>
>> It means management needs to block volume resize while a live 
>> migration takes place.  Since live migration is typically done by the 
>> system automatically, while volume resize happens in response to user 
>> request, this isn't a good idea.  Both in terms of user experience, 
>> and in terms of pushing more complexity to management.
>
> We don't do volume resize today so it's a moot point.
>

Let's be prepared for the future.
Kevin Wolf Sept. 13, 2010, 8:21 a.m. UTC | #8
Am 11.09.2010 16:04, schrieb Anthony Liguori:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
> 
> Today, image formats aggressively cache mutable data to improve performance.  In
> some cases, this happens before a guest even starts.  When dealing with live
> migration, since a file is open on two machines, the caching of meta data can
> lead to data corruption.
> 
> This patch addresses this by introducing a mechanism to invalidate any cached
> mutable data a block driver may have which is then used by the live migration
> code.
> 
> NB, this still requires coherent shared storage.  Addressing migration without
> coherent shared storage (i.e. NFS) requires additional work.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> 
> diff --git a/block.c b/block.c
> index ebbc376..cd2ee31 100644
> --- a/block.c
> +++ b/block.c
> @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
>      return bs->device_name;
>  }
>  
> +void bdrv_invalidate_cache(BlockDriverState *bs)
> +{
> +    if (bs->drv && bs->drv->bdrv_invalidate_cache) {
> +        bs->drv->bdrv_invalidate_cache(bs);
> +    }
> +}

There is a simple generic implementation:

drv = bs->drv;
drv->close(bs);
drv->open(bs, bs->open_flags);

Note that this only reopens e.g. the qcow2 layer, but not the image
file, which is bs->file.

This works for all simple case, that is, one format on top of one or
more protocols, where the protocols don't need invalidation. I think
this includes everything that is possible today. With -blockdev we might
need to revise this to include the lower layers, too. (But only
sometimes, because we don't want to reopen the file)

Kevin
Anthony Liguori Sept. 13, 2010, 1:27 p.m. UTC | #9
On 09/13/2010 03:21 AM, Kevin Wolf wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>    
>> Image files have two types of data: immutable data that describes things like
>> image size, backing files, etc. and mutable data that includes offset and
>> reference count tables.
>>
>> Today, image formats aggressively cache mutable data to improve performance.  In
>> some cases, this happens before a guest even starts.  When dealing with live
>> migration, since a file is open on two machines, the caching of meta data can
>> lead to data corruption.
>>
>> This patch addresses this by introducing a mechanism to invalidate any cached
>> mutable data a block driver may have which is then used by the live migration
>> code.
>>
>> NB, this still requires coherent shared storage.  Addressing migration without
>> coherent shared storage (i.e. NFS) requires additional work.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>
>> diff --git a/block.c b/block.c
>> index ebbc376..cd2ee31 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
>>       return bs->device_name;
>>   }
>>
>> +void bdrv_invalidate_cache(BlockDriverState *bs)
>> +{
>> +    if (bs->drv&&  bs->drv->bdrv_invalidate_cache) {
>> +        bs->drv->bdrv_invalidate_cache(bs);
>> +    }
>> +}
>>      
> There is a simple generic implementation:
>
> drv = bs->drv;
> drv->close(bs);
> drv->open(bs, bs->open_flags);
>
> Note that this only reopens e.g. the qcow2 layer, but not the image
> file, which is bs->file.
>    

That's a pretty good idea for a general implementation, I'll update the 
patches accordingly.

Regards,

Anthony Liguori

> This works for all simple case, that is, one format on top of one or
> more protocols, where the protocols don't need invalidation. I think
> this includes everything that is possible today. With -blockdev we might
> need to revise this to include the lower layers, too. (But only
> sometimes, because we don't want to reopen the file)
>
> Kevin
>
Juan Quintela Sept. 15, 2010, 3:53 p.m. UTC | #10
Anthony Liguori <aliguori@us.ibm.com> wrote:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
>
> Today, image formats aggressively cache mutable data to improve performance.  In
> some cases, this happens before a guest even starts.  When dealing with live
> migration, since a file is open on two machines, the caching of meta data can
> lead to data corruption.
>
> This patch addresses this by introducing a mechanism to invalidate any cached
> mutable data a block driver may have which is then used by the live migration
> code.
>
> NB, this still requires coherent shared storage.  Addressing migration without
> coherent shared storage (i.e. NFS) requires additional work.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

For the NFS case, we just need 2 different case:
- outgoing: nothing to do (generic code already do an fsync())
- incoming: we need to reopen the image.

For the rest, I agree.  Once told that, I can implement my changes here.

Later, Juan.
Juan Quintela Sept. 15, 2010, 3:57 p.m. UTC | #11
Avi Kivity <avi@redhat.com> wrote:
>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>
>> Backing files and logical size shouldn't change during live migration.
>
> Why not?

It is supposed to be the same in both sides O:-)
If you want to extend a disk, you need to told qemu about that.  Either
before or after migration, but not "during" migration.

>
>>
>> But even so, I think the interface make sense.  It's basically, drop
>> anything you have cached that may change during migration.  What
>> needs to be read is dependent on features/format.
>
> Sure.  Certainly as an incremental step.

Later, Juan.
Juan Quintela Sept. 15, 2010, 4 p.m. UTC | #12
Avi Kivity <avi@redhat.com> wrote:
>  On 09/12/2010 07:10 PM, Anthony Liguori wrote:
>> On 09/12/2010 11:06 AM, Avi Kivity wrote:
>>>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>>>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>>>
>>>>>> Backing files and logical size shouldn't change during live
>>>>>> migration.
>>>>>
>>>>> Why not?
>>>>
>>>> To make our lives easier.
>>>
>>> It means management needs to block volume resize while a live
>>> migration takes place.  Since live migration is typically done by
>>> the system automatically, while volume resize happens in response
>>> to user request, this isn't a good idea.  Both in terms of user
>>> experience, and in terms of pushing more complexity to management.
>>
>> We don't do volume resize today so it's a moot point.
>>
>
> Let's be prepared for the future.

This is better done when we would be able to create the machine during
migration.  As we have everything organized today, this is basically
imposible :(

The problem is not related with live migration, is related with the fact
that we need to give the same arguments in both source & target machines.

Later, Juan.
Juan Quintela Sept. 15, 2010, 4:03 p.m. UTC | #13
Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:

> There is a simple generic implementation:
>
> drv = bs->drv;
> drv->close(bs);
> drv->open(bs, bs->open_flags);
>
> Note that this only reopens e.g. the qcow2 layer, but not the image
> file, which is bs->file.
>
> This works for all simple case, that is, one format on top of one or
> more protocols, where the protocols don't need invalidation. I think
> this includes everything that is possible today. With -blockdev we might
> need to revise this to include the lower layers, too. (But only
> sometimes, because we don't want to reopen the file)

we require it for nfs consistency.  We need to:

source: fsync()
target: open() (after the fsync).

About your 1st comment, I am doing:

bdrv_close(bdrv);
bdrv_open(bdrv);

Do you mean that this don't close and open again the file with qcow2?

Later, Juan.
Kevin Wolf Sept. 16, 2010, 7:54 a.m. UTC | #14
Am 15.09.2010 18:03, schrieb Juan Quintela:
> Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 11.09.2010 16:04, schrieb Anthony Liguori:
> 
>> There is a simple generic implementation:
>>
>> drv = bs->drv;
>> drv->close(bs);
>> drv->open(bs, bs->open_flags);
>>
>> Note that this only reopens e.g. the qcow2 layer, but not the image
>> file, which is bs->file.
>>
>> This works for all simple case, that is, one format on top of one or
>> more protocols, where the protocols don't need invalidation. I think
>> this includes everything that is possible today. With -blockdev we might
>> need to revise this to include the lower layers, too. (But only
>> sometimes, because we don't want to reopen the file)
> 
> we require it for nfs consistency.  We need to:
> 
> source: fsync()
> target: open() (after the fsync).
> 
> About your 1st comment, I am doing:
> 
> bdrv_close(bdrv);
> bdrv_open(bdrv);
> 
> Do you mean that this don't close and open again the file with qcow2?

It does. bdrv_open/close take care of the whole stack.

Kevin
diff mbox

Patch

diff --git a/block.c b/block.c
index ebbc376..cd2ee31 100644
--- a/block.c
+++ b/block.c
@@ -1453,6 +1453,22 @@  const char *bdrv_get_device_name(BlockDriverState *bs)
     return bs->device_name;
 }
 
+void bdrv_invalidate_cache(BlockDriverState *bs)
+{
+    if (bs->drv && bs->drv->bdrv_invalidate_cache) {
+        bs->drv->bdrv_invalidate_cache(bs);
+    }
+}
+
+void bdrv_invalidate_cache_all(void)
+{
+    BlockDriverState *bs;
+
+    QTAILQ_FOREACH(bs, &bdrv_states, list) {
+        bdrv_invalidate_cache(bs);
+    }
+}
+
 void bdrv_flush(BlockDriverState *bs)
 {
     if (bs->open_flags & BDRV_O_NO_FLUSH) {
diff --git a/block.h b/block.h
index 5f64380..387f6d3 100644
--- a/block.h
+++ b/block.h
@@ -141,6 +141,10 @@  BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
         unsigned long int req, void *buf,
         BlockDriverCompletionFunc *cb, void *opaque);
 
+/* Invalidate any cached metadata used by image formats */
+void bdrv_invalidate_cache(BlockDriverState *bs);
+void bdrv_invalidate_cache_all(void);
+
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
diff --git a/block_int.h b/block_int.h
index e8e7156..bca99b2 100644
--- a/block_int.h
+++ b/block_int.h
@@ -60,6 +60,7 @@  struct BlockDriver {
     void (*bdrv_close)(BlockDriverState *bs);
     int (*bdrv_create)(const char *filename, QEMUOptionParameter *options);
     void (*bdrv_flush)(BlockDriverState *bs);
+    void (*bdrv_invalidate_cache)(BlockDriverState *bs);
     int (*bdrv_is_allocated)(BlockDriverState *bs, int64_t sector_num,
                              int nb_sectors, int *pnum);
     int (*bdrv_set_key)(BlockDriverState *bs, const char *key);
diff --git a/migration.c b/migration.c
index 468d517..64d3d15 100644
--- a/migration.c
+++ b/migration.c
@@ -69,6 +69,9 @@  void process_incoming_migration(QEMUFile *f)
 
     incoming_expected = false;
 
+    /* Make sure all file formats flush their mutable metadata */
+    bdrv_invalidate_cache_all();
+
     if (autostart)
         vm_start();
 }
@@ -370,6 +373,7 @@  void migrate_fd_put_ready(void *opaque)
 
         qemu_aio_flush();
         bdrv_flush_all();
+        bdrv_invalidate_cache_all();
         if ((qemu_savevm_state_complete(s->mon, s->file)) < 0) {
             if (old_vm_running) {
                 vm_start();