diff mbox

[5/7] block: allow migration to work with image files (v2)

Message ID 1321113420-3252-5-git-send-email-aliguori@us.ibm.com
State New
Headers show

Commit Message

Anthony Liguori Nov. 12, 2011, 3:56 p.m. UTC
Image files have two types of data: immutable data that describes things like
image size, backing files, etc. and mutable data that includes offset and
reference count tables.

Today, image formats aggressively cache mutable data to improve performance.  In
some cases, this happens before a guest even starts.  When dealing with live
migration, since a file is open on two machines, the caching of meta data can
lead to data corruption.

This patch addresses this by introducing a mechanism to invalidate any cached
mutable data a block driver may have which is then used by the live migration
code.

NB, this still requires coherent shared storage.  Addressing migration without
coherent shared storage (i.e. NFS) requires additional work.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
v1 -> v2
 - rebase to latest master
---
 block.c     |   16 ++++++++++++++++
 block.h     |    4 ++++
 block_int.h |    5 +++++
 cpus.c      |    1 +
 migration.c |    3 +++
 5 files changed, 29 insertions(+), 0 deletions(-)

Comments

Juan Quintela Nov. 14, 2011, 1:11 p.m. UTC | #1
> diff --git a/cpus.c b/cpus.c
> index 82530c4..ae5ec99 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>          vm_state_notify(0, state);
>          qemu_aio_flush();
>          bdrv_flush_all();
> +        bdrv_invalidate_cache_all();
>          monitor_protocol_event(QEVENT_STOP, NULL);
>      }

This is too much. Reopening all qcow2 images each time that we stop the
vm looks excesive, no?

Later, Juan.
Anthony Liguori Nov. 14, 2011, 2:10 p.m. UTC | #2
On 11/14/2011 07:11 AM, Juan Quintela wrote:
>
>> diff --git a/cpus.c b/cpus.c
>> index 82530c4..ae5ec99 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>           vm_state_notify(0, state);
>>           qemu_aio_flush();
>>           bdrv_flush_all();
>> +        bdrv_invalidate_cache_all();
>>           monitor_protocol_event(QEVENT_STOP, NULL);
>>       }
>
> This is too much. Reopening all qcow2 images each time that we stop the
> vm looks excesive, no?

This general code came in via:

http://mid.gmane.org/cover.1290613959.git.mst@redhat.com

That series made migration stable after issuing a stop operation.  I believe the 
justification was for debugging purposes or something like that.

At any rate, invalidating the cache is part of what's required to make things 
stable.  If you look at something like cache=unsafe, the only way the metadata 
will get flushed if via a bdrv_close since bdrv_flush is a nop.

So this is needed as long as we care about supporting this use-case.

Regards,

Anthony Liguori

>
> Later, Juan.
>
Juan Quintela Nov. 14, 2011, 7:46 p.m. UTC | #3
Anthony Liguori <aliguori@us.ibm.com> wrote:
> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>
>>> diff --git a/cpus.c b/cpus.c
>>> index 82530c4..ae5ec99 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>           vm_state_notify(0, state);
>>>           qemu_aio_flush();
>>>           bdrv_flush_all();
>>> +        bdrv_invalidate_cache_all();
>>>           monitor_protocol_event(QEVENT_STOP, NULL);
>>>       }
>>
>> This is too much. Reopening all qcow2 images each time that we stop the
>> vm looks excesive, no?
>
> This general code came in via:
>
> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>
> That series made migration stable after issuing a stop operation.  I
> believe the justification was for debugging purposes or something like
> that.
>
> At any rate, invalidating the cache is part of what's required to make
> things stable.  If you look at something like cache=unsafe, the only
> way the metadata will get flushed if via a bdrv_close since bdrv_flush
> is a nop.
>
> So this is needed as long as we care about supporting this use-case.

Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:

(qemu)stop

And now all your qcow2 block devices are closed, or perhaps failing to
re-open() looks too much to me (TM).

Kevin?

Later, Juan.
Anthony Liguori Nov. 14, 2011, 7:49 p.m. UTC | #4
On 11/14/2011 01:46 PM, Juan Quintela wrote:
> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>>
>>>> diff --git a/cpus.c b/cpus.c
>>>> index 82530c4..ae5ec99 100644
>>>> --- a/cpus.c
>>>> +++ b/cpus.c
>>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>>            vm_state_notify(0, state);
>>>>            qemu_aio_flush();
>>>>            bdrv_flush_all();
>>>> +        bdrv_invalidate_cache_all();
>>>>            monitor_protocol_event(QEVENT_STOP, NULL);
>>>>        }
>>>
>>> This is too much. Reopening all qcow2 images each time that we stop the
>>> vm looks excesive, no?
>>
>> This general code came in via:
>>
>> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>>
>> That series made migration stable after issuing a stop operation.  I
>> believe the justification was for debugging purposes or something like
>> that.
>>
>> At any rate, invalidating the cache is part of what's required to make
>> things stable.  If you look at something like cache=unsafe, the only
>> way the metadata will get flushed if via a bdrv_close since bdrv_flush
>> is a nop.
>>
>> So this is needed as long as we care about supporting this use-case.
>
> Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:
>
> (qemu)stop
>
> And now all your qcow2 block devices are closed, or perhaps failing to
> re-open() looks too much to me (TM).
>
> Kevin?

Look closely at the patch.  It doesn't actually close()/open() anything.

It just invokes the bdrv_close() routine which calls the free functions on the 
l1/l2 caching functions.  bdrv_open() doesn't actually open anything (it assumes 
the file is already open.  It just reads the header and metadata over again.

For something that's basically a hack, it turned out to work very cleanly :-)

Regards,

Anthony Liguori

>
> Later, Juan.
>
>
Kevin Wolf Nov. 14, 2011, 8:11 p.m. UTC | #5
Am 14.11.2011 20:49, schrieb Anthony Liguori:
> On 11/14/2011 01:46 PM, Juan Quintela wrote:
>> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>>> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>>>
>>>>> diff --git a/cpus.c b/cpus.c
>>>>> index 82530c4..ae5ec99 100644
>>>>> --- a/cpus.c
>>>>> +++ b/cpus.c
>>>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>>>            vm_state_notify(0, state);
>>>>>            qemu_aio_flush();
>>>>>            bdrv_flush_all();
>>>>> +        bdrv_invalidate_cache_all();
>>>>>            monitor_protocol_event(QEVENT_STOP, NULL);
>>>>>        }
>>>>
>>>> This is too much. Reopening all qcow2 images each time that we stop the
>>>> vm looks excesive, no?
>>>
>>> This general code came in via:
>>>
>>> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>>>
>>> That series made migration stable after issuing a stop operation.  I
>>> believe the justification was for debugging purposes or something like
>>> that.
>>>
>>> At any rate, invalidating the cache is part of what's required to make
>>> things stable.  If you look at something like cache=unsafe, the only
>>> way the metadata will get flushed if via a bdrv_close since bdrv_flush
>>> is a nop.
>>>
>>> So this is needed as long as we care about supporting this use-case.
>>
>> Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:
>>
>> (qemu)stop
>>
>> And now all your qcow2 block devices are closed, or perhaps failing to
>> re-open() looks too much to me (TM).
>>
>> Kevin?
> 
> Look closely at the patch.  It doesn't actually close()/open() anything.
> 
> It just invokes the bdrv_close() routine which calls the free functions on the 
> l1/l2 caching functions.  bdrv_open() doesn't actually open anything (it assumes 
> the file is already open.  It just reads the header and metadata over again.
> 
> For something that's basically a hack, it turned out to work very cleanly :-)

But why do we need to do it on stop?

I don't think it makes even sense logically: bdrv_invalidate_cache()
means "throw all your caches away and refetch everything from disk".
What do we gain from doing this on stop? To some degree I could
understand if you did it on cont, so that you can modify an image on the
host while the VM is stopped (though I would still consider it criminal
:-)).

Kevin
Anthony Liguori Nov. 14, 2011, 8:12 p.m. UTC | #6
On 11/14/2011 02:11 PM, Kevin Wolf wrote:
> Am 14.11.2011 20:49, schrieb Anthony Liguori:
>> On 11/14/2011 01:46 PM, Juan Quintela wrote:
>>> Anthony Liguori<aliguori@us.ibm.com>   wrote:
>>>> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>>>>
>>>>>> diff --git a/cpus.c b/cpus.c
>>>>>> index 82530c4..ae5ec99 100644
>>>>>> --- a/cpus.c
>>>>>> +++ b/cpus.c
>>>>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>>>>             vm_state_notify(0, state);
>>>>>>             qemu_aio_flush();
>>>>>>             bdrv_flush_all();
>>>>>> +        bdrv_invalidate_cache_all();
>>>>>>             monitor_protocol_event(QEVENT_STOP, NULL);
>>>>>>         }
>>>>>
>>>>> This is too much. Reopening all qcow2 images each time that we stop the
>>>>> vm looks excesive, no?
>>>>
>>>> This general code came in via:
>>>>
>>>> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>>>>
>>>> That series made migration stable after issuing a stop operation.  I
>>>> believe the justification was for debugging purposes or something like
>>>> that.
>>>>
>>>> At any rate, invalidating the cache is part of what's required to make
>>>> things stable.  If you look at something like cache=unsafe, the only
>>>> way the metadata will get flushed if via a bdrv_close since bdrv_flush
>>>> is a nop.
>>>>
>>>> So this is needed as long as we care about supporting this use-case.
>>>
>>> Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:
>>>
>>> (qemu)stop
>>>
>>> And now all your qcow2 block devices are closed, or perhaps failing to
>>> re-open() looks too much to me (TM).
>>>
>>> Kevin?
>>
>> Look closely at the patch.  It doesn't actually close()/open() anything.
>>
>> It just invokes the bdrv_close() routine which calls the free functions on the
>> l1/l2 caching functions.  bdrv_open() doesn't actually open anything (it assumes
>> the file is already open.  It just reads the header and metadata over again.
>>
>> For something that's basically a hack, it turned out to work very cleanly :-)
>
> But why do we need to do it on stop?
>
> I don't think it makes even sense logically: bdrv_invalidate_cache()
> means "throw all your caches away and refetch everything from disk".
> What do we gain from doing this on stop? To some degree I could
> understand if you did it on cont, so that you can modify an image on the
> host while the VM is stopped (though I would still consider it criminal
> :-)).

Michael basically was trying to avoid having a VM's state change after you 
stopped the guest.

With something like cache=unsafe that periodically flushes based on a timer (I 
think), you want to make sure that that doesn't happen after stop occurs.

Regards,

Anthony Liguori

>
> Kevin
>
>
Juan Quintela Nov. 14, 2011, 8:15 p.m. UTC | #7
Kevin Wolf <kwolf@redhat.com> wrote:
> Am 14.11.2011 20:49, schrieb Anthony Liguori:
>> On 11/14/2011 01:46 PM, Juan Quintela wrote:
>>> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>>>> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>>>>
>>>>>> diff --git a/cpus.c b/cpus.c
>>>>>> index 82530c4..ae5ec99 100644
>>>>>> --- a/cpus.c
>>>>>> +++ b/cpus.c
>>>>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>>>>            vm_state_notify(0, state);
>>>>>>            qemu_aio_flush();
>>>>>>            bdrv_flush_all();
>>>>>> +        bdrv_invalidate_cache_all();
>>>>>>            monitor_protocol_event(QEVENT_STOP, NULL);
>>>>>>        }
>>>>>
>>>>> This is too much. Reopening all qcow2 images each time that we stop the
>>>>> vm looks excesive, no?
>>>>
>>>> This general code came in via:
>>>>
>>>> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>>>>
>>>> That series made migration stable after issuing a stop operation.  I
>>>> believe the justification was for debugging purposes or something like
>>>> that.
>>>>
>>>> At any rate, invalidating the cache is part of what's required to make
>>>> things stable.  If you look at something like cache=unsafe, the only
>>>> way the metadata will get flushed if via a bdrv_close since bdrv_flush
>>>> is a nop.
>>>>
>>>> So this is needed as long as we care about supporting this use-case.
>>>
>>> Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:
>>>
>>> (qemu)stop
>>>
>>> And now all your qcow2 block devices are closed, or perhaps failing to
>>> re-open() looks too much to me (TM).
>>>
>>> Kevin?
>> 
>> Look closely at the patch.  It doesn't actually close()/open() anything.

Sorry, someday I will remember the difference between bdrv_open() and
bdrv_file_open().

>> It just invokes the bdrv_close() routine which calls the free functions on the 
>> l1/l2 caching functions.  bdrv_open() doesn't actually open anything (it assumes 
>> the file is already open.  It just reads the header and metadata over again.
>> 
>> For something that's basically a hack, it turned out to work very cleanly :-)
>
> But why do we need to do it on stop?
>
> I don't think it makes even sense logically: bdrv_invalidate_cache()
> means "throw all your caches away and refetch everything from disk".
> What do we gain from doing this on stop? To some degree I could
> understand if you did it on cont, so that you can modify an image on the
> host while the VM is stopped (though I would still consider it criminal
> :-)).

Fully agree.  When I answered, I was thinking that I "could" want it on
"cont", just to be able to do evil things.  But I thought it was
"criminal" just to write the idea O:-)

Later, Juan.
Juan Quintela Nov. 14, 2011, 8:28 p.m. UTC | #8
Anthony Liguori <anthony@codemonkey.ws> wrote:

>
> Michael basically was trying to avoid having a VM's state change after
> you stopped the guest.
>
> With something like cache=unsafe that periodically flushes based on a
> timer (I think), you want to make sure that that doesn't happen after
> stop occurs.

Even then, we want to have "qcow2_flush_dirty_buffer_to_disk()" or
whatever is going to be called the method.  Doing a full "drop all
cache" and "reread" it is using a cannon to fight flies IMHO.
Especially because there are "lots" of uses of stop, and only a minority
of them want this flushed to disk (if ever).

My recount of mst problems were with networking (vhost) than something
touch the guest image while VM is stopped.  This is not the case here,
as we are not doing anything with the guest memory, no?

Later, Juan.
Kevin Wolf Nov. 14, 2011, 8:36 p.m. UTC | #9
Am 14.11.2011 21:12, schrieb Anthony Liguori:
> On 11/14/2011 02:11 PM, Kevin Wolf wrote:
>> Am 14.11.2011 20:49, schrieb Anthony Liguori:
>>> On 11/14/2011 01:46 PM, Juan Quintela wrote:
>>>> Anthony Liguori<aliguori@us.ibm.com>   wrote:
>>>>> On 11/14/2011 07:11 AM, Juan Quintela wrote:
>>>>>>
>>>>>>> diff --git a/cpus.c b/cpus.c
>>>>>>> index 82530c4..ae5ec99 100644
>>>>>>> --- a/cpus.c
>>>>>>> +++ b/cpus.c
>>>>>>> @@ -398,6 +398,7 @@ static void do_vm_stop(RunState state)
>>>>>>>             vm_state_notify(0, state);
>>>>>>>             qemu_aio_flush();
>>>>>>>             bdrv_flush_all();
>>>>>>> +        bdrv_invalidate_cache_all();
>>>>>>>             monitor_protocol_event(QEVENT_STOP, NULL);
>>>>>>>         }
>>>>>>
>>>>>> This is too much. Reopening all qcow2 images each time that we stop the
>>>>>> vm looks excesive, no?
>>>>>
>>>>> This general code came in via:
>>>>>
>>>>> http://mid.gmane.org/cover.1290613959.git.mst@redhat.com
>>>>>
>>>>> That series made migration stable after issuing a stop operation.  I
>>>>> believe the justification was for debugging purposes or something like
>>>>> that.
>>>>>
>>>>> At any rate, invalidating the cache is part of what's required to make
>>>>> things stable.  If you look at something like cache=unsafe, the only
>>>>> way the metadata will get flushed if via a bdrv_close since bdrv_flush
>>>>> is a nop.
>>>>>
>>>>> So this is needed as long as we care about supporting this use-case.
>>>>
>>>> Then we need a "proper" qcow2 invalidate call.  Doing in qemu toplevel:
>>>>
>>>> (qemu)stop
>>>>
>>>> And now all your qcow2 block devices are closed, or perhaps failing to
>>>> re-open() looks too much to me (TM).
>>>>
>>>> Kevin?
>>>
>>> Look closely at the patch.  It doesn't actually close()/open() anything.
>>>
>>> It just invokes the bdrv_close() routine which calls the free functions on the
>>> l1/l2 caching functions.  bdrv_open() doesn't actually open anything (it assumes
>>> the file is already open.  It just reads the header and metadata over again.
>>>
>>> For something that's basically a hack, it turned out to work very cleanly :-)
>>
>> But why do we need to do it on stop?
>>
>> I don't think it makes even sense logically: bdrv_invalidate_cache()
>> means "throw all your caches away and refetch everything from disk".
>> What do we gain from doing this on stop? To some degree I could
>> understand if you did it on cont, so that you can modify an image on the
>> host while the VM is stopped (though I would still consider it criminal
>> :-)).
> 
> Michael basically was trying to avoid having a VM's state change after you 
> stopped the guest.
> 
> With something like cache=unsafe that periodically flushes based on a timer (I 
> think), you want to make sure that that doesn't happen after stop occurs.

This is a good point, but neither does cache=unsafe use a timer nor can
I see how invalidating the cache would avoid such behaviour. And
throwing away any unwritten changes doesn't really make it better.

Kevin
Anthony Liguori Nov. 14, 2011, 8:49 p.m. UTC | #10
On 11/14/2011 02:36 PM, Kevin Wolf wrote:
> Am 14.11.2011 21:12, schrieb Anthony Liguori:
>>> I don't think it makes even sense logically: bdrv_invalidate_cache()
>>> means "throw all your caches away and refetch everything from disk".
>>> What do we gain from doing this on stop? To some degree I could
>>> understand if you did it on cont, so that you can modify an image on the
>>> host while the VM is stopped (though I would still consider it criminal
>>> :-)).
>>
>> Michael basically was trying to avoid having a VM's state change after you
>> stopped the guest.
>>
>> With something like cache=unsafe that periodically flushes based on a timer (I
>> think), you want to make sure that that doesn't happen after stop occurs.
>
> This is a good point, but neither does cache=unsafe use a timer nor can
> I see how invalidating the cache would avoid such behaviour. And
> throwing away any unwritten changes doesn't really make it better.

I don't think there's any real harm to removing it so I'll remove it in the next 
rev.

Regards,

Anthony Liguori

>
> Kevin
>
diff mbox

Patch

diff --git a/block.c b/block.c
index 86910b0..d015887 100644
--- a/block.c
+++ b/block.c
@@ -2839,6 +2839,22 @@  int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
     }
 }
 
+void bdrv_invalidate_cache(BlockDriverState *bs)
+{
+    if (bs->drv && bs->drv->bdrv_invalidate_cache) {
+        bs->drv->bdrv_invalidate_cache(bs);
+    }
+}
+
+void bdrv_invalidate_cache_all(void)
+{
+    BlockDriverState *bs;
+
+    QTAILQ_FOREACH(bs, &bdrv_states, list) {
+        bdrv_invalidate_cache(bs);
+    }
+}
+
 int bdrv_flush(BlockDriverState *bs)
 {
     Coroutine *co;
diff --git a/block.h b/block.h
index 051a25d..a826059 100644
--- a/block.h
+++ b/block.h
@@ -197,6 +197,10 @@  BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
         unsigned long int req, void *buf,
         BlockDriverCompletionFunc *cb, void *opaque);
 
+/* Invalidate any cached metadata used by image formats */
+void bdrv_invalidate_cache(BlockDriverState *bs);
+void bdrv_invalidate_cache_all(void);
+
 /* Ensure contents are flushed to disk.  */
 int bdrv_flush(BlockDriverState *bs);
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
diff --git a/block_int.h b/block_int.h
index 1ec4921..77c0187 100644
--- a/block_int.h
+++ b/block_int.h
@@ -88,6 +88,11 @@  struct BlockDriver {
         int64_t sector_num, int nb_sectors);
 
     /*
+     * Invalidate any cached meta-data.
+     */
+    void (*bdrv_invalidate_cache)(BlockDriverState *bs);
+
+    /*
      * Flushes all data that was already written to the OS all the way down to
      * the disk (for example raw-posix calls fsync()).
      */
diff --git a/cpus.c b/cpus.c
index 82530c4..ae5ec99 100644
--- a/cpus.c
+++ b/cpus.c
@@ -398,6 +398,7 @@  static void do_vm_stop(RunState state)
         vm_state_notify(0, state);
         qemu_aio_flush();
         bdrv_flush_all();
+        bdrv_invalidate_cache_all();
         monitor_protocol_event(QEVENT_STOP, NULL);
     }
 }
diff --git a/migration.c b/migration.c
index 6764d3a..8280d71 100644
--- a/migration.c
+++ b/migration.c
@@ -89,6 +89,9 @@  void process_incoming_migration(QEMUFile *f)
     qemu_announce_self();
     DPRINTF("successfully loaded vm state\n");
 
+    /* Make sure all file formats flush their mutable metadata */
+    bdrv_invalidate_cache_all();
+
     if (autostart) {
         vm_start();
     } else {