migration: flush the bdrv before stopping VM

Message ID	1426582438-9698-1-git-send-email-liang.z.li@intel.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Liang Li <liang.z.li@intel.com> To: qemu-devel@nongnu.org Date: Tue, 17 Mar 2015 16:53:58 +0800 Message-Id: <1426582438-9698-1-git-send-email-liang.z.li@intel.com> Cc: amit.shah@redhat.com, yang.z.zhang@intel.com, Liang Li <liang.z.li@intel.com>, quintela@redhat.com Subject: [Qemu-devel] [PATCH] migration: flush the bdrv before stopping VM Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Li, Liang Z March 17, 2015, 8:53 a.m. UTC

If there are file write operations in the guest when doing live
migration, the VM downtime will be much longer than the max_downtime,
this is caused by bdrv_flush_all(), this function is a time consuming
operation if there a lot of data have to be flushed to disk.

By adding bdrv_flush_all() before VM stop, we can reduce the time
consumed by bdrv_flush_all() in vm_stop_force_state, this means the
VM down time can be reduced.

The test shows this optimization can help to reduce the VM downtime
from more than 20 seconds to about 100 milliseconds.

Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 migration/migration.c | 4 ++++
 1 file changed, 4 insertions(+)

Juan Quintela March 17, 2015, 12:12 p.m. UTC | #1

Liang Li <liang.z.li@intel.com> wrote:
> If there are file write operations in the guest when doing live
> migration, the VM downtime will be much longer than the max_downtime,
> this is caused by bdrv_flush_all(), this function is a time consuming
> operation if there a lot of data have to be flushed to disk.
>
> By adding bdrv_flush_all() before VM stop, we can reduce the time
> consumed by bdrv_flush_all() in vm_stop_force_state, this means the
> VM down time can be reduced.
>
> The test shows this optimization can help to reduce the VM downtime
> from more than 20 seconds to about 100 milliseconds.
>
> Signed-off-by: Liang Li <liang.z.li@intel.com>

This needs further review/changes on the block layer.

First explanation, why I think this don't fix the full problem.
Whith this patch, we fix the problem where we have a dirty block layer
but basically nothing dirtying the memory on the guest (we are moving
the 20 seconds from max_downtime for the blocklayer flush), to 20
seconds until we have decided that the amount of dirty memory is small
enough to be transferred during max_downtime.  But it is still going to
take 20 seconds to flush the block layer, and during that 20 seconds,
the amount of memory that can be dirty is HUGE.

I think our ouptions are:

- tell the block layer at the beggining of migration
  Hey, we are migrating, could you please start flusing data now, and
  don't get the caches to grow too much, please, pretty please.
  (I left the API to the block layer)
- Add on that point a new function:
  bdrvr_flush_all_start()
  That starts the sending of pages, and we "hope" that by the time that
  we have migrated all memory, they have also finished (so our last
  call to block_flush_all() have less work to do)
- Add another function:
  int bdrv_flush_all_timeout(int timeout)
  that returns if timeout pass, telling if it has migrated all pages or
  timeout has passed.  So we can got back to the iterative stage if it
  has taken too long.

Notice that *normally* bdrv_flush_all() is very fast, the problem is
that sometimes it get really, really slow (NFS decided to go slow, TCP
drop a packet, whatever).

Right now, we don't have an interface to detect that cases and got back
to the iterative stage.

So, I agree whit the diagnosis that there is a problem there, but I
think that the solution is more complex that this.  You helped one load
making a different other worse.  I am not sure which of the two
compromises is better :-(

Makes this sense?

Later, Juan.

> ---
>  migration/migration.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c805f1..fc4735c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -655,6 +655,10 @@ static void *migration_thread(void *opaque)
>                  qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
>                  old_vm_running = runstate_is_running();
>  
> +                /* do flush here is aimed to shorten the VM downtime,
> +                 * bdrv_flush_all is a time consuming operation
> +                 * when the guest has done some file writing */
> +                bdrv_flush_all();
>                  ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>                  if (ret >= 0) {
>                      qemu_file_set_rate_limit(s->file, INT64_MAX);

Li, Liang Z March 18, 2015, 3:19 a.m. UTC | #2

> This needs further review/changes on the block layer.
> 
> First explanation, why I think this don't fix the full problem.
> Whith this patch, we fix the problem where we have a dirty block layer but
> basically nothing dirtying the memory on the guest (we are moving the 20
> seconds from max_downtime for the blocklayer flush), to 20 seconds until
> we have decided that the amount of dirty memory is small enough to be
> transferred during max_downtime.  But it is still going to take 20 seconds to
> flush the block layer, and during that 20 seconds, the amount of memory that
> can be dirty is HUGE.

It's true.

> I think our ouptions are:
> 
> - tell the block layer at the beggining of migration
>   Hey, we are migrating, could you please start flusing data now, and
>   don't get the caches to grow too much, please, pretty please.
>   (I left the API to the block layer)
> - Add on that point a new function:
>   bdrvr_flush_all_start()
>   That starts the sending of pages, and we "hope" that by the time that
>   we have migrated all memory, they have also finished (so our last
>   call to block_flush_all() have less work to do)
> - Add another function:
>   int bdrv_flush_all_timeout(int timeout)
>   that returns if timeout pass, telling if it has migrated all pages or
>   timeout has passed.  So we can got back to the iterative stage if it
>   has taken too long.
> 
> Notice that *normally* bdrv_flush_all() is very fast, the problem is that
> sometimes it get really, really slow (NFS decided to go slow, TCP drop a
> packet, whatever).
> 
> Right now, we don't have an interface to detect that cases and got back to
> the iterative stage.

How about go back to the iterative stage when detect that the pending_size is larger
Than max_size, like this:

+                /* do flush here is aimed to shorten the VM downtime,
+                 * bdrv_flush_all is a time consuming operation
+                 * when the guest has done some file writing */
+                bdrv_flush_all();
+                pending_size = qemu_savevm_state_pending(s->file, max_size);
+                if (pending_size && pending_size >= max_size) {
+                    qemu_mutex_unlock_iothread();
+                    continue;
+                }
                  ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
                  if (ret >= 0) {
                      qemu_file_set_rate_limit(s->file, INT64_MAX);

and this is quite simple.


> So, I agree whit the diagnosis that there is a problem there, but I think that
> the solution is more complex that this.  You helped one load making a
> different other worse.  I am not sure which of the two compromises is
> better :-(
> 
> Makes this sense?
> 
> Later, Juan.
>

Kevin Wolf March 18, 2015, 11:17 a.m. UTC | #3

[ Cc: qemu-block ]

Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben:
> > This needs further review/changes on the block layer.
> > 
> > First explanation, why I think this don't fix the full problem.
> > Whith this patch, we fix the problem where we have a dirty block layer but
> > basically nothing dirtying the memory on the guest (we are moving the 20
> > seconds from max_downtime for the blocklayer flush), to 20 seconds until
> > we have decided that the amount of dirty memory is small enough to be
> > transferred during max_downtime.  But it is still going to take 20 seconds to
> > flush the block layer, and during that 20 seconds, the amount of memory that
> > can be dirty is HUGE.
> 
> It's true.

What kind of cache is it actually that takes 20s to flush here?

Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use
a writeback cache for some metadata that might need to be written back,
but it is small and shouldn't take any significant time.

Then we have the kernel page cache, or for some network protocols caches
in the respective libs. This sounds like the right size for a 20s stall,
but don't we require cache.direct=on for live migration anyway for
coherency, i.e. bypassing any such cache?

Lastly there may be a disk cache, but it's too small either.

> > I think our ouptions are:
> > 
> > - tell the block layer at the beggining of migration
> >   Hey, we are migrating, could you please start flusing data now, and
> >   don't get the caches to grow too much, please, pretty please.
> >   (I left the API to the block layer)
> > - Add on that point a new function:
> >   bdrvr_flush_all_start()
> >   That starts the sending of pages, and we "hope" that by the time that
> >   we have migrated all memory, they have also finished (so our last
> >   call to block_flush_all() have less work to do)
> > - Add another function:
> >   int bdrv_flush_all_timeout(int timeout)
> >   that returns if timeout pass, telling if it has migrated all pages or
> >   timeout has passed.  So we can got back to the iterative stage if it
> >   has taken too long.

The problem is that the block layer really doesn't have an option to
control what is getting synced once the data is cached outside of qemu.
Essentially we can do an fdatasync() or we can leave it, that's the only
choice we have.

Now doing an asynchronous fdatasync() in the background is completely
reasonable in order to reduce the amount of data to be flushed later.
But the patch is doing it while holding both the BQL and the AIOContext
lock of the device, which doesn't feel right. Maybe it should schedule a
BH in the AIOContext instead and flush from there asynchronously.


The other thing is that flushing once doesn't mean that new data isn't
accumulating in the cache, especially if you decide to do the background
flush at the start of the migration.

The obvious way to avoid that would be to switch to a writethrough mode,
so any write request goes directly to the disk. This will, however,
impact performance so heavily that it's probably not a realistic option.

An alternative approach could be to repeat the background flush
periodically, either time based or after every x bytes that are written
to a device. Time based should actually be quite easy to implement.


Once we have the flushing in the background, the migration code can
apply any timeouts it wants. If the timeout is exceeded, the flush
wouldn't be aborted but keep running in the background, but migration
can go back to the iterative state anyway.

> > Notice that *normally* bdrv_flush_all() is very fast, the problem is that
> > sometimes it get really, really slow (NFS decided to go slow, TCP drop a
> > packet, whatever).
> > 
> > Right now, we don't have an interface to detect that cases and got back to
> > the iterative stage.
> 
> How about go back to the iterative stage when detect that the pending_size is larger
> Than max_size, like this:
> 
> +                /* do flush here is aimed to shorten the VM downtime,
> +                 * bdrv_flush_all is a time consuming operation
> +                 * when the guest has done some file writing */
> +                bdrv_flush_all();
> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
> +                if (pending_size && pending_size >= max_size) {
> +                    qemu_mutex_unlock_iothread();
> +                    continue;
> +                }
>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>                   if (ret >= 0) {
>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
> 
> and this is quite simple.

Yes, but it is too simple. If you hold all the locks during
bdrv_flush_all(), your VM will effectively stop as soon as it performs
the next I/O access, so you don't win much. And you still don't have a
timeout for cases where the flush takes really long.

Kevin

Juan Quintela March 18, 2015, 12:36 p.m. UTC | #4

Kevin Wolf <kwolf@redhat.com> wrote:
> [ Cc: qemu-block ]
>
> Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben:
>> > This needs further review/changes on the block layer.
>> > 
>> > First explanation, why I think this don't fix the full problem.
>> > Whith this patch, we fix the problem where we have a dirty block layer but
>> > basically nothing dirtying the memory on the guest (we are moving the 20
>> > seconds from max_downtime for the blocklayer flush), to 20 seconds until
>> > we have decided that the amount of dirty memory is small enough to be
>> > transferred during max_downtime.  But it is still going to take 20 seconds to
>> > flush the block layer, and during that 20 seconds, the amount of memory that
>> > can be dirty is HUGE.
>> 
>> It's true.
>
> What kind of cache is it actually that takes 20s to flush here?

That is a very good question.   When I meassured this (long, long ago),
testing with the same workload, bdrv_flush_all() could take form
100-300ms (what I expected and I can live with that), to several
seconds, what I can't live with.

Basically (this was around RHEL6 times, whatever upstream were there at
the time), my notes of the time point to:

aio.c:quemu_aio_wait()
        .....
        ret = select(max_fd, &rdfds, &wrfds, NULL, NULL);

notice that we are doing a select without a timeout in the iothread,
bad.

I know that the code has changed a lot on that area, the select() don't
exist anymore.  But this exemplifies the problem, something asks the
block layer to do an operation, and it blocks until it finishes, even if
this time it is taking more than usual for whatever reason.  What I
would really is a way to be able to bdrv_flush_all() to take a timeout
parameter and return an error case that it is taking too long.

> Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use
> a writeback cache for some metadata that might need to be written back,
> but it is small and shouldn't take any significant time.
>
> Then we have the kernel page cache, or for some network protocols caches
> in the respective libs. This sounds like the right size for a 20s stall,
> but don't we require cache.direct=on for live migration anyway for
> coherency, i.e. bypassing any such cache?
>
> Lastly there may be a disk cache, but it's too small either.

>> > I think our ouptions are:
>> > 
>> > - tell the block layer at the beggining of migration
>> >   Hey, we are migrating, could you please start flusing data now, and
>> >   don't get the caches to grow too much, please, pretty please.
>> >   (I left the API to the block layer)
>> > - Add on that point a new function:
>> >   bdrvr_flush_all_start()
>> >   That starts the sending of pages, and we "hope" that by the time that
>> >   we have migrated all memory, they have also finished (so our last
>> >   call to block_flush_all() have less work to do)
>> > - Add another function:
>> >   int bdrv_flush_all_timeout(int timeout)
>> >   that returns if timeout pass, telling if it has migrated all pages or
>> >   timeout has passed.  So we can got back to the iterative stage if it
>> >   has taken too long.
>
> The problem is that the block layer really doesn't have an option to
> control what is getting synced once the data is cached outside of qemu.
> Essentially we can do an fdatasync() or we can leave it, that's the only
> choice we have.

See my explanation, if all that qemu is doing is an fdatasync(), just
spawn a thread that do the fdatasync() and if the thread don't finish in
<timeout> time, just return one error.  You can implement that behaviour
whatever you want.


> Now doing an asynchronous fdatasync() in the background is completely
> reasonable in order to reduce the amount of data to be flushed later.
> But the patch is doing it while holding both the BQL and the AIOContext
> lock of the device, which doesn't feel right. Maybe it should schedule a
> BH in the AIOContext instead and flush from there asynchronously.

Position is wrong, definitelly.  We want to start the asynchronous
fdatasync() at the start of migration, or each X milliseconds.  At this
point, we *think* that we can finish the migration on max_downtime
(basically we are ignoring the time that is going to take to migrate
device state and do the block layer flush, but normally this takes in
the other of 100-200ms, so it don't matter at all).

> The other thing is that flushing once doesn't mean that new data isn't
> accumulating in the cache, especially if you decide to do the background
> flush at the start of the migration.

But "pontentially", we would arrive to this point with less "cached"
data everything on the system.

> The obvious way to avoid that would be to switch to a writethrough mode,
> so any write request goes directly to the disk. This will, however,
> impact performance so heavily that it's probably not a realistic option.
>
> An alternative approach could be to repeat the background flush
> periodically, either time based or after every x bytes that are written
> to a device. Time based should actually be quite easy to implement.

We can do it periodically on the migration thread, if the call is
thread_safe.  We already have a loop there, and "kind" of time control,
what we miss is an interface.

> Once we have the flushing in the background, the migration code can
> apply any timeouts it wants. If the timeout is exceeded, the flush
> wouldn't be aborted but keep running in the background, but migration
> can go back to the iterative state anyway.

Yeap, that is what we really want/need.

>> > Notice that *normally* bdrv_flush_all() is very fast, the problem is that
>> > sometimes it get really, really slow (NFS decided to go slow, TCP drop a
>> > packet, whatever).
>> > 
>> > Right now, we don't have an interface to detect that cases and got back to
>> > the iterative stage.
>> 
>> How about go back to the iterative stage when detect that the pending_size is larger
>> Than max_size, like this:
>> 
>> +                /* do flush here is aimed to shorten the VM downtime,
>> +                 * bdrv_flush_all is a time consuming operation
>> +                 * when the guest has done some file writing */
>> +                bdrv_flush_all();
>> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
>> +                if (pending_size && pending_size >= max_size) {
>> +                    qemu_mutex_unlock_iothread();
>> +                    continue;
>> +                }
>>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>>                   if (ret >= 0) {
>>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
>> 
>> and this is quite simple.
>
> Yes, but it is too simple. If you hold all the locks during
> bdrv_flush_all(), your VM will effectively stop as soon as it performs
> the next I/O access, so you don't win much. And you still don't have a
> timeout for cases where the flush takes really long.

This is probably better than what we had now (basically we are
"meassuring" after bdrv_flush_all how much the amount of dirty memory
has changed, and return to iterative stage if it took too much.  A
timeout would be better anyways.  And an interface te start the
synchronization sooner asynchronously would be also good.

Notice that my understanding is that any proper fix for this is 2.4 material.

Thanks, Juan.

Paolo Bonzini March 18, 2015, 12:59 p.m. UTC | #5

On 18/03/2015 13:36, Juan Quintela wrote:
> I know that the code has changed a lot on that area, the select() don't
> exist anymore.

It is still there in aio_poll():

    ret = qemu_poll_ns((GPollFD *)ctx->pollfds->data,
                         ctx->pollfds->len,
                         blocking ? aio_compute_timeout(ctx) : 0);

where aio_compute_timeout will usually return -1.

Paolo

Li, Liang Z March 18, 2015, 1:39 p.m. UTC | #6

> > > First explanation, why I think this don't fix the full problem.
> > > Whith this patch, we fix the problem where we have a dirty block
> > > layer but basically nothing dirtying the memory on the guest (we are
> > > moving the 20 seconds from max_downtime for the blocklayer flush),
> > > to 20 seconds until we have decided that the amount of dirty memory
> > > is small enough to be transferred during max_downtime.  But it is
> > > still going to take 20 seconds to flush the block layer, and during
> > > that 20 seconds, the amount of memory that can be dirty is HUGE.
> >
> > It's true.
> 
> What kind of cache is it actually that takes 20s to flush here?
> 

I run a script in the guest which do a dd operation,  like this:

#!/bin/sh
for i in {1..1000000}
do
	time dd if=/dev/zero of=/time.bdf bs=4k count=200000
	rm /time.bdf
done

It's an extreme  case.

Kevin Wolf March 18, 2015, 1:42 p.m. UTC | #7

Am 18.03.2015 um 13:36 hat Juan Quintela geschrieben:
> Kevin Wolf <kwolf@redhat.com> wrote:
> > The problem is that the block layer really doesn't have an option to
> > control what is getting synced once the data is cached outside of qemu.
> > Essentially we can do an fdatasync() or we can leave it, that's the only
> > choice we have.
> 
> See my explanation, if all that qemu is doing is an fdatasync(), just
> spawn a thread that do the fdatasync() and if the thread don't finish in
> <timeout> time, just return one error.  You can implement that behaviour
> whatever you want.

Yes, and if you use bdrv_co_flush(), this is one of the worker threads
that raw-posix uses. So no new code to write for that.

> > Now doing an asynchronous fdatasync() in the background is completely
> > reasonable in order to reduce the amount of data to be flushed later.
> > But the patch is doing it while holding both the BQL and the AIOContext
> > lock of the device, which doesn't feel right. Maybe it should schedule a
> > BH in the AIOContext instead and flush from there asynchronously.
> 
> Position is wrong, definitelly.  We want to start the asynchronous
> fdatasync() at the start of migration, or each X milliseconds.  At this
> point, we *think* that we can finish the migration on max_downtime
> (basically we are ignoring the time that is going to take to migrate
> device state and do the block layer flush, but normally this takes in
> the other of 100-200ms, so it don't matter at all).
> 
> > The other thing is that flushing once doesn't mean that new data isn't
> > accumulating in the cache, especially if you decide to do the background
> > flush at the start of the migration.
> 
> But "pontentially", we would arrive to this point with less "cached"
> data everything on the system.
> 
> > The obvious way to avoid that would be to switch to a writethrough mode,
> > so any write request goes directly to the disk. This will, however,
> > impact performance so heavily that it's probably not a realistic option.
> >
> > An alternative approach could be to repeat the background flush
> > periodically, either time based or after every x bytes that are written
> > to a device. Time based should actually be quite easy to implement.
> 
> We can do it periodically on the migration thread, if the call is
> thread_safe.  We already have a loop there, and "kind" of time control,
> what we miss is an interface.

Don't do it from the migration thread. You'd have to take locks and stop
the guest from doing I/O while the flush is running.

Instead, create a coroutine and use a BH to enter it from the main loop
of the AIOContext of the block device (i.e. the qemu main loop thread in
most cases, or the dataplane thread if it exists). Then your flush will
run in the background without blocking anyone else.

Time control in coroutines is as easy as calling co_aio_sleep_ns().

> > Once we have the flushing in the background, the migration code can
> > apply any timeouts it wants. If the timeout is exceeded, the flush
> > wouldn't be aborted but keep running in the background, but migration
> > can go back to the iterative state anyway.
> 
> Yeap, that is what we really want/need.

Cool. I don't think it's very hard to get there.

> Notice that my understanding is that any proper fix for this is 2.4 material.

Yes, I agree.

Kevin

Dr. David Alan Gilbert March 18, 2015, 4:55 p.m. UTC | #8

* Li, Liang Z (liang.z.li@intel.com) wrote:
> > > > First explanation, why I think this don't fix the full problem.
> > > > Whith this patch, we fix the problem where we have a dirty block
> > > > layer but basically nothing dirtying the memory on the guest (we are
> > > > moving the 20 seconds from max_downtime for the blocklayer flush),
> > > > to 20 seconds until we have decided that the amount of dirty memory
> > > > is small enough to be transferred during max_downtime.  But it is
> > > > still going to take 20 seconds to flush the block layer, and during
> > > > that 20 seconds, the amount of memory that can be dirty is HUGE.
> > >
> > > It's true.
> > 
> > What kind of cache is it actually that takes 20s to flush here?
> > 
> 
> I run a script in the guest which do a dd operation,  like this:
> 
> #!/bin/sh
> for i in {1..1000000}
> do
> 	time dd if=/dev/zero of=/time.bdf bs=4k count=200000
> 	rm /time.bdf
> done
> 
> It's an extreme  case.

With what qemu options for the device, and what was your device
backed by?

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Li, Liang Z March 19, 2015, 2:06 p.m. UTC | #9

> * Li, Liang Z (liang.z.li@intel.com) wrote:
> > > > > First explanation, why I think this don't fix the full problem.
> > > > > Whith this patch, we fix the problem where we have a dirty block
> > > > > layer but basically nothing dirtying the memory on the guest (we
> > > > > are moving the 20 seconds from max_downtime for the blocklayer
> > > > > flush), to 20 seconds until we have decided that the amount of
> > > > > dirty memory is small enough to be transferred during
> > > > > max_downtime.  But it is still going to take 20 seconds to flush
> > > > > the block layer, and during that 20 seconds, the amount of memory
> that can be dirty is HUGE.
> > > >
> > > > It's true.
> > >
> > > What kind of cache is it actually that takes 20s to flush here?
> > >
> >
> > I run a script in the guest which do a dd operation,  like this:
> >
> > #!/bin/sh
> > for i in {1..1000000}
> > do
> > 	time dd if=/dev/zero of=/time.bdf bs=4k count=200000
> > 	rm /time.bdf
> > done
> >
> > It's an extreme  case.
> 
> With what qemu options for the device, and what was your device backed by?

Very simple:
./qemu-system-x86_64 -enable-kvm -smp 4 -m 4096  -net none rhel6u5.img -monitor stdio

And it's a local migration.  I will do the test between two physical machines later.


Liang

Dr. David Alan Gilbert March 19, 2015, 2:40 p.m. UTC | #10

* Li, Liang Z (liang.z.li@intel.com) wrote:
> > * Li, Liang Z (liang.z.li@intel.com) wrote:
> > > > > > First explanation, why I think this don't fix the full problem.
> > > > > > Whith this patch, we fix the problem where we have a dirty block
> > > > > > layer but basically nothing dirtying the memory on the guest (we
> > > > > > are moving the 20 seconds from max_downtime for the blocklayer
> > > > > > flush), to 20 seconds until we have decided that the amount of
> > > > > > dirty memory is small enough to be transferred during
> > > > > > max_downtime.  But it is still going to take 20 seconds to flush
> > > > > > the block layer, and during that 20 seconds, the amount of memory
> > that can be dirty is HUGE.
> > > > >
> > > > > It's true.
> > > >
> > > > What kind of cache is it actually that takes 20s to flush here?
> > > >
> > >
> > > I run a script in the guest which do a dd operation,  like this:
> > >
> > > #!/bin/sh
> > > for i in {1..1000000}
> > > do
> > > 	time dd if=/dev/zero of=/time.bdf bs=4k count=200000
> > > 	rm /time.bdf
> > > done
> > >
> > > It's an extreme  case.
> > 
> > With what qemu options for the device, and what was your device backed by?
> 
> Very simple:
> ./qemu-system-x86_64 -enable-kvm -smp 4 -m 4096  -net none rhel6u5.img -monitor stdio
> 
> And it's a local migration.  I will do the test between two physical machines later.

OK, but for shared storage you would have to add cache=none (or something like that),
so that would change the behaviour anyway.

Dave
> 
> 
> Liang
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Li, Liang Z March 20, 2015, 7:22 a.m. UTC | #11

> >> > Right now, we don't have an interface to detect that cases and got
> >> > back to the iterative stage.
> >>
> >> How about go back to the iterative stage when detect that the
> >> pending_size is larger Than max_size, like this:
> >>
> >> +                /* do flush here is aimed to shorten the VM downtime,
> >> +                 * bdrv_flush_all is a time consuming operation
> >> +                 * when the guest has done some file writing */
> >> +                bdrv_flush_all();
> >> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
> >> +                if (pending_size && pending_size >= max_size) {
> >> +                    qemu_mutex_unlock_iothread();
> >> +                    continue;
> >> +                }
> >>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> >>                   if (ret >= 0) {
> >>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
> >>
> >> and this is quite simple.
> >
> > Yes, but it is too simple. If you hold all the locks during
> > bdrv_flush_all(), your VM will effectively stop as soon as it performs
> > the next I/O access, so you don't win much. And you still don't have a
> > timeout for cases where the flush takes really long.
> 
> This is probably better than what we had now (basically we are "meassuring"
> after bdrv_flush_all how much the amount of dirty memory has changed,
> and return to iterative stage if it took too much.  A timeout would be better
> anyways.  And an interface te start the synchronization sooner
> asynchronously would be also good.
> 
> Notice that my understanding is that any proper fix for this is 2.4 material.

Then, how to deal with this issue in 2.3, leave it here? or make an incomplete fix like I do above?

Liang

> Thanks, Juan.

Juan Quintela March 25, 2015, 10:50 a.m. UTC | #12

"Li, Liang Z" <liang.z.li@intel.com> wrote:
>> >> > Right now, we don't have an interface to detect that cases and got
>> >> > back to the iterative stage.
>> >>
>> >> How about go back to the iterative stage when detect that the
>> >> pending_size is larger Than max_size, like this:
>> >>
>> >> +                /* do flush here is aimed to shorten the VM downtime,
>> >> +                 * bdrv_flush_all is a time consuming operation
>> >> +                 * when the guest has done some file writing */
>> >> +                bdrv_flush_all();
>> >> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
>> >> +                if (pending_size && pending_size >= max_size) {
>> >> +                    qemu_mutex_unlock_iothread();
>> >> +                    continue;
>> >> +                }
>> >>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>> >>                   if (ret >= 0) {
>> >>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
>> >>
>> >> and this is quite simple.
>> >
>> > Yes, but it is too simple. If you hold all the locks during
>> > bdrv_flush_all(), your VM will effectively stop as soon as it performs
>> > the next I/O access, so you don't win much. And you still don't have a
>> > timeout for cases where the flush takes really long.
>> 
>> This is probably better than what we had now (basically we are "meassuring"
>> after bdrv_flush_all how much the amount of dirty memory has changed,
>> and return to iterative stage if it took too much.  A timeout would be better
>> anyways.  And an interface te start the synchronization sooner
>> asynchronously would be also good.
>> 
>> Notice that my understanding is that any proper fix for this is 2.4 material.
>
> Then, how to deal with this issue in 2.3, leave it here? or make an
> incomplete fix like I do above?

I think it is better to leave it here for 2.3. With a patch like this
one, we improve in one load and we got worse in a different load (depens
a lot in the ratio of dirtying memory vs disk).  I have no data which
load is more common, so I prefer to be conservative so late in the
cycle.  What do you think?

Later, Juan.


>
> Liang
>
>> Thanks, Juan.

Kevin Wolf March 25, 2015, 10:53 a.m. UTC | #13

Am 25.03.2015 um 11:50 hat Juan Quintela geschrieben:
> "Li, Liang Z" <liang.z.li@intel.com> wrote:
> >> >> > Right now, we don't have an interface to detect that cases and got
> >> >> > back to the iterative stage.
> >> >>
> >> >> How about go back to the iterative stage when detect that the
> >> >> pending_size is larger Than max_size, like this:
> >> >>
> >> >> +                /* do flush here is aimed to shorten the VM downtime,
> >> >> +                 * bdrv_flush_all is a time consuming operation
> >> >> +                 * when the guest has done some file writing */
> >> >> +                bdrv_flush_all();
> >> >> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
> >> >> +                if (pending_size && pending_size >= max_size) {
> >> >> +                    qemu_mutex_unlock_iothread();
> >> >> +                    continue;
> >> >> +                }
> >> >>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> >> >>                   if (ret >= 0) {
> >> >>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
> >> >>
> >> >> and this is quite simple.
> >> >
> >> > Yes, but it is too simple. If you hold all the locks during
> >> > bdrv_flush_all(), your VM will effectively stop as soon as it performs
> >> > the next I/O access, so you don't win much. And you still don't have a
> >> > timeout for cases where the flush takes really long.
> >> 
> >> This is probably better than what we had now (basically we are "meassuring"
> >> after bdrv_flush_all how much the amount of dirty memory has changed,
> >> and return to iterative stage if it took too much.  A timeout would be better
> >> anyways.  And an interface te start the synchronization sooner
> >> asynchronously would be also good.
> >> 
> >> Notice that my understanding is that any proper fix for this is 2.4 material.
> >
> > Then, how to deal with this issue in 2.3, leave it here? or make an
> > incomplete fix like I do above?
> 
> I think it is better to leave it here for 2.3. With a patch like this
> one, we improve in one load and we got worse in a different load (depens
> a lot in the ratio of dirtying memory vs disk).  I have no data which
> load is more common, so I prefer to be conservative so late in the
> cycle.  What do you think?

I agree, it's too late in the release cycle for such a change.

Kevin

Li, Liang Z March 26, 2015, 1:13 a.m. UTC | #14

> > > Then, how to deal with this issue in 2.3, leave it here? or make an
> > > incomplete fix like I do above?
> >
> > I think it is better to leave it here for 2.3. With a patch like this
> > one, we improve in one load and we got worse in a different load
> > (depens a lot in the ratio of dirtying memory vs disk).  I have no
> > data which load is more common, so I prefer to be conservative so late
> > in the cycle.  What do you think?
> 
> I agree, it's too late in the release cycle for such a change.
> 
> Kevin

That's OK.

Li, Liang Z June 24, 2015, 11:08 a.m. UTC | #15

> > >> >> > Right now, we don't have an interface to detect that cases and
> > >> >> > got back to the iterative stage.
> > >> >>
> > >> >> How about go back to the iterative stage when detect that the
> > >> >> pending_size is larger Than max_size, like this:
> > >> >>
> > >> >> +                /* do flush here is aimed to shorten the VM downtime,
> > >> >> +                 * bdrv_flush_all is a time consuming operation
> > >> >> +                 * when the guest has done some file writing */
> > >> >> +                bdrv_flush_all();
> > >> >> +                pending_size = qemu_savevm_state_pending(s->file,
> max_size);
> > >> >> +                if (pending_size && pending_size >= max_size) {
> > >> >> +                    qemu_mutex_unlock_iothread();
> > >> >> +                    continue;
> > >> >> +                }
> > >> >>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> > >> >>                   if (ret >= 0) {
> > >> >>                       qemu_file_set_rate_limit(s->file,
> > >> >> INT64_MAX);
> > >> >>
> > >> >> and this is quite simple.
> > >> >
> > >> > Yes, but it is too simple. If you hold all the locks during
> > >> > bdrv_flush_all(), your VM will effectively stop as soon as it
> > >> > performs the next I/O access, so you don't win much. And you
> > >> > still don't have a timeout for cases where the flush takes really long.
> > >>
> > >> This is probably better than what we had now (basically we are
> "meassuring"
> > >> after bdrv_flush_all how much the amount of dirty memory has
> > >> changed, and return to iterative stage if it took too much.  A
> > >> timeout would be better anyways.  And an interface te start the
> > >> synchronization sooner asynchronously would be also good.
> > >>
> > >> Notice that my understanding is that any proper fix for this is 2.4
> material.
> > >
> > > Then, how to deal with this issue in 2.3, leave it here? or make an
> > > incomplete fix like I do above?
> >
> > I think it is better to leave it here for 2.3. With a patch like this
> > one, we improve in one load and we got worse in a different load
> > (depens a lot in the ratio of dirtying memory vs disk).  I have no
> > data which load is more common, so I prefer to be conservative so late
> > in the cycle.  What do you think?
> 
> I agree, it's too late in the release cycle for such a change.
> 
> Kevin

Hi Juan & Kevin,

I have not found the related patches to fix the issue which lead to long VM downtime,  how is it going?

Liang

Stefan Hajnoczi June 25, 2015, 12:34 p.m. UTC | #16

On Wed, Jun 24, 2015 at 11:08:43AM +0000, Li, Liang Z wrote:
> > > >> >> > Right now, we don't have an interface to detect that cases and
> > > >> >> > got back to the iterative stage.
> > > >> >>
> > > >> >> How about go back to the iterative stage when detect that the
> > > >> >> pending_size is larger Than max_size, like this:
> > > >> >>
> > > >> >> +                /* do flush here is aimed to shorten the VM downtime,
> > > >> >> +                 * bdrv_flush_all is a time consuming operation
> > > >> >> +                 * when the guest has done some file writing */
> > > >> >> +                bdrv_flush_all();
> > > >> >> +                pending_size = qemu_savevm_state_pending(s->file,
> > max_size);
> > > >> >> +                if (pending_size && pending_size >= max_size) {
> > > >> >> +                    qemu_mutex_unlock_iothread();
> > > >> >> +                    continue;
> > > >> >> +                }
> > > >> >>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> > > >> >>                   if (ret >= 0) {
> > > >> >>                       qemu_file_set_rate_limit(s->file,
> > > >> >> INT64_MAX);
> > > >> >>
> > > >> >> and this is quite simple.
> > > >> >
> > > >> > Yes, but it is too simple. If you hold all the locks during
> > > >> > bdrv_flush_all(), your VM will effectively stop as soon as it
> > > >> > performs the next I/O access, so you don't win much. And you
> > > >> > still don't have a timeout for cases where the flush takes really long.
> > > >>
> > > >> This is probably better than what we had now (basically we are
> > "meassuring"
> > > >> after bdrv_flush_all how much the amount of dirty memory has
> > > >> changed, and return to iterative stage if it took too much.  A
> > > >> timeout would be better anyways.  And an interface te start the
> > > >> synchronization sooner asynchronously would be also good.
> > > >>
> > > >> Notice that my understanding is that any proper fix for this is 2.4
> > material.
> > > >
> > > > Then, how to deal with this issue in 2.3, leave it here? or make an
> > > > incomplete fix like I do above?
> > >
> > > I think it is better to leave it here for 2.3. With a patch like this
> > > one, we improve in one load and we got worse in a different load
> > > (depens a lot in the ratio of dirtying memory vs disk).  I have no
> > > data which load is more common, so I prefer to be conservative so late
> > > in the cycle.  What do you think?
> > 
> > I agree, it's too late in the release cycle for such a change.
> > 
> > Kevin
> 
> Hi Juan & Kevin,
> 
> I have not found the related patches to fix the issue which lead to long VM downtime,  how is it going?

Kevin is on vacation and QEMU is currently in 2.4 soft freeze.  Unless
patches have been posted/merged that I'm not aware of, it is unlikely
that anything will happen before QEMU 2.4 is released.

Stefan

migration: flush the bdrv before stopping VM

Commit Message

Comments

Patch