diff mbox series

[regression] Clock jump on VM migration

Message ID 2932080.UxbmD43V0u@neil
State New
Headers show
Series [regression] Clock jump on VM migration | expand

Commit Message

Neil Skrypuch Feb. 7, 2019, 10:33 p.m. UTC
We (ab)use migration + block mirroring to perform transparent zero downtime VM 
backups. Basically:

1) do a block mirror of the source VM's disk
2) migrate the source VM to a destination VM using the disk copy
3) cancel the block mirroring
4) resume the source VM
5) shut down the destination VM gracefully and move the disk to backup

Note that both source and destination VMs are running on the same host and 
same disk array.

Relatively recently, the source VM's clock started jumping after step #4. The 
specific amount of clock jump is generally around 1s, but sometimes as much as 
2-3s. I was able to bisect this down to the following QEMU change:

commit dd577a26ff03b6829721b1ffbbf9e7c411b72378
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Fri Apr 27 17:23:11 2018 +0100

    block/file-posix: implement bdrv_co_invalidate_cache() on Linux
    
    On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*.  Use
    this to drop page cache on the destination host during shared storage
    migration.  This way the destination host will read the latest copy of
    the data and will not use stale data from the page cache.
    
    The flow is as follows:
    
    1. Source host writes out all dirty pages and inactivates drives.
    2. QEMU_VM_EOF is sent on migration stream.
    3. Destination host invalidates caches before accessing drives.
    
    This patch enables live migration even with -drive cache.direct=off.
    
    * Terms and conditions may apply, please see patch for details.
    
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
    Reviewed-by: Fam Zheng <famz@redhat.com>
    Message-id: 20180427162312.18583-2-stefanha@redhat.com
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

That patch went in for QEMU 3.0.0, and I can confirm this issue affects both 
3.1.0 and 3.0.0, but not 2.12.0. Most testing has been done on kernel 4.20.5, 
but I also confirmed the issue with 4.13.0.

Reproducing this issue is easy, it happens 100% of the time with a CentOS 7 
guest (other guests not tested), NTP will notice the clock jump quite soon 
after migration.

We are seeing this issue across our entire fleet of VMs, but the specific VM 
I've been testing with has a 20G disk and 1.5G of RAM.

To further debug this issue, I made the following changes:

     if (ret < 0) {                                                                                                                                                            
@@ -2581,6 +2582,8 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
         return; /* No host kernel page cache */                                                                                                                               
     }                                                                                                                                                                         
                                                                                                                                                                               
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("before: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                               
 #if defined(__linux__)                                                                                                                                                        
     /* This sets the scene for the next syscall... */                                                                                                                         
     ret = bdrv_co_flush(bs);                                                                                                                                                  
@@ -2610,6 +2613,8 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
      * configurations that should not cause errors.                                                                                                                           
      */                                                                                                                                                                       
 #endif /* !__linux__ */                                                                                                                                                       
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("after: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                                
 }                                                                                                                                                                             
                                                                                                                                                                               
 static coroutine_fn int

In two separate runs, they produced the following:

before: 1549567702.412048
after: 1549567703.295500
-> clock jump: 949ms

before: 1549576767.707454
after: 1549576768.584981
-> clock jump: 941ms

The clock jump numbers above are from NTP, but you can see that they are quite 
close to the amount of time spent in raw_co_invalidate_cache. So, it looks 
like flushing the cache is just taking a long time and stalling the guest, 
which causes the clock jump. This isn't too surprising as the entire disk 
image was just written as part of the block mirror and would likely still be 
in the cache.

I see the use case for this feature, but I don't think it applies here, as 
we're not technically using shared storage. I believe an option to toggle this 
behaviour on/off and/or some sort of heuristic to guess whether or not it 
should be enabled by default would be in order here.

- Neil

Comments

Stefan Hajnoczi Feb. 8, 2019, 6:24 a.m. UTC | #1
On Thu, Feb 07, 2019 at 05:33:25PM -0500, Neil Skrypuch wrote:

Thanks for your email!

Please post your QEMU command-line.

> The clock jump numbers above are from NTP, but you can see that they are quite 
> close to the amount of time spent in raw_co_invalidate_cache. So, it looks 
> like flushing the cache is just taking a long time and stalling the guest, 
> which causes the clock jump. This isn't too surprising as the entire disk 
> image was just written as part of the block mirror and would likely still be 
> in the cache.
> 
> I see the use case for this feature, but I don't think it applies here, as 
> we're not technically using shared storage. I believe an option to toggle this 
> behaviour on/off and/or some sort of heuristic to guess whether or not it 
> should be enabled by default would be in order here.

It would be good to figure out how to perform the flush without
affecting guest time at all.  The clock jump will also inconvenience
users who do need the flush, so I rather not work around the clock jump
for a subset of users only.

Stefan
Dr. David Alan Gilbert Feb. 8, 2019, 9:48 a.m. UTC | #2
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Thu, Feb 07, 2019 at 05:33:25PM -0500, Neil Skrypuch wrote:
> 
> Thanks for your email!
> 
> Please post your QEMU command-line.
> 
> > The clock jump numbers above are from NTP, but you can see that they are quite 
> > close to the amount of time spent in raw_co_invalidate_cache. So, it looks 
> > like flushing the cache is just taking a long time and stalling the guest, 
> > which causes the clock jump. This isn't too surprising as the entire disk 
> > image was just written as part of the block mirror and would likely still be 
> > in the cache.
> > 
> > I see the use case for this feature, but I don't think it applies here, as 
> > we're not technically using shared storage. I believe an option to toggle this 
> > behaviour on/off and/or some sort of heuristic to guess whether or not it 
> > should be enabled by default would be in order here.
> 
> It would be good to figure out how to perform the flush without
> affecting guest time at all.  The clock jump will also inconvenience
> users who do need the flush, so I rather not work around the clock jump
> for a subset of users only.

One thing that makes Neil's setup different is that having the source
and destination on the same host, that fadvise is bound to drop pages
that are actually in use by the source on the same host.

But I'm also curious at what point in the migration we call the
invalidate and so which threads get held up, in which state.

Neil: Another printf would also be interesting, between the
bdrv_co_flush and the posix_fadvise;  I'm assuming it's the
bdrv_co_flush that's taking the time but it would be good to check.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Neil Skrypuch Feb. 8, 2019, 10:52 p.m. UTC | #3
On Friday, February 8, 2019 4:48:19 AM EST Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Thu, Feb 07, 2019 at 05:33:25PM -0500, Neil Skrypuch wrote:
> > 
> > Thanks for your email!
> > 
> > Please post your QEMU command-line.

For the test VM, we're using the following:

/root/qemu-build/x86_64-softmmu/qemu-system-x86_64 -machine accel=kvm -vga 
cirrus -m 1536 -smp 2,cores=2,sockets=1,threads=1 -drive file=/var/lib/kvm/
testbackup-t1-n.polldev.com.img,format=raw,if=virtio -device virtio-net-
pci,mac=52:54:98:42:45:17,netdev=net1 -netdev type=tap,script=/etc/qemu-ifup-
br2,downscript=no,id=net1 -curses -monitor unix:/var/lib/kvm/monitor/
testbackup-t1-n,server,nowait -name "testbackup-t1-
n.polldev.com",process=testbackup-t1-n.polldev.com

Others are generally along the same lines.

> > > The clock jump numbers above are from NTP, but you can see that they are
> > > quite close to the amount of time spent in raw_co_invalidate_cache. So,
> > > it looks like flushing the cache is just taking a long time and
> > > stalling the guest, which causes the clock jump. This isn't too
> > > surprising as the entire disk image was just written as part of the
> > > block mirror and would likely still be in the cache.
> > > 
> > > I see the use case for this feature, but I don't think it applies here,
> > > as
> > > we're not technically using shared storage. I believe an option to
> > > toggle this behaviour on/off and/or some sort of heuristic to guess
> > > whether or not it should be enabled by default would be in order here.
> > 
> > It would be good to figure out how to perform the flush without
> > affecting guest time at all.  The clock jump will also inconvenience
> > users who do need the flush, so I rather not work around the clock jump
> > for a subset of users only.
> 
> One thing that makes Neil's setup different is that having the source
> and destination on the same host, that fadvise is bound to drop pages
> that are actually in use by the source on the same host.

Yes, flushing those pages in a setup like ours is bound to be expensive. Out 
of curiosity, I tried calling sync(1) after the drive mirror completes but 
before starting the migration. The sync itself was quite slow (~10s), but the 
clock jump was still observed and QEMU's own cache flush still took just as 
long.

I took one more step and both called sync(1) and dropped the system cache with 
echo 3 >/proc/sys/vm/drop_caches right before starting migration. In this 
case, qemu's cache flush was nearly instantaneous and no clock jump was 
observed. Now, dropping the whole system cache is not very nice, but it does 
suggest that if QEMU does the cache flush immediately before starting the 
migration as well as right after that the clock jump wouldn't occur, as long 
as there isn't too much cache dirtying during the migration.

> But I'm also curious at what point in the migration we call the
> invalidate and so which threads get held up, in which state.
> 
> Neil: Another printf would also be interesting, between the
> bdrv_co_flush and the posix_fadvise;  I'm assuming it's the
> bdrv_co_flush that's taking the time but it would be good to check.

I added two more printfs, like so:

diff --git a/block/file-posix.c b/block/file-posix.c                                                                                                                           
index 07bbdab953..ee64a8fc6e 100644                                                                                                                                            
--- a/block/file-posix.c                                                                                                                                                       
+++ b/block/file-posix.c                                                                                                                                                       
@@ -2570,6 +2570,7 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
 {                                                                                                                                                                             
     BDRVRawState *s = bs->opaque;                                                                                                                                             
     int ret;                                                                                                                                                                  
+    struct timeval t;                                                                                                                                                         
                                                                                                                                                                               
     ret = fd_open(bs);                                                                                                                                                        
     if (ret < 0) {                                                                                                                                                            
@@ -2581,9 +2582,13 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                  
         return; /* No host kernel page cache */                                                                                                                               
     }                                                                                                                                                                         
                                                                                                                                                                               
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("before: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                               
 #if defined(__linux__)                                                                                                                                                        
     /* This sets the scene for the next syscall... */                                                                                                                         
     ret = bdrv_co_flush(bs);                                                                                                                                                  
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("after bdrv_co_flush: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                  
     if (ret < 0) {                                                                                                                                                            
         error_setg_errno(errp, -ret, "flush failed");                                                                                                                         
         return;                                                                                                                                                               
@@ -2594,6 +2599,8 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
      * we don't use mmap, and the file should not be in use by other 
processes.                                                                                               
      */                                                                                                                                                                       
     ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED);                                                                                                                    
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("after posix_fadvise: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                  
     if (ret != 0) { /* the return value is a positive errno */                                                                                                                
         error_setg_errno(errp, ret, "fadvise failed");                                                                                                                        
         return;                                                                                                                                                               
@@ -2610,6 +2617,8 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
      * configurations that should not cause errors.                                                                                                                           
      */                                                                                                                                                                       
 #endif /* !__linux__ */                                                                                                                                                       
+    gettimeofday(&t, NULL);                                                                                                                                                   
+    printf("after: %d.%d\n", (int) t.tv_sec, (int) t.tv_usec);                                                                                                                
 }                                                                                                                                                                             
                                                                                                                                                                               
 static coroutine_fn int

Here's the output from a sample run:

before: 1549662160.360669
after bdrv_co_flush: 1549662160.360685
after posix_fadvise: 1549662161.244629
after: 1549662161.244648
-> clock jump: 1002ms

So, the time is more or less all spent in posix_fadvise.

- Neil
Stefan Hajnoczi Feb. 12, 2019, 2:56 a.m. UTC | #4
Hi Neil,
I'll return to this issue on Monday.

Stefan
Stefan Hajnoczi Feb. 26, 2019, 10:45 a.m. UTC | #5
On Tue, Feb 12, 2019 at 10:56:58AM +0800, Stefan Hajnoczi wrote:
> I'll return to this issue on Monday.

I have played around with posix_fadvise(POSIX_FADV_DONTNEED) and
measured 100+ millisecond latencies.  This is not a great primitive to
rely on during migration downtime since it can be slow.

On the other hand, we need it to ensure consistency when migrating to
another host with cache.direct=off.  In the cross-host migration case
the invalidation will be relatively fast because few pages will be
present.

An asynchronous flush and page cache invalidation on the source host at
the beginning of live migration would help your use case, but I don't
think the complexity of adding it is worthwhile (it doesn't help the
cross-host migration case).

Let's go back to the original idea of adding a parameter to disable the
flush and page cache invalidation.  I will send a patch.

Stefan
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c                                                                                                                           
index 07bbdab953..4724b543df 100644                                                                                                                                            
--- a/block/file-posix.c                                                                                                                                                       
+++ b/block/file-posix.c                                                                                                                                                       
@@ -2570,6 +2570,7 @@  static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,                                                                                   
 {                                                                                                                                                                             
     BDRVRawState *s = bs->opaque;                                                                                                                                             
     int ret;                                                                                                                                                                  
+    struct timeval t;                                                                                                                                                         
                                                                                                                                                                               
     ret = fd_open(bs);