Patchwork fix migration with large mem

login
register
mail settings
Submitter Izik Eidus
Date April 13, 2010, 9:33 a.m.
Message ID <20100413123318.0f2cd334@redhat.com>
Download mbox | patch
Permalink /patch/50051/
State New
Headers show

Comments

Izik Eidus - April 13, 2010, 9:33 a.m.
From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00 2001
From: Izik Eidus <ieidus@redhat.com>
Date: Tue, 13 Apr 2010 12:24:57 +0300
Subject: [PATCH] fix migration with large mem

In cases of guests with large mem that have pages
that all their bytes content are the same, we will
spend alot of time reading the memory from the guest
(is_dup_page())

It is happening beacuse ram_save_live() function have
limit of how much we can send to the dest but not how
much we read from it, and in cases we have many is_dup_page()
hits, we might read huge amount of data without updating important
stuff like the timers...

The guest lose all its repsonsibility and have many softlock ups
inside itself.

this patch add limit on the size we can read from the guest each
iteration.

    Thanks.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 arch_init.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)
Izik Eidus - May 9, 2010, 12:29 p.m.
On Tue, 13 Apr 2010 12:33:18 +0300
Izik Eidus <ieidus@redhat.com> wrote:

> From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00 2001
> From: Izik Eidus <ieidus@redhat.com>
> Date: Tue, 13 Apr 2010 12:24:57 +0300
> Subject: [PATCH] fix migration with large mem

Anyone ???


> 
> In cases of guests with large mem that have pages
> that all their bytes content are the same, we will
> spend alot of time reading the memory from the guest
> (is_dup_page())
> 
> It is happening beacuse ram_save_live() function have
> limit of how much we can send to the dest but not how
> much we read from it, and in cases we have many is_dup_page()
> hits, we might read huge amount of data without updating important
> stuff like the timers...
> 
> The guest lose all its repsonsibility and have many softlock ups
> inside itself.
> 
> this patch add limit on the size we can read from the guest each
> iteration.
> 
>     Thanks.
> 
> Signed-off-by: Izik Eidus <ieidus@redhat.com>
> ---
>  arch_init.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index cfc03ea..e27b1a0 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -88,6 +88,8 @@ const uint32_t arch_type = QEMU_ARCH;
>  #define RAM_SAVE_FLAG_PAGE	0x08
>  #define RAM_SAVE_FLAG_EOS	0x10
>  
> +#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
> +
>  static int is_dup_page(uint8_t *page, uint8_t ch)
>  {
>      uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
> @@ -175,6 +177,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      uint64_t bytes_transferred_last;
>      double bwidth = 0;
>      uint64_t expected_time = 0;
> +    int data_read = 0;
>  
>      if (stage < 0) {
>          cpu_physical_memory_set_dirty_tracking(0);
> @@ -205,10 +208,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      bytes_transferred_last = bytes_transferred;
>      bwidth = qemu_get_clock_ns(rt_clock);
>  
> -    while (!qemu_file_rate_limit(f)) {
> +    while (!qemu_file_rate_limit(f) && data_read < MAX_SAVE_BLOCK_READ) {
>          int ret;
>  
>          ret = ram_save_block(f);
> +        data_read += ret * TARGET_PAGE_SIZE;
>          bytes_transferred += ret * TARGET_PAGE_SIZE;
>          if (ret == 0) { /* no more blocks */
>              break;
Paolo Bonzini - May 9, 2010, 1:57 p.m.
On 05/09/2010 02:29 PM, Izik Eidus wrote:
> On Tue, 13 Apr 2010 12:33:18 +0300
> Izik Eidus<ieidus@redhat.com>  wrote:
>
>>  From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00 2001
>> From: Izik Eidus<ieidus@redhat.com>
>> Date: Tue, 13 Apr 2010 12:24:57 +0300
>> Subject: [PATCH] fix migration with large mem
>
> Anyone ???

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Maybe this could be merged via uq/master.

Paolo
Avi Kivity - May 9, 2010, 2:01 p.m.
On 05/09/2010 04:57 PM, Paolo Bonzini wrote:
> On 05/09/2010 02:29 PM, Izik Eidus wrote:
>> On Tue, 13 Apr 2010 12:33:18 +0300
>> Izik Eidus<ieidus@redhat.com>  wrote:
>>
>>>  From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00 2001
>>> From: Izik Eidus<ieidus@redhat.com>
>>> Date: Tue, 13 Apr 2010 12:24:57 +0300
>>> Subject: [PATCH] fix migration with large mem
>>
>> Anyone ???
>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>
> Maybe this could be merged via uq/master.

It's not intrinsically kvm related.
Anthony Liguori - May 10, 2010, 8:24 p.m.
On 04/13/2010 04:33 AM, Izik Eidus wrote:
>  From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00 2001
> From: Izik Eidus<ieidus@redhat.com>
> Date: Tue, 13 Apr 2010 12:24:57 +0300
> Subject: [PATCH] fix migration with large mem
>
> In cases of guests with large mem that have pages
> that all their bytes content are the same, we will
> spend alot of time reading the memory from the guest
> (is_dup_page())
>
> It is happening beacuse ram_save_live() function have
> limit of how much we can send to the dest but not how
> much we read from it, and in cases we have many is_dup_page()
> hits, we might read huge amount of data without updating important
> stuff like the timers...
>
> The guest lose all its repsonsibility and have many softlock ups
> inside itself.
>
> this patch add limit on the size we can read from the guest each
> iteration.
>
>      Thanks.
>
> Signed-off-by: Izik Eidus<ieidus@redhat.com>
> ---
>   arch_init.c |    6 +++++-
>   1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index cfc03ea..e27b1a0 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -88,6 +88,8 @@ const uint32_t arch_type = QEMU_ARCH;
>   #define RAM_SAVE_FLAG_PAGE	0x08
>   #define RAM_SAVE_FLAG_EOS	0x10
>
> +#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
> +
>   static int is_dup_page(uint8_t *page, uint8_t ch)
>   {
>       uint32_t val = ch<<  24 | ch<<  16 | ch<<  8 | ch;
> @@ -175,6 +177,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       uint64_t bytes_transferred_last;
>       double bwidth = 0;
>       uint64_t expected_time = 0;
> +    int data_read = 0;
>
>       if (stage<  0) {
>           cpu_physical_memory_set_dirty_tracking(0);
> @@ -205,10 +208,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       bytes_transferred_last = bytes_transferred;
>       bwidth = qemu_get_clock_ns(rt_clock);
>
> -    while (!qemu_file_rate_limit(f)) {
> +    while (!qemu_file_rate_limit(f)&&  data_read<  MAX_SAVE_BLOCK_READ) {
>    

The effect of this patch is that we'll never send more than 10mb/s 
during live migration?  If so, it's totally wrong as a fix to the problem.

It would be better to account the deduplicated pages as part of the rate 
limiting calculations.

Regards,

Anthony Liguori

>           int ret;
>
>           ret = ram_save_block(f);
> +        data_read += ret * TARGET_PAGE_SIZE;
>           bytes_transferred += ret * TARGET_PAGE_SIZE;
>           if (ret == 0) { /* no more blocks */
>               break;
>
Izik Eidus - May 10, 2010, 9:45 p.m.
On Mon, 10 May 2010 15:24:20 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 04/13/2010 04:33 AM, Izik Eidus wrote:
> >  From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00
> > 2001 From: Izik Eidus<ieidus@redhat.com>
> > Date: Tue, 13 Apr 2010 12:24:57 +0300
> > Subject: [PATCH] fix migration with large mem
> >
> > In cases of guests with large mem that have pages
> > that all their bytes content are the same, we will
> > spend alot of time reading the memory from the guest
> > (is_dup_page())
> >
> > It is happening beacuse ram_save_live() function have
> > limit of how much we can send to the dest but not how
> > much we read from it, and in cases we have many is_dup_page()
> > hits, we might read huge amount of data without updating important
> > stuff like the timers...
> >
> > The guest lose all its repsonsibility and have many softlock ups
> > inside itself.
> >
> > this patch add limit on the size we can read from the guest each
> > iteration.
> >
> >      Thanks.
> >
> > Signed-off-by: Izik Eidus<ieidus@redhat.com>
> > ---
> >   arch_init.c |    6 +++++-
> >   1 files changed, 5 insertions(+), 1 deletions(-)
> >
> > diff --git a/arch_init.c b/arch_init.c
> > index cfc03ea..e27b1a0 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -88,6 +88,8 @@ const uint32_t arch_type = QEMU_ARCH;
> >   #define RAM_SAVE_FLAG_PAGE	0x08
> >   #define RAM_SAVE_FLAG_EOS	0x10
> >
> > +#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
> > +
> >   static int is_dup_page(uint8_t *page, uint8_t ch)
> >   {
> >       uint32_t val = ch<<  24 | ch<<  16 | ch<<  8 | ch;
> > @@ -175,6 +177,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f,
> > int stage, void *opaque) uint64_t bytes_transferred_last;
> >       double bwidth = 0;
> >       uint64_t expected_time = 0;
> > +    int data_read = 0;
> >
> >       if (stage<  0) {
> >           cpu_physical_memory_set_dirty_tracking(0);
> > @@ -205,10 +208,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f,
> > int stage, void *opaque) bytes_transferred_last = bytes_transferred;
> >       bwidth = qemu_get_clock_ns(rt_clock);
> >
> > -    while (!qemu_file_rate_limit(f)) {
> > +    while (!qemu_file_rate_limit(f)&&  data_read<
> > MAX_SAVE_BLOCK_READ) { 
> 
> The effect of this patch is that we'll never send more than 10mb/s 
> during live migration?  If so, it's totally wrong as a fix to the
> problem.

It is 100mb/s... (if I remember correct the migration code will run
this thing 10 times for each iteration)

My feeling is that limit it with the network 32mb/s limit is too low,
reading memory for 100mb/s is not such a problem as long as we don`t
read gigas out of memory every sec...

(Still we want to optimize the billion of zeros cases of windows guests)

Anyway if the above does not make sense to you, I will just change it
into what you suggested

So ?

> 
> It would be better to account the deduplicated pages as part of the
> rate limiting calculations.
> 
> Regards,
> 
> Anthony Liguori
> 
> >           int ret;
> >
> >           ret = ram_save_block(f);
> > +        data_read += ret * TARGET_PAGE_SIZE;
> >           bytes_transferred += ret * TARGET_PAGE_SIZE;
> >           if (ret == 0) { /* no more blocks */
> >               break;
> >    
>
Anthony Liguori - May 10, 2010, 9:55 p.m.
On 05/10/2010 04:45 PM, Izik Eidus wrote:
> On Mon, 10 May 2010 15:24:20 -0500
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>
>    
>> On 04/13/2010 04:33 AM, Izik Eidus wrote:
>>      
>>>    From f881b371e08760a67bf1f5b992a586c3de600f7a Mon Sep 17 00:00:00
>>> 2001 From: Izik Eidus<ieidus@redhat.com>
>>> Date: Tue, 13 Apr 2010 12:24:57 +0300
>>> Subject: [PATCH] fix migration with large mem
>>>
>>> In cases of guests with large mem that have pages
>>> that all their bytes content are the same, we will
>>> spend alot of time reading the memory from the guest
>>> (is_dup_page())
>>>
>>> It is happening beacuse ram_save_live() function have
>>> limit of how much we can send to the dest but not how
>>> much we read from it, and in cases we have many is_dup_page()
>>> hits, we might read huge amount of data without updating important
>>> stuff like the timers...
>>>
>>> The guest lose all its repsonsibility and have many softlock ups
>>> inside itself.
>>>
>>> this patch add limit on the size we can read from the guest each
>>> iteration.
>>>
>>>       Thanks.
>>>
>>> Signed-off-by: Izik Eidus<ieidus@redhat.com>
>>> ---
>>>    arch_init.c |    6 +++++-
>>>    1 files changed, 5 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index cfc03ea..e27b1a0 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -88,6 +88,8 @@ const uint32_t arch_type = QEMU_ARCH;
>>>    #define RAM_SAVE_FLAG_PAGE	0x08
>>>    #define RAM_SAVE_FLAG_EOS	0x10
>>>
>>> +#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
>>> +
>>>    static int is_dup_page(uint8_t *page, uint8_t ch)
>>>    {
>>>        uint32_t val = ch<<   24 | ch<<   16 | ch<<   8 | ch;
>>> @@ -175,6 +177,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f,
>>> int stage, void *opaque) uint64_t bytes_transferred_last;
>>>        double bwidth = 0;
>>>        uint64_t expected_time = 0;
>>> +    int data_read = 0;
>>>
>>>        if (stage<   0) {
>>>            cpu_physical_memory_set_dirty_tracking(0);
>>> @@ -205,10 +208,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f,
>>> int stage, void *opaque) bytes_transferred_last = bytes_transferred;
>>>        bwidth = qemu_get_clock_ns(rt_clock);
>>>
>>> -    while (!qemu_file_rate_limit(f)) {
>>> +    while (!qemu_file_rate_limit(f)&&   data_read<
>>> MAX_SAVE_BLOCK_READ) {
>>>        
>> The effect of this patch is that we'll never send more than 10mb/s
>> during live migration?  If so, it's totally wrong as a fix to the
>> problem.
>>      
> It is 100mb/s... (if I remember correct the migration code will run
> this thing 10 times for each iteration)
>    

No, it only runs it once.

> My feeling is that limit it with the network 32mb/s limit is too low,
> reading memory for 100mb/s is not such a problem as long as we don`t
> read gigas out of memory every sec...
>    

You've limited bandwidth to 10 mb/sec.  Even if it was 100 mb/sec a 
fixed limit is wrong.  On a 10gbit (or 40gbit) link, 100 mb/sec is not 
enough.

> (Still we want to optimize the billion of zeros cases of windows guests)
>
> Anyway if the above does not make sense to you, I will just change it
> into what you suggested
>
> So ?
>    

That would work for me.

Regards,

Anthony Liguori

Patch

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..e27b1a0 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -88,6 +88,8 @@  const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_PAGE	0x08
 #define RAM_SAVE_FLAG_EOS	0x10
 
+#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
+
 static int is_dup_page(uint8_t *page, uint8_t ch)
 {
     uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
@@ -175,6 +177,7 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t bytes_transferred_last;
     double bwidth = 0;
     uint64_t expected_time = 0;
+    int data_read = 0;
 
     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
@@ -205,10 +208,11 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     bytes_transferred_last = bytes_transferred;
     bwidth = qemu_get_clock_ns(rt_clock);
 
-    while (!qemu_file_rate_limit(f)) {
+    while (!qemu_file_rate_limit(f) && data_read < MAX_SAVE_BLOCK_READ) {
         int ret;
 
         ret = ram_save_block(f);
+        data_read += ret * TARGET_PAGE_SIZE;
         bytes_transferred += ret * TARGET_PAGE_SIZE;
         if (ret == 0) { /* no more blocks */
             break;