diff mbox

[09/10] Exit loop if we have been there too long

Message ID 9b23b9b4cee242591bdb356c838a9cfb9af033c1.1290552026.git.quintela@redhat.com
State New
Headers show

Commit Message

Juan Quintela Nov. 23, 2010, 11:03 p.m. UTC
From: Juan Quintela <quintela@trasno.org>

cheking each 64 pages is a random magic number as good as any other.
We don't want to test too many times, but on the other hand,
qemu_get_clock_ns() is not so expensive either.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

Comments

Michael S. Tsirkin Nov. 24, 2010, 10:40 a.m. UTC | #1
On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
> From: Juan Quintela <quintela@trasno.org>
> 
> cheking each 64 pages is a random magic number as good as any other.
> We don't want to test too many times, but on the other hand,
> qemu_get_clock_ns() is not so expensive either.
> 

Could you please explain what's the problem this fixes?
I would like to see an API that documents the contract
we are making with the backend.

> Signed-off-by: Juan Quintela <quintela@trasno.org>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  arch_init.c |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index d32aaae..b463798 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -40,6 +40,7 @@
>  #include "net.h"
>  #include "gdbstub.h"
>  #include "hw/smbios.h"
> +#include "buffered_file.h"
> 
>  #ifdef TARGET_SPARC
>  int graphic_width = 1024;
> @@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      uint64_t bytes_transferred_last;
>      uint64_t t0;
>      double bwidth = 0;
> +    int i;
> 
>      if (stage < 0) {
>          cpu_physical_memory_set_dirty_tracking(0);
> @@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      bytes_transferred_last = bytes_transferred;
>      t0 = qemu_get_clock_ns(rt_clock);
> 
> +    i = 0;
>      while (!qemu_file_rate_limit(f)) {
>          int bytes_sent;
> 
> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>          if (bytes_sent == 0) { /* no more blocks */
>              break;
>          }
> +	/* we want to check in the 1st loop, just in case it was the 1st time
> +           and we had to sync the dirty bitmap.
> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
> +           iterations
> +	*/
> +        if ((i & 63) == 0) {
> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;

This adds even more non-determinism to savevm behaviour.  If bandwidth
limit is higth enough, I expect it to just keep going.

> +            if (t1 > buffered_file_interval/2) {

arch_init should not depend on buffered_file implementation IMO.

Also - / 2?

> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);

Is this a debugging aid?

> +		break;
> +	    }
> +	}
> +        i++;
>      }
> 
>      t0 = qemu_get_clock_ns(rt_clock) - t0;
> -- 
> 1.7.3.2
>
Juan Quintela Nov. 24, 2010, 11:01 a.m. UTC | #2
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
>> From: Juan Quintela <quintela@trasno.org>
>> 
>> cheking each 64 pages is a random magic number as good as any other.
>> We don't want to test too many times, but on the other hand,
>> qemu_get_clock_ns() is not so expensive either.
>> 
>
> Could you please explain what's the problem this fixes?
> I would like to see an API that documents the contract
> we are making with the backend.

buffered_file is an "abstraction" that uses a buffer.

live migration code (remember it can't sleep, it runs on the main loop)
stores its "stuff" on that buffer.  And a timer writes that buffer to
the fd that is associated with migration.

This design is due to the main_loop/no threads qemu model.

buffered_file timer runs each 100ms.  And we "try" to measure channel
bandwidth from there.  If we are not able to run the timer, all the
calculations are wrong, and then stalls happens.


>> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>          if (bytes_sent == 0) { /* no more blocks */
>>              break;
>>          }
>> +	/* we want to check in the 1st loop, just in case it was the 1st time
>> +           and we had to sync the dirty bitmap.
>> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
>> +           iterations
>> +	*/
>> +        if ((i & 63) == 0) {
>> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
>
> This adds even more non-determinism to savevm behaviour.  If bandwidth
> limit is higth enough, I expect it to just keep going.

If we find a row of 512MB of zero pages together (and that happens if
you have a 64GB iddle guest, then you can spent more than 3seconds to
fill the default bandwith).  After that everything that uses the main
loop has had stalls.


>> +            if (t1 > buffered_file_interval/2) {
>
> arch_init should not depend on buffered_file implementation IMO.
>
> Also - / 2?

We need to run a timer each 100ms.  For times look at the 0/6 patch.
We can't spent more that 50ms in each function.  It is something that
should happen for all funnctions called from io_handlers.

>> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
>
> Is this a debugging aid?

I left that on purpose, to show that it happens a lot.  There is no
DEBUG_ARCH or DEBUG_RAM around, I can create them if you preffer.  But
notice that this is something that shouldn't happen (but it happens).

DPRINTF for that file should be a good idea, will do.

>> +		break;
>> +	    }
>> +	}
>> +        i++;
>>      }
>> 
>>      t0 = qemu_get_clock_ns(rt_clock) - t0;
>> -- 
>> 1.7.3.2
>>
Michael S. Tsirkin Nov. 24, 2010, 11:14 a.m. UTC | #3
On Wed, Nov 24, 2010 at 12:01:51PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
> >> From: Juan Quintela <quintela@trasno.org>
> >> 
> >> cheking each 64 pages is a random magic number as good as any other.
> >> We don't want to test too many times, but on the other hand,
> >> qemu_get_clock_ns() is not so expensive either.
> >> 
> >
> > Could you please explain what's the problem this fixes?
> > I would like to see an API that documents the contract
> > we are making with the backend.
> 
> buffered_file is an "abstraction" that uses a buffer.
> 
> live migration code (remember it can't sleep, it runs on the main loop)
> stores its "stuff" on that buffer.  And a timer writes that buffer to
> the fd that is associated with migration.
> 
> This design is due to the main_loop/no threads qemu model.
> 
> buffered_file timer runs each 100ms.  And we "try" to measure channel
> bandwidth from there.  If we are not able to run the timer, all the
> calculations are wrong, and then stalls happens.

So the problem is the timer in the buffered file abstraction?
Why don't we just flush out data if the buffer is full?

> 
> 
> >> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >>          if (bytes_sent == 0) { /* no more blocks */
> >>              break;
> >>          }
> >> +	/* we want to check in the 1st loop, just in case it was the 1st time
> >> +           and we had to sync the dirty bitmap.
> >> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
> >> +           iterations
> >> +	*/
> >> +        if ((i & 63) == 0) {
> >> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
> >
> > This adds even more non-determinism to savevm behaviour.  If bandwidth
> > limit is higth enough, I expect it to just keep going.
> 
> If we find a row of 512MB of zero pages together (and that happens if
> you have a 64GB iddle guest, then you can spent more than 3seconds to
> fill the default bandwith).  After that everything that uses the main
> loop has had stalls.
> 
> 
> >> +            if (t1 > buffered_file_interval/2) {
> >
> > arch_init should not depend on buffered_file implementation IMO.
> >
> > Also - / 2?
> 
> We need to run a timer each 100ms.  For times look at the 0/6 patch.
> We can't spent more that 50ms in each function.  It is something that
> should happen for all funnctions called from io_handlers.
> 
> >> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
> >
> > Is this a debugging aid?
> 
> I left that on purpose, to show that it happens a lot.  There is no
> DEBUG_ARCH or DEBUG_RAM around, I can create them if you preffer.  But
> notice that this is something that shouldn't happen (but it happens).
> 
> DPRINTF for that file should be a good idea, will do.
> 
> >> +		break;
> >> +	    }
> >> +	}
> >> +        i++;
> >>      }
> >> 
> >>      t0 = qemu_get_clock_ns(rt_clock) - t0;
> >> -- 
> >> 1.7.3.2
> >>
Paolo Bonzini Nov. 24, 2010, 3:16 p.m. UTC | #4
On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
>> >  buffered_file timer runs each 100ms.  And we "try" to measure channel
>> >  bandwidth from there.  If we are not able to run the timer, all the
>> >  calculations are wrong, and then stalls happens.
>
> So the problem is the timer in the buffered file abstraction?
> Why don't we just flush out data if the buffer is full?

It takes a lot to fill the buffer if you have many zero pages, and if 
that happens the guest is starved by the main loop filling the buffer.

Paolo
Michael S. Tsirkin Nov. 24, 2010, 3:59 p.m. UTC | #5
On Wed, Nov 24, 2010 at 04:16:04PM +0100, Paolo Bonzini wrote:
> On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
> >>>  buffered_file timer runs each 100ms.  And we "try" to measure channel
> >>>  bandwidth from there.  If we are not able to run the timer, all the
> >>>  calculations are wrong, and then stalls happens.
> >
> >So the problem is the timer in the buffered file abstraction?
> >Why don't we just flush out data if the buffer is full?
> 
> It takes a lot to fill the buffer if you have many zero pages, and
> if that happens the guest is starved by the main loop filling the
> buffer.
> 
> Paolo

To solve starvation we really should return to main loop
as soon as possible, give io a chance to run.
The timer based hack is probably the best we can do by 0.14, though.
Anthony Liguori Nov. 30, 2010, 2:11 a.m. UTC | #6
On 11/23/2010 05:03 PM, Juan Quintela wrote:
> From: Juan Quintela<quintela@trasno.org>
>
> cheking each 64 pages is a random magic number as good as any other.
> We don't want to test too many times, but on the other hand,
> qemu_get_clock_ns() is not so expensive either.
>
> Signed-off-by: Juan Quintela<quintela@trasno.org>
> Signed-off-by: Juan Quintela<quintela@redhat.com>
>    

This is just introducing a future problem.

BufferedFile should hit the qemu_file_rate_limit check when the socket 
buffer gets filled up.  Are you claiming that the problem is due to 
trying to send a large number of zero pages?  Is this purely synthetic 
because you're testing an utilized large memory guest?

Regards,

Anthony Liguori

> ---
>   arch_init.c |   16 ++++++++++++++++
>   1 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index d32aaae..b463798 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -40,6 +40,7 @@
>   #include "net.h"
>   #include "gdbstub.h"
>   #include "hw/smbios.h"
> +#include "buffered_file.h"
>
>   #ifdef TARGET_SPARC
>   int graphic_width = 1024;
> @@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       uint64_t bytes_transferred_last;
>       uint64_t t0;
>       double bwidth = 0;
> +    int i;
>
>       if (stage<  0) {
>           cpu_physical_memory_set_dirty_tracking(0);
> @@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       bytes_transferred_last = bytes_transferred;
>       t0 = qemu_get_clock_ns(rt_clock);
>
> +    i = 0;
>       while (!qemu_file_rate_limit(f)) {
>           int bytes_sent;
>
> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>           if (bytes_sent == 0) { /* no more blocks */
>               break;
>           }
> +	/* we want to check in the 1st loop, just in case it was the 1st time
> +           and we had to sync the dirty bitmap.
> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
> +           iterations
> +	*/
> +        if ((i&  63) == 0) {
> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
> +            if (t1>  buffered_file_interval/2) {
> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
> +		break;
> +	    }
> +	}
> +        i++;
>       }
>
>       t0 = qemu_get_clock_ns(rt_clock) - t0;
>
Anthony Liguori Nov. 30, 2010, 2:15 a.m. UTC | #7
On 11/24/2010 09:16 AM, Paolo Bonzini wrote:
> On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
>>> >  buffered_file timer runs each 100ms.  And we "try" to measure 
>>> channel
>>> >  bandwidth from there.  If we are not able to run the timer, all the
>>> >  calculations are wrong, and then stalls happens.
>>
>> So the problem is the timer in the buffered file abstraction?
>> Why don't we just flush out data if the buffer is full?
>
> It takes a lot to fill the buffer if you have many zero pages, and if 
> that happens the guest is starved by the main loop filling the buffer.

Sounds like the sort of thing you'd only see if you created a guest a 
large guest that was mostly unallocated and then tried to migrate.  That 
doesn't seem like a very real case to me though.

The best approach would be to drop qemu_mutex while processing this code 
instead of having an arbitrary back-off point.  The later is deferring 
the problem to another day when it becomes the source of a future problem.

Regards,

Anthony Liguori

> Paolo
>
Juan Quintela Nov. 30, 2010, 2:23 a.m. UTC | #8
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/23/2010 05:03 PM, Juan Quintela wrote:
>> From: Juan Quintela<quintela@trasno.org>
>>
>> cheking each 64 pages is a random magic number as good as any other.
>> We don't want to test too many times, but on the other hand,
>> qemu_get_clock_ns() is not so expensive either.
>>
>> Signed-off-by: Juan Quintela<quintela@trasno.org>
>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>>    
>
> This is just introducing a future problem.
>
> BufferedFile should hit the qemu_file_rate_limit check when the socket
> buffer gets filled up.  Are you claiming that the problem is due to
> trying to send a large number of zero pages?  Is this purely synthetic
> because you're testing an utilized large memory guest?

Get big memory machine
Launch a 64GB guest.
Let it idle after start
Migrate it, and you will get stalls of around 3-4 seconds (see the patch
0/6 file).  I got 375000 pages sent in one go (all zero pages).

In not idle/new guests, it don't happen (you don't have several 500MB of
zeropages together in one 1GB guest).

The other solution to this problem is to just limit the number of pages
that we can sent in one go, but the problem is what number to put there.
For my test hosts, around 8000 pages are ok, but this is a 16core
machine with 128GB of RAM.  I don't know how to get a value that works
for every machine.

The timout value is not trivial either.  We need "clearly" that is less
than 100ms (otherwise the rate limit calculation is completely broken),
and other things in qemu don't work very well (monitor cames to mind).

I tested several values:
- no limit: migration takes 112 seconds, but we have lots of stalls of
  around 3 seconds.
- 15ms (migration_max_downtime()/2). Migration total time moves to 404
  seconds.
- 50ms: sweet spot, bigger value that don't have stalls. (migration
  takes 150s)
- 75ms-100ms: try both of them, total speed is a bit better (~125s), but
  we are back to having stalls.


Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> ---
>>   arch_init.c |   16 ++++++++++++++++
>>   1 files changed, 16 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index d32aaae..b463798 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -40,6 +40,7 @@
>>   #include "net.h"
>>   #include "gdbstub.h"
>>   #include "hw/smbios.h"
>> +#include "buffered_file.h"
>>
>>   #ifdef TARGET_SPARC
>>   int graphic_width = 1024;
>> @@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>       uint64_t bytes_transferred_last;
>>       uint64_t t0;
>>       double bwidth = 0;
>> +    int i;
>>
>>       if (stage<  0) {
>>           cpu_physical_memory_set_dirty_tracking(0);
>> @@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>       bytes_transferred_last = bytes_transferred;
>>       t0 = qemu_get_clock_ns(rt_clock);
>>
>> +    i = 0;
>>       while (!qemu_file_rate_limit(f)) {
>>           int bytes_sent;
>>
>> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>           if (bytes_sent == 0) { /* no more blocks */
>>               break;
>>           }
>> +	/* we want to check in the 1st loop, just in case it was the 1st time
>> +           and we had to sync the dirty bitmap.
>> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
>> +           iterations
>> +	*/
>> +        if ((i&  63) == 0) {
>> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
>> +            if (t1>  buffered_file_interval/2) {
>> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
>> +		break;
>> +	    }
>> +	}
>> +        i++;
>>       }
>>
>>       t0 = qemu_get_clock_ns(rt_clock) - t0;
>>
Anthony Liguori Nov. 30, 2010, 2:27 a.m. UTC | #9
On 11/29/2010 08:23 PM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 11/23/2010 05:03 PM, Juan Quintela wrote:
>>      
>>> From: Juan Quintela<quintela@trasno.org>
>>>
>>> cheking each 64 pages is a random magic number as good as any other.
>>> We don't want to test too many times, but on the other hand,
>>> qemu_get_clock_ns() is not so expensive either.
>>>
>>> Signed-off-by: Juan Quintela<quintela@trasno.org>
>>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>>>
>>>        
>> This is just introducing a future problem.
>>
>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>> buffer gets filled up.  Are you claiming that the problem is due to
>> trying to send a large number of zero pages?  Is this purely synthetic
>> because you're testing an utilized large memory guest?
>>      
> Get big memory machine
> Launch a 64GB guest.
> Let it idle after start
> Migrate it, and you will get stalls of around 3-4 seconds (see the patch
> 0/6 file).  I got 375000 pages sent in one go (all zero pages).
>    

That's a purely synthetic test case that's totally unrealistic.  There 
simply aren't a lot of zero pages in modern guests after a reasonable 
period of time.

> In not idle/new guests, it don't happen (you don't have several 500MB of
> zeropages together in one 1GB guest).
>
> The other solution to this problem is to just limit the number of pages
> that we can sent in one go, but the problem is what number to put there.
> For my test hosts, around 8000 pages are ok, but this is a 16core
> machine with 128GB of RAM.  I don't know how to get a value that works
> for every machine.
>    

Or just account zero pages as full sized pages and let rate limiting do 
it's thing.  That's a much better short term approach.

> The timout value is not trivial either.  We need "clearly" that is less
> than 100ms (otherwise the rate limit calculation is completely broken),
> and other things in qemu don't work very well (monitor cames to mind).
>    

You're testing has been broken because you didn't update the code correctly.

Regards,

Anthony Liguori

> I tested several values:
> - no limit: migration takes 112 seconds, but we have lots of stalls of
>    around 3 seconds.
> - 15ms (migration_max_downtime()/2). Migration total time moves to 404
>    seconds.
> - 50ms: sweet spot, bigger value that don't have stalls. (migration
>    takes 150s)
> - 75ms-100ms: try both of them, total speed is a bit better (~125s), but
>    we are back to having stalls.
>
>
> Later, Juan.
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>> ---
>>>    arch_init.c |   16 ++++++++++++++++
>>>    1 files changed, 16 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index d32aaae..b463798 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -40,6 +40,7 @@
>>>    #include "net.h"
>>>    #include "gdbstub.h"
>>>    #include "hw/smbios.h"
>>> +#include "buffered_file.h"
>>>
>>>    #ifdef TARGET_SPARC
>>>    int graphic_width = 1024;
>>> @@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>>        uint64_t bytes_transferred_last;
>>>        uint64_t t0;
>>>        double bwidth = 0;
>>> +    int i;
>>>
>>>        if (stage<   0) {
>>>            cpu_physical_memory_set_dirty_tracking(0);
>>> @@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>>        bytes_transferred_last = bytes_transferred;
>>>        t0 = qemu_get_clock_ns(rt_clock);
>>>
>>> +    i = 0;
>>>        while (!qemu_file_rate_limit(f)) {
>>>            int bytes_sent;
>>>
>>> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>>            if (bytes_sent == 0) { /* no more blocks */
>>>                break;
>>>            }
>>> +	/* we want to check in the 1st loop, just in case it was the 1st time
>>> +           and we had to sync the dirty bitmap.
>>> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
>>> +           iterations
>>> +	*/
>>> +        if ((i&   63) == 0) {
>>> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
>>> +            if (t1>   buffered_file_interval/2) {
>>> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
>>> +		break;
>>> +	    }
>>> +	}
>>> +        i++;
>>>        }
>>>
>>>        t0 = qemu_get_clock_ns(rt_clock) - t0;
>>>
>>>
Paolo Bonzini Nov. 30, 2010, 7:15 a.m. UTC | #10
On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>
> BufferedFile should hit the qemu_file_rate_limit check when the socket
> buffer gets filled up.

The problem is that the file rate limit is not hit because work is done 
elsewhere.  The rate can limit the bandwidth used and makes QEMU aware 
that socket operations may block (because that's what the buffered file 
freeze/unfreeze logic does); but it cannot be used to limit the _time_ 
spent in the migration code.

Paolo
Paolo Bonzini Nov. 30, 2010, 8:10 a.m. UTC | #11
On 11/30/2010 03:15 AM, Anthony Liguori wrote:
>
> Sounds like the sort of thing you'd only see if you created a guest a
> large guest that was mostly unallocated and then tried to migrate.

Or if you migrate a guest that was booted using the memory= option.

Paolo
Juan Quintela Nov. 30, 2010, 1:26 p.m. UTC | #12
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/24/2010 09:16 AM, Paolo Bonzini wrote:
>> On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
>>>> >  buffered_file timer runs each 100ms.  And we "try" to measure
>>>> channel
>>>> >  bandwidth from there.  If we are not able to run the timer, all the
>>>> >  calculations are wrong, and then stalls happens.
>>>
>>> So the problem is the timer in the buffered file abstraction?
>>> Why don't we just flush out data if the buffer is full?
>>
>> It takes a lot to fill the buffer if you have many zero pages, and
>> if that happens the guest is starved by the main loop filling the
>> buffer.
>
> Sounds like the sort of thing you'd only see if you created a guest a
> large guest that was mostly unallocated and then tried to migrate.
> That doesn't seem like a very real case to me though.

No.  this is the "easy" to reproduce case.  You can get that in normal
use.  Just with an idle guest with loads of memory is the worst possible
case, and trivial to reproduce.

> The best approach would be to drop qemu_mutex while processing this
> code instead of having an arbitrary back-off point.  The later is
> deferring the problem to another day when it becomes the source of a
> future problem.

As told in the other mail, you are offering me half a solution.  If I
implemente the qemu_mutex change (that I will do) we would still have
this problem on the main loop.  CPU stuck for 10s will be done, but
nothing else.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Paolo
>>
Anthony Liguori Nov. 30, 2010, 1:47 p.m. UTC | #13
On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>
>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>> buffer gets filled up.
>
> The problem is that the file rate limit is not hit because work is 
> done elsewhere.  The rate can limit the bandwidth used and makes QEMU 
> aware that socket operations may block (because that's what the 
> buffered file freeze/unfreeze logic does); but it cannot be used to 
> limit the _time_ spent in the migration code.

Yes, it can, if you set the rate limit sufficiently low.

The caveats are 1) the kvm.ko interface for dirty bits doesn't scale for 
large memory guests so we spend a lot more CPU time walking it than we 
should 2) zero pages cause us to burn a lot more CPU time than we 
otherwise would because compressing them is so effective.

In the short term, fixing (2) by accounting zero pages as full sized 
pages should "fix" the problem.

In the long term, we need a new dirty bit interface from kvm.ko that 
uses a multi-level table.  That should dramatically improve scan 
performance.  We also need to implement live migration in a separate 
thread that doesn't carry qemu_mutex while it runs.

Regards,

Anthony Liguori

> Paolo
Avi Kivity Nov. 30, 2010, 1:58 p.m. UTC | #14
On 11/30/2010 03:47 PM, Anthony Liguori wrote:
> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>
>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>> buffer gets filled up.
>>
>> The problem is that the file rate limit is not hit because work is 
>> done elsewhere.  The rate can limit the bandwidth used and makes QEMU 
>> aware that socket operations may block (because that's what the 
>> buffered file freeze/unfreeze logic does); but it cannot be used to 
>> limit the _time_ spent in the migration code.
>
> Yes, it can, if you set the rate limit sufficiently low.
>
> The caveats are 1) the kvm.ko interface for dirty bits doesn't scale 
> for large memory guests so we spend a lot more CPU time walking it 
> than we should 2) zero pages cause us to burn a lot more CPU time than 
> we otherwise would because compressing them is so effective.

What's the problem with burning that cpu?  per guest page, compressing 
takes less than sending.  Is it just an issue of qemu mutex hold time?

>
> In the short term, fixing (2) by accounting zero pages as full sized 
> pages should "fix" the problem.
>
> In the long term, we need a new dirty bit interface from kvm.ko that 
> uses a multi-level table.  That should dramatically improve scan 
> performance. 

Why would a multi-level table help?  (or rather, please explain what you 
mean by a multi-level table).

Something we could do is divide memory into more slots, and polling each 
slot when we start to scan its page range.  That reduces the time 
between sampling a page's dirtiness and sending it off, and reduces the 
latency incurred by the sampling.  There are also non-interface-changing 
ways to reduce this latency, like O(1) write protection, or using dirty 
bits instead of write protection when available.

> We also need to implement live migration in a separate thread that 
> doesn't carry qemu_mutex while it runs.

IMO that's the biggest hit currently.
Paolo Bonzini Nov. 30, 2010, 2:12 p.m. UTC | #15
On 11/30/2010 02:47 PM, Anthony Liguori wrote:
> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>
>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>> buffer gets filled up.
>>
>> The problem is that the file rate limit is not hit because work is
>> done elsewhere. The rate can limit the bandwidth used and makes QEMU
>> aware that socket operations may block (because that's what the
>> buffered file freeze/unfreeze logic does); but it cannot be used to
>> limit the _time_ spent in the migration code.
>
> Yes, it can, if you set the rate limit sufficiently low.

You mean, just like you can drive a car without brakes by keeping the 
speed sufficiently low.

> [..] accounting zero pages as full sized
> pages should "fix" the problem.

I know you used quotes, but it's a very very generous definition of fix. 
  Both these proposed "fixes" are nothing more than workarounds, and 
even particularly ugly ones.  The worst thing about them is that there 
is no guarantee of migration finishing in a reasonable time, or at all.

If you account zero pages as full, you don't use effectively the 
bandwidth that was allotted to you, you use only 0.2% of it (8/4096). 
It then takes an exaggerate amount of time to start iteration on pages 
that matter.  If you set the bandwidth low, instead, you do not have the 
bandwidth you need in order to converge.

Even from an aesthetic point of view, if there is such a thing, I don't 
understand why you advocate conflating network bandwidth and CPU usage 
into a single measurement.  Nobody disagrees that all you propose is 
nice to have, and that what Juan sent is a stopgap measure (though a 
very effective one).  However, this doesn't negate that Juan's 
accounting patches make a lot of sense in the current design.

> In the long term, we need a new dirty bit interface from kvm.ko that
> uses a multi-level table. That should dramatically improve scan
> performance. We also need to implement live migration in a separate
> thread that doesn't carry qemu_mutex while it runs.

This may be a good way to fix it, but it's also basically a rewrite.

Paolo
Anthony Liguori Nov. 30, 2010, 2:17 p.m. UTC | #16
On 11/30/2010 07:58 AM, Avi Kivity wrote:
> On 11/30/2010 03:47 PM, Anthony Liguori wrote:
>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>>
>>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>>> buffer gets filled up.
>>>
>>> The problem is that the file rate limit is not hit because work is 
>>> done elsewhere.  The rate can limit the bandwidth used and makes 
>>> QEMU aware that socket operations may block (because that's what the 
>>> buffered file freeze/unfreeze logic does); but it cannot be used to 
>>> limit the _time_ spent in the migration code.
>>
>> Yes, it can, if you set the rate limit sufficiently low.
>>
>> The caveats are 1) the kvm.ko interface for dirty bits doesn't scale 
>> for large memory guests so we spend a lot more CPU time walking it 
>> than we should 2) zero pages cause us to burn a lot more CPU time 
>> than we otherwise would because compressing them is so effective.
>
> What's the problem with burning that cpu?  per guest page, compressing 
> takes less than sending.  Is it just an issue of qemu mutex hold time?

If you have a 512GB guest, then you have a 16MB dirty bitmap which ends 
up being an 128MB dirty bitmap in QEMU because we represent dirty bits 
with 8 bits.

Walking 16mb (or 128mb) of memory just fine find a few pages to send 
over the wire is a big waste of CPU time.  If kvm.ko used a multi-level 
table to represent dirty info, we could walk the memory mapping at 2MB 
chunks allowing us to skip a large amount of the comparisons.

>> In the short term, fixing (2) by accounting zero pages as full sized 
>> pages should "fix" the problem.
>>
>> In the long term, we need a new dirty bit interface from kvm.ko that 
>> uses a multi-level table.  That should dramatically improve scan 
>> performance. 
>
> Why would a multi-level table help?  (or rather, please explain what 
> you mean by a multi-level table).
>
> Something we could do is divide memory into more slots, and polling 
> each slot when we start to scan its page range.  That reduces the time 
> between sampling a page's dirtiness and sending it off, and reduces 
> the latency incurred by the sampling.  There are also 
> non-interface-changing ways to reduce this latency, like O(1) write 
> protection, or using dirty bits instead of write protection when 
> available.

BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
instead of mapping it to the main dirty bitmap.

>> We also need to implement live migration in a separate thread that 
>> doesn't carry qemu_mutex while it runs.
>
> IMO that's the biggest hit currently.

Yup.  That's the Correct solution to the problem.

Regards,

Anthony Liguori
Avi Kivity Nov. 30, 2010, 2:27 p.m. UTC | #17
On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>> What's the problem with burning that cpu?  per guest page, 
>> compressing takes less than sending.  Is it just an issue of qemu 
>> mutex hold time?
>
>
> If you have a 512GB guest, then you have a 16MB dirty bitmap which 
> ends up being an 128MB dirty bitmap in QEMU because we represent dirty 
> bits with 8 bits.

Was there not a patchset to split each bit into its own bitmap?  And 
then copy the kvm or qemu master bitmap into each client bitmap as it 
became needed?

> Walking 16mb (or 128mb) of memory just fine find a few pages to send 
> over the wire is a big waste of CPU time.  If kvm.ko used a 
> multi-level table to represent dirty info, we could walk the memory 
> mapping at 2MB chunks allowing us to skip a large amount of the 
> comparisons.

There's no reason to assume dirty pages would be clustered.  If 0.2% of 
memory were dirty, but scattered uniformly, there would be no win from 
the two-level bitmap.  A loss, in fact: 2MB can be represented as 512 
bits or 64 bytes, just one cache line.  Any two-level thing will need more.

We might have a more compact encoding for sparse bitmaps, like 
run-length encoding.

>
>>> In the short term, fixing (2) by accounting zero pages as full sized 
>>> pages should "fix" the problem.
>>>
>>> In the long term, we need a new dirty bit interface from kvm.ko that 
>>> uses a multi-level table.  That should dramatically improve scan 
>>> performance. 
>>
>> Why would a multi-level table help?  (or rather, please explain what 
>> you mean by a multi-level table).
>>
>> Something we could do is divide memory into more slots, and polling 
>> each slot when we start to scan its page range.  That reduces the 
>> time between sampling a page's dirtiness and sending it off, and 
>> reduces the latency incurred by the sampling.  There are also 
>> non-interface-changing ways to reduce this latency, like O(1) write 
>> protection, or using dirty bits instead of write protection when 
>> available.
>
> BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
> instead of mapping it to the main dirty bitmap.

That's what the patch set I was alluding to did.  Or maybe I imagined 
the whole thing.

>>> We also need to implement live migration in a separate thread that 
>>> doesn't carry qemu_mutex while it runs.
>>
>> IMO that's the biggest hit currently.
>
> Yup.  That's the Correct solution to the problem.

Then let's just Do it.
Anthony Liguori Nov. 30, 2010, 2:50 p.m. UTC | #18
On 11/30/2010 08:27 AM, Avi Kivity wrote:
> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>>> What's the problem with burning that cpu?  per guest page, 
>>> compressing takes less than sending.  Is it just an issue of qemu 
>>> mutex hold time?
>>
>>
>> If you have a 512GB guest, then you have a 16MB dirty bitmap which 
>> ends up being an 128MB dirty bitmap in QEMU because we represent 
>> dirty bits with 8 bits.
>
> Was there not a patchset to split each bit into its own bitmap?  And 
> then copy the kvm or qemu master bitmap into each client bitmap as it 
> became needed?
>
>> Walking 16mb (or 128mb) of memory just fine find a few pages to send 
>> over the wire is a big waste of CPU time.  If kvm.ko used a 
>> multi-level table to represent dirty info, we could walk the memory 
>> mapping at 2MB chunks allowing us to skip a large amount of the 
>> comparisons.
>
> There's no reason to assume dirty pages would be clustered.  If 0.2% 
> of memory were dirty, but scattered uniformly, there would be no win 
> from the two-level bitmap.  A loss, in fact: 2MB can be represented as 
> 512 bits or 64 bytes, just one cache line.  Any two-level thing will 
> need more.
>
> We might have a more compact encoding for sparse bitmaps, like 
> run-length encoding.
>
>>
>>>> In the short term, fixing (2) by accounting zero pages as full 
>>>> sized pages should "fix" the problem.
>>>>
>>>> In the long term, we need a new dirty bit interface from kvm.ko 
>>>> that uses a multi-level table.  That should dramatically improve 
>>>> scan performance. 
>>>
>>> Why would a multi-level table help?  (or rather, please explain what 
>>> you mean by a multi-level table).
>>>
>>> Something we could do is divide memory into more slots, and polling 
>>> each slot when we start to scan its page range.  That reduces the 
>>> time between sampling a page's dirtiness and sending it off, and 
>>> reduces the latency incurred by the sampling.  There are also 
>>> non-interface-changing ways to reduce this latency, like O(1) write 
>>> protection, or using dirty bits instead of write protection when 
>>> available.
>>
>> BTW, we should also refactor qemu to use the kvm dirty bitmap 
>> directly instead of mapping it to the main dirty bitmap.
>
> That's what the patch set I was alluding to did.  Or maybe I imagined 
> the whole thing.

No, it just split the main bitmap into three bitmaps.  I'm suggesting 
that we have the dirty interface have two implementations, one that 
refers to the 8-bit bitmap when TCG in use and another one that uses the 
KVM representation.

TCG really needs multiple dirty bits but KVM doesn't.  A shared 
implementation really can't be optimal.

>
>>>> We also need to implement live migration in a separate thread that 
>>>> doesn't carry qemu_mutex while it runs.
>>>
>>> IMO that's the biggest hit currently.
>>
>> Yup.  That's the Correct solution to the problem.
>
> Then let's just Do it.
>

Yup.

Regards,

Anthony Liguori
Anthony Liguori Nov. 30, 2010, 3 p.m. UTC | #19
On 11/30/2010 08:12 AM, Paolo Bonzini wrote:
> On 11/30/2010 02:47 PM, Anthony Liguori wrote:
>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>>
>>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>>> buffer gets filled up.
>>>
>>> The problem is that the file rate limit is not hit because work is
>>> done elsewhere. The rate can limit the bandwidth used and makes QEMU
>>> aware that socket operations may block (because that's what the
>>> buffered file freeze/unfreeze logic does); but it cannot be used to
>>> limit the _time_ spent in the migration code.
>>
>> Yes, it can, if you set the rate limit sufficiently low.
>
> You mean, just like you can drive a car without brakes by keeping the 
> speed sufficiently low.
>
>> [..] accounting zero pages as full sized
>> pages should "fix" the problem.
>
> I know you used quotes, but it's a very very generous definition of 
> fix.  Both these proposed "fixes" are nothing more than workarounds, 
> and even particularly ugly ones.  The worst thing about them is that 
> there is no guarantee of migration finishing in a reasonable time, or 
> at all.
>
> If you account zero pages as full, you don't use effectively the 
> bandwidth that was allotted to you, you use only 0.2% of it (8/4096). 
> It then takes an exaggerate amount of time to start iteration on pages 
> that matter.  If you set the bandwidth low, instead, you do not have 
> the bandwidth you need in order to converge.
>
> Even from an aesthetic point of view, if there is such a thing, I 
> don't understand why you advocate conflating network bandwidth and CPU 
> usage into a single measurement.  Nobody disagrees that all you 
> propose is nice to have, and that what Juan sent is a stopgap measure 
> (though a very effective one).  However, this doesn't negate that 
> Juan's accounting patches make a lot of sense in the current design.

Juan's patch, IIUC, does the following: If you've been iterating in a 
tight loop, return to the main loop for *one* iteration every 50ms.

But this means that during this 50ms period of time, a VCPU may be 
blocked from running.  If the guest isn't doing a lot of device I/O 
*and* you're on a relatively low link speed, then this will mean that 
you don't hold qemu_mutex for more than 50ms at a time.

But in the degenerate case where you have a high speed link and you have 
a guest doing a lot of device I/O, you'll see the guest VCPU being 
blocked for 50ms, then getting to run for a very brief period of time, 
followed by another block for 50ms.  The guest's execution will be 
extremely sporadic.

This isn't fixable with this approach.  The only way to really fix this 
is to say that over a given period of time, migration may only consume 
XX amount of CPU time which guarantees the VCPUs get the qemu_mutex for 
the rest of the time.

This is exactly what rate limiting does.  Yes, it results in a longer 
migration time but that's the trade-off we have to make if we want 
deterministic VCPU execution until we can implement threading properly.

If you want a simple example, doing I/O with the rtl8139 adapter while 
doing your migration test and run a tight loop in the get running 
gettimeofday().  Graph the results to see how much execution time the 
guest is actually getting.


>> In the long term, we need a new dirty bit interface from kvm.ko that
>> uses a multi-level table. That should dramatically improve scan
>> performance. We also need to implement live migration in a separate
>> thread that doesn't carry qemu_mutex while it runs.
>
> This may be a good way to fix it, but it's also basically a rewrite.

The only correct short term solution I can see if rate limiting 
unfortunately.

Regards,

Anthony Liguori

> Paolo
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juan Quintela Nov. 30, 2010, 5:43 p.m. UTC | #20
Avi Kivity <avi@redhat.com> wrote:
> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>>> What's the problem with burning that cpu?  per guest page,
>>> compressing takes less than sending.  Is it just an issue of qemu
>>> mutex hold time?
>>
>>
>> If you have a 512GB guest, then you have a 16MB dirty bitmap which
>> ends up being an 128MB dirty bitmap in QEMU because we represent
>> dirty bits with 8 bits.
>
> Was there not a patchset to split each bit into its own bitmap?  And
> then copy the kvm or qemu master bitmap into each client bitmap as it
> became needed?
>
>> Walking 16mb (or 128mb) of memory just fine find a few pages to send
>> over the wire is a big waste of CPU time.  If kvm.ko used a
>> multi-level table to represent dirty info, we could walk the memory
>> mapping at 2MB chunks allowing us to skip a large amount of the
>> comparisons.
>
> There's no reason to assume dirty pages would be clustered.  If 0.2%
> of memory were dirty, but scattered uniformly, there would be no win
> from the two-level bitmap.  A loss, in fact: 2MB can be represented as
> 512 bits or 64 bytes, just one cache line.  Any two-level thing will
> need more.
>
> We might have a more compact encoding for sparse bitmaps, like
> run-length encoding.


I haven't measured it, but I think that it would be much better that
way.  When we start, it don't matter too much (everything is dirty),
what we should optimize for is the last rounds, and in the last rounds
it would be much better to ask kvm:

fill this array of dirty pages offsets, and be done with it.
Not sure if adding a size field would improve things, both tests need to
be measured.

What would be a winner independent of that is a way to ask qemu the
number of dirty pages.  Just now we need to calculate them walking the
bitmap (one of my patches just simplifies this).

Adding the feature to qemu means that we could always give recent
information to "info migrate" without incurring in a big cost.

>> BTW, we should also refactor qemu to use the kvm dirty bitmap
>> directly instead of mapping it to the main dirty bitmap.
>
> That's what the patch set I was alluding to did.  Or maybe I imagined
> the whole thing.

it existed.  And today would be easier because KQEMU and VGA are not
needed  anymore.

>>>> We also need to implement live migration in a separate thread that
>>>> doesn't carry qemu_mutex while it runs.
>>>
>>> IMO that's the biggest hit currently.
>>
>> Yup.  That's the Correct solution to the problem.
>
> Then let's just Do it.

Will take a look at splittingthe qemu_mutex bit.

Later, Juan.
Juan Quintela Nov. 30, 2010, 5:59 p.m. UTC | #21
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 08:12 AM, Paolo Bonzini wrote:
>> On 11/30/2010 02:47 PM, Anthony Liguori wrote:
>>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:

> Juan's patch, IIUC, does the following: If you've been iterating in a
> tight loop, return to the main loop for *one* iteration every 50ms.

I dont' think it that way, more to the lines that: if you have spent
50ms, time to leave others to run O:-)  But yes, this is one of the changes.

> But this means that during this 50ms period of time, a VCPU may be
> blocked from running.  If the guest isn't doing a lot of device I/O
> *and* you're on a relatively low link speed, then this will mean that
> you don't hold qemu_mutex for more than 50ms at a time.
>
> But in the degenerate case where you have a high speed link and you
> have a guest doing a lot of device I/O, you'll see the guest VCPU
> being blocked for 50ms, then getting to run for a very brief period of
> time, followed by another block for 50ms.  The guest's execution will
> be extremely sporadic.

ok, lets go back a bit.  we have two problems:
- vcpu stalls
- main_loop stalls

my patch reduce stalls to a maximun of 50ms, just now they can be much
bigger.  Fully agree that doing ram_save_live without qemu_mutex is
an improvement.  But we are still hogging teh main loop, so we need my
patch and splitting qemu_mutex.

In other words, splitting qemu_mutex will fix half of the problem, not
all of it.

> This isn't fixable with this approach.  The only way to really fix
> this is to say that over a given period of time, migration may only
> consume XX amount of CPU time which guarantees the VCPUs get the
> qemu_mutex for the rest of the time.
>
> This is exactly what rate limiting does.  Yes, it results in a longer
> migration time but that's the trade-off we have to make if we want
> deterministic VCPU execution until we can implement threading
> properly.

I just give up here the discussion.  Basically I abused one thing (a
timeout), that was trivial to understand: if we have spent too much
time, leave others to run.

It is not a perfect solution, but it is a big improvement over what we
had.

I have been suggested that I use the other appoarch, and that I abuse
qemu_file_rate_limit().  creating a new qemu_file_write_nop() that only
increases xfer bytes transmited.  And then trust the rate limiting code
to fix the problem.  How this is nicer/clearer is completely alien to
me.  That probably also reduces the stalls, perhaps (I have to look at
the interactions), but it is a complete lie, we are counting as
trasfering stuff, stuff that we are not transfering :(

Why I don't like it?  Because what I found when I started is that this
rate_limit works very badly if we are not able to run the buffered_file
limit each 100ms.  If we lost ticks of that timer, our calculations are
wrong.

> If you want a simple example, doing I/O with the rtl8139 adapter while
> doing your migration test and run a tight loop in the get running
> gettimeofday().  Graph the results to see how much execution time the
> guest is actually getting.

I am still convinced that we need to limit the ammount of time that an
io_handler can use.

And that we should write scare things to the logs when an io_handler
needs too much time.

There are two independent problems.  I fully agree that having
ram_save_live() without qemu_mutex is an improvement.  But it is an
improvement independently of my patch.

>>> In the long term, we need a new dirty bit interface from kvm.ko that
>>> uses a multi-level table. That should dramatically improve scan
>>> performance. We also need to implement live migration in a separate
>>> thread that doesn't carry qemu_mutex while it runs.
>>
>> This may be a good way to fix it, but it's also basically a rewrite.
>
> The only correct short term solution I can see if rate limiting
> unfortunately.

I still think that this is an inferior solution, we are using something
to limit the amount of stuff that we write to the network to limit the
amount of pages that we walk.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Paolo
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa Dec. 1, 2010, 1:20 a.m. UTC | #22
On Tue, 30 Nov 2010 16:27:13 +0200
Avi Kivity <avi@redhat.com> wrote:

> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
> >> What's the problem with burning that cpu?  per guest page, 
> >> compressing takes less than sending.  Is it just an issue of qemu 
> >> mutex hold time?
> >
> >
> > If you have a 512GB guest, then you have a 16MB dirty bitmap which 
> > ends up being an 128MB dirty bitmap in QEMU because we represent dirty 
> > bits with 8 bits.
> 
> Was there not a patchset to split each bit into its own bitmap?  And 
> then copy the kvm or qemu master bitmap into each client bitmap as it 
> became needed?
> 
> > Walking 16mb (or 128mb) of memory just fine find a few pages to send 
> > over the wire is a big waste of CPU time.  If kvm.ko used a 
> > multi-level table to represent dirty info, we could walk the memory 
> > mapping at 2MB chunks allowing us to skip a large amount of the 
> > comparisons.
> 
> There's no reason to assume dirty pages would be clustered.  If 0.2% of 
> memory were dirty, but scattered uniformly, there would be no win from 
> the two-level bitmap.  A loss, in fact: 2MB can be represented as 512 
> bits or 64 bytes, just one cache line.  Any two-level thing will need more.
> 
> We might have a more compact encoding for sparse bitmaps, like 
> run-length encoding.
> 

Does anyone is profiling these dirty bitmap things?

 - 512GB guest is really the target?
 - how much cpu time can we use for these things?
 - how many dirty pages do we have to care?

Since we are planning to do some profiling for these, taking into account
Kemari, can you please share these information?


> >
> >>> In the short term, fixing (2) by accounting zero pages as full sized 
> >>> pages should "fix" the problem.
> >>>
> >>> In the long term, we need a new dirty bit interface from kvm.ko that 
> >>> uses a multi-level table.  That should dramatically improve scan 
> >>> performance. 
> >>
> >> Why would a multi-level table help?  (or rather, please explain what 
> >> you mean by a multi-level table).
> >>
> >> Something we could do is divide memory into more slots, and polling 
> >> each slot when we start to scan its page range.  That reduces the 
> >> time between sampling a page's dirtiness and sending it off, and 
> >> reduces the latency incurred by the sampling.  There are also 

If we use rmap approach with one more interface, we can specify which
range of dirty bitmap to get. This has the same effect to splitting into
more slots.


> >> non-interface-changing ways to reduce this latency, like O(1) write 
> >> protection, or using dirty bits instead of write protection when 
> >> available.

IIUC, O(1) will lazily write protect pages beggining from top level?
Does this have any impact other than the timing of get_dirty_log()?


Thanks,
  Takuya


> >
> > BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
> > instead of mapping it to the main dirty bitmap.
> 
> That's what the patch set I was alluding to did.  Or maybe I imagined 
> the whole thing.
> 
> >>> We also need to implement live migration in a separate thread that 
> >>> doesn't carry qemu_mutex while it runs.
> >>
> >> IMO that's the biggest hit currently.
> >
> > Yup.  That's the Correct solution to the problem.
> 
> Then let's just Do it.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juan Quintela Dec. 1, 2010, 1:52 a.m. UTC | #23
Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:
> On Tue, 30 Nov 2010 16:27:13 +0200
> Avi Kivity <avi@redhat.com> wrote:

> Does anyone is profiling these dirty bitmap things?

I am.

>  - 512GB guest is really the target?

no, problems exist with smaller amounts of RAM.  with 16GB guest it is
trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
is flaky to say the less.

>  - how much cpu time can we use for these things?

the problem here is that we are forced to walk the bitmap too many
times, we want to do it less times.

>  - how many dirty pages do we have to care?

default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
dirty pages to have only 30ms of downtime.

But notice that this is what we are advertising, we aren't near there at all.

> Since we are planning to do some profiling for these, taking into account
> Kemari, can you please share these information?

If you see the 0/10 email with this setup, you can see how much time are
we spending on stuff.  Just now (for migration, kemari is a bit
different) we have to fix other things first.

Next item for me is te improve that bitmap handling (we can at least
trivial divide the space used by 8, and use ffs to find dirty pages).

Thinknig about changing kvm interface when we convert it to the
bottleneck (it is not at this point).

>> >>> In the short term, fixing (2) by accounting zero pages as full sized 
>> >>> pages should "fix" the problem.
>> >>>
>> >>> In the long term, we need a new dirty bit interface from kvm.ko that 
>> >>> uses a multi-level table.  That should dramatically improve scan 
>> >>> performance. 
>> >>
>> >> Why would a multi-level table help?  (or rather, please explain what 
>> >> you mean by a multi-level table).
>> >>
>> >> Something we could do is divide memory into more slots, and polling 
>> >> each slot when we start to scan its page range.  That reduces the 
>> >> time between sampling a page's dirtiness and sending it off, and 
>> >> reduces the latency incurred by the sampling.  There are also 
>
> If we use rmap approach with one more interface, we can specify which
> range of dirty bitmap to get. This has the same effect to splitting into
> more slots.

kvm allows us to that today.  It is qemu who don't use this
information.  qemu always asks for the whole memory.  kvm is happy to
give only a range.

>> >> non-interface-changing ways to reduce this latency, like O(1) write 
>> >> protection, or using dirty bits instead of write protection when 
>> >> available.
>
> IIUC, O(1) will lazily write protect pages beggining from top level?
> Does this have any impact other than the timing of get_dirty_log()?

dunno.

At this point I am trying to:
- get migration with 16-64GB to not having stalls.
- get infrastructure to be able to know what is going on.

So far, bigger stalls are gone, and discusing what to do next.  As
Anthony suggested I run ram_save_live() loop without qemu_mutex, and now
guests get much better interaction, but my current patch (for this) is
just to put qemu_mutex_unlock_iothread()/qemu_mutex_lock_iothread()
around it.  I think that we are racey with teh access to the bitmap, but
it was just a test.

With respect to Kemari, we can discuss what do you need and how you are
going to test, just to not do overlapping work.

Thanks, Juan.
Takuya Yoshikawa Dec. 1, 2010, 2:22 a.m. UTC | #24
On Wed, 01 Dec 2010 02:52:08 +0100
Juan Quintela <quintela@redhat.com> wrote:

> > Since we are planning to do some profiling for these, taking into account
> > Kemari, can you please share these information?
> 
> If you see the 0/10 email with this setup, you can see how much time are
> we spending on stuff.  Just now (for migration, kemari is a bit
> different) we have to fix other things first.
> 

Thank you for the information.
  Sorry, I only had the [9/10] in my kvm mail box and did not notice this [0/10].

> >> >> non-interface-changing ways to reduce this latency, like O(1) write 
> >> >> protection, or using dirty bits instead of write protection when 
> >> >> available.
> >
> > IIUC, O(1) will lazily write protect pages beggining from top level?
> > Does this have any impact other than the timing of get_dirty_log()?
> 
> dunno.
> 
> At this point I am trying to:
> - get migration with 16-64GB to not having stalls.
> - get infrastructure to be able to know what is going on.
> 
> So far, bigger stalls are gone, and discusing what to do next.  As
> Anthony suggested I run ram_save_live() loop without qemu_mutex, and now
> guests get much better interaction, but my current patch (for this) is
> just to put qemu_mutex_unlock_iothread()/qemu_mutex_lock_iothread()
> around it.  I think that we are racey with teh access to the bitmap, but
> it was just a test.
> 
> With respect to Kemari, we can discuss what do you need and how you are
> going to test, just to not do overlapping work.

I see, we've just started to talk about what we have to achieve and what we
can expect. I need to talk with Tamura-san about this a bit more before
showing some example targets.

Thanks,
  Takuya
Avi Kivity Dec. 1, 2010, 12:35 p.m. UTC | #25
On 12/01/2010 03:52 AM, Juan Quintela wrote:
> >   - 512GB guest is really the target?
>
> no, problems exist with smaller amounts of RAM.  with 16GB guest it is
> trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
> is flaky to say the less.
>
> >   - how much cpu time can we use for these things?
>
> the problem here is that we are forced to walk the bitmap too many
> times, we want to do it less times.

How much time is spent walking bitmaps?  Are you sure this is the problem?

> >   - how many dirty pages do we have to care?
>
> default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> dirty pages to have only 30ms of downtime.

1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
Avi Kivity Dec. 1, 2010, 12:40 p.m. UTC | #26
On 11/30/2010 04:50 PM, Anthony Liguori wrote:
>> That's what the patch set I was alluding to did.  Or maybe I imagined 
>> the whole thing.
>
>
> No, it just split the main bitmap into three bitmaps.  I'm suggesting 
> that we have the dirty interface have two implementations, one that 
> refers to the 8-bit bitmap when TCG in use and another one that uses 
> the KVM representation.
>
> TCG really needs multiple dirty bits but KVM doesn't.  A shared 
> implementation really can't be optimal.

Live migration and the framebuffer can certainly share code with kvm and 
tcg:

- tcg or kvm maintain an internal bitmap (kvm in the kernel, tcg updates 
a private bitmap)
- a dirty log client wants to see an updated bitmap; migration on a new 
pass, vga on screen refresh
    - ask the producer (kvm or tcg) to fetch-and-clear a dirty bitmap
    - broadcast it ( |= ) into any active clients (migration or framebuffer)
- everyone's happy

The code dirty thing might need special treatment, we can have a special 
tcg-only bitmap for it.
Juan Quintela Dec. 1, 2010, 1:45 p.m. UTC | #27
Avi Kivity <avi@redhat.com> wrote:
> On 12/01/2010 03:52 AM, Juan Quintela wrote:
>> >   - 512GB guest is really the target?
>>
>> no, problems exist with smaller amounts of RAM.  with 16GB guest it is
>> trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
>> is flaky to say the less.
>>
>> >   - how much cpu time can we use for these things?
>>
>> the problem here is that we are forced to walk the bitmap too many
>> times, we want to do it less times.
>
> How much time is spent walking bitmaps?  Are you sure this is the problem?

see my 10/10 patch, that one makes ram_save_live() to move from 80% of the
time in the profile to the second one with ~5% or so.

The important bit is:

 static uint64_t ram_save_remaining(void)
 {
-    RAMBlock *block;
-    uint64_t count = 0;
-
-    QLIST_FOREACH(block, &ram_list.blocks, next) {
-        ram_addr_t addr;
-        for (addr = block->offset; addr < block->offset + block->length;
-             addr += TARGET_PAGE_SIZE) {
-            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
-                count++;
-            }
-        }
-    }
-
-    return count;
+    return ram_list.dirty_pages;
 }


We dont' need to walk all bitmap to see how much memory is remaining.

syncing the bitmap is also expensive, but just now we have other issues
that hide it.  That is the reason that I didn't started trynig to get a
better interface with kvm, 1st I will try to remove the bigger
bottlenecks.

>> >   - how many dirty pages do we have to care?
>>
>> default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
>> dirty pages to have only 30ms of downtime.
>
> 1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.

I will learn to make math with time and bytes at some point O:-)

Later, Juan.
Takuya Yoshikawa Dec. 2, 2010, 1:31 a.m. UTC | #28
Thanks for the answers Avi, Juan,

Some FYI, (not about the bottleneck)

On Wed, 01 Dec 2010 14:35:57 +0200
Avi Kivity <avi@redhat.com> wrote:

> > >   - how many dirty pages do we have to care?
> >
> > default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> > dirty pages to have only 30ms of downtime.
> 
> 1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
> 

3MB / 4KB/page = 750 pages.

Then, KVM side processing is near the theoretical goal!

In my framebuffer test, I tested

  nr_dirty_pages/npages = 576/4096

case with the rate of 20 updates/s (1updates/50ms).

Using rmap optimization, write protection only took 46,718 tsc time.
Bitmap copy was not a problem of course.

The display was working anyway at this rate!


In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log()
can be processed within 200us or so even for large RAM slot.
 - rmap optimization depends mainly on nr_dirty_pages but npages.

Avi, can you guess the property of O(1) write protection?
I want to test rmap optimization taking these issues into acount.



Of course, Kemari have to continue synchronization, and maybe see
more dirty pages. This will be a future task!


Thanks,
  Takuya
Avi Kivity Dec. 2, 2010, 8:37 a.m. UTC | #29
On 12/02/2010 03:31 AM, Takuya Yoshikawa wrote:
> Thanks for the answers Avi, Juan,
>
> Some FYI, (not about the bottleneck)
>
> On Wed, 01 Dec 2010 14:35:57 +0200
> Avi Kivity<avi@redhat.com>  wrote:
>
> >  >  >    - how many dirty pages do we have to care?
> >  >
> >  >  default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> >  >  dirty pages to have only 30ms of downtime.
> >
> >  1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
> >
>
> 3MB / 4KB/page = 750 pages.
>
> Then, KVM side processing is near the theoretical goal!
>
> In my framebuffer test, I tested
>
>    nr_dirty_pages/npages = 576/4096
>
> case with the rate of 20 updates/s (1updates/50ms).
>
> Using rmap optimization, write protection only took 46,718 tsc time.

Yes, using rmap to drive write protection with sparse dirty bitmaps 
really helps.

> Bitmap copy was not a problem of course.
>
> The display was working anyway at this rate!
>
>
> In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log()
> can be processed within 200us or so even for large RAM slot.
>   - rmap optimization depends mainly on nr_dirty_pages but npages.
>
> Avi, can you guess the property of O(1) write protection?
> I want to test rmap optimization taking these issues into acount.

I think we should use O(1) write protection only if there is a large 
number of dirty pages.  With a small number, using rmap guided by the 
previous dirty bitmap is faster.

So, under normal operation where only the framebuffer is logged, we'd 
use rmap write protection; when enabling logging for live migration we'd 
use O(1) write protection, after a few iterations when the number of 
dirty pages drops, we switch back to rmap write protection.

> Of course, Kemari have to continue synchronization, and maybe see
> more dirty pages. This will be a future task!
>

There's yet another option, of using dirty bits instead of write 
protection.  Or maybe using write protection in the upper page tables 
and dirty bits in the lowest level.
diff mbox

Patch

diff --git a/arch_init.c b/arch_init.c
index d32aaae..b463798 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,7 @@ 
 #include "net.h"
 #include "gdbstub.h"
 #include "hw/smbios.h"
+#include "buffered_file.h"

 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -218,6 +219,7 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t bytes_transferred_last;
     uint64_t t0;
     double bwidth = 0;
+    int i;

     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
@@ -261,6 +263,7 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     bytes_transferred_last = bytes_transferred;
     t0 = qemu_get_clock_ns(rt_clock);

+    i = 0;
     while (!qemu_file_rate_limit(f)) {
         int bytes_sent;

@@ -269,6 +272,19 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         if (bytes_sent == 0) { /* no more blocks */
             break;
         }
+	/* we want to check in the 1st loop, just in case it was the 1st time
+           and we had to sync the dirty bitmap.
+           qemu_get_clock_ns() is a bit expensive, so we only check each some
+           iterations
+	*/
+        if ((i & 63) == 0) {
+            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
+            if (t1 > buffered_file_interval/2) {
+                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
+		break;
+	    }
+	}
+        i++;
     }

     t0 = qemu_get_clock_ns(rt_clock) - t0;