mbox

[RFC,00/27] Migration thread (WIP)

Message ID 1343155012-26316-1-git-send-email-quintela@redhat.com
State New
Headers show

Pull-request

http://repo.or.cz/r/qemu/quintela.git migration-thread-v1

Message

Juan Quintela July 24, 2012, 6:36 p.m. UTC
Hi

This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

It does:
- get a new bitmap for migration, and that bitmap uses 1 bit by page
- it unfolds migration_buffered_file.  Only one user existed.
- it simplifies buffered_file a lot.

- About the migration thread, special attention was giving to try to
  get the series review-able (reviewers would tell if I got it).

Basic design:
- we create a new thread instead of a timer function
- we move all the migration work to that thread (but run everything
  except the waits with the iothread lock.
- we move all the writting to outside the iothread lock.  i.e.
  we walk the state with the iothread hold, and copy everything to one buffer.
  then we write that buffer to the sockets outside the iothread lock.
- once here, we move to writting synchronously to the sockets.
- this allows us to simplify quite a lot.

And basically, that is it.  Notice that we still do the iterate page
walking with the iothread held.  Light testing show that we got
similar speed and latencies than without the thread (notice that
almost no optimizations done here yet).

Appart of the review:
- Are there any locking issues that I have missed (I guess so)
- stop all cpus correctly.  vm_stop should be called from the iothread,
  I use the trick of using a bottom half to get that working correctly.
  but this _implementation_ is ugly as hell.  Is there an easy way
  of doing it?
- Do I really have to export last_ram_offset(), there is no other way
  of knowing the ammount of RAM?

Known issues:

- for some reason, when it has to start a 2nd round of bitmap
  handling, it decides to dirty all pages.  Haven't found still why
  this happens.

If you can test it, and said me where it breaks, it would also help.

Work is based on Umesh thread work, and work that Paolo Bonzini had
work on top of that.  All the mirgation thread was done from scratch
becase I was unable to debug why it was failing, but it "owes" a lot
to the previous design.

Thanks in advance, Juan.

The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:

  Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)

are available in the git repository at:


  http://repo.or.cz/r/qemu/quintela.git migration-thread-v1

for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:

  buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)


Juan Quintela (23):
  buffered_file: g_realloc() can't fail
  savevm: Factorize ram globals reset in its own function
  ram: introduce migration_bitmap_set_dirty()
  ram: Introduce migration_bitmap_test_and_reset_dirty()
  ram: Export last_ram_offset()
  ram: introduce migration_bitmap_sync()
  Separate migration bitmap
  buffered_file: rename opaque to migration_state
  buffered_file: opaque is MigrationState
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: unfold migrate_fd_put_ready
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: We can access directly to bandwidth_limit
  buffered_file: Move from using a timer to use a thread
  migration: make qemu_fopen_ops_buffered() return void
  migration: stop all cpus correctly
  migration: make writes blocking
  migration: remove unfreeze logic
  migration: take finer locking
  buffered_file: Unfold the trick to restart generating migration data
  buffered_file: don't flush on put buffer
  buffered_file: unfold buffered_append in buffered_put_buffer

Paolo Bonzini (2):
  split MRU ram list
  BufferedFile: append, then flush

Umesh Deshpande (2):
  add a version number to ram_list
  protect the ramlist with a separate mutex

 arch_init.c      |  108 +++++++++++++++++++++++++-------
 buffered_file.c  |  179 +++++++++++++++++-------------------------------------
 buffered_file.h  |   12 +---
 cpu-all.h        |   17 +++++-
 exec-obsolete.h  |   10 ---
 exec.c           |   45 +++++++++++---
 migration-exec.c |    2 -
 migration-fd.c   |    6 --
 migration-tcp.c  |    2 +-
 migration-unix.c |    2 -
 migration.c      |  111 ++++++++++++++-------------------
 migration.h      |    6 ++
 qemu-file.h      |    5 --
 savevm.c         |    5 --
 14 files changed, 249 insertions(+), 261 deletions(-)

Comments

Orit Wasserman July 25, 2012, 9:55 a.m. UTC | #1
Hi,
I started with some performance testing , the results look very good, one of the workload that wasn't converging with upstream version (migration completed only when I stopped the workload) seems to converge.

I used FC16 guest with 1G ram , default speed and downtime

workload     upstream transferred   upstream total time  thread transferred   thread total time
boot	      372028k		    14402ms 		 463966k	      2764ms
idle	      496512k	            14741ms		 976334k	      2847ms
iozone        913535k               27346ms              108872k              3021ms
stressapptest  stopped the workload after 400s           1827221k             4998ms

Regards,
Orit

On 07/24/2012 09:36 PM, Juan Quintela wrote:
> Hi
> 
> This series are on top of the migration-next-v5 series just posted.
> 
> First of all, this is an RFC/Work in progress.  Just a lot of people
> asked for it, and I would like review of the design.
> 
> It does:
> - get a new bitmap for migration, and that bitmap uses 1 bit by page
> - it unfolds migration_buffered_file.  Only one user existed.
> - it simplifies buffered_file a lot.
> 
> - About the migration thread, special attention was giving to try to
>   get the series review-able (reviewers would tell if I got it).
> 
> Basic design:
> - we create a new thread instead of a timer function
> - we move all the migration work to that thread (but run everything
>   except the waits with the iothread lock.
> - we move all the writting to outside the iothread lock.  i.e.
>   we walk the state with the iothread hold, and copy everything to one buffer.
>   then we write that buffer to the sockets outside the iothread lock.
> - once here, we move to writting synchronously to the sockets.
> - this allows us to simplify quite a lot.
> 
> And basically, that is it.  Notice that we still do the iterate page
> walking with the iothread held.  Light testing show that we got
> similar speed and latencies than without the thread (notice that
> almost no optimizations done here yet).
> 
> Appart of the review:
> - Are there any locking issues that I have missed (I guess so)
> - stop all cpus correctly.  vm_stop should be called from the iothread,
>   I use the trick of using a bottom half to get that working correctly.
>   but this _implementation_ is ugly as hell.  Is there an easy way
>   of doing it?
> - Do I really have to export last_ram_offset(), there is no other way
>   of knowing the ammount of RAM?
> 
> Known issues:
> 
> - for some reason, when it has to start a 2nd round of bitmap
>   handling, it decides to dirty all pages.  Haven't found still why
>   this happens.
> 
> If you can test it, and said me where it breaks, it would also help.
> 
> Work is based on Umesh thread work, and work that Paolo Bonzini had
> work on top of that.  All the mirgation thread was done from scratch
> becase I was unable to debug why it was failing, but it "owes" a lot
> to the previous design.
> 
> Thanks in advance, Juan.
> 
> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
> 
>   Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)
> 
> are available in the git repository at:
> 
> 
>   http://repo.or.cz/r/qemu/quintela.git migration-thread-v1
> 
> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
> 
>   buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)
> 
> 
> Juan Quintela (23):
>   buffered_file: g_realloc() can't fail
>   savevm: Factorize ram globals reset in its own function
>   ram: introduce migration_bitmap_set_dirty()
>   ram: Introduce migration_bitmap_test_and_reset_dirty()
>   ram: Export last_ram_offset()
>   ram: introduce migration_bitmap_sync()
>   Separate migration bitmap
>   buffered_file: rename opaque to migration_state
>   buffered_file: opaque is MigrationState
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_ready
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: We can access directly to bandwidth_limit
>   buffered_file: Move from using a timer to use a thread
>   migration: make qemu_fopen_ops_buffered() return void
>   migration: stop all cpus correctly
>   migration: make writes blocking
>   migration: remove unfreeze logic
>   migration: take finer locking
>   buffered_file: Unfold the trick to restart generating migration data
>   buffered_file: don't flush on put buffer
>   buffered_file: unfold buffered_append in buffered_put_buffer
> 
> Paolo Bonzini (2):
>   split MRU ram list
>   BufferedFile: append, then flush
> 
> Umesh Deshpande (2):
>   add a version number to ram_list
>   protect the ramlist with a separate mutex
> 
>  arch_init.c      |  108 +++++++++++++++++++++++++-------
>  buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>  buffered_file.h  |   12 +---
>  cpu-all.h        |   17 +++++-
>  exec-obsolete.h  |   10 ---
>  exec.c           |   45 +++++++++++---
>  migration-exec.c |    2 -
>  migration-fd.c   |    6 --
>  migration-tcp.c  |    2 +-
>  migration-unix.c |    2 -
>  migration.c      |  111 ++++++++++++++-------------------
>  migration.h      |    6 ++
>  qemu-file.h      |    5 --
>  savevm.c         |    5 --
>  14 files changed, 249 insertions(+), 261 deletions(-)
>
Jan Kiszka July 26, 2012, 10:57 a.m. UTC | #2
On 2012-07-24 20:36, Juan Quintela wrote:
> Hi
> 
> This series are on top of the migration-next-v5 series just posted.
> 
> First of all, this is an RFC/Work in progress.  Just a lot of people
> asked for it, and I would like review of the design.
> 
> It does:
> - get a new bitmap for migration, and that bitmap uses 1 bit by page
> - it unfolds migration_buffered_file.  Only one user existed.
> - it simplifies buffered_file a lot.
> 
> - About the migration thread, special attention was giving to try to
>   get the series review-able (reviewers would tell if I got it).
> 
> Basic design:
> - we create a new thread instead of a timer function
> - we move all the migration work to that thread (but run everything
>   except the waits with the iothread lock.
> - we move all the writting to outside the iothread lock.  i.e.
>   we walk the state with the iothread hold, and copy everything to one buffer.
>   then we write that buffer to the sockets outside the iothread lock.
> - once here, we move to writting synchronously to the sockets.
> - this allows us to simplify quite a lot.
> 
> And basically, that is it.  Notice that we still do the iterate page
> walking with the iothread held.  Light testing show that we got
> similar speed and latencies than without the thread (notice that
> almost no optimizations done here yet).
> 
> Appart of the review:
> - Are there any locking issues that I have missed (I guess so)
> - stop all cpus correctly.  vm_stop should be called from the iothread,
>   I use the trick of using a bottom half to get that working correctly.
>   but this _implementation_ is ugly as hell.  Is there an easy way
>   of doing it?

vm_stop is prepared to be called from vcpu context as well. I'm not sure
right now if we actually do, but the code is there.

Jan
Juan Quintela July 26, 2012, 11:16 a.m. UTC | #3
Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-07-24 20:36, Juan Quintela wrote:
>> Hi

>> Appart of the review:
>> - Are there any locking issues that I have missed (I guess so)
>> - stop all cpus correctly.  vm_stop should be called from the iothread,
>>   I use the trick of using a bottom half to get that working correctly.
>>   but this _implementation_ is ugly as hell.  Is there an easy way
>>   of doing it?
>
> vm_stop is prepared to be called from vcpu context as well. I'm not sure
> right now if we actually do, but the code is there.

But this is a migation_thread (i.e. neither iothread of vcpu), and we
need to wait for vm_stop to finish.  My reading is that in vcpu context,
we just ask the iothread to stop all cpus.

void vm_stop(RunState state)
{
    if (!qemu_thread_is_self(&io_thread)) {
        qemu_system_vmstop_request(state);
        /*
         * FIXME: should not return to device code in case
         * vm_stop() has been requested.
         */
        cpu_stop_current();
        return;
    }
    do_vm_stop(state);
}

Or I am reading it wrong?

Later, Juan.
Jan Kiszka July 26, 2012, 11:56 a.m. UTC | #4
On 2012-07-26 13:16, Juan Quintela wrote:
> Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2012-07-24 20:36, Juan Quintela wrote:
>>> Hi
> 
>>> Appart of the review:
>>> - Are there any locking issues that I have missed (I guess so)
>>> - stop all cpus correctly.  vm_stop should be called from the iothread,
>>>   I use the trick of using a bottom half to get that working correctly.
>>>   but this _implementation_ is ugly as hell.  Is there an easy way
>>>   of doing it?
>>
>> vm_stop is prepared to be called from vcpu context as well. I'm not sure
>> right now if we actually do, but the code is there.
> 
> But this is a migation_thread (i.e. neither iothread of vcpu), and we
> need to wait for vm_stop to finish.  My reading is that in vcpu context,
> we just ask the iothread to stop all cpus.
> 
> void vm_stop(RunState state)
> {
>     if (!qemu_thread_is_self(&io_thread)) {
>         qemu_system_vmstop_request(state);
>         /*
>          * FIXME: should not return to device code in case
>          * vm_stop() has been requested.
>          */
>         cpu_stop_current();
>         return;
>     }
>     do_vm_stop(state);
> }
> 
> Or I am reading it wrong?

Ah, indeed. Did you try top make this service ready for such a use case
(sorry, didn't look at the code yet)? Something like issuing
qemu_system_vmstop_request and then resuming the with the next step on a
vmstate change notification?

Jan
Chegu Vinod July 26, 2012, 6:41 p.m. UTC | #5
> -------- Original Message --------
> Subject: 	[Qemu-devel] [RFC 00/27] Migration thread (WIP)
> Date: 	Tue, 24 Jul 2012 20:36:25 +0200
> From: 	Juan Quintela <quintela@redhat.com>
> To: 	qemu-devel@nongnu.org
>
>
>
> Hi
>
> This series are on top of the migration-next-v5 series just posted.
>
> First of all, this is an RFC/Work in progress.  Just a lot of people
> asked for it, and I would like review of the design.
Hello,

Thanks for sharing this early/WIP version for evaluation.

Still in the middle of  code review..but wanted to share a couple of 
quick  observations.
'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
downtime 2s).
Once with no workload (i.e. idle guest) and the second was with a 
SpecJBB running in the guest.

The idle guest case seemed to migrate fine...


capabilities: xbzrle: off
Migration status: completed
transferred ram: 3811345 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 199743 milliseconds


In the case of the SpecJBB I ran into issues during stage 3...the source 
host's qemu and the guest hung. I need to debug this more... (if  
already have some hints pl. let me know.).


capabilities: xbzrle: off
Migration status: active
transferred ram: 127618578 kbytes
remaining ram: 2386832 kbytes
total ram: 134226368 kbytes
total time: 526139 milliseconds
(qemu) qemu_savevm_state_complete called
qemu_savevm_state_complete calling ram_save_complete

<---  hung somewhere after this ('need to get more info).


---

As with the non-migration-thread version the Specjbb workload completed 
before the migration attempted to move to stage 3 (i.e. didn't converge 
while the workload was still active).

BTW, with this version of the bits (i.e. while running SpecJBB which is 
supposed to dirty quite a bit of memory) I noticed that there wasn't 
much change in the b/w usage of the dedicated 10Gb private network link 
(It was still < ~1.5-3.0Gb/sec).   Expected this to be a little better 
since we have a separate thread...  not sure what else is in play here ? 
(numa locality of where the migration thread runs or something other 
basic tuning in the implementation ?)

'have a hi-level design question... (perhaps folks have already thought 
about it..and categorized it as potential future optimization..?)

Would it be possible to off load the iothread completely [from all 
migration related activity] and have one thread (with the appropriate 
protection) get involved with getting the list of the dirty pages ? Have 
one or more threads dedicated for trying to push multiple streams of 
data to saturate the allocated network bandwidth ?  This may help in 
large + busy guests. Comments?    There  are perhaps other implications 
of doing all of this (like burning more host cpu cycles) but perhaps 
this can be configurable based on user's needs... e.g. fewer but large 
guests on a host with no over subscription.

Thanks
Vinod


> It does:
> - get a new bitmap for migration, and that bitmap uses 1 bit by page
> - it unfolds migration_buffered_file.  Only one user existed.
> - it simplifies buffered_file a lot.
>
> - About the migration thread, special attention was giving to try to
>    get the series review-able (reviewers would tell if I got it).
>
> Basic design:
> - we create a new thread instead of a timer function
> - we move all the migration work to that thread (but run everything
>    except the waits with the iothread lock.
> - we move all the writting to outside the iothread lock.  i.e.
>    we walk the state with the iothread hold, and copy everything to one buffer.
>    then we write that buffer to the sockets outside the iothread lock.
> - once here, we move to writting synchronously to the sockets.
> - this allows us to simplify quite a lot.
>
> And basically, that is it.  Notice that we still do the iterate page
> walking with the iothread held.  Light testing show that we got
> similar speed and latencies than without the thread (notice that
> almost no optimizations done here yet).
>
> Appart of the review:
> - Are there any locking issues that I have missed (I guess so)
> - stop all cpus correctly.  vm_stop should be called from the iothread,
>    I use the trick of using a bottom half to get that working correctly.
>    but this _implementation_ is ugly as hell.  Is there an easy way
>    of doing it?
> - Do I really have to export last_ram_offset(), there is no other way
>    of knowing the ammount of RAM?
>
> Known issues:
>
> - for some reason, when it has to start a 2nd round of bitmap
>    handling, it decides to dirty all pages.  Haven't found still why
>    this happens.
>
> If you can test it, and said me where it breaks, it would also help.
>
> Work is based on Umesh thread work, and work that Paolo Bonzini had
> work on top of that.  All the mirgation thread was done from scratch
> becase I was unable to debug why it was failing, but it "owes" a lot
> to the previous design.
>
> Thanks in advance, Juan.
>
> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
>
>    Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)
>
> are available in the git repository at:
>
>
>    http://repo.or.cz/r/qemu/quintela.git  migration-thread-v1
>
> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
>
>    buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)
>
>
> Juan Quintela (23):
>    buffered_file: g_realloc() can't fail
>    savevm: Factorize ram globals reset in its own function
>    ram: introduce migration_bitmap_set_dirty()
>    ram: Introduce migration_bitmap_test_and_reset_dirty()
>    ram: Export last_ram_offset()
>    ram: introduce migration_bitmap_sync()
>    Separate migration bitmap
>    buffered_file: rename opaque to migration_state
>    buffered_file: opaque is MigrationState
>    buffered_file: unfold migrate_fd_put_buffer
>    buffered_file: unfold migrate_fd_put_ready
>    buffered_file: unfold migrate_fd_put_buffer
>    buffered_file: unfold migrate_fd_put_buffer
>    buffered_file: We can access directly to bandwidth_limit
>    buffered_file: Move from using a timer to use a thread
>    migration: make qemu_fopen_ops_buffered() return void
>    migration: stop all cpus correctly
>    migration: make writes blocking
>    migration: remove unfreeze logic
>    migration: take finer locking
>    buffered_file: Unfold the trick to restart generating migration data
>    buffered_file: don't flush on put buffer
>    buffered_file: unfold buffered_append in buffered_put_buffer
>
> Paolo Bonzini (2):
>    split MRU ram list
>    BufferedFile: append, then flush
>
> Umesh Deshpande (2):
>    add a version number to ram_list
>    protect the ramlist with a separate mutex
>
>   arch_init.c      |  108 +++++++++++++++++++++++++-------
>   buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>   buffered_file.h  |   12 +---
>   cpu-all.h        |   17 +++++-
>   exec-obsolete.h  |   10 ---
>   exec.c           |   45 +++++++++++---
>   migration-exec.c |    2 -
>   migration-fd.c   |    6 --
>   migration-tcp.c  |    2 +-
>   migration-unix.c |    2 -
>   migration.c      |  111 ++++++++++++++-------------------
>   migration.h      |    6 ++
>   qemu-file.h      |    5 --
>   savevm.c         |    5 --
>   14 files changed, 249 insertions(+), 261 deletions(-)
>
> -- 
> 1.7.10.4
>
>
>
>
>
Chegu Vinod July 26, 2012, 9:26 p.m. UTC | #6
On 7/26/2012 11:41 AM, Chegu Vinod wrote:
>
>
>> -------- Original Message --------
>> Subject: 	[Qemu-devel] [RFC 00/27] Migration thread (WIP)
>> Date: 	Tue, 24 Jul 2012 20:36:25 +0200
>> From: 	Juan Quintela <quintela@redhat.com>
>> To: 	qemu-devel@nongnu.org
>>
>>
>>
>> Hi
>>
>> This series are on top of the migration-next-v5 series just posted.
>>
>> First of all, this is an RFC/Work in progress.  Just a lot of people
>> asked for it, and I would like review of the design.
> Hello,
>
> Thanks for sharing this early/WIP version for evaluation.
>
> Still in the middle of  code review..but wanted to share a couple of 
> quick  observations.
> 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
> downtime 2s).
> Once with no workload (i.e. idle guest) and the second was with a 
> SpecJBB running in the guest.
>
> The idle guest case seemed to migrate fine...
>
>
> capabilities: xbzrle: off
> Migration status: completed
> transferred ram: 3811345 kbytes
> remaining ram: 0 kbytes
> total ram: 134226368 kbytes
> total time: 199743 milliseconds
>
>
> In the case of the SpecJBB I ran into issues during stage 3...the 
> source host's qemu and the guest hung. I need to debug this more... 
> (if  already have some hints pl. let me know.).
>
>
> capabilities: xbzrle: off
> Migration status: active
> transferred ram: 127618578 kbytes
> remaining ram: 2386832 kbytes
> total ram: 134226368 kbytes
> total time: 526139 milliseconds
> (qemu) qemu_savevm_state_complete called
> qemu_savevm_state_complete calling ram_save_complete
>
> <---  hung somewhere after this ('need to get more info).
>


Appears to be some race condition...as there are cases when it hangs and 
in some cases it succeeds.

(qemu) info migrate
capabilities: xbzrle: off
Migration status: completed
transferred ram: 129937687 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 543228 milliseconds

Need to review/debug...

Vinod


>
> ---
>
> As with the non-migration-thread version the Specjbb workload 
> completed before the migration attempted to move to stage 3 (i.e. 
> didn't converge while the workload was still active).
>
> BTW, with this version of the bits (i.e. while running SpecJBB which 
> is supposed to dirty quite a bit of memory) I noticed that there 
> wasn't much change in the b/w usage of the dedicated 10Gb private 
> network link (It was still < ~1.5-3.0Gb/sec). Expected this to be a 
> little better since we have a separate thread...  not sure what else 
> is in play here ? (numa locality of where the migration thread runs or 
> something other basic tuning in the implementation ?)
>
> 'have a hi-level design question... (perhaps folks have already 
> thought about it..and categorized it as potential future optimization..?)
>
> Would it be possible to off load the iothread completely [from all 
> migration related activity] and have one thread (with the appropriate 
> protection) get involved with getting the list of the dirty pages ? 
> Have one or more threads dedicated for trying to push multiple streams 
> of data to saturate the allocated network bandwidth ?  This may help 
> in large + busy guests. Comments? There  are perhaps other 
> implications of doing all of this (like burning more host cpu cycles) 
> but perhaps this can be configurable based on user's needs... e.g. 
> fewer but large guests on a host with no over subscription.
>
> Thanks
> Vinod
>
>
>> It does:
>> - get a new bitmap for migration, and that bitmap uses 1 bit by page
>> - it unfolds migration_buffered_file.  Only one user existed.
>> - it simplifies buffered_file a lot.
>>
>> - About the migration thread, special attention was giving to try to
>>    get the series review-able (reviewers would tell if I got it).
>>
>> Basic design:
>> - we create a new thread instead of a timer function
>> - we move all the migration work to that thread (but run everything
>>    except the waits with the iothread lock.
>> - we move all the writting to outside the iothread lock.  i.e.
>>    we walk the state with the iothread hold, and copy everything to one buffer.
>>    then we write that buffer to the sockets outside the iothread lock.
>> - once here, we move to writting synchronously to the sockets.
>> - this allows us to simplify quite a lot.
>>
>> And basically, that is it.  Notice that we still do the iterate page
>> walking with the iothread held.  Light testing show that we got
>> similar speed and latencies than without the thread (notice that
>> almost no optimizations done here yet).
>>
>> Appart of the review:
>> - Are there any locking issues that I have missed (I guess so)
>> - stop all cpus correctly.  vm_stop should be called from the iothread,
>>    I use the trick of using a bottom half to get that working correctly.
>>    but this _implementation_ is ugly as hell.  Is there an easy way
>>    of doing it?
>> - Do I really have to export last_ram_offset(), there is no other way
>>    of knowing the ammount of RAM?
>>
>> Known issues:
>>
>> - for some reason, when it has to start a 2nd round of bitmap
>>    handling, it decides to dirty all pages.  Haven't found still why
>>    this happens.
>>
>> If you can test it, and said me where it breaks, it would also help.
>>
>> Work is based on Umesh thread work, and work that Paolo Bonzini had
>> work on top of that.  All the mirgation thread was done from scratch
>> becase I was unable to debug why it was failing, but it "owes" a lot
>> to the previous design.
>>
>> Thanks in advance, Juan.
>>
>> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
>>
>>    Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)
>>
>> are available in the git repository at:
>>
>>
>>    http://repo.or.cz/r/qemu/quintela.git  migration-thread-v1
>>
>> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
>>
>>    buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)
>>
>>
>> Juan Quintela (23):
>>    buffered_file: g_realloc() can't fail
>>    savevm: Factorize ram globals reset in its own function
>>    ram: introduce migration_bitmap_set_dirty()
>>    ram: Introduce migration_bitmap_test_and_reset_dirty()
>>    ram: Export last_ram_offset()
>>    ram: introduce migration_bitmap_sync()
>>    Separate migration bitmap
>>    buffered_file: rename opaque to migration_state
>>    buffered_file: opaque is MigrationState
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_ready
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: We can access directly to bandwidth_limit
>>    buffered_file: Move from using a timer to use a thread
>>    migration: make qemu_fopen_ops_buffered() return void
>>    migration: stop all cpus correctly
>>    migration: make writes blocking
>>    migration: remove unfreeze logic
>>    migration: take finer locking
>>    buffered_file: Unfold the trick to restart generating migration data
>>    buffered_file: don't flush on put buffer
>>    buffered_file: unfold buffered_append in buffered_put_buffer
>>
>> Paolo Bonzini (2):
>>    split MRU ram list
>>    BufferedFile: append, then flush
>>
>> Umesh Deshpande (2):
>>    add a version number to ram_list
>>    protect the ramlist with a separate mutex
>>
>>   arch_init.c      |  108 +++++++++++++++++++++++++-------
>>   buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>>   buffered_file.h  |   12 +---
>>   cpu-all.h        |   17 +++++-
>>   exec-obsolete.h  |   10 ---
>>   exec.c           |   45 +++++++++++---
>>   migration-exec.c |    2 -
>>   migration-fd.c   |    6 --
>>   migration-tcp.c  |    2 +-
>>   migration-unix.c |    2 -
>>   migration.c      |  111 ++++++++++++++-------------------
>>   migration.h      |    6 ++
>>   qemu-file.h      |    5 --
>>   savevm.c         |    5 --
>>   14 files changed, 249 insertions(+), 261 deletions(-)
>>
>> -- 
>> 1.7.10.4
>>
>>
>>
>>
>>
>
>
Michael Roth July 26, 2012, 9:36 p.m. UTC | #7
On Tue, Jul 24, 2012 at 08:36:25PM +0200, Juan Quintela wrote:
> Hi
> 
> This series are on top of the migration-next-v5 series just posted.
> 
> First of all, this is an RFC/Work in progress.  Just a lot of people
> asked for it, and I would like review of the design.
> 
> It does:
> - get a new bitmap for migration, and that bitmap uses 1 bit by page
> - it unfolds migration_buffered_file.  Only one user existed.
> - it simplifies buffered_file a lot.
> 
> - About the migration thread, special attention was giving to try to
>   get the series review-able (reviewers would tell if I got it).
> 
> Basic design:
> - we create a new thread instead of a timer function
> - we move all the migration work to that thread (but run everything
>   except the waits with the iothread lock.
> - we move all the writting to outside the iothread lock.  i.e.
>   we walk the state with the iothread hold, and copy everything to one buffer.
>   then we write that buffer to the sockets outside the iothread lock.
> - once here, we move to writting synchronously to the sockets.
> - this allows us to simplify quite a lot.
> 
> And basically, that is it.  Notice that we still do the iterate page
> walking with the iothread held.  Light testing show that we got

Is the plan to eventually hold the iothread lock only to re-sync the
dirty bitmap, and then rely on qemu_mutex_lock_ramlist() to walk the
ramlist?

It seems like the non-MRU block list, "local" migration
bitmap copy, and ramlist mutex are all toward this end, but we still
have basically the same locking protocol as before. If not, can you elaborate
on what purpose they serve in this series?

> similar speed and latencies than without the thread (notice that
> almost no optimizations done here yet).
> 
> Appart of the review:
> - Are there any locking issues that I have missed (I guess so)
> - stop all cpus correctly.  vm_stop should be called from the iothread,
>   I use the trick of using a bottom half to get that working correctly.
>   but this _implementation_ is ugly as hell.  Is there an easy way
>   of doing it?
> - Do I really have to export last_ram_offset(), there is no other way
>   of knowing the ammount of RAM?
> 
> Known issues:
> 
> - for some reason, when it has to start a 2nd round of bitmap
>   handling, it decides to dirty all pages.  Haven't found still why
>   this happens.
> 
> If you can test it, and said me where it breaks, it would also help.
> 
> Work is based on Umesh thread work, and work that Paolo Bonzini had
> work on top of that.  All the mirgation thread was done from scratch
> becase I was unable to debug why it was failing, but it "owes" a lot
> to the previous design.
> 
> Thanks in advance, Juan.
> 
> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
> 
>   Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)
> 
> are available in the git repository at:
> 
> 
>   http://repo.or.cz/r/qemu/quintela.git migration-thread-v1
> 
> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
> 
>   buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)
> 
> 
> Juan Quintela (23):
>   buffered_file: g_realloc() can't fail
>   savevm: Factorize ram globals reset in its own function
>   ram: introduce migration_bitmap_set_dirty()
>   ram: Introduce migration_bitmap_test_and_reset_dirty()
>   ram: Export last_ram_offset()
>   ram: introduce migration_bitmap_sync()
>   Separate migration bitmap
>   buffered_file: rename opaque to migration_state
>   buffered_file: opaque is MigrationState
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_ready
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: We can access directly to bandwidth_limit
>   buffered_file: Move from using a timer to use a thread
>   migration: make qemu_fopen_ops_buffered() return void
>   migration: stop all cpus correctly
>   migration: make writes blocking
>   migration: remove unfreeze logic
>   migration: take finer locking
>   buffered_file: Unfold the trick to restart generating migration data
>   buffered_file: don't flush on put buffer
>   buffered_file: unfold buffered_append in buffered_put_buffer
> 
> Paolo Bonzini (2):
>   split MRU ram list
>   BufferedFile: append, then flush
> 
> Umesh Deshpande (2):
>   add a version number to ram_list
>   protect the ramlist with a separate mutex
> 
>  arch_init.c      |  108 +++++++++++++++++++++++++-------
>  buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>  buffered_file.h  |   12 +---
>  cpu-all.h        |   17 +++++-
>  exec-obsolete.h  |   10 ---
>  exec.c           |   45 +++++++++++---
>  migration-exec.c |    2 -
>  migration-fd.c   |    6 --
>  migration-tcp.c  |    2 +-
>  migration-unix.c |    2 -
>  migration.c      |  111 ++++++++++++++-------------------
>  migration.h      |    6 ++
>  qemu-file.h      |    5 --
>  savevm.c         |    5 --
>  14 files changed, 249 insertions(+), 261 deletions(-)
> 
> -- 
> 1.7.10.4
> 
>
Juan Quintela July 27, 2012, 11:05 a.m. UTC | #8
Chegu Vinod <chegu_vinod@hp.com> wrote:
> On 7/26/2012 11:41 AM, Chegu Vinod wrote:
>
>
>     
>         
>         -------- Original Message -------- 
>
>                                                                    
>          Subject:  [Qemu-devel] [RFC 00/27] Migration thread (WIP) 
>                                                                    
>          Date:     Tue, 24 Jul 2012 20:36:25 +0200                 
>                                                                    
>          From:     Juan Quintela <quintela@redhat.com>             
>                                                                    
>          To:       qemu-devel@nongnu.org                           
>                                                                    
>
>         
>         
>         Hi
>
> This series are on top of the migration-next-v5 series just posted.
>
> First of all, this is an RFC/Work in progress.  Just a lot of people
> asked for it, and I would like review of the design.
>
>     Hello,
>     
>     Thanks for sharing this early/WIP version for evaluation. 
>     
>     Still in the middle of  code review..but wanted to share a couple
>     of quick  observations.
>     'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G
>     and downtime 2s). 
>     Once with no workload (i.e. idle guest) and the second was with a
>     SpecJBB running in the guest.  
>     
>     The idle guest case seemed to migrate fine...
>     
>     
>     capabilities: xbzrle: off
>     Migration status: completed
>     transferred ram: 3811345 kbytes
>     remaining ram: 0 kbytes
>     total ram: 134226368 kbytes
>     total time: 199743 milliseconds
>     
>     
>     In the case of the SpecJBB I ran into issues during stage 3...the
>     source host's qemu and the guest hung. I need to debug this
>     more... (if  already have some hints pl. let me know.).
>     
>     
>     capabilities: xbzrle: off 
>     Migration status: active
>     transferred ram: 127618578 kbytes
>     remaining ram: 2386832 kbytes
>     total ram: 134226368 kbytes
>     total time: 526139 milliseconds
>     (qemu) qemu_savevm_state_complete called 
>     qemu_savevm_state_complete calling ram_save_complete
>      
>     <---  hung somewhere after this ('need to get more info).
>     
>     
>
>
> Appears to be some race condition...as there are cases when it hangs
> and in some cases it succeeds.

Weird guess, try to use less vcpus, same ram.  The way that we stop cpus
is _hacky_ to say it somewhere.  Will try to think about that part.

Thanks for the testing.  All my testing has been done with 8GB guests
and 2vcps.  Will try with more vcpus to see if it makes a difference.




>
> (qemu) info migrate
> capabilities: xbzrle: off 
> Migration status: completed
> transferred ram: 129937687 kbytes
> remaining ram: 0 kbytes
> total ram: 134226368 kbytes
> total time: 543228 milliseconds

Humm, _that_ is more strange.  This means that it finished.  Could you
run qemu under gdb and sent me the stack traces?

I don't know your gdb thread kung-fu, so here are the instructions just
in case:

gdb --args <exact qemu commandh line you used>
C-c to break when it hangs
(gdb)info threads
you see all the threads running
(gdb)thread 1
or whatever other number
(gdb)bt
the backtrace of that thread.

I am specially interested in the backtrace of the migration thread and
of the iothread.

Thanks, Juan.

>
> Need to review/debug...
>
> Vinod
>
>
>
>     ---
>     
>     As with the non-migration-thread version the Specjbb workload
>     completed before the migration attempted to move to stage 3 (i.e.
>     didn't converge while the workload was still active). 
>     
>     BTW, with this version of the bits (i.e. while running SpecJBB
>     which is supposed to dirty quite a bit of memory) I noticed that
>     there wasn't much change in the b/w usage of the dedicated 10Gb
>     private network link (It was still < ~1.5-3.0Gb/sec).   Expected
>     this to be a little better since we have a separate thread...  not
>     sure what else is in play here ? (numa locality of where the
>     migration thread runs or something other basic tuning in the
>     implementation ?)
>     
>     'have a hi-level design question... (perhaps folks have already
>     thought about it..and categorized it as potential future
>     optimization..?)
>     
>     Would it be possible to off load the iothread completely [from all
>     migration related activity] and have one thread (with the
>     appropriate protection) get involved with getting the list of the
>     dirty pages ? Have one or more threads dedicated for trying to
>     push multiple streams of data to saturate the allocated network
>     bandwidth ?  This may help in large + busy guests. Comments?   
>     There  are perhaps other implications of doing all of this (like
>     burning more host cpu cycles) but perhaps this can be configurable
>     based on user's needs... e.g. fewer but large guests on a host
>     with no over subscription. 
>     
>     Thanks
>     Vinod
>     
>     
>         
>         
>         It does:
> - get a new bitmap for migration, and that bitmap uses 1 bit by page
> - it unfolds migration_buffered_file.  Only one user existed.
> - it simplifies buffered_file a lot.
>
> - About the migration thread, special attention was giving to try to
>   get the series review-able (reviewers would tell if I got it).
>
> Basic design:
> - we create a new thread instead of a timer function
> - we move all the migration work to that thread (but run everything
>   except the waits with the iothread lock.
> - we move all the writting to outside the iothread lock.  i.e.
>   we walk the state with the iothread hold, and copy everything to one buffer.
>   then we write that buffer to the sockets outside the iothread lock.
> - once here, we move to writting synchronously to the sockets.
> - this allows us to simplify quite a lot.
>
> And basically, that is it.  Notice that we still do the iterate page
> walking with the iothread held.  Light testing show that we got
> similar speed and latencies than without the thread (notice that
> almost no optimizations done here yet).
>
> Appart of the review:
> - Are there any locking issues that I have missed (I guess so)
> - stop all cpus correctly.  vm_stop should be called from the iothread,
>   I use the trick of using a bottom half to get that working correctly.
>   but this _implementation_ is ugly as hell.  Is there an easy way
>   of doing it?
> - Do I really have to export last_ram_offset(), there is no other way
>   of knowing the ammount of RAM?
>
> Known issues:
>
> - for some reason, when it has to start a 2nd round of bitmap
>   handling, it decides to dirty all pages.  Haven't found still why
>   this happens.
>
> If you can test it, and said me where it breaks, it would also help.
>
> Work is based on Umesh thread work, and work that Paolo Bonzini had
> work on top of that.  All the mirgation thread was done from scratch
> becase I was unable to debug why it was failing, but it "owes" a lot
> to the previous design.
>
> Thanks in advance, Juan.
>
> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
>
>   Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23 13:15:34 -0500)
>
> are available in the git repository at:
>
>
>   http://repo.or.cz/r/qemu/quintela.git migration-thread-v1
>
> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
>
>   buffered_file: unfold buffered_append in buffered_put_buffer (2012-07-24 16:46:13 +0200)
>
>
> Juan Quintela (23):
>   buffered_file: g_realloc() can't fail
>   savevm: Factorize ram globals reset in its own function
>   ram: introduce migration_bitmap_set_dirty()
>   ram: Introduce migration_bitmap_test_and_reset_dirty()
>   ram: Export last_ram_offset()
>   ram: introduce migration_bitmap_sync()
>   Separate migration bitmap
>   buffered_file: rename opaque to migration_state
>   buffered_file: opaque is MigrationState
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_ready
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: unfold migrate_fd_put_buffer
>   buffered_file: We can access directly to bandwidth_limit
>   buffered_file: Move from using a timer to use a thread
>   migration: make qemu_fopen_ops_buffered() return void
>   migration: stop all cpus correctly
>   migration: make writes blocking
>   migration: remove unfreeze logic
>   migration: take finer locking
>   buffered_file: Unfold the trick to restart generating migration data
>   buffered_file: don't flush on put buffer
>   buffered_file: unfold buffered_append in buffered_put_buffer
>
> Paolo Bonzini (2):
>   split MRU ram list
>   BufferedFile: append, then flush
>
> Umesh Deshpande (2):
>   add a version number to ram_list
>   protect the ramlist with a separate mutex
>
>  arch_init.c      |  108 +++++++++++++++++++++++++-------
>  buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>  buffered_file.h  |   12 +---
>  cpu-all.h        |   17 +++++-
>  exec-obsolete.h  |   10 ---
>  exec.c           |   45 +++++++++++---
>  migration-exec.c |    2 -
>  migration-fd.c   |    6 --
>  migration-tcp.c  |    2 +-
>  migration-unix.c |    2 -
>  migration.c      |  111 ++++++++++++++-------------------
>  migration.h      |    6 ++
>  qemu-file.h      |    5 --
>  savevm.c         |    5 --
>  14 files changed, 249 insertions(+), 261 deletions(-)
Chegu Vinod July 27, 2012, 2:21 p.m. UTC | #9
On 7/27/2012 7:11 AM, Vinod, Chegu wrote:
>
> -----Original Message-----
> From: Juan Quintela [mailto:quintela@redhat.com]
> Sent: Friday, July 27, 2012 4:06 AM
> To: Vinod, Chegu
> Cc: qemu-devel@nongnu.org; Orit Wasserman
> Subject: Re: Fwd: [RFC 00/27] Migration thread (WIP)
>
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>> On 7/26/2012 11:41 AM, Chegu Vinod wrote:
>>
>>
>>      
>>          
>>          -------- Original Message --------
>>
>>                                                                     
>>           Subject:  [Qemu-devel] [RFC 00/27] Migration thread (WIP)
>>                                                                     
>>           Date:     Tue, 24 Jul 2012 20:36:25 +0200
>>                                                                     
>>           From:     Juan Quintela <quintela@redhat.com>
>>                                                                     
>>           To:       qemu-devel@nongnu.org
>>                                                                     
>>
>>          
>>          
>>          Hi
>>
>> This series are on top of the migration-next-v5 series just posted.
>>
>> First of all, this is an RFC/Work in progress.  Just a lot of people
>> asked for it, and I would like review of the design.
>>
>>      Hello,
>>      
>>      Thanks for sharing this early/WIP version for evaluation.
>>      
>>      Still in the middle of  code review..but wanted to share a couple
>>      of quick  observations.
>>      'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G
>>      and downtime 2s).
>>      Once with no workload (i.e. idle guest) and the second was with a
>>      SpecJBB running in the guest.
>>      
>>      The idle guest case seemed to migrate fine...
>>      
>>      
>>      capabilities: xbzrle: off
>>      Migration status: completed
>>      transferred ram: 3811345 kbytes
>>      remaining ram: 0 kbytes
>>      total ram: 134226368 kbytes
>>      total time: 199743 milliseconds
>>      
>>      
>>      In the case of the SpecJBB I ran into issues during stage 3...the
>>      source host's qemu and the guest hung. I need to debug this
>>      more... (if  already have some hints pl. let me know.).
>>      
>>      
>>      capabilities: xbzrle: off
>>      Migration status: active
>>      transferred ram: 127618578 kbytes
>>      remaining ram: 2386832 kbytes
>>      total ram: 134226368 kbytes
>>      total time: 526139 milliseconds
>>      (qemu) qemu_savevm_state_complete called
>>      qemu_savevm_state_complete calling ram_save_complete
>>       
>>      <---  hung somewhere after this ('need to get more info).
>>      
>>      
>>
>>
>> Appears to be some race condition...as there are cases when it hangs
>> and in some cases it succeeds.
> Weird guess, try to use less vcpus, same ram.

Ok..will try that.
> The way that we stop cpus is _hacky_ to say it somewhere.  Will try to think about that part.
Ok.
> Thanks for the testing.  All my testing has been done with 8GB guests and 2vcps.  Will try with more vcpus to see if it makes a difference.
>
>
>
>
>> (qemu) info migrate
>> capabilities: xbzrle: off
>> Migration status: completed
>> transferred ram: 129937687 kbytes
>> remaining ram: 0 kbytes
>> total ram: 134226368 kbytes
>> total time: 543228 milliseconds
> Humm, _that_ is more strange.  This means that it finished.

There are cases where the migration is finishing just fine... even with 
larger guest configurations (256G/20VCPUs).


>   Could you run qemu under gdb and sent me the stack traces?
>
> I don't know your gdb thread kung-fu, so here are the instructions just in case:
>
> gdb --args <exact qemu commandh line you used> C-c to break when it hangs (gdb)info threads you see all the threads running (gdb)thread 1 or whatever other number (gdb)bt the backtrace of that thread.

The hang is intermittent...
I ran it 4-5 times (under gdb) just now and I didn't see the issue :-(


> I am specially interested in the backtrace of the migration thread and of the iothread.

Will keep re-trying with different configs. and see if i get lucky in 
reproducing it (under gdb).

Vinod
>
> Thanks, Juan.
>
>> Need to review/debug...
>>
>> Vinod
>>
>>
>>
>>      ---
>>      
>>      As with the non-migration-thread version the Specjbb workload
>>      completed before the migration attempted to move to stage 3 (i.e.
>>      didn't converge while the workload was still active).
>>      
>>      BTW, with this version of the bits (i.e. while running SpecJBB
>>      which is supposed to dirty quite a bit of memory) I noticed that
>>      there wasn't much change in the b/w usage of the dedicated 10Gb
>>      private network link (It was still < ~1.5-3.0Gb/sec).   Expected
>>      this to be a little better since we have a separate thread...  not
>>      sure what else is in play here ? (numa locality of where the
>>      migration thread runs or something other basic tuning in the
>>      implementation ?)
>>      
>>      'have a hi-level design question... (perhaps folks have already
>>      thought about it..and categorized it as potential future
>>      optimization..?)
>>      
>>      Would it be possible to off load the iothread completely [from all
>>      migration related activity] and have one thread (with the
>>      appropriate protection) get involved with getting the list of the
>>      dirty pages ? Have one or more threads dedicated for trying to
>>      push multiple streams of data to saturate the allocated network
>>      bandwidth ?  This may help in large + busy guests. Comments?
>>      There  are perhaps other implications of doing all of this (like
>>      burning more host cpu cycles) but perhaps this can be configurable
>>      based on user's needs... e.g. fewer but large guests on a host
>>      with no over subscription.
>>      
>>      Thanks
>>      Vinod
>>      
>>      
>>          
>>          
>>          It does:
>> - get a new bitmap for migration, and that bitmap uses 1 bit by page
>> - it unfolds migration_buffered_file.  Only one user existed.
>> - it simplifies buffered_file a lot.
>>
>> - About the migration thread, special attention was giving to try to
>>    get the series review-able (reviewers would tell if I got it).
>>
>> Basic design:
>> - we create a new thread instead of a timer function
>> - we move all the migration work to that thread (but run everything
>>    except the waits with the iothread lock.
>> - we move all the writting to outside the iothread lock.  i.e.
>>    we walk the state with the iothread hold, and copy everything to one buffer.
>>    then we write that buffer to the sockets outside the iothread lock.
>> - once here, we move to writting synchronously to the sockets.
>> - this allows us to simplify quite a lot.
>>
>> And basically, that is it.  Notice that we still do the iterate page
>> walking with the iothread held.  Light testing show that we got
>> similar speed and latencies than without the thread (notice that
>> almost no optimizations done here yet).
>>
>> Appart of the review:
>> - Are there any locking issues that I have missed (I guess so)
>> - stop all cpus correctly.  vm_stop should be called from the iothread,
>>    I use the trick of using a bottom half to get that working correctly.
>>    but this _implementation_ is ugly as hell.  Is there an easy way
>>    of doing it?
>> - Do I really have to export last_ram_offset(), there is no other way
>>    of knowing the ammount of RAM?
>>
>> Known issues:
>>
>> - for some reason, when it has to start a 2nd round of bitmap
>>    handling, it decides to dirty all pages.  Haven't found still why
>>    this happens.
>>
>> If you can test it, and said me where it breaks, it would also help.
>>
>> Work is based on Umesh thread work, and work that Paolo Bonzini had
>> work on top of that.  All the mirgation thread was done from scratch
>> becase I was unable to debug why it was failing, but it "owes" a lot
>> to the previous design.
>>
>> Thanks in advance, Juan.
>>
>> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
>>
>>    Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23
>> 13:15:34 -0500)
>>
>> are available in the git repository at:
>>
>>
>>    http://repo.or.cz/r/qemu/quintela.git migration-thread-v1
>>
>> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
>>
>>    buffered_file: unfold buffered_append in buffered_put_buffer
>> (2012-07-24 16:46:13 +0200)
>>
>>
>> Juan Quintela (23):
>>    buffered_file: g_realloc() can't fail
>>    savevm: Factorize ram globals reset in its own function
>>    ram: introduce migration_bitmap_set_dirty()
>>    ram: Introduce migration_bitmap_test_and_reset_dirty()
>>    ram: Export last_ram_offset()
>>    ram: introduce migration_bitmap_sync()
>>    Separate migration bitmap
>>    buffered_file: rename opaque to migration_state
>>    buffered_file: opaque is MigrationState
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_ready
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: We can access directly to bandwidth_limit
>>    buffered_file: Move from using a timer to use a thread
>>    migration: make qemu_fopen_ops_buffered() return void
>>    migration: stop all cpus correctly
>>    migration: make writes blocking
>>    migration: remove unfreeze logic
>>    migration: take finer locking
>>    buffered_file: Unfold the trick to restart generating migration data
>>    buffered_file: don't flush on put buffer
>>    buffered_file: unfold buffered_append in buffered_put_buffer
>>
>> Paolo Bonzini (2):
>>    split MRU ram list
>>    BufferedFile: append, then flush
>>
>> Umesh Deshpande (2):
>>    add a version number to ram_list
>>    protect the ramlist with a separate mutex
>>
>>   arch_init.c      |  108 +++++++++++++++++++++++++-------
>>   buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>>   buffered_file.h  |   12 +---
>>   cpu-all.h        |   17 +++++-
>>   exec-obsolete.h  |   10 ---
>>   exec.c           |   45 +++++++++++---
>>   migration-exec.c |    2 -
>>   migration-fd.c   |    6 --
>>   migration-tcp.c  |    2 +-
>>   migration-unix.c |    2 -
>>   migration.c      |  111 ++++++++++++++-------------------
>>   migration.h      |    6 ++
>>   qemu-file.h      |    5 --
>>   savevm.c         |    5 --
>>   14 files changed, 249 insertions(+), 261 deletions(-)
Juan Quintela Aug. 2, 2012, 12:01 p.m. UTC | #10
Michael Roth <mdroth@linux.vnet.ibm.com> wrote:
> On Tue, Jul 24, 2012 at 08:36:25PM +0200, Juan Quintela wrote:
>> Hi
>> 
>> This series are on top of the migration-next-v5 series just posted.
>> 
>> First of all, this is an RFC/Work in progress.  Just a lot of people
>> asked for it, and I would like review of the design.
>> 
>> It does:
>> - get a new bitmap for migration, and that bitmap uses 1 bit by page
>> - it unfolds migration_buffered_file.  Only one user existed.
>> - it simplifies buffered_file a lot.
>> 
>> - About the migration thread, special attention was giving to try to
>>   get the series review-able (reviewers would tell if I got it).
>> 
>> Basic design:
>> - we create a new thread instead of a timer function
>> - we move all the migration work to that thread (but run everything
>>   except the waits with the iothread lock.
>> - we move all the writting to outside the iothread lock.  i.e.
>>   we walk the state with the iothread hold, and copy everything to one buffer.
>>   then we write that buffer to the sockets outside the iothread lock.
>> - once here, we move to writting synchronously to the sockets.
>> - this allows us to simplify quite a lot.
>> 
>> And basically, that is it.  Notice that we still do the iterate page
>> walking with the iothread held.  Light testing show that we got
>
> Is the plan to eventually hold the iothread lock only to re-sync the
> dirty bitmap, and then rely on qemu_mutex_lock_ramlist() to walk the
> ramlist?

Yeap.  I want to drop the walking of the RAM without iothread lock, but
aren't yet there.  This series basically move:
- all migration operations are done in its own thread
- all copy from "guest" to "buffer" is done with the io thread lock (to
  call it something)
- all writes from that buffer to the socket/fd is done synchronously and
  without the iothread
- we measure bandwith/downtime more correctly (still not perfect)

> It seems like the non-MRU block list, "local" migration
> bitmap copy, and ramlist mutex are all toward this end, but we still
> have basically the same locking protocol as before. If not, can you elaborate
> on what purpose they serve in this series?

The problem to move the "live part" outside is that we still miss some
locking issues.

Later, Juan.