diff mbox

[RFC] New thread for the VM migration

Message ID 1447945249.1317755.1310627692984.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com
State New
Headers show

Commit Message

Umesh Deshpande July 14, 2011, 7:14 a.m. UTC
Following patch is implemented to deal with the VCPU and iothread starvation during the migration of a guest. Currently iothread is responsible for performing the migration. It holds the qemu_mutex during the migration and doesn't allow VCPU to enter the qemu mode and delays its return to the guest. The guest migration, executed as an iohandler also delays the execution of other iohandlers. In the following patch, the migration has been moved to a separate thread to reduce the qemu_mutex contention and iohandler starvation.


Signed-off-by: Umesh Deshpande <udeshpan@redhat.com>
---
 arch_init.c      |   19 +++++++++-----
 buffered_file.c  |   10 ++++----
 cpu-all.h        |   37 +++++++++++++++++++++++++++++
 exec.c           |   59 +++++++++++++++++++++++++++++++++++++++++++++++
 migration-exec.c |   17 ++++++++++---
 migration-fd.c   |   15 ++++++++++-
 migration-tcp.c  |   34 +++++++++++++++------------
 migration-unix.c |   23 ++++++++++--------
 migration.c      |   67 +++++++++++++++++++++++++++++++++---------------------
 migration.h      |    6 +++-
 qemu-timer.c     |   28 +++++++++++++++++++++-
 qemu-timer.h     |    3 ++
 12 files changed, 245 insertions(+), 73 deletions(-)

Comments

Avi Kivity July 14, 2011, 8:36 a.m. UTC | #1
On 07/14/2011 10:14 AM, Umesh Deshpande wrote:
> Following patch is implemented to deal with the VCPU and iothread starvation during the migration of a guest. Currently iothread is responsible for performing the migration. It holds the qemu_mutex during the migration and doesn't allow VCPU to enter the qemu mode and delays its return to the guest. The guest migration, executed as an iohandler also delays the execution of other iohandlers. In the following patch, the migration has been moved to a separate thread to reduce the qemu_mutex contention and iohandler starvation.
>
>
> @@ -260,10 +260,15 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>           return 0;
>       }
>
> +    if (stage != 3)
> +        qemu_mutex_lock_iothread();

Please read CODING_STYLE, especially the bit about braces.

Does this mean that the following code is sometimes executed without 
qemu_mutex?  I don't think any of it is thread safe.

Even just reading memory is not thread safe.  You either have to copy it 
into a buffer under lock, or convert the memory API to RCU.
Stefan Hajnoczi July 14, 2011, 9:09 a.m. UTC | #2
On Thu, Jul 14, 2011 at 9:36 AM, Avi Kivity <avi@redhat.com> wrote:
> On 07/14/2011 10:14 AM, Umesh Deshpande wrote:
>> @@ -260,10 +260,15 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int
>> stage, void *opaque)
>>          return 0;
>>      }
>>
>> +    if (stage != 3)
>> +        qemu_mutex_lock_iothread();
>
> Please read CODING_STYLE, especially the bit about braces.

Please use scripts/checkpatch.pl to check coding style before
submitting patches to the list.

You can also set git's pre-commit hook to automatically run checkpatch.pl:
http://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html

Stefan
Anthony Liguori July 14, 2011, 12:30 p.m. UTC | #3
On 07/14/2011 03:36 AM, Avi Kivity wrote:
> On 07/14/2011 10:14 AM, Umesh Deshpande wrote:
>> Following patch is implemented to deal with the VCPU and iothread
>> starvation during the migration of a guest. Currently iothread is
>> responsible for performing the migration. It holds the qemu_mutex
>> during the migration and doesn't allow VCPU to enter the qemu mode and
>> delays its return to the guest. The guest migration, executed as an
>> iohandler also delays the execution of other iohandlers. In the
>> following patch, the migration has been moved to a separate thread to
>> reduce the qemu_mutex contention and iohandler starvation.
>>
>>
>> @@ -260,10 +260,15 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int
>> stage, void *opaque)
>> return 0;
>> }
>>
>> + if (stage != 3)
>> + qemu_mutex_lock_iothread();
>
> Please read CODING_STYLE, especially the bit about braces.
>
> Does this mean that the following code is sometimes executed without
> qemu_mutex? I don't think any of it is thread safe.

That was my reaction too.

I think the most rational thing to do is have a separate thread and a 
pair of producer/consumer queues.

The I/O thread can push virtual addresses and sizes to the queue for the 
migration thread to compress/write() to the fd.  The migration thread 
can then push sent regions onto a separate queue for the I/O thread to 
mark as dirty.

Regards,

Anthony Liguori

>
> Even just reading memory is not thread safe. You either have to copy it
> into a buffer under lock, or convert the memory API to RCU.
>
Avi Kivity July 14, 2011, 12:32 p.m. UTC | #4
On 07/14/2011 03:30 PM, Anthony Liguori wrote:
>> Does this mean that the following code is sometimes executed without
>> qemu_mutex? I don't think any of it is thread safe.
>
>
> That was my reaction too.
>
> I think the most rational thing to do is have a separate thread and a 
> pair of producer/consumer queues.
>
> The I/O thread can push virtual addresses and sizes to the queue for 
> the migration thread to compress/write() to the fd.  The migration 
> thread can then push sent regions onto a separate queue for the I/O 
> thread to mark as dirty.

Even virtual addresses are not safe enough, because of hotunplug.  
Without some kind of locking, you have to copy the data.
Juan Quintela July 14, 2011, 3:30 p.m. UTC | #5
Avi Kivity <avi@redhat.com> wrote:
> On 07/14/2011 03:30 PM, Anthony Liguori wrote:
>>> Does this mean that the following code is sometimes executed without
>>> qemu_mutex? I don't think any of it is thread safe.
>>
>>
>> That was my reaction too.
>>
>> I think the most rational thing to do is have a separate thread and
>> a pair of producer/consumer queues.
>>
>> The I/O thread can push virtual addresses and sizes to the queue for
>> the migration thread to compress/write() to the fd.  The migration
>> thread can then push sent regions onto a separate queue for the I/O
>> thread to mark as dirty.
>
> Even virtual addresses are not safe enough, because of hotunplug.
> Without some kind of locking, you have to copy the data.

Disabling hotplug should be enough? Notice that hotplug/unplug during
migration don't make a lot of sense anyways.

Not all the bitmap syncying has proper locking now (copyng towards one
place), but rest of cade looks really thread safe to me (migration code
is only called from this thread, so it should be safe).

My understanding on how this work:

 vcpu thread modifies momery
 iothread (some times) modifies memory

 migration thread: reads memory, and gets the lock before syncing its
 bitmap with kvm one and qemu one (clearing it on the time).

Assume we disable hotplug/unplug (what we have to do anyways).  What is
the locking problem that we have?

We do stage 3 with the iothread locked, i.e. at that point everything
else is stopped.  Before stage 3, can kvm or qemu modify a page and
_not_ modify the bitmap?  My understanding is not.

Only real variable that we are sharing is ram_list, or I am losing
something obvious?

Later, Juan.
Avi Kivity July 14, 2011, 3:44 p.m. UTC | #6
On 07/14/2011 06:30 PM, Juan Quintela wrote:
> Avi Kivity<avi@redhat.com>  wrote:
> >  On 07/14/2011 03:30 PM, Anthony Liguori wrote:
> >>>  Does this mean that the following code is sometimes executed without
> >>>  qemu_mutex? I don't think any of it is thread safe.
> >>
> >>
> >>  That was my reaction too.
> >>
> >>  I think the most rational thing to do is have a separate thread and
> >>  a pair of producer/consumer queues.
> >>
> >>  The I/O thread can push virtual addresses and sizes to the queue for
> >>  the migration thread to compress/write() to the fd.  The migration
> >>  thread can then push sent regions onto a separate queue for the I/O
> >>  thread to mark as dirty.
> >
> >  Even virtual addresses are not safe enough, because of hotunplug.
> >  Without some kind of locking, you have to copy the data.
>
> Disabling hotplug should be enough?

So is powering down the destination host.

> Notice that hotplug/unplug during
> migration don't make a lot of sense anyways.

That's completely wrong.  Hotplug is a guest/end-user operation; 
migration is a host/admin operation.  The two don't talk to each other 
at all - if the admin (usually a program) wants to migrate at the same 
time the user wants to hotplug, which one gets the bad news?  Who will 
actually test the combination?

It's true that with the current setup we can't really do migration and 
hotplug together, since we can't synchronize the hotplug on the 
destination with the migration.  But we should be able to, and we should 
design migration with that in mind.

> Not all the bitmap syncying has proper locking now (copyng towards one
> place), but rest of cade looks really thread safe to me (migration code
> is only called from this thread, so it should be safe).
>
> My understanding on how this work:
>
>   vcpu thread modifies momery
>   iothread (some times) modifies memory
>
>   migration thread: reads memory, and gets the lock before syncing its
>   bitmap with kvm one and qemu one (clearing it on the time).
>
> Assume we disable hotplug/unplug (what we have to do anyways).  What is
> the locking problem that we have?

I didn't really grok the buffering of the migration bitmap.  It does 
look correct.  Would be best in a separate patch to point out the new 
mechanism (but that doesn't really excuse the bad review).

> We do stage 3 with the iothread locked, i.e. at that point everything
> else is stopped.  Before stage 3, can kvm or qemu modify a page and
> _not_ modify the bitmap?  My understanding is not.
>
> Only real variable that we are sharing is ram_list, or I am losing
> something obvious?

You are right.  ram_list _is_ volatile though (but we can't really 
change it these days during migration).
Juan Quintela July 14, 2011, 3:52 p.m. UTC | #7
Avi Kivity <avi@redhat.com> wrote:

>> Disabling hotplug should be enough?
>
> So is powering down the destination host.

O:-)  You see that I explained that later O:-)

>
>> Notice that hotplug/unplug during
>> migration don't make a lot of sense anyways.
>
> That's completely wrong.  Hotplug is a guest/end-user operation;
> migration is a host/admin operation.  The two don't talk to each other
> at all - if the admin (usually a program) wants to migrate at the same
> time the user wants to hotplug, which one gets the bad news?  Who will
> actually test the combination?

I am not sure if it makes sense, but to be able to allow hotplug during
migration we need to change lots of things.  It don't work today either,
so I think it is excesive to fix that on this patch.

> It's true that with the current setup we can't really do migration and
> hotplug together, since we can't synchronize the hotplug on the
> destination with the migration.  But we should be able to, and we
> should design migration with that in mind.

ok then.

>> Not all the bitmap syncying has proper locking now (copyng towards one
>> place), but rest of cade looks really thread safe to me (migration code
>> is only called from this thread, so it should be safe).
>>
>> My understanding on how this work:
>>
>>   vcpu thread modifies momery
>>   iothread (some times) modifies memory
>>
>>   migration thread: reads memory, and gets the lock before syncing its
>>   bitmap with kvm one and qemu one (clearing it on the time).
>>
>> Assume we disable hotplug/unplug (what we have to do anyways).  What is
>> the locking problem that we have?
>
> I didn't really grok the buffering of the migration bitmap.  It does
> look correct.  Would be best in a separate patch to point out the new
> mechanism (but that doesn't really excuse the bad review).

I agree that patch needs to be split in meaningful chunks.  As it is
today, it is difficult to review.

>> We do stage 3 with the iothread locked, i.e. at that point everything
>> else is stopped.  Before stage 3, can kvm or qemu modify a page and
>> _not_ modify the bitmap?  My understanding is not.
>>
>> Only real variable that we are sharing is ram_list, or I am losing
>> something obvious?
>
> You are right.  ram_list _is_ volatile though (but we can't really
> change it these days during migration).

I have think a little bit about hotplug & migration, and haven't arraive
to a nice solution.

- Disabling hotplug/unplug during migration: easy to do.  But it is not
  exactly user friendly (we are here).

- Allowing hotplug during migration. Solutions:
  * allow the transfer of hotplug events over the migration protocol
    (make it even more complicated, but not too much.  The big problem is
    if the hotplug succeeds on the source but not the destination, ...)
  * migrate the device list in stage 3: it fixes the hotplug problem
    nicely, but it makes the interesting problem that after migrating
    all the ram, we can find "interesting" problems like: disk not
    readable, etc.  Not funny.
  * <insert your nice idea here>

As far as I can see, if we sent the device list 1st, we can create the
full machine at destination, but hotplug is "interesting" to manage.
Sending the device list late, solve hotplug, but allows errors after
migrating all memory (also known as, why don't you told me *sooner*).

No clue about a nice & good solution.  For now, I think that we should
go with disabling hotplug/unplug during migration, until we get a better
idea.

Later, Juan.
Avi Kivity July 14, 2011, 4:07 p.m. UTC | #8
On 07/14/2011 06:52 PM, Juan Quintela wrote:
> >
> >>  Notice that hotplug/unplug during
> >>  migration don't make a lot of sense anyways.
> >
> >  That's completely wrong.  Hotplug is a guest/end-user operation;
> >  migration is a host/admin operation.  The two don't talk to each other
> >  at all - if the admin (usually a program) wants to migrate at the same
> >  time the user wants to hotplug, which one gets the bad news?  Who will
> >  actually test the combination?
>
> I am not sure if it makes sense, but to be able to allow hotplug during
> migration we need to change lots of things.  It don't work today either,
> so I think it is excesive to fix that on this patch.

Reluctantly agree.  We're adding another obstacle that we have to remove 
later, which I don't really like, but that's life.

> >
> >  You are right.  ram_list _is_ volatile though (but we can't really
> >  change it these days during migration).
>
> I have think a little bit about hotplug&  migration, and haven't arraive
> to a nice solution.
>
> - Disabling hotplug/unplug during migration: easy to do.  But it is not
>    exactly user friendly (we are here).
>
> - Allowing hotplug during migration. Solutions:
>    * allow the transfer of hotplug events over the migration protocol
>      (make it even more complicated, but not too much.  The big problem is
>      if the hotplug succeeds on the source but not the destination, ...)

If we transfer only the guest side of the device (and really, that's the 
only thing we can do), then nothing should ever fail, except for guest 
memory allocation.  If that happens, kill the target.  The source should 
recover.

Maybe we can do this via a magic subsection whose contents are the 
hotplug event.

>    * migrate the device list in stage 3: it fixes the hotplug problem
>      nicely, but it makes the interesting problem that after migrating
>      all the ram, we can find "interesting" problems like: disk not
>      readable, etc.  Not funny.

Yes, that's not very nice.

>    *<insert your nice idea here>
>
> As far as I can see, if we sent the device list 1st, we can create the
> full machine at destination, but hotplug is "interesting" to manage.
> Sending the device list late, solve hotplug, but allows errors after
> migrating all memory (also known as, why don't you told me *sooner*).
>
> No clue about a nice&  good solution.  For now, I think that we should
> go with disabling hotplug/unplug during migration, until we get a better
> idea.

Yes.
Anthony Liguori July 14, 2011, 4:49 p.m. UTC | #9
On 07/14/2011 07:32 AM, Avi Kivity wrote:
> On 07/14/2011 03:30 PM, Anthony Liguori wrote:
>>> Does this mean that the following code is sometimes executed without
>>> qemu_mutex? I don't think any of it is thread safe.
>>
>>
>> That was my reaction too.
>>
>> I think the most rational thing to do is have a separate thread and a
>> pair of producer/consumer queues.
>>
>> The I/O thread can push virtual addresses and sizes to the queue for
>> the migration thread to compress/write() to the fd. The migration
>> thread can then push sent regions onto a separate queue for the I/O
>> thread to mark as dirty.
>
> Even virtual addresses are not safe enough, because of hotunplug.
> Without some kind of locking, you have to copy the data.

We don't know yet how we're going to implement hot unplug so let's not 
worry about this for now.

I think a reference count based approach is really the only sane thing 
to do and if we did that, it wouldn't be a problem since the reference 
would be owned by the I/O thread and would live until the migration 
thread is done with the VA.

Regards,

Anthony Liguori

>
Avi Kivity July 14, 2011, 4:59 p.m. UTC | #10
On 07/14/2011 07:49 PM, Anthony Liguori wrote:
>
> I think a reference count based approach is really the only sane thing 
> to do and if we did that, it wouldn't be a problem since the reference 
> would be owned by the I/O thread and would live until the migration 
> thread is done with the VA.
>

I was thinking about RCU.
Paolo Bonzini July 15, 2011, 7:59 a.m. UTC | #11
On 07/14/2011 06:07 PM, Avi Kivity wrote:
>
> Maybe we can do this via a magic subsection whose contents are the
> hotplug event.

What about making the device list just another "thing" that has to be 
migrated live, together with block and ram?

Paolo
Anthony Liguori July 15, 2011, 9:09 p.m. UTC | #12
On 07/15/2011 02:59 AM, Paolo Bonzini wrote:
> On 07/14/2011 06:07 PM, Avi Kivity wrote:
>>
>> Maybe we can do this via a magic subsection whose contents are the
>> hotplug event.
>
> What about making the device list just another "thing" that has to be
> migrated live, together with block and ram?

In an ideal world, you would only create the backends on the destination 
node and all of the devices would be created through the migration process.

Regards,

Anthony Liguori

> Paolo
>
Avi Kivity July 17, 2011, 8:39 a.m. UTC | #13
On 07/15/2011 10:59 AM, Paolo Bonzini wrote:
> On 07/14/2011 06:07 PM, Avi Kivity wrote:
>>
>> Maybe we can do this via a magic subsection whose contents are the
>> hotplug event.
>
> What about making the device list just another "thing" that has to be 
> migrated live, together with block and ram?

Excellent idea.

Sticky points:
- if a new device includes RAM, the device update must precede the 
migration of the new RAM section
- the reverse for unplug
- need better decoupling between host and guest state
Markus Armbruster July 18, 2011, 7:08 a.m. UTC | #14
Juan Quintela <quintela@redhat.com> writes:

[...]
> I have think a little bit about hotplug & migration, and haven't arraive
> to a nice solution.
>
> - Disabling hotplug/unplug during migration: easy to do.  But it is not
>   exactly user friendly (we are here).
>
> - Allowing hotplug during migration. Solutions:
>   * allow the transfer of hotplug events over the migration protocol
>     (make it even more complicated, but not too much.  The big problem is
>     if the hotplug succeeds on the source but not the destination, ...)
>   * migrate the device list in stage 3: it fixes the hotplug problem
>     nicely, but it makes the interesting problem that after migrating
>     all the ram, we can find "interesting" problems like: disk not
>     readable, etc.  Not funny.
>   * <insert your nice idea here>
>
> As far as I can see, if we sent the device list 1st, we can create the
> full machine at destination, but hotplug is "interesting" to manage.
> Sending the device list late, solve hotplug, but allows errors after
> migrating all memory (also known as, why don't you told me *sooner*).

I figure the errors relevant here happen in device backends (host parts)
mostly.

Maybe updating just backends is easier than full device hot plug.
Configure backends before migrating memory, to catch errors.
Reconfigure backends afterwards for hot plug[*].  Then build machine.

You still get errors from frontends (device models) after migrating
memory, but they should be rare.

[...]

[*] You could do it "in the middle" to catch errors as early as
possible, but I doubt it's worth the trouble.
diff mbox

Patch

diff --git a/arch_init.c b/arch_init.c
index 484b39d..f18dda2 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -123,13 +123,13 @@  static int ram_save_block(QEMUFile *f)
     current_addr = block->offset + offset;
 
     do {
-        if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
+        if (migration_bitmap_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
             uint8_t *p;
             int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;
 
-            cpu_physical_memory_reset_dirty(current_addr,
-                                            current_addr + TARGET_PAGE_SIZE,
-                                            MIGRATION_DIRTY_FLAG);
+            migration_bitmap_reset_dirty(current_addr,
+                                         current_addr + TARGET_PAGE_SIZE,
+                                         MIGRATION_DIRTY_FLAG);
 
             p = block->host + offset;
 
@@ -185,7 +185,7 @@  static ram_addr_t ram_save_remaining(void)
         ram_addr_t addr;
         for (addr = block->offset; addr < block->offset + block->length;
              addr += TARGET_PAGE_SIZE) {
-            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+            if (migration_bitmap_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
                 count++;
             }
         }
@@ -260,10 +260,15 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         return 0;
     }
 
+    if (stage != 3)
+        qemu_mutex_lock_iothread();
     if (cpu_physical_sync_dirty_bitmap(0, TARGET_PHYS_ADDR_MAX) != 0) {
         qemu_file_set_error(f);
         return 0;
     }
+    sync_migration_bitmap(0, TARGET_PHYS_ADDR_MAX);
+    if (stage != 3)
+        qemu_mutex_unlock_iothread();
 
     if (stage == 1) {
         RAMBlock *block;
@@ -276,9 +281,9 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         QLIST_FOREACH(block, &ram_list.blocks, next) {
             for (addr = block->offset; addr < block->offset + block->length;
                  addr += TARGET_PAGE_SIZE) {
-                if (!cpu_physical_memory_get_dirty(addr,
+                if (!migration_bitmap_get_dirty(addr,
                                                    MIGRATION_DIRTY_FLAG)) {
-                    cpu_physical_memory_set_dirty(addr);
+                    migration_bitmap_set_dirty(addr);
                 }
             }
         }
diff --git a/buffered_file.c b/buffered_file.c
index 41b42c3..e05efe8 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -237,7 +237,7 @@  static void buffered_rate_tick(void *opaque)
         return;
     }
 
-    qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
+    qemu_mod_timer(s->timer, qemu_get_clock_ms(migration_clock) + 100);
 
     if (s->freeze_output)
         return;
@@ -246,8 +246,8 @@  static void buffered_rate_tick(void *opaque)
 
     buffered_flush(s);
 
-    /* Add some checks around this */
     s->put_ready(s->opaque);
+    usleep(qemu_timer_difference(s->timer, migration_clock) * 1000);
 }
 
 QEMUFile *qemu_fopen_ops_buffered(void *opaque,
@@ -271,11 +271,11 @@  QEMUFile *qemu_fopen_ops_buffered(void *opaque,
     s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
                              buffered_close, buffered_rate_limit,
                              buffered_set_rate_limit,
-			     buffered_get_rate_limit);
+                             buffered_get_rate_limit);
 
-    s->timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s);
+    s->timer = qemu_new_timer_ms(migration_clock, buffered_rate_tick, s);
 
-    qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
+    qemu_mod_timer(s->timer, qemu_get_clock_ms(migration_clock) + 100);
 
     return s->file;
 }
diff --git a/cpu-all.h b/cpu-all.h
index e839100..80ce601 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -932,6 +932,7 @@  typedef struct RAMBlock {
 
 typedef struct RAMList {
     uint8_t *phys_dirty;
+    uint8_t *migration_bitmap;
     QLIST_HEAD(ram, RAMBlock) blocks;
 } RAMList;
 extern RAMList ram_list;
@@ -1004,8 +1005,44 @@  static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
     }
 }
 
+
+
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
                                      int dirty_flags);
+
+static inline int migration_bitmap_get_dirty(ram_addr_t addr,
+                                             int dirty_flags)
+{
+    return ram_list.migration_bitmap[addr >> TARGET_PAGE_BITS] & dirty_flags;
+}
+
+static inline void migration_bitmap_set_dirty(ram_addr_t addr)
+{
+    ram_list.migration_bitmap[addr >> TARGET_PAGE_BITS] = 0xff;
+}
+
+static inline void migration_bitmap_mask_dirty_range(ram_addr_t start,
+                                                     int length,
+                                                     int dirty_flags)
+{
+    int i, mask, len;
+    uint8_t *p;
+
+    len = length >> TARGET_PAGE_BITS;
+    mask = ~dirty_flags;
+    p = ram_list.migration_bitmap + (start >> TARGET_PAGE_BITS);
+    for (i = 0; i < len; i++) {
+        p[i] &= mask;
+    }
+}
+
+
+void migration_bitmap_reset_dirty(ram_addr_t start,
+                                  ram_addr_t end,
+                                  int dirty_flags);
+
+void sync_migration_bitmap(ram_addr_t start, ram_addr_t end);
+
 void cpu_tlb_update_dirty(CPUState *env);
 
 int cpu_physical_memory_set_dirty_tracking(int enable);
diff --git a/exec.c b/exec.c
index 0e2ce57..86f56c4 100644
--- a/exec.c
+++ b/exec.c
@@ -2106,6 +2106,9 @@  void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
         abort();
     }
 
+    if (kvm_enabled())
+       return;
+
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
         int mmu_idx;
         for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
@@ -2114,8 +2117,58 @@  void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
                                       start1, length);
         }
     }
+
 }
 
+void migration_bitmap_reset_dirty(ram_addr_t start, ram_addr_t end,
+                                  int dirty_flags)
+{
+    unsigned long length, start1;
+
+    start &= TARGET_PAGE_MASK;
+    end = TARGET_PAGE_ALIGN(end);
+
+    length = end - start;
+    if (length == 0)
+        return;
+    migration_bitmap_mask_dirty_range(start, length, dirty_flags);
+
+    /* we modify the TLB cache so that the dirty bit will be set again
+       when accessing the range */
+    start1 = (unsigned long)qemu_safe_ram_ptr(start);
+    /* Check that we don't span multiple blocks - this breaks the
+       address comparisons below.  */
+    if ((unsigned long)qemu_safe_ram_ptr(end - 1) - start1
+            != (end - 1) - start) {
+        abort();
+    }
+}
+
+void sync_migration_bitmap(ram_addr_t start, ram_addr_t end)
+{
+    unsigned long length, len, i;
+    ram_addr_t addr;
+    start &= TARGET_PAGE_MASK;
+    end = TARGET_PAGE_ALIGN(end);
+
+    length = end - start;
+    if (length == 0)
+        return;
+
+    len = length >> TARGET_PAGE_BITS;
+    for (i = 0; i < len; i++) {
+        if (ram_list.phys_dirty[i] & MIGRATION_DIRTY_FLAG) {
+            addr = i << TARGET_PAGE_BITS;
+            migration_bitmap_set_dirty(addr);
+            cpu_physical_memory_reset_dirty(addr, addr + TARGET_PAGE_SIZE,
+                                            MIGRATION_DIRTY_FLAG);
+        }
+    }
+
+}
+
+
+
 int cpu_physical_memory_set_dirty_tracking(int enable)
 {
     int ret = 0;
@@ -2979,6 +3032,12 @@  ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
     memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
            0xff, size >> TARGET_PAGE_BITS);
 
+    ram_list.migration_bitmap = qemu_realloc(ram_list.phys_dirty,
+                                       last_ram_offset() >> TARGET_PAGE_BITS);
+    memset(ram_list.migration_bitmap + (new_block->offset >> TARGET_PAGE_BITS),
+           0xff, size >> TARGET_PAGE_BITS);
+
+
     if (kvm_enabled())
         kvm_setup_guest_memory(new_block->host, size);
 
diff --git a/migration-exec.c b/migration-exec.c
index 4b7aad8..5085703 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -13,6 +13,7 @@ 
  *
  */
 
+#include "qemu-thread.h"
 #include "qemu-common.h"
 #include "qemu_socket.h"
 #include "migration.h"
@@ -117,13 +118,21 @@  err_after_alloc:
     return NULL;
 }
 
-static void exec_accept_incoming_migration(void *opaque)
+static void *exec_incoming_migration_thread(void *opaque)
 {
     QEMUFile *f = opaque;
-
     process_incoming_migration(f);
-    qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
     qemu_fclose(f);
+    return NULL;
+}
+
+static void exec_accept_incoming_migration(void *opaque)
+{
+    QEMUFile *f = opaque;
+    struct QemuThread migrate_incoming_thread;
+    qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
+    qemu_thread_create(&migrate_incoming_thread, exec_incoming_migration_thread,
+                       f);
 }
 
 int exec_start_incoming_migration(const char *command)
@@ -138,7 +147,7 @@  int exec_start_incoming_migration(const char *command)
     }
 
     qemu_set_fd_handler2(qemu_stdio_fd(f), NULL,
-			 exec_accept_incoming_migration, NULL, f);
+                         exec_accept_incoming_migration, NULL, f);
 
     return 0;
 }
diff --git a/migration-fd.c b/migration-fd.c
index 66d51c1..4220566 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -11,6 +11,7 @@ 
  *
  */
 
+#include "qemu-thread.h"
 #include "qemu-common.h"
 #include "qemu_socket.h"
 #include "migration.h"
@@ -100,13 +101,23 @@  err_after_alloc:
     return NULL;
 }
 
+
+static void *fd_incoming_migration_thread(void *opaque)
+{
+    QEMUFile *f = opaque;
+    process_incoming_migration(f);
+    qemu_fclose(f);
+    return NULL;
+}
+
 static void fd_accept_incoming_migration(void *opaque)
 {
     QEMUFile *f = opaque;
+    struct QemuThread migrate_incoming_thread;
 
-    process_incoming_migration(f);
     qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
-    qemu_fclose(f);
+    qemu_thread_create(&migrate_incoming_thread, fd_incoming_migration_thread,
+                       f);
 }
 
 int fd_start_incoming_migration(const char *infd)
diff --git a/migration-tcp.c b/migration-tcp.c
index d3d80c9..4ef58c7 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -11,6 +11,7 @@ 
  *
  */
 
+#include "qemu-thread.h"
 #include "qemu-common.h"
 #include "qemu_socket.h"
 #include "migration.h"
@@ -65,11 +66,9 @@  static void tcp_wait_for_connect(void *opaque)
         return;
     }
 
-    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
-
-    if (val == 0)
+    if (val == 0) {
         migrate_fd_connect(s);
-    else {
+    } else {
         DPRINTF("error connecting %d\n", val);
         migrate_fd_error(s);
     }
@@ -79,8 +78,8 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,
                                              const char *host_port,
                                              int64_t bandwidth_limit,
                                              int detach,
-					     int blk,
-					     int inc)
+                                             int blk,
+                                             int inc)
 {
     struct sockaddr_in addr;
     FdMigrationState *s;
@@ -111,7 +110,7 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,
     }
 
     socket_set_nonblock(s->fd);
-
+    
     if (!detach) {
         migrate_fd_monitor_suspend(s, mon);
     }
@@ -121,20 +120,22 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,
         if (ret == -1)
             ret = -(s->get_error(s));
 
-        if (ret == -EINPROGRESS || ret == -EWOULDBLOCK)
-            qemu_set_fd_handler2(s->fd, NULL, NULL, tcp_wait_for_connect, s);
     } while (ret == -EINTR);
 
     if (ret < 0 && ret != -EINPROGRESS && ret != -EWOULDBLOCK) {
         DPRINTF("connect failed\n");
         migrate_fd_error(s);
-    } else if (ret >= 0)
+    } else if (ret >= 0) {
         migrate_fd_connect(s);
+    } else { 
+        migrate_fd_wait_for_unfreeze(s);
+        tcp_wait_for_connect(s);
+    }
 
     return &s->mig_state;
 }
 
-static void tcp_accept_incoming_migration(void *opaque)
+static void *tcp_accept_incoming_migration(void *opaque)
 {
     struct sockaddr_in addr;
     socklen_t addrlen = sizeof(addr);
@@ -142,6 +143,8 @@  static void tcp_accept_incoming_migration(void *opaque)
     QEMUFile *f;
     int c;
 
+    migrate_fd_wait(s);
+
     do {
         c = qemu_accept(s, (struct sockaddr *)&addr, &addrlen);
     } while (c == -1 && socket_error() == EINTR);
@@ -164,15 +167,16 @@  static void tcp_accept_incoming_migration(void *opaque)
 out:
     close(c);
 out2:
-    qemu_set_fd_handler2(s, NULL, NULL, NULL, NULL);
     close(s);
+    return NULL;
 }
 
 int tcp_start_incoming_migration(const char *host_port)
 {
     struct sockaddr_in addr;
     int val;
-    int s;
+    struct QemuThread migrate_incoming_thread;
+    static int s;
 
     if (parse_host_port(&addr, host_port) < 0) {
         fprintf(stderr, "invalid host/port combination: %s\n", host_port);
@@ -192,8 +196,8 @@  int tcp_start_incoming_migration(const char *host_port)
     if (listen(s, 1) == -1)
         goto err;
 
-    qemu_set_fd_handler2(s, NULL, tcp_accept_incoming_migration, NULL,
-                         (void *)(intptr_t)s);
+    qemu_thread_create(&migrate_incoming_thread, tcp_accept_incoming_migration,
+                       (void *)(unsigned long)s);
 
     return 0;
 
diff --git a/migration-unix.c b/migration-unix.c
index c8625c7..9ba2944 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -11,6 +11,7 @@ 
  *
  */
 
+#include "qemu-thread.h"
 #include "qemu-common.h"
 #include "qemu_socket.h"
 #include "migration.h"
@@ -64,8 +65,6 @@  static void unix_wait_for_connect(void *opaque)
         return;
     }
 
-    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
-
     if (val == 0)
         migrate_fd_connect(s);
     else {
@@ -116,13 +115,14 @@  MigrationState *unix_start_outgoing_migration(Monitor *mon,
         if (ret == -1)
 	    ret = -(s->get_error(s));
 
-        if (ret == -EINPROGRESS || ret == -EWOULDBLOCK)
-	    qemu_set_fd_handler2(s->fd, NULL, NULL, unix_wait_for_connect, s);
     } while (ret == -EINTR);
 
     if (ret < 0 && ret != -EINPROGRESS && ret != -EWOULDBLOCK) {
         DPRINTF("connect failed\n");
         goto err_after_open;
+    } else if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) {
+        migrate_fd_wait_for_unfreeze(s);
+        unix_wait_for_connect(s);
     }
 
     if (!detach) {
@@ -142,7 +142,7 @@  err_after_alloc:
     return NULL;
 }
 
-static void unix_accept_incoming_migration(void *opaque)
+static void *unix_accept_incoming_migration(void *opaque)
 {
     struct sockaddr_un addr;
     socklen_t addrlen = sizeof(addr);
@@ -150,6 +150,8 @@  static void unix_accept_incoming_migration(void *opaque)
     QEMUFile *f;
     int c;
 
+    migrate_fd_wait(s);
+
     do {
         c = qemu_accept(s, (struct sockaddr *)&addr, &addrlen);
     } while (c == -1 && socket_error() == EINTR);
@@ -158,7 +160,7 @@  static void unix_accept_incoming_migration(void *opaque)
 
     if (c == -1) {
         fprintf(stderr, "could not accept migration connection\n");
-        return;
+        return NULL;
     }
 
     f = qemu_fopen_socket(c);
@@ -170,15 +172,16 @@  static void unix_accept_incoming_migration(void *opaque)
     process_incoming_migration(f);
     qemu_fclose(f);
 out:
-    qemu_set_fd_handler2(s, NULL, NULL, NULL, NULL);
     close(s);
     close(c);
+    return NULL;
 }
 
 int unix_start_incoming_migration(const char *path)
 {
     struct sockaddr_un un;
-    int sock;
+    struct QemuThread migrate_incoming_thread;
+    static int sock;
 
     DPRINTF("Attempting to start an incoming migration\n");
 
@@ -202,8 +205,8 @@  int unix_start_incoming_migration(const char *path)
         goto err;
     }
 
-    qemu_set_fd_handler2(sock, NULL, unix_accept_incoming_migration, NULL,
-			 (void *)(intptr_t)sock);
+    qemu_thread_create(&migrate_incoming_thread, unix_accept_incoming_migration,
+                       (void *)(unsigned long)sock);
 
     return 0;
 
diff --git a/migration.c b/migration.c
index af3a1f2..5ada647 100644
--- a/migration.c
+++ b/migration.c
@@ -12,6 +12,8 @@ 
  */
 
 #include "qemu-common.h"
+#include "qemu-thread.h"
+#include "qemu-timer.h"
 #include "migration.h"
 #include "monitor.h"
 #include "buffered_file.h"
@@ -35,6 +37,7 @@ 
 static int64_t max_throttle = (32 << 20);
 
 static MigrationState *current_migration;
+char host_port[50];
 
 static NotifierList migration_state_notifiers =
     NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -284,8 +287,6 @@  int migrate_fd_cleanup(FdMigrationState *s)
 {
     int ret = 0;
 
-    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
-
     if (s->file) {
         DPRINTF("closing file\n");
         if (qemu_fclose(s->file) != 0) {
@@ -307,14 +308,6 @@  int migrate_fd_cleanup(FdMigrationState *s)
     return ret;
 }
 
-void migrate_fd_put_notify(void *opaque)
-{
-    FdMigrationState *s = opaque;
-
-    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
-    qemu_file_put_notify(s->file);
-}
-
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
     FdMigrationState *s = opaque;
@@ -327,9 +320,7 @@  ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     if (ret == -1)
         ret = -(s->get_error(s));
 
-    if (ret == -EAGAIN) {
-        qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s);
-    } else if (ret < 0) {
+    if (ret < 0 && ret != -EAGAIN) {
         if (s->mon) {
             monitor_resume(s->mon);
         }
@@ -340,10 +331,27 @@  ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     return ret;
 }
 
-void migrate_fd_connect(FdMigrationState *s)
+void *migrate_run_timers(void *arg)
 {
+    FdMigrationState *s = arg;
     int ret;
+    ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk,
+            s->mig_state.shared);
+    if (ret < 0) {
+        DPRINTF("failed, %d\n", ret);
+        migrate_fd_error(s);
+        return NULL;
+    }
+
+    migrate_fd_put_ready(s);
+    while (s->state == MIG_STATE_ACTIVE)
+        qemu_run_timers(migration_clock);
+    return NULL;
+}
 
+void migrate_fd_connect(FdMigrationState *s)
+{
+    struct QemuThread migrate_thread;
     s->file = qemu_fopen_ops_buffered(s,
                                       s->bandwidth_limit,
                                       migrate_fd_put_buffer,
@@ -352,15 +360,7 @@  void migrate_fd_connect(FdMigrationState *s)
                                       migrate_fd_close);
 
     DPRINTF("beginning savevm\n");
-    ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk,
-                                  s->mig_state.shared);
-    if (ret < 0) {
-        DPRINTF("failed, %d\n", ret);
-        migrate_fd_error(s);
-        return;
-    }
-    
-    migrate_fd_put_ready(s);
+    qemu_thread_create(&migrate_thread, migrate_run_timers, s);
 }
 
 void migrate_fd_put_ready(void *opaque)
@@ -376,8 +376,7 @@  void migrate_fd_put_ready(void *opaque)
     if (qemu_savevm_state_iterate(s->mon, s->file) == 1) {
         int state;
         int old_vm_running = vm_running;
-
-        DPRINTF("done iterating\n");
+        qemu_mutex_lock_iothread();
         vm_stop(VMSTOP_MIGRATE);
 
         if ((qemu_savevm_state_complete(s->mon, s->file)) < 0) {
@@ -396,6 +395,10 @@  void migrate_fd_put_ready(void *opaque)
         }
         s->state = state;
         notifier_list_notify(&migration_state_notifiers);
+        qemu_mutex_unlock_iothread();
+    } else {
+        migrate_fd_wait_for_unfreeze(s);
+        qemu_file_put_notify(s->file);
     }
 }
 
@@ -454,11 +457,23 @@  void migrate_fd_wait_for_unfreeze(void *opaque)
     } while (ret == -1 && (s->get_error(s)) == EINTR);
 }
 
+void migrate_fd_wait(int fd)
+{
+    int ret;
+    do {
+        fd_set rfds;
+
+        FD_ZERO(&rfds);
+        FD_SET(fd, &rfds);
+
+        ret = select(fd + 1, &rfds, NULL, NULL, NULL);
+    } while (ret == -1);
+}
+
 int migrate_fd_close(void *opaque)
 {
     FdMigrationState *s = opaque;
 
-    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
     return s->close(s);
 }
 
diff --git a/migration.h b/migration.h
index 050c56c..2f46d9e 100644
--- a/migration.h
+++ b/migration.h
@@ -72,6 +72,8 @@  void do_info_migrate(Monitor *mon, QObject **ret_data);
 
 int exec_start_incoming_migration(const char *host_port);
 
+void *migrate_run_timers(void *);
+
 MigrationState *exec_start_outgoing_migration(Monitor *mon,
                                               const char *host_port,
 					      int64_t bandwidth_limit,
@@ -112,8 +114,6 @@  void migrate_fd_error(FdMigrationState *s);
 
 int migrate_fd_cleanup(FdMigrationState *s);
 
-void migrate_fd_put_notify(void *opaque);
-
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
 void migrate_fd_connect(FdMigrationState *s);
@@ -128,6 +128,8 @@  void migrate_fd_release(MigrationState *mig_state);
 
 void migrate_fd_wait_for_unfreeze(void *opaque);
 
+void migrate_fd_wait(int fd);
+
 int migrate_fd_close(void *opaque);
 
 static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state)
diff --git a/qemu-timer.c b/qemu-timer.c
index 72066c7..3a0b114 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -144,6 +144,7 @@  void cpu_disable_ticks(void)
 #define QEMU_CLOCK_REALTIME 0
 #define QEMU_CLOCK_VIRTUAL  1
 #define QEMU_CLOCK_HOST     2
+#define QEMU_CLOCK_MIGRATE  3
 
 struct QEMUClock {
     int type;
@@ -364,9 +365,10 @@  next:
     }
 }
 
-#define QEMU_NUM_CLOCKS 3
+#define QEMU_NUM_CLOCKS 4
 
 QEMUClock *rt_clock;
+QEMUClock *migration_clock;
 QEMUClock *vm_clock;
 QEMUClock *host_clock;
 
@@ -561,12 +563,30 @@  int qemu_timer_pending(QEMUTimer *ts)
     return 0;
 }
 
+int64_t qemu_timer_difference(QEMUTimer *ts, QEMUClock *clock)
+{
+    int64_t expire_time, current_time;
+    QEMUTimer *t;
+
+    current_time = qemu_get_clock_ms(clock);
+    for(t = active_timers[clock->type]; t != NULL; t = t->next) {
+        if (t == ts) {
+            expire_time = ts->expire_time / SCALE_MS;
+            if (current_time >= expire_time)
+                return 0;
+            else
+                return expire_time - current_time;
+        }
+    }
+    return 0;
+}
+
 int qemu_timer_expired(QEMUTimer *timer_head, int64_t current_time)
 {
     return qemu_timer_expired_ns(timer_head, current_time * timer_head->scale);
 }
 
-static void qemu_run_timers(QEMUClock *clock)
+void qemu_run_timers(QEMUClock *clock)
 {
     QEMUTimer **ptimer_head, *ts;
     int64_t current_time;
@@ -595,6 +615,9 @@  int64_t qemu_get_clock_ns(QEMUClock *clock)
     switch(clock->type) {
     case QEMU_CLOCK_REALTIME:
         return get_clock();
+
+    case QEMU_CLOCK_MIGRATE:
+        return get_clock();
     default:
     case QEMU_CLOCK_VIRTUAL:
         if (use_icount) {
@@ -610,6 +633,7 @@  int64_t qemu_get_clock_ns(QEMUClock *clock)
 void init_clocks(void)
 {
     rt_clock = qemu_new_clock(QEMU_CLOCK_REALTIME);
+    migration_clock = qemu_new_clock(QEMU_CLOCK_MIGRATE);
     vm_clock = qemu_new_clock(QEMU_CLOCK_VIRTUAL);
     host_clock = qemu_new_clock(QEMU_CLOCK_HOST);
 
diff --git a/qemu-timer.h b/qemu-timer.h
index 06cbe20..014b70b 100644
--- a/qemu-timer.h
+++ b/qemu-timer.h
@@ -23,6 +23,7 @@  typedef void QEMUTimerCB(void *opaque);
    machine is stopped. The real time clock has a frequency of 1000
    Hz. */
 extern QEMUClock *rt_clock;
+extern QEMUClock *migration_clock;
 
 /* The virtual clock is only run during the emulation. It is stopped
    when the virtual machine is stopped. Virtual timers use a high
@@ -45,7 +46,9 @@  QEMUTimer *qemu_new_timer(QEMUClock *clock, int scale,
 void qemu_free_timer(QEMUTimer *ts);
 void qemu_del_timer(QEMUTimer *ts);
 void qemu_mod_timer(QEMUTimer *ts, int64_t expire_time);
+void qemu_run_timers(QEMUClock *clock);
 int qemu_timer_pending(QEMUTimer *ts);
+int64_t qemu_timer_difference(QEMUTimer *ts, QEMUClock *);
 int qemu_timer_expired(QEMUTimer *timer_head, int64_t current_time);
 
 void qemu_run_all_timers(void);