diff mbox

migration: Introduce migration_in_completion()

Message ID 012c01d110a6$cd61cf90$68256eb0$@samsung.com
State New
Headers show

Commit Message

Pavel Fedin Oct. 27, 2015, 11:01 a.m. UTC
This allows to signal migration notifiers that the migration has entered
final phase. The condition is set after vm_stop_force_state().

This will be necessary for ITS live migration on ARM, which will have to
dump its state into guest RAM at this point.

Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
---
 include/migration/migration.h | 2 ++
 migration/migration.c         | 7 +++++++
 2 files changed, 9 insertions(+)

Comments

Juan Quintela Oct. 27, 2015, 1:41 p.m. UTC | #1
Pavel Fedin <p.fedin@samsung.com> wrote:
> This allows to signal migration notifiers that the migration has entered
> final phase. The condition is set after vm_stop_force_state().
>
> This will be necessary for ITS live migration on ARM, which will have to
> dump its state into guest RAM at this point.
>
> Signed-off-by: Pavel Fedin <p.fedin@samsung.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Applied, thanks.

I wonder who it has never been needed before.  BTW, what is ITS?
Pavel Fedin Oct. 27, 2015, 2:03 p.m. UTC | #2
Hello!

> Applied, thanks.

 Thank you.

> I wonder who it has never been needed before.  BTW, what is ITS?

 ITS is Interrupt Translation Service. It is used by ARM64 architecture. In a short, it is a special hardware which performs
table-based lookup of MSI IDs and their destination CPUs, and routes them. There can be lots of CPUs and lots of interrupts, so
these tables are large and are held in a system RAM.
 qemu implementation of this thing needs to be able to flush its internal cached state into memory for live migration. This has to
be done during final stage, right after CPUs have been stopped, because right after this point migration code will recheck RAM
status and send out modified data for the last time.
 The implementation itself has to wait, because the necessary KVM API is also not ready yet, but i prefer to upstream as much of the
infrastructure as possible, because: a) we want to reduce amount of out-of-tree patches for our project; b) these things affect qemu
core code, and upstreaming them makes sure that we do it right, and our out-of-tree code will not diverge from the upstream too
much.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Juan Quintela Oct. 28, 2015, 9:58 a.m. UTC | #3
Pavel Fedin <p.fedin@samsung.com> wrote:
>  Hello!
>
>> Applied, thanks.
>
>  Thank you.
>
>> I wonder who it has never been needed before.  BTW, what is ITS?
>
>  ITS is Interrupt Translation Service. It is used by ARM64
> architecture. In a short, it is a special hardware which performs
> table-based lookup of MSI IDs and their destination CPUs, and routes
> them. There can be lots of CPUs and lots of interrupts, so
> these tables are large and are held in a system RAM.
>  qemu implementation of this thing needs to be able to flush its
> internal cached state into memory for live migration. This has to
> be done during final stage, right after CPUs have been stopped,
> because right after this point migration code will recheck RAM
> status and send out modified data for the last time.
>  The implementation itself has to wait, because the necessary KVM API
> is also not ready yet, but i prefer to upstream as much of the
> infrastructure as possible, because: a) we want to reduce amount of
> out-of-tree patches for our project; b) these things affect qemu
> core code, and upstreaming them makes sure that we do it right, and
> our out-of-tree code will not diverge from the upstream too
> much.

Power people have a similar problem with its hashed page tables, they
integrated their own save_live implementation because they are too big
for the last stage.  You can look there for inspiration.

How big are we talking here?

Thanks, Juan.

>
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
Pavel Fedin Oct. 28, 2015, 10:27 a.m. UTC | #4
Hello!

> Power people have a similar problem with its hashed page tables, they
> integrated their own save_live implementation because they are too big
> for the last stage.  You can look there for inspiration.

 Hm, i'll take a look...

> How big are we talking here?

 Tables are normally large (about 640K on my VM), but extremely sparse. So the actually modified space during this callback would be
several pages, unlikely more.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin Oct. 28, 2015, 11:36 a.m. UTC | #5
Hello!

> Power people have a similar problem with its hashed page tables, they
> integrated their own save_live implementation because they are too big
> for the last stage.  You can look there for inspiration.

 I examined their code. Interesting, and, indeed, it opens up a way for decreasing downtime by implementing iterative migration for
the ITS.
 However, this is not really what is necessary. This thing aims to produce own data chunk, and it's not good for ITS. ITS already
stores everything in system RAM, therefore savevm_ram_handlers take perfect care about these data. The only thing to do is to tell
the ITS to dump its state into RAM. This is what i currently do using migration_in_completion().
 An alternate, perhaps better approach, would be to be able to hook into ram_save_iterate() and ram_save_complete(). This way we
could kick ITS right before attempting to migrate RAM.
 Could we extend the infrastructure so that:
a) Handlers are prioritized, and we can determine order of their execution?
b) We can choose whether our handlers actually produce extra chunk or not?

 OTOH, what i've done is actually a way to hook up into save_live_complete before any other registered handlers get executed. What
is missing is one more notifier_list_notify() call right before qemu_savevm_state_iterate(), and a corresponding
migration_is_active() checker.

 What do you think ?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Juan Quintela Oct. 29, 2015, 12:20 p.m. UTC | #6
Pavel Fedin <p.fedin@samsung.com> wrote:
>  Hello!
>
>> Power people have a similar problem with its hashed page tables, they
>> integrated their own save_live implementation because they are too big
>> for the last stage.  You can look there for inspiration.
>
>  I examined their code. Interesting, and, indeed, it opens up a way
> for decreasing downtime by implementing iterative migration for
> the ITS.
>  However, this is not really what is necessary. This thing aims to
> produce own data chunk, and it's not good for ITS. ITS already
> stores everything in system RAM, therefore savevm_ram_handlers take
> perfect care about these data. The only thing to do is to tell
> the ITS to dump its state into RAM. This is what i currently do using
> migration_in_completion().
>  An alternate, perhaps better approach, would be to be able to hook
> into ram_save_iterate() and ram_save_complete(). This way we
> could kick ITS right before attempting to migrate RAM.
>  Could we extend the infrastructure so that:
> a) Handlers are prioritized, and we can determine order of their execution?
> b) We can choose whether our handlers actually produce extra chunk or not?
>
>  OTOH, what i've done is actually a way to hook up into
> save_live_complete before any other registered handlers get
> executed. What
> is missing is one more notifier_list_notify() call right before
> qemu_savevm_state_iterate(), and a corresponding
> migration_is_active() checker.
>
>  What do you think ?

ok, your problem here is that you modify ram.  Could you take a look at
how vhost manage this?  It is done at migration_bitmap_sync(), and it
just marks the pages that are dirty.

The "just" part in the interesting bit.  It uses the memory_region
operations.  Michael, do you understand that code better than me, could
you give an introduction and say if it would work for the ITS?  Thanks.


Later, Juan.
Pavel Fedin Oct. 29, 2015, 1:36 p.m. UTC | #7
Hello!

> ok, your problem here is that you modify ram.  Could you take a look at
> how vhost manage this?  It is done at migration_bitmap_sync(), and it
> just marks the pages that are dirty.

 Hm, interesting... I see it hooks into memory_region_sync_dirty_bitmap(). Sorry for maybe lame question, i do not know the whole
code, and it will be much faster for you to explain it to me, than for me to dig into it myself. At what moment is it called during
migration?

 For you to better understand what is necessary... ITS is a thing which can be implemented in-kernel by KVM, and i work on exactly
this. In my implementation i add an ioctl, which is called after CPUs are stopped. It flushes internal caches of the vITS to the
RAM. It happens inside the kernel. I guess, dirty state tracking works correctly in this case, because memory gets modified by the
kernel, and i guess from qemu's point of view it's the same as memory being modified by the guest. Therefore, i do not need to
modify memory state bitmaps. I only need to tell the kernel to actually write out the data.
 If we talk about making this thing iterative, we anyway need this ioctl. It could be modified inside the kernel to update only
those RAM parts whose data have been modified since the last flush. The semantics would stay the same - it's just an ioctl telling
the virtual device to store its data in RAM.
 Ah, and again, these memory listeners are not prioritized too. I guess i could use them, but i would need a guarantee that my
listener is called before KVMMemoryListener, which picks up changes.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Juan Quintela Oct. 29, 2015, 2:37 p.m. UTC | #8
Pavel Fedin <p.fedin@samsung.com> wrote:
>  Hello!
>
>> ok, your problem here is that you modify ram.  Could you take a look at
>> how vhost manage this?  It is done at migration_bitmap_sync(), and it
>> just marks the pages that are dirty.
>
>  Hm, interesting... I see it hooks into
> memory_region_sync_dirty_bitmap(). Sorry for maybe lame question, i do
> not know the whole
> code, and it will be much faster for you to explain it to me, than for
> me to dig into it myself. At what moment is it called during
> migration?

when we call migration_bitmap_sync() we end calling all the
MEMORY_LISTENERS registered, and they can do whatever they want, like
modifying RAM, syncronize tables, whatever.

>
>  For you to better understand what is necessary... ITS is a thing
> which can be implemented in-kernel by KVM, and i work on exactly
> this. In my implementation i add an ioctl, which is called after CPUs
> are stopped. It flushes internal caches of the vITS to the
> RAM. It happens inside the kernel. I guess, dirty state tracking works
> correctly in this case, because memory gets modified by the
> kernel, and i guess from qemu's point of view it's the same as memory
> being modified by the guest. Therefore, i do not need to
> modify memory state bitmaps. I only need to tell the kernel to
> actually write out the data.
>  If we talk about making this thing iterative, we anyway need this
> ioctl. It could be modified inside the kernel to update only
> those RAM parts whose data have been modified since the last
> flush. The semantics would stay the same - it's just an ioctl telling
> the virtual device to store its data in RAM.
>  Ah, and again, these memory listeners are not prioritized too. I
> guess i could use them, but i would need a guarantee that my
> listener is called before KVMMemoryListener, which picks up changes.

Wait a bit to see if Michael answer as how it is done for vhost.
I don't remember the details O:-)


What you really need is a call before we do the completion stage for
RAM.  Thinking about how to do this, and if there are other users.

Thanks, Juan.
diff mbox

Patch

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 8334621..51b0ea2 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -69,6 +69,7 @@  struct MigrationState
     int parameters[MIGRATION_PARAMETER_MAX];
 
     int state;
+    bool in_completion;
     MigrationParams params;
     double mbps;
     int64_t total_time;
@@ -117,6 +118,7 @@  int migrate_fd_close(MigrationState *s);
 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
 bool migration_in_setup(MigrationState *);
+bool migration_in_completion(MigrationState *);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
 MigrationState *migrate_get_current(void);
diff --git a/migration/migration.c b/migration/migration.c
index b092f38..f4a2421 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -674,6 +674,11 @@  bool migration_in_setup(MigrationState *s)
     return s->state == MIGRATION_STATUS_SETUP;
 }
 
+bool migration_in_completion(MigrationState *s)
+{
+    return s->in_completion;
+}
+
 bool migration_has_finished(MigrationState *s)
 {
     return s->state == MIGRATION_STATUS_COMPLETED;
@@ -996,6 +1001,8 @@  static void migration_completion(MigrationState *s, bool *old_vm_running,
     if (!ret) {
         ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
         if (ret >= 0) {
+            s->in_completion = true;
+            notifier_list_notify(&migration_state_notifiers, s);
             qemu_file_set_rate_limit(s->file, INT64_MAX);
             qemu_savevm_state_complete(s->file);
         }