diff mbox series

possible ahci/migrate fix

Message ID 20190227121052.GD2602@work-vm
State New
Headers show
Series possible ahci/migrate fix | expand

Commit Message

Dr. David Alan Gilbert Feb. 27, 2019, 12:10 p.m. UTC
Hi Alex,
  Can you see if the attached patch fixes the ahci/migrate failure you
see;  it won't fail for me however mean I am to it.


From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Wed, 27 Feb 2019 12:04:37 +0000
Subject: [PATCH] migration: Cleanup during exit

Currently we cleanup the migration object as we exit main after the
main_loop finishes; however if there's a migration running things
get messy and we can end up with the migration thread still trying
to access freed structures.

We now take a ref to the object around the migration thread itself,
so the act of dropping the ref during exit doesn't cause us to lose
the state until the thread quits.

Cancelling the migration during migration also tries to get the thread
to quit.

We do this a bit earlier; so hopefully migration gets out of the way
before all the devices etc are freed.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c | 8 ++++++++
 vl.c                  | 7 ++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Juan Quintela Feb. 27, 2019, 12:41 p.m. UTC | #1
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> Hi Alex,
>   Can you see if the attached patch fixes the ahci/migrate failure you
> see;  it won't fail for me however mean I am to it.
>
>

....

>  void migration_object_finalize(void)
>  {
> +    /*
> +     * Cancel the current migration - that will (eventually)
> +     * stop the migration using this structure
> +     */
> +    migrate_fd_cancel(current_migration);

This can only happen during "civilized" exit of qemu, right?
Otherwise, we are changing the migration status.

>      object_unref(OBJECT(current_migration));
>  }
>  
> @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque)
>  
>      rcu_register_thread();
>  
> +    object_ref(OBJECT(s));

It is weird that this is not enough :-(

>      s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  
>      qemu_savevm_state_header(s->to_dst_file);
> @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque)
>  
>      trace_migration_thread_after_loop();
>      migration_iteration_finish(s);
> +    object_unref(OBJECT(s));
>      rcu_unregister_thread();
>      return NULL;
>  }
> diff --git a/vl.c b/vl.c
> index 2f340686a7..c1920165f3 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp)
>  
>      gdbserver_cleanup();
>  
> +    /*
> +     * cleaning up the migration object cancels any existing migration
> +     * try to do this early so that it also stops using devices.
> +     */
> +    migration_object_finalize();
> +
>      /* No more vcpu or device emulation activity beyond this point */
>      vm_shutdown();
>  
> @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp)
>      monitor_cleanup();
>      qemu_chr_cleanup();
>      user_creatable_cleanup();
> -    migration_object_finalize();
>      /* TODO: unref root container, check all devices are ok */
>  
>      return 0;

Ok, it was happening really late.

Once that you are at this, can we rename it?

migration_cleanup()

looks more consistent with everything else, and makes sure that it does
"more" than finalize the object, no?

Furthermore, it is enough to "just" cancel the migration?  It is not
needed that we wait for the migration thread to finish?

If the problem was that migration_thread() was accessing the object
after main thread freed it, then just the ref counting should be enough,
no?  My understanding is that returning from main is the quivalent of
exit() and kill all threads?  Or are we doing something special for that
not to happen?

Later, Juan.
Alex Bennée Feb. 27, 2019, 4:46 p.m. UTC | #2
Dr. David Alan Gilbert <dgilbert@redhat.com> writes:

> Hi Alex,
>   Can you see if the attached patch fixes the ahci/migrate failure you
> see;  it won't fail for me however mean I am to it.

over 2000 iterations without issue, commit before crashed within 30

Tested-by: Alex Bennée <alex.bennee@linaro.org>

How quickly can we get this merged?

>
>
> From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Date: Wed, 27 Feb 2019 12:04:37 +0000
> Subject: [PATCH] migration: Cleanup during exit
>
> Currently we cleanup the migration object as we exit main after the
> main_loop finishes; however if there's a migration running things
> get messy and we can end up with the migration thread still trying
> to access freed structures.
>
> We now take a ref to the object around the migration thread itself,
> so the act of dropping the ref during exit doesn't cause us to lose
> the state until the thread quits.
>
> Cancelling the migration during migration also tries to get the thread
> to quit.
>
> We do this a bit earlier; so hopefully migration gets out of the way
> before all the devices etc are freed.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/migration.c | 8 ++++++++
>  vl.c                  | 7 ++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index e44f77af02..f612313d09 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp);
>  static int migration_maybe_pause(MigrationState *s,
>                                   int *current_active_state,
>                                   int new_state);
> +static void migrate_fd_cancel(MigrationState *s);
>
>  void migration_object_init(void)
>  {
> @@ -169,6 +170,11 @@ void migration_object_init(void)
>
>  void migration_object_finalize(void)
>  {
> +    /*
> +     * Cancel the current migration - that will (eventually)
> +     * stop the migration using this structure
> +     */
> +    migrate_fd_cancel(current_migration);
>      object_unref(OBJECT(current_migration));
>  }
>
> @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque)
>
>      rcu_register_thread();
>
> +    object_ref(OBJECT(s));
>      s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>
>      qemu_savevm_state_header(s->to_dst_file);
> @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque)
>
>      trace_migration_thread_after_loop();
>      migration_iteration_finish(s);
> +    object_unref(OBJECT(s));
>      rcu_unregister_thread();
>      return NULL;
>  }
> diff --git a/vl.c b/vl.c
> index 2f340686a7..c1920165f3 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp)
>
>      gdbserver_cleanup();
>
> +    /*
> +     * cleaning up the migration object cancels any existing migration
> +     * try to do this early so that it also stops using devices.
> +     */
> +    migration_object_finalize();
> +
>      /* No more vcpu or device emulation activity beyond this point */
>      vm_shutdown();
>
> @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp)
>      monitor_cleanup();
>      qemu_chr_cleanup();
>      user_creatable_cleanup();
> -    migration_object_finalize();
>      /* TODO: unref root container, check all devices are ok */
>
>      return 0;
> --
> 2.20.1


--
Alex Bennée
Dr. David Alan Gilbert Feb. 27, 2019, 4:51 p.m. UTC | #3
* Alex Bennée (alex.bennee@linaro.org) wrote:
> 
> Dr. David Alan Gilbert <dgilbert@redhat.com> writes:
> 
> > Hi Alex,
> >   Can you see if the attached patch fixes the ahci/migrate failure you
> > see;  it won't fail for me however mean I am to it.
> 
> over 2000 iterations without issue, commit before crashed within 30
> 
> Tested-by: Alex Bennée <alex.bennee@linaro.org>

Great, just reposted as a proper patch.

> How quickly can we get this merged?

Once it gets a review-by I'm happy if someone wants to merge it directly
given the pain it's causing.
Alternatively we'll do a migration pull next week with it.

Dave

> >
> >
> > From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > Date: Wed, 27 Feb 2019 12:04:37 +0000
> > Subject: [PATCH] migration: Cleanup during exit
> >
> > Currently we cleanup the migration object as we exit main after the
> > main_loop finishes; however if there's a migration running things
> > get messy and we can end up with the migration thread still trying
> > to access freed structures.
> >
> > We now take a ref to the object around the migration thread itself,
> > so the act of dropping the ref during exit doesn't cause us to lose
> > the state until the thread quits.
> >
> > Cancelling the migration during migration also tries to get the thread
> > to quit.
> >
> > We do this a bit earlier; so hopefully migration gets out of the way
> > before all the devices etc are freed.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/migration.c | 8 ++++++++
> >  vl.c                  | 7 ++++++-
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index e44f77af02..f612313d09 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp);
> >  static int migration_maybe_pause(MigrationState *s,
> >                                   int *current_active_state,
> >                                   int new_state);
> > +static void migrate_fd_cancel(MigrationState *s);
> >
> >  void migration_object_init(void)
> >  {
> > @@ -169,6 +170,11 @@ void migration_object_init(void)
> >
> >  void migration_object_finalize(void)
> >  {
> > +    /*
> > +     * Cancel the current migration - that will (eventually)
> > +     * stop the migration using this structure
> > +     */
> > +    migrate_fd_cancel(current_migration);
> >      object_unref(OBJECT(current_migration));
> >  }
> >
> > @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque)
> >
> >      rcu_register_thread();
> >
> > +    object_ref(OBJECT(s));
> >      s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> >
> >      qemu_savevm_state_header(s->to_dst_file);
> > @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque)
> >
> >      trace_migration_thread_after_loop();
> >      migration_iteration_finish(s);
> > +    object_unref(OBJECT(s));
> >      rcu_unregister_thread();
> >      return NULL;
> >  }
> > diff --git a/vl.c b/vl.c
> > index 2f340686a7..c1920165f3 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp)
> >
> >      gdbserver_cleanup();
> >
> > +    /*
> > +     * cleaning up the migration object cancels any existing migration
> > +     * try to do this early so that it also stops using devices.
> > +     */
> > +    migration_object_finalize();
> > +
> >      /* No more vcpu or device emulation activity beyond this point */
> >      vm_shutdown();
> >
> > @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp)
> >      monitor_cleanup();
> >      qemu_chr_cleanup();
> >      user_creatable_cleanup();
> > -    migration_object_finalize();
> >      /* TODO: unref root container, check all devices are ok */
> >
> >      return 0;
> > --
> > 2.20.1
> 
> 
> --
> Alex Bennée
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Juan Quintela Feb. 27, 2019, 8:30 p.m. UTC | #4
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> Hi Alex,
>   Can you see if the attached patch fixes the ahci/migrate failure you
> see;  it won't fail for me however mean I am to it.

Reviewed-by: Juan Quintela <quintela@redhat.com>

It fixes the issue at point, and even if it is not a full solution (not
sure if it is or not), it is a step in the right direction.

Later, Juan.

>
> From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Date: Wed, 27 Feb 2019 12:04:37 +0000
> Subject: [PATCH] migration: Cleanup during exit
>
> Currently we cleanup the migration object as we exit main after the
> main_loop finishes; however if there's a migration running things
> get messy and we can end up with the migration thread still trying
> to access freed structures.
>
> We now take a ref to the object around the migration thread itself,
> so the act of dropping the ref during exit doesn't cause us to lose
> the state until the thread quits.
>
> Cancelling the migration during migration also tries to get the thread
> to quit.
>
> We do this a bit earlier; so hopefully migration gets out of the way
> before all the devices etc are freed.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/migration.c | 8 ++++++++
>  vl.c                  | 7 ++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index e44f77af02..f612313d09 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp);
>  static int migration_maybe_pause(MigrationState *s,
>                                   int *current_active_state,
>                                   int new_state);
> +static void migrate_fd_cancel(MigrationState *s);
>  
>  void migration_object_init(void)
>  {
> @@ -169,6 +170,11 @@ void migration_object_init(void)
>  
>  void migration_object_finalize(void)
>  {
> +    /*
> +     * Cancel the current migration - that will (eventually)
> +     * stop the migration using this structure
> +     */
> +    migrate_fd_cancel(current_migration);
>      object_unref(OBJECT(current_migration));
>  }
>  
> @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque)
>  
>      rcu_register_thread();
>  
> +    object_ref(OBJECT(s));
>      s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  
>      qemu_savevm_state_header(s->to_dst_file);
> @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque)
>  
>      trace_migration_thread_after_loop();
>      migration_iteration_finish(s);
> +    object_unref(OBJECT(s));
>      rcu_unregister_thread();
>      return NULL;
>  }
> diff --git a/vl.c b/vl.c
> index 2f340686a7..c1920165f3 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp)
>  
>      gdbserver_cleanup();
>  
> +    /*
> +     * cleaning up the migration object cancels any existing migration
> +     * try to do this early so that it also stops using devices.
> +     */
> +    migration_object_finalize();
> +
>      /* No more vcpu or device emulation activity beyond this point */
>      vm_shutdown();
>  
> @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp)
>      monitor_cleanup();
>      qemu_chr_cleanup();
>      user_creatable_cleanup();
> -    migration_object_finalize();
>      /* TODO: unref root container, check all devices are ok */
>  
>      return 0;
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index e44f77af02..f612313d09 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -126,6 +126,7 @@  static bool migration_object_check(MigrationState *ms, Error **errp);
 static int migration_maybe_pause(MigrationState *s,
                                  int *current_active_state,
                                  int new_state);
+static void migrate_fd_cancel(MigrationState *s);
 
 void migration_object_init(void)
 {
@@ -169,6 +170,11 @@  void migration_object_init(void)
 
 void migration_object_finalize(void)
 {
+    /*
+     * Cancel the current migration - that will (eventually)
+     * stop the migration using this structure
+     */
+    migrate_fd_cancel(current_migration);
     object_unref(OBJECT(current_migration));
 }
 
@@ -3134,6 +3140,7 @@  static void *migration_thread(void *opaque)
 
     rcu_register_thread();
 
+    object_ref(OBJECT(s));
     s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 
     qemu_savevm_state_header(s->to_dst_file);
@@ -3230,6 +3237,7 @@  static void *migration_thread(void *opaque)
 
     trace_migration_thread_after_loop();
     migration_iteration_finish(s);
+    object_unref(OBJECT(s));
     rcu_unregister_thread();
     return NULL;
 }
diff --git a/vl.c b/vl.c
index 2f340686a7..c1920165f3 100644
--- a/vl.c
+++ b/vl.c
@@ -4579,6 +4579,12 @@  int main(int argc, char **argv, char **envp)
 
     gdbserver_cleanup();
 
+    /*
+     * cleaning up the migration object cancels any existing migration
+     * try to do this early so that it also stops using devices.
+     */
+    migration_object_finalize();
+
     /* No more vcpu or device emulation activity beyond this point */
     vm_shutdown();
 
@@ -4594,7 +4600,6 @@  int main(int argc, char **argv, char **envp)
     monitor_cleanup();
     qemu_chr_cleanup();
     user_creatable_cleanup();
-    migration_object_finalize();
     /* TODO: unref root container, check all devices are ok */
 
     return 0;