Message ID | 20190227121052.GD2602@work-vm |
---|---|
State | New |
Headers | show |
Series | possible ahci/migrate fix | expand |
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > Hi Alex, > Can you see if the attached patch fixes the ahci/migrate failure you > see; it won't fail for me however mean I am to it. > > .... > void migration_object_finalize(void) > { > + /* > + * Cancel the current migration - that will (eventually) > + * stop the migration using this structure > + */ > + migrate_fd_cancel(current_migration); This can only happen during "civilized" exit of qemu, right? Otherwise, we are changing the migration status. > object_unref(OBJECT(current_migration)); > } > > @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque) > > rcu_register_thread(); > > + object_ref(OBJECT(s)); It is weird that this is not enough :-( > s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > qemu_savevm_state_header(s->to_dst_file); > @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque) > > trace_migration_thread_after_loop(); > migration_iteration_finish(s); > + object_unref(OBJECT(s)); > rcu_unregister_thread(); > return NULL; > } > diff --git a/vl.c b/vl.c > index 2f340686a7..c1920165f3 100644 > --- a/vl.c > +++ b/vl.c > @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp) > > gdbserver_cleanup(); > > + /* > + * cleaning up the migration object cancels any existing migration > + * try to do this early so that it also stops using devices. > + */ > + migration_object_finalize(); > + > /* No more vcpu or device emulation activity beyond this point */ > vm_shutdown(); > > @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp) > monitor_cleanup(); > qemu_chr_cleanup(); > user_creatable_cleanup(); > - migration_object_finalize(); > /* TODO: unref root container, check all devices are ok */ > > return 0; Ok, it was happening really late. Once that you are at this, can we rename it? migration_cleanup() looks more consistent with everything else, and makes sure that it does "more" than finalize the object, no? Furthermore, it is enough to "just" cancel the migration? It is not needed that we wait for the migration thread to finish? If the problem was that migration_thread() was accessing the object after main thread freed it, then just the ref counting should be enough, no? My understanding is that returning from main is the quivalent of exit() and kill all threads? Or are we doing something special for that not to happen? Later, Juan.
Dr. David Alan Gilbert <dgilbert@redhat.com> writes: > Hi Alex, > Can you see if the attached patch fixes the ahci/migrate failure you > see; it won't fail for me however mean I am to it. over 2000 iterations without issue, commit before crashed within 30 Tested-by: Alex Bennée <alex.bennee@linaro.org> How quickly can we get this merged? > > > From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001 > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > Date: Wed, 27 Feb 2019 12:04:37 +0000 > Subject: [PATCH] migration: Cleanup during exit > > Currently we cleanup the migration object as we exit main after the > main_loop finishes; however if there's a migration running things > get messy and we can end up with the migration thread still trying > to access freed structures. > > We now take a ref to the object around the migration thread itself, > so the act of dropping the ref during exit doesn't cause us to lose > the state until the thread quits. > > Cancelling the migration during migration also tries to get the thread > to quit. > > We do this a bit earlier; so hopefully migration gets out of the way > before all the devices etc are freed. > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/migration.c | 8 ++++++++ > vl.c | 7 ++++++- > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index e44f77af02..f612313d09 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp); > static int migration_maybe_pause(MigrationState *s, > int *current_active_state, > int new_state); > +static void migrate_fd_cancel(MigrationState *s); > > void migration_object_init(void) > { > @@ -169,6 +170,11 @@ void migration_object_init(void) > > void migration_object_finalize(void) > { > + /* > + * Cancel the current migration - that will (eventually) > + * stop the migration using this structure > + */ > + migrate_fd_cancel(current_migration); > object_unref(OBJECT(current_migration)); > } > > @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque) > > rcu_register_thread(); > > + object_ref(OBJECT(s)); > s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > qemu_savevm_state_header(s->to_dst_file); > @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque) > > trace_migration_thread_after_loop(); > migration_iteration_finish(s); > + object_unref(OBJECT(s)); > rcu_unregister_thread(); > return NULL; > } > diff --git a/vl.c b/vl.c > index 2f340686a7..c1920165f3 100644 > --- a/vl.c > +++ b/vl.c > @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp) > > gdbserver_cleanup(); > > + /* > + * cleaning up the migration object cancels any existing migration > + * try to do this early so that it also stops using devices. > + */ > + migration_object_finalize(); > + > /* No more vcpu or device emulation activity beyond this point */ > vm_shutdown(); > > @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp) > monitor_cleanup(); > qemu_chr_cleanup(); > user_creatable_cleanup(); > - migration_object_finalize(); > /* TODO: unref root container, check all devices are ok */ > > return 0; > -- > 2.20.1 -- Alex Bennée
* Alex Bennée (alex.bennee@linaro.org) wrote: > > Dr. David Alan Gilbert <dgilbert@redhat.com> writes: > > > Hi Alex, > > Can you see if the attached patch fixes the ahci/migrate failure you > > see; it won't fail for me however mean I am to it. > > over 2000 iterations without issue, commit before crashed within 30 > > Tested-by: Alex Bennée <alex.bennee@linaro.org> Great, just reposted as a proper patch. > How quickly can we get this merged? Once it gets a review-by I'm happy if someone wants to merge it directly given the pain it's causing. Alternatively we'll do a migration pull next week with it. Dave > > > > > > From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001 > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > Date: Wed, 27 Feb 2019 12:04:37 +0000 > > Subject: [PATCH] migration: Cleanup during exit > > > > Currently we cleanup the migration object as we exit main after the > > main_loop finishes; however if there's a migration running things > > get messy and we can end up with the migration thread still trying > > to access freed structures. > > > > We now take a ref to the object around the migration thread itself, > > so the act of dropping the ref during exit doesn't cause us to lose > > the state until the thread quits. > > > > Cancelling the migration during migration also tries to get the thread > > to quit. > > > > We do this a bit earlier; so hopefully migration gets out of the way > > before all the devices etc are freed. > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > > --- > > migration/migration.c | 8 ++++++++ > > vl.c | 7 ++++++- > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index e44f77af02..f612313d09 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp); > > static int migration_maybe_pause(MigrationState *s, > > int *current_active_state, > > int new_state); > > +static void migrate_fd_cancel(MigrationState *s); > > > > void migration_object_init(void) > > { > > @@ -169,6 +170,11 @@ void migration_object_init(void) > > > > void migration_object_finalize(void) > > { > > + /* > > + * Cancel the current migration - that will (eventually) > > + * stop the migration using this structure > > + */ > > + migrate_fd_cancel(current_migration); > > object_unref(OBJECT(current_migration)); > > } > > > > @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque) > > > > rcu_register_thread(); > > > > + object_ref(OBJECT(s)); > > s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > > > qemu_savevm_state_header(s->to_dst_file); > > @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque) > > > > trace_migration_thread_after_loop(); > > migration_iteration_finish(s); > > + object_unref(OBJECT(s)); > > rcu_unregister_thread(); > > return NULL; > > } > > diff --git a/vl.c b/vl.c > > index 2f340686a7..c1920165f3 100644 > > --- a/vl.c > > +++ b/vl.c > > @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp) > > > > gdbserver_cleanup(); > > > > + /* > > + * cleaning up the migration object cancels any existing migration > > + * try to do this early so that it also stops using devices. > > + */ > > + migration_object_finalize(); > > + > > /* No more vcpu or device emulation activity beyond this point */ > > vm_shutdown(); > > > > @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp) > > monitor_cleanup(); > > qemu_chr_cleanup(); > > user_creatable_cleanup(); > > - migration_object_finalize(); > > /* TODO: unref root container, check all devices are ok */ > > > > return 0; > > -- > > 2.20.1 > > > -- > Alex Bennée -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > Hi Alex, > Can you see if the attached patch fixes the ahci/migrate failure you > see; it won't fail for me however mean I am to it. Reviewed-by: Juan Quintela <quintela@redhat.com> It fixes the issue at point, and even if it is not a full solution (not sure if it is or not), it is a step in the right direction. Later, Juan. > > From f4c327d14d656d1c0f0e694d0efc6165493416f0 Mon Sep 17 00:00:00 2001 > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > Date: Wed, 27 Feb 2019 12:04:37 +0000 > Subject: [PATCH] migration: Cleanup during exit > > Currently we cleanup the migration object as we exit main after the > main_loop finishes; however if there's a migration running things > get messy and we can end up with the migration thread still trying > to access freed structures. > > We now take a ref to the object around the migration thread itself, > so the act of dropping the ref during exit doesn't cause us to lose > the state until the thread quits. > > Cancelling the migration during migration also tries to get the thread > to quit. > > We do this a bit earlier; so hopefully migration gets out of the way > before all the devices etc are freed. > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/migration.c | 8 ++++++++ > vl.c | 7 ++++++- > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index e44f77af02..f612313d09 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp); > static int migration_maybe_pause(MigrationState *s, > int *current_active_state, > int new_state); > +static void migrate_fd_cancel(MigrationState *s); > > void migration_object_init(void) > { > @@ -169,6 +170,11 @@ void migration_object_init(void) > > void migration_object_finalize(void) > { > + /* > + * Cancel the current migration - that will (eventually) > + * stop the migration using this structure > + */ > + migrate_fd_cancel(current_migration); > object_unref(OBJECT(current_migration)); > } > > @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque) > > rcu_register_thread(); > > + object_ref(OBJECT(s)); > s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > qemu_savevm_state_header(s->to_dst_file); > @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque) > > trace_migration_thread_after_loop(); > migration_iteration_finish(s); > + object_unref(OBJECT(s)); > rcu_unregister_thread(); > return NULL; > } > diff --git a/vl.c b/vl.c > index 2f340686a7..c1920165f3 100644 > --- a/vl.c > +++ b/vl.c > @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp) > > gdbserver_cleanup(); > > + /* > + * cleaning up the migration object cancels any existing migration > + * try to do this early so that it also stops using devices. > + */ > + migration_object_finalize(); > + > /* No more vcpu or device emulation activity beyond this point */ > vm_shutdown(); > > @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp) > monitor_cleanup(); > qemu_chr_cleanup(); > user_creatable_cleanup(); > - migration_object_finalize(); > /* TODO: unref root container, check all devices are ok */ > > return 0;
diff --git a/migration/migration.c b/migration/migration.c index e44f77af02..f612313d09 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -126,6 +126,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp); static int migration_maybe_pause(MigrationState *s, int *current_active_state, int new_state); +static void migrate_fd_cancel(MigrationState *s); void migration_object_init(void) { @@ -169,6 +170,11 @@ void migration_object_init(void) void migration_object_finalize(void) { + /* + * Cancel the current migration - that will (eventually) + * stop the migration using this structure + */ + migrate_fd_cancel(current_migration); object_unref(OBJECT(current_migration)); } @@ -3134,6 +3140,7 @@ static void *migration_thread(void *opaque) rcu_register_thread(); + object_ref(OBJECT(s)); s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); qemu_savevm_state_header(s->to_dst_file); @@ -3230,6 +3237,7 @@ static void *migration_thread(void *opaque) trace_migration_thread_after_loop(); migration_iteration_finish(s); + object_unref(OBJECT(s)); rcu_unregister_thread(); return NULL; } diff --git a/vl.c b/vl.c index 2f340686a7..c1920165f3 100644 --- a/vl.c +++ b/vl.c @@ -4579,6 +4579,12 @@ int main(int argc, char **argv, char **envp) gdbserver_cleanup(); + /* + * cleaning up the migration object cancels any existing migration + * try to do this early so that it also stops using devices. + */ + migration_object_finalize(); + /* No more vcpu or device emulation activity beyond this point */ vm_shutdown(); @@ -4594,7 +4600,6 @@ int main(int argc, char **argv, char **envp) monitor_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); - migration_object_finalize(); /* TODO: unref root container, check all devices are ok */ return 0;