Message ID | 20170804070957.GJ5561@pxdev.xzpeter.org |
---|---|
State | New |
Headers | show |
* Peter Xu (peterx@redhat.com) wrote: > On Fri, Aug 04, 2017 at 03:04:19PM +0800, Peter Xu wrote: > > On Thu, Aug 03, 2017 at 12:05:41PM +0100, Dr. David Alan Gilbert wrote: > > > > [...] > > > > > > +static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) > > > > +{ > > > > + /* > > > > + * This means source VM is ready to resume the postcopy migration. > > > > + * It's time to switch state and release the fault thread to > > > > + * continue service page faults. > > > > + */ > > > > + migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER, > > > > + MIGRATION_STATUS_POSTCOPY_ACTIVE); > > > > + qemu_sem_post(&mis->postcopy_pause_sem_fault); > > > > > > Is it worth sanity checking that you were in RECOVER at this point? > > > > Yeah, it never hurts. Will do. > > Not sure whether this would be good (note: I returned 0 in the if): > > diff --git a/migration/savevm.c b/migration/savevm.c > index b7843c2..b34f59b 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -1709,6 +1709,12 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > > static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) > { > + if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) { > + error_report("%s: illegal resume received", __func__); > + /* Don't fail the load, only for this. */ > + return 0; > + } > + > /* > * This means source VM is ready to resume the postcopy migration. > * It's time to switch state and release the fault thread to > > Basically I just don't want to crash the dest VM (it holds hot dirty > pages) even if it receives a faulty RESUME command. Yes, so now that's a fun problem; effectively you then have 3 valid failure modes: a) An IO failure so we need to go into POSTCOPY_PAUSE b) A fatal migration stream problem to quit c) A non-fatal migration stream problem to go .. back into PAUSE? Dave > -- > Peter Xu -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Aug 04, 2017 at 09:30:01AM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > On Fri, Aug 04, 2017 at 03:04:19PM +0800, Peter Xu wrote: > > > On Thu, Aug 03, 2017 at 12:05:41PM +0100, Dr. David Alan Gilbert wrote: > > > > > > [...] > > > > > > > > +static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) > > > > > +{ > > > > > + /* > > > > > + * This means source VM is ready to resume the postcopy migration. > > > > > + * It's time to switch state and release the fault thread to > > > > > + * continue service page faults. > > > > > + */ > > > > > + migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER, > > > > > + MIGRATION_STATUS_POSTCOPY_ACTIVE); > > > > > + qemu_sem_post(&mis->postcopy_pause_sem_fault); > > > > > > > > Is it worth sanity checking that you were in RECOVER at this point? > > > > > > Yeah, it never hurts. Will do. > > > > Not sure whether this would be good (note: I returned 0 in the if): > > > > diff --git a/migration/savevm.c b/migration/savevm.c > > index b7843c2..b34f59b 100644 > > --- a/migration/savevm.c > > +++ b/migration/savevm.c > > @@ -1709,6 +1709,12 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > > > > static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) > > { > > + if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) { > > + error_report("%s: illegal resume received", __func__); > > + /* Don't fail the load, only for this. */ > > + return 0; > > + } > > + > > /* > > * This means source VM is ready to resume the postcopy migration. > > * It's time to switch state and release the fault thread to > > > > Basically I just don't want to crash the dest VM (it holds hot dirty > > pages) even if it receives a faulty RESUME command. > > Yes, so now that's a fun problem; effectively you then have 3 valid > failure modes: > a) An IO failure so we need to go into POSTCOPY_PAUSE > b) A fatal migration stream problem to quit > c) A non-fatal migration stream problem to go .. back into PAUSE? Hmm yes... So I got at least three TODO ITEMs now: - support manual switch source into PAUSED state - support migrate_cancel during PAUSED/RECOVER state - when anything wrong happens during PAUSED/RECOVER, switching back to PAUSED state on both sides It just depends on whether we would like to postpone these work, or we think any of them are essential even for the first version. IMHO we can postpone this 3rd one as well.
diff --git a/migration/savevm.c b/migration/savevm.c index b7843c2..b34f59b 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1709,6 +1709,12 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) { + if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) { + error_report("%s: illegal resume received", __func__); + /* Don't fail the load, only for this. */ + return 0; + } + /* * This means source VM is ready to resume the postcopy migration. * It's time to switch state and release the fault thread to