diff mbox

[RFC,23/29] migration: new cmd MIG_CMD_POSTCOPY_RESUME

Message ID 20170804070957.GJ5561@pxdev.xzpeter.org
State New
Headers show

Commit Message

Peter Xu Aug. 4, 2017, 7:09 a.m. UTC
On Fri, Aug 04, 2017 at 03:04:19PM +0800, Peter Xu wrote:
> On Thu, Aug 03, 2017 at 12:05:41PM +0100, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
> > > +static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > > +{
> > > +    /*
> > > +     * This means source VM is ready to resume the postcopy migration.
> > > +     * It's time to switch state and release the fault thread to
> > > +     * continue service page faults.
> > > +     */
> > > +    migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
> > > +                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > +    qemu_sem_post(&mis->postcopy_pause_sem_fault);
> > 
> > Is it worth sanity checking that you were in RECOVER at this point?
> 
> Yeah, it never hurts.  Will do.

Not sure whether this would be good (note: I returned 0 in the if):


Basically I just don't want to crash the dest VM (it holds hot dirty
pages) even if it receives a faulty RESUME command.

Comments

Dr. David Alan Gilbert Aug. 4, 2017, 8:30 a.m. UTC | #1
* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Aug 04, 2017 at 03:04:19PM +0800, Peter Xu wrote:
> > On Thu, Aug 03, 2017 at 12:05:41PM +0100, Dr. David Alan Gilbert wrote:
> > 
> > [...]
> > 
> > > > +static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > > > +{
> > > > +    /*
> > > > +     * This means source VM is ready to resume the postcopy migration.
> > > > +     * It's time to switch state and release the fault thread to
> > > > +     * continue service page faults.
> > > > +     */
> > > > +    migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
> > > > +                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > +    qemu_sem_post(&mis->postcopy_pause_sem_fault);
> > > 
> > > Is it worth sanity checking that you were in RECOVER at this point?
> > 
> > Yeah, it never hurts.  Will do.
> 
> Not sure whether this would be good (note: I returned 0 in the if):
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b7843c2..b34f59b 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1709,6 +1709,12 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
>  
>  static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>  {
> +    if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> +        error_report("%s: illegal resume received", __func__);
> +        /* Don't fail the load, only for this. */
> +        return 0;
> +    }
> +
>      /*
>       * This means source VM is ready to resume the postcopy migration.
>       * It's time to switch state and release the fault thread to
> 
> Basically I just don't want to crash the dest VM (it holds hot dirty
> pages) even if it receives a faulty RESUME command.

Yes, so now that's a fun problem; effectively you then have 3 valid
failure modes:
    a) An IO failure so we need to go into POSTCOPY_PAUSE
    b) A fatal migration stream problem to quit
    c) A non-fatal migration stream problem to go .. back into PAUSE?

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu Aug. 4, 2017, 9:22 a.m. UTC | #2
On Fri, Aug 04, 2017 at 09:30:01AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Aug 04, 2017 at 03:04:19PM +0800, Peter Xu wrote:
> > > On Thu, Aug 03, 2017 at 12:05:41PM +0100, Dr. David Alan Gilbert wrote:
> > > 
> > > [...]
> > > 
> > > > > +static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > > > > +{
> > > > > +    /*
> > > > > +     * This means source VM is ready to resume the postcopy migration.
> > > > > +     * It's time to switch state and release the fault thread to
> > > > > +     * continue service page faults.
> > > > > +     */
> > > > > +    migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
> > > > > +                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > > +    qemu_sem_post(&mis->postcopy_pause_sem_fault);
> > > > 
> > > > Is it worth sanity checking that you were in RECOVER at this point?
> > > 
> > > Yeah, it never hurts.  Will do.
> > 
> > Not sure whether this would be good (note: I returned 0 in the if):
> > 
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index b7843c2..b34f59b 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1709,6 +1709,12 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
> >  
> >  static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> >  {
> > +    if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> > +        error_report("%s: illegal resume received", __func__);
> > +        /* Don't fail the load, only for this. */
> > +        return 0;
> > +    }
> > +
> >      /*
> >       * This means source VM is ready to resume the postcopy migration.
> >       * It's time to switch state and release the fault thread to
> > 
> > Basically I just don't want to crash the dest VM (it holds hot dirty
> > pages) even if it receives a faulty RESUME command.
> 
> Yes, so now that's a fun problem; effectively you then have 3 valid
> failure modes:
>     a) An IO failure so we need to go into POSTCOPY_PAUSE
>     b) A fatal migration stream problem to quit
>     c) A non-fatal migration stream problem to go .. back into PAUSE?

Hmm yes...

So I got at least three TODO ITEMs now:

- support manual switch source into PAUSED state
- support migrate_cancel during PAUSED/RECOVER state
- when anything wrong happens during PAUSED/RECOVER, switching back to
  PAUSED state on both sides

It just depends on whether we would like to postpone these work, or we
think any of them are essential even for the first version.

IMHO we can postpone this 3rd one as well.
diff mbox

Patch

diff --git a/migration/savevm.c b/migration/savevm.c
index b7843c2..b34f59b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1709,6 +1709,12 @@  static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
 
 static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
 {
+    if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
+        error_report("%s: illegal resume received", __func__);
+        /* Don't fail the load, only for this. */
+        return 0;
+    }
+
     /*
      * This means source VM is ready to resume the postcopy migration.
      * It's time to switch state and release the fault thread to