diff mbox series

[v4,31/32] migration, qmp: new command "migrate-pause"

Message ID 20171108060130.3772-32-peterx@redhat.com
State New
Headers show
Series Migration: postcopy failure recovery | expand

Commit Message

Peter Xu Nov. 8, 2017, 6:01 a.m. UTC
It is used to manually trigger the postcopy pause state.  It works just
like when we found the migration stream failed during postcopy, but
provide an explicit way for user in case of misterious socket hangs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 18 ++++++++++++++++++
 qapi/migration.json   | 22 ++++++++++++++++++++++
 2 files changed, 40 insertions(+)

Comments

Dr. David Alan Gilbert Dec. 1, 2017, 4:53 p.m. UTC | #1
* Peter Xu (peterx@redhat.com) wrote:
> It is used to manually trigger the postcopy pause state.  It works just
> like when we found the migration stream failed during postcopy, but
> provide an explicit way for user in case of misterious socket hangs.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Can we change the name to something like 'migrate-disconnect' - pause
is a bit easy to confuse with other things and this is really more
an explicit network disconnect (Is it worth just making it a flag to
migrate-cancel?)


> ---
>  migration/migration.c | 18 ++++++++++++++++++
>  qapi/migration.json   | 22 ++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 536a771803..30348a5e27 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1485,6 +1485,24 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
>      once = false;
>  }
>  
> +void qmp_migrate_pause(Error **errp)
> +{
> +    int ret;
> +    MigrationState *ms = migrate_get_current();
> +
> +    if (ms->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> +        error_setg(errp, "Migration pause is currently only allowed during"
> +                   " an active postcopy phase.");
> +        return;
> +    }
> +
> +    ret = qemu_file_shutdown(ms->to_dst_file);
> +
> +    if (ret) {
> +        error_setg(errp, "Failed to pause migration stream.");
> +    }
> +}
> +
>  bool migration_is_blocked(Error **errp)
>  {
>      if (qemu_savevm_state_blocked(errp)) {
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4a3eff62f1..52901f7e2e 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1074,6 +1074,28 @@
>  { 'command': 'migrate-incoming', 'data': {'uri': 'str' } }
>  
>  ##
> +# @migrate-pause:
> +#
> +# Pause an migration.  Currently it can only pause a postcopy
> +# migration.  Pausing a precopy migration is not supported yet.
> +#
> +# It is mostly used as a manual way to trigger the postcopy paused
> +# state when the network sockets hang due to some reason, so that we
> +# can try a recovery afterward.

Can we say this explicitly;
'Force closes the migration connection to trigger the postcopy paused
 state when the network sockets hang due to some reason, so that we
can try a recovery afterwards'

Dave

> +# Returns: nothing on success
> +#
> +# Since: 2.12
> +#
> +# Example:
> +#
> +# -> { "execute": "migrate-pause" }
> +# <- { "return": {} }
> +#
> +##
> +{ 'command': 'migrate-pause' }
> +
> +##
>  # @xen-save-devices-state:
>  #
>  # Save the state of all devices to file. The RAM and the block devices
> -- 
> 2.13.6
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu Dec. 4, 2017, 4:48 a.m. UTC | #2
On Fri, Dec 01, 2017 at 04:53:28PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > It is used to manually trigger the postcopy pause state.  It works just
> > like when we found the migration stream failed during postcopy, but
> > provide an explicit way for user in case of misterious socket hangs.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Can we change the name to something like 'migrate-disconnect' - pause
> is a bit easy to confuse with other things and this is really more
> an explicit network disconnect (Is it worth just making it a flag to
> migrate-cancel?)

Then I would prefer to reuse the migrate_cancel command.  

Actually this reminded me about what would happen now if someone on
src VM sends a "migrate_cancel" during postcopy active.  It should
crash the VM, right?

Considering above, I'm thinking whether we should just make it a
default behavior that when do migrate_cancel during postcopy-active we
just do a pause instead of real cancel. After all it cannot re-start
the VM any more on source, so IMHO a real cancel does not mean much
here.  More importantly, what if someone wants to manually trigger
this pause but accidentally forgot to type that new flag (say,
-D[isconnect])?  It'll crash the VM directly.

What do you think?

> 
> 
> > ---
> >  migration/migration.c | 18 ++++++++++++++++++
> >  qapi/migration.json   | 22 ++++++++++++++++++++++
> >  2 files changed, 40 insertions(+)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 536a771803..30348a5e27 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1485,6 +1485,24 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
> >      once = false;
> >  }
> >  
> > +void qmp_migrate_pause(Error **errp)
> > +{
> > +    int ret;
> > +    MigrationState *ms = migrate_get_current();
> > +
> > +    if (ms->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > +        error_setg(errp, "Migration pause is currently only allowed during"
> > +                   " an active postcopy phase.");
> > +        return;
> > +    }
> > +
> > +    ret = qemu_file_shutdown(ms->to_dst_file);
> > +
> > +    if (ret) {
> > +        error_setg(errp, "Failed to pause migration stream.");
> > +    }
> > +}
> > +
> >  bool migration_is_blocked(Error **errp)
> >  {
> >      if (qemu_savevm_state_blocked(errp)) {
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 4a3eff62f1..52901f7e2e 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -1074,6 +1074,28 @@
> >  { 'command': 'migrate-incoming', 'data': {'uri': 'str' } }
> >  
> >  ##
> > +# @migrate-pause:
> > +#
> > +# Pause an migration.  Currently it can only pause a postcopy
> > +# migration.  Pausing a precopy migration is not supported yet.
> > +#
> > +# It is mostly used as a manual way to trigger the postcopy paused
> > +# state when the network sockets hang due to some reason, so that we
> > +# can try a recovery afterward.
> 
> Can we say this explicitly;
> 'Force closes the migration connection to trigger the postcopy paused
>  state when the network sockets hang due to some reason, so that we
> can try a recovery afterwards'

Sure!  I'll just see where I should properly put these sentences.

Thanks,
Dr. David Alan Gilbert Dec. 4, 2017, 5:10 p.m. UTC | #3
* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Dec 01, 2017 at 04:53:28PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > It is used to manually trigger the postcopy pause state.  It works just
> > > like when we found the migration stream failed during postcopy, but
> > > provide an explicit way for user in case of misterious socket hangs.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Can we change the name to something like 'migrate-disconnect' - pause
> > is a bit easy to confuse with other things and this is really more
> > an explicit network disconnect (Is it worth just making it a flag to
> > migrate-cancel?)
> 
> Then I would prefer to reuse the migrate_cancel command.  
> 
> Actually this reminded me about what would happen now if someone on
> src VM sends a "migrate_cancel" during postcopy active.  It should
> crash the VM, right?
> 
> Considering above, I'm thinking whether we should just make it a
> default behavior that when do migrate_cancel during postcopy-active we
> just do a pause instead of real cancel. After all it cannot re-start
> the VM any more on source, so IMHO a real cancel does not mean much
> here.  More importantly, what if someone wants to manually trigger
> this pause but accidentally forgot to type that new flag (say,
> -D[isconnect])?  It'll crash the VM directly.
> 
> What do you think?

Yes, that's OK, just be careful about race conditions between the
states,  for example what happens if you do a cancel and you enter
migrate_fd_cancel in postcopy-active, but before you can actually
cancel you end up completing, or the opposite where you do a
migrate-start-postcopy almost immediately before migrade-cancel;
do you get to cancel in teh active or postcopy-active state?


> 
> > 
> > 
> > > ---
> > >  migration/migration.c | 18 ++++++++++++++++++
> > >  qapi/migration.json   | 22 ++++++++++++++++++++++
> > >  2 files changed, 40 insertions(+)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 536a771803..30348a5e27 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -1485,6 +1485,24 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
> > >      once = false;
> > >  }
> > >  
> > > +void qmp_migrate_pause(Error **errp)
> > > +{
> > > +    int ret;
> > > +    MigrationState *ms = migrate_get_current();
> > > +
> > > +    if (ms->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > +        error_setg(errp, "Migration pause is currently only allowed during"
> > > +                   " an active postcopy phase.");
> > > +        return;
> > > +    }
> > > +
> > > +    ret = qemu_file_shutdown(ms->to_dst_file);
> > > +
> > > +    if (ret) {
> > > +        error_setg(errp, "Failed to pause migration stream.");
> > > +    }
> > > +}
> > > +
> > >  bool migration_is_blocked(Error **errp)
> > >  {
> > >      if (qemu_savevm_state_blocked(errp)) {
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 4a3eff62f1..52901f7e2e 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -1074,6 +1074,28 @@
> > >  { 'command': 'migrate-incoming', 'data': {'uri': 'str' } }
> > >  
> > >  ##
> > > +# @migrate-pause:
> > > +#
> > > +# Pause an migration.  Currently it can only pause a postcopy
> > > +# migration.  Pausing a precopy migration is not supported yet.
> > > +#
> > > +# It is mostly used as a manual way to trigger the postcopy paused
> > > +# state when the network sockets hang due to some reason, so that we
> > > +# can try a recovery afterward.
> > 
> > Can we say this explicitly;
> > 'Force closes the migration connection to trigger the postcopy paused
> >  state when the network sockets hang due to some reason, so that we
> > can try a recovery afterwards'
> 
> Sure!  I'll just see where I should properly put these sentences.

Thanks.

Dave

> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu Dec. 5, 2017, 2:52 a.m. UTC | #4
On Mon, Dec 04, 2017 at 05:10:29PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Dec 01, 2017 at 04:53:28PM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > It is used to manually trigger the postcopy pause state.  It works just
> > > > like when we found the migration stream failed during postcopy, but
> > > > provide an explicit way for user in case of misterious socket hangs.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Can we change the name to something like 'migrate-disconnect' - pause
> > > is a bit easy to confuse with other things and this is really more
> > > an explicit network disconnect (Is it worth just making it a flag to
> > > migrate-cancel?)
> > 
> > Then I would prefer to reuse the migrate_cancel command.  
> > 
> > Actually this reminded me about what would happen now if someone on
> > src VM sends a "migrate_cancel" during postcopy active.  It should
> > crash the VM, right?
> > 
> > Considering above, I'm thinking whether we should just make it a
> > default behavior that when do migrate_cancel during postcopy-active we
> > just do a pause instead of real cancel. After all it cannot re-start
> > the VM any more on source, so IMHO a real cancel does not mean much
> > here.  More importantly, what if someone wants to manually trigger
> > this pause but accidentally forgot to type that new flag (say,
> > -D[isconnect])?  It'll crash the VM directly.
> > 
> > What do you think?
> 
> Yes, that's OK, just be careful about race conditions between the
> states,  for example what happens if you do a cancel and you enter
> migrate_fd_cancel in postcopy-active, but before you can actually
> cancel you end up completing,

If I am going to modify that, migrate_fd_cancel won't be called if we
are in postcopy-active state, instead we'll just do the disconnect
only.

For finally solving all the races between QMP commands and migration
thread, I do think (again) that we need locks or some other sync
method.  I really hope we can have this fixed in QEMU 2.12.  Basically
we will need to go over every migration command to see whether it'll
need to take the migration lock (to be added) or not.  With that,
it'll save a lot of our future time IMHO thinking about races.

> or the opposite where you do a
> migrate-start-postcopy almost immediately before migrade-cancel;
> do you get to cancel in teh active or postcopy-active state?

This is a good example that at least migrate-start-postcopy is
synchronized somehow nicely with the migration thread using a single
variable (actually it can be non-atomic operation I think, anyway, no
race is here as long as we are delivering the message via a single
variable, and qmp command is the only writter).  For this command I
think it's pretty safe.  After all, user should not run that command
too fast if he/she wants a paused postcopy, at least he/she should do
query-migration before that to make sure the state is postcopy-active.
So IMHO this is totally fine.

Thanks,
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 536a771803..30348a5e27 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1485,6 +1485,24 @@  void qmp_migrate_incoming(const char *uri, Error **errp)
     once = false;
 }
 
+void qmp_migrate_pause(Error **errp)
+{
+    int ret;
+    MigrationState *ms = migrate_get_current();
+
+    if (ms->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
+        error_setg(errp, "Migration pause is currently only allowed during"
+                   " an active postcopy phase.");
+        return;
+    }
+
+    ret = qemu_file_shutdown(ms->to_dst_file);
+
+    if (ret) {
+        error_setg(errp, "Failed to pause migration stream.");
+    }
+}
+
 bool migration_is_blocked(Error **errp)
 {
     if (qemu_savevm_state_blocked(errp)) {
diff --git a/qapi/migration.json b/qapi/migration.json
index 4a3eff62f1..52901f7e2e 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1074,6 +1074,28 @@ 
 { 'command': 'migrate-incoming', 'data': {'uri': 'str' } }
 
 ##
+# @migrate-pause:
+#
+# Pause an migration.  Currently it can only pause a postcopy
+# migration.  Pausing a precopy migration is not supported yet.
+#
+# It is mostly used as a manual way to trigger the postcopy paused
+# state when the network sockets hang due to some reason, so that we
+# can try a recovery afterward.
+#
+# Returns: nothing on success
+#
+# Since: 2.12
+#
+# Example:
+#
+# -> { "execute": "migrate-pause" }
+# <- { "return": {} }
+#
+##
+{ 'command': 'migrate-pause' }
+
+##
 # @xen-save-devices-state:
 #
 # Save the state of all devices to file. The RAM and the block devices