diff mbox

migration: Fix possible bug for migrate cancel

Message ID 1395666264-12060-1-git-send-email-arei.gonglei@huawei.com
State New
Headers show

Commit Message

Gonglei (Arei) March 24, 2014, 1:04 p.m. UTC
From: zengjunliang <zengjunliang@huawei.com>

Return error for migrate cancel, when migration status is not
MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can
perceive the operation fails.

Signed-off-by: zengjunliang <zengjunliang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 include/qapi/qmp/qerror.h | 3 +++
 migration.c               | 5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

Eric Blake March 24, 2014, 2:14 p.m. UTC | #1
On 03/24/2014 07:04 AM, arei.gonglei@huawei.com wrote:
> From: zengjunliang <zengjunliang@huawei.com>
> 
> Return error for migrate cancel, when migration status is not
> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can
> perceive the operation fails.
> 
> Signed-off-by: zengjunliang <zengjunliang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  include/qapi/qmp/qerror.h | 3 +++
>  migration.c               | 5 +++--
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/qapi/qmp/qerror.h b/include/qapi/qmp/qerror.h
> index da75abf..b13e3e0 100644
> --- a/include/qapi/qmp/qerror.h
> +++ b/include/qapi/qmp/qerror.h
> @@ -164,6 +164,9 @@ void qerror_report_err(Error *err);
>  #define QERR_MIGRATION_ACTIVE \
>      ERROR_CLASS_GENERIC_ERROR, "There's a migration process in progress"
>  
> +#define QERR_MIGRATION_COMPLETED \

New code should NOT be adding macros in qerror.h, but just directly
report the error.

> +    ERROR_CLASS_GENERIC_ERROR, "There's no migration process in progress"

You use a generic error both for migration active and for no migration
in progress.  The error API documents that clients (such as libvirt)
must NOT parse the human-readable string.  If libvirt is actually going
to behave differently for this particular error, that argues that it may
need a different error category than GENERIC_ERROR.
Paolo Bonzini March 24, 2014, 3:47 p.m. UTC | #2
Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto:
> From: zengjunliang <zengjunliang@huawei.com>
>
> Return error for migrate cancel, when migration status is not
> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can
> perceive the operation fails.
>
> Signed-off-by: zengjunliang <zengjunliang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

I think this is done on purpose, because canceling migration is racy. 
Instead, libvirt should do "query-migrate" and check if the migration 
was completed or canceled.

Paolo
Eric Blake March 24, 2014, 4 p.m. UTC | #3
[adding libvirt]

On 03/24/2014 09:47 AM, Paolo Bonzini wrote:
> Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto:
>> From: zengjunliang <zengjunliang@huawei.com>
>>
>> Return error for migrate cancel, when migration status is not
>> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can
>> perceive the operation fails.
>>
>> Signed-off-by: zengjunliang <zengjunliang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> 
> I think this is done on purpose, because canceling migration is racy.
> Instead, libvirt should do "query-migrate" and check if the migration
> was completed or canceled.

Can you please give more details at how you are triggering the problem
with libvirt?  I think Paolo is probably right - the bug is more likely
to be in libvirt not expecting the race and not recovering correctly
when the race occurs, than it is to be in changing qemu's state algorithm.
Gonglei (Arei) March 25, 2014, 11:15 a.m. UTC | #4
> -----Original Message-----

> From: Eric Blake [mailto:eblake@redhat.com]

> Sent: Tuesday, March 25, 2014 12:01 AM

> To: Paolo Bonzini; Gonglei (Arei); qemu-devel@nongnu.org

> Cc: quintela@redhat.com; owasserm@redhat.com; Yanqiangjun; Zhaoyanbin

> (A); Zengjunliang; libvir-list@redhat.com

> Subject: Re: [PATCH] migration: Fix possible bug for migrate cancel

> 

> [adding libvirt]

> 

> On 03/24/2014 09:47 AM, Paolo Bonzini wrote:

> > Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto:

> >> From: zengjunliang <zengjunliang@huawei.com>

> >>

> >> Return error for migrate cancel, when migration status is not

> >> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can

> >> perceive the operation fails.

> >>

> >> Signed-off-by: zengjunliang <zengjunliang@huawei.com>

> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

> >

> > I think this is done on purpose, because canceling migration is racy.

> > Instead, libvirt should do "query-migrate" and check if the migration

> > was completed or canceled.

> 

> Can you please give more details at how you are triggering the problem

> with libvirt?  I think Paolo is probably right - the bug is more likely

> to be in libvirt not expecting the race and not recovering correctly

> when the race occurs, than it is to be in changing qemu's state algorithm.

> 

When the migration progress reaches 100%, and the migration status becomes MIG_STATE_COMPLETED in Qemu.
It will take some time which from MIG_STATE_COMPLETED to the migration thread resources are recovered.
If we cancel the migration at this moment, the migrate_fd_cancel function will break directly without reporting
error code. Then, libvirt considers the cancle operation a success, contrary facts.

Best regards,
-Gonglei
Gonglei (Arei) March 28, 2014, 9:18 a.m. UTC | #5
> > >> Return error for migrate cancel, when migration status is not

> > >> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can

> > >> perceive the operation fails.

> > >>

> > >> Signed-off-by: zengjunliang <zengjunliang@huawei.com>

> > >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

> > >

> > > I think this is done on purpose, because canceling migration is racy.

> > > Instead, libvirt should do "query-migrate" and check if the migration

> > > was completed or canceled.

> >

> > Can you please give more details at how you are triggering the problem

> > with libvirt?  I think Paolo is probably right - the bug is more likely

> > to be in libvirt not expecting the race and not recovering correctly

> > when the race occurs, than it is to be in changing qemu's state algorithm.

> >

> When the migration progress reaches 100%, and the migration status becomes

> MIG_STATE_COMPLETED in Qemu.

> It will take some time which from MIG_STATE_COMPLETED to the migration

> thread resources are recovered.

> If we cancel the migration at this moment, the migrate_fd_cancel function will

> break directly without reporting

> error code. Then, libvirt considers the cancle operation a success, contrary

> facts.

> 


Ping... 


Best regards,
-Gonglei
Paolo Bonzini March 28, 2014, 9:28 a.m. UTC | #6
Il 28/03/2014 10:18, Gonglei (Arei) ha scritto:
>> > > Can you please give more details at how you are triggering the problem
>> > > with libvirt?  I think Paolo is probably right - the bug is more likely
>> > > to be in libvirt not expecting the race and not recovering correctly
>> > > when the race occurs, than it is to be in changing qemu's state algorithm.
>> > >
>> When the migration progress reaches 100%, and the migration status becomes
>> MIG_STATE_COMPLETED in Qemu.
>> It will take some time which from MIG_STATE_COMPLETED to the migration
>> thread resources are recovered.
>> If we cancel the migration at this moment, the migrate_fd_cancel function will
>> break directly without reporting
>> error code. Then, libvirt considers the cancle operation a success, contrary
>> facts.

There is no error, once migration is completed you can still shutdown on 
the destination and continue on the source.  Libvirt should either:

1) poll with "query-migrate" after migrate_cancel, and report an error 
there if it's the desired semantics;

2) toggle a "cancelled" flag before asking QEMU to cancel migration, 
check it in the migration functions after "query-migrate" reported 
completion; if it is true, do not resume on the destination.

Another reason for doing it in libvirt is that the serialization between 
cancellation and completion of migration ultimately is controlled by 
libvirt's lock.  Doing this in QEMU makes it harder to reason about 
concurrency.

Paolo
Dr. David Alan Gilbert March 28, 2014, 11:30 a.m. UTC | #7
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 28/03/2014 10:18, Gonglei (Arei) ha scritto:
> >>> > Can you please give more details at how you are triggering the problem
> >>> > with libvirt?  I think Paolo is probably right - the bug is more likely
> >>> > to be in libvirt not expecting the race and not recovering correctly
> >>> > when the race occurs, than it is to be in changing qemu's state algorithm.
> >>> >
> >>When the migration progress reaches 100%, and the migration status becomes
> >>MIG_STATE_COMPLETED in Qemu.
> >>It will take some time which from MIG_STATE_COMPLETED to the migration
> >>thread resources are recovered.
> >>If we cancel the migration at this moment, the migrate_fd_cancel function will
> >>break directly without reporting
> >>error code. Then, libvirt considers the cancle operation a success, contrary
> >>facts.
> 
> There is no error, once migration is completed you can still
> shutdown on the destination and continue on the source.  Libvirt
> should either:

(I've rewritten my reply below about 4 times - swinging between
different answers, this stuff really isn't obvious, and certainly
not documented)

I think I agree that it's not an error; but I think migrate_fd_cancel
knows what the outcome will be.

If it was MIG_STATE_ERROR on entry to migrate_fd_cancel, then yes it
could tell you that the cancel failed because you were already in error.

If it was MIG_STATE_COMPLETED on entry to migrate_fd_cancel, then yes it
could tell you that the cancel failed because you already finished.

If it was MIG_STATE_ACTIVE on entry to migrate_fd_cancel - it will go to
MIG_STATE_CANCELLING and I believe eventually to MIG_STATE_CANCELLED;
I don't believe it can get to MIG_STATE_ERROR from that point, since
all of the places in the migrate_thread that transition to error
do explicit ACTIVE->ERROR transitions.  I don't believe it can get to
MIG_STATE_COMPLETED for the same reason.

So migrate_fd_cancel knows that the eventual outcome will be Error
or Cancelled or completed, even if the state isn't there yet, and it
could reply to say that.

> 1) poll with "query-migrate" after migrate_cancel, and report an
> error there if it's the desired semantics;
> 2) toggle a "cancelled" flag before asking QEMU to cancel migration,
> check it in the migration functions after "query-migrate" reported
> completion; if it is true, do not resume on the destination.

I think you're right you have to poll with query-migrate until you
get one of cancelled/failed/completed.

However it's a bit odd; prior to the introduction of 'CANCELLING', the
state that you would get by a query-migrate after migrate_fd_cancel
returned would in principal be the state you ended up in - i.e.
cancelled/failed/completed.  With cancelling added, query-migrate
might lie to you and say 'active' (when it's really hiding the
fact that cancelling is happening).    So while 'cancelling' apparently
didn't alter the API it did, in that query-migrate after a cancel
can now return active where it couldn't before.

> Another reason for doing it in libvirt is that the serialization
> between cancellation and completion of migration ultimately is
> controlled by libvirt's lock.  Doing this in QEMU makes it harder to
> reason about concurrency.

I think you have to be careful when you talk about 'cancellation and completion
of migration' - in that paragraph I don't think you mean the same thing
as MIG_STATE_CANCELLED and MIG_STATE_COMPLETED, I think you're talking
about the larger scale idea of completion after you take into account
that the VM might be paused after qemu has gone to MIG_STATE_COMPLETED and
libvirt might still decide it wants to give up and use the version on
the source that's still paused.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini March 28, 2014, 12:16 p.m. UTC | #8
Il 28/03/2014 12:30, Dr. David Alan Gilbert ha scritto:
>> > Another reason for doing it in libvirt is that the serialization
>> > between cancellation and completion of migration ultimately is
>> > controlled by libvirt's lock.  Doing this in QEMU makes it harder to
>> > reason about concurrency.
> I think you have to be careful when you talk about 'cancellation and completion
> of migration' - in that paragraph I don't think you mean the same thing
> as MIG_STATE_CANCELLED and MIG_STATE_COMPLETED, I think you're talking
> about the larger scale idea of completion after you take into account
> that the VM might be paused after qemu has gone to MIG_STATE_COMPLETED and
> libvirt might still decide it wants to give up and use the version on
> the source that's still paused.

Yes, exactly.  This is why I considered the possibility of adding a 
"cancelled" flag within libvirt.

Libvirt always uses -S on the destination, so it's always possible to 
cancel migration even after MIG_STATE_COMPLETED.

Paolo
diff mbox

Patch

diff --git a/include/qapi/qmp/qerror.h b/include/qapi/qmp/qerror.h
index da75abf..b13e3e0 100644
--- a/include/qapi/qmp/qerror.h
+++ b/include/qapi/qmp/qerror.h
@@ -164,6 +164,9 @@  void qerror_report_err(Error *err);
 #define QERR_MIGRATION_ACTIVE \
     ERROR_CLASS_GENERIC_ERROR, "There's a migration process in progress"
 
+#define QERR_MIGRATION_COMPLETED \
+    ERROR_CLASS_GENERIC_ERROR, "There's no migration process in progress"
+
 #define QERR_MIGRATION_NOT_SUPPORTED \
     ERROR_CLASS_GENERIC_ERROR, "State blocked by non-migratable device '%s'"
 
diff --git a/migration.c b/migration.c
index e0e24d4..2f34c67 100644
--- a/migration.c
+++ b/migration.c
@@ -336,7 +336,7 @@  void migrate_fd_error(MigrationState *s)
     notifier_list_notify(&migration_state_notifiers, s);
 }
 
-static void migrate_fd_cancel(MigrationState *s)
+static void migrate_fd_cancel(MigrationState *s, Error **errp)
 {
     int old_state ;
     DPRINTF("cancelling migration\n");
@@ -344,6 +344,7 @@  static void migrate_fd_cancel(MigrationState *s)
     do {
         old_state = s->state;
         if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
+            error_set(errp, QERR_MIGRATION_COMPLETED);
             break;
         }
         migrate_set_state(s, old_state, MIG_STATE_CANCELLING);
@@ -470,7 +471,7 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
 void qmp_migrate_cancel(Error **errp)
 {
-    migrate_fd_cancel(migrate_get_current());
+    migrate_fd_cancel(migrate_get_current(), errp);
 }
 
 void qmp_migrate_set_cache_size(int64_t value, Error **errp)