Message ID | 1395666264-12060-1-git-send-email-arei.gonglei@huawei.com |
---|---|
State | New |
Headers | show |
On 03/24/2014 07:04 AM, arei.gonglei@huawei.com wrote: > From: zengjunliang <zengjunliang@huawei.com> > > Return error for migrate cancel, when migration status is not > MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can > perceive the operation fails. > > Signed-off-by: zengjunliang <zengjunliang@huawei.com> > Signed-off-by: Gonglei <arei.gonglei@huawei.com> > --- > include/qapi/qmp/qerror.h | 3 +++ > migration.c | 5 +++-- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/include/qapi/qmp/qerror.h b/include/qapi/qmp/qerror.h > index da75abf..b13e3e0 100644 > --- a/include/qapi/qmp/qerror.h > +++ b/include/qapi/qmp/qerror.h > @@ -164,6 +164,9 @@ void qerror_report_err(Error *err); > #define QERR_MIGRATION_ACTIVE \ > ERROR_CLASS_GENERIC_ERROR, "There's a migration process in progress" > > +#define QERR_MIGRATION_COMPLETED \ New code should NOT be adding macros in qerror.h, but just directly report the error. > + ERROR_CLASS_GENERIC_ERROR, "There's no migration process in progress" You use a generic error both for migration active and for no migration in progress. The error API documents that clients (such as libvirt) must NOT parse the human-readable string. If libvirt is actually going to behave differently for this particular error, that argues that it may need a different error category than GENERIC_ERROR.
Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto: > From: zengjunliang <zengjunliang@huawei.com> > > Return error for migrate cancel, when migration status is not > MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can > perceive the operation fails. > > Signed-off-by: zengjunliang <zengjunliang@huawei.com> > Signed-off-by: Gonglei <arei.gonglei@huawei.com> I think this is done on purpose, because canceling migration is racy. Instead, libvirt should do "query-migrate" and check if the migration was completed or canceled. Paolo
[adding libvirt] On 03/24/2014 09:47 AM, Paolo Bonzini wrote: > Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto: >> From: zengjunliang <zengjunliang@huawei.com> >> >> Return error for migrate cancel, when migration status is not >> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can >> perceive the operation fails. >> >> Signed-off-by: zengjunliang <zengjunliang@huawei.com> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com> > > I think this is done on purpose, because canceling migration is racy. > Instead, libvirt should do "query-migrate" and check if the migration > was completed or canceled. Can you please give more details at how you are triggering the problem with libvirt? I think Paolo is probably right - the bug is more likely to be in libvirt not expecting the race and not recovering correctly when the race occurs, than it is to be in changing qemu's state algorithm.
> -----Original Message----- > From: Eric Blake [mailto:eblake@redhat.com] > Sent: Tuesday, March 25, 2014 12:01 AM > To: Paolo Bonzini; Gonglei (Arei); qemu-devel@nongnu.org > Cc: quintela@redhat.com; owasserm@redhat.com; Yanqiangjun; Zhaoyanbin > (A); Zengjunliang; libvir-list@redhat.com > Subject: Re: [PATCH] migration: Fix possible bug for migrate cancel > > [adding libvirt] > > On 03/24/2014 09:47 AM, Paolo Bonzini wrote: > > Il 24/03/2014 14:04, arei.gonglei@huawei.com ha scritto: > >> From: zengjunliang <zengjunliang@huawei.com> > >> > >> Return error for migrate cancel, when migration status is not > >> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can > >> perceive the operation fails. > >> > >> Signed-off-by: zengjunliang <zengjunliang@huawei.com> > >> Signed-off-by: Gonglei <arei.gonglei@huawei.com> > > > > I think this is done on purpose, because canceling migration is racy. > > Instead, libvirt should do "query-migrate" and check if the migration > > was completed or canceled. > > Can you please give more details at how you are triggering the problem > with libvirt? I think Paolo is probably right - the bug is more likely > to be in libvirt not expecting the race and not recovering correctly > when the race occurs, than it is to be in changing qemu's state algorithm. > When the migration progress reaches 100%, and the migration status becomes MIG_STATE_COMPLETED in Qemu. It will take some time which from MIG_STATE_COMPLETED to the migration thread resources are recovered. If we cancel the migration at this moment, the migrate_fd_cancel function will break directly without reporting error code. Then, libvirt considers the cancle operation a success, contrary facts. Best regards, -Gonglei
> > >> Return error for migrate cancel, when migration status is not > > >> MIG_STATE_SETUP or MIG_STATE_ACTIVE. Thus, libvirt can can > > >> perceive the operation fails. > > >> > > >> Signed-off-by: zengjunliang <zengjunliang@huawei.com> > > >> Signed-off-by: Gonglei <arei.gonglei@huawei.com> > > > > > > I think this is done on purpose, because canceling migration is racy. > > > Instead, libvirt should do "query-migrate" and check if the migration > > > was completed or canceled. > > > > Can you please give more details at how you are triggering the problem > > with libvirt? I think Paolo is probably right - the bug is more likely > > to be in libvirt not expecting the race and not recovering correctly > > when the race occurs, than it is to be in changing qemu's state algorithm. > > > When the migration progress reaches 100%, and the migration status becomes > MIG_STATE_COMPLETED in Qemu. > It will take some time which from MIG_STATE_COMPLETED to the migration > thread resources are recovered. > If we cancel the migration at this moment, the migrate_fd_cancel function will > break directly without reporting > error code. Then, libvirt considers the cancle operation a success, contrary > facts. > Ping... Best regards, -Gonglei
Il 28/03/2014 10:18, Gonglei (Arei) ha scritto: >> > > Can you please give more details at how you are triggering the problem >> > > with libvirt? I think Paolo is probably right - the bug is more likely >> > > to be in libvirt not expecting the race and not recovering correctly >> > > when the race occurs, than it is to be in changing qemu's state algorithm. >> > > >> When the migration progress reaches 100%, and the migration status becomes >> MIG_STATE_COMPLETED in Qemu. >> It will take some time which from MIG_STATE_COMPLETED to the migration >> thread resources are recovered. >> If we cancel the migration at this moment, the migrate_fd_cancel function will >> break directly without reporting >> error code. Then, libvirt considers the cancle operation a success, contrary >> facts. There is no error, once migration is completed you can still shutdown on the destination and continue on the source. Libvirt should either: 1) poll with "query-migrate" after migrate_cancel, and report an error there if it's the desired semantics; 2) toggle a "cancelled" flag before asking QEMU to cancel migration, check it in the migration functions after "query-migrate" reported completion; if it is true, do not resume on the destination. Another reason for doing it in libvirt is that the serialization between cancellation and completion of migration ultimately is controlled by libvirt's lock. Doing this in QEMU makes it harder to reason about concurrency. Paolo
* Paolo Bonzini (pbonzini@redhat.com) wrote: > Il 28/03/2014 10:18, Gonglei (Arei) ha scritto: > >>> > Can you please give more details at how you are triggering the problem > >>> > with libvirt? I think Paolo is probably right - the bug is more likely > >>> > to be in libvirt not expecting the race and not recovering correctly > >>> > when the race occurs, than it is to be in changing qemu's state algorithm. > >>> > > >>When the migration progress reaches 100%, and the migration status becomes > >>MIG_STATE_COMPLETED in Qemu. > >>It will take some time which from MIG_STATE_COMPLETED to the migration > >>thread resources are recovered. > >>If we cancel the migration at this moment, the migrate_fd_cancel function will > >>break directly without reporting > >>error code. Then, libvirt considers the cancle operation a success, contrary > >>facts. > > There is no error, once migration is completed you can still > shutdown on the destination and continue on the source. Libvirt > should either: (I've rewritten my reply below about 4 times - swinging between different answers, this stuff really isn't obvious, and certainly not documented) I think I agree that it's not an error; but I think migrate_fd_cancel knows what the outcome will be. If it was MIG_STATE_ERROR on entry to migrate_fd_cancel, then yes it could tell you that the cancel failed because you were already in error. If it was MIG_STATE_COMPLETED on entry to migrate_fd_cancel, then yes it could tell you that the cancel failed because you already finished. If it was MIG_STATE_ACTIVE on entry to migrate_fd_cancel - it will go to MIG_STATE_CANCELLING and I believe eventually to MIG_STATE_CANCELLED; I don't believe it can get to MIG_STATE_ERROR from that point, since all of the places in the migrate_thread that transition to error do explicit ACTIVE->ERROR transitions. I don't believe it can get to MIG_STATE_COMPLETED for the same reason. So migrate_fd_cancel knows that the eventual outcome will be Error or Cancelled or completed, even if the state isn't there yet, and it could reply to say that. > 1) poll with "query-migrate" after migrate_cancel, and report an > error there if it's the desired semantics; > 2) toggle a "cancelled" flag before asking QEMU to cancel migration, > check it in the migration functions after "query-migrate" reported > completion; if it is true, do not resume on the destination. I think you're right you have to poll with query-migrate until you get one of cancelled/failed/completed. However it's a bit odd; prior to the introduction of 'CANCELLING', the state that you would get by a query-migrate after migrate_fd_cancel returned would in principal be the state you ended up in - i.e. cancelled/failed/completed. With cancelling added, query-migrate might lie to you and say 'active' (when it's really hiding the fact that cancelling is happening). So while 'cancelling' apparently didn't alter the API it did, in that query-migrate after a cancel can now return active where it couldn't before. > Another reason for doing it in libvirt is that the serialization > between cancellation and completion of migration ultimately is > controlled by libvirt's lock. Doing this in QEMU makes it harder to > reason about concurrency. I think you have to be careful when you talk about 'cancellation and completion of migration' - in that paragraph I don't think you mean the same thing as MIG_STATE_CANCELLED and MIG_STATE_COMPLETED, I think you're talking about the larger scale idea of completion after you take into account that the VM might be paused after qemu has gone to MIG_STATE_COMPLETED and libvirt might still decide it wants to give up and use the version on the source that's still paused. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Il 28/03/2014 12:30, Dr. David Alan Gilbert ha scritto: >> > Another reason for doing it in libvirt is that the serialization >> > between cancellation and completion of migration ultimately is >> > controlled by libvirt's lock. Doing this in QEMU makes it harder to >> > reason about concurrency. > I think you have to be careful when you talk about 'cancellation and completion > of migration' - in that paragraph I don't think you mean the same thing > as MIG_STATE_CANCELLED and MIG_STATE_COMPLETED, I think you're talking > about the larger scale idea of completion after you take into account > that the VM might be paused after qemu has gone to MIG_STATE_COMPLETED and > libvirt might still decide it wants to give up and use the version on > the source that's still paused. Yes, exactly. This is why I considered the possibility of adding a "cancelled" flag within libvirt. Libvirt always uses -S on the destination, so it's always possible to cancel migration even after MIG_STATE_COMPLETED. Paolo
diff --git a/include/qapi/qmp/qerror.h b/include/qapi/qmp/qerror.h index da75abf..b13e3e0 100644 --- a/include/qapi/qmp/qerror.h +++ b/include/qapi/qmp/qerror.h @@ -164,6 +164,9 @@ void qerror_report_err(Error *err); #define QERR_MIGRATION_ACTIVE \ ERROR_CLASS_GENERIC_ERROR, "There's a migration process in progress" +#define QERR_MIGRATION_COMPLETED \ + ERROR_CLASS_GENERIC_ERROR, "There's no migration process in progress" + #define QERR_MIGRATION_NOT_SUPPORTED \ ERROR_CLASS_GENERIC_ERROR, "State blocked by non-migratable device '%s'" diff --git a/migration.c b/migration.c index e0e24d4..2f34c67 100644 --- a/migration.c +++ b/migration.c @@ -336,7 +336,7 @@ void migrate_fd_error(MigrationState *s) notifier_list_notify(&migration_state_notifiers, s); } -static void migrate_fd_cancel(MigrationState *s) +static void migrate_fd_cancel(MigrationState *s, Error **errp) { int old_state ; DPRINTF("cancelling migration\n"); @@ -344,6 +344,7 @@ static void migrate_fd_cancel(MigrationState *s) do { old_state = s->state; if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) { + error_set(errp, QERR_MIGRATION_COMPLETED); break; } migrate_set_state(s, old_state, MIG_STATE_CANCELLING); @@ -470,7 +471,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, void qmp_migrate_cancel(Error **errp) { - migrate_fd_cancel(migrate_get_current()); + migrate_fd_cancel(migrate_get_current(), errp); } void qmp_migrate_set_cache_size(int64_t value, Error **errp)