[V8,11/17] qapi: Add new command to query colo status

Message ID	20180603050546.6827-12-zhangckid@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Zhang Chen <zhangckid@gmail.com> To: qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>, Juan Quintela <quintela@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, Jason Wang <jasowang@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com> Date: Sun, 3 Jun 2018 13:05:40 +0800 Message-Id: <20180603050546.6827-12-zhangckid@gmail.com> In-Reply-To: <20180603050546.6827-1-zhangckid@gmail.com> References: <20180603050546.6827-1-zhangckid@gmail.com> Subject: [Qemu-devel] [PATCH V8 11/17] qapi: Add new command to query colo status Precedence: list Cc: zhanghailiang <zhang.zhanghailiang@huawei.com>, Li Zhijian <lizhijian@cn.fujitsu.com>, Zhang Chen <zhangckid@gmail.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	COLO: integrate colo frame with block replication and COLO proxy \| expand [V8,00/17] COLO: integrate colo frame with block replication and COLO proxy [V8,01/17] filter-rewriter: fix memory leak for connection in connection_track_table [V8,02/17] colo-compare: implement the process of checkpoint [V8,03/17] colo-compare: use notifier to notify packets comparing result [V8,04/17] COLO: integrate colo compare with colo frame [V8,05/17] COLO: Add block replication into colo process [V8,06/17] COLO: Remove colo_state migration struct [V8,07/17] COLO: Load dirty pages into SVM's RAM cache firstly [V8,08/17] ram/COLO: Record the dirty pages that SVM received [V8,09/17] COLO: Flush memory data from ram cache [V8,10/17] qmp event: Add COLO_EXIT event to notify users while exited COLO [V8,11/17] qapi: Add new command to query colo status [V8,12/17] savevm: split the process of different stages for loadvm/savevm [V8,13/17] COLO: flush host dirty ram from cache [V8,14/17] filter: Add handle_event method for NetFilterClass [V8,15/17] filter-rewriter: handle checkpoint and failover event [V8,16/17] COLO: notify net filters about checkpoint/failover event [V8,17/17] COLO: quick failover process by kick COLO thread

Zhang Chen June 3, 2018, 5:05 a.m. UTC

Libvirt or other high level software can use this command query colo status.
You can test this command like that:
{'execute':'query-colo-status'}

Signed-off-by: Zhang Chen <zhangckid@gmail.com>
---
 migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
 qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

Eric Blake June 4, 2018, 10:23 p.m. UTC | #1

On 06/03/2018 12:05 AM, Zhang Chen wrote:
> Libvirt or other high level software can use this command query colo status.
> You can test this command like that:
> {'execute':'query-colo-status'}
> 
> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> ---

> +++ b/qapi/migration.json
> @@ -1231,6 +1231,40 @@
>   ##
>   { 'command': 'xen-colo-do-checkpoint' }
>   
> +##
> +# @COLOStatus:
> +#
> +# The result format for 'query-colo-status'.
> +#
> +# @mode: COLO running mode. If COLO is running, this field will return
> +#        'primary' or 'secodary'.

s/secodary/secondary/

> +#
> +# @colo-running: true if COLO is running.
> +#
> +# @reason: describes the reason for the COLO exit.
> +#
> +# Since: 2.13

3.0

> +##
> +{ 'struct': 'COLOStatus',
> +  'data': { 'mode': 'COLOMode', 'colo-running': 'bool', 'reason': 'COLOExitReason' } }
> +
> +##
> +# @query-colo-status:
> +#
> +# Query COLO status while the vm is running.
> +#
> +# Returns: A @COLOStatus object showing the status.
> +#
> +# Example:
> +#
> +# -> { "execute": "query-colo-status" }
> +# <- { "return": { "mode": "primary", "colo-running": true, "reason": "request" } }
> +#
> +# Since: 2.13

3.0

> +##
> +{ 'command': 'query-colo-status',
> +  'returns': 'COLOStatus' }
> +
>   ##
>   # @migrate-recover:
>   #
>

Markus Armbruster June 7, 2018, 12:59 p.m. UTC | #2

Zhang Chen <zhangckid@gmail.com> writes:

> Libvirt or other high level software can use this command query colo status.
> You can test this command like that:
> {'execute':'query-colo-status'}
>
> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> ---
>  migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
>  qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 73 insertions(+)
>
> diff --git a/migration/colo.c b/migration/colo.c
> index bedb677788..8c6b8e9a4e 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -29,6 +29,7 @@
>  #include "net/colo.h"
>  #include "block/block.h"
>  #include "qapi/qapi-events-migration.h"
> +#include "qapi/qmp/qerror.h"
>  
>  static bool vmstate_loading;
>  static Notifier packets_compare_notifier;
> @@ -237,6 +238,44 @@ void qmp_xen_colo_do_checkpoint(Error **errp)
>  #endif
>  }
>  
> +COLOStatus *qmp_query_colo_status(Error **errp)
> +{
> +    int state;
> +    COLOStatus *s = g_new0(COLOStatus, 1);
> +
> +    s->mode = get_colo_mode();
> +
> +    switch (s->mode) {
> +    case COLO_MODE_UNKNOWN:
> +        error_setg(errp, "COLO is disabled");
> +        state = MIGRATION_STATUS_NONE;
> +        break;
> +    case COLO_MODE_PRIMARY:
> +        state = migrate_get_current()->state;
> +        break;
> +    case COLO_MODE_SECONDARY:
> +        state = migration_incoming_get_current()->state;
> +        break;
> +    default:
> +        abort();
> +    }
> +
> +    s->colo_running = state == MIGRATION_STATUS_COLO;
> +
> +    switch (failover_get_state()) {
> +    case FAILOVER_STATUS_NONE:
> +        s->reason = COLO_EXIT_REASON_NONE;
> +        break;
> +    case FAILOVER_STATUS_REQUIRE:
> +        s->reason = COLO_EXIT_REASON_REQUEST;
> +        break;
> +    default:
> +        s->reason = COLO_EXIT_REASON_ERROR;
> +    }
> +
> +    return s;
> +}
> +
>  static void colo_send_message(QEMUFile *f, COLOMessage msg,
>                                Error **errp)
>  {
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 93136ce5a0..356a370949 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1231,6 +1231,40 @@
>  ##
>  { 'command': 'xen-colo-do-checkpoint' }
>  
> +##
> +# @COLOStatus:
> +#
> +# The result format for 'query-colo-status'.
> +#
> +# @mode: COLO running mode. If COLO is running, this field will return
> +#        'primary' or 'secodary'.
> +#
> +# @colo-running: true if COLO is running.
> +#
> +# @reason: describes the reason for the COLO exit.

What's the value of @reason before a "COLO exit"?

> +#
> +# Since: 2.13
> +##
> +{ 'struct': 'COLOStatus',
> +  'data': { 'mode': 'COLOMode', 'colo-running': 'bool', 'reason': 'COLOExitReason' } }
> +
> +##
> +# @query-colo-status:
> +#
> +# Query COLO status while the vm is running.
> +#
> +# Returns: A @COLOStatus object showing the status.
> +#
> +# Example:
> +#
> +# -> { "execute": "query-colo-status" }
> +# <- { "return": { "mode": "primary", "colo-running": true, "reason": "request" } }
> +#
> +# Since: 2.13
> +##
> +{ 'command': 'query-colo-status',
> +  'returns': 'COLOStatus' }
> +
>  ##
>  # @migrate-recover:
>  #

Zhang Chen June 10, 2018, 5:39 p.m. UTC | #3

On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com> wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > Libvirt or other high level software can use this command query colo
> status.
> > You can test this command like that:
> > {'execute':'query-colo-status'}
> >
> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> > ---
> >  migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
> >  qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 73 insertions(+)
> >
> > diff --git a/migration/colo.c b/migration/colo.c
> > index bedb677788..8c6b8e9a4e 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -29,6 +29,7 @@
> >  #include "net/colo.h"
> >  #include "block/block.h"
> >  #include "qapi/qapi-events-migration.h"
> > +#include "qapi/qmp/qerror.h"
> >
> >  static bool vmstate_loading;
> >  static Notifier packets_compare_notifier;
> > @@ -237,6 +238,44 @@ void qmp_xen_colo_do_checkpoint(Error **errp)
> >  #endif
> >  }
> >
> > +COLOStatus *qmp_query_colo_status(Error **errp)
> > +{
> > +    int state;
> > +    COLOStatus *s = g_new0(COLOStatus, 1);
> > +
> > +    s->mode = get_colo_mode();
> > +
> > +    switch (s->mode) {
> > +    case COLO_MODE_UNKNOWN:
> > +        error_setg(errp, "COLO is disabled");
> > +        state = MIGRATION_STATUS_NONE;
> > +        break;
> > +    case COLO_MODE_PRIMARY:
> > +        state = migrate_get_current()->state;
> > +        break;
> > +    case COLO_MODE_SECONDARY:
> > +        state = migration_incoming_get_current()->state;
> > +        break;
> > +    default:
> > +        abort();
> > +    }
> > +
> > +    s->colo_running = state == MIGRATION_STATUS_COLO;
> > +
> > +    switch (failover_get_state()) {
> > +    case FAILOVER_STATUS_NONE:
> > +        s->reason = COLO_EXIT_REASON_NONE;
> > +        break;
> > +    case FAILOVER_STATUS_REQUIRE:
> > +        s->reason = COLO_EXIT_REASON_REQUEST;
> > +        break;
> > +    default:
> > +        s->reason = COLO_EXIT_REASON_ERROR;
> > +    }
> > +
> > +    return s;
> > +}
> > +
> >  static void colo_send_message(QEMUFile *f, COLOMessage msg,
> >                                Error **errp)
> >  {
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 93136ce5a0..356a370949 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -1231,6 +1231,40 @@
> >  ##
> >  { 'command': 'xen-colo-do-checkpoint' }
> >
> > +##
> > +# @COLOStatus:
> > +#
> > +# The result format for 'query-colo-status'.
> > +#
> > +# @mode: COLO running mode. If COLO is running, this field will return
> > +#        'primary' or 'secodary'.
> > +#
> > +# @colo-running: true if COLO is running.
> > +#
> > +# @reason: describes the reason for the COLO exit.
>
> What's the value of @reason before a "COLO exit"?
>

Before a "COLO exit", we just return 'none' in this field.

Thanks
Zhang Chen


>
> > +#
> > +# Since: 2.13
> > +##
> > +{ 'struct': 'COLOStatus',
> > +  'data': { 'mode': 'COLOMode', 'colo-running': 'bool', 'reason':
> 'COLOExitReason' } }
> > +
> > +##
> > +# @query-colo-status:
> > +#
> > +# Query COLO status while the vm is running.
> > +#
> > +# Returns: A @COLOStatus object showing the status.
> > +#
> > +# Example:
> > +#
> > +# -> { "execute": "query-colo-status" }
> > +# <- { "return": { "mode": "primary", "colo-running": true, "reason":
> "request" } }
> > +#
> > +# Since: 2.13
> > +##
> > +{ 'command': 'query-colo-status',
> > +  'returns': 'COLOStatus' }
> > +
> >  ##
> >  # @migrate-recover:
> >  #
>

Zhang Chen June 10, 2018, 5:42 p.m. UTC | #4

On Tue, Jun 5, 2018 at 6:23 AM, Eric Blake <eblake@redhat.com> wrote:

> On 06/03/2018 12:05 AM, Zhang Chen wrote:
>
>> Libvirt or other high level software can use this command query colo
>> status.
>> You can test this command like that:
>> {'execute':'query-colo-status'}
>>
>> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
>> ---
>>
>
> +++ b/qapi/migration.json
>> @@ -1231,6 +1231,40 @@
>>   ##
>>   { 'command': 'xen-colo-do-checkpoint' }
>>   +##
>> +# @COLOStatus:
>> +#
>> +# The result format for 'query-colo-status'.
>> +#
>> +# @mode: COLO running mode. If COLO is running, this field will return
>> +#        'primary' or 'secodary'.
>>
>
> s/secodary/secondary/
>
> +#
>> +# @colo-running: true if COLO is running.
>> +#
>> +# @reason: describes the reason for the COLO exit.
>> +#
>> +# Since: 2.13
>>
>
> 3.0
>
> +##
>> +{ 'struct': 'COLOStatus',
>> +  'data': { 'mode': 'COLOMode', 'colo-running': 'bool', 'reason':
>> 'COLOExitReason' } }
>> +
>> +##
>> +# @query-colo-status:
>> +#
>> +# Query COLO status while the vm is running.
>> +#
>> +# Returns: A @COLOStatus object showing the status.
>> +#
>> +# Example:
>> +#
>> +# -> { "execute": "query-colo-status" }
>> +# <- { "return": { "mode": "primary", "colo-running": true, "reason":
>> "request" } }
>> +#
>> +# Since: 2.13
>>
>
> 3.0


Oh, I can't see the new Qemu plan...

Thank you for the reminder.
Zhang Chen



>
>
> +##
>> +{ 'command': 'query-colo-status',
>> +  'returns': 'COLOStatus' }
>> +
>>   ##
>>   # @migrate-recover:
>>   #
>>
>>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

Zhang Chen June 10, 2018, 5:53 p.m. UTC | #5

On Mon, Jun 11, 2018 at 1:42 AM, Zhang Chen <zhangckid@gmail.com> wrote:

>
>
> On Tue, Jun 5, 2018 at 6:23 AM, Eric Blake <eblake@redhat.com> wrote:
>
>> On 06/03/2018 12:05 AM, Zhang Chen wrote:
>>
>>> Libvirt or other high level software can use this command query colo
>>> status.
>>> You can test this command like that:
>>> {'execute':'query-colo-status'}
>>>
>>> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
>>> ---
>>>
>>
>> +++ b/qapi/migration.json
>>> @@ -1231,6 +1231,40 @@
>>>   ##
>>>   { 'command': 'xen-colo-do-checkpoint' }
>>>   +##
>>> +# @COLOStatus:
>>> +#
>>> +# The result format for 'query-colo-status'.
>>> +#
>>> +# @mode: COLO running mode. If COLO is running, this field will return
>>> +#        'primary' or 'secodary'.
>>>
>>
>> s/secodary/secondary/
>>
>> +#
>>> +# @colo-running: true if COLO is running.
>>> +#
>>> +# @reason: describes the reason for the COLO exit.
>>> +#
>>> +# Since: 2.13
>>>
>>
>> 3.0
>>
>> +##
>>> +{ 'struct': 'COLOStatus',
>>> +  'data': { 'mode': 'COLOMode', 'colo-running': 'bool', 'reason':
>>> 'COLOExitReason' } }
>>> +
>>> +##
>>> +# @query-colo-status:
>>> +#
>>> +# Query COLO status while the vm is running.
>>> +#
>>> +# Returns: A @COLOStatus object showing the status.
>>> +#
>>> +# Example:
>>> +#
>>> +# -> { "execute": "query-colo-status" }
>>> +# <- { "return": { "mode": "primary", "colo-running": true, "reason":
>>> "request" } }
>>> +#
>>> +# Since: 2.13
>>>
>>
>> 3.0
>
>
> Oh, I can't see the new Qemu plan...
>

Typo: Sorry, I just forgot to see the new plan....


>
> Thank you for the reminder.
> Zhang Chen
>
>
>
>>
>>
>> +##
>>> +{ 'command': 'query-colo-status',
>>> +  'returns': 'COLOStatus' }
>>> +
>>>   ##
>>>   # @migrate-recover:
>>>   #
>>>
>>>
>> --
>> Eric Blake, Principal Software Engineer
>> Red Hat, Inc.           +1-919-301-3266
>> Virtualization:  qemu.org | libvirt.org
>>
>
>

Markus Armbruster June 11, 2018, 6:48 a.m. UTC | #6

Zhang Chen <zhangckid@gmail.com> writes:

> On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com> wrote:
>
>> Zhang Chen <zhangckid@gmail.com> writes:
>>
>> > Libvirt or other high level software can use this command query colo
>> status.
>> > You can test this command like that:
>> > {'execute':'query-colo-status'}
>> >
>> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
>> > ---
>> >  migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
>> >  qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
>> >  2 files changed, 73 insertions(+)
>> >
>> > diff --git a/migration/colo.c b/migration/colo.c
>> > index bedb677788..8c6b8e9a4e 100644
>> > --- a/migration/colo.c
>> > +++ b/migration/colo.c
>> > @@ -29,6 +29,7 @@
>> >  #include "net/colo.h"
>> >  #include "block/block.h"
>> >  #include "qapi/qapi-events-migration.h"
>> > +#include "qapi/qmp/qerror.h"
>> >
>> >  static bool vmstate_loading;
>> >  static Notifier packets_compare_notifier;
>> > @@ -237,6 +238,44 @@ void qmp_xen_colo_do_checkpoint(Error **errp)
>> >  #endif
>> >  }
>> >
>> > +COLOStatus *qmp_query_colo_status(Error **errp)
>> > +{
>> > +    int state;
>> > +    COLOStatus *s = g_new0(COLOStatus, 1);
>> > +
>> > +    s->mode = get_colo_mode();
>> > +
>> > +    switch (s->mode) {
>> > +    case COLO_MODE_UNKNOWN:
>> > +        error_setg(errp, "COLO is disabled");
>> > +        state = MIGRATION_STATUS_NONE;
>> > +        break;
>> > +    case COLO_MODE_PRIMARY:
>> > +        state = migrate_get_current()->state;
>> > +        break;
>> > +    case COLO_MODE_SECONDARY:
>> > +        state = migration_incoming_get_current()->state;
>> > +        break;
>> > +    default:
>> > +        abort();
>> > +    }
>> > +
>> > +    s->colo_running = state == MIGRATION_STATUS_COLO;
>> > +
>> > +    switch (failover_get_state()) {
>> > +    case FAILOVER_STATUS_NONE:
>> > +        s->reason = COLO_EXIT_REASON_NONE;
>> > +        break;
>> > +    case FAILOVER_STATUS_REQUIRE:
>> > +        s->reason = COLO_EXIT_REASON_REQUEST;
>> > +        break;
>> > +    default:
>> > +        s->reason = COLO_EXIT_REASON_ERROR;
>> > +    }
>> > +
>> > +    return s;
>> > +}
>> > +
>> >  static void colo_send_message(QEMUFile *f, COLOMessage msg,
>> >                                Error **errp)
>> >  {
>> > diff --git a/qapi/migration.json b/qapi/migration.json
>> > index 93136ce5a0..356a370949 100644
>> > --- a/qapi/migration.json
>> > +++ b/qapi/migration.json
>> > @@ -1231,6 +1231,40 @@
>> >  ##
>> >  { 'command': 'xen-colo-do-checkpoint' }
>> >
>> > +##
>> > +# @COLOStatus:
>> > +#
>> > +# The result format for 'query-colo-status'.
>> > +#
>> > +# @mode: COLO running mode. If COLO is running, this field will return
>> > +#        'primary' or 'secodary'.
>> > +#
>> > +# @colo-running: true if COLO is running.
>> > +#
>> > +# @reason: describes the reason for the COLO exit.
>>
>> What's the value of @reason before a "COLO exit"?
>>
>
> Before a "COLO exit", we just return 'none' in this field.

Please add that to the documentation.

Please excuse my ignorance on COLO...  I'm still not sure I fully
understand how the three members are related, or even how the COLO state
machine works and how its related to / embedded in RunState.  I searched
docs/ for a state diagram, but couldn't find one.

According to runstate_transitions_def[], the part of the RunState state
machine that's directly connected to state "colo" looks like this:

    inmigrate  -+
                |
    paused  ----+
                |
    migrate  ---+->  colo  <------>  running
                |
    suspended  -+
                |
    watchdog  --+

For each of the seven state transitions: how is the state transition
triggered (e.g. by QMP command, spontaneously when a certain condition
is detected, ...), and what events (if any) are emitted then?

How is @colo-running related to the run state?

Which run states are considered to be "before a COLO exit"?  If "before
a COLO exit" doesn't map to run states, the state machine is too coarse
to fully describe COLO, and I'd like to see a suitably refined one.

If @colo-running is true, then @mode is either "primary" or "secondary".
What are the possible values when @colo-running is false?

[...]

Zhang Chen June 11, 2018, 3:34 p.m. UTC | #7

On Mon, Jun 11, 2018 at 2:48 PM, Markus Armbruster <armbru@redhat.com>
wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
> >
> >> Zhang Chen <zhangckid@gmail.com> writes:
> >>
> >> > Libvirt or other high level software can use this command query colo
> >> status.
> >> > You can test this command like that:
> >> > {'execute':'query-colo-status'}
> >> >
> >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> >> > ---
> >> >  migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
> >> >  qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
> >> >  2 files changed, 73 insertions(+)
> >> >
> >> > diff --git a/migration/colo.c b/migration/colo.c
> >> > index bedb677788..8c6b8e9a4e 100644
> >> > --- a/migration/colo.c
> >> > +++ b/migration/colo.c
> >> > @@ -29,6 +29,7 @@
> >> >  #include "net/colo.h"
> >> >  #include "block/block.h"
> >> >  #include "qapi/qapi-events-migration.h"
> >> > +#include "qapi/qmp/qerror.h"
> >> >
> >> >  static bool vmstate_loading;
> >> >  static Notifier packets_compare_notifier;
> >> > @@ -237,6 +238,44 @@ void qmp_xen_colo_do_checkpoint(Error **errp)
> >> >  #endif
> >> >  }
> >> >
> >> > +COLOStatus *qmp_query_colo_status(Error **errp)
> >> > +{
> >> > +    int state;
> >> > +    COLOStatus *s = g_new0(COLOStatus, 1);
> >> > +
> >> > +    s->mode = get_colo_mode();
> >> > +
> >> > +    switch (s->mode) {
> >> > +    case COLO_MODE_UNKNOWN:
> >> > +        error_setg(errp, "COLO is disabled");
> >> > +        state = MIGRATION_STATUS_NONE;
> >> > +        break;
> >> > +    case COLO_MODE_PRIMARY:
> >> > +        state = migrate_get_current()->state;
> >> > +        break;
> >> > +    case COLO_MODE_SECONDARY:
> >> > +        state = migration_incoming_get_current()->state;
> >> > +        break;
> >> > +    default:
> >> > +        abort();
> >> > +    }
> >> > +
> >> > +    s->colo_running = state == MIGRATION_STATUS_COLO;
> >> > +
> >> > +    switch (failover_get_state()) {
> >> > +    case FAILOVER_STATUS_NONE:
> >> > +        s->reason = COLO_EXIT_REASON_NONE;
> >> > +        break;
> >> > +    case FAILOVER_STATUS_REQUIRE:
> >> > +        s->reason = COLO_EXIT_REASON_REQUEST;
> >> > +        break;
> >> > +    default:
> >> > +        s->reason = COLO_EXIT_REASON_ERROR;
> >> > +    }
> >> > +
> >> > +    return s;
> >> > +}
> >> > +
> >> >  static void colo_send_message(QEMUFile *f, COLOMessage msg,
> >> >                                Error **errp)
> >> >  {
> >> > diff --git a/qapi/migration.json b/qapi/migration.json
> >> > index 93136ce5a0..356a370949 100644
> >> > --- a/qapi/migration.json
> >> > +++ b/qapi/migration.json
> >> > @@ -1231,6 +1231,40 @@
> >> >  ##
> >> >  { 'command': 'xen-colo-do-checkpoint' }
> >> >
> >> > +##
> >> > +# @COLOStatus:
> >> > +#
> >> > +# The result format for 'query-colo-status'.
> >> > +#
> >> > +# @mode: COLO running mode. If COLO is running, this field will
> return
> >> > +#        'primary' or 'secodary'.
> >> > +#
> >> > +# @colo-running: true if COLO is running.
> >> > +#
> >> > +# @reason: describes the reason for the COLO exit.
> >>
> >> What's the value of @reason before a "COLO exit"?
> >>
> >
> > Before a "COLO exit", we just return 'none' in this field.
>
> Please add that to the documentation.
>

OK.


>
> Please excuse my ignorance on COLO...  I'm still not sure I fully
> understand how the three members are related, or even how the COLO state
> machine works and how its related to / embedded in RunState.  I searched
> docs/ for a state diagram, but couldn't find one.
>
> According to runstate_transitions_def[], the part of the RunState state
> machine that's directly connected to state "colo" looks like this:
>
>     inmigrate  -+
>                 |
>     paused  ----+
>                 |
>     migrate  ---+->  colo  <------>  running
>                 |
>     suspended  -+
>                 |
>     watchdog  --+
>
> For each of the seven state transitions: how is the state transition
> triggered (e.g. by QMP command, spontaneously when a certain condition
> is detected, ...), and what events (if any) are emitted then?
>
>
When you start COLO, the VM always running in "MIGRATION_STATUS_COLO" still
occur failover.
And in the flow diagram, you can think COLO always running in migrate state.
Because into COLO mode, we will control VM state in COLO code itself, for
example:
When we start COLO, it will do the first migration as normal live
migration, after that we will enter
the COLO process, at that time COLO think the primary VM state is same with
secondary VM(the first checkpoint),
so we will use vm_start() start the primary VM(unlike to normal migration)
and secondary VM.
In this time, primary VM and secondary VM will parallel running, and if
COLO found two VM state are
not same, it will trigger checkpoint(like another migration). Finally, if
occurred some fault that will trigger
failover, after that primary VM maybe return to normal running
mode(secondary dead).
So, if we just see the primary VM state, may be it has out of the RunState
state
machine or it still in migrate state.




> How is @colo-running related to the run state?
>

Not related, as I say above.


>
> Which run states are considered to be "before a COLO exit"?  If "before
> a COLO exit" doesn't map to run states, the state machine is too coarse
> to fully describe COLO, and I'd like to see a suitably refined one.
>
>
COLO just is a special case. It's worthy to refined one?
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Any comments?



> If @colo-running is true, then @mode is either "primary" or "secondary".
> What are the possible values when @colo-running is false?
>

The @mode will in "unknown" state.


Thanks
Zhang Chen



>
> [...]
>

Dr. David Alan Gilbert June 13, 2018, 4:50 p.m. UTC | #8

* Zhang Chen (zhangckid@gmail.com) wrote:
> On Mon, Jun 11, 2018 at 2:48 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
> 
> > Zhang Chen <zhangckid@gmail.com> writes:
> >
> > > On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com>
> > wrote:
> > >
> > >> Zhang Chen <zhangckid@gmail.com> writes:
> > >>
> > >> > Libvirt or other high level software can use this command query colo
> > >> status.
> > >> > You can test this command like that:
> > >> > {'execute':'query-colo-status'}
> > >> >
> > >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> > >> > ---
> > >> >  migration/colo.c    | 39 +++++++++++++++++++++++++++++++++++++++
> > >> >  qapi/migration.json | 34 ++++++++++++++++++++++++++++++++++
> > >> >  2 files changed, 73 insertions(+)
> > >> >
> > >> > diff --git a/migration/colo.c b/migration/colo.c
> > >> > index bedb677788..8c6b8e9a4e 100644
> > >> > --- a/migration/colo.c
> > >> > +++ b/migration/colo.c
> > >> > @@ -29,6 +29,7 @@
> > >> >  #include "net/colo.h"
> > >> >  #include "block/block.h"
> > >> >  #include "qapi/qapi-events-migration.h"
> > >> > +#include "qapi/qmp/qerror.h"
> > >> >
> > >> >  static bool vmstate_loading;
> > >> >  static Notifier packets_compare_notifier;
> > >> > @@ -237,6 +238,44 @@ void qmp_xen_colo_do_checkpoint(Error **errp)
> > >> >  #endif
> > >> >  }
> > >> >
> > >> > +COLOStatus *qmp_query_colo_status(Error **errp)
> > >> > +{
> > >> > +    int state;
> > >> > +    COLOStatus *s = g_new0(COLOStatus, 1);
> > >> > +
> > >> > +    s->mode = get_colo_mode();
> > >> > +
> > >> > +    switch (s->mode) {
> > >> > +    case COLO_MODE_UNKNOWN:
> > >> > +        error_setg(errp, "COLO is disabled");
> > >> > +        state = MIGRATION_STATUS_NONE;
> > >> > +        break;
> > >> > +    case COLO_MODE_PRIMARY:
> > >> > +        state = migrate_get_current()->state;
> > >> > +        break;
> > >> > +    case COLO_MODE_SECONDARY:
> > >> > +        state = migration_incoming_get_current()->state;
> > >> > +        break;
> > >> > +    default:
> > >> > +        abort();
> > >> > +    }
> > >> > +
> > >> > +    s->colo_running = state == MIGRATION_STATUS_COLO;
> > >> > +
> > >> > +    switch (failover_get_state()) {
> > >> > +    case FAILOVER_STATUS_NONE:
> > >> > +        s->reason = COLO_EXIT_REASON_NONE;
> > >> > +        break;
> > >> > +    case FAILOVER_STATUS_REQUIRE:
> > >> > +        s->reason = COLO_EXIT_REASON_REQUEST;
> > >> > +        break;
> > >> > +    default:
> > >> > +        s->reason = COLO_EXIT_REASON_ERROR;
> > >> > +    }
> > >> > +
> > >> > +    return s;
> > >> > +}
> > >> > +
> > >> >  static void colo_send_message(QEMUFile *f, COLOMessage msg,
> > >> >                                Error **errp)
> > >> >  {
> > >> > diff --git a/qapi/migration.json b/qapi/migration.json
> > >> > index 93136ce5a0..356a370949 100644
> > >> > --- a/qapi/migration.json
> > >> > +++ b/qapi/migration.json
> > >> > @@ -1231,6 +1231,40 @@
> > >> >  ##
> > >> >  { 'command': 'xen-colo-do-checkpoint' }
> > >> >
> > >> > +##
> > >> > +# @COLOStatus:
> > >> > +#
> > >> > +# The result format for 'query-colo-status'.
> > >> > +#
> > >> > +# @mode: COLO running mode. If COLO is running, this field will
> > return
> > >> > +#        'primary' or 'secodary'.
> > >> > +#
> > >> > +# @colo-running: true if COLO is running.
> > >> > +#
> > >> > +# @reason: describes the reason for the COLO exit.
> > >>
> > >> What's the value of @reason before a "COLO exit"?
> > >>
> > >
> > > Before a "COLO exit", we just return 'none' in this field.
> >
> > Please add that to the documentation.
> >
> 
> OK.
> 
> 
> >
> > Please excuse my ignorance on COLO...  I'm still not sure I fully
> > understand how the three members are related, or even how the COLO state
> > machine works and how its related to / embedded in RunState.  I searched
> > docs/ for a state diagram, but couldn't find one.
> >
> > According to runstate_transitions_def[], the part of the RunState state
> > machine that's directly connected to state "colo" looks like this:
> >
> >     inmigrate  -+
> >                 |
> >     paused  ----+
> >                 |
> >     migrate  ---+->  colo  <------>  running
> >                 |
> >     suspended  -+
> >                 |
> >     watchdog  --+
> >
> > For each of the seven state transitions: how is the state transition
> > triggered (e.g. by QMP command, spontaneously when a certain condition
> > is detected, ...), and what events (if any) are emitted then?
> >
> >
> When you start COLO, the VM always running in "MIGRATION_STATUS_COLO" still
> occur failover.
> And in the flow diagram, you can think COLO always running in migrate state.
> Because into COLO mode, we will control VM state in COLO code itself, for
> example:
> When we start COLO, it will do the first migration as normal live
> migration, after that we will enter
> the COLO process, at that time COLO think the primary VM state is same with
> secondary VM(the first checkpoint),
> so we will use vm_start() start the primary VM(unlike to normal migration)
> and secondary VM.
> In this time, primary VM and secondary VM will parallel running, and if
> COLO found two VM state are
> not same, it will trigger checkpoint(like another migration). Finally, if
> occurred some fault that will trigger
> failover, after that primary VM maybe return to normal running
> mode(secondary dead).
> So, if we just see the primary VM state, may be it has out of the RunState
> state
> machine or it still in migrate state.
> 
> 
> 
> 
> > How is @colo-running related to the run state?
> >
> 
> Not related, as I say above.

Right; this is a different type of 'running' - it might be better to say
'active' rather than running.

  COLO has a pair of VMs in sync with a constant stream of migrations
between them.
The 'mode' is whether it's the source (primary) or destination (secondary) VM.
(Also sometimes written PVM/SVM)

If COLO fails for some reason (e.g. the
secondary host fails) then I think this is saying the 'colo-running'
would be false.

Some monitoring tool would be watching this to make sure you
really do have a redundent pair of VMs, and if one of them failed
you'd want to know and alert.

Dave

> > Which run states are considered to be "before a COLO exit"?  If "before
> > a COLO exit" doesn't map to run states, the state machine is too coarse
> > to fully describe COLO, and I'd like to see a suitably refined one.
> >
> >
> COLO just is a special case. It's worthy to refined one?
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Any comments?
> 
> 
> 
> > If @colo-running is true, then @mode is either "primary" or "secondary".
> > What are the possible values when @colo-running is false?
> >
> 
> The @mode will in "unknown" state.
> 
> 
> Thanks
> Zhang Chen
> 
> 
> 
> >
> > [...]
> >
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Markus Armbruster June 14, 2018, 8:42 a.m. UTC | #9

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Zhang Chen (zhangckid@gmail.com) wrote:
>> On Mon, Jun 11, 2018 at 2:48 PM, Markus Armbruster <armbru@redhat.com>
>> wrote:
>> 
>> > Zhang Chen <zhangckid@gmail.com> writes:
>> >
>> > > On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> > >
>> > >> Zhang Chen <zhangckid@gmail.com> writes:
>> > >>
>> > >> > Libvirt or other high level software can use this command query colo status.
>> > >> > You can test this command like that:
>> > >> > {'execute':'query-colo-status'}
>> > >> >
>> > >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
[...]
>> > >> > diff --git a/qapi/migration.json b/qapi/migration.json
>> > >> > index 93136ce5a0..356a370949 100644
>> > >> > --- a/qapi/migration.json
>> > >> > +++ b/qapi/migration.json
>> > >> > @@ -1231,6 +1231,40 @@
>> > >> >  ##
>> > >> >  { 'command': 'xen-colo-do-checkpoint' }
>> > >> >
>> > >> > +##
>> > >> > +# @COLOStatus:
>> > >> > +#
>> > >> > +# The result format for 'query-colo-status'.
>> > >> > +#
>> > >> > +# @mode: COLO running mode. If COLO is running, this field will return
>> > >> > +#        'primary' or 'secodary'.
>> > >> > +#
>> > >> > +# @colo-running: true if COLO is running.
>> > >> > +#
>> > >> > +# @reason: describes the reason for the COLO exit.
>> > >>
>> > >> What's the value of @reason before a "COLO exit"?
>> > >>
>> > >
>> > > Before a "COLO exit", we just return 'none' in this field.
>> >
>> > Please add that to the documentation.
>> >
>> 
>> OK.
>> 
>> 
>> >
>> > Please excuse my ignorance on COLO...  I'm still not sure I fully
>> > understand how the three members are related, or even how the COLO state
>> > machine works and how its related to / embedded in RunState.  I searched
>> > docs/ for a state diagram, but couldn't find one.
>> >
>> > According to runstate_transitions_def[], the part of the RunState state
>> > machine that's directly connected to state "colo" looks like this:
>> >
>> >     inmigrate  -+
>> >                 |
>> >     paused  ----+
>> >                 |
>> >     migrate  ---+->  colo  <------>  running
>> >                 |
>> >     suspended  -+
>> >                 |
>> >     watchdog  --+
>> >
>> > For each of the seven state transitions: how is the state transition
>> > triggered (e.g. by QMP command, spontaneously when a certain condition
>> > is detected, ...), and what events (if any) are emitted then?
>> >
>> >
>> When you start COLO, the VM always running in "MIGRATION_STATUS_COLO" still
>> occur failover.
>> And in the flow diagram, you can think COLO always running in migrate state.
>> Because into COLO mode, we will control VM state in COLO code itself, for
>> example:
>> When we start COLO, it will do the first migration as normal live
>> migration, after that we will enter
>> the COLO process, at that time COLO think the primary VM state is same with
>> secondary VM(the first checkpoint),
>> so we will use vm_start() start the primary VM(unlike to normal migration)
>> and secondary VM.
>> In this time, primary VM and secondary VM will parallel running, and if
>> COLO found two VM state are
>> not same, it will trigger checkpoint(like another migration). Finally, if
>> occurred some fault that will trigger
>> failover, after that primary VM maybe return to normal running
>> mode(secondary dead).
>> So, if we just see the primary VM state, may be it has out of the RunState
>> state
>> machine or it still in migrate state.
>> 
>> 
>> 
>> 
>> > How is @colo-running related to the run state?
>> >
>> 
>> Not related, as I say above.
>
> Right; this is a different type of 'running' - it might be better to say
> 'active' rather than running.

Rename?

>   COLO has a pair of VMs in sync with a constant stream of migrations
> between them.
> The 'mode' is whether it's the source (primary) or destination (secondary) VM.
> (Also sometimes written PVM/SVM)
>
> If COLO fails for some reason (e.g. the
> secondary host fails) then I think this is saying the 'colo-running'
> would be false.
>
> Some monitoring tool would be watching this to make sure you
> really do have a redundent pair of VMs, and if one of them failed
> you'd want to know and alert.

Let me try to explain what I learned in my own words, so you can correct
my misunderstandings.

A VM doing COLO is either the primary or the secondary of a pair.  A
monitoring process watches them.

At some time, it enters MigrationStatus 'colo'.  Peeking at the code, it
looks like it enters it from state 'active', and never leaves it.  This
happens after we successfully created the secondary by migrating the
primary.

Aside: migrate_set_state() appears to do nothing when @old_state doesn't
match @state, yet callers appear to assume it works.  Feels brittle.  Am
I confused?

The monitoring process orchestrates fault tolerance:

* It initially creates the secondary by migrating the primary.  This is
  called the first checkpoint.

* If the primary goes down, the monitor sends x-colo-lost-heartbeat to
  the secondary.  The secondary becomes the primary, and we create a new
  secondary by live-migrating the primary.

* If the secondary goes down or out of sync, we abandon it and send
  x-colo-lost-heartbeat to the primary.  We can then create a new
  secondary by live-migrating the primary.  This is called another
  checkpoint.

x-colo-lost-heartbeat's doc comment:

# Tell qemu that heartbeat is lost, request it to do takeover procedures.
# If this command is sent to the PVM, the Primary side will exit COLO mode.

What does "exiting COLO mode" mean, and how is it reflected in
ColoStatus member mode?  Do we reenter COLO mode eventually?  How?

# If sent to the Secondary, the Secondary side will run failover work,
# then takes over server operation to become the service VM.

Undefined term "service VM".  Do you mean primary VM?

Cases:

(1) This VM isn't doing COLO.  ColoStatus:

    { "mode": "unknown",
      "running": false,
      "reason": "none" }

(2) This VM is a COLO primary

(2a) and it hasn't received x-colo-lost-heartbeat since it last became
     primary.  ColoStatus:

    { "mode": "primary",
      "running": true,          # I guess
      "reason": "none" }

(2b) and it has received x-colo-lost-heartbeat since it last became
     primary

    { "mode": "primary",
      "running": true,          # I guess
      "reason": "request" }

(2c) and it has run into some error condition I don't understand (but
     probably should)

    { "mode": "primary",
      "running": true,          # I guess
      "reason": "error" }

(3) This VM is a COLO secondary

(3a-c) like (2a-c)

If that's correct (and I doubt it), then @running is entirely redundant:
it's false if and only if @mode is "unknown".

Speaking of mode "unknown": that's a bad name.  "none" would be better.
Or maybe query-colo-status should fail in case (1), to get rid of it at
the interface entirely.

We really, really, really need a state diagram complete with QMP
commands and events.  COLO-FT.txt covers architecture and provides an
example, but it's entirely inadequate at explaining how the QMP commands
and events fit in, and their doc comments don't really help.  I feel
this is the reason why we're at v8 and I'm still groping in the dark,
unable to pass judgement on the proposed QAPI schema changes.

[...]

Dr. David Alan Gilbert June 14, 2018, 9:25 a.m. UTC | #10

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Zhang Chen (zhangckid@gmail.com) wrote:
> >> On Mon, Jun 11, 2018 at 2:48 PM, Markus Armbruster <armbru@redhat.com>
> >> wrote:
> >> 
> >> > Zhang Chen <zhangckid@gmail.com> writes:
> >> >
> >> > > On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <armbru@redhat.com> wrote:
> >> > >
> >> > >> Zhang Chen <zhangckid@gmail.com> writes:
> >> > >>
> >> > >> > Libvirt or other high level software can use this command query colo status.
> >> > >> > You can test this command like that:
> >> > >> > {'execute':'query-colo-status'}
> >> > >> >
> >> > >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> [...]
> >> > >> > diff --git a/qapi/migration.json b/qapi/migration.json
> >> > >> > index 93136ce5a0..356a370949 100644
> >> > >> > --- a/qapi/migration.json
> >> > >> > +++ b/qapi/migration.json
> >> > >> > @@ -1231,6 +1231,40 @@
> >> > >> >  ##
> >> > >> >  { 'command': 'xen-colo-do-checkpoint' }
> >> > >> >
> >> > >> > +##
> >> > >> > +# @COLOStatus:
> >> > >> > +#
> >> > >> > +# The result format for 'query-colo-status'.
> >> > >> > +#
> >> > >> > +# @mode: COLO running mode. If COLO is running, this field will return
> >> > >> > +#        'primary' or 'secodary'.
> >> > >> > +#
> >> > >> > +# @colo-running: true if COLO is running.
> >> > >> > +#
> >> > >> > +# @reason: describes the reason for the COLO exit.
> >> > >>
> >> > >> What's the value of @reason before a "COLO exit"?
> >> > >>
> >> > >
> >> > > Before a "COLO exit", we just return 'none' in this field.
> >> >
> >> > Please add that to the documentation.
> >> >
> >> 
> >> OK.
> >> 
> >> 
> >> >
> >> > Please excuse my ignorance on COLO...  I'm still not sure I fully
> >> > understand how the three members are related, or even how the COLO state
> >> > machine works and how its related to / embedded in RunState.  I searched
> >> > docs/ for a state diagram, but couldn't find one.
> >> >
> >> > According to runstate_transitions_def[], the part of the RunState state
> >> > machine that's directly connected to state "colo" looks like this:
> >> >
> >> >     inmigrate  -+
> >> >                 |
> >> >     paused  ----+
> >> >                 |
> >> >     migrate  ---+->  colo  <------>  running
> >> >                 |
> >> >     suspended  -+
> >> >                 |
> >> >     watchdog  --+
> >> >
> >> > For each of the seven state transitions: how is the state transition
> >> > triggered (e.g. by QMP command, spontaneously when a certain condition
> >> > is detected, ...), and what events (if any) are emitted then?
> >> >
> >> >
> >> When you start COLO, the VM always running in "MIGRATION_STATUS_COLO" still
> >> occur failover.
> >> And in the flow diagram, you can think COLO always running in migrate state.
> >> Because into COLO mode, we will control VM state in COLO code itself, for
> >> example:
> >> When we start COLO, it will do the first migration as normal live
> >> migration, after that we will enter
> >> the COLO process, at that time COLO think the primary VM state is same with
> >> secondary VM(the first checkpoint),
> >> so we will use vm_start() start the primary VM(unlike to normal migration)
> >> and secondary VM.
> >> In this time, primary VM and secondary VM will parallel running, and if
> >> COLO found two VM state are
> >> not same, it will trigger checkpoint(like another migration). Finally, if
> >> occurred some fault that will trigger
> >> failover, after that primary VM maybe return to normal running
> >> mode(secondary dead).
> >> So, if we just see the primary VM state, may be it has out of the RunState
> >> state
> >> machine or it still in migrate state.
> >> 
> >> 
> >> 
> >> 
> >> > How is @colo-running related to the run state?
> >> >
> >> 
> >> Not related, as I say above.
> >
> > Right; this is a different type of 'running' - it might be better to say
> > 'active' rather than running.
> 
> Rename?
> 
> >   COLO has a pair of VMs in sync with a constant stream of migrations
> > between them.
> > The 'mode' is whether it's the source (primary) or destination (secondary) VM.
> > (Also sometimes written PVM/SVM)
> >
> > If COLO fails for some reason (e.g. the
> > secondary host fails) then I think this is saying the 'colo-running'
> > would be false.
> >
> > Some monitoring tool would be watching this to make sure you
> > really do have a redundent pair of VMs, and if one of them failed
> > you'd want to know and alert.
> 
> Let me try to explain what I learned in my own words, so you can correct
> my misunderstandings.
> 
> A VM doing COLO is either the primary or the secondary of a pair.  A
> monitoring process watches them.

Right

> At some time, it enters MigrationStatus 'colo'.  Peeking at the code, it
> looks like it enters it from state 'active', and never leaves it.  This
> happens after we successfully created the secondary by migrating the
> primary.

Yes, I think that's right.

> Aside: migrate_set_state() appears to do nothing when @old_state doesn't
> match @state, yet callers appear to assume it works.  Feels brittle.  Am
> I confused?

It's an atomic-compare-exchange used to set the state; most of the time you only
care about the fact it's atomic and you know the state you expect to be
coming from; normally the cases where this isn't right are failure
paths, but those are explicitly checked by checking error states.
There are some places where we explicitly check the exchanged value but
they're pretty rare, and are normally special cases (e.g. when forcing a
cancel).

> The monitoring process orchestrates fault tolerance:
> 
> * It initially creates the secondary by migrating the primary.  This is
>   called the first checkpoint.

Right.

(And the step you haven't mentioned; that we keep sending checkpoints)

> * If the primary goes down, the monitor sends x-colo-lost-heartbeat to
>   the secondary.  The secondary becomes the primary, and we create a new
>   secondary by live-migrating the primary.

I don't think there's mechanisms yet for resyncing to bring a failed
pair back into a new pair - so you survive one failure at the moment.
(I might be wrong, that was the case previously)

> * If the secondary goes down or out of sync, we abandon it and send
>   x-colo-lost-heartbeat to the primary.  We can then create a new
>   secondary by live-migrating the primary.  This is called another
>   checkpoint.

Yes

> x-colo-lost-heartbeat's doc comment:
> 
> # Tell qemu that heartbeat is lost, request it to do takeover procedures.
> # If this command is sent to the PVM, the Primary side will exit COLO mode.
> 
> What does "exiting COLO mode" mean

The VM is running unprotected - there's no migration/checkpointing.  At
that point it's pretty much just a normal VM.

> and how is it reflected in
> ColoStatus member mode?  Do we reenter COLO mode eventually?  How?

I'm not sure of the status in that case (I'll leave that to Zhang Chen)
but at that point it's just a normal VM; so I think we go through the
startup-like path of having to do that first migration again.

> # If sent to the Secondary, the Secondary side will run failover work,
> # then takes over server operation to become the service VM.
> 
> Undefined term "service VM".  Do you mean primary VM?

I think that means the VM that's actually running the workload; at that
point there is no primary/secondary any more because COLO isn't
synchronising.

> Cases:
> 
> (1) This VM isn't doing COLO.  ColoStatus:
> 
>     { "mode": "unknown",
>       "running": false,
>       "reason": "none" }
> 
> (2) This VM is a COLO primary
> 
> (2a) and it hasn't received x-colo-lost-heartbeat since it last became
>      primary.  ColoStatus:
> 
>     { "mode": "primary",
>       "running": true,          # I guess
>       "reason": "none" }
> 
> (2b) and it has received x-colo-lost-heartbeat since it last became
>      primary
> 
>     { "mode": "primary",
>       "running": true,          # I guess
>       "reason": "request" }
> 
> (2c) and it has run into some error condition I don't understand (but
>      probably should)
> 
>     { "mode": "primary",
>       "running": true,          # I guess
>       "reason": "error" }
> 
> (3) This VM is a COLO secondary
> 
> (3a-c) like (2a-c)
> 
> If that's correct (and I doubt it), then @running is entirely redundant:
> it's false if and only if @mode is "unknown".

That's probably true; both fields do derive from the migration state;
I think the mode is primary if you're outgoing migration state is COLO,
it's secondary if you're incoming state is COLO, and unknown if neither
state is COLO.  And 'running' is the OR of those.  
Note that there's one other piece of state, the 'colo' migration
capability (that is displayed in the normal capabilities stuff).

So for example, if you're in the process of starting COLO up,
your colo capability is set, your migration mode is still normal
migration setup/active/complete - so these would still show
'unknown/false/none' which probably could be better.

> Speaking of mode "unknown": that's a bad name.  "none" would be better.
> Or maybe query-colo-status should fail in case (1), to get rid of it at
> the interface entirely.
> 
> We really, really, really need a state diagram complete with QMP
> commands and events.  COLO-FT.txt covers architecture and provides an
> example, but it's entirely inadequate at explaining how the QMP commands
> and events fit in, and their doc comments don't really help.  I feel
> this is the reason why we're at v8 and I'm still groping in the dark,
> unable to pass judgement on the proposed QAPI schema changes.

COLO is a big series that touches lots of bits of QEMU (and has bounced
through the hands of a few people); most of the iterations haven't been
that much about the interface.

Dave

> [...]
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Zhang Chen June 19, 2018, 4 a.m. UTC | #11

On Thu, Jun 14, 2018 at 5:25 PM, Dr. David Alan Gilbert <dgilbert@redhat.com
> wrote:

> * Markus Armbruster (armbru@redhat.com) wrote:
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >
> > > * Zhang Chen (zhangckid@gmail.com) wrote:
> > >> On Mon, Jun 11, 2018 at 2:48 PM, Markus Armbruster <armbru@redhat.com
> >
> > >> wrote:
> > >>
> > >> > Zhang Chen <zhangckid@gmail.com> writes:
> > >> >
> > >> > > On Thu, Jun 7, 2018 at 8:59 PM, Markus Armbruster <
> armbru@redhat.com> wrote:
> > >> > >
> > >> > >> Zhang Chen <zhangckid@gmail.com> writes:
> > >> > >>
> > >> > >> > Libvirt or other high level software can use this command
> query colo status.
> > >> > >> > You can test this command like that:
> > >> > >> > {'execute':'query-colo-status'}
> > >> > >> >
> > >> > >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> > [...]
> > >> > >> > diff --git a/qapi/migration.json b/qapi/migration.json
> > >> > >> > index 93136ce5a0..356a370949 100644
> > >> > >> > --- a/qapi/migration.json
> > >> > >> > +++ b/qapi/migration.json
> > >> > >> > @@ -1231,6 +1231,40 @@
> > >> > >> >  ##
> > >> > >> >  { 'command': 'xen-colo-do-checkpoint' }
> > >> > >> >
> > >> > >> > +##
> > >> > >> > +# @COLOStatus:
> > >> > >> > +#
> > >> > >> > +# The result format for 'query-colo-status'.
> > >> > >> > +#
> > >> > >> > +# @mode: COLO running mode. If COLO is running, this field
> will return
> > >> > >> > +#        'primary' or 'secodary'.
> > >> > >> > +#
> > >> > >> > +# @colo-running: true if COLO is running.
> > >> > >> > +#
> > >> > >> > +# @reason: describes the reason for the COLO exit.
> > >> > >>
> > >> > >> What's the value of @reason before a "COLO exit"?
> > >> > >>
> > >> > >
> > >> > > Before a "COLO exit", we just return 'none' in this field.
> > >> >
> > >> > Please add that to the documentation.
> > >> >
> > >>
> > >> OK.
> > >>
> > >>
> > >> >
> > >> > Please excuse my ignorance on COLO...  I'm still not sure I fully
> > >> > understand how the three members are related, or even how the COLO
> state
> > >> > machine works and how its related to / embedded in RunState.  I
> searched
> > >> > docs/ for a state diagram, but couldn't find one.
> > >> >
> > >> > According to runstate_transitions_def[], the part of the RunState
> state
> > >> > machine that's directly connected to state "colo" looks like this:
> > >> >
> > >> >     inmigrate  -+
> > >> >                 |
> > >> >     paused  ----+
> > >> >                 |
> > >> >     migrate  ---+->  colo  <------>  running
> > >> >                 |
> > >> >     suspended  -+
> > >> >                 |
> > >> >     watchdog  --+
> > >> >
> > >> > For each of the seven state transitions: how is the state transition
> > >> > triggered (e.g. by QMP command, spontaneously when a certain
> condition
> > >> > is detected, ...), and what events (if any) are emitted then?
> > >> >
> > >> >
> > >> When you start COLO, the VM always running in "MIGRATION_STATUS_COLO"
> still
> > >> occur failover.
> > >> And in the flow diagram, you can think COLO always running in migrate
> state.
> > >> Because into COLO mode, we will control VM state in COLO code itself,
> for
> > >> example:
> > >> When we start COLO, it will do the first migration as normal live
> > >> migration, after that we will enter
> > >> the COLO process, at that time COLO think the primary VM state is
> same with
> > >> secondary VM(the first checkpoint),
> > >> so we will use vm_start() start the primary VM(unlike to normal
> migration)
> > >> and secondary VM.
> > >> In this time, primary VM and secondary VM will parallel running, and
> if
> > >> COLO found two VM state are
> > >> not same, it will trigger checkpoint(like another migration).
> Finally, if
> > >> occurred some fault that will trigger
> > >> failover, after that primary VM maybe return to normal running
> > >> mode(secondary dead).
> > >> So, if we just see the primary VM state, may be it has out of the
> RunState
> > >> state
> > >> machine or it still in migrate state.
> > >>
> > >>
> > >>
> > >>
> > >> > How is @colo-running related to the run state?
> > >> >
> > >>
> > >> Not related, as I say above.
> > >
> > > Right; this is a different type of 'running' - it might be better to
> say
> > > 'active' rather than running.
> >
> > Rename?
>

OK, I will rename it in next version.



> >
> > >   COLO has a pair of VMs in sync with a constant stream of migrations
> > > between them.
> > > The 'mode' is whether it's the source (primary) or destination
> (secondary) VM.
> > > (Also sometimes written PVM/SVM)
> > >
> > > If COLO fails for some reason (e.g. the
> > > secondary host fails) then I think this is saying the 'colo-running'
> > > would be false.
> > >
> > > Some monitoring tool would be watching this to make sure you
> > > really do have a redundent pair of VMs, and if one of them failed
> > > you'd want to know and alert.
> >
> > Let me try to explain what I learned in my own words, so you can correct
> > my misunderstandings.
> >
> > A VM doing COLO is either the primary or the secondary of a pair.  A
> > monitoring process watches them.
>
> Right
>
> > At some time, it enters MigrationStatus 'colo'.  Peeking at the code, it
> > looks like it enters it from state 'active', and never leaves it.  This
> > happens after we successfully created the secondary by migrating the
> > primary.
>
> Yes, I think that's right.
>
> > Aside: migrate_set_state() appears to do nothing when @old_state doesn't
> > match @state, yet callers appear to assume it works.  Feels brittle.  Am
> > I confused?
>
> It's an atomic-compare-exchange used to set the state; most of the time
> you only
> care about the fact it's atomic and you know the state you expect to be
> coming from; normally the cases where this isn't right are failure
> paths, but those are explicitly checked by checking error states.
> There are some places where we explicitly check the exchanged value but
> they're pretty rare, and are normally special cases (e.g. when forcing a
> cancel).
>
> > The monitoring process orchestrates fault tolerance:
> >
> > * It initially creates the secondary by migrating the primary.  This is
> >   called the first checkpoint.
>
> Right.
>
> (And the step you haven't mentioned; that we keep sending checkpoints)
>
> > * If the primary goes down, the monitor sends x-colo-lost-heartbeat to
> >   the secondary.  The secondary becomes the primary, and we create a new
> >   secondary by live-migrating the primary.
>
> I don't think there's mechanisms yet for resyncing to bring a failed
> pair back into a new pair - so you survive one failure at the moment.
> (I might be wrong, that was the case previously)
>


Dave is right. In qemu side, we just provide some qmp command to upper
layer software,
Like openstack or libvirt, user can implement some policy on high level
then call the qmp
command to use COLO.



>
> > * If the secondary goes down or out of sync, we abandon it and send
> >   x-colo-lost-heartbeat to the primary.  We can then create a new
> >   secondary by live-migrating the primary.  This is called another
> >   checkpoint.
>
> Yes
>
> > x-colo-lost-heartbeat's doc comment:
> >
> > # Tell qemu that heartbeat is lost, request it to do takeover procedures.
> > # If this command is sent to the PVM, the Primary side will exit COLO
> mode.
> >
> > What does "exiting COLO mode" mean
>
> The VM is running unprotected - there's no migration/checkpointing.  At
> that point it's pretty much just a normal VM.
>
> > and how is it reflected in
> > ColoStatus member mode?  Do we reenter COLO mode eventually?  How?
>
> I'm not sure of the status in that case (I'll leave that to Zhang Chen)
> but at that point it's just a normal VM; so I think we go through the
> startup-like path of having to do that first migration again.
>
> > # If sent to the Secondary, the Secondary side will run failover work,
> > # then takes over server operation to become the service VM.
> >
> > Undefined term "service VM".  Do you mean primary VM?
>
> I think that means the VM that's actually running the workload; at that
> point there is no primary/secondary any more because COLO isn't
> synchronising.
>
> > Cases:
> >
> > (1) This VM isn't doing COLO.  ColoStatus:
> >
> >     { "mode": "unknown",
> >       "running": false,
> >       "reason": "none" }
> >
> > (2) This VM is a COLO primary
> >
> > (2a) and it hasn't received x-colo-lost-heartbeat since it last became
> >      primary.  ColoStatus:
> >
> >     { "mode": "primary",
> >       "running": true,          # I guess
> >       "reason": "none" }
> >
> > (2b) and it has received x-colo-lost-heartbeat since it last became
> >      primary
> >
> >     { "mode": "primary",
> >       "running": true,          # I guess
> >       "reason": "request" }
> >
> > (2c) and it has run into some error condition I don't understand (but
> >      probably should)
> >
> >     { "mode": "primary",
> >       "running": true,          # I guess
> >       "reason": "error" }
> >
> > (3) This VM is a COLO secondary
> >
> > (3a-c) like (2a-c)
> >
> > If that's correct (and I doubt it), then @running is entirely redundant:
> > it's false if and only if @mode is "unknown".
>
> That's probably true; both fields do derive from the migration state;
> I think the mode is primary if you're outgoing migration state is COLO,
> it's secondary if you're incoming state is COLO, and unknown if neither
> state is COLO.  And 'running' is the OR of those.
> Note that there's one other piece of state, the 'colo' migration
> capability (that is displayed in the normal capabilities stuff).
>
> So for example, if you're in the process of starting COLO up,
> your colo capability is set, your migration mode is still normal
> migration setup/active/complete - so these would still show
> 'unknown/false/none' which probably could be better.
>
> > Speaking of mode "unknown": that's a bad name.  "none" would be better.
> > Or maybe query-colo-status should fail in case (1), to get rid of it at
> > the interface entirely.
>

Currently in case (1):

{'execute':'query-colo-status'}{"return": {}}

{"error": {"class": "GenericError", "desc": "COLO is disabled"}}



> >
> > We really, really, really need a state diagram complete with QMP
> > commands and events.  COLO-FT.txt covers architecture and provides an
> > example, but it's entirely inadequate at explaining how the QMP commands
> > and events fit in, and their doc comments don't really help.  I feel
> > this is the reason why we're at v8 and I'm still groping in the dark,
> > unable to pass judgement on the proposed QAPI schema changes.
>
>
OK, I got it, I will try to add new patch about the state diagram complete
with QMP commands and events in the COLO-FT.txt.



> COLO is a big series that touches lots of bits of QEMU (and has bounced
> through the hands of a few people); most of the iterations haven't been
> that much about the interface.
>


Yes, I will add the state diagram as Markus said.

Thanks
Zhang Chen


>
> Dave
>
> > [...]
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

[V8,11/17] qapi: Add new command to query colo status

Commit Message

Comments

Patch