diff mbox

[COLO-Frame,v12,25/38] qmp event: Add event notification for COLO error

Message ID 1450167779-9960-26-git-send-email-zhang.zhanghailiang@huawei.com
State New
Headers show

Commit Message

Zhanghailiang Dec. 15, 2015, 8:22 a.m. UTC
If some errors happen during VM's COLO FT stage, it's important to notify the users
of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
failover work immediately.
If users don't want to get involved in COLO's failover verdict,
it is still necessary to notify users that we exited COLO mode.

Cc: Markus Armbruster <armbru@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
v11:
- Fix several typos found by Eric

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
 docs/qmp-events.txt | 17 +++++++++++++++++
 migration/colo.c    | 11 +++++++++++
 qapi-schema.json    | 16 ++++++++++++++++
 qapi/event.json     | 17 +++++++++++++++++
 4 files changed, 61 insertions(+)

Comments

Eric Blake Dec. 18, 2015, 4:03 p.m. UTC | #1
On 12/15/2015 01:22 AM, zhanghailiang wrote:
> If some errors happen during VM's COLO FT stage, it's important to notify the users
> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
> failover work immediately.
> If users don't want to get involved in COLO's failover verdict,
> it is still necessary to notify users that we exited COLO mode.
> 
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
> v11:
> - Fix several typos found by Eric
> 
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> ---

> +++ b/docs/qmp-events.txt
> @@ -184,6 +184,23 @@ Example:
>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>  event.
>  
> +COLO_EXIT
> +---------
> +
> +Emitted when VM finishes COLO mode due to some errors happening or
> +at the request of users.
> +
> +Data:
> +
> + - "mode": COLO mode, primary or secondary side (json-string)
> + - "reason":  the exit reason, internal error or external request. (json-string)
> + - "error": error message (json-string, operation)

s/operation/optional/
May want to word it as:

- "error": error message for human consumption (json-string, optional)

to point out that machines shouldn't parse it.

> +++ b/migration/colo.c
> @@ -18,6 +18,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
>  #include "migration/failover.h"
> +#include "qapi-event.h"
>  
>  /* colo buffer */
>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>  out:
>      if (ret < 0) {
>          error_report("%s: %s", __func__, strerror(-ret));

Unrelated: I mentioned in another thread that we may want to start
thinking about adding error_report_errno(); this would be another client.

> +++ b/qapi-schema.json
> @@ -778,6 +778,22 @@
>    'data': [ 'unknown', 'primary', 'secondary'] }
>  
>  ##
> +# @COLOExitReason
> +#
> +# The reason for a COLO exit
> +#
> +# @unknown: unknown reason
> +#

If we never return 'unknown', then it is not worth having it in the enum
(we can always add it later if we find a reason to have it; but adding
it now feels premature if the code base isn't using it).

Otherwise looks okay to me.
Markus Armbruster Dec. 19, 2015, 10:02 a.m. UTC | #2
Copying qemu-block because this seems related to generalising block jobs
to background jobs.

zhanghailiang <zhang.zhanghailiang@huawei.com> writes:

> If some errors happen during VM's COLO FT stage, it's important to notify the users
> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
> failover work immediately.
> If users don't want to get involved in COLO's failover verdict,
> it is still necessary to notify users that we exited COLO mode.
>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
> v11:
> - Fix several typos found by Eric
>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> ---
>  docs/qmp-events.txt | 17 +++++++++++++++++
>  migration/colo.c    | 11 +++++++++++
>  qapi-schema.json    | 16 ++++++++++++++++
>  qapi/event.json     | 17 +++++++++++++++++
>  4 files changed, 61 insertions(+)
>
> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
> index d2f1ce4..19f68fc 100644
> --- a/docs/qmp-events.txt
> +++ b/docs/qmp-events.txt
> @@ -184,6 +184,23 @@ Example:
>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>  event.
>  
> +COLO_EXIT
> +---------
> +
> +Emitted when VM finishes COLO mode due to some errors happening or
> +at the request of users.

How would the event's recipient distinguish between "due to error" and
"at the user's request"?

> +
> +Data:
> +
> + - "mode": COLO mode, primary or secondary side (json-string)
> + - "reason":  the exit reason, internal error or external request. (json-string)
> + - "error": error message (json-string, operation)
> +
> +Example:
> +
> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
> +

Pardon my ignorance again...  Does "VM finishes COLO mode" means have
some kind of COLO background job, and it just finished for whatever
reason?

If yes, this COLO job could be an instance of the general background job
concept we're trying to grow from the existing block job concept.

I'm not asking you to rebase your work onto the background job
infrastructure, not least for the simple reason that it doesn't exist,
yet.  But I think it would be fruitful to compare your COLO job
management QMP interface with the one we have for block jobs.  Not only
may that avoid unnecessary inconsistency, it could also help shape the
general background job interface.

Quick overview of the block job QMP interface:

* Commands to create a job: block-commit, block-stream, drive-mirror,
  drive-backup.

* Get information on jobs: query-block-jobs

* Pause a job: block-job-pause

* Resume a job: block-job-resume

* Cancel a job: block-job-cancel

* Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED

* Block job error event: BLOCK_JOB_ERROR

* Block job synchronous completion: event BLOCK_JOB_READY and command
  block-job-complete

>  DEVICE_DELETED
>  --------------
>  
> diff --git a/migration/colo.c b/migration/colo.c
> index d1dd4e1..d06c14f 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -18,6 +18,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
>  #include "migration/failover.h"
> +#include "qapi-event.h"
>  
>  /* colo buffer */
>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>  out:
>      if (ret < 0) {
>          error_report("%s: %s", __func__, strerror(-ret));
> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
> +                                  true, strerror(-ret), NULL);
> +    } else {
> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
> +                                  false, NULL, NULL);
>      }
>  
>      qsb_free(buffer);
> @@ -516,6 +522,11 @@ out:
>      if (ret < 0) {
>          error_report("colo incoming thread will exit, detect error: %s",
>                       strerror(-ret));
> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
> +                                  true, strerror(-ret), NULL);
> +    } else {
> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
> +                                  false, NULL, NULL);
>      }
>  
>      if (fb) {
> diff --git a/qapi-schema.json b/qapi-schema.json
> index feb7d53..f6ecb88 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -778,6 +778,22 @@
>    'data': [ 'unknown', 'primary', 'secondary'] }
>  
>  ##
> +# @COLOExitReason
> +#
> +# The reason for a COLO exit
> +#
> +# @unknown: unknown reason

How can @unknown happen?

> +#
> +# @request: COLO exit is due to an external request
> +#
> +# @error: COLO exit is due to an internal error
> +#
> +# Since: 2.6
> +##
> +{ 'enum': 'COLOExitReason',
> +  'data': [ 'unknown', 'request', 'error'] }
> +
> +##
>  # @x-colo-lost-heartbeat
>  #
>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
> diff --git a/qapi/event.json b/qapi/event.json
> index f0cef01..f63d456 100644
> --- a/qapi/event.json
> +++ b/qapi/event.json
> @@ -255,6 +255,23 @@
>    'data': {'status': 'MigrationStatus'}}
>  
>  ##
> +# @COLO_EXIT
> +#
> +# Emitted when VM finishes COLO mode due to some errors happening or
> +# at the request of users.
> +#
> +# @mode: which COLO mode the VM was in when it exited.

Can we get 'unknown' here?

> +#
> +# @reason: describes the reason for the COLO exit.

Can we get 'unknown' here?

> +#
> +# @error: #optional, error message. Only present on error happening.
> +#
> +# Since: 2.6
> +##
> +{ 'event': 'COLO_EXIT',
> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
> +
> +##
>  # @ACPI_DEVICE_OST
>  #
>  # Emitted when guest executes ACPI _OST method.
John Snow Dec. 21, 2015, 9:14 p.m. UTC | #3
On 12/19/2015 05:02 AM, Markus Armbruster wrote:
> Copying qemu-block because this seems related to generalising block jobs
> to background jobs.
> 
> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
> 
>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> ---
>>  docs/qmp-events.txt | 17 +++++++++++++++++
>>  migration/colo.c    | 11 +++++++++++
>>  qapi-schema.json    | 16 ++++++++++++++++
>>  qapi/event.json     | 17 +++++++++++++++++
>>  4 files changed, 61 insertions(+)
>>
>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>> index d2f1ce4..19f68fc 100644
>> --- a/docs/qmp-events.txt
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>  event.
>>  
>> +COLO_EXIT
>> +---------
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
> 
> How would the event's recipient distinguish between "due to error" and
> "at the user's request"?
> 
>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. (json-string)
>> + - "error": error message (json-string, operation)
>> +
>> +Example:
>> +
>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> +
> 
> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
> some kind of COLO background job, and it just finished for whatever
> reason?
> 
> If yes, this COLO job could be an instance of the general background job
> concept we're trying to grow from the existing block job concept.
> 
> I'm not asking you to rebase your work onto the background job
> infrastructure, not least for the simple reason that it doesn't exist,
> yet.  But I think it would be fruitful to compare your COLO job
> management QMP interface with the one we have for block jobs.  Not only
> may that avoid unnecessary inconsistency, it could also help shape the
> general background job interface.
> 

Yes. The "background job" concept doesn't exist in a formal way outside
of the block layer yet, but we're looking to expand it as we re-tool the
block jobs themselves.

It may be the case that the COLO commands and events need to go in as
they are now, but later we can bring them back into the generalized job
infrastructure.

> Quick overview of the block job QMP interface:
> 
> * Commands to create a job: block-commit, block-stream, drive-mirror,
>   drive-backup.
> 
> * Get information on jobs: query-block-jobs
> 
> * Pause a job: block-job-pause
> 
> * Resume a job: block-job-resume
> 
> * Cancel a job: block-job-cancel
> 
> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
> 
> * Block job error event: BLOCK_JOB_ERROR
> 
> * Block job synchronous completion: event BLOCK_JOB_READY and command
>   block-job-complete
> 

The block-agnostic version of these commands would likely be:

query-jobs
job-pause
job-resume
job-cancel
job-complete

Events: JOB_COMPLETED, JOB_CANCELLED, JOB_ERROR, JOB_READY.


It looks like COLO_EXIT would be an instance of JOB_COMPLETED, and if it
occurred due to an error, we'd also see JOB_ERROR emitted.

>>  DEVICE_DELETED
>>  --------------
>>  
>> diff --git a/migration/colo.c b/migration/colo.c
>> index d1dd4e1..d06c14f 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>  #include "qemu/error-report.h"
>>  #include "qemu/sockets.h"
>>  #include "migration/failover.h"
>> +#include "qapi-event.h"
>>  
>>  /* colo buffer */
>>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>  out:
>>      if (ret < 0) {
>>          error_report("%s: %s", __func__, strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>      }
>>  
>>      qsb_free(buffer);
>> @@ -516,6 +522,11 @@ out:
>>      if (ret < 0) {
>>          error_report("colo incoming thread will exit, detect error: %s",
>>                       strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>      }
>>  
>>      if (fb) {
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index feb7d53..f6ecb88 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -778,6 +778,22 @@
>>    'data': [ 'unknown', 'primary', 'secondary'] }
>>  
>>  ##
>> +# @COLOExitReason
>> +#
>> +# The reason for a COLO exit
>> +#
>> +# @unknown: unknown reason
> 
> How can @unknown happen?
> 
>> +#
>> +# @request: COLO exit is due to an external request
>> +#
>> +# @error: COLO exit is due to an internal error
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum': 'COLOExitReason',
>> +  'data': [ 'unknown', 'request', 'error'] }
>> +
>> +##
>>  # @x-colo-lost-heartbeat
>>  #
>>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>> diff --git a/qapi/event.json b/qapi/event.json
>> index f0cef01..f63d456 100644
>> --- a/qapi/event.json
>> +++ b/qapi/event.json
>> @@ -255,6 +255,23 @@
>>    'data': {'status': 'MigrationStatus'}}
>>  
>>  ##
>> +# @COLO_EXIT
>> +#
>> +# Emitted when VM finishes COLO mode due to some errors happening or
>> +# at the request of users.
>> +#
>> +# @mode: which COLO mode the VM was in when it exited.
> 
> Can we get 'unknown' here?
> 
>> +#
>> +# @reason: describes the reason for the COLO exit.
> 
> Can we get 'unknown' here?
> 
>> +#
>> +# @error: #optional, error message. Only present on error happening.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'event': 'COLO_EXIT',
>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>> +
>> +##
>>  # @ACPI_DEVICE_OST
>>  #
>>  # Emitted when guest executes ACPI _OST method.
>
Wen Congyang Dec. 23, 2015, 1:24 a.m. UTC | #4
On 12/19/2015 06:02 PM, Markus Armbruster wrote:
> Copying qemu-block because this seems related to generalising block jobs
> to background jobs.
> 
> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
> 
>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> ---
>>  docs/qmp-events.txt | 17 +++++++++++++++++
>>  migration/colo.c    | 11 +++++++++++
>>  qapi-schema.json    | 16 ++++++++++++++++
>>  qapi/event.json     | 17 +++++++++++++++++
>>  4 files changed, 61 insertions(+)
>>
>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>> index d2f1ce4..19f68fc 100644
>> --- a/docs/qmp-events.txt
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>  event.
>>  
>> +COLO_EXIT
>> +---------
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
> 
> How would the event's recipient distinguish between "due to error" and
> "at the user's request"?
> 
>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. (json-string)
>> + - "error": error message (json-string, operation)
>> +
>> +Example:
>> +
>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> +
> 
> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
> some kind of COLO background job, and it just finished for whatever
> reason?
> 
> If yes, this COLO job could be an instance of the general background job
> concept we're trying to grow from the existing block job concept.
> 
> I'm not asking you to rebase your work onto the background job
> infrastructure, not least for the simple reason that it doesn't exist,
> yet.  But I think it would be fruitful to compare your COLO job
> management QMP interface with the one we have for block jobs.  Not only
> may that avoid unnecessary inconsistency, it could also help shape the
> general background job interface.

COLO is not a block job. If live migration is a background jon, COLO
is also a backgroud job.

> 
> Quick overview of the block job QMP interface:
> 
> * Commands to create a job: block-commit, block-stream, drive-mirror,
>   drive-backup.
> 
> * Get information on jobs: query-block-jobs
> 
> * Pause a job: block-job-pause
> 
> * Resume a job: block-job-resume
> 
> * Cancel a job: block-job-cancel
> 
> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
> 
> * Block job error event: BLOCK_JOB_ERROR
> 
> * Block job synchronous completion: event BLOCK_JOB_READY and command
>   block-job-complete

What is background job infrastructure? Do you mean implement all the above
interfaces for each background job?

Thanks
Wen Congyang

> 
>>  DEVICE_DELETED
>>  --------------
>>  
>> diff --git a/migration/colo.c b/migration/colo.c
>> index d1dd4e1..d06c14f 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>  #include "qemu/error-report.h"
>>  #include "qemu/sockets.h"
>>  #include "migration/failover.h"
>> +#include "qapi-event.h"
>>  
>>  /* colo buffer */
>>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>  out:
>>      if (ret < 0) {
>>          error_report("%s: %s", __func__, strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>      }
>>  
>>      qsb_free(buffer);
>> @@ -516,6 +522,11 @@ out:
>>      if (ret < 0) {
>>          error_report("colo incoming thread will exit, detect error: %s",
>>                       strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>      }
>>  
>>      if (fb) {
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index feb7d53..f6ecb88 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -778,6 +778,22 @@
>>    'data': [ 'unknown', 'primary', 'secondary'] }
>>  
>>  ##
>> +# @COLOExitReason
>> +#
>> +# The reason for a COLO exit
>> +#
>> +# @unknown: unknown reason
> 
> How can @unknown happen?
> 
>> +#
>> +# @request: COLO exit is due to an external request
>> +#
>> +# @error: COLO exit is due to an internal error
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum': 'COLOExitReason',
>> +  'data': [ 'unknown', 'request', 'error'] }
>> +
>> +##
>>  # @x-colo-lost-heartbeat
>>  #
>>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>> diff --git a/qapi/event.json b/qapi/event.json
>> index f0cef01..f63d456 100644
>> --- a/qapi/event.json
>> +++ b/qapi/event.json
>> @@ -255,6 +255,23 @@
>>    'data': {'status': 'MigrationStatus'}}
>>  
>>  ##
>> +# @COLO_EXIT
>> +#
>> +# Emitted when VM finishes COLO mode due to some errors happening or
>> +# at the request of users.
>> +#
>> +# @mode: which COLO mode the VM was in when it exited.
> 
> Can we get 'unknown' here?
> 
>> +#
>> +# @reason: describes the reason for the COLO exit.
> 
> Can we get 'unknown' here?
> 
>> +#
>> +# @error: #optional, error message. Only present on error happening.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'event': 'COLO_EXIT',
>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>> +
>> +##
>>  # @ACPI_DEVICE_OST
>>  #
>>  # Emitted when guest executes ACPI _OST method.
> 
> 
> 
> .
>
Zhanghailiang Dec. 23, 2015, 1:55 a.m. UTC | #5
On 2015/12/19 0:03, Eric Blake wrote:
> On 12/15/2015 01:22 AM, zhanghailiang wrote:
>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> ---
>
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>   event.
>>
>> +COLO_EXIT
>> +---------
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. (json-string)
>> + - "error": error message (json-string, operation)
>
> s/operation/optional/
> May want to word it as:
>
> - "error": error message for human consumption (json-string, optional)
>
> to point out that machines shouldn't parse it.
>

Good idea, i will fix it like that.

>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qemu/sockets.h"
>>   #include "migration/failover.h"
>> +#include "qapi-event.h"
>>
>>   /* colo buffer */
>>   #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>   out:
>>       if (ret < 0) {
>>           error_report("%s: %s", __func__, strerror(-ret));
>
> Unrelated: I mentioned in another thread that we may want to start
> thinking about adding error_report_errno(); this would be another client.
>

Hmm, yes, we may need such a helper function.

>> +++ b/qapi-schema.json
>> @@ -778,6 +778,22 @@
>>     'data': [ 'unknown', 'primary', 'secondary'] }
>>
>>   ##
>> +# @COLOExitReason
>> +#
>> +# The reason for a COLO exit
>> +#
>> +# @unknown: unknown reason
>> +#
>
> If we never return 'unknown', then it is not worth having it in the enum
> (we can always add it later if we find a reason to have it; but adding
> it now feels premature if the code base isn't using it).
>

You are right, it should never happen, i will remove it in next version, thanks.

> Otherwise looks okay to me.
>
Zhanghailiang Dec. 23, 2015, 3:10 a.m. UTC | #6
On 2015/12/19 18:02, Markus Armbruster wrote:
> Copying qemu-block because this seems related to generalising block jobs
> to background jobs.
>

Er, this event just used to help users to know what happened to VM with COLO FT
on. If users get this event, they can make further check what's wrong, and
decide which side should take over the work.

> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>
>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> ---
>>   docs/qmp-events.txt | 17 +++++++++++++++++
>>   migration/colo.c    | 11 +++++++++++
>>   qapi-schema.json    | 16 ++++++++++++++++
>>   qapi/event.json     | 17 +++++++++++++++++
>>   4 files changed, 61 insertions(+)
>>
>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>> index d2f1ce4..19f68fc 100644
>> --- a/docs/qmp-events.txt
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>   event.
>>
>> +COLO_EXIT
>> +---------
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
>
> How would the event's recipient distinguish between "due to error" and
> "at the user's request"?
>

If they get this event with 'reason' is 'request', it is 'at the user's request',
Or, it will be 'due to error' (The key for 'reason' will be 'error', and we have an optional
error message which may help to figure out what happened.)

>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. (json-string)
>> + - "error": error message (json-string, operation)
>> +
>> +Example:
>> +
>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> +
>
> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
> some kind of COLO background job, and it just finished for whatever
> reason?
>

As above, what i have said.

> If yes, this COLO job could be an instance of the general background job
> concept we're trying to grow from the existing block job concept.
>
> I'm not asking you to rebase your work onto the background job
> infrastructure, not least for the simple reason that it doesn't exist,
> yet.  But I think it would be fruitful to compare your COLO job
> management QMP interface with the one we have for block jobs.  Not only
> may that avoid unnecessary inconsistency, it could also help shape the
> general background job interface.
>

Interesting, i'm not quite familiar with this block background job infrastructure.
If we consider COLO FT as a background job, we can certainly use it. I will have a look
at it.

> Quick overview of the block job QMP interface:
>
> * Commands to create a job: block-commit, block-stream, drive-mirror,
>    drive-backup.
>
> * Get information on jobs: query-block-jobs
>
> * Pause a job: block-job-pause
>
> * Resume a job: block-job-resume
>
> * Cancel a job: block-job-cancel
>
> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
>
> * Block job error event: BLOCK_JOB_ERROR
>
> * Block job synchronous completion: event BLOCK_JOB_READY and command
>    block-job-complete
>
>>   DEVICE_DELETED
>>   --------------
>>
>> diff --git a/migration/colo.c b/migration/colo.c
>> index d1dd4e1..d06c14f 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qemu/sockets.h"
>>   #include "migration/failover.h"
>> +#include "qapi-event.h"
>>
>>   /* colo buffer */
>>   #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>   out:
>>       if (ret < 0) {
>>           error_report("%s: %s", __func__, strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>       }
>>
>>       qsb_free(buffer);
>> @@ -516,6 +522,11 @@ out:
>>       if (ret < 0) {
>>           error_report("colo incoming thread will exit, detect error: %s",
>>                        strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>       }
>>
>>       if (fb) {
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index feb7d53..f6ecb88 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -778,6 +778,22 @@
>>     'data': [ 'unknown', 'primary', 'secondary'] }
>>
>>   ##
>> +# @COLOExitReason
>> +#
>> +# The reason for a COLO exit
>> +#
>> +# @unknown: unknown reason
>
> How can @unknown happen?
>

>> +#
>> +# @request: COLO exit is due to an external request
>> +#
>> +# @error: COLO exit is due to an internal error
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum': 'COLOExitReason',
>> +  'data': [ 'unknown', 'request', 'error'] }
>> +
>> +##
>>   # @x-colo-lost-heartbeat
>>   #
>>   # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>> diff --git a/qapi/event.json b/qapi/event.json
>> index f0cef01..f63d456 100644
>> --- a/qapi/event.json
>> +++ b/qapi/event.json
>> @@ -255,6 +255,23 @@
>>     'data': {'status': 'MigrationStatus'}}
>>
>>   ##
>> +# @COLO_EXIT
>> +#
>> +# Emitted when VM finishes COLO mode due to some errors happening or
>> +# at the request of users.
>> +#
>> +# @mode: which COLO mode the VM was in when it exited.
>
> Can we get 'unknown' here?
>

No, i will remove it :)

>> +#
>> +# @reason: describes the reason for the COLO exit.
>
> Can we get 'unknown' here?
>

No, it should never happen for now. i will remove it.

>> +#
>> +# @error: #optional, error message. Only present on error happening.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'event': 'COLO_EXIT',
>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>> +
>> +##
>>   # @ACPI_DEVICE_OST
>>   #
>>   # Emitted when guest executes ACPI _OST method.
>
> .
>
Zhanghailiang Dec. 23, 2015, 3:14 a.m. UTC | #7
On 2015/12/22 5:14, John Snow wrote:
>
>
> On 12/19/2015 05:02 AM, Markus Armbruster wrote:
>> Copying qemu-block because this seems related to generalising block jobs
>> to background jobs.
>>
>> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>>
>>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>>> failover work immediately.
>>> If users don't want to get involved in COLO's failover verdict,
>>> it is still necessary to notify users that we exited COLO mode.
>>>
>>> Cc: Markus Armbruster <armbru@redhat.com>
>>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> ---
>>> v11:
>>> - Fix several typos found by Eric
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> ---
>>>   docs/qmp-events.txt | 17 +++++++++++++++++
>>>   migration/colo.c    | 11 +++++++++++
>>>   qapi-schema.json    | 16 ++++++++++++++++
>>>   qapi/event.json     | 17 +++++++++++++++++
>>>   4 files changed, 61 insertions(+)
>>>
>>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>>> index d2f1ce4..19f68fc 100644
>>> --- a/docs/qmp-events.txt
>>> +++ b/docs/qmp-events.txt
>>> @@ -184,6 +184,23 @@ Example:
>>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>>   event.
>>>
>>> +COLO_EXIT
>>> +---------
>>> +
>>> +Emitted when VM finishes COLO mode due to some errors happening or
>>> +at the request of users.
>>
>> How would the event's recipient distinguish between "due to error" and
>> "at the user's request"?
>>
>>> +
>>> +Data:
>>> +
>>> + - "mode": COLO mode, primary or secondary side (json-string)
>>> + - "reason":  the exit reason, internal error or external request. (json-string)
>>> + - "error": error message (json-string, operation)
>>> +
>>> +Example:
>>> +
>>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>>> +
>>
>> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
>> some kind of COLO background job, and it just finished for whatever
>> reason?
>>
>> If yes, this COLO job could be an instance of the general background job
>> concept we're trying to grow from the existing block job concept.
>>
>> I'm not asking you to rebase your work onto the background job
>> infrastructure, not least for the simple reason that it doesn't exist,
>> yet.  But I think it would be fruitful to compare your COLO job
>> management QMP interface with the one we have for block jobs.  Not only
>> may that avoid unnecessary inconsistency, it could also help shape the
>> general background job interface.
>>
>
> Yes. The "background job" concept doesn't exist in a formal way outside
> of the block layer yet, but we're looking to expand it as we re-tool the
> block jobs themselves.
>
> It may be the case that the COLO commands and events need to go in as
> they are now, but later we can bring them back into the generalized job
> infrastructure.
>

Agreed. ;)

>> Quick overview of the block job QMP interface:
>>
>> * Commands to create a job: block-commit, block-stream, drive-mirror,
>>    drive-backup.
>>
>> * Get information on jobs: query-block-jobs
>>
>> * Pause a job: block-job-pause
>>
>> * Resume a job: block-job-resume
>>
>> * Cancel a job: block-job-cancel
>>
>> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
>>
>> * Block job error event: BLOCK_JOB_ERROR
>>
>> * Block job synchronous completion: event BLOCK_JOB_READY and command
>>    block-job-complete
>>
>
> The block-agnostic version of these commands would likely be:
>
> query-jobs
> job-pause
> job-resume
> job-cancel
> job-complete
>
> Events: JOB_COMPLETED, JOB_CANCELLED, JOB_ERROR, JOB_READY.
>
>
> It looks like COLO_EXIT would be an instance of JOB_COMPLETED, and if it
> occurred due to an error, we'd also see JOB_ERROR emitted.
>

Yes, if we use this job frame for COLO, the COLO_EXIT will be like that.

>>>   DEVICE_DELETED
>>>   --------------
>>>
>>> diff --git a/migration/colo.c b/migration/colo.c
>>> index d1dd4e1..d06c14f 100644
>>> --- a/migration/colo.c
>>> +++ b/migration/colo.c
>>> @@ -18,6 +18,7 @@
>>>   #include "qemu/error-report.h"
>>>   #include "qemu/sockets.h"
>>>   #include "migration/failover.h"
>>> +#include "qapi-event.h"
>>>
>>>   /* colo buffer */
>>>   #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>>   out:
>>>       if (ret < 0) {
>>>           error_report("%s: %s", __func__, strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>       }
>>>
>>>       qsb_free(buffer);
>>> @@ -516,6 +522,11 @@ out:
>>>       if (ret < 0) {
>>>           error_report("colo incoming thread will exit, detect error: %s",
>>>                        strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>       }
>>>
>>>       if (fb) {
>>> diff --git a/qapi-schema.json b/qapi-schema.json
>>> index feb7d53..f6ecb88 100644
>>> --- a/qapi-schema.json
>>> +++ b/qapi-schema.json
>>> @@ -778,6 +778,22 @@
>>>     'data': [ 'unknown', 'primary', 'secondary'] }
>>>
>>>   ##
>>> +# @COLOExitReason
>>> +#
>>> +# The reason for a COLO exit
>>> +#
>>> +# @unknown: unknown reason
>>
>> How can @unknown happen?
>>
>>> +#
>>> +# @request: COLO exit is due to an external request
>>> +#
>>> +# @error: COLO exit is due to an internal error
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'enum': 'COLOExitReason',
>>> +  'data': [ 'unknown', 'request', 'error'] }
>>> +
>>> +##
>>>   # @x-colo-lost-heartbeat
>>>   #
>>>   # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>>> diff --git a/qapi/event.json b/qapi/event.json
>>> index f0cef01..f63d456 100644
>>> --- a/qapi/event.json
>>> +++ b/qapi/event.json
>>> @@ -255,6 +255,23 @@
>>>     'data': {'status': 'MigrationStatus'}}
>>>
>>>   ##
>>> +# @COLO_EXIT
>>> +#
>>> +# Emitted when VM finishes COLO mode due to some errors happening or
>>> +# at the request of users.
>>> +#
>>> +# @mode: which COLO mode the VM was in when it exited.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @reason: describes the reason for the COLO exit.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @error: #optional, error message. Only present on error happening.
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'event': 'COLO_EXIT',
>>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>>> +
>>> +##
>>>   # @ACPI_DEVICE_OST
>>>   #
>>>   # Emitted when guest executes ACPI _OST method.
>>
>
John Snow Jan. 5, 2016, 7:21 p.m. UTC | #8
On 12/22/2015 08:24 PM, Wen Congyang wrote:
> On 12/19/2015 06:02 PM, Markus Armbruster wrote:
>> Copying qemu-block because this seems related to generalising block jobs
>> to background jobs.
>>
>> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>>
>>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>>> failover work immediately.
>>> If users don't want to get involved in COLO's failover verdict,
>>> it is still necessary to notify users that we exited COLO mode.
>>>
>>> Cc: Markus Armbruster <armbru@redhat.com>
>>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> ---
>>> v11:
>>> - Fix several typos found by Eric
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> ---
>>>  docs/qmp-events.txt | 17 +++++++++++++++++
>>>  migration/colo.c    | 11 +++++++++++
>>>  qapi-schema.json    | 16 ++++++++++++++++
>>>  qapi/event.json     | 17 +++++++++++++++++
>>>  4 files changed, 61 insertions(+)
>>>
>>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>>> index d2f1ce4..19f68fc 100644
>>> --- a/docs/qmp-events.txt
>>> +++ b/docs/qmp-events.txt
>>> @@ -184,6 +184,23 @@ Example:
>>>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>>  event.
>>>  
>>> +COLO_EXIT
>>> +---------
>>> +
>>> +Emitted when VM finishes COLO mode due to some errors happening or
>>> +at the request of users.
>>
>> How would the event's recipient distinguish between "due to error" and
>> "at the user's request"?
>>
>>> +
>>> +Data:
>>> +
>>> + - "mode": COLO mode, primary or secondary side (json-string)
>>> + - "reason":  the exit reason, internal error or external request. (json-string)
>>> + - "error": error message (json-string, operation)
>>> +
>>> +Example:
>>> +
>>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>>> +
>>
>> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
>> some kind of COLO background job, and it just finished for whatever
>> reason?
>>
>> If yes, this COLO job could be an instance of the general background job
>> concept we're trying to grow from the existing block job concept.
>>
>> I'm not asking you to rebase your work onto the background job
>> infrastructure, not least for the simple reason that it doesn't exist,
>> yet.  But I think it would be fruitful to compare your COLO job
>> management QMP interface with the one we have for block jobs.  Not only
>> may that avoid unnecessary inconsistency, it could also help shape the
>> general background job interface.
> 
> COLO is not a block job. If live migration is a background jon, COLO
> is also a backgroud job.
> 

Right. We are contemplating expanding the "block job" subsystem to be a
generic "background job" system. Live Migration might be one target to
be converted into this Jobs API, COLO might also be a fit.

The framework doesn't exist yet, though.

>>
>> Quick overview of the block job QMP interface:
>>
>> * Commands to create a job: block-commit, block-stream, drive-mirror,
>>   drive-backup.
>>
>> * Get information on jobs: query-block-jobs
>>
>> * Pause a job: block-job-pause
>>
>> * Resume a job: block-job-resume
>>
>> * Cancel a job: block-job-cancel
>>
>> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
>>
>> * Block job error event: BLOCK_JOB_ERROR
>>
>> * Block job synchronous completion: event BLOCK_JOB_READY and command
>>   block-job-complete
> 
> What is background job infrastructure? Do you mean implement all the above
> interfaces for each background job?
> 
> Thanks
> Wen Congyang
> 

Markus is laying out how Block Jobs currently work for some background
on how the job system exists today. He's highlighting the commands to
create, query, pause, resume, and cancel jobs; as well as demonstrating
the QMP events that the Block Job system uses to indicate completion,
cancellation, error and convergence.

We're thinking of making a generic background job system that would
replace the blockjobs API with a new generic Jobs API that looks very
similar.

Something like this:

Commands:
query: query-jobs
pause: job-pause
resume: job-resume
cancel: job-cancel
complete: job-complete (finalizes a long running command that has converged)

Events:
completion: JOB_COMPLETED, JOB_CANCELLED
error: JOB_ERROR
convergence indicator: JOB_READY

The system doesn't exist yet, but your proposed events that indicate
success/failure etc for COLO caught Markus' attention as perhaps quite
neatly fitting into the above proposed system.

--js

>>
>>>  DEVICE_DELETED
>>>  --------------
>>>  
>>> diff --git a/migration/colo.c b/migration/colo.c
>>> index d1dd4e1..d06c14f 100644
>>> --- a/migration/colo.c
>>> +++ b/migration/colo.c
>>> @@ -18,6 +18,7 @@
>>>  #include "qemu/error-report.h"
>>>  #include "qemu/sockets.h"
>>>  #include "migration/failover.h"
>>> +#include "qapi-event.h"
>>>  
>>>  /* colo buffer */
>>>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>>  out:
>>>      if (ret < 0) {
>>>          error_report("%s: %s", __func__, strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>      }
>>>  
>>>      qsb_free(buffer);
>>> @@ -516,6 +522,11 @@ out:
>>>      if (ret < 0) {
>>>          error_report("colo incoming thread will exit, detect error: %s",
>>>                       strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>      }
>>>  
>>>      if (fb) {
>>> diff --git a/qapi-schema.json b/qapi-schema.json
>>> index feb7d53..f6ecb88 100644
>>> --- a/qapi-schema.json
>>> +++ b/qapi-schema.json
>>> @@ -778,6 +778,22 @@
>>>    'data': [ 'unknown', 'primary', 'secondary'] }
>>>  
>>>  ##
>>> +# @COLOExitReason
>>> +#
>>> +# The reason for a COLO exit
>>> +#
>>> +# @unknown: unknown reason
>>
>> How can @unknown happen?
>>
>>> +#
>>> +# @request: COLO exit is due to an external request
>>> +#
>>> +# @error: COLO exit is due to an internal error
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'enum': 'COLOExitReason',
>>> +  'data': [ 'unknown', 'request', 'error'] }
>>> +
>>> +##
>>>  # @x-colo-lost-heartbeat
>>>  #
>>>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>>> diff --git a/qapi/event.json b/qapi/event.json
>>> index f0cef01..f63d456 100644
>>> --- a/qapi/event.json
>>> +++ b/qapi/event.json
>>> @@ -255,6 +255,23 @@
>>>    'data': {'status': 'MigrationStatus'}}
>>>  
>>>  ##
>>> +# @COLO_EXIT
>>> +#
>>> +# Emitted when VM finishes COLO mode due to some errors happening or
>>> +# at the request of users.
>>> +#
>>> +# @mode: which COLO mode the VM was in when it exited.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @reason: describes the reason for the COLO exit.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @error: #optional, error message. Only present on error happening.
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'event': 'COLO_EXIT',
>>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>>> +
>>> +##
>>>  # @ACPI_DEVICE_OST
>>>  #
>>>  # Emitted when guest executes ACPI _OST method.
>>
>>
>>
>> .
>>
> 
> 
> 
>
Markus Armbruster Jan. 11, 2016, 1:24 p.m. UTC | #9
Hailiang Zhang <zhang.zhanghailiang@huawei.com> writes:

> On 2015/12/19 18:02, Markus Armbruster wrote:
>> Copying qemu-block because this seems related to generalising block jobs
>> to background jobs.
>>
>
> Er, this event just used to help users to know what happened to VM with COLO FT
> on. If users get this event, they can make further check what's wrong, and
> decide which side should take over the work.
>
>> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>>
>>> If some errors happen during VM's COLO FT stage, it's important to
>>> notify the users
>>> of this event. Together with 'colo_lost_heartbeat', users can
>>> intervene in COLO's
>>> failover work immediately.
>>> If users don't want to get involved in COLO's failover verdict,
>>> it is still necessary to notify users that we exited COLO mode.
>>>
>>> Cc: Markus Armbruster <armbru@redhat.com>
>>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> ---
>>> v11:
>>> - Fix several typos found by Eric
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> ---
>>>   docs/qmp-events.txt | 17 +++++++++++++++++
>>>   migration/colo.c    | 11 +++++++++++
>>>   qapi-schema.json    | 16 ++++++++++++++++
>>>   qapi/event.json     | 17 +++++++++++++++++
>>>   4 files changed, 61 insertions(+)
>>>
>>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>>> index d2f1ce4..19f68fc 100644
>>> --- a/docs/qmp-events.txt
>>> +++ b/docs/qmp-events.txt
>>> @@ -184,6 +184,23 @@ Example:
>>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>>   event.
>>>
>>> +COLO_EXIT
>>> +---------
>>> +
>>> +Emitted when VM finishes COLO mode due to some errors happening or
>>> +at the request of users.
>>
>> How would the event's recipient distinguish between "due to error" and
>> "at the user's request"?
>>
>
> If they get this event with 'reason' is 'request', it is 'at the
> user's request',
> Or, it will be 'due to error' (The key for 'reason' will be 'error',
> and we have an optional
> error message which may help to figure out what happened.)

For what it's worth, block jobs use separate events BLOCK_JOB_CANCELLED
and BLOCK_JOB_ERROR.

>>> +
>>> +Data:
>>> +
>>> + - "mode": COLO mode, primary or secondary side (json-string)
>>> + - "reason": the exit reason, internal error or external
>>> request. (json-string)
>>> + - "error": error message (json-string, operation)
>>> +
>>> +Example:
>>> +
>>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>>> +
>>
>> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
>> some kind of COLO background job, and it just finished for whatever
>> reason?
>>
>
> As above, what i have said.
>
>> If yes, this COLO job could be an instance of the general background job
>> concept we're trying to grow from the existing block job concept.
>>
>> I'm not asking you to rebase your work onto the background job
>> infrastructure, not least for the simple reason that it doesn't exist,
>> yet.  But I think it would be fruitful to compare your COLO job
>> management QMP interface with the one we have for block jobs.  Not only
>> may that avoid unnecessary inconsistency, it could also help shape the
>> general background job interface.
>>
>
> Interesting, i'm not quite familiar with this block background job
> infrastructure.
> If we consider COLO FT as a background job, we can certainly use it. I
> will have a look
> at it.

Thanks!  Let's avoid unnecessary differences between COLO and block job
interfaces.  Later on, we can hopefully make them both use a common
background job infrastructure, and the smaller their differences are,
the easier that'll be.

[...]
diff mbox

Patch

diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
index d2f1ce4..19f68fc 100644
--- a/docs/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -184,6 +184,23 @@  Example:
 Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
 event.
 
+COLO_EXIT
+---------
+
+Emitted when VM finishes COLO mode due to some errors happening or
+at the request of users.
+
+Data:
+
+ - "mode": COLO mode, primary or secondary side (json-string)
+ - "reason":  the exit reason, internal error or external request. (json-string)
+ - "error": error message (json-string, operation)
+
+Example:
+
+{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
+ "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
+
 DEVICE_DELETED
 --------------
 
diff --git a/migration/colo.c b/migration/colo.c
index d1dd4e1..d06c14f 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -18,6 +18,7 @@ 
 #include "qemu/error-report.h"
 #include "qemu/sockets.h"
 #include "migration/failover.h"
+#include "qapi-event.h"
 
 /* colo buffer */
 #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
@@ -349,6 +350,11 @@  static void colo_process_checkpoint(MigrationState *s)
 out:
     if (ret < 0) {
         error_report("%s: %s", __func__, strerror(-ret));
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
+                                  true, strerror(-ret), NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
+                                  false, NULL, NULL);
     }
 
     qsb_free(buffer);
@@ -516,6 +522,11 @@  out:
     if (ret < 0) {
         error_report("colo incoming thread will exit, detect error: %s",
                      strerror(-ret));
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
+                                  true, strerror(-ret), NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
+                                  false, NULL, NULL);
     }
 
     if (fb) {
diff --git a/qapi-schema.json b/qapi-schema.json
index feb7d53..f6ecb88 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -778,6 +778,22 @@ 
   'data': [ 'unknown', 'primary', 'secondary'] }
 
 ##
+# @COLOExitReason
+#
+# The reason for a COLO exit
+#
+# @unknown: unknown reason
+#
+# @request: COLO exit is due to an external request
+#
+# @error: COLO exit is due to an internal error
+#
+# Since: 2.6
+##
+{ 'enum': 'COLOExitReason',
+  'data': [ 'unknown', 'request', 'error'] }
+
+##
 # @x-colo-lost-heartbeat
 #
 # Tell qemu that heartbeat is lost, request it to do takeover procedures.
diff --git a/qapi/event.json b/qapi/event.json
index f0cef01..f63d456 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -255,6 +255,23 @@ 
   'data': {'status': 'MigrationStatus'}}
 
 ##
+# @COLO_EXIT
+#
+# Emitted when VM finishes COLO mode due to some errors happening or
+# at the request of users.
+#
+# @mode: which COLO mode the VM was in when it exited.
+#
+# @reason: describes the reason for the COLO exit.
+#
+# @error: #optional, error message. Only present on error happening.
+#
+# Since: 2.6
+##
+{ 'event': 'COLO_EXIT',
+  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
+
+##
 # @ACPI_DEVICE_OST
 #
 # Emitted when guest executes ACPI _OST method.