diff mbox series

[V4,10/16] qmp event: Add COLO_EXIT event to notify users while exited COLO

Message ID 1516369485-5374-11-git-send-email-zhangckid@gmail.com
State New
Headers show
Series COLO: integrate colo frame with block replication and COLO proxy | expand

Commit Message

Zhang Chen Jan. 19, 2018, 1:44 p.m. UTC
From: zhanghailiang <zhang.zhanghailiang@huawei.com>

If some errors happen during VM's COLO FT stage, it's important to
notify the users of this event. Together with 'x-colo-lost-heartbeat',
Users can intervene in COLO's failover work immediately.
If users don't want to get involved in COLO's failover verdict,
it is still necessary to notify users that we exited COLO mode.

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Zhang Chen <zhangckid@gmail.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 migration/colo.c    | 19 +++++++++++++++++++
 qapi/migration.json | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

Comments

Markus Armbruster Feb. 3, 2018, 3:49 p.m. UTC | #1
Zhang Chen <zhangckid@gmail.com> writes:

> From: zhanghailiang <zhang.zhanghailiang@huawei.com>
>
> If some errors happen during VM's COLO FT stage, it's important to
> notify the users of this event. Together with 'x-colo-lost-heartbeat',
> Users can intervene in COLO's failover work immediately.
> If users don't want to get involved in COLO's failover verdict,
> it is still necessary to notify users that we exited COLO mode.
>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
[...]
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 70e7b67..6fc95b7 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -869,6 +869,41 @@
>    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
>  
>  ##
> +# @COLO_EXIT:
> +#
> +# Emitted when VM finishes COLO mode due to some errors happening or
> +# at the request of users.
> +#
> +# @mode: which COLO mode the VM was in when it exited.
> +#
> +# @reason: describes the reason for the COLO exit.
> +#
> +# Since: 2.12
> +#
> +# Example:
> +#
> +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
> +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
> +#
> +##
> +{ 'event': 'COLO_EXIT',
> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }

Standard question when I see a new event: is there a way to poll for the
event's information?  If not, why don't we need one?

Remember, management applications might miss events when they lose the
connection and have to reconnect, say because the management application
needs to be restarted.

> +
> +##
> +# @COLOExitReason:
> +#
> +# The reason for a COLO exit
> +#
> +# @request: COLO exit is due to an external request
> +#
> +# @error: COLO exit is due to an internal error
> +#
> +# Since: 2.12
> +##
> +{ 'enum': 'COLOExitReason',
> +  'data': [ 'request', 'error' ] }
> +
> +##
>  # @x-colo-lost-heartbeat:
>  #
>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
Zhang Chen Feb. 6, 2018, 3:13 a.m. UTC | #2
On Sat, Feb 3, 2018 at 3:49 PM, Markus Armbruster <armbru@redhat.com> wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >
> > If some errors happen during VM's COLO FT stage, it's important to
> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
> > Users can intervene in COLO's failover work immediately.
> > If users don't want to get involved in COLO's failover verdict,
> > it is still necessary to notify users that we exited COLO mode.
> >
> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> > Reviewed-by: Eric Blake <eblake@redhat.com>
> [...]
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 70e7b67..6fc95b7 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -869,6 +869,41 @@
> >    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
> >
> >  ##
> > +# @COLO_EXIT:
> > +#
> > +# Emitted when VM finishes COLO mode due to some errors happening or
> > +# at the request of users.
> > +#
> > +# @mode: which COLO mode the VM was in when it exited.
> > +#
> > +# @reason: describes the reason for the COLO exit.
> > +#
> > +# Since: 2.12
> > +#
> > +# Example:
> > +#
> > +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
> > +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason":
> "request" } }
> > +#
> > +##
> > +{ 'event': 'COLO_EXIT',
> > +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
>
> Standard question when I see a new event: is there a way to poll for the
> event's information?  If not, why don't we need one?
>
>
Your means is we'd better print the information to a log file or something
like that for all qemu events?
CC  Eric Blake <eblake@redhat.com>
any idea about this?

Thanks
Zhang Chen


> Remember, management applications might miss events when they lose the
> connection and have to reconnect, say because the management application
> needs to be restarted.
>
> > +
> > +##
> > +# @COLOExitReason:
> > +#
> > +# The reason for a COLO exit
> > +#
> > +# @request: COLO exit is due to an external request
> > +#
> > +# @error: COLO exit is due to an internal error
> > +#
> > +# Since: 2.12
> > +##
> > +{ 'enum': 'COLOExitReason',
> > +  'data': [ 'request', 'error' ] }
> > +
> > +##
> >  # @x-colo-lost-heartbeat:
> >  #
> >  # Tell qemu that heartbeat is lost, request it to do takeover
> procedures.
>
Markus Armbruster Feb. 6, 2018, 7:27 a.m. UTC | #3
Zhang Chen <zhangckid@gmail.com> writes:

> On Sat, Feb 3, 2018 at 3:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>
>> Zhang Chen <zhangckid@gmail.com> writes:
>>
>> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> >
>> > If some errors happen during VM's COLO FT stage, it's important to
>> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
>> > Users can intervene in COLO's failover work immediately.
>> > If users don't want to get involved in COLO's failover verdict,
>> > it is still necessary to notify users that we exited COLO mode.
>> >
>> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
>> > Reviewed-by: Eric Blake <eblake@redhat.com>
>> [...]
>> > diff --git a/qapi/migration.json b/qapi/migration.json
>> > index 70e7b67..6fc95b7 100644
>> > --- a/qapi/migration.json
>> > +++ b/qapi/migration.json
>> > @@ -869,6 +869,41 @@
>> >    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
>> >
>> >  ##
>> > +# @COLO_EXIT:
>> > +#
>> > +# Emitted when VM finishes COLO mode due to some errors happening or
>> > +# at the request of users.
>> > +#
>> > +# @mode: which COLO mode the VM was in when it exited.
>> > +#
>> > +# @reason: describes the reason for the COLO exit.
>> > +#
>> > +# Since: 2.12
>> > +#
>> > +# Example:
>> > +#
>> > +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> > +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> > +#
>> > +##
>> > +{ 'event': 'COLO_EXIT',
>> > +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
>>
>> Standard question when I see a new event: is there a way to poll for the
>> event's information?  If not, why don't we need one?
>>
>>
> Your means is we'd better print the information to a log file or something
> like that for all qemu events?
> CC  Eric Blake <eblake@redhat.com>
> any idea about this?

Events carrying state change information management applications want to
track are generally paired with a query- command.  While the management
application is connected, it can track by passively listening for state
change events.  After (re)connect, it has to actively query the current
state.

Questions?

>> Remember, management applications might miss events when they lose the
>> connection and have to reconnect, say because the management application
>> needs to be restarted.
>>
>> > +
>> > +##
>> > +# @COLOExitReason:
>> > +#
>> > +# The reason for a COLO exit
>> > +#
>> > +# @request: COLO exit is due to an external request
>> > +#
>> > +# @error: COLO exit is due to an internal error
>> > +#
>> > +# Since: 2.12
>> > +##
>> > +{ 'enum': 'COLOExitReason',
>> > +  'data': [ 'request', 'error' ] }
>> > +
>> > +##
>> >  # @x-colo-lost-heartbeat:
>> >  #
>> >  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
Zhang Chen Feb. 6, 2018, 8:01 a.m. UTC | #4
On Tue, Feb 6, 2018 at 3:27 PM, Markus Armbruster <armbru@redhat.com> wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > On Sat, Feb 3, 2018 at 3:49 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
> >
> >> Zhang Chen <zhangckid@gmail.com> writes:
> >>
> >> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> >
> >> > If some errors happen during VM's COLO FT stage, it's important to
> >> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
> >> > Users can intervene in COLO's failover work immediately.
> >> > If users don't want to get involved in COLO's failover verdict,
> >> > it is still necessary to notify users that we exited COLO mode.
> >> >
> >> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> >> > Reviewed-by: Eric Blake <eblake@redhat.com>
> >> [...]
> >> > diff --git a/qapi/migration.json b/qapi/migration.json
> >> > index 70e7b67..6fc95b7 100644
> >> > --- a/qapi/migration.json
> >> > +++ b/qapi/migration.json
> >> > @@ -869,6 +869,41 @@
> >> >    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
> >> >
> >> >  ##
> >> > +# @COLO_EXIT:
> >> > +#
> >> > +# Emitted when VM finishes COLO mode due to some errors happening or
> >> > +# at the request of users.
> >> > +#
> >> > +# @mode: which COLO mode the VM was in when it exited.
> >> > +#
> >> > +# @reason: describes the reason for the COLO exit.
> >> > +#
> >> > +# Since: 2.12
> >> > +#
> >> > +# Example:
> >> > +#
> >> > +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
> >> > +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason":
> "request" } }
> >> > +#
> >> > +##
> >> > +{ 'event': 'COLO_EXIT',
> >> > +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
> >>
> >> Standard question when I see a new event: is there a way to poll for the
> >> event's information?  If not, why don't we need one?
> >>
> >>
> > Your means is we'd better print the information to a log file or
> something
> > like that for all qemu events?
> > CC  Eric Blake <eblake@redhat.com>
> > any idea about this?
>
> Events carrying state change information management applications want to
> track are generally paired with a query- command.  While the management
> application is connected, it can track by passively listening for state
> change events.  After (re)connect, it has to actively query the current
> state.
>
> Questions?
>


If I understand correctly, maybe we need a qemu events general history
mechanism
to solve this problem,
because lots of qemu events can't resend the current state. Yes, when the
"management application"(like libvirt)
lose the connection to qemu,  management application can't get the
information after reconnect.

Thanks
Zhang Chen


>
> >> Remember, management applications might miss events when they lose the
> >> connection and have to reconnect, say because the management application
> >> needs to be restarted.
> >>
> >> > +
> >> > +##
> >> > +# @COLOExitReason:
> >> > +#
> >> > +# The reason for a COLO exit
> >> > +#
> >> > +# @request: COLO exit is due to an external request
> >> > +#
> >> > +# @error: COLO exit is due to an internal error
> >> > +#
> >> > +# Since: 2.12
> >> > +##
> >> > +{ 'enum': 'COLOExitReason',
> >> > +  'data': [ 'request', 'error' ] }
> >> > +
> >> > +##
> >> >  # @x-colo-lost-heartbeat:
> >> >  #
> >> >  # Tell qemu that heartbeat is lost, request it to do takeover
> procedures.
>
Markus Armbruster Feb. 6, 2018, 9:53 a.m. UTC | #5
Zhang Chen <zhangckid@gmail.com> writes:

> On Tue, Feb 6, 2018 at 3:27 PM, Markus Armbruster <armbru@redhat.com> wrote:
>
>> Zhang Chen <zhangckid@gmail.com> writes:
>>
>> > On Sat, Feb 3, 2018 at 3:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> >> Standard question when I see a new event: is there a way to poll for the
>> >> event's information?  If not, why don't we need one?
>> >>
>> >>
>> > Your means is we'd better print the information to a log file or something
>> > like that for all qemu events?
>> > CC  Eric Blake <eblake@redhat.com>
>> > any idea about this?
>>
>> Events carrying state change information management applications want to
>> track are generally paired with a query- command.  While the management
>> application is connected, it can track by passively listening for state
>> change events.  After (re)connect, it has to actively query the current
>> state.
>>
>> Questions?
>>
>
>
> If I understand correctly, maybe we need a qemu events general history
> mechanism
> to solve this problem,
> because lots of qemu events can't resend the current state. Yes, when the
> "management application"(like libvirt)
> lose the connection to qemu,  management application can't get the
> information after reconnect.

Events can't resend the current state, but query commands can.

Designing of an "events general history mechanism" could well be
non-trivial.  Its implementation might not be simple, either.  Query
commands, on the other hand, are well understood and easy to implement.
Zhang Chen Feb. 6, 2018, 12:44 p.m. UTC | #6
On Tue, Feb 6, 2018 at 5:53 PM, Markus Armbruster <armbru@redhat.com> wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > On Tue, Feb 6, 2018 at 3:27 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
> >
> >> Zhang Chen <zhangckid@gmail.com> writes:
> >>
> >> > On Sat, Feb 3, 2018 at 3:49 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
> >> >> Standard question when I see a new event: is there a way to poll for
> the
> >> >> event's information?  If not, why don't we need one?
> >> >>
> >> >>
> >> > Your means is we'd better print the information to a log file or
> something
> >> > like that for all qemu events?
> >> > CC  Eric Blake <eblake@redhat.com>
> >> > any idea about this?
> >>
> >> Events carrying state change information management applications want to
> >> track are generally paired with a query- command.  While the management
> >> application is connected, it can track by passively listening for state
> >> change events.  After (re)connect, it has to actively query the current
> >> state.
> >>
> >> Questions?
> >>
> >
> >
> > If I understand correctly, maybe we need a qemu events general history
> > mechanism
> > to solve this problem,
> > because lots of qemu events can't resend the current state. Yes, when the
> > "management application"(like libvirt)
> > lose the connection to qemu,  management application can't get the
> > information after reconnect.
>
> Events can't resend the current state, but query commands can.
>
> Designing of an "events general history mechanism" could well be
> non-trivial.  Its implementation might not be simple, either.  Query
> commands, on the other hand, are well understood and easy to implement.
>

OK, I got it.
I will add a new query command for COLO state in next version.
Thanks your comments.

Zhang Chen
Eric Blake Feb. 6, 2018, 3:20 p.m. UTC | #7
On 02/05/2018 09:13 PM, Zhang Chen wrote:

>>> +##
>>> +{ 'event': 'COLO_EXIT',
>>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
>>
>> Standard question when I see a new event: is there a way to poll for the
>> event's information?  If not, why don't we need one?
>>
>>
> Your means is we'd better print the information to a log file or something
> like that for all qemu events?
> CC  Eric Blake <eblake@redhat.com>
> any idea about this?

Nothing to add, Markus is right - implementing a new mechanism that logs 
all events as they are issued, and teaching libvirt to parse that log at 
startup, is more work than just implementing a query-foo command that 
libvirt already knows how to use to query current state on first connect 
(and based on that query, make an intelligent decision on whether at 
least one event was missed during downtime).  So far, no one has come up 
with an event that is so important it must be logged, when compared to 
the working alternative of just having events be ways to optimize 
performance so that the query- command doesn't have to be polled all the 
time, but no severe loss if the event is missed because the query- can 
be used in its place.
diff mbox series

Patch

diff --git a/migration/colo.c b/migration/colo.c
index 8d2e3f8..790b122 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -516,6 +516,18 @@  out:
         qemu_fclose(fb);
     }
 
+    /*
+     * There are only two reasons we can go here, some error happened.
+     * Or the user triggered failover.
+     */
+    if (failover_get_state() == FAILOVER_STATUS_NONE) {
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
+                                  COLO_EXIT_REASON_ERROR, NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
+                                  COLO_EXIT_REASON_REQUEST, NULL);
+    }
+
     /* Hope this not to be too long to wait here */
     qemu_sem_wait(&s->colo_exit_sem);
     qemu_sem_destroy(&s->colo_exit_sem);
@@ -746,6 +758,13 @@  out:
     if (local_err) {
         error_report_err(local_err);
     }
+    if (failover_get_state() == FAILOVER_STATUS_NONE) {
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
+                                  COLO_EXIT_REASON_ERROR, NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
+                                  COLO_EXIT_REASON_REQUEST, NULL);
+    }
 
     if (fb) {
         qemu_fclose(fb);
diff --git a/qapi/migration.json b/qapi/migration.json
index 70e7b67..6fc95b7 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -869,6 +869,41 @@ 
   'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
 
 ##
+# @COLO_EXIT:
+#
+# Emitted when VM finishes COLO mode due to some errors happening or
+# at the request of users.
+#
+# @mode: which COLO mode the VM was in when it exited.
+#
+# @reason: describes the reason for the COLO exit.
+#
+# Since: 2.12
+#
+# Example:
+#
+# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
+#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
+#
+##
+{ 'event': 'COLO_EXIT',
+  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
+
+##
+# @COLOExitReason:
+#
+# The reason for a COLO exit
+#
+# @request: COLO exit is due to an external request
+#
+# @error: COLO exit is due to an internal error
+#
+# Since: 2.12
+##
+{ 'enum': 'COLOExitReason',
+  'data': [ 'request', 'error' ] }
+
+##
 # @x-colo-lost-heartbeat:
 #
 # Tell qemu that heartbeat is lost, request it to do takeover procedures.