diff mbox series

[PULL,16/45] vl: Add option to avoid stopping VM upon guest panic

Message ID 20201215175445.1272776-17-pbonzini@redhat.com
State New
Headers show
Series [PULL,01/45] remove preconfig state | expand

Commit Message

Paolo Bonzini Dec. 15, 2020, 5:54 p.m. UTC
From: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>

The current default action of pausing a guest after a panic event
is received leaves the responsibility to resume guest execution to the
management layer. The reasons for this behavior are discussed here:
https://lore.kernel.org/qemu-devel/52148F88.5000509@redhat.com/

However, in instances like the case of older guests (Linux and
Windows) using a pvpanic device but missing support for the
PVPANIC_CRASHLOADED event, and Windows guests using the hv-crash
enlightenment, it is desirable to allow the guests to continue
running after sending a PVPANIC_PANICKED event. This allows such
guests to proceed to capture a crash dump and automatically reboot
without intervention of a management layer.

Add an option to avoid stopping a VM after a panic event is received,
by passing:

-action panic=none

in the command line arguments, or during runtime by using an upcoming
QMP command.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Message-Id: <1607705564-26264-3-git-send-email-alejandro.j.jimenez@oracle.com>
[Do not fix panic action in the variable, instead modify -no-shutdown. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/sysemu/runstate-action.h |  1 +
 qapi/run-state.json              | 16 ++++++++++++++++
 qemu-options.hx                  |  3 +++
 softmmu/runstate-action.c        |  5 +++++
 softmmu/runstate.c               | 18 ++++++++++++++----
 softmmu/vl.c                     |  5 ++++-
 6 files changed, 43 insertions(+), 5 deletions(-)

Comments

Peter Maydell Jan. 19, 2021, 9:34 p.m. UTC | #1
On Tue, 15 Dec 2020 at 18:11, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>
> The current default action of pausing a guest after a panic event
> is received leaves the responsibility to resume guest execution to the
> management layer. The reasons for this behavior are discussed here:
> https://lore.kernel.org/qemu-devel/52148F88.5000509@redhat.com/
>
> However, in instances like the case of older guests (Linux and
> Windows) using a pvpanic device but missing support for the
> PVPANIC_CRASHLOADED event, and Windows guests using the hv-crash
> enlightenment, it is desirable to allow the guests to continue
> running after sending a PVPANIC_PANICKED event. This allows such
> guests to proceed to capture a crash dump and automatically reboot
> without intervention of a management layer.
>
> Add an option to avoid stopping a VM after a panic event is received,
> by passing:
>
> -action panic=none
>
> in the command line arguments, or during runtime by using an upcoming
> QMP command.

Hi. This commit message doesn't say it's changing the default
action, but the change does:

> @@ -3899,6 +3899,8 @@ DEF("action", HAS_ARG, QEMU_OPTION_action,
>      "                   action when guest reboots [default=none]\n"
>      "-action shutdown=poweroff|pause\n"
>      "                   action when guest shuts down [default=poweroff]\n"
> +    "-action panic=poweroff|pause|none\n"
> +    "                   action when guest panics [default=poweroff]\n"
>      "-action watchdog=reset|shutdown|poweroff|inject-nmi|pause|debug|none\n"
>      "                   action when watchdog fires [default=reset]\n",
>      QEMU_ARCH_ALL)

>  RebootAction reboot_action = REBOOT_ACTION_NONE;
>  ShutdownAction shutdown_action = SHUTDOWN_ACTION_POWEROFF;
> +PanicAction panic_action = PANIC_ACTION_POWEROFF;

We used to default to 'pause' and now we default to 'poweroff'.

We noticed this because it broke an in-flight test case for
the pvpanic-pci device from Mihai (which was expecting to see
the device in 'pause' state and found it was now in 'poweroff').
Test cases aren't very exciting, but was it really intentional
to change the default behaviour? It's part of the user-facing
surface of QEMU, so if we did intend a default change that ought
really to be more clearly stated (and noted in the Changelog) I think.

thanks
-- PMM
Alejandro Jimenez Jan. 20, 2021, 5:28 a.m. UTC | #2
On 1/19/2021 4:34 PM, Peter Maydell wrote:
> On Tue, 15 Dec 2020 at 18:11, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> From: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>
>> The current default action of pausing a guest after a panic event
>> is received leaves the responsibility to resume guest execution to the
>> management layer. The reasons for this behavior are discussed here:
>> https://lore.kernel.org/qemu-devel/52148F88.5000509@redhat.com/
>>
>> However, in instances like the case of older guests (Linux and
>> Windows) using a pvpanic device but missing support for the
>> PVPANIC_CRASHLOADED event, and Windows guests using the hv-crash
>> enlightenment, it is desirable to allow the guests to continue
>> running after sending a PVPANIC_PANICKED event. This allows such
>> guests to proceed to capture a crash dump and automatically reboot
>> without intervention of a management layer.
>>
>> Add an option to avoid stopping a VM after a panic event is received,
>> by passing:
>>
>> -action panic=none
>>
>> in the command line arguments, or during runtime by using an upcoming
>> QMP command.
> Hi. This commit message doesn't say it's changing the default
> action, but the change does:
>
>> @@ -3899,6 +3899,8 @@ DEF("action", HAS_ARG, QEMU_OPTION_action,
>>       "                   action when guest reboots [default=none]\n"
>>       "-action shutdown=poweroff|pause\n"
>>       "                   action when guest shuts down [default=poweroff]\n"
>> +    "-action panic=poweroff|pause|none\n"
>> +    "                   action when guest panics [default=poweroff]\n"
>>       "-action watchdog=reset|shutdown|poweroff|inject-nmi|pause|debug|none\n"
>>       "                   action when watchdog fires [default=reset]\n",
>>       QEMU_ARCH_ALL)
>>   RebootAction reboot_action = REBOOT_ACTION_NONE;
>>   ShutdownAction shutdown_action = SHUTDOWN_ACTION_POWEROFF;
>> +PanicAction panic_action = PANIC_ACTION_POWEROFF;
> We used to default to 'pause' and now we default to 'poweroff'.
Hi Peter.

My rationale for setting the panic action to 'poweroff' was to keep the 
default behavior of QEMU when '-no-shutdown' is not specified, and a 
panic occurs. I believe that in order to accomplish that, the default 
panic action should still be 'poweroff', but as you point out there is 
an instance where the behavior changes. Specifically, when 
'-no-shutdown' is not used there is now one fewer QMP event issued when 
a guest panic is detected, before stopping the VM and powering off.

I tried to account for this scenario in the original patches, but I 
failed to catch the problem after the rebase when the changes were 
merged. I'll test and send a fix for this issue in the next few days.

>
> We noticed this because it broke an in-flight test case for
> the pvpanic-pci device from Mihai (which was expecting to see
> the device in 'pause' state and found it was now in 'poweroff').
The test is just checking for the arrival of the QMP event, and not 
actually expecting the VM to be paused, correct? I see that if a 
test/management app is expecting to receive a GUEST_PANICKED event with 
the specific 'pause' action, then it might be confused. But any such 
tests would only be able to check for the arrival of the QMP event, and 
not actually expect to issue any commands to a paused VM, since the next 
block of code in QEMU immediately powers off and shutdowns when 
'-no-shutdown' is not requested. This was the typical behavior before 
the patches.

> Test cases aren't very exciting, but was it really intentional
> to change the default behaviour?
My intention was to preserve the default behavior. Perhaps Paolo wanted 
to reduce the number of GUEST_PANICKED events by removing the one with 
'pause' action? You could consider it superfluous since it is 
immediately followed by another indicating the 'poweroff' action... 
Unless I hear otherwise from either of you, I'll work on a fix to keep 
the same number and type of events sent.

Thank you,
Alejandro

>   It's part of the user-facing
> surface of QEMU, so if we did intend a default change that ought
> really to be more clearly stated (and noted in the Changelog) I think.
>
> thanks
> -- PMM
Peter Maydell Jan. 20, 2021, 1:47 p.m. UTC | #3
On Wed, 20 Jan 2021 at 05:28, Alejandro Jimenez
<alejandro.j.jimenez@oracle.com> wrote:
> On 1/19/2021 4:34 PM, Peter Maydell wrote:
> > Test cases aren't very exciting, but was it really intentional
> > to change the default behaviour?
> My intention was to preserve the default behavior. Perhaps Paolo wanted
> to reduce the number of GUEST_PANICKED events by removing the one with
> 'pause' action? You could consider it superfluous since it is
> immediately followed by another indicating the 'poweroff' action...
> Unless I hear otherwise from either of you, I'll work on a fix to keep
> the same number and type of events sent.

I'm happy to defer to you/Paolo on the behaviour -- I mostly
wanted to check if the change was intentional.

-- PMM
Daniel P. Berrangé Jan. 20, 2021, 1:54 p.m. UTC | #4
On Wed, Jan 20, 2021 at 12:28:14AM -0500, Alejandro Jimenez wrote:
> 
> 
> On 1/19/2021 4:34 PM, Peter Maydell wrote:
> > On Tue, 15 Dec 2020 at 18:11, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > From: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> > > 
> > > The current default action of pausing a guest after a panic event
> > > is received leaves the responsibility to resume guest execution to the
> > > management layer. The reasons for this behavior are discussed here:
> > > https://lore.kernel.org/qemu-devel/52148F88.5000509@redhat.com/
> > > 
> > > However, in instances like the case of older guests (Linux and
> > > Windows) using a pvpanic device but missing support for the
> > > PVPANIC_CRASHLOADED event, and Windows guests using the hv-crash
> > > enlightenment, it is desirable to allow the guests to continue
> > > running after sending a PVPANIC_PANICKED event. This allows such
> > > guests to proceed to capture a crash dump and automatically reboot
> > > without intervention of a management layer.
> > > 
> > > Add an option to avoid stopping a VM after a panic event is received,
> > > by passing:
> > > 
> > > -action panic=none
> > > 
> > > in the command line arguments, or during runtime by using an upcoming
> > > QMP command.
> > Hi. This commit message doesn't say it's changing the default
> > action, but the change does:
> > 
> > > @@ -3899,6 +3899,8 @@ DEF("action", HAS_ARG, QEMU_OPTION_action,
> > >       "                   action when guest reboots [default=none]\n"
> > >       "-action shutdown=poweroff|pause\n"
> > >       "                   action when guest shuts down [default=poweroff]\n"
> > > +    "-action panic=poweroff|pause|none\n"
> > > +    "                   action when guest panics [default=poweroff]\n"
> > >       "-action watchdog=reset|shutdown|poweroff|inject-nmi|pause|debug|none\n"
> > >       "                   action when watchdog fires [default=reset]\n",
> > >       QEMU_ARCH_ALL)
> > >   RebootAction reboot_action = REBOOT_ACTION_NONE;
> > >   ShutdownAction shutdown_action = SHUTDOWN_ACTION_POWEROFF;
> > > +PanicAction panic_action = PANIC_ACTION_POWEROFF;
> > We used to default to 'pause' and now we default to 'poweroff'.
> Hi Peter.
> 
> My rationale for setting the panic action to 'poweroff' was to keep the
> default behavior of QEMU when '-no-shutdown' is not specified, and a panic
> occurs. I believe that in order to accomplish that, the default panic action
> should still be 'poweroff', but as you point out there is an instance where
> the behavior changes. Specifically, when '-no-shutdown' is not used there is
> now one fewer QMP event issued when a guest panic is detected, before
> stopping the VM and powering off.
> 
> I tried to account for this scenario in the original patches, but I failed
> to catch the problem after the rebase when the changes were merged. I'll
> test and send a fix for this issue in the next few days.
> 
> > 
> > We noticed this because it broke an in-flight test case for
> > the pvpanic-pci device from Mihai (which was expecting to see
> > the device in 'pause' state and found it was now in 'poweroff').
> The test is just checking for the arrival of the QMP event, and not actually
> expecting the VM to be paused, correct? I see that if a test/management app
> is expecting to receive a GUEST_PANICKED event with the specific 'pause'
> action, then it might be confused. But any such tests would only be able to
> check for the arrival of the QMP event, and not actually expect to issue any
> commands to a paused VM, since the next block of code in QEMU immediately
> powers off and shutdowns when '-no-shutdown' is not requested. This was the
> typical behavior before the patches.
> 
> > Test cases aren't very exciting, but was it really intentional
> > to change the default behaviour?
> My intention was to preserve the default behavior. Perhaps Paolo wanted to
> reduce the number of GUEST_PANICKED events by removing the one with 'pause'
> action? You could consider it superfluous since it is immediately followed
> by another indicating the 'poweroff' action... Unless I hear otherwise from
> either of you, I'll work on a fix to keep the same number and type of events
> sent.

Why would pause be immediately followed by poweroff ? These are independant
actions, and the mgmt app should be deciding what todo next after the
pause action. It may wish to capture a guest crash image before poweroff.


Regards,
Daniel
Paolo Bonzini Jan. 20, 2021, 1:58 p.m. UTC | #5
On 19/01/21 22:34, Peter Maydell wrote:
> We used to default to 'pause' and now we default to 'poweroff'.
> 
> We noticed this because it broke an in-flight test case for
> the pvpanic-pci device from Mihai (which was expecting to see
> the device in 'pause' state and found it was now in 'poweroff').
> Test cases aren't very exciting, but was it really intentional
> to change the default behaviour? It's part of the user-facing
> surface of QEMU, so if we did intend a default change that ought
> really to be more clearly stated (and noted in the Changelog) I think.

To sum up the difference:

1) before, without -no-shutdown:
{"event": "GUEST_PANICKED", "data": {"action": "pause"}}
{"event": "STOP"}
{"event": "GUEST_PANICKED", "data": {"action": "poweroff"}}
{"event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-panic"}}

after, without -no-shutdown:
{"event": "GUEST_PANICKED", "data": {"action": "poweroff"}}
{"event": "STOP"}
{"event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-panic"}}

2) before, with -no-shutdown:
{"event": "GUEST_PANICKED", "data": {"action": "pause"}}
{"event": "STOP"}

after, with -no-shutdown (aka -action panic=pause,shutdown=pause):
{"event": "GUEST_PANICKED", "data": {"action": "pause"}}
{"event": "STOP"}


I think the new behavior without -no-shutdown is (albeit not 
intentionally) preferrable, since QEMU used to report twice that the 
guest panicked.

Looking again at the -action CLI, though, there are some inconsistencies 
in the naming.  I'll send a patches for them, but they will not affect 
the HMP output.

Paolo
Paolo Bonzini Jan. 20, 2021, 2:47 p.m. UTC | #6
On 20/01/21 14:54, Daniel P. Berrangé wrote:
> Why would pause be immediately followed by poweroff ? These are independant
> actions, and the mgmt app should be deciding what todo next after the
> pause action. It may wish to capture a guest crash image before poweroff.

That's exactly why -action was introduced (and before that, why 
-no-shutdown affected the behavior of panicking too).

Guest panic will exit QEMU by default, but you can use "-action 
panic=pause" or "-no-shutdown" to change the default.

Paolo
diff mbox series

Patch

diff --git a/include/sysemu/runstate-action.h b/include/sysemu/runstate-action.h
index 51a461c4ac..cff45a047b 100644
--- a/include/sysemu/runstate-action.h
+++ b/include/sysemu/runstate-action.h
@@ -14,5 +14,6 @@ 
 /* in softmmu/runstate-action.c */
 extern RebootAction reboot_action;
 extern ShutdownAction shutdown_action;
+extern PanicAction panic_action;
 
 #endif /* RUNSTATE_ACTION_H */
diff --git a/qapi/run-state.json b/qapi/run-state.json
index 25e82d1be1..1f3b329f05 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -353,6 +353,18 @@ 
 { 'enum': 'ShutdownAction',
   'data': [ 'poweroff', 'pause' ] }
 
+##
+# @PanicAction:
+#
+# @none: Continue VM execution
+#
+# @pause: Pause the VM
+#
+# Since: 6.0
+##
+{ 'enum': 'PanicAction',
+  'data': [ 'poweroff', 'pause', 'none' ] }
+
 ##
 # @watchdog-set-action:
 #
@@ -372,6 +384,8 @@ 
 #
 # @shutdown: @ShutdownAction action taken on guest shutdown.
 #
+# @panic: @PanicAction action taken on guest panic.
+#
 # @watchdog: @WatchdogAction action taken when watchdog timer expires .
 #
 # Returns: Nothing on success.
@@ -383,12 +397,14 @@ 
 # -> { "execute": "set-action",
 #      "arguments": { "reboot": "shutdown",
 #                     "shutdown" : "pause",
+#                     "panic": "pause",
 #                     "watchdog": "inject-nmi" } }
 # <- { "return": {} }
 ##
 { 'command': 'set-action',
   'data': { '*reboot': 'RebootAction',
             '*shutdown': 'ShutdownAction',
+            '*panic': 'PanicAction',
             '*watchdog': 'WatchdogAction' },
   'allow-preconfig': true }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index eb55cd0eea..94cdfcf32e 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3899,6 +3899,8 @@  DEF("action", HAS_ARG, QEMU_OPTION_action,
     "                   action when guest reboots [default=none]\n"
     "-action shutdown=poweroff|pause\n"
     "                   action when guest shuts down [default=poweroff]\n"
+    "-action panic=poweroff|pause|none\n"
+    "                   action when guest panics [default=poweroff]\n"
     "-action watchdog=reset|shutdown|poweroff|inject-nmi|pause|debug|none\n"
     "                   action when watchdog fires [default=reset]\n",
     QEMU_ARCH_ALL)
@@ -3911,6 +3913,7 @@  SRST
 
     Examples:
 
+    ``-action panic=none``
     ``-action reboot=shutdown,shutdown=pause``
     ``-watchdog i6300esb -action watchdog=pause``
 
diff --git a/softmmu/runstate-action.c b/softmmu/runstate-action.c
index 44de01adaf..99ce880886 100644
--- a/softmmu/runstate-action.c
+++ b/softmmu/runstate-action.c
@@ -15,6 +15,7 @@ 
 
 RebootAction reboot_action = REBOOT_ACTION_NONE;
 ShutdownAction shutdown_action = SHUTDOWN_ACTION_POWEROFF;
+PanicAction panic_action = PANIC_ACTION_POWEROFF;
 
 /*
  * Receives actions to be applied for specific guest events
@@ -30,6 +31,10 @@  void qmp_set_action(bool has_reboot, RebootAction reboot,
         reboot_action = reboot;
     }
 
+    if (has_panic) {
+        panic_action = panic;
+    }
+
     if (has_watchdog) {
         qmp_watchdog_set_action(watchdog, errp);
     }
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index bd0522ed9e..636aab0add 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -469,13 +469,23 @@  void qemu_system_guest_panicked(GuestPanicInformation *info)
     if (current_cpu) {
         current_cpu->crash_occurred = true;
     }
-    qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_PAUSE,
-                                   !!info, info);
-    vm_stop(RUN_STATE_GUEST_PANICKED);
-    if (shutdown_action == SHUTDOWN_ACTION_POWEROFF) {
+    /*
+     * TODO:  Currently the available panic actions are: none, pause, and
+     * poweroff, but in principle debug and reset could be supported as well.
+     * Investigate any potential use cases for the unimplemented actions.
+     */
+    if (panic_action == PANIC_ACTION_PAUSE) {
+        qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_PAUSE,
+                                        !!info, info);
+        vm_stop(RUN_STATE_GUEST_PANICKED);
+    } else if (panic_action == PANIC_ACTION_POWEROFF) {
         qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_POWEROFF,
                                        !!info, info);
+        vm_stop(RUN_STATE_GUEST_PANICKED);
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_PANIC);
+    } else {
+        qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_RUN,
+                                        !!info, info);
     }
 
     if (info) {
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 6d1a7ebb08..3921a04f77 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -489,6 +489,9 @@  static QemuOptsList qemu_action_opts = {
         },{
             .name = "reboot",
             .type = QEMU_OPT_STRING,
+        },{
+            .name = "panic",
+            .type = QEMU_OPT_STRING,
         },{
             .name = "watchdog",
             .type = QEMU_OPT_STRING,
@@ -3212,7 +3215,7 @@  void qemu_init(int argc, char **argv, char **envp)
                 break;
             case QEMU_OPTION_no_shutdown:
                 olist = qemu_find_opts("action");
-                qemu_opts_parse_noisily(olist, "shutdown=pause", false);
+                qemu_opts_parse_noisily(olist, "panic=pause,shutdown=pause", false);
                 break;
             case QEMU_OPTION_show_cursor:
                 warn_report("The -show-cursor option is deprecated. Please "